Sample records for comparing gene set

  1. Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data.

    PubMed

    Hettne, Kristina M; Boorsma, André; van Dartel, Dorien A M; Goeman, Jelle J; de Jong, Esther; Piersma, Aldert H; Stierum, Rob H; Kleinjans, Jos C; Kors, Jan A

    2013-01-29

    Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals. Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.

  2. Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data

    PubMed Central

    2013-01-01

    Background Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. Methods We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals. Results Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Conclusions Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect. PMID:23356878

  3. Estimation of gene induction enables a relevance-based ranking of gene sets.

    PubMed

    Bartholomé, Kilian; Kreutz, Clemens; Timmer, Jens

    2009-07-01

    In order to handle and interpret the vast amounts of data produced by microarray experiments, the analysis of sets of genes with a common biological functionality has been shown to be advantageous compared to single gene analyses. Some statistical methods have been proposed to analyse the differential gene expression of gene sets in microarray experiments. However, most of these methods either require threshhold values to be chosen for the analysis, or they need some reference set for the determination of significance. We present a method that estimates the number of differentially expressed genes in a gene set without requiring a threshold value for significance of genes. The method is self-contained (i.e., it does not require a reference set for comparison). In contrast to other methods which are focused on significance, our approach emphasizes the relevance of the regulation of gene sets. The presented method measures the degree of regulation of a gene set and is a useful tool to compare the induction of different gene sets and place the results of microarray experiments into the biological context. An R-package is available.

  4. Comparative study on gene set and pathway topology-based enrichment methods.

    PubMed

    Bayerlová, Michaela; Jung, Klaus; Kramer, Frank; Klemm, Florian; Bleckmann, Annalen; Beißbarth, Tim

    2015-10-22

    Enrichment analysis is a popular approach to identify pathways or sets of genes which are significantly enriched in the context of differentially expressed genes. The traditional gene set enrichment approach considers a pathway as a simple gene list disregarding any knowledge of gene or protein interactions. In contrast, the new group of so called pathway topology-based methods integrates the topological structure of a pathway into the analysis. We comparatively investigated gene set and pathway topology-based enrichment approaches, considering three gene set and four topological methods. These methods were compared in two extensive simulation studies and on a benchmark of 36 real datasets, providing the same pathway input data for all methods. In the benchmark data analysis both types of methods showed a comparable ability to detect enriched pathways. The first simulation study was conducted with KEGG pathways, which showed considerable gene overlaps between each other. In this study with original KEGG pathways, none of the topology-based methods outperformed the gene set approach. Therefore, a second simulation study was performed on non-overlapping pathways created by unique gene IDs. Here, methods accounting for pathway topology reached higher accuracy than the gene set methods, however their sensitivity was lower. We conducted one of the first comprehensive comparative works on evaluating gene set against pathway topology-based enrichment methods. The topological methods showed better performance in the simulation scenarios with non-overlapping pathways, however, they were not conclusively better in the other scenarios. This suggests that simple gene set approach might be sufficient to detect an enriched pathway under realistic circumstances. Nevertheless, more extensive studies and further benchmark data are needed to systematically evaluate these methods and to assess what gain and cost pathway topology information introduces into enrichment analysis. Both types of methods for enrichment analysis require further improvements in order to deal with the problem of pathway overlaps.

  5. Training set selection for the prediction of essential genes.

    PubMed

    Cheng, Jian; Xu, Zhao; Wu, Wenwu; Zhao, Li; Li, Xiangchen; Liu, Yanlin; Tao, Shiheng

    2014-01-01

    Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale.

  6. Initial description of primate-specific cystine-knot Prometheus genes and differential gene expansions of D-dopachrome tautomerase genes

    PubMed Central

    Premzl, Marko

    2015-01-01

    Using eutherian comparative genomic analysis protocol and public genomic sequence data sets, the present work attempted to update and revise two gene data sets. The most comprehensive third party annotation gene data sets of eutherian adenohypophysis cystine-knot genes (128 complete coding sequences), and d-dopachrome tautomerases and macrophage migration inhibitory factor genes (30 complete coding sequences) were annotated. For example, the present study first described primate-specific cystine-knot Prometheus genes, as well as differential gene expansions of D-dopachrome tautomerase genes. Furthermore, new frameworks of future experiments of two eutherian gene data sets were proposed. PMID:25941635

  7. Time-Course Gene Set Analysis for Longitudinal Gene Expression Data

    PubMed Central

    Hejblum, Boris P.; Skinner, Jason; Thiébaut, Rodolphe

    2015-01-01

    Gene set analysis methods, which consider predefined groups of genes in the analysis of genomic data, have been successfully applied for analyzing gene expression data in cross-sectional studies. The time-course gene set analysis (TcGSA) introduced here is an extension of gene set analysis to longitudinal data. The proposed method relies on random effects modeling with maximum likelihood estimates. It allows to use all available repeated measurements while dealing with unbalanced data due to missing at random (MAR) measurements. TcGSA is a hypothesis driven method that identifies a priori defined gene sets with significant expression variations over time, taking into account the potential heterogeneity of expression within gene sets. When biological conditions are compared, the method indicates if the time patterns of gene sets significantly differ according to these conditions. The interest of the method is illustrated by its application to two real life datasets: an HIV therapeutic vaccine trial (DALIA-1 trial), and data from a recent study on influenza and pneumococcal vaccines. In the DALIA-1 trial TcGSA revealed a significant change in gene expression over time within 69 gene sets during vaccination, while a standard univariate individual gene analysis corrected for multiple testing as well as a standard a Gene Set Enrichment Analysis (GSEA) for time series both failed to detect any significant pattern change over time. When applied to the second illustrative data set, TcGSA allowed the identification of 4 gene sets finally found to be linked with the influenza vaccine too although they were found to be associated to the pneumococcal vaccine only in previous analyses. In our simulation study TcGSA exhibits good statistical properties, and an increased power compared to other approaches for analyzing time-course expression patterns of gene sets. The method is made available for the community through an R package. PMID:26111374

  8. Transcriptional responses in thyroid tissues from rats treated with a tumorigenic and a non-tumorigenic triazole conazole fungicide.

    PubMed

    Hester, Susan D; Nesnow, Stephen

    2008-03-15

    Conazoles are azole-containing fungicides that are used in agriculture and medicine. Conazoles can induce follicular cell adenomas of the thyroid in rats after chronic bioassay. The goal of this study was to identify pathways and networks of genes that were associated with thyroid tumorigenesis through transcriptional analyses. To this end, we compared transcriptional profiles from tissues of rats treated with a tumorigenic and a non-tumorigenic conazole. Triadimefon, a rat thyroid tumorigen, and myclobutanil, which was not tumorigenic in rats after a 2-year bioassay, were administered in the feed to male Wistar/Han rats for 30 or 90 days similar to the treatment conditions previously used in their chronic bioassays. Thyroid gene expression was determined using high density Affymetrix GeneChips (Rat 230_2). Gene expression was analyzed by the Gene Set Expression Analyses method which clearly separated the tumorigenic treatments (tumorigenic response group (TRG)) from the non-tumorigenic treatments (non-tumorigenic response group (NRG)). Core genes from these gene sets were mapped to canonical, metabolic, and GeneGo processes and these processes compared across group and treatment time. Extensive analyses were performed on the 30-day gene sets as they represented the major perturbations. Gene sets in the 30-day TRG group had over representation of fatty acid metabolism, oxidation, and degradation processes (including PPARgamma and CYP involvement), and of cell proliferation responses. Core genes from these gene sets were combined into networks and found to possess signaling interactions. In addition, the core genes in each gene set were compared with genes known to be associated with human thyroid cancer. Among the genes that appeared in both rat and human data sets were: Acaca, Asns, Cebpg, Crem, Ddit3, Gja1, Grn, Jun, Junb, and Vegf. These genes were major contributors in the previously developed network from triadimefon-treated rat thyroids. It is postulated that triadimefon induces oxidative response genes and activates the nuclear receptor, Ppargamma, initiating transcription of gene products and signaling to a series of genes involved in cell proliferation.

  9. Global Landscape of a Co-Expressed Gene Network in Barley and its Application to Gene Discovery in Triticeae Crops

    PubMed Central

    Mochida, Keiichi; Uehara-Yamaguchi, Yukiko; Yoshida, Takuhiro; Sakurai, Tetsuya; Shinozaki, Kazuo

    2011-01-01

    Accumulated transcriptome data can be used to investigate regulatory networks of genes involved in various biological systems. Co-expression analysis data sets generated from comprehensively collected transcriptome data sets now represent efficient resources that are capable of facilitating the discovery of genes with closely correlated expression patterns. In order to construct a co-expression network for barley, we analyzed 45 publicly available experimental series, which are composed of 1,347 sets of GeneChip data for barley. On the basis of a gene-to-gene weighted correlation coefficient, we constructed a global barley co-expression network and classified it into clusters of subnetwork modules. The resulting clusters are candidates for functional regulatory modules in the barley transcriptome. To annotate each of the modules, we performed comparative annotation using genes in Arabidopsis and Brachypodium distachyon. On the basis of a comparative analysis between barley and two model species, we investigated functional properties from the representative distributions of the gene ontology (GO) terms. Modules putatively involved in drought stress response and cellulose biogenesis have been identified. These modules are discussed to demonstrate the effectiveness of the co-expression analysis. Furthermore, we applied the data set of co-expressed genes coupled with comparative analysis in attempts to discover potentially Triticeae-specific network modules. These results demonstrate that analysis of the co-expression network of the barley transcriptome together with comparative analysis should promote the process of gene discovery in barley. Furthermore, the insights obtained should be transferable to investigations of Triticeae plants. The associated data set generated in this analysis is publicly accessible at http://coexpression.psc.riken.jp/barley/. PMID:21441235

  10. Effect of the absolute statistic on gene-sampling gene-set analysis methods.

    PubMed

    Nam, Dougu

    2017-06-01

    Gene-set enrichment analysis and its modified versions have commonly been used for identifying altered functions or pathways in disease from microarray data. In particular, the simple gene-sampling gene-set analysis methods have been heavily used for datasets with only a few sample replicates. The biggest problem with this approach is the highly inflated false-positive rate. In this paper, the effect of absolute gene statistic on gene-sampling gene-set analysis methods is systematically investigated. Thus far, the absolute gene statistic has merely been regarded as a supplementary method for capturing the bidirectional changes in each gene set. Here, it is shown that incorporating the absolute gene statistic in gene-sampling gene-set analysis substantially reduces the false-positive rate and improves the overall discriminatory ability. Its effect was investigated by power, false-positive rate, and receiver operating curve for a number of simulated and real datasets. The performances of gene-set analysis methods in one-tailed (genome-wide association study) and two-tailed (gene expression data) tests were also compared and discussed.

  11. Computing and Applying Atomic Regulons to Understand Gene Expression and Regulation

    PubMed Central

    Faria, José P.; Davis, James J.; Edirisinghe, Janaka N.; Taylor, Ronald C.; Weisenhorn, Pamela; Olson, Robert D.; Stevens, Rick L.; Rocha, Miguel; Rocha, Isabel; Best, Aaron A.; DeJongh, Matthew; Tintle, Nathan L.; Parrello, Bruce; Overbeek, Ross; Henry, Christopher S.

    2016-01-01

    Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. An important step toward meeting the challenge of understanding gene function and regulation is the identification of sets of genes that are always co-expressed. These gene sets, Atomic Regulons (ARs), represent fundamental units of function within a cell and could be used to associate genes of unknown function with cellular processes and to enable rational genetic engineering of cellular systems. Here, we describe an approach for inferring ARs that leverages large-scale expression data sets, gene context, and functional relationships among genes. We computed ARs for Escherichia coli based on 907 gene expression experiments and compared our results with gene clusters produced by two prevalent data-driven methods: Hierarchical clustering and k-means clustering. We compared ARs and purely data-driven gene clusters to the curated set of regulatory interactions for E. coli found in RegulonDB, showing that ARs are more consistent with gold standard regulons than are data-driven gene clusters. We further examined the consistency of ARs and data-driven gene clusters in the context of gene interactions predicted by Context Likelihood of Relatedness (CLR) analysis, finding that the ARs show better agreement with CLR predicted interactions. We determined the impact of increasing amounts of expression data on AR construction and find that while more data improve ARs, it is not necessary to use the full set of gene expression experiments available for E. coli to produce high quality ARs. In order to explore the conservation of co-regulated gene sets across different organisms, we computed ARs for Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus, each of which represents increasing degrees of phylogenetic distance from E. coli. Comparison of the organism-specific ARs showed that the consistency of AR gene membership correlates with phylogenetic distance, but there is clear variability in the regulatory networks of closely related organisms. As large scale expression data sets become increasingly common for model and non-model organisms, comparative analyses of atomic regulons will provide valuable insights into fundamental regulatory modules used across the bacterial domain. PMID:27933038

  12. A hybrid approach of gene sets and single genes for the prediction of survival risks with gene expression data.

    PubMed

    Seok, Junhee; Davis, Ronald W; Xiao, Wenzhong

    2015-01-01

    Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn't been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge.

  13. A Hybrid Approach of Gene Sets and Single Genes for the Prediction of Survival Risks with Gene Expression Data

    PubMed Central

    Seok, Junhee; Davis, Ronald W.; Xiao, Wenzhong

    2015-01-01

    Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn’t been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge. PMID:25933378

  14. Curated eutherian third party data gene data sets.

    PubMed

    Premzl, Marko

    2016-03-01

    The free available eutherian genomic sequence data sets advanced scientific field of genomics. Of note, future revisions of gene data sets were expected, due to incompleteness of public eutherian genomic sequence assemblies and potential genomic sequence errors. The eutherian comparative genomic analysis protocol was proposed as guidance in protection against potential genomic sequence errors in public eutherian genomic sequences. The protocol was applicable in updates of 7 major eutherian gene data sets, including 812 complete coding sequences deposited in European Nucleotide Archive as curated third party data gene data sets.

  15. MAGMA: Generalized Gene-Set Analysis of GWAS Data

    PubMed Central

    de Leeuw, Christiaan A.; Mooij, Joris M.; Heskes, Tom; Posthuma, Danielle

    2015-01-01

    By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn’s Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn’s Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn’s Disease data was found to be considerably faster as well. PMID:25885710

  16. MAGMA: generalized gene-set analysis of GWAS data.

    PubMed

    de Leeuw, Christiaan A; Mooij, Joris M; Heskes, Tom; Posthuma, Danielle

    2015-04-01

    By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.

  17. Random forests-based differential analysis of gene sets for gene expression data.

    PubMed

    Hsueh, Huey-Miin; Zhou, Da-Wei; Tsai, Chen-An

    2013-04-10

    In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for interpretation of data in complex biological systems. The classifications of biologically defined gene sets can reveal the underlying interactions of gene sets associated with the phenotypes, and provide an insightful complement to conventional gene set analyses. Copyright © 2012 Elsevier B.V. All rights reserved.

  18. Statistical Test of Expression Pattern (STEPath): a new strategy to integrate gene expression data with genomic information in individual and meta-analysis studies.

    PubMed

    Martini, Paolo; Risso, Davide; Sales, Gabriele; Romualdi, Chiara; Lanfranchi, Gerolamo; Cagnin, Stefano

    2011-04-11

    In the last decades, microarray technology has spread, leading to a dramatic increase of publicly available datasets. The first statistical tools developed were focused on the identification of significant differentially expressed genes. Later, researchers moved toward the systematic integration of gene expression profiles with additional biological information, such as chromosomal location, ontological annotations or sequence features. The analysis of gene expression linked to physical location of genes on chromosomes allows the identification of transcriptionally imbalanced regions, while, Gene Set Analysis focuses on the detection of coordinated changes in transcriptional levels among sets of biologically related genes. In this field, meta-analysis offers the possibility to compare different studies, addressing the same biological question to fully exploit public gene expression datasets. We describe STEPath, a method that starts from gene expression profiles and integrates the analysis of imbalanced region as an a priori step before performing gene set analysis. The application of STEPath in individual studies produced gene set scores weighted by chromosomal activation. As a final step, we propose a way to compare these scores across different studies (meta-analysis) on related biological issues. One complication with meta-analysis is batch effects, which occur because molecular measurements are affected by laboratory conditions, reagent lots and personnel differences. Major problems occur when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. We evaluated the power of combining chromosome mapping and gene set enrichment analysis, performing the analysis on a dataset of leukaemia (example of individual study) and on a dataset of skeletal muscle diseases (meta-analysis approach). In leukaemia, we identified the Hox gene set, a gene set closely related to the pathology that other algorithms of gene set analysis do not identify, while the meta-analysis approach on muscular disease discriminates between related pathologies and correlates similar ones from different studies. STEPath is a new method that integrates gene expression profiles, genomic co-expressed regions and the information about the biological function of genes. The usage of the STEPath-computed gene set scores overcomes batch effects in the meta-analysis approaches allowing the direct comparison of different pathologies and different studies on a gene set activation level.

  19. A Guide to the PLAZA 3.0 Plant Comparative Genomic Database.

    PubMed

    Vandepoele, Klaas

    2017-01-01

    PLAZA 3.0 is an online resource for comparative genomics and offers a versatile platform to study gene functions and gene families or to analyze genome organization and evolution in the green plant lineage. Starting from genome sequence information for over 35 plant species, precomputed comparative genomic data sets cover homologous gene families, multiple sequence alignments, phylogenetic trees, and genomic colinearity information within and between species. Complementary functional data sets, a Workbench, and interactive visualization tools are available through a user-friendly web interface, making PLAZA an excellent starting point to translate sequence or omics data sets into biological knowledge. PLAZA is available at http://bioinformatics.psb.ugent.be/plaza/ .

  20. Multiconstrained gene clustering based on generalized projections

    PubMed Central

    2010-01-01

    Background Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. Results We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods. Conclusions The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions. PMID:20356386

  1. Principal Angle Enrichment Analysis (PAEA): Dimensionally Reduced Multivariate Gene Set Enrichment Analysis Tool

    PubMed Central

    Clark, Neil R.; Szymkiewicz, Maciej; Wang, Zichen; Monteiro, Caroline D.; Jones, Matthew R.; Ma’ayan, Avi

    2016-01-01

    Gene set analysis of differential expression, which identifies collectively differentially expressed gene sets, has become an important tool for biology. The power of this approach lies in its reduction of the dimensionality of the statistical problem and its incorporation of biological interpretation by construction. Many approaches to gene set analysis have been proposed, but benchmarking their performance in the setting of real biological data is difficult due to the lack of a gold standard. In a previously published work we proposed a geometrical approach to differential expression which performed highly in benchmarking tests and compared well to the most popular methods of differential gene expression. As reported, this approach has a natural extension to gene set analysis which we call Principal Angle Enrichment Analysis (PAEA). PAEA employs dimensionality reduction and a multivariate approach for gene set enrichment analysis. However, the performance of this method has not been assessed nor its implementation as a web-based tool. Here we describe new benchmarking protocols for gene set analysis methods and find that PAEA performs highly. The PAEA method is implemented as a user-friendly web-based tool, which contains 70 gene set libraries and is freely available to the community. PMID:26848405

  2. Principal Angle Enrichment Analysis (PAEA): Dimensionally Reduced Multivariate Gene Set Enrichment Analysis Tool.

    PubMed

    Clark, Neil R; Szymkiewicz, Maciej; Wang, Zichen; Monteiro, Caroline D; Jones, Matthew R; Ma'ayan, Avi

    2015-11-01

    Gene set analysis of differential expression, which identifies collectively differentially expressed gene sets, has become an important tool for biology. The power of this approach lies in its reduction of the dimensionality of the statistical problem and its incorporation of biological interpretation by construction. Many approaches to gene set analysis have been proposed, but benchmarking their performance in the setting of real biological data is difficult due to the lack of a gold standard. In a previously published work we proposed a geometrical approach to differential expression which performed highly in benchmarking tests and compared well to the most popular methods of differential gene expression. As reported, this approach has a natural extension to gene set analysis which we call Principal Angle Enrichment Analysis (PAEA). PAEA employs dimensionality reduction and a multivariate approach for gene set enrichment analysis. However, the performance of this method has not been assessed nor its implementation as a web-based tool. Here we describe new benchmarking protocols for gene set analysis methods and find that PAEA performs highly. The PAEA method is implemented as a user-friendly web-based tool, which contains 70 gene set libraries and is freely available to the community.

  3. Seten: a tool for systematic identification and comparison of processes, phenotypes, and diseases associated with RNA-binding proteins from condition-specific CLIP-seq profiles.

    PubMed

    Budak, Gungor; Srivastava, Rajneesh; Janga, Sarath Chandra

    2017-06-01

    RNA-binding proteins (RBPs) control the regulation of gene expression in eukaryotic genomes at post-transcriptional level by binding to their cognate RNAs. Although several variants of CLIP (crosslinking and immunoprecipitation) protocols are currently available to study the global protein-RNA interaction landscape at single-nucleotide resolution in a cell, currently there are very few tools that can facilitate understanding and dissecting the functional associations of RBPs from the resulting binding maps. Here, we present Seten, a web-based and command line tool, which can identify and compare processes, phenotypes, and diseases associated with RBPs from condition-specific CLIP-seq profiles. Seten uses BED files resulting from most peak calling algorithms, which include scores reflecting the extent of binding of an RBP on the target transcript, to provide both traditional functional enrichment as well as gene set enrichment results for a number of gene set collections including BioCarta, KEGG, Reactome, Gene Ontology (GO), Human Phenotype Ontology (HPO), and MalaCards Disease Ontology for several organisms including fruit fly, human, mouse, rat, worm, and yeast. It also provides an option to dynamically compare the associated gene sets across data sets as bubble charts, to facilitate comparative analysis. Benchmarking of Seten using eCLIP data for IGF2BP1, SRSF7, and PTBP1 against their corresponding CRISPR RNA-seq in K562 cells as well as randomized negative controls, demonstrated that its gene set enrichment method outperforms functional enrichment, with scores significantly contributing to the discovery of true annotations. Comparative performance analysis using these CRISPR control data sets revealed significantly higher precision and comparable recall to that observed using ChIP-Enrich. Seten's web interface currently provides precomputed results for about 200 CLIP-seq data sets and both command line as well as web interfaces can be used to analyze CLIP-seq data sets. We highlight several examples to show the utility of Seten for rapid profiling of various CLIP-seq data sets. Seten is available on http://www.iupui.edu/∼sysbio/seten/. © 2017 Budak et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.

  4. Tissue Non-Specific Genes and Pathways Associated with Diabetes: An Expression Meta-Analysis.

    PubMed

    Mei, Hao; Li, Lianna; Liu, Shijian; Jiang, Fan; Griswold, Michael; Mosley, Thomas

    2017-01-21

    We performed expression studies to identify tissue non-specific genes and pathways of diabetes by meta-analysis. We searched curated datasets of the Gene Expression Omnibus (GEO) database and identified 13 and five expression studies of diabetes and insulin responses at various tissues, respectively. We tested differential gene expression by empirical Bayes-based linear method and investigated gene set expression association by knowledge-based enrichment analysis. Meta-analysis by different methods was applied to identify tissue non-specific genes and gene sets. We also proposed pathway mapping analysis to infer functions of the identified gene sets, and correlation and independent analysis to evaluate expression association profile of genes and gene sets between studies and tissues. Our analysis showed that PGRMC1 and HADH genes were significant over diabetes studies, while IRS1 and MPST genes were significant over insulin response studies, and joint analysis showed that HADH and MPST genes were significant over all combined data sets. The pathway analysis identified six significant gene sets over all studies. The KEGG pathway mapping indicated that the significant gene sets are related to diabetes pathogenesis. The results also presented that 12.8% and 59.0% pairwise studies had significantly correlated expression association for genes and gene sets, respectively; moreover, 12.8% pairwise studies had independent expression association for genes, but no studies were observed significantly different for expression association of gene sets. Our analysis indicated that there are both tissue specific and non-specific genes and pathways associated with diabetes pathogenesis. Compared to the gene expression, pathway association tends to be tissue non-specific, and a common pathway influencing diabetes development is activated through different genes at different tissues.

  5. Bioinformatics approach to evaluate differential gene expression of M1/M2 macrophage phenotypes and antioxidant genes in atherosclerosis.

    PubMed

    da Rocha, Ricardo Fagundes; De Bastiani, Marco Antônio; Klamt, Fábio

    2014-11-01

    Atherosclerosis is a pro-inflammatory process intrinsically related to systemic redox impairments. Macrophages play a major role on disease development. The specific involvement of classically activated, M1 (pro-inflammatory), or the alternatively activated, M2 (anti-inflammatory), on plaque formation and disease progression are still not established. Thus, based on meta-data analysis of public micro-array datasets, we compared differential gene expression levels of the human antioxidant genes (HAG) and M1/M2 genes between early and advanced human atherosclerotic plaques, and among peripheric macrophages (with or without foam cells induction by oxidized low density lipoprotein, oxLDL) from healthy and atherosclerotic subjects. Two independent datasets, GSE28829 and GSE9874, were selected from gene expression omnibus (http://www.ncbi.nlm.nih.gov/geo/) repository. Functional interactions were obtained with STRING (http://string-db.org/) and Medusa (http://coot.embl.de/medusa/). Statistical analysis was performed with ViaComplex(®) (http://lief.if.ufrgs.br/pub/biosoftwares/viacomplex/) and gene score enrichment analysis (http://www.broadinstitute.org/gsea/index.jsp). Bootstrap analysis demonstrated that the activity (expression) of HAG and M1 gene sets were significantly increased in advance compared to early atherosclerotic plaque. Increased expressions of HAG, M1, and M2 gene sets were found in peripheric macrophages from atherosclerotic subjects compared to peripheric macrophages from healthy subjects, while only M1 gene set was increased in foam cells from atherosclerotic subjects compared to foam cells from healthy subjects. However, M1 gene set was decreased in foam cells from healthy subjects compared to peripheric macrophages from healthy subjects, while no differences were found in foam cells from atherosclerotic subjects compared to peripheric macrophages from atherosclerotic subjects. Our data suggest that, different to cancer, in atherosclerosis there is no M1 or M2 polarization of macrophages. Actually, M1 and M2 phenotype are equally induced, what is an important aspect to better understand the disease progression, and can help to develop new therapeutic approaches.

  6. Comparative Bacterial Proteomics: Analysis of the Core Genome Concept

    PubMed Central

    Callister, Stephen J.; McCue, Lee Ann; Turse, Joshua E.; Monroe, Matthew E.; Auberry, Kenneth J.; Smith, Richard D.; Adkins, Joshua N.; Lipton, Mary S.

    2008-01-01

    While comparative bacterial genomic studies commonly predict a set of genes indicative of common ancestry, experimental validation of the existence of this core genome requires extensive measurement and is typically not undertaken. Enabled by an extensive proteome database developed over six years, we have experimentally verified the expression of proteins predicted from genomic ortholog comparisons among 17 environmental and pathogenic bacteria. More exclusive relationships were observed among the expressed protein content of phenotypically related bacteria, which is indicative of the specific lifestyles associated with these organisms. Although genomic studies can establish relative orthologous relationships among a set of bacteria and propose a set of ancestral genes, our proteomics study establishes expressed lifestyle differences among conserved genes and proposes a set of expressed ancestral traits. PMID:18253490

  7. Comparative physical mapping between wheat chromosome arm 2BL and rice chromosome 4.

    PubMed

    Lee, Tong Geon; Lee, Yong Jin; Kim, Dae Yeon; Seo, Yong Weon

    2010-12-01

    Physical maps of chromosomes provide a framework for organizing and integrating diverse genetic information. DNA microarrays are a valuable technique for physical mapping and can also be used to facilitate the discovery of single feature polymorphisms (SFPs). Wheat chromosome arm 2BL was physically mapped using a Wheat Genome Array onto near-isogenic lines (NILs) with the aid of wheat-rice synteny and mapped wheat EST information. Using high variance probe set (HVP) analysis, 314 HVPs constituting genes present on 2BL were identified. The 314 HVPs were grouped into 3 categories: HVPs that match only rice chromosome 4 (298 HVPs), those that match only wheat ESTs mapped on 2BL (1), and those that match both rice chromosome 4 and wheat ESTs mapped on 2BL (15). All HVPs were converted into gene sets, which represented either unique rice gene models or mapped wheat ESTs that matched identified HVPs. Comparative physical maps were constructed for 16 wheat gene sets and 271 rice gene sets. Of the 271 rice gene sets, 257 were mapped to the 18-35 Mb regions on rice chromosome 4. Based on HVP analysis and sequence similarity between the gene models in the rice chromosomes and mapped wheat ESTs, the outermost rice gene model that limits the translocation breakpoint to orthologous regions was identified.

  8. shRNA-Induced Gene Knockdown In Vivo to Investigate Neutrophil Function.

    PubMed

    Basit, Abdul; Tang, Wenwen; Wu, Dianqing

    2016-01-01

    To silence genes in neutrophils efficiently, we exploited the RNA interference and developed an shRNA-based gene knockdown technique. This method involves transfection of mouse bone marrow-derived hematopoietic stem cells with retroviral vector carrying shRNA directed at a specific gene. Transfected stem cells are then transplanted into irradiated wild-type mice. After engraftment of stem cells, the transplanted mice have two sets of circulating neutrophils. One set has a gene of interest knocked down while the other set has full complement of expressed genes. This efficient technique provides a unique way to directly compare the response of neutrophils with a knocked-down gene to that of neutrophils with the full complement of expressed genes in the same environment.

  9. GO-based functional dissimilarity of gene sets.

    PubMed

    Díaz-Díaz, Norberto; Aguilar-Ruiz, Jesús S

    2011-09-01

    The Gene Ontology (GO) provides a controlled vocabulary for describing the functions of genes and can be used to evaluate the functional coherence of gene sets. Many functional coherence measures consider each pair of gene functions in a set and produce an output based on all pairwise distances. A single gene can encode multiple proteins that may differ in function. For each functionality, other proteins that exhibit the same activity may also participate. Therefore, an identification of the most common function for all of the genes involved in a biological process is important in evaluating the functional similarity of groups of genes and a quantification of functional coherence can helps to clarify the role of a group of genes working together. To implement this approach to functional assessment, we present GFD (GO-based Functional Dissimilarity), a novel dissimilarity measure for evaluating groups of genes based on the most relevant functions of the whole set. The measure assigns a numerical value to the gene set for each of the three GO sub-ontologies. Results show that GFD performs robustly when applied to gene set of known functionality (extracted from KEGG). It performs particularly well on randomly generated gene sets. An ROC analysis reveals that the performance of GFD in evaluating the functional dissimilarity of gene sets is very satisfactory. A comparative analysis against other functional measures, such as GS2 and those presented by Resnik and Wang, also demonstrates the robustness of GFD.

  10. Cross-Study Homogeneity of Psoriasis Gene Expression in Skin across a Large Expression Range

    PubMed Central

    Kerkof, Keith; Timour, Martin; Russell, Christopher B.

    2013-01-01

    Background In psoriasis, only limited overlap between sets of genes identified as differentially expressed (psoriatic lesional vs. psoriatic non-lesional) was found using statistical and fold-change cut-offs. To provide a framework for utilizing prior psoriasis data sets we sought to understand the consistency of those sets. Methodology/Principal Findings Microarray expression profiling and qRT-PCR were used to characterize gene expression in PP and PN skin from psoriasis patients. cDNA (three new data sets) and cRNA hybridization (four existing data sets) data were compared using a common analysis pipeline. Agreement between data sets was assessed using varying qualitative and quantitative cut-offs to generate a DEG list in a source data set and then using other data sets to validate the list. Concordance increased from 67% across all probe sets to over 99% across more than 10,000 probe sets when statistical filters were employed. The fold-change behavior of individual genes tended to be consistent across the multiple data sets. We found that genes with <2-fold change values were quantitatively reproducible between pairs of data-sets. In a subset of transcripts with a role in inflammation changes detected by microarray were confirmed by qRT-PCR with high concordance. For transcripts with both PN and PP levels within the microarray dynamic range, microarray and qRT-PCR were quantitatively reproducible, including minimal fold-changes in IL13, TNFSF11, and TNFRSF11B and genes with >10-fold changes in either direction such as CHRM3, IL12B and IFNG. Conclusions/Significance Gene expression changes in psoriatic lesions were consistent across different studies, despite differences in patient selection, sample handling, and microarray platforms but between-study comparisons showed stronger agreement within than between platforms. We could use cut-offs as low as log10(ratio) = 0.1 (fold-change = 1.26), generating larger gene lists that validate on independent data sets. The reproducibility of PP signatures across data sets suggests that different sample sets can be productively compared. PMID:23308107

  11. ExAtlas: An interactive online tool for meta-analysis of gene expression data.

    PubMed

    Sharov, Alexei A; Schlessinger, David; Ko, Minoru S H

    2015-12-01

    We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users' own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher's methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein-protein interaction) are pre-loaded and can be used for functional annotations.

  12. Identification, characterization and expression analysis of lineage-specific genes within sweet orange (Citrus sinensis).

    PubMed

    Xu, Yuantao; Wu, Guizhi; Hao, Baohai; Chen, Lingling; Deng, Xiuxin; Xu, Qiang

    2015-11-23

    With the availability of rapidly increasing number of genome and transcriptome sequences, lineage-specific genes (LSGs) can be identified and characterized. Like other conserved functional genes, LSGs play important roles in biological evolution and functions. Two set of citrus LSGs, 296 citrus-specific genes (CSGs) and 1039 orphan genes specific to sweet orange, were identified by comparative analysis between the sweet orange genome sequences and 41 genomes and 273 transcriptomes. With the two sets of genes, gene structure and gene expression pattern were investigated. On average, both the CSGs and orphan genes have fewer exons, shorter gene length and higher GC content when compared with those evolutionarily conserved genes (ECs). Expression profiling indicated that most of the LSGs expressed in various tissues of sweet orange and some of them exhibited distinct temporal and spatial expression patterns. Particularly, the orphan genes were preferentially expressed in callus, which is an important pluripotent tissue of citrus. Besides, part of the CSGs and orphan genes expressed responsive to abiotic stress, indicating their potential functions during interaction with environment. This study identified and characterized two sets of LSGs in citrus, dissected their sequence features and expression patterns, and provided valuable clues for future functional analysis of the LSGs in sweet orange.

  13. Comparative analysis of expressed sequence tags of conifers and angiosperms reveals sequences specifically conserved in conifers.

    PubMed

    Ujino-Ihara, Tokuko; Kanamori, Hiroyuki; Yamane, Hiroko; Taguchi, Yuriko; Namiki, Nobukazu; Mukai, Yuzuru; Yoshimura, Kensuke; Tsumura, Yoshihiko

    2005-12-01

    To identify and characterize lineage-specific genes of conifers, two sets of ESTs (with 12791 and 5902 ESTs, representing 5373 and 3018 gene transcripts, respectively) were generated from the Cupressaceae species Cryptomeria japonica and Chamaecyparis obtusa. These transcripts were compared with non-redundant sets of genes generated from Pinaceae species, other gymnosperms and angiosperms. About 6% of tentative unique genes (Unigenes) of C. japonica and C. obtusa had homologs in other conifers but not angiosperms, and about 70% had apparent homologs in angiosperms. The calculated GC contents of orthologous genes showed that GC contents of coniferous genes are likely to be lower than those of angiosperms. Comparisons of the numbers of homologous genes in each species suggest that copy numbers of genes may be correlated between diverse seed plants. This correlation suggests that the multiplicity of such genes may have arisen before the divergence of gymnosperms and angiosperms.

  14. Accurately Assessing the Risk of Schizophrenia Conferred by Rare Copy-Number Variation Affecting Genes with Brain Function

    PubMed Central

    Raychaudhuri, Soumya; Korn, Joshua M.; McCarroll, Steven A.; Altshuler, David; Sklar, Pamela; Purcell, Shaun; Daly, Mark J.

    2010-01-01

    Investigators have linked rare copy number variation (CNVs) to neuropsychiatric diseases, such as schizophrenia. One hypothesis is that CNV events cause disease by affecting genes with specific brain functions. Under these circumstances, we expect that CNV events in cases should impact brain-function genes more frequently than those events in controls. Previous publications have applied “pathway” analyses to genes within neuropsychiatric case CNVs to show enrichment for brain-functions. While such analyses have been suggestive, they often have not rigorously compared the rates of CNVs impacting genes with brain function in cases to controls, and therefore do not address important confounders such as the large size of brain genes and overall differences in rates and sizes of CNVs. To demonstrate the potential impact of confounders, we genotyped rare CNV events in 2,415 unaffected controls with Affymetrix 6.0; we then applied standard pathway analyses using four sets of brain-function genes and observed an apparently highly significant enrichment for each set. The enrichment is simply driven by the large size of brain-function genes. Instead, we propose a case-control statistical test, cnv-enrichment-test, to compare the rate of CNVs impacting specific gene sets in cases versus controls. With simulations, we demonstrate that cnv-enrichment-test is robust to case-control differences in CNV size, CNV rate, and systematic differences in gene size. Finally, we apply cnv-enrichment-test to rare CNV events published by the International Schizophrenia Consortium (ISC). This approach reveals nominal evidence of case-association in neuronal-activity and the learning gene sets, but not the other two examined gene sets. The neuronal-activity genes have been associated in a separate set of schizophrenia cases and controls; however, testing in independent samples is necessary to definitively confirm this association. Our method is implemented in the PLINK software package. PMID:20838587

  15. ArrayVigil: a methodology for statistical comparison of gene signatures using segregated-one-tailed (SOT) Wilcoxon's signed-rank test.

    PubMed

    Khan, Haseeb Ahmad

    2005-01-28

    Due to versatile diagnostic and prognostic fidelity molecular signatures or fingerprints are anticipated as the most powerful tools for cancer management in the near future. Notwithstanding the experimental advancements in microarray technology, methods for analyzing either whole arrays or gene signatures have not been firmly established. Recently, an algorithm, ArraySolver has been reported by Khan for two-group comparison of microarray gene expression data using two-tailed Wilcoxon signed-rank test. Most of the molecular signatures are composed of two sets of genes (hybrid signatures) wherein up-regulation of one set and down-regulation of the other set collectively define the purpose of a gene signature. Since the direction of a selected gene's expression (positive or negative) with respect to a particular disease condition is known, application of one-tailed statistics could be a more relevant choice. A novel method, ArrayVigil, is described for comparing hybrid signatures using segregated-one-tailed (SOT) Wilcoxon signed-rank test and the results compared with integrated-two-tailed (ITT) procedures (SPSS and ArraySolver). ArrayVigil resulted in lower P values than those obtained from ITT statistics while comparing real data from four signatures.

  16. LEGO: a novel method for gene set over-representation analysis by incorporating network-based gene weights

    PubMed Central

    Dong, Xinran; Hao, Yun; Wang, Xiao; Tian, Weidong

    2016-01-01

    Pathway or gene set over-representation analysis (ORA) has become a routine task in functional genomics studies. However, currently widely used ORA tools employ statistical methods such as Fisher’s exact test that reduce a pathway into a list of genes, ignoring the constitutive functional non-equivalent roles of genes and the complex gene-gene interactions. Here, we develop a novel method named LEGO (functional Link Enrichment of Gene Ontology or gene sets) that takes into consideration these two types of information by incorporating network-based gene weights in ORA analysis. In three benchmarks, LEGO achieves better performance than Fisher and three other network-based methods. To further evaluate LEGO’s usefulness, we compare LEGO with five gene expression-based and three pathway topology-based methods using a benchmark of 34 disease gene expression datasets compiled by a recent publication, and show that LEGO is among the top-ranked methods in terms of both sensitivity and prioritization for detecting target KEGG pathways. In addition, we develop a cluster-and-filter approach to reduce the redundancy among the enriched gene sets, making the results more interpretable to biologists. Finally, we apply LEGO to two lists of autism genes, and identify relevant gene sets to autism that could not be found by Fisher. PMID:26750448

  17. LEGO: a novel method for gene set over-representation analysis by incorporating network-based gene weights.

    PubMed

    Dong, Xinran; Hao, Yun; Wang, Xiao; Tian, Weidong

    2016-01-11

    Pathway or gene set over-representation analysis (ORA) has become a routine task in functional genomics studies. However, currently widely used ORA tools employ statistical methods such as Fisher's exact test that reduce a pathway into a list of genes, ignoring the constitutive functional non-equivalent roles of genes and the complex gene-gene interactions. Here, we develop a novel method named LEGO (functional Link Enrichment of Gene Ontology or gene sets) that takes into consideration these two types of information by incorporating network-based gene weights in ORA analysis. In three benchmarks, LEGO achieves better performance than Fisher and three other network-based methods. To further evaluate LEGO's usefulness, we compare LEGO with five gene expression-based and three pathway topology-based methods using a benchmark of 34 disease gene expression datasets compiled by a recent publication, and show that LEGO is among the top-ranked methods in terms of both sensitivity and prioritization for detecting target KEGG pathways. In addition, we develop a cluster-and-filter approach to reduce the redundancy among the enriched gene sets, making the results more interpretable to biologists. Finally, we apply LEGO to two lists of autism genes, and identify relevant gene sets to autism that could not be found by Fisher.

  18. Consistency of gene starts among Burkholderia genomes

    PubMed Central

    2011-01-01

    Background Evolutionary divergence in the position of the translational start site among orthologous genes can have significant functional impacts. Divergence can alter the translation rate, degradation rate, subcellular location, and function of the encoded proteins. Results Existing Genbank gene maps for Burkholderia genomes suggest that extensive divergence has occurred--53% of ortholog sets based on Genbank gene maps had inconsistent gene start sites. However, most of these inconsistencies appear to be gene-calling errors. Evolutionary divergence was the most plausible explanation for only 17% of the ortholog sets. Correcting probable errors in the Genbank gene maps decreased the percentage of ortholog sets with inconsistent starts by 68%, increased the percentage of ortholog sets with extractable upstream intergenic regions by 32%, increased the sequence similarity of intergenic regions and predicted proteins, and increased the number of proteins with identifiable signal peptides. Conclusions Our findings highlight an emerging problem in comparative genomics: single-digit percent errors in gene predictions can lead to double-digit percentages of inconsistent ortholog sets. The work demonstrates a simple approach to evaluate and improve the quality of gene maps. PMID:21342528

  19. Genome-wide identification of lineage-specific genes in Arabidopsis, Oryza and Populus

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yang, Xiaohan; Jawdy, Sara; Tschaplinski, Timothy J

    2009-01-01

    Protein sequences were compared among Arabidopsis, Oryza and Populus to identify differential gene (DG) sets that are in one but not the other two genomes. The DG sets were screened against a plant transcript database, the NR protein database and six newly-sequenced genomes (Carica, Glycine, Medicago, Sorghum, Vitis and Zea) to identify a set of species-specific genes (SS). Gene expression, protein motif and intron number were examined. 192, 641 and 109 SS genes were identified in Arabidopsis, Oryza and Populus, respectively. Some SS genes were preferentially expressed in flowers, roots, xylem and cambium or up-regulated by stress. Six conserved motifsmore » in Arabidopsis and Oryza SS proteins were found in other distant lineages. The SS gene sets were enriched with intronless genes. The results reflect functional and/or anatomical differences between monocots and eudicots or between herbaceous and woody plants. The Populus-specific genes are candidates for carbon sequestration and biofuel research.« less

  20. Identification of a conserved set of upregulated genes in mouse skeletal muscle hypertrophy and regrowth.

    PubMed

    Chaillou, Thomas; Jackson, Janna R; England, Jonathan H; Kirby, Tyler J; Richards-White, Jena; Esser, Karyn A; Dupont-Versteegden, Esther E; McCarthy, John J

    2015-01-01

    The purpose of this study was to compare the gene expression profile of mouse skeletal muscle undergoing two forms of growth (hypertrophy and regrowth) with the goal of identifying a conserved set of differentially expressed genes. Expression profiling by microarray was performed on the plantaris muscle subjected to 1, 3, 5, 7, 10, and 14 days of hypertrophy or regrowth following 2 wk of hind-limb suspension. We identified 97 differentially expressed genes (≥2-fold increase or ≥50% decrease compared with control muscle) that were conserved during the two forms of muscle growth. The vast majority (∼90%) of the differentially expressed genes was upregulated and occurred at a single time point (64 out of 86 genes), which most often was on the first day of the time course. Microarray analysis from the conserved upregulated genes showed a set of genes related to contractile apparatus and stress response at day 1, including three genes involved in mechanotransduction and four genes encoding heat shock proteins. Our analysis further identified three cell cycle-related genes at day and several genes associated with extracellular matrix (ECM) at both days 3 and 10. In conclusion, we have identified a core set of genes commonly upregulated in two forms of muscle growth that could play a role in the maintenance of sarcomere stability, ECM remodeling, cell proliferation, fast-to-slow fiber type transition, and the regulation of skeletal muscle growth. These findings suggest conserved regulatory mechanisms involved in the adaptation of skeletal muscle to increased mechanical loading. Copyright © 2015 the American Physiological Society.

  1. Identification of a conserved set of upregulated genes in mouse skeletal muscle hypertrophy and regrowth

    PubMed Central

    Chaillou, Thomas; Jackson, Janna R.; England, Jonathan H.; Kirby, Tyler J.; Richards-White, Jena; Esser, Karyn A.; Dupont-Versteegden, Esther E.

    2014-01-01

    The purpose of this study was to compare the gene expression profile of mouse skeletal muscle undergoing two forms of growth (hypertrophy and regrowth) with the goal of identifying a conserved set of differentially expressed genes. Expression profiling by microarray was performed on the plantaris muscle subjected to 1, 3, 5, 7, 10, and 14 days of hypertrophy or regrowth following 2 wk of hind-limb suspension. We identified 97 differentially expressed genes (≥2-fold increase or ≥50% decrease compared with control muscle) that were conserved during the two forms of muscle growth. The vast majority (∼90%) of the differentially expressed genes was upregulated and occurred at a single time point (64 out of 86 genes), which most often was on the first day of the time course. Microarray analysis from the conserved upregulated genes showed a set of genes related to contractile apparatus and stress response at day 1, including three genes involved in mechanotransduction and four genes encoding heat shock proteins. Our analysis further identified three cell cycle-related genes at day and several genes associated with extracellular matrix (ECM) at both days 3 and 10. In conclusion, we have identified a core set of genes commonly upregulated in two forms of muscle growth that could play a role in the maintenance of sarcomere stability, ECM remodeling, cell proliferation, fast-to-slow fiber type transition, and the regulation of skeletal muscle growth. These findings suggest conserved regulatory mechanisms involved in the adaptation of skeletal muscle to increased mechanical loading. PMID:25554798

  2. Combining multiple tools outperforms individual methods in gene set enrichment analyses.

    PubMed

    Alhamdoosh, Monther; Ng, Milica; Wilson, Nicholas J; Sheridan, Julie M; Huynh, Huy; Wilson, Michael J; Ritchie, Matthew E

    2017-02-01

    Gene set enrichment (GSE) analysis allows researchers to efficiently extract biological insight from long lists of differentially expressed genes by interrogating them at a systems level. In recent years, there has been a proliferation of GSE analysis methods and hence it has become increasingly difficult for researchers to select an optimal GSE tool based on their particular dataset. Moreover, the majority of GSE analysis methods do not allow researchers to simultaneously compare gene set level results between multiple experimental conditions. The ensemble of genes set enrichment analyses (EGSEA) is a method developed for RNA-sequencing data that combines results from twelve algorithms and calculates collective gene set scores to improve the biological relevance of the highest ranked gene sets. EGSEA's gene set database contains around 25 000 gene sets from sixteen collections. It has multiple visualization capabilities that allow researchers to view gene sets at various levels of granularity. EGSEA has been tested on simulated data and on a number of human and mouse datasets and, based on biologists' feedback, consistently outperforms the individual tools that have been combined. Our evaluation demonstrates the superiority of the ensemble approach for GSE analysis, and its utility to effectively and efficiently extrapolate biological functions and potential involvement in disease processes from lists of differentially regulated genes. EGSEA is available as an R package at http://www.bioconductor.org/packages/EGSEA/ . The gene sets collections are available in the R package EGSEAdata from http://www.bioconductor.org/packages/EGSEAdata/ . monther.alhamdoosh@csl.com.au mritchie@wehi.edu.au. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  3. Investigating the different mechanisms of genotoxic and non-genotoxic carcinogens by a gene set analysis.

    PubMed

    Lee, Won Jun; Kim, Sang Cheol; Lee, Seul Ji; Lee, Jeongmi; Park, Jeong Hill; Yu, Kyung-Sang; Lim, Johan; Kwon, Sung Won

    2014-01-01

    Based on the process of carcinogenesis, carcinogens are classified as either genotoxic or non-genotoxic. In contrast to non-genotoxic carcinogens, many genotoxic carcinogens have been reported to cause tumor in carcinogenic bioassays in animals. Thus evaluating the genotoxicity potential of chemicals is important to discriminate genotoxic from non-genotoxic carcinogens for health care and pharmaceutical industry safety. Additionally, investigating the difference between the mechanisms of genotoxic and non-genotoxic carcinogens could provide the foundation for a mechanism-based classification for unknown compounds. In this study, we investigated the gene expression of HepG2 cells treated with genotoxic or non-genotoxic carcinogens and compared their mechanisms of action. To enhance our understanding of the differences in the mechanisms of genotoxic and non-genotoxic carcinogens, we implemented a gene set analysis using 12 compounds for the training set (12, 24, 48 h) and validated significant gene sets using 22 compounds for the test set (24, 48 h). For a direct biological translation, we conducted a gene set analysis using Globaltest and selected significant gene sets. To validate the results, training and test compounds were predicted by the significant gene sets using a prediction analysis for microarrays (PAM). Finally, we obtained 6 gene sets, including sets enriched for genes involved in the adherens junction, bladder cancer, p53 signaling pathway, pathways in cancer, peroxisome and RNA degradation. Among the 6 gene sets, the bladder cancer and p53 signaling pathway sets were significant at 12, 24 and 48 h. We also found that the DDB2, RRM2B and GADD45A, genes related to the repair and damage prevention of DNA, were consistently up-regulated for genotoxic carcinogens. Our results suggest that a gene set analysis could provide a robust tool in the investigation of the different mechanisms of genotoxic and non-genotoxic carcinogens and construct a more detailed understanding of the perturbation of significant pathways.

  4. Investigating the Different Mechanisms of Genotoxic and Non-Genotoxic Carcinogens by a Gene Set Analysis

    PubMed Central

    Lee, Won Jun; Kim, Sang Cheol; Lee, Seul Ji; Lee, Jeongmi; Park, Jeong Hill; Yu, Kyung-Sang; Lim, Johan; Kwon, Sung Won

    2014-01-01

    Based on the process of carcinogenesis, carcinogens are classified as either genotoxic or non-genotoxic. In contrast to non-genotoxic carcinogens, many genotoxic carcinogens have been reported to cause tumor in carcinogenic bioassays in animals. Thus evaluating the genotoxicity potential of chemicals is important to discriminate genotoxic from non-genotoxic carcinogens for health care and pharmaceutical industry safety. Additionally, investigating the difference between the mechanisms of genotoxic and non-genotoxic carcinogens could provide the foundation for a mechanism-based classification for unknown compounds. In this study, we investigated the gene expression of HepG2 cells treated with genotoxic or non-genotoxic carcinogens and compared their mechanisms of action. To enhance our understanding of the differences in the mechanisms of genotoxic and non-genotoxic carcinogens, we implemented a gene set analysis using 12 compounds for the training set (12, 24, 48 h) and validated significant gene sets using 22 compounds for the test set (24, 48 h). For a direct biological translation, we conducted a gene set analysis using Globaltest and selected significant gene sets. To validate the results, training and test compounds were predicted by the significant gene sets using a prediction analysis for microarrays (PAM). Finally, we obtained 6 gene sets, including sets enriched for genes involved in the adherens junction, bladder cancer, p53 signaling pathway, pathways in cancer, peroxisome and RNA degradation. Among the 6 gene sets, the bladder cancer and p53 signaling pathway sets were significant at 12, 24 and 48 h. We also found that the DDB2, RRM2B and GADD45A, genes related to the repair and damage prevention of DNA, were consistently up-regulated for genotoxic carcinogens. Our results suggest that a gene set analysis could provide a robust tool in the investigation of the different mechanisms of genotoxic and non-genotoxic carcinogens and construct a more detailed understanding of the perturbation of significant pathways. PMID:24497971

  5. Anopheles gambiae genome reannotation through synthesis of ab initio and comparative gene prediction algorithms

    PubMed Central

    Li, Jun; Riehle, Michelle M; Zhang, Yan; Xu, Jiannong; Oduol, Frederick; Gomez, Shawn M; Eiglmeier, Karin; Ueberheide, Beatrix M; Shabanowitz, Jeffrey; Hunt, Donald F; Ribeiro, José MC; Vernick, Kenneth D

    2006-01-01

    Background Complete genome annotation is a necessary tool as Anopheles gambiae researchers probe the biology of this potent malaria vector. Results We reannotate the A. gambiae genome by synthesizing comparative and ab initio sets of predicted coding sequences (CDSs) into a single set using an exon-gene-union algorithm followed by an open-reading-frame-selection algorithm. The reannotation predicts 20,970 CDSs supported by at least two lines of evidence, and it lowers the proportion of CDSs lacking start and/or stop codons to only approximately 4%. The reannotated CDS set includes a set of 4,681 novel CDSs not represented in the Ensembl annotation but with EST support, and another set of 4,031 Ensembl-supported genes that undergo major structural and, therefore, probably functional changes in the reannotated set. The quality and accuracy of the reannotation was assessed by comparison with end sequences from 20,249 full-length cDNA clones, and evaluation of mass spectrometry peptide hit rates from an A. gambiae shotgun proteomic dataset confirms that the reannotated CDSs offer a high quality protein database for proteomics. We provide a functional proteomics annotation, ReAnoXcel, obtained by analysis of the new CDSs through the AnoXcel pipeline, which allows functional comparisons of the CDS sets within the same bioinformatic platform. CDS data are available for download. Conclusion Comprehensive A. gambiae genome reannotation is achieved through a combination of comparative and ab initio gene prediction algorithms. PMID:16569258

  6. A Guideline to Family-Wide Comparative State-of-the-Art Quantitative RT-PCR Analysis Exemplified with a Brassicaceae Cross-Species Seed Germination Case Study[W][OA

    PubMed Central

    Graeber, Kai; Linkies, Ada; Wood, Andrew T.A.; Leubner-Metzger, Gerhard

    2011-01-01

    Comparative biology includes the comparison of transcriptome and quantitative real-time RT-PCR (qRT-PCR) data sets in a range of species to detect evolutionarily conserved and divergent processes. Transcript abundance analysis of target genes by qRT-PCR requires a highly accurate and robust workflow. This includes reference genes with high expression stability (i.e., low intersample transcript abundance variation) for correct target gene normalization. Cross-species qRT-PCR for proper comparative transcript quantification requires reference genes suitable for different species. We addressed this issue using tissue-specific transcriptome data sets of germinating Lepidium sativum seeds to identify new candidate reference genes. We investigated their expression stability in germinating seeds of L. sativum and Arabidopsis thaliana by qRT-PCR, combined with in silico analysis of Arabidopsis and Brassica napus microarray data sets. This revealed that reference gene expression stability is higher for a given developmental process between distinct species than for distinct developmental processes within a given single species. The identified superior cross-species reference genes may be used for family-wide comparative qRT-PCR analysis of Brassicaceae seed germination. Furthermore, using germinating seeds, we exemplify optimization of the qRT-PCR workflow for challenging tissues regarding RNA quality, transcript stability, and tissue abundance. Our work therefore can serve as a guideline for moving beyond Arabidopsis by establishing high-quality cross-species qRT-PCR. PMID:21666000

  7. International interlaboratory study comparing single organism 16S rRNA gene sequencing data: Beyond consensus sequence comparisons

    PubMed Central

    Olson, Nathan D.; Lund, Steven P.; Zook, Justin M.; Rojas-Cornejo, Fabiola; Beck, Brian; Foy, Carole; Huggett, Jim; Whale, Alexandra S.; Sui, Zhiwei; Baoutina, Anna; Dobeson, Michael; Partis, Lina; Morrow, Jayne B.

    2015-01-01

    This study presents the results from an interlaboratory sequencing study for which we developed a novel high-resolution method for comparing data from different sequencing platforms for a multi-copy, paralogous gene. The combination of PCR amplification and 16S ribosomal RNA gene (16S rRNA) sequencing has revolutionized bacteriology by enabling rapid identification, frequently without the need for culture. To assess variability between laboratories in sequencing 16S rRNA, six laboratories sequenced the gene encoding the 16S rRNA from Escherichia coli O157:H7 strain EDL933 and Listeria monocytogenes serovar 4b strain NCTC11994. Participants performed sequencing methods and protocols available in their laboratories: Sanger sequencing, Roche 454 pyrosequencing®, or Ion Torrent PGM®. The sequencing data were evaluated on three levels: (1) identity of biologically conserved position, (2) ratio of 16S rRNA gene copies featuring identified variants, and (3) the collection of variant combinations in a set of 16S rRNA gene copies. The same set of biologically conserved positions was identified for each sequencing method. Analytical methods using Bayesian and maximum likelihood statistics were developed to estimate variant copy ratios, which describe the ratio of nucleotides at each identified biologically variable position, as well as the likely set of variant combinations present in 16S rRNA gene copies. Our results indicate that estimated variant copy ratios at biologically variable positions were only reproducible for high throughput sequencing methods. Furthermore, the likely variant combination set was only reproducible with increased sequencing depth and longer read lengths. We also demonstrate novel methods for evaluating variable positions when comparing multi-copy gene sequence data from multiple laboratories generated using multiple sequencing technologies. PMID:27077030

  8. Expression of the histone chaperone SET/TAF-Iβ during the strobilation process of Mesocestoides corti (Platyhelminthes, Cestoda).

    PubMed

    Costa, Caroline B; Monteiro, Karina M; Teichmann, Aline; da Silva, Edileuza D; Lorenzatto, Karina R; Cancela, Martín; Paes, Jéssica A; Benitz, André de N D; Castillo, Estela; Margis, Rogério; Zaha, Arnaldo; Ferreira, Henrique B

    2015-08-01

    The histone chaperone SET/TAF-Iβ is implicated in processes of chromatin remodelling and gene expression regulation. It has been associated with the control of developmental processes, but little is known about its function in helminth parasites. In Mesocestoides corti, a partial cDNA sequence related to SET/TAF-Iβ was isolated in a screening for genes differentially expressed in larvae (tetrathyridia) and adult worms. Here, the full-length coding sequence of the M. corti SET/TAF-Iβ gene was analysed and the encoded protein (McSET/TAF) was compared with orthologous sequences, showing that McSET/TAF can be regarded as a SET/TAF-Iβ family member, with a typical nucleosome-assembly protein (NAP) domain and an acidic tail. The expression patterns of the McSET/TAF gene and protein were investigated during the strobilation process by RT-qPCR, using a set of five reference genes, and by immunoblot and immunofluorescence, using monospecific polyclonal antibodies. A gradual increase in McSET/TAF transcripts and McSET/TAF protein was observed upon development induction by trypsin, demonstrating McSET/TAF differential expression during strobilation. These results provided the first evidence for the involvement of a protein from the NAP family of epigenetic effectors in the regulation of cestode development.

  9. Gene integrated set profile analysis: a context-based approach for inferring biological endpoints

    PubMed Central

    Kowalski, Jeanne; Dwivedi, Bhakti; Newman, Scott; Switchenko, Jeffery M.; Pauly, Rini; Gutman, David A.; Arora, Jyoti; Gandhi, Khanjan; Ainslie, Kylie; Doho, Gregory; Qin, Zhaohui; Moreno, Carlos S.; Rossi, Michael R.; Vertino, Paula M.; Lonial, Sagar; Bernal-Mizrachi, Leon; Boise, Lawrence H.

    2016-01-01

    The identification of genes with specific patterns of change (e.g. down-regulated and methylated) as phenotype drivers or samples with similar profiles for a given gene set as drivers of clinical outcome, requires the integration of several genomic data types for which an ‘integrate by intersection’ (IBI) approach is often applied. In this approach, results from separate analyses of each data type are intersected, which has the limitation of a smaller intersection with more data types. We introduce a new method, GISPA (Gene Integrated Set Profile Analysis) for integrated genomic analysis and its variation, SISPA (Sample Integrated Set Profile Analysis) for defining respective genes and samples with the context of similar, a priori specified molecular profiles. With GISPA, the user defines a molecular profile that is compared among several classes and obtains ranked gene sets that satisfy the profile as drivers of each class. With SISPA, the user defines a gene set that satisfies a profile and obtains sample groups of profile activity. Our results from applying GISPA to human multiple myeloma (MM) cell lines contained genes of known profiles and importance, along with several novel targets, and their further SISPA application to MM coMMpass trial data showed clinical relevance. PMID:26826710

  10. A Meta-Analysis of Multiple Matched Copy Number and Transcriptomics Data Sets for Inferring Gene Regulatory Relationships

    PubMed Central

    Newton, Richard; Wernisch, Lorenz

    2014-01-01

    Inferring gene regulatory relationships from observational data is challenging. Manipulation and intervention is often required to unravel causal relationships unambiguously. However, gene copy number changes, as they frequently occur in cancer cells, might be considered natural manipulation experiments on gene expression. An increasing number of data sets on matched array comparative genomic hybridisation and transcriptomics experiments from a variety of cancer pathologies are becoming publicly available. Here we explore the potential of a meta-analysis of thirty such data sets. The aim of our analysis was to assess the potential of in silico inference of trans-acting gene regulatory relationships from this type of data. We found sufficient correlation signal in the data to infer gene regulatory relationships, with interesting similarities between data sets. A number of genes had highly correlated copy number and expression changes in many of the data sets and we present predicted potential trans-acted regulatory relationships for each of these genes. The study also investigates to what extent heterogeneity between cell types and between pathologies determines the number of statistically significant predictions available from a meta-analysis of experiments. PMID:25148247

  11. Differential Effect of Active Smoking on Gene Expression in Male and Female Smokers

    PubMed Central

    Paul, Sunirmal; Amundson, Sally A

    2015-01-01

    Smoking is the second leading cause of preventable death in the United States. Cohort epidemiological studies have demonstrated that women are more vulnerable to cigarette-smoking induced diseases than their male counterparts, however, the molecular basis of these differences has remained unknown. In this study, we explored if there were differences in the gene expression patterns between male and female smokers, and how these patterns might reflect different sex-specific responses to the stress of smoking. Using whole genome microarray gene expression profiling, we found that a substantial number of oxidant related genes were expressed in both male and female smokers, however, smoking-responsive genes did indeed differ greatly between male and female smokers. Gene set enrichment analysis (GSEA) against reference oncogenic signature gene sets identified a large number of oncogenic pathway gene-sets that were significantly altered in female smokers compared to male smokers. In addition, functional annotation with Ingenuity Pathway Analysis (IPA) identified smoking-correlated genes associated with biological functions in male and female smokers that are directly relevant to well-known smoking related pathologies. However, these relevant biological functions were strikingly overrepresented in female smokers compared to male smokers. IPA network analysis with the functional categories of immune and inflammatory response gene products suggested potential interactions between smoking response and female hormones. Our results demonstrate a striking dichotomy between male and female gene expression responses to smoking. This is the first genome-wide expression study to compare the sex-specific impacts of smoking at a molecular level and suggests a novel potential connection between sex hormone signaling and smoking-induced diseases in female smokers. PMID:25621181

  12. QuickMap: a public tool for large-scale gene therapy vector insertion site mapping and analysis.

    PubMed

    Appelt, J-U; Giordano, F A; Ecker, M; Roeder, I; Grund, N; Hotz-Wagenblatt, A; Opelz, G; Zeller, W J; Allgayer, H; Fruehauf, S; Laufs, S

    2009-07-01

    Several events of insertional mutagenesis in pre-clinical and clinical gene therapy studies have created intense interest in assessing the genomic insertion profiles of gene therapy vectors. For the construction of such profiles, vector-flanking sequences detected by inverse PCR, linear amplification-mediated-PCR or ligation-mediated-PCR need to be mapped to the host cell's genome and compared to a reference set. Although remarkable progress has been achieved in mapping gene therapy vector insertion sites, public reference sets are lacking, as are the possibilities to quickly detect non-random patterns in experimental data. We developed a tool termed QuickMap, which uniformly maps and analyzes human and murine vector-flanking sequences within seconds (available at www.gtsg.org). Besides information about hits in chromosomes and fragile sites, QuickMap automatically determines insertion frequencies in +/- 250 kb adjacency to genes, cancer genes, pseudogenes, transcription factor and (post-transcriptional) miRNA binding sites, CpG islands and repetitive elements (short interspersed nuclear elements (SINE), long interspersed nuclear elements (LINE), Type II elements and LTR elements). Additionally, all experimental frequencies are compared with the data obtained from a reference set, containing 1 000 000 random integrations ('random set'). Thus, for the first time a tool allowing high-throughput profiling of gene therapy vector insertion sites is available. It provides a basis for large-scale insertion site analyses, which is now urgently needed to discover novel gene therapy vectors with 'safe' insertion profiles.

  13. Drug2Gene: an exhaustive resource to explore effectively the drug-target relation network.

    PubMed

    Roider, Helge G; Pavlova, Nadia; Kirov, Ivaylo; Slavov, Stoyan; Slavov, Todor; Uzunov, Zlatyo; Weiss, Bertram

    2014-03-11

    Information about drug-target relations is at the heart of drug discovery. There are now dozens of databases providing drug-target interaction data with varying scope, and focus. Therefore, and due to the large chemical space, the overlap of the different data sets is surprisingly small. As searching through these sources manually is cumbersome, time-consuming and error-prone, integrating all the data is highly desirable. Despite a few attempts, integration has been hampered by the diversity of descriptions of compounds, and by the fact that the reported activity values, coming from different data sets, are not always directly comparable due to usage of different metrics or data formats. We have built Drug2Gene, a knowledge base, which combines the compound/drug-gene/protein information from 19 publicly available databases. A key feature is our rigorous unification and standardization process which makes the data truly comparable on a large scale, allowing for the first time effective data mining in such a large knowledge corpus. As of version 3.2, Drug2Gene contains 4,372,290 unified relations between compounds and their targets most of which include reported bioactivity data. We extend this set with putative (i.e. homology-inferred) relations where sufficient sequence homology between proteins suggests they may bind to similar compounds. Drug2Gene provides powerful search functionalities, very flexible export procedures, and a user-friendly web interface. Drug2Gene v3.2 has become a mature and comprehensive knowledge base providing unified, standardized drug-target related information gathered from publicly available data sources. It can be used to integrate proprietary data sets with publicly available data sets. Its main goal is to be a 'one-stop shop' to identify tool compounds targeting a given gene product or for finding all known targets of a drug. Drug2Gene with its integrated data set of public compound-target relations is freely accessible without restrictions at http://www.drug2gene.com.

  14. Pathway-based analysis of GWAs data identifies association of sex determination genes with susceptibility to testicular germ cell tumors.

    PubMed

    Koster, Roelof; Mitra, Nandita; D'Andrea, Kurt; Vardhanabhuti, Saran; Chung, Charles C; Wang, Zhaoming; Loren Erickson, R; Vaughn, David J; Litchfield, Kevin; Rahman, Nazneen; Greene, Mark H; McGlynn, Katherine A; Turnbull, Clare; Chanock, Stephen J; Nathanson, Katherine L; Kanetsky, Peter A

    2014-11-15

    Genome-wide association (GWA) studies of testicular germ cell tumor (TGCT) have identified 18 susceptibility loci, some containing genes encoding proteins important in male germ cell development. Deletions of one of these genes, DMRT1, lead to male-to-female sex reversal and are associated with development of gonadoblastoma. To further explore genetic association with TGCT, we undertook a pathway-based analysis of SNP marker associations in the Penn GWAs (349 TGCT cases and 919 controls). We analyzed a custom-built sex determination gene set consisting of 32 genes using three different methods of pathway-based analysis. The sex determination gene set ranked highly compared with canonical gene sets, and it was associated with TGCT (FDRG = 2.28 × 10(-5), FDRM = 0.014 and FDRI = 0.008 for Gene Set Analysis-SNP (GSA-SNP), Meta-Analysis Gene Set Enrichment of Variant Associations (MAGENTA) and Improved Gene Set Enrichment Analysis for Genome-wide Association Study (i-GSEA4GWAS) analysis, respectively). The association remained after removal of DMRT1 from the gene set (FDRG = 0.0002, FDRM = 0.055 and FDRI = 0.009). Using data from the NCI GWA scan (582 TGCT cases and 1056 controls) and UK scan (986 TGCT cases and 4946 controls), we replicated these findings (NCI: FDRG = 0.006, FDRM = 0.014, FDRI = 0.033, and UK: FDRG = 1.04 × 10(-6), FDRM = 0.016, FDRI = 0.025). After removal of DMRT1 from the gene set, the sex determination gene set remains associated with TGCT in the NCI (FDRG = 0.039, FDRM = 0.050 and FDRI = 0.055) and UK scans (FDRG = 3.00 × 10(-5), FDRM = 0.056 and FDRI = 0.044). With the exception of DMRT1, genes in the sex determination gene set have not previously been identified as TGCT susceptibility loci in these GWA scans, demonstrating the complementary nature of a pathway-based approach for genome-wide analysis of TGCT. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  15. Analysis of co-evolving genes in campylobacter jejuni and C. coli

    USDA-ARS?s Scientific Manuscript database

    Background: The population structure of Campylobacter has been frequently studied by MLST, for which fragments of housekeeping genes are compared. We wished to determine if the used MLST genes are representative of the complete genome. Methods: A set of 1029 core gene families (CGF) was identifie...

  16. Meta-Analysis of Tumor Stem-Like Breast Cancer Cells Using Gene Set and Network Analysis

    PubMed Central

    Lee, Won Jun; Kim, Sang Cheol; Yoon, Jung-Ho; Yoon, Sang Jun; Lim, Johan; Kim, You-Sun; Kwon, Sung Won; Park, Jeong Hill

    2016-01-01

    Generally, cancer stem cells have epithelial-to-mesenchymal-transition characteristics and other aggressive properties that cause metastasis. However, there have been no confident markers for the identification of cancer stem cells and comparative methods examining adherent and sphere cells are widely used to investigate mechanism underlying cancer stem cells, because sphere cells have been known to maintain cancer stem cell characteristics. In this study, we conducted a meta-analysis that combined gene expression profiles from several studies that utilized tumorsphere technology to investigate tumor stem-like breast cancer cells. We used our own gene expression profiles along with the three different gene expression profiles from the Gene Expression Omnibus, which we combined using the ComBat method, and obtained significant gene sets using the gene set analysis of our datasets and the combined dataset. This experiment focused on four gene sets such as cytokine-cytokine receptor interaction that demonstrated significance in both datasets. Our observations demonstrated that among the genes of four significant gene sets, six genes were consistently up-regulated and satisfied the p-value of < 0.05, and our network analysis showed high connectivity in five genes. From these results, we established CXCR4, CXCL1 and HMGCS1, the intersecting genes of the datasets with high connectivity and p-value of < 0.05, as significant genes in the identification of cancer stem cells. Additional experiment using quantitative reverse transcription-polymerase chain reaction showed significant up-regulation in MCF-7 derived sphere cells and confirmed the importance of these three genes. Taken together, using meta-analysis that combines gene set and network analysis, we suggested CXCR4, CXCL1 and HMGCS1 as candidates involved in tumor stem-like breast cancer cells. Distinct from other meta-analysis, by using gene set analysis, we selected possible markers which can explain the biological mechanisms and suggested network analysis as an additional criterion for selecting candidates. PMID:26870956

  17. Determining Semantically Related Significant Genes.

    PubMed

    Taha, Kamal

    2014-01-01

    GO relation embodies some aspects of existence dependency. If GO term xis existence-dependent on GO term y, the presence of y implies the presence of x. Therefore, the genes annotated with the function of the GO term y are usually functionally and semantically related to the genes annotated with the function of the GO term x. A large number of gene set enrichment analysis methods have been developed in recent years for analyzing gene sets enrichment. However, most of these methods overlook the structural dependencies between GO terms in GO graph by not considering the concept of existence dependency. We propose in this paper a biological search engine called RSGSearch that identifies enriched sets of genes annotated with different functions using the concept of existence dependency. We observe that GO term xcannot be existence-dependent on GO term y, if x- and y- have the same specificity (biological characteristics). After encoding into a numeric format the contributions of GO terms annotating target genes to the semantics of their lowest common ancestors (LCAs), RSGSearch uses microarray experiment to identify the most significant LCA that annotates the result genes. We evaluated RSGSearch experimentally and compared it with five gene set enrichment systems. Results showed marked improvement.

  18. Genome-Wide Temporal Expression Profiling in Caenorhabditis elegans Identifies a Core Gene Set Related to Long-Term Memory.

    PubMed

    Freytag, Virginie; Probst, Sabine; Hadziselimovic, Nils; Boglari, Csaba; Hauser, Yannick; Peter, Fabian; Gabor Fenyves, Bank; Milnik, Annette; Demougin, Philippe; Vukojevic, Vanja; de Quervain, Dominique J-F; Papassotiropoulos, Andreas; Stetak, Attila

    2017-07-12

    The identification of genes related to encoding, storage, and retrieval of memories is a major interest in neuroscience. In the current study, we analyzed the temporal gene expression changes in a neuronal mRNA pool during an olfactory long-term associative memory (LTAM) in Caenorhabditis elegans hermaphrodites. Here, we identified a core set of 712 (538 upregulated and 174 downregulated) genes that follows three distinct temporal peaks demonstrating multiple gene regulation waves in LTAM. Compared with the previously published positive LTAM gene set (Lakhina et al., 2015), 50% of the identified upregulated genes here overlap with the previous dataset, possibly representing stimulus-independent memory-related genes. On the other hand, the remaining genes were not previously identified in positive associative memory and may specifically regulate aversive LTAM. Our results suggest a multistep gene activation process during the formation and retrieval of long-term memory and define general memory-implicated genes as well as conditioning-type-dependent gene sets. SIGNIFICANCE STATEMENT The identification of genes regulating different steps of memory is of major interest in neuroscience. Identification of common memory genes across different learning paradigms and the temporal activation of the genes are poorly studied. Here, we investigated the temporal aspects of Caenorhabditis elegans gene expression changes using aversive olfactory associative long-term memory (LTAM) and identified three major gene activation waves. Like in previous studies, aversive LTAM is also CREB dependent, and CREB activity is necessary immediately after training. Finally, we define a list of memory paradigm-independent core gene sets as well as conditioning-dependent genes. Copyright © 2017 the authors 0270-6474/17/376661-12$15.00/0.

  19. COGNATE: comparative gene annotation characterizer.

    PubMed

    Wilbrandt, Jeanne; Misof, Bernhard; Niehuis, Oliver

    2017-07-17

    The comparison of gene and genome structures across species has the potential to reveal major trends of genome evolution. However, such a comparative approach is currently hampered by a lack of standardization (e.g., Elliott TA, Gregory TR, Philos Trans Royal Soc B: Biol Sci 370:20140331, 2015). For example, testing the hypothesis that the total amount of coding sequences is a reliable measure of potential proteome diversity (Wang M, Kurland CG, Caetano-Anollés G, PNAS 108:11954, 2011) requires the application of standardized definitions of coding sequence and genes to create both comparable and comprehensive data sets and corresponding summary statistics. However, such standard definitions either do not exist or are not consistently applied. These circumstances call for a standard at the descriptive level using a minimum of parameters as well as an undeviating use of standardized terms, and for software that infers the required data under these strict definitions. The acquisition of a comprehensive, descriptive, and standardized set of parameters and summary statistics for genome publications and further analyses can thus greatly benefit from the availability of an easy to use standard tool. We developed a new open-source command-line tool, COGNATE (Comparative Gene Annotation Characterizer), which uses a given genome assembly and its annotation of protein-coding genes for a detailed description of the respective gene and genome structure parameters. Additionally, we revised the standard definitions of gene and genome structures and provide the definitions used by COGNATE as a working draft suggestion for further reference. Complete parameter lists and summary statistics are inferred using this set of definitions to allow down-stream analyses and to provide an overview of the genome and gene repertoire characteristics. COGNATE is written in Perl and freely available at the ZFMK homepage ( https://www.zfmk.de/en/COGNATE ) and on github ( https://github.com/ZFMK/COGNATE ). The tool COGNATE allows comparing genome assemblies and structural elements on multiples levels (e.g., scaffold or contig sequence, gene). It clearly enhances comparability between analyses. Thus, COGNATE can provide the important standardization of both genome and gene structure parameter disclosure as well as data acquisition for future comparative analyses. With the establishment of comprehensive descriptive standards and the extensive availability of genomes, an encompassing database will become possible.

  20. Analysis of genetic association in Listeria and Diabetes using Hierarchical Clustering and Silhouette Index

    NASA Astrophysics Data System (ADS)

    Pagnuco, Inti A.; Pastore, Juan I.; Abras, Guillermo; Brun, Marcel; Ballarin, Virginia L.

    2016-04-01

    It is usually assumed that co-expressed genes suggest co-regulation in the underlying regulatory network. Determining sets of co-expressed genes is an important task, where significative groups of genes are defined based on some criteria. This task is usually performed by clustering algorithms, where the whole family of genes, or a subset of them, are clustered into meaningful groups based on their expression values in a set of experiment. In this work we used a methodology based on the Silhouette index as a measure of cluster quality for individual gene groups, and a combination of several variants of hierarchical clustering to generate the candidate groups, to obtain sets of co-expressed genes for two real data examples. We analyzed the quality of the best ranked groups, obtained by the algorithm, using an online bioinformatics tool that provides network information for the selected genes. Moreover, to verify the performance of the algorithm, considering the fact that it doesn’t find all possible subsets, we compared its results against a full search, to determine the amount of good co-regulated sets not detected.

  1. A closer look at cross-validation for assessing the accuracy of gene regulatory networks and models.

    PubMed

    Tabe-Bordbar, Shayan; Emad, Amin; Zhao, Sihai Dave; Sinha, Saurabh

    2018-04-26

    Cross-validation (CV) is a technique to assess the generalizability of a model to unseen data. This technique relies on assumptions that may not be satisfied when studying genomics datasets. For example, random CV (RCV) assumes that a randomly selected set of samples, the test set, well represents unseen data. This assumption doesn't hold true where samples are obtained from different experimental conditions, and the goal is to learn regulatory relationships among the genes that generalize beyond the observed conditions. In this study, we investigated how the CV procedure affects the assessment of supervised learning methods used to learn gene regulatory networks (or in other applications). We compared the performance of a regression-based method for gene expression prediction estimated using RCV with that estimated using a clustering-based CV (CCV) procedure. Our analysis illustrates that RCV can produce over-optimistic estimates of the model's generalizability compared to CCV. Next, we defined the 'distinctness' of test set from training set and showed that this measure is predictive of performance of the regression method. Finally, we introduced a simulated annealing method to construct partitions with gradually increasing distinctness and showed that performance of different gene expression prediction methods can be better evaluated using this method.

  2. Comparative genomic analysis of SET domain family reveals the origin, expansion, and putative function of the arthropod-specific SmydA genes as histone modifiers in insects.

    PubMed

    Jiang, Feng; Liu, Qing; Wang, Yanli; Zhang, Jie; Wang, Huimin; Song, Tianqi; Yang, Meiling; Wang, Xianhui; Kang, Le

    2017-06-01

    The SET domain is an evolutionarily conserved motif present in histone lysine methyltransferases, which are important in the regulation of chromatin and gene expression in animals. In this study, we searched for SET domain-containing genes (SET genes) in all of the 147 arthropod genomes sequenced at the time of carrying out this experiment to understand the evolutionary history by which SET domains have evolved in insects. Phylogenetic and ancestral state reconstruction analysis revealed an arthropod-specific SET gene family, named SmydA, that is ancestral to arthropod animals and specifically diversified during insect evolution. Considering that pseudogenization is the most probable fate of the new emerging gene copies, we provided experimental and evolutionary evidence to demonstrate their essential functions. Fluorescence in situ hybridization analysis and in vitro methyltransferase activity assays showed that the SmydA-2 gene was transcriptionally active and retained the original histone methylation activity. Expression knockdown by RNA interference significantly increased mortality, implying that the SmydA genes may be essential for insect survival. We further showed predominantly strong purifying selection on the SmydA gene family and a potential association between the regulation of gene expression and insect phenotypic plasticity by transcriptome analysis. Overall, these data suggest that the SmydA gene family retains essential functions that may possibly define novel regulatory pathways in insects. This work provides insights into the roles of lineage-specific domain duplication in insect evolution. © The Authors 2017. Published by Oxford University Press.

  3. Comparative genomic analysis of SET domain family reveals the origin, expansion, and putative function of the arthropod-specific SmydA genes as histone modifiers in insects

    PubMed Central

    Jiang, Feng; Liu, Qing; Wang, Yanli; Zhang, Jie; Wang, Huimin; Song, Tianqi; Yang, Meiling

    2017-01-01

    Abstract The SET domain is an evolutionarily conserved motif present in histone lysine methyltransferases, which are important in the regulation of chromatin and gene expression in animals. In this study, we searched for SET domain–containing genes (SET genes) in all of the 147 arthropod genomes sequenced at the time of carrying out this experiment to understand the evolutionary history by which SET domains have evolved in insects. Phylogenetic and ancestral state reconstruction analysis revealed an arthropod-specific SET gene family, named SmydA, that is ancestral to arthropod animals and specifically diversified during insect evolution. Considering that pseudogenization is the most probable fate of the new emerging gene copies, we provided experimental and evolutionary evidence to demonstrate their essential functions. Fluorescence in situ hybridization analysis and in vitro methyltransferase activity assays showed that the SmydA-2 gene was transcriptionally active and retained the original histone methylation activity. Expression knockdown by RNA interference significantly increased mortality, implying that the SmydA genes may be essential for insect survival. We further showed predominantly strong purifying selection on the SmydA gene family and a potential association between the regulation of gene expression and insect phenotypic plasticity by transcriptome analysis. Overall, these data suggest that the SmydA gene family retains essential functions that may possibly define novel regulatory pathways in insects. This work provides insights into the roles of lineage-specific domain duplication in insect evolution. PMID:28444351

  4. GSNFS: Gene subnetwork biomarker identification of lung cancer expression data.

    PubMed

    Doungpan, Narumol; Engchuan, Worrawat; Chan, Jonathan H; Meechai, Asawin

    2016-12-05

    Gene expression has been used to identify disease gene biomarkers, but there are ongoing challenges. Single gene or gene-set biomarkers are inadequate to provide sufficient understanding of complex disease mechanisms and the relationship among those genes. Network-based methods have thus been considered for inferring the interaction within a group of genes to further study the disease mechanism. Recently, the Gene-Network-based Feature Set (GNFS), which is capable of handling case-control and multiclass expression for gene biomarker identification, has been proposed, partly taking into account of network topology. However, its performance relies on a greedy search for building subnetworks and thus requires further improvement. In this work, we establish a new approach named Gene Sub-Network-based Feature Selection (GSNFS) by implementing the GNFS framework with two proposed searching and scoring algorithms, namely gene-set-based (GS) search and parent-node-based (PN) search, to identify subnetworks. An additional dataset is used to validate the results. The two proposed searching algorithms of the GSNFS method for subnetwork expansion are concerned with the degree of connectivity and the scoring scheme for building subnetworks and their topology. For each iteration of expansion, the neighbour genes of a current subnetwork, whose expression data improved the overall subnetwork score, is recruited. While the GS search calculated the subnetwork score using an activity score of a current subnetwork and the gene expression values of its neighbours, the PN search uses the expression value of the corresponding parent of each neighbour gene. Four lung cancer expression datasets were used for subnetwork identification. In addition, using pathway data and protein-protein interaction as network data in order to consider the interaction among significant genes were discussed. Classification was performed to compare the performance of the identified gene subnetworks with three subnetwork identification algorithms. The two searching algorithms resulted in better classification and gene/gene-set agreement compared to the original greedy search of the GNFS method. The identified lung cancer subnetwork using the proposed searching algorithm resulted in an improvement of the cross-dataset validation and an increase in the consistency of findings between two independent datasets. The homogeneity measurement of the datasets was conducted to assess dataset compatibility in cross-dataset validation. The lung cancer dataset with higher homogeneity showed a better result when using the GS search while the dataset with low homogeneity showed a better result when using the PN search. The 10-fold cross-dataset validation on the independent lung cancer datasets showed higher classification performance of the proposed algorithms when compared with the greedy search in the original GNFS method. The proposed searching algorithms provide a higher number of genes in the subnetwork expansion step than the greedy algorithm. As a result, the performance of the subnetworks identified from the GSNFS method was improved in terms of classification performance and gene/gene-set level agreement depending on the homogeneity of the datasets used in the analysis. Some common genes obtained from the four datasets using different searching algorithms are genes known to play a role in lung cancer. The improvement of classification performance and the gene/gene-set level agreement, and the biological relevance indicated the effectiveness of the GSNFS method for gene subnetwork identification using expression data.

  5. Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets.

    PubMed

    Lai, Yinglei; Zhang, Fanni; Nayak, Tapan K; Modarres, Reza; Lee, Norman H; McCaffrey, Timothy A

    2014-01-01

    Gene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment. We categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a three-component multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets. We used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method. This study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple large-scale two-sample gene expression data sets.

  6. Distributed Function Mining for Gene Expression Programming Based on Fast Reduction.

    PubMed

    Deng, Song; Yue, Dong; Yang, Le-chan; Fu, Xiong; Feng, Ya-zhou

    2016-01-01

    For high-dimensional and massive data sets, traditional centralized gene expression programming (GEP) or improved algorithms lead to increased run-time and decreased prediction accuracy. To solve this problem, this paper proposes a new improved algorithm called distributed function mining for gene expression programming based on fast reduction (DFMGEP-FR). In DFMGEP-FR, fast attribution reduction in binary search algorithms (FAR-BSA) is proposed to quickly find the optimal attribution set, and the function consistency replacement algorithm is given to solve integration of the local function model. Thorough comparative experiments for DFMGEP-FR, centralized GEP and the parallel gene expression programming algorithm based on simulated annealing (parallel GEPSA) are included in this paper. For the waveform, mushroom, connect-4 and musk datasets, the comparative results show that the average time-consumption of DFMGEP-FR drops by 89.09%%, 88.85%, 85.79% and 93.06%, respectively, in contrast to centralized GEP and by 12.5%, 8.42%, 9.62% and 13.75%, respectively, compared with parallel GEPSA. Six well-studied UCI test data sets demonstrate the efficiency and capability of our proposed DFMGEP-FR algorithm for distributed function mining.

  7. Candidate genes for obesity-susceptibility show enriched association within a large genome-wide association study for BMI.

    PubMed

    Vimaleswaran, Karani S; Tachmazidou, Ioanna; Zhao, Jing Hua; Hirschhorn, Joel N; Dudbridge, Frank; Loos, Ruth J F

    2012-10-15

    Before the advent of genome-wide association studies (GWASs), hundreds of candidate genes for obesity-susceptibility had been identified through a variety of approaches. We examined whether those obesity candidate genes are enriched for associations with body mass index (BMI) compared with non-candidate genes by using data from a large-scale GWAS. A thorough literature search identified 547 candidate genes for obesity-susceptibility based on evidence from animal studies, Mendelian syndromes, linkage studies, genetic association studies and expression studies. Genomic regions were defined to include the genes ±10 kb of flanking sequence around candidate and non-candidate genes. We used summary statistics publicly available from the discovery stage of the genome-wide meta-analysis for BMI performed by the genetic investigation of anthropometric traits consortium in 123 564 individuals. Hypergeometric, rank tail-strength and gene-set enrichment analysis tests were used to test for the enrichment of association in candidate compared with non-candidate genes. The hypergeometric test of enrichment was not significant at the 5% P-value quantile (P = 0.35), but was nominally significant at the 25% quantile (P = 0.015). The rank tail-strength and gene-set enrichment tests were nominally significant for the full set of genes and borderline significant for the subset without SNPs at P < 10(-7). Taken together, the observed evidence for enrichment suggests that the candidate gene approach retains some value. However, the degree of enrichment is small despite the extensive number of candidate genes and the large sample size. Studies that focus on candidate genes have only slightly increased chances of detecting associations, and are likely to miss many true effects in non-candidate genes, at least for obesity-related traits.

  8. Comparative mRNA analysis of behavioral and genetic mouse models of aggression.

    PubMed

    Malki, Karim; Tosto, Maria G; Pain, Oliver; Sluyter, Frans; Mineur, Yann S; Crusio, Wim E; de Boer, Sietse; Sandnabba, Kenneth N; Kesserwani, Jad; Robinson, Edward; Schalkwyk, Leonard C; Asherson, Philip

    2016-04-01

    Mouse models of aggression have traditionally compared strains, most notably BALB/cJ and C57BL/6. However, these strains were not designed to study aggression despite differences in aggression-related traits and distinct reactivity to stress. This study evaluated expression of genes differentially regulated in a stress (behavioral) mouse model of aggression with those from a recent genetic mouse model aggression. The study used a discovery-replication design using two independent mRNA studies from mouse brain tissue. The discovery study identified strain (BALB/cJ and C57BL/6J) × stress (chronic mild stress or control) interactions. Probe sets differentially regulated in the discovery set were intersected with those uncovered in the replication study, which evaluated differences between high and low aggressive animals from three strains specifically bred to study aggression. Network analysis was conducted on overlapping genes uncovered across both studies. A significant overlap was found with the genetic mouse study sharing 1,916 probe sets with the stress model. Fifty-one probe sets were found to be strongly dysregulated across both studies mapping to 50 known genes. Network analysis revealed two plausible pathways including one centered on the UBC gene hub which encodes ubiquitin, a protein well-known for protein degradation, and another on P38 MAPK. Findings from this study support the stress model of aggression, which showed remarkable molecular overlap with a genetic model. The study uncovered a set of candidate genes including the Erg2 gene, which has previously been implicated in different psychopathologies. The gene networks uncovered points at a Redox pathway as potentially being implicated in aggressive related behaviors. © 2016 Wiley Periodicals, Inc.

  9. Gene expression profiling of rat spermatogonia and Sertoli cells reveals signaling pathways from stem cells to niche and testicular cancer cells to surrounding stroma

    PubMed Central

    2011-01-01

    Background Stem cells and their niches are studied in many systems, but mammalian germ stem cells (GSC) and their niches are still poorly understood. In rat testis, spermatogonia and undifferentiated Sertoli cells proliferate before puberty, but at puberty most spermatogonia enter spermatogenesis, and Sertoli cells differentiate to support this program. Thus, pre-pubertal spermatogonia might possess GSC potential and pre-pubertal Sertoli cells niche functions. We hypothesized that the different stem cell pools at pre-puberty and maturity provide a model for the identification of stem cell and niche-specific genes. We compared the transcript profiles of spermatogonia and Sertoli cells from pre-pubertal and pubertal rats and examined how these related to genes expressed in testicular cancers, which might originate from inappropriate communication between GSCs and Sertoli cells. Results The pre-pubertal spermatogonia-specific gene set comprised known stem cell and spermatogonial stem cell (SSC) markers. Similarly, the pre-pubertal Sertoli cell-specific gene set comprised known niche gene transcripts. A large fraction of these specifically enriched transcripts encoded trans-membrane, extra-cellular, and secreted proteins highlighting stem cell to niche communication. Comparing selective gene sets established in this study with published gene expression data of testicular cancers and their stroma, we identified sets expressed genes shared between testicular tumors and pre-pubertal spermatogonia, and tumor stroma and pre-pubertal Sertoli cells with statistic significance. Conclusions Our data suggest that SSC and their niche specifically express complementary factors for cell communication and that the same factors might be implicated in the communication between tumor cells and their micro-enviroment in testicular cancer. PMID:21232125

  10. Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex.

    PubMed

    Pavlidis, Paul; Qin, Jie; Arango, Victoria; Mann, John J; Sibille, Etienne

    2004-06-01

    One of the challenges in the analysis of gene expression data is placing the results in the context of other data available about genes and their relationships to each other. Here, we approach this problem in the study of gene expression changes associated with age in two areas of the human prefrontal cortex, comparing two computational methods. The first method, "overrepresentation analysis" (ORA), is based on statistically evaluating the fraction of genes in a particular gene ontology class found among the set of genes showing age-related changes in expression. The second method, "functional class scoring" (FCS), examines the statistical distribution of individual gene scores among all genes in the gene ontology class and does not involve an initial gene selection step. We find that FCS yields more consistent results than ORA, and the results of ORA depended strongly on the gene selection threshold. Our findings highlight the utility of functional class scoring for the analysis of complex expression data sets and emphasize the advantage of considering all available genomic information rather than sets of genes that pass a predetermined "threshold of significance."

  11. BubbleGUM: automatic extraction of phenotype molecular signatures and comprehensive visualization of multiple Gene Set Enrichment Analyses.

    PubMed

    Spinelli, Lionel; Carpentier, Sabrina; Montañana Sanchis, Frédéric; Dalod, Marc; Vu Manh, Thien-Phong

    2015-10-19

    Recent advances in the analysis of high-throughput expression data have led to the development of tools that scaled-up their focus from single-gene to gene set level. For example, the popular Gene Set Enrichment Analysis (GSEA) algorithm can detect moderate but coordinated expression changes of groups of presumably related genes between pairs of experimental conditions. This considerably improves extraction of information from high-throughput gene expression data. However, although many gene sets covering a large panel of biological fields are available in public databases, the ability to generate home-made gene sets relevant to one's biological question is crucial but remains a substantial challenge to most biologists lacking statistic or bioinformatic expertise. This is all the more the case when attempting to define a gene set specific of one condition compared to many other ones. Thus, there is a crucial need for an easy-to-use software for generation of relevant home-made gene sets from complex datasets, their use in GSEA, and the correction of the results when applied to multiple comparisons of many experimental conditions. We developed BubbleGUM (GSEA Unlimited Map), a tool that allows to automatically extract molecular signatures from transcriptomic data and perform exhaustive GSEA with multiple testing correction. One original feature of BubbleGUM notably resides in its capacity to integrate and compare numerous GSEA results into an easy-to-grasp graphical representation. We applied our method to generate transcriptomic fingerprints for murine cell types and to assess their enrichments in human cell types. This analysis allowed us to confirm homologies between mouse and human immunocytes. BubbleGUM is an open-source software that allows to automatically generate molecular signatures out of complex expression datasets and to assess directly their enrichment by GSEA on independent datasets. Enrichments are displayed in a graphical output that helps interpreting the results. This innovative methodology has recently been used to answer important questions in functional genomics, such as the degree of similarities between microarray datasets from different laboratories or with different experimental models or clinical cohorts. BubbleGUM is executable through an intuitive interface so that both bioinformaticians and biologists can use it. It is available at http://www.ciml.univ-mrs.fr/applications/BubbleGUM/index.html .

  12. Performance Comparison of Two Gene Set Analysis Methods for Genome-wide Association Study Results: GSA-SNP vs i-GSEA4GWAS.

    PubMed

    Kwon, Ji-Sun; Kim, Jihye; Nam, Dougu; Kim, Sangsoo

    2012-06-01

    Gene set analysis (GSA) is useful in interpreting a genome-wide association study (GWAS) result in terms of biological mechanism. We compared the performance of two different GSA implementations that accept GWAS p-values of single nucleotide polymorphisms (SNPs) or gene-by-gene summaries thereof, GSA-SNP and i-GSEA4GWAS, under the same settings of inputs and parameters. GSA runs were made with two sets of p-values from a Korean type 2 diabetes mellitus GWAS study: 259,188 and 1,152,947 SNPs of the original and imputed genotype datasets, respectively. When Gene Ontology terms were used as gene sets, i-GSEA4GWAS produced 283 and 1,070 hits for the unimputed and imputed datasets, respectively. On the other hand, GSA-SNP reported 94 and 38 hits, respectively, for both datasets. Similar, but to a lesser degree, trends were observed with Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets as well. The huge number of hits by i-GSEA4GWAS for the imputed dataset was probably an artifact due to the scaling step in the algorithm. The decrease in hits by GSA-SNP for the imputed dataset may be due to the fact that it relies on Z-statistics, which is sensitive to variations in the background level of associations. Judicious evaluation of the GSA outcomes, perhaps based on multiple programs, is recommended.

  13. sigReannot: an oligo-set re-annotation pipeline based on similarities with the Ensembl transcripts and Unigene clusters.

    PubMed

    Casel, Pierrot; Moreews, François; Lagarrigue, Sandrine; Klopp, Christophe

    2009-07-16

    Microarray is a powerful technology enabling to monitor tens of thousands of genes in a single experiment. Most microarrays are now using oligo-sets. The design of the oligo-nucleotides is time consuming and error prone. Genome wide microarray oligo-sets are designed using as large a set of transcripts as possible in order to monitor as many genes as possible. Depending on the genome sequencing state and on the assembly state the knowledge of the existing transcripts can be very different. This knowledge evolves with the different genome builds and gene builds. Once the design is done the microarrays are often used for several years. The biologists working in EADGENE expressed the need of up-to-dated annotation files for the oligo-sets they share including information about the orthologous genes of model species, the Gene Ontology, the corresponding pathways and the chromosomal location. The results of SigReannot on a chicken micro-array used in the EADGENE project compared to the initial annotations show that 23% of the oligo-nucleotide gene annotations were not confirmed, 2% were modified and 1% were added. The interest of this up-to-date annotation procedure is demonstrated through the analysis of real data previously published. SigReannot uses the oligo-nucleotide design procedure criteria to validate the probe-gene link and the Ensembl transcripts as reference for annotation. It therefore produces a high quality annotation based on reference gene sets.

  14. Reproducible detection of disease-associated markers from gene expression data.

    PubMed

    Omae, Katsuhiro; Komori, Osamu; Eguchi, Shinto

    2016-08-18

    Detection of disease-associated markers plays a crucial role in gene screening for biological studies. Two-sample test statistics, such as the t-statistic, are widely used to rank genes based on gene expression data. However, the resultant gene ranking is often not reproducible among different data sets. Such irreproducibility may be caused by disease heterogeneity. When we divided data into two subsets, we found that the signs of the two t-statistics were often reversed. Focusing on such instability, we proposed a sign-sum statistic that counts the signs of the t-statistics for all possible subsets. The proposed method excludes genes affected by heterogeneity, thereby improving the reproducibility of gene ranking. We compared the sign-sum statistic with the t-statistic by a theoretical evaluation of the upper confidence limit. Through simulations and applications to real data sets, we show that the sign-sum statistic exhibits superior performance. We derive the sign-sum statistic for getting a robust gene ranking. The sign-sum statistic gives more reproducible ranking than the t-statistic. Using simulated data sets we show that the sign-sum statistic excludes hetero-type genes well. Also for the real data sets, the sign-sum statistic performs well in a viewpoint of ranking reproducibility.

  15. Informatic selection of a neural crest-melanocyte cDNA set for microarray analysis

    PubMed Central

    Loftus, S. K.; Chen, Y.; Gooden, G.; Ryan, J. F.; Birznieks, G.; Hilliard, M.; Baxevanis, A. D.; Bittner, M.; Meltzer, P.; Trent, J.; Pavan, W.

    1999-01-01

    With cDNA microarrays, it is now possible to compare the expression of many genes simultaneously. To maximize the likelihood of finding genes whose expression is altered under the experimental conditions, it would be advantageous to be able to select clones for tissue-appropriate cDNA sets. We have taken advantage of the extensive sequence information in the dbEST expressed sequence tag (EST) database to identify a neural crest-derived melanocyte cDNA set for microarray analysis. Analysis of characterized genes with dbEST identified one library that contained ESTs representing 21 neural crest-expressed genes (library 198). The distribution of the ESTs corresponding to these genes was biased toward being derived from library 198. This is in contrast to the EST distribution profile for a set of control genes, characterized to be more ubiquitously expressed in multiple tissues (P < 1 × 10−9). From library 198, a subset of 852 clustered ESTs were selected that have a library distribution profile similar to that of the 21 neural crest-expressed genes. Microarray analysis demonstrated the majority of the neural crest-selected 852 ESTs (Mel1 array) were differentially expressed in melanoma cell lines compared with a non-neural crest kidney epithelial cell line (P < 1 × 10−8). This was not observed with an array of 1,238 ESTs that was selected without library origin bias (P = 0.204). This study presents an approach for selecting tissue-appropriate cDNAs that can be used to examine the expression profiles of developmental processes and diseases. PMID:10430933

  16. The GENCODE exome: sequencing the complete human exome

    PubMed Central

    Coffey, Alison J; Kokocinski, Felix; Calafato, Maria S; Scott, Carol E; Palta, Priit; Drury, Eleanor; Joyce, Christopher J; LeProust, Emily M; Harrow, Jen; Hunt, Sarah; Lehesjoki, Anna-Elina; Turner, Daniel J; Hubbard, Tim J; Palotie, Aarno

    2011-01-01

    Sequencing the coding regions, the exome, of the human genome is one of the major current strategies to identify low frequency and rare variants associated with human disease traits. So far, the most widely used commercial exome capture reagents have mainly targeted the consensus coding sequence (CCDS) database. We report the design of an extended set of targets for capturing the complete human exome, based on annotation from the GENCODE consortium. The extended set covers an additional 5594 genes and 10.3 Mb compared with the current CCDS-based sets. The additional regions include potential disease genes previously inaccessible to exome resequencing studies, such as 43 genes linked to ion channel activity and 70 genes linked to protein kinase activity. In total, the new GENCODE exome set developed here covers 47.9 Mb and performed well in sequence capture experiments. In the sample set used in this study, we identified over 5000 SNP variants more in the GENCODE exome target (24%) than in the CCDS-based exome sequencing. PMID:21364695

  17. Sequencing and comparing whole mitochondrial genomes ofanimals

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Boore, Jeffrey L.; Macey, J. Robert; Medina, Monica

    2005-04-22

    Comparing complete animal mitochondrial genome sequences is becoming increasingly common for phylogenetic reconstruction and as a model for genome evolution. Not only are they much more informative than shorter sequences of individual genes for inferring evolutionary relatedness, but these data also provide sets of genome-level characters, such as the relative arrangements of genes, that can be especially powerful. We describe here the protocols commonly used for physically isolating mtDNA, for amplifying these by PCR or RCA, for cloning,sequencing, assembly, validation, and gene annotation, and for comparing both sequences and gene arrangements. On several topics, we offer general observations based onmore » our experiences to date with determining and comparing complete mtDNA sequences.« less

  18. STBase: one million species trees for comparative biology.

    PubMed

    McMahon, Michelle M; Deepak, Akshay; Fernández-Baca, David; Boss, Darren; Sanderson, Michael J

    2015-01-01

    Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.

  19. YouGenMap: a web platform for dynamic multi-comparative mapping and visualization of genetic maps

    Treesearch

    Keith Batesole; Kokulapalan Wimalanathan; Lin Liu; Fan Zhang; Craig S. Echt; Chun Liang

    2014-01-01

    Comparative genetic maps are used in examination of genome organization, detection of conserved gene order, and exploration of marker order variations. YouGenMap is an open-source web tool that offers dynamic comparative mapping capability of users' own genetic mapping between 2 or more map sets. Users' genetic map data and optional gene annotations are...

  20. Mining pathway associations for disease-related pathway activity analysis based on gene expression and methylation data.

    PubMed

    Lee, Hyeonjeong; Shin, Miyoung

    2017-01-01

    The problem of discovering genetic markers as disease signatures is of great significance for the successful diagnosis, treatment, and prognosis of complex diseases. Even if many earlier studies worked on identifying disease markers from a variety of biological resources, they mostly focused on the markers of genes or gene-sets (i.e., pathways). However, these markers may not be enough to explain biological interactions between genetic variables that are related to diseases. Thus, in this study, our aim is to investigate distinctive associations among active pathways (i.e., pathway-sets) shown each in case and control samples which can be observed from gene expression and/or methylation data. The pathway-sets are obtained by identifying a set of associated pathways that are often active together over a significant number of class samples. For this purpose, gene expression or methylation profiles are first analyzed to identify significant (active) pathways via gene-set enrichment analysis. Then, regarding these active pathways, an association rule mining approach is applied to examine interesting pathway-sets in each class of samples (case or control). By doing so, the sets of associated pathways often working together in activity profiles are finally chosen as our distinctive signature of each class. The identified pathway-sets are aggregated into a pathway activity network (PAN), which facilitates the visualization of differential pathway associations between case and control samples. From our experiments with two publicly available datasets, we could find interesting PAN structures as the distinctive signatures of breast cancer and uterine leiomyoma cancer, respectively. Our pathway-set markers were shown to be superior or very comparable to other genetic markers (such as genes or gene-sets) in disease classification. Furthermore, the PAN structure, which can be constructed from the identified markers of pathway-sets, could provide deeper insights into distinctive associations between pathway activities in case and control samples.

  1. Gene expression changes in the course of normal brain aging are sexually dimorphic

    PubMed Central

    Berchtold, Nicole C.; Cribbs, David H.; Coleman, Paul D.; Rogers, Joseph; Head, Elizabeth; Kim, Ronald; Beach, Tom; Miller, Carol; Troncoso, Juan; Trojanowski, John Q.; Zielke, H. Ronald; Cotman, Carl W.

    2008-01-01

    Gene expression profiles were assessed in the hippocampus, entorhinal cortex, superior-frontal gyrus, and postcentral gyrus across the lifespan of 55 cognitively intact individuals aged 20–99 years. Perspectives on global gene changes that are associated with brain aging emerged, revealing two overarching concepts. First, different regions of the forebrain exhibited substantially different gene profile changes with age. For example, comparing equally powered groups, 5,029 probe sets were significantly altered with age in the superior-frontal gyrus, compared with 1,110 in the entorhinal cortex. Prominent change occurred in the sixth to seventh decades across cortical regions, suggesting that this period is a critical transition point in brain aging, particularly in males. Second, clear gender differences in brain aging were evident, suggesting that the brain undergoes sexually dimorphic changes in gene expression not only in development but also in later life. Globally across all brain regions, males showed more gene change than females. Further, Gene Ontology analysis revealed that different categories of genes were predominantly affected in males vs. females. Notably, the male brain was characterized by global decreased catabolic and anabolic capacity with aging, with down-regulated genes heavily enriched in energy production and protein synthesis/transport categories. Increased immune activation was a prominent feature of aging in both sexes, with proportionally greater activation in the female brain. These data open opportunities to explore age-dependent changes in gene expression that set the balance between neurodegeneration and compensatory mechanisms in the brain and suggest that this balance is set differently in males and females, an intriguing idea. PMID:18832152

  2. Comparative analysis of gene expression profiles of hip articular cartilage between non-traumatic necrosis and osteoarthritis.

    PubMed

    Wang, Wenyu; Liu, Yang; Hao, Jingcan; Zheng, Shuyu; Wen, Yan; Xiao, Xiao; He, Awen; Fan, Qianrui; Zhang, Feng; Liu, Ruiyu

    2016-10-10

    Hip cartilage destruction is consistently observed in the non-traumatic osteonecrosis of femoral head (NOFH) and accelerates its bone necrosis. The molecular mechanism underlying the cartilage damage of NOFH remains elusive. In this study, we conducted a systematically comparative study of gene expression profiles between NOFH and osteoarthritis (OA). Hip articular cartilage specimens were collected from 12 NOFH patients and 12 controls with traumatic femoral neck fracture for microarray (n=4) and quantitative real-time PCR validation experiments (n=8). Gene expression profiling of articular cartilage was performed using Agilent Human 4×44K Microarray chip. The accuracy of microarray experiment was further validated by qRT-PCR. Gene expression results of OA hip cartilage were derived from previously published study. Significance Analysis of Microarrays (SAM) software was applied for identifying differently expressed genes. Gene ontology (GO) and pathway enrichment analysis were conducted by Gene Set Enrichment Analysis software and DAVID tool, respectively. Totally, 27 differently expressed genes were identified for NOFH. Comparing the gene expression profiles of NOFH cartilage and OA cartilage detected 8 common differently expressed genes, including COL5A1, OGN, ANGPTL4, CRIP1, NFIL3, METRNL, ID2 and STEAP1. GO comparative analysis identified 10 common significant GO terms, mainly implicated in apoptosis and development process. Pathway comparative analysis observed that ECM-receptor interaction pathway and focal adhesion pathway were enriched in the differently expressed genes of both NOFH and hip OA. In conclusion, we identified a set of differently expressed genes, GO and pathways for NOFH articular destruction, some of which were also involved in the hip OA. Our study results may help to reveal the pathogenetic similarities and differences of cartilage damage of NOFH and hip OA. Copyright © 2016 Elsevier B.V. All rights reserved.

  3. Ensembl comparative genomics resources.

    PubMed

    Herrero, Javier; Muffato, Matthieu; Beal, Kathryn; Fitzgerald, Stephen; Gordon, Leo; Pignatelli, Miguel; Vilella, Albert J; Searle, Stephen M J; Amode, Ridwan; Brent, Simon; Spooner, William; Kulesha, Eugene; Yates, Andrew; Flicek, Paul

    2016-01-01

    Evolution provides the unifying framework with which to understand biology. The coherent investigation of genic and genomic data often requires comparative genomics analyses based on whole-genome alignments, sets of homologous genes and other relevant datasets in order to evaluate and answer evolutionary-related questions. However, the complexity and computational requirements of producing such data are substantial: this has led to only a small number of reference resources that are used for most comparative analyses. The Ensembl comparative genomics resources are one such reference set that facilitates comprehensive and reproducible analysis of chordate genome data. Ensembl computes pairwise and multiple whole-genome alignments from which large-scale synteny, per-base conservation scores and constrained elements are obtained. Gene alignments are used to define Ensembl Protein Families, GeneTrees and homologies for both protein-coding and non-coding RNA genes. These resources are updated frequently and have a consistent informatics infrastructure and data presentation across all supported species. Specialized web-based visualizations are also available including synteny displays, collapsible gene tree plots, a gene family locator and different alignment views. The Ensembl comparative genomics infrastructure is extensively reused for the analysis of non-vertebrate species by other projects including Ensembl Genomes and Gramene and much of the information here is relevant to these projects. The consistency of the annotation across species and the focus on vertebrates makes Ensembl an ideal system to perform and support vertebrate comparative genomic analyses. We use robust software and pipelines to produce reference comparative data and make it freely available. Database URL: http://www.ensembl.org. © The Author(s) 2016. Published by Oxford University Press.

  4. Ensembl comparative genomics resources

    PubMed Central

    Muffato, Matthieu; Beal, Kathryn; Fitzgerald, Stephen; Gordon, Leo; Pignatelli, Miguel; Vilella, Albert J.; Searle, Stephen M. J.; Amode, Ridwan; Brent, Simon; Spooner, William; Kulesha, Eugene; Yates, Andrew; Flicek, Paul

    2016-01-01

    Evolution provides the unifying framework with which to understand biology. The coherent investigation of genic and genomic data often requires comparative genomics analyses based on whole-genome alignments, sets of homologous genes and other relevant datasets in order to evaluate and answer evolutionary-related questions. However, the complexity and computational requirements of producing such data are substantial: this has led to only a small number of reference resources that are used for most comparative analyses. The Ensembl comparative genomics resources are one such reference set that facilitates comprehensive and reproducible analysis of chordate genome data. Ensembl computes pairwise and multiple whole-genome alignments from which large-scale synteny, per-base conservation scores and constrained elements are obtained. Gene alignments are used to define Ensembl Protein Families, GeneTrees and homologies for both protein-coding and non-coding RNA genes. These resources are updated frequently and have a consistent informatics infrastructure and data presentation across all supported species. Specialized web-based visualizations are also available including synteny displays, collapsible gene tree plots, a gene family locator and different alignment views. The Ensembl comparative genomics infrastructure is extensively reused for the analysis of non-vertebrate species by other projects including Ensembl Genomes and Gramene and much of the information here is relevant to these projects. The consistency of the annotation across species and the focus on vertebrates makes Ensembl an ideal system to perform and support vertebrate comparative genomic analyses. We use robust software and pipelines to produce reference comparative data and make it freely available. Database URL: http://www.ensembl.org. PMID:26896847

  5. The Functional Genetics of Handedness and Language Lateralization: Insights from Gene Ontology, Pathway and Disease Association Analyses.

    PubMed

    Schmitz, Judith; Lor, Stephanie; Klose, Rena; Güntürkün, Onur; Ocklenburg, Sebastian

    2017-01-01

    Handedness and language lateralization are partially determined by genetic influences. It has been estimated that at least 40 (and potentially more) possibly interacting genes may influence the ontogenesis of hemispheric asymmetries. Recently, it has been suggested that analyzing the genetics of hemispheric asymmetries on the level of gene ontology sets, rather than at the level of individual genes, might be more informative for understanding the underlying functional cascades. Here, we performed gene ontology, pathway and disease association analyses on genes that have previously been associated with handedness and language lateralization. Significant gene ontology sets for handedness were anatomical structure development, pattern specification (especially asymmetry formation) and biological regulation. Pathway analysis highlighted the importance of the TGF-beta signaling pathway for handedness ontogenesis. Significant gene ontology sets for language lateralization were responses to different stimuli, nervous system development, transport, signaling, and biological regulation. Despite the fact that some authors assume that handedness and language lateralization share a common ontogenetic basis, gene ontology sets barely overlap between phenotypes. Compared to genes involved in handedness, which mostly contribute to structural development, genes involved in language lateralization rather contribute to activity-dependent cognitive processes. Disease association analysis revealed associations of genes involved in handedness with diseases affecting the whole body, while genes involved in language lateralization were specifically engaged in mental and neurological diseases. These findings further support the idea that handedness and language lateralization are ontogenetically independent, complex phenotypes.

  6. The Functional Genetics of Handedness and Language Lateralization: Insights from Gene Ontology, Pathway and Disease Association Analyses

    PubMed Central

    Schmitz, Judith; Lor, Stephanie; Klose, Rena; Güntürkün, Onur; Ocklenburg, Sebastian

    2017-01-01

    Handedness and language lateralization are partially determined by genetic influences. It has been estimated that at least 40 (and potentially more) possibly interacting genes may influence the ontogenesis of hemispheric asymmetries. Recently, it has been suggested that analyzing the genetics of hemispheric asymmetries on the level of gene ontology sets, rather than at the level of individual genes, might be more informative for understanding the underlying functional cascades. Here, we performed gene ontology, pathway and disease association analyses on genes that have previously been associated with handedness and language lateralization. Significant gene ontology sets for handedness were anatomical structure development, pattern specification (especially asymmetry formation) and biological regulation. Pathway analysis highlighted the importance of the TGF-beta signaling pathway for handedness ontogenesis. Significant gene ontology sets for language lateralization were responses to different stimuli, nervous system development, transport, signaling, and biological regulation. Despite the fact that some authors assume that handedness and language lateralization share a common ontogenetic basis, gene ontology sets barely overlap between phenotypes. Compared to genes involved in handedness, which mostly contribute to structural development, genes involved in language lateralization rather contribute to activity-dependent cognitive processes. Disease association analysis revealed associations of genes involved in handedness with diseases affecting the whole body, while genes involved in language lateralization were specifically engaged in mental and neurological diseases. These findings further support the idea that handedness and language lateralization are ontogenetically independent, complex phenotypes. PMID:28729848

  7. Statistical inference for time course RNA-Seq data using a negative binomial mixed-effect model.

    PubMed

    Sun, Xiaoxiao; Dalpiaz, David; Wu, Di; S Liu, Jun; Zhong, Wenxuan; Ma, Ping

    2016-08-26

    Accurate identification of differentially expressed (DE) genes in time course RNA-Seq data is crucial for understanding the dynamics of transcriptional regulatory network. However, most of the available methods treat gene expressions at different time points as replicates and test the significance of the mean expression difference between treatments or conditions irrespective of time. They thus fail to identify many DE genes with different profiles across time. In this article, we propose a negative binomial mixed-effect model (NBMM) to identify DE genes in time course RNA-Seq data. In the NBMM, mean gene expression is characterized by a fixed effect, and time dependency is described by random effects. The NBMM is very flexible and can be fitted to both unreplicated and replicated time course RNA-Seq data via a penalized likelihood method. By comparing gene expression profiles over time, we further classify the DE genes into two subtypes to enhance the understanding of expression dynamics. A significance test for detecting DE genes is derived using a Kullback-Leibler distance ratio. Additionally, a significance test for gene sets is developed using a gene set score. Simulation analysis shows that the NBMM outperforms currently available methods for detecting DE genes and gene sets. Moreover, our real data analysis of fruit fly developmental time course RNA-Seq data demonstrates the NBMM identifies biologically relevant genes which are well justified by gene ontology analysis. The proposed method is powerful and efficient to detect biologically relevant DE genes and gene sets in time course RNA-Seq data.

  8. Gene Network Rewiring to Study Melanoma Stage Progression and Elements Essential for Driving Melanoma

    PubMed Central

    Kaushik, Abhinav; Bhatia, Yashuma; Ali, Shakir; Gupta, Dinesh

    2015-01-01

    Metastatic melanoma patients have a poor prognosis, mainly attributable to the underlying heterogeneity in melanoma driver genes and altered gene expression profiles. These characteristics of melanoma also make the development of drugs and identification of novel drug targets for metastatic melanoma a daunting task. Systems biology offers an alternative approach to re-explore the genes or gene sets that display dysregulated behaviour without being differentially expressed. In this study, we have performed systems biology studies to enhance our knowledge about the conserved property of disease genes or gene sets among mutually exclusive datasets representing melanoma progression. We meta-analysed 642 microarray samples to generate melanoma reconstructed networks representing four different stages of melanoma progression to extract genes with altered molecular circuitry wiring as compared to a normal cellular state. Intriguingly, a majority of the melanoma network-rewired genes are not differentially expressed and the disease genes involved in melanoma progression consistently modulate its activity by rewiring network connections. We found that the shortlisted disease genes in the study show strong and abnormal network connectivity, which enhances with the disease progression. Moreover, the deviated network properties of the disease gene sets allow ranking/prioritization of different enriched, dysregulated and conserved pathway terms in metastatic melanoma, in agreement with previous findings. Our analysis also reveals presence of distinct network hubs in different stages of metastasizing tumor for the same set of pathways in the statistically conserved gene sets. The study results are also presented as a freely available database at http://bioinfo.icgeb.res.in/m3db/. The web-based database resource consists of results from the analysis presented here, integrated with cytoscape web and user-friendly tools for visualization, retrieval and further analysis. PMID:26558755

  9. In silico pathway analysis in cervical carcinoma reveals potential new targets for treatment

    PubMed Central

    van Dam, Peter A.; van Dam, Pieter-Jan H. H.; Rolfo, Christian; Giallombardo, Marco; van Berckelaer, Christophe; Trinh, Xuan Bich; Altintas, Sevilay; Huizing, Manon; Papadimitriou, Kostas; Tjalma, Wiebren A. A.; van Laere, Steven

    2016-01-01

    An in silico pathway analysis was performed in order to improve current knowledge on the molecular drivers of cervical cancer and detect potential targets for treatment. Three publicly available Affymetrix gene expression data-sets (GSE5787, GSE7803, GSE9750) were retrieved, vouching for a total of 9 cervical cancer cell lines (CCCLs), 39 normal cervical samples, 7 CIN3 samples and 111 cervical cancer samples (CCSs). Predication analysis of microarrays was performed in the Affymetrix sets to identify cervical cancer biomarkers. To select cancer cell-specific genes the CCSs were compared to the CCCLs. Validated genes were submitted to a gene set enrichment analysis (GSEA) and Expression2Kinases (E2K). In the CCSs a total of 1,547 probe sets were identified that were overexpressed (FDR < 0.1). Comparing to CCCLs 560 probe sets (481 unique genes) had a cancer cell-specific expression profile, and 315 of these genes (65%) were validated. GSEA identified 5 cancer hallmarks enriched in CCSs (P < 0.01 and FDR < 0.25) showing that deregulation of the cell cycle is a major component of cervical cancer biology. E2K identified a protein-protein interaction (PPI) network of 162 nodes (including 20 drugable kinases) and 1626 edges. This PPI-network consists of 5 signaling modules associated with MYC signaling (Module 1), cell cycle deregulation (Module 2), TGFβ-signaling (Module 3), MAPK signaling (Module 4) and chromatin modeling (Module 5). Potential targets for treatment which could be identified were CDK1, CDK2, ABL1, ATM, AKT1, MAPK1, MAPK3 among others. The present study identified important driver pathways in cervical carcinogenesis which should be assessed for their potential therapeutic drugability. PMID:26701206

  10. Discovery of error-tolerant biclusters from noisy gene expression data.

    PubMed

    Gupta, Rohit; Rao, Navneet; Kumar, Vipin

    2011-11-24

    An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, which limits its applicability in real-life data sets where the biclusters may be fragmented due to random noise/errors. Moreover, as they only work with binary or boolean attributes, their application on gene-expression data require transforming real-valued attributes to binary attributes, which often results in loss of information. Many past approaches have tried to address the issue of noise and handling real-valued attributes independently but there is no systematic approach that addresses both of these issues together. In this paper, we first propose a novel error-tolerant biclustering model, 'ET-bicluster', and then propose a bottom-up heuristic-based mining algorithm to sequentially discover error-tolerant biclusters directly from real-valued gene-expression data. The efficacy of our proposed approach is illustrated by comparing it with a recent approach RAP in the context of two biological problems: discovery of functional modules and discovery of biomarkers. For the first problem, two real-valued S.Cerevisiae microarray gene-expression data sets are used to demonstrate that the biclusters obtained from ET-bicluster approach not only recover larger set of genes as compared to those obtained from RAP approach but also have higher functional coherence as evaluated using the GO-based functional enrichment analysis. The statistical significance of the discovered error-tolerant biclusters as estimated by using two randomization tests, reveal that they are indeed biologically meaningful and statistically significant. For the second problem of biomarker discovery, we used four real-valued Breast Cancer microarray gene-expression data sets and evaluate the biomarkers obtained using MSigDB gene sets. The results obtained for both the problems: functional module discovery and biomarkers discovery, clearly signifies the usefulness of the proposed ET-bicluster approach and illustrate the importance of explicitly incorporating noise/errors in discovering coherent groups of genes from gene-expression data.

  11. The compositional transition of vertebrate genomes: an analysis of the secondary structure of the proteins encoded by human genes.

    PubMed

    D'Onofrio, Giuseppe; Ghosh, Tapash Chandra

    2005-01-17

    Fluctuations and increments of both C(3) and G(3) levels along the human coding sequences were investigated comparing two sets of Xenopus/human orthologous genes. The first set of genes shows minor differences of the GC(3) levels, the second shows considerable increments of the GC(3) levels in the human genes. In both data sets, the fluctuations of C(3) and G(3) levels along the coding sequences correlated with the secondary structures of the encoded proteins. The human genes that underwent the compositional transition showed a different increment of the C(3) and G(3) levels within and among the structural units of the proteins. The relative synonymous codon usage (RSCU) of several amino acids were also affected during the compositional transition, showing that there exists a correlation between RSCU and protein secondary structures in human genes. The importance of natural selection for the formation of isochore organization of the human genome has been discussed on the basis of these results.

  12. Combining Shigella Tn-seq data with gold-standard E. coli gene deletion data suggests rare transitions between essential and non-essential gene functionality.

    PubMed

    Freed, Nikki E; Bumann, Dirk; Silander, Olin K

    2016-09-06

    Gene essentiality - whether or not a gene is necessary for cell growth - is a fundamental component of gene function. It is not well established how quickly gene essentiality can change, as few studies have compared empirical measures of essentiality between closely related organisms. Here we present the results of a Tn-seq experiment designed to detect essential protein coding genes in the bacterial pathogen Shigella flexneri 2a 2457T on a genome-wide scale. Superficial analysis of this data suggested that 481 protein-coding genes in this Shigella strain are critical for robust cellular growth on rich media. Comparison of this set of genes with a gold-standard data set of essential genes in the closely related Escherichia coli K12 BW25113 revealed that an excessive number of genes appeared essential in Shigella but non-essential in E. coli. Importantly, and in converse to this comparison, we found no genes that were essential in E. coli and non-essential in Shigella, implying that many genes were artefactually inferred as essential in Shigella. Controlling for such artefacts resulted in a much smaller set of discrepant genes. Among these, we identified three sets of functionally related genes, two of which have previously been implicated as critical for Shigella growth, but which are dispensable for E. coli growth. The data presented here highlight the small number of protein coding genes for which we have strong evidence that their essentiality status differs between the closely related bacterial taxa E. coli and Shigella. A set of genes involved in acetate utilization provides a canonical example. These results leave open the possibility of developing strain-specific antibiotic treatments targeting such differentially essential genes, but suggest that such opportunities may be rare in closely related bacteria.

  13. Pathway Distiller - multisource biological pathway consolidation

    PubMed Central

    2012-01-01

    Background One method to understand and evaluate an experiment that produces a large set of genes, such as a gene expression microarray analysis, is to identify overrepresentation or enrichment for biological pathways. Because pathways are able to functionally describe the set of genes, much effort has been made to collect curated biological pathways into publicly accessible databases. When combining disparate databases, highly related or redundant pathways exist, making their consolidation into pathway concepts essential. This will facilitate unbiased, comprehensive yet streamlined analysis of experiments that result in large gene sets. Methods After gene set enrichment finds representative pathways for large gene sets, pathways are consolidated into representative pathway concepts. Three complementary, but different methods of pathway consolidation are explored. Enrichment Consolidation combines the set of the pathways enriched for the signature gene list through iterative combining of enriched pathways with other pathways with similar signature gene sets; Weighted Consolidation utilizes a Protein-Protein Interaction network based gene-weighting approach that finds clusters of both enriched and non-enriched pathways limited to the experiments' resultant gene list; and finally the de novo Consolidation method uses several measurements of pathway similarity, that finds static pathway clusters independent of any given experiment. Results We demonstrate that the three consolidation methods provide unified yet different functional insights of a resultant gene set derived from a genome-wide profiling experiment. Results from the methods are presented, demonstrating their applications in biological studies and comparing with a pathway web-based framework that also combines several pathway databases. Additionally a web-based consolidation framework that encompasses all three methods discussed in this paper, Pathway Distiller (http://cbbiweb.uthscsa.edu/PathwayDistiller), is established to allow researchers access to the methods and example microarray data described in this manuscript, and the ability to analyze their own gene list by using our unique consolidation methods. Conclusions By combining several pathway systems, implementing different, but complementary pathway consolidation methods, and providing a user-friendly web-accessible tool, we have enabled users the ability to extract functional explanations of their genome wide experiments. PMID:23134636

  14. Pathway Distiller - multisource biological pathway consolidation.

    PubMed

    Doderer, Mark S; Anguiano, Zachry; Suresh, Uthra; Dashnamoorthy, Ravi; Bishop, Alexander J R; Chen, Yidong

    2012-01-01

    One method to understand and evaluate an experiment that produces a large set of genes, such as a gene expression microarray analysis, is to identify overrepresentation or enrichment for biological pathways. Because pathways are able to functionally describe the set of genes, much effort has been made to collect curated biological pathways into publicly accessible databases. When combining disparate databases, highly related or redundant pathways exist, making their consolidation into pathway concepts essential. This will facilitate unbiased, comprehensive yet streamlined analysis of experiments that result in large gene sets. After gene set enrichment finds representative pathways for large gene sets, pathways are consolidated into representative pathway concepts. Three complementary, but different methods of pathway consolidation are explored. Enrichment Consolidation combines the set of the pathways enriched for the signature gene list through iterative combining of enriched pathways with other pathways with similar signature gene sets; Weighted Consolidation utilizes a Protein-Protein Interaction network based gene-weighting approach that finds clusters of both enriched and non-enriched pathways limited to the experiments' resultant gene list; and finally the de novo Consolidation method uses several measurements of pathway similarity, that finds static pathway clusters independent of any given experiment. We demonstrate that the three consolidation methods provide unified yet different functional insights of a resultant gene set derived from a genome-wide profiling experiment. Results from the methods are presented, demonstrating their applications in biological studies and comparing with a pathway web-based framework that also combines several pathway databases. Additionally a web-based consolidation framework that encompasses all three methods discussed in this paper, Pathway Distiller (http://cbbiweb.uthscsa.edu/PathwayDistiller), is established to allow researchers access to the methods and example microarray data described in this manuscript, and the ability to analyze their own gene list by using our unique consolidation methods. By combining several pathway systems, implementing different, but complementary pathway consolidation methods, and providing a user-friendly web-accessible tool, we have enabled users the ability to extract functional explanations of their genome wide experiments.

  15. Sample entropy analysis of cervical neoplasia gene-expression signatures

    PubMed Central

    Botting, Shaleen K; Trzeciakowski, Jerome P; Benoit, Michelle F; Salama, Salama A; Diaz-Arrastia, Concepcion R

    2009-01-01

    Background We introduce Approximate Entropy as a mathematical method of analysis for microarray data. Approximate entropy is applied here as a method to classify the complex gene expression patterns resultant of a clinical sample set. Since Entropy is a measure of disorder in a system, we believe that by choosing genes which display minimum entropy in normal controls and maximum entropy in the cancerous sample set we will be able to distinguish those genes which display the greatest variability in the cancerous set. Here we describe a method of utilizing Approximate Sample Entropy (ApSE) analysis to identify genes of interest with the highest probability of producing an accurate, predictive, classification model from our data set. Results In the development of a diagnostic gene-expression profile for cervical intraepithelial neoplasia (CIN) and squamous cell carcinoma of the cervix, we identified 208 genes which are unchanging in all normal tissue samples, yet exhibit a random pattern indicative of the genetic instability and heterogeneity of malignant cells. This may be measured in terms of the ApSE when compared to normal tissue. We have validated 10 of these genes on 10 Normal and 20 cancer and CIN3 samples. We report that the predictive value of the sample entropy calculation for these 10 genes of interest is promising (75% sensitivity, 80% specificity for prediction of cervical cancer over CIN3). Conclusion The success of the Approximate Sample Entropy approach in discerning alterations in complexity from biological system with such relatively small sample set, and extracting biologically relevant genes of interest hold great promise. PMID:19232110

  16. A Vector Library for Silencing Central Carbon Metabolism Genes with Antisense RNAs in Escherichia coli

    PubMed Central

    Ohno, Satoshi; Yoshikawa, Katsunori; Shimizu, Hiroshi; Tamura, Tomohiro

    2014-01-01

    We describe here the construction of a series of 71 vectors to silence central carbon metabolism genes in Escherichia coli. The vectors inducibly express antisense RNAs called paired-terminus antisense RNAs, which have a higher silencing efficacy than ordinary antisense RNAs. By measuring mRNA amounts, measuring activities of target proteins, or observing specific phenotypes, it was confirmed that all the vectors were able to silence the expression of target genes efficiently. Using this vector set, each of the central carbon metabolism genes was silenced individually, and the accumulation of metabolites was investigated. We were able to obtain accurate information on ways to increase the production of pyruvate, an industrially valuable compound, from the silencing results. Furthermore, the experimental results of pyruvate accumulation were compared to in silico predictions, and both sets of results were consistent. Compared to the gene disruption approach, the silencing approach has an advantage in that any E. coli strain can be used and multiple gene silencing is easily possible in any combination. PMID:24212579

  17. Query-based biclustering of gene expression data using Probabilistic Relational Models.

    PubMed

    Zhao, Hui; Cloots, Lore; Van den Bulcke, Tim; Wu, Yan; De Smet, Riet; Storms, Valerie; Meysman, Pieter; Engelen, Kristof; Marchal, Kathleen

    2011-02-15

    With the availability of large scale expression compendia it is now possible to view own findings in the light of what is already available and retrieve genes with an expression profile similar to a set of genes of interest (i.e., a query or seed set) for a subset of conditions. To that end, a query-based strategy is needed that maximally exploits the coexpression behaviour of the seed genes to guide the biclustering, but that at the same time is robust against the presence of noisy genes in the seed set as seed genes are often assumed, but not guaranteed to be coexpressed in the queried compendium. Therefore, we developed ProBic, a query-based biclustering strategy based on Probabilistic Relational Models (PRMs) that exploits the use of prior distributions to extract the information contained within the seed set. We applied ProBic on a large scale Escherichia coli compendium to extend partially described regulons with potentially novel members. We compared ProBic's performance with previously published query-based biclustering algorithms, namely ISA and QDB, from the perspective of bicluster expression quality, robustness of the outcome against noisy seed sets and biological relevance.This comparison learns that ProBic is able to retrieve biologically relevant, high quality biclusters that retain their seed genes and that it is particularly strong in handling noisy seeds. ProBic is a query-based biclustering algorithm developed in a flexible framework, designed to detect biologically relevant, high quality biclusters that retain relevant seed genes even in the presence of noise or when dealing with low quality seed sets.

  18. Transcriptome meta-analysis reveals common differential and global gene expression profiles in cystic fibrosis and other respiratory disorders and identifies CFTR regulators.

    PubMed

    Clarke, Luka A; Botelho, Hugo M; Sousa, Lisete; Falcao, Andre O; Amaral, Margarida D

    2015-11-01

    A meta-analysis of 13 independent microarray data sets was performed and gene expression profiles from cystic fibrosis (CF), similar disorders (COPD: chronic obstructive pulmonary disease, IPF: idiopathic pulmonary fibrosis, asthma), environmental conditions (smoking, epithelial injury), related cellular processes (epithelial differentiation/regeneration), and non-respiratory "control" conditions (schizophrenia, dieting), were compared. Similarity among differentially expressed (DE) gene lists was assessed using a permutation test, and a clustergram was constructed, identifying common gene markers. Global gene expression values were standardized using a novel approach, revealing that similarities between independent data sets run deeper than shared DE genes. Correlation of gene expression values identified putative gene regulators of the CF transmembrane conductance regulator (CFTR) gene, of potential therapeutic significance. Our study provides a novel perspective on CF epithelial gene expression in the context of other lung disorders and conditions, and highlights the contribution of differentiation/EMT and injury to gene signatures of respiratory disease. Copyright © 2015 Elsevier Inc. All rights reserved.

  19. Twenty-four signature genes predict the prognosis of oral squamous cell carcinoma with high accuracy and repeatability

    PubMed Central

    Gao, Jianyong; Tian, Gang; Han, Xu; Zhu, Qiang

    2018-01-01

    Oral squamous cell carcinoma (OSCC) is the sixth most common type cancer worldwide, with poor prognosis. The present study aimed to identify gene signatures that could classify OSCC and predict prognosis in different stages. A training data set (GSE41613) and two validation data sets (GSE42743 and GSE26549) were acquired from the online Gene Expression Omnibus database. In the training data set, patients were classified based on the tumor-node-metastasis staging system, and subsequently grouped into low stage (L) or high stage (H). Signature genes between L and H stages were selected by disparity index analysis, and classification was performed by the expression of these signature genes. The established classification was compared with the L and H classification, and fivefold cross validation was used to evaluate the stability. Enrichment analysis for the signature genes was implemented by the Database for Annotation, Visualization and Integration Discovery. Two validation data sets were used to determine the precise of classification. Survival analysis was conducted followed each classification using the package ‘survival’ in R software. A set of 24 signature genes was identified based on the classification model with the Fi value of 0.47, which was used to distinguish OSCC samples in two different stages. Overall survival of patients in the H stage was higher than those in the L stage. Signature genes were primarily enriched in ‘ether lipid metabolism’ pathway and biological processes such as ‘positive regulation of adaptive immune response’ and ‘apoptotic cell clearance’. The results provided a novel 24-gene set that may be used as biomarkers to predict OSCC prognosis with high accuracy, which may be used to determine an appropriate treatment program for patients with OSCC in addition to the traditional evaluation index. PMID:29257303

  20. An integrative machine learning strategy for improved prediction of essential genes in Escherichia coli metabolism using flux-coupled features.

    PubMed

    Nandi, Sutanu; Subramanian, Abhishek; Sarkar, Ram Rup

    2017-07-25

    Prediction of essential genes helps to identify a minimal set of genes that are absolutely required for the appropriate functioning and survival of a cell. The available machine learning techniques for essential gene prediction have inherent problems, like imbalanced provision of training datasets, biased choice of the best model for a given balanced dataset, choice of a complex machine learning algorithm, and data-based automated selection of biologically relevant features for classification. Here, we propose a simple support vector machine-based learning strategy for the prediction of essential genes in Escherichia coli K-12 MG1655 metabolism that integrates a non-conventional combination of an appropriate sample balanced training set, a unique organism-specific genotype, phenotype attributes that characterize essential genes, and optimal parameters of the learning algorithm to generate the best machine learning model (the model with the highest accuracy among all the models trained for different sample training sets). For the first time, we also introduce flux-coupled metabolic subnetwork-based features for enhancing the classification performance. Our strategy proves to be superior as compared to previous SVM-based strategies in obtaining a biologically relevant classification of genes with high sensitivity and specificity. This methodology was also trained with datasets of other recent supervised classification techniques for essential gene classification and tested using reported test datasets. The testing accuracy was always high as compared to the known techniques, proving that our method outperforms known methods. Observations from our study indicate that essential genes are conserved among homologous bacterial species, demonstrate high codon usage bias, GC content and gene expression, and predominantly possess a tendency to form physiological flux modules in metabolism.

  1. Gene expression changes in response to aging compared to heat stress, oxidative stress and ionizing radiation in Drosophila melanogaster.

    PubMed

    Landis, Gary; Shen, Jie; Tower, John

    2012-11-01

    Gene expression changes in response to aging, heat stress, hyperoxia, hydrogen peroxide, and ionizing radiation were compared using microarrays. A set of 18 genes were up-regulated across all conditions, indicating a general stress response shared with aging, including the heat shock protein (Hsp) genes Hsp70, Hsp83 and l(2)efl, the glutathione-S-transferase gene GstD2, and the mitochondrial unfolded protein response (mUPR) gene ref(2)P. Selected gene expression changes were confirmed using quantitative PCR, Northern analysis and GstD-GFP reporter constructs. Certain genes were altered in only a subset of the conditions, for example, up-regulation of numerous developmental pathway and signaling genes in response to hydrogen peroxide. While aging shared features with each stress, aging was more similar to the stresses most associated with oxidative stress (hyperoxia, hydrogen peroxide, ionizing radiation) than to heat stress. Aging is associated with down-regulation of numerous mitochondrial genes, including electron-transport-chain (ETC) genes and mitochondrial metabolism genes, and a sub-set of these changes was also observed upon hydrogen peroxide stress and ionizing radiation stress. Aging shared the largest number of gene expression changes with hyperoxia. The extensive down-regulation of mitochondrial and ETC genes during aging is consistent with an aging-associated failure in mitochondrial maintenance, which may underlie the oxidative stress-like and proteotoxic stress-like responses observed during aging.

  2. Gene expression changes in response to aging compared to heat stress, oxidative stress and ionizing radiation in Drosophila melanogaster

    PubMed Central

    Landis, Gary; Shen, Jie; Tower, John

    2012-01-01

    Gene expression changes in response to aging, heat stress, hyperoxia, hydrogen peroxide, and ionizing radiation were compared using microarrays. A set of 18 genes were up-regulated across all conditions, indicating a general stress response shared with aging, including the heat shock protein (Hsp) genes Hsp70, Hsp83 and l(2)efl, the glutathione-S-transferase gene GstD2, and the mitochondrial unfolded protein response (mUPR) gene ref(2)P. Selected gene expression changes were confirmed using quantitative PCR, Northern analysis and GstD-GFP reporter constructs. Certain genes were altered in only a subset of the conditions, for example, up-regulation of numerous developmental pathway and signaling genes in response to hydrogen peroxide. While aging shared features with each stress, aging was more similar to the stresses most associated with oxidative stress (hyperoxia, hydrogen peroxide, ionizing radiation) than to heat stress. Aging is associated with down-regulation of numerous mitochondrial genes, including electron-transport-chain (ETC) genes and mitochondrial metabolism genes, and a sub-set of these changes was also observed upon hydrogen peroxide stress and ionizing radiation stress. Aging shared the largest number of gene expression changes with hyperoxia. The extensive down-regulation of mitochondrial and ETC genes during aging is consistent with an aging-associated failure in mitochondrial maintenance, which may underlie the oxidative stress-like and proteotoxic stress-like responses observed during aging. PMID:23211361

  3. Evolution of Prdm Genes in Animals: Insights from Comparative Genomics

    PubMed Central

    Vervoort, Michel; Meulemeester, David; Béhague, Julien; Kerner, Pierre

    2016-01-01

    Prdm genes encode transcription factors with a subtype of SET domain known as the PRDF1-RIZ (PR) homology domain and a variable number of zinc finger motifs. These genes are involved in a wide variety of functions during animal development. As most Prdm genes have been studied in vertebrates, especially in mice, little is known about the evolution of this gene family. We searched for Prdm genes in the fully sequenced genomes of 93 different species representative of all the main metazoan lineages. A total of 976 Prdm genes were identified in these species. The number of Prdm genes per species ranges from 2 to 19. To better understand how the Prdm gene family has evolved in metazoans, we performed phylogenetic analyses using this large set of identified Prdm genes. These analyses allowed us to define 14 different subfamilies of Prdm genes and to establish, through ancestral state reconstruction, that 11 of them are ancestral to bilaterian animals. Three additional subfamilies were acquired during early vertebrate evolution (Prdm5, Prdm11, and Prdm17). Several gene duplication and gene loss events were identified and mapped onto the metazoan phylogenetic tree. By studying a large number of nonmetazoan genomes, we confirmed that Prdm genes likely constitute a metazoan-specific gene family. Our data also suggest that Prdm genes originated before the diversification of animals through the association of a single ancestral SET domain encoding gene with one or several zinc finger encoding genes. PMID:26560352

  4. Evaluation of new gyrB-based real-time PCR system for the detection of B. fragilis as an indicator of human-specific fecal contamination.

    PubMed

    Lee, Chang Soo; Lee, Jiyoung

    2010-09-01

    A rapid and specific gyrB-based real-time PCR system has been developed for detecting Bacteroides fragilis as a human-specific marker of fecal contamination. Its specificity and sensitivity was evaluated by comparison with other 16S rRNA gene-based primers using closely related Bacteroides and Prevotella. Many studies have used 16S rRNA gene-based method targeting Bacteroides because this genus is relatively abundant in human feces and is useful for microbial source tracking. However, 16S rRNA gene-based primers are evolutionarily too conserved among taxa to discriminate between human-specific species of Bacteroides and other closely related genera, such as Prevotella. Recently, one of the housekeeping genes, gyrB, has been used as an alternative target in multilocus sequence analysis (MLSA) to provide greater phylogenetic resolution. In this study, a new B. fragilis-specific primer set (Bf904F/Bf958R) was designed by alignments of 322 gyrB genes and was compared with the performance of the 16S rRNA gene-based primers in the presence of B. fragilis, Bacteroides ovatus and Prevotella melaninogenica. Amplicons were sequenced and a phylogenetic tree was constructed to confirm the specificity of the primers to B. fragilis. The gyrB-based primers successfully discriminated B. fragilis from B. ovatus and P. melaninogenica. Real-time PCR results showed that the gyrB primer set had a comparable sensitivity in the detection of B. fragilis when compared with the 16S rRNA primer set. The host-specificity of our gyrB-based primer set was validated with human, pig, cow, and dog fecal samples. The gyrB primer system had superior human-specificity. The gyrB-based system can rapidly detect human-specific fecal source and can be used for improved source tracking of human contamination. (c) 2010 Elsevier B.V. All rights reserved.

  5. Design and verification of a pangenome microarray oligonucleotide probe set for Dehalococcoides spp.

    PubMed

    Hug, Laura A; Salehi, Maryam; Nuin, Paulo; Tillier, Elisabeth R; Edwards, Elizabeth A

    2011-08-01

    Dehalococcoides spp. are an industrially relevant group of Chloroflexi bacteria capable of reductively dechlorinating contaminants in groundwater environments. Existing Dehalococcoides genomes revealed a high level of sequence identity within this group, including 98 to 100% 16S rRNA sequence identity between strains with diverse substrate specificities. Common molecular techniques for identification of microbial populations are often not applicable for distinguishing Dehalococcoides strains. Here we describe an oligonucleotide microarray probe set designed based on clustered Dehalococcoides genes from five different sources (strain DET195, CBDB1, BAV1, and VS genomes and the KB-1 metagenome). This "pangenome" probe set provides coverage of core Dehalococcoides genes as well as strain-specific genes while optimizing the potential for hybridization to closely related, previously unknown Dehalococcoides strains. The pangenome probe set was compared to probe sets designed independently for each of the five Dehalococcoides strains. The pangenome probe set demonstrated better predictability and higher detection of Dehalococcoides genes than strain-specific probe sets on nontarget strains with <99% average nucleotide identity. An in silico analysis of the expected probe hybridization against the recently released Dehalococcoides strain GT genome and additional KB-1 metagenome sequence data indicated that the pangenome probe set performs more robustly than the combined strain-specific probe sets in the detection of genes not included in the original design. The pangenome probe set represents a highly specific, universal tool for the detection and characterization of Dehalococcoides from contaminated sites. It has the potential to become a common platform for Dehalococcoides-focused research, allowing meaningful comparisons between microarray experiments regardless of the strain examined.

  6. Reprogramming of gene expression during compression wood formation in pine: Coordinated modulation of S-adenosylmethionine, lignin and lignan related genes

    PubMed Central

    2012-01-01

    Background Transcript profiling of differentiating secondary xylem has allowed us to draw a general picture of the genes involved in wood formation. However, our knowledge is still limited about the regulatory mechanisms that coordinate and modulate the different pathways providing substrates during xylogenesis. The development of compression wood in conifers constitutes an exceptional model for these studies. Although differential expression of a few genes in differentiating compression wood compared to normal or opposite wood has been reported, the broad range of features that distinguish this reaction wood suggest that the expression of a larger set of genes would be modified. Results By combining the construction of different cDNA libraries with microarray analyses we have identified a total of 496 genes in maritime pine (Pinus pinaster, Ait.) that change in expression during differentiation of compression wood (331 up-regulated and 165 down-regulated compared to opposite wood). Samples from different provenances collected in different years and geographic locations were integrated into the analyses to mitigate the effects of multiple sources of variability. This strategy allowed us to define a group of genes that are consistently associated with compression wood formation. Correlating with the deposition of a thicker secondary cell wall that characterizes compression wood development, the expression of a number of genes involved in synthesis of cellulose, hemicellulose, lignin and lignans was up-regulated. Further analysis of a set of these genes involved in S-adenosylmethionine metabolism, ammonium recycling, and lignin and lignans biosynthesis showed changes in expression levels in parallel to the levels of lignin accumulation in cells undergoing xylogenesis in vivo and in vitro. Conclusions The comparative transcriptomic analysis reported here have revealed a broad spectrum of coordinated transcriptional modulation of genes involved in biosynthesis of different cell wall polymers associated with within-tree variations in pine wood structure and composition. In particular, we demonstrate the coordinated modulation at transcriptional level of a gene set involved in S-adenosylmethionine synthesis and ammonium assimilation with increased demand for coniferyl alcohol for lignin and lignan synthesis, enabling a better understanding of the metabolic requirements in cells undergoing lignification. PMID:22747794

  7. Comparative Analysis of Four Calypogeia Species Revealed Unexpected Change in Evolutionarily-Stable Liverwort Mitogenomes

    PubMed Central

    Ślipiko, Monika; Buczkowska-Chmielewska, Katarzyna; Bączkiewicz, Alina; Szczecińska, Monika; Sawicki, Jakub

    2017-01-01

    Liverwort mitogenomes are considered to be evolutionarily stable. A comparative analysis of four Calypogeia species revealed differences compared to previously sequenced liverwort mitogenomes. Such differences involve unexpected structural changes in the two genes, cox1 and atp1, which have lost three and two introns, respectively. The group I introns in the cox1 gene are proposed to have been lost by two-step localized retroprocessing, whereas one-step retroprocessing could be responsible for the disappearance of the group II introns in the atp1 gene. These cases represent the first identified losses of introns in mitogenomes of leafy liverworts (Jungermanniopsida) contrasting the stability of mitochondrial gene order with certain changes in the gene content and intron set in liverworts. PMID:29257096

  8. Combined protein construct and synthetic gene engineering for heterologous protein expression and crystallization using Gene Composer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Raymond, Amy; Lovell, Scott; Lorimer, Don

    2009-12-01

    With the goal of improving yield and success rates of heterologous protein production for structural studies we have developed the database and algorithm software package Gene Composer. This freely available electronic tool facilitates the information-rich design of protein constructs and their engineered synthetic gene sequences, as detailed in the accompanying manuscript. In this report, we compare heterologous protein expression levels from native sequences to that of codon engineered synthetic gene constructs designed by Gene Composer. A test set of proteins including a human kinase (P38{alpha}), viral polymerase (HCV NS5B), and bacterial structural protein (FtsZ) were expressed in both E. colimore » and a cell-free wheat germ translation system. We also compare the protein expression levels in E. coli for a set of 11 different proteins with greatly varied G:C content and codon bias. The results consistently demonstrate that protein yields from codon engineered Gene Composer designs are as good as or better than those achieved from the synonymous native genes. Moreover, structure guided N- and C-terminal deletion constructs designed with the aid of Gene Composer can lead to greater success in gene to structure work as exemplified by the X-ray crystallographic structure determination of FtsZ from Bacillus subtilis. These results validate the Gene Composer algorithms, and suggest that using a combination of synthetic gene and protein construct engineering tools can improve the economics of gene to structure research.« less

  9. Gene regulatory network inference using fused LASSO on multiple data sets

    PubMed Central

    Omranian, Nooshin; Eloundou-Mbebi, Jeanne M. O.; Mueller-Roeber, Bernd; Nikoloski, Zoran

    2016-01-01

    Devising computational methods to accurately reconstruct gene regulatory networks given gene expression data is key to systems biology applications. Here we propose a method for reconstructing gene regulatory networks by simultaneous consideration of data sets from different perturbation experiments and corresponding controls. The method imposes three biologically meaningful constraints: (1) expression levels of each gene should be explained by the expression levels of a small number of transcription factor coding genes, (2) networks inferred from different data sets should be similar with respect to the type and number of regulatory interactions, and (3) relationships between genes which exhibit similar differential behavior over the considered perturbations should be favored. We demonstrate that these constraints can be transformed in a fused LASSO formulation for the proposed method. The comparative analysis on transcriptomics time-series data from prokaryotic species, Escherichia coli and Mycobacterium tuberculosis, as well as a eukaryotic species, mouse, demonstrated that the proposed method has the advantages of the most recent approaches for regulatory network inference, while obtaining better performance and assigning higher scores to the true regulatory links. The study indicates that the combination of sparse regression techniques with other biologically meaningful constraints is a promising framework for gene regulatory network reconstructions. PMID:26864687

  10. Comparative Reannotation of 21 Aspergillus Genomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Salamov, Asaf; Riley, Robert; Kuo, Alan

    2013-03-08

    We used comparative gene modeling to reannotate 21 Aspergillus genomes. Initial automatic annotation of individual genomes may contain some errors of different nature, e.g. missing genes, incorrect exon-intron structures, 'chimeras', which fuse 2 or more real genes or alternatively splitting some real genes into 2 or more models. The main premise behind the comparative modeling approach is that for closely related genomes most orthologous families have the same conserved gene structure. The algorithm maps all gene models predicted in each individual Aspergillus genome to the other genomes and, for each locus, selects from potentially many competing models, the one whichmore » most closely resembles the orthologous genes from other genomes. This procedure is iterated until no further change in gene models is observed. For Aspergillus genomes we predicted in total 4503 new gene models ( ~;;2percent per genome), supported by comparative analysis, additionally correcting ~;;18percent of old gene models. This resulted in a total of 4065 more genes with annotated PFAM domains (~;;3percent increase per genome). Analysis of a few genomes with EST/transcriptomics data shows that the new annotation sets also have a higher number of EST-supported splice sites at exon-intron boundaries.« less

  11. Tcof1-Related Molecular Networks in Treacher Collins Syndrome.

    PubMed

    Dai, Jiewen; Si, Jiawen; Wang, Minjiao; Huang, Li; Fang, Bing; Shi, Jun; Wang, Xudong; Shen, Guofang

    2016-09-01

    Treacher Collins syndrome (TCS) is a rare, autosomal-dominant disorder characterized by craniofacial deformities, and is primarily caused by mutations in the Tcof1 gene. This article was aimed to perform a comprehensive literature review and systematic bioinformatic analysis of Tcof1-related molecular networks in TCS. First, the up- and down-regulated genes in Tcof1 heterozygous haploinsufficient mutant mice embryos and Tcof1 knockdown and Tcof1 over-expressed neuroblastoma N1E-115 cells were obtained from the Gene Expression Omnibus database. The GeneDecks database was used to calculate the 500 genes most closely related to Tcof1. Then, the relationships between 4 gene sets (a predicted set and sets comparing the wildtype with the 3 Gene Expression Omnibus datasets) were analyzed using the DAVID, GeneMANIA and STRING databases. The analysis results showed that the Tcof1-related genes were enriched in various biological processes, including cell proliferation, apoptosis, cell cycle, differentiation, and migration. They were also enriched in several signaling pathways, such as the ribosome, p53, cell cycle, and WNT signaling pathways. Additionally, these genes clearly had direct or indirect interactions with Tcof1 and between each other. Literature review and bioinformatic analysis finds imply that special attention should be given to these pathways, as they may offer target points for TCS therapies.

  12. Gene Network Construction from Microarray Data Identifies a Key Network Module and Several Candidate Hub Genes in Age-Associated Spatial Learning Impairment

    PubMed Central

    Uddin, Raihan; Singh, Shiva M.

    2017-01-01

    As humans age many suffer from a decrease in normal brain functions including spatial learning impairments. This study aimed to better understand the molecular mechanisms in age-associated spatial learning impairment (ASLI). We used a mathematical modeling approach implemented in Weighted Gene Co-expression Network Analysis (WGCNA) to create and compare gene network models of young (learning unimpaired) and aged (predominantly learning impaired) brains from a set of exploratory datasets in rats in the context of ASLI. The major goal was to overcome some of the limitations previously observed in the traditional meta- and pathway analysis using these data, and identify novel ASLI related genes and their networks based on co-expression relationship of genes. This analysis identified a set of network modules in the young, each of which is highly enriched with genes functioning in broad but distinct GO functional categories or biological pathways. Interestingly, the analysis pointed to a single module that was highly enriched with genes functioning in “learning and memory” related functions and pathways. Subsequent differential network analysis of this “learning and memory” module in the aged (predominantly learning impaired) rats compared to the young learning unimpaired rats allowed us to identify a set of novel ASLI candidate hub genes. Some of these genes show significant repeatability in networks generated from independent young and aged validation datasets. These hub genes are highly co-expressed with other genes in the network, which not only show differential expression but also differential co-expression and differential connectivity across age and learning impairment. The known function of these hub genes indicate that they play key roles in critical pathways, including kinase and phosphatase signaling, in functions related to various ion channels, and in maintaining neuronal integrity relating to synaptic plasticity and memory formation. Taken together, they provide a new insight and generate new hypotheses into the molecular mechanisms responsible for age associated learning impairment, including spatial learning. PMID:29066959

  13. Gene Network Construction from Microarray Data Identifies a Key Network Module and Several Candidate Hub Genes in Age-Associated Spatial Learning Impairment.

    PubMed

    Uddin, Raihan; Singh, Shiva M

    2017-01-01

    As humans age many suffer from a decrease in normal brain functions including spatial learning impairments. This study aimed to better understand the molecular mechanisms in age-associated spatial learning impairment (ASLI). We used a mathematical modeling approach implemented in Weighted Gene Co-expression Network Analysis (WGCNA) to create and compare gene network models of young (learning unimpaired) and aged (predominantly learning impaired) brains from a set of exploratory datasets in rats in the context of ASLI. The major goal was to overcome some of the limitations previously observed in the traditional meta- and pathway analysis using these data, and identify novel ASLI related genes and their networks based on co-expression relationship of genes. This analysis identified a set of network modules in the young, each of which is highly enriched with genes functioning in broad but distinct GO functional categories or biological pathways. Interestingly, the analysis pointed to a single module that was highly enriched with genes functioning in "learning and memory" related functions and pathways. Subsequent differential network analysis of this "learning and memory" module in the aged (predominantly learning impaired) rats compared to the young learning unimpaired rats allowed us to identify a set of novel ASLI candidate hub genes. Some of these genes show significant repeatability in networks generated from independent young and aged validation datasets. These hub genes are highly co-expressed with other genes in the network, which not only show differential expression but also differential co-expression and differential connectivity across age and learning impairment. The known function of these hub genes indicate that they play key roles in critical pathways, including kinase and phosphatase signaling, in functions related to various ion channels, and in maintaining neuronal integrity relating to synaptic plasticity and memory formation. Taken together, they provide a new insight and generate new hypotheses into the molecular mechanisms responsible for age associated learning impairment, including spatial learning.

  14. MALDI-TOF mass spectrometry for quantitative gene expression analysis of acid responses in Staphylococcus aureus.

    PubMed

    Rode, Tone Mari; Berget, Ingunn; Langsrud, Solveig; Møretrø, Trond; Holck, Askild

    2009-07-01

    Microorganisms are constantly exposed to new and altered growth conditions, and respond by changing gene expression patterns. Several methods for studying gene expression exist. During the last decade, the analysis of microarrays has been one of the most common approaches applied for large scale gene expression studies. A relatively new method for gene expression analysis is MassARRAY, which combines real competitive-PCR and MALDI-TOF (matrix-assisted laser desorption/ionization time-of-flight) mass spectrometry. In contrast to microarray methods, MassARRAY technology is suitable for analysing a larger number of samples, though for a smaller set of genes. In this study we compare the results from MassARRAY with microarrays on gene expression responses of Staphylococcus aureus exposed to acid stress at pH 4.5. RNA isolated from the same stress experiments was analysed using both the MassARRAY and the microarray methods. The MassARRAY and microarray methods showed good correlation. Both MassARRAY and microarray estimated somewhat lower fold changes compared with quantitative real-time PCR (qRT-PCR). The results confirmed the up-regulation of the urease genes in acidic environments, and also indicated the importance of metal ion regulation. This study shows that the MassARRAY technology is suitable for gene expression analysis in prokaryotes, and has advantages when a set of genes is being analysed for an organism exposed to many different environmental conditions.

  15. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool

    PubMed Central

    2013-01-01

    Background System-wide profiling of genes and proteins in mammalian cells produce lists of differentially expressed genes/proteins that need to be further analyzed for their collective functions in order to extract new knowledge. Once unbiased lists of genes or proteins are generated from such experiments, these lists are used as input for computing enrichment with existing lists created from prior knowledge organized into gene-set libraries. While many enrichment analysis tools and gene-set libraries databases have been developed, there is still room for improvement. Results Here, we present Enrichr, an integrative web-based and mobile software application that includes new gene-set libraries, an alternative approach to rank enriched terms, and various interactive visualization approaches to display enrichment results using the JavaScript library, Data Driven Documents (D3). The software can also be embedded into any tool that performs gene list analysis. We applied Enrichr to analyze nine cancer cell lines by comparing their enrichment signatures to the enrichment signatures of matched normal tissues. We observed a common pattern of up regulation of the polycomb group PRC2 and enrichment for the histone mark H3K27me3 in many cancer cell lines, as well as alterations in Toll-like receptor and interlukin signaling in K562 cells when compared with normal myeloid CD33+ cells. Such analyses provide global visualization of critical differences between normal tissues and cancer cell lines but can be applied to many other scenarios. Conclusions Enrichr is an easy to use intuitive enrichment analysis web-based tool providing various types of visualization summaries of collective functions of gene lists. Enrichr is open source and freely available online at: http://amp.pharm.mssm.edu/Enrichr. PMID:23586463

  16. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool.

    PubMed

    Chen, Edward Y; Tan, Christopher M; Kou, Yan; Duan, Qiaonan; Wang, Zichen; Meirelles, Gabriela Vaz; Clark, Neil R; Ma'ayan, Avi

    2013-04-15

    System-wide profiling of genes and proteins in mammalian cells produce lists of differentially expressed genes/proteins that need to be further analyzed for their collective functions in order to extract new knowledge. Once unbiased lists of genes or proteins are generated from such experiments, these lists are used as input for computing enrichment with existing lists created from prior knowledge organized into gene-set libraries. While many enrichment analysis tools and gene-set libraries databases have been developed, there is still room for improvement. Here, we present Enrichr, an integrative web-based and mobile software application that includes new gene-set libraries, an alternative approach to rank enriched terms, and various interactive visualization approaches to display enrichment results using the JavaScript library, Data Driven Documents (D3). The software can also be embedded into any tool that performs gene list analysis. We applied Enrichr to analyze nine cancer cell lines by comparing their enrichment signatures to the enrichment signatures of matched normal tissues. We observed a common pattern of up regulation of the polycomb group PRC2 and enrichment for the histone mark H3K27me3 in many cancer cell lines, as well as alterations in Toll-like receptor and interlukin signaling in K562 cells when compared with normal myeloid CD33+ cells. Such analyses provide global visualization of critical differences between normal tissues and cancer cell lines but can be applied to many other scenarios. Enrichr is an easy to use intuitive enrichment analysis web-based tool providing various types of visualization summaries of collective functions of gene lists. Enrichr is open source and freely available online at: http://amp.pharm.mssm.edu/Enrichr.

  17. Comparative transcript profiling of alloplasmic male-sterile lines revealed altered gene expression related to pollen development in rice (Oryza sativa L.).

    PubMed

    Hu, Jihong; Chen, Guanglong; Zhang, Hongyuan; Qian, Qian; Ding, Yi

    2016-08-05

    Cytoplasmic male sterility (CMS) is an ideal model for investigating the mitochondrial-nuclear interaction and down-regulated genes in CMS lines which might be the candidate genes for pollen development in rice. In this study, a set of rice alloplasmic sporophytic CMS lines was obtained by successive backcrossing of Meixiang B, with three different cytoplasmic types: D62A (D type), ZS97A (WA type) and XQZ-A (DA type). Using microarray, the anther transcript profiles of the three indica rice CMS lines revealed 622 differentially expressed genes (DEGs) in each of the three CMS lines compared with the maintainer line Meixiang B. GO and MapMan analysis indicated that these DEGs were mainly involved in lipid metabolism and cell wall organization. Compared with the gene expression of sporophytic and gametophytic CMS lines, 303 DEGs were identified and 56 of them were down-regulated in all the CMS lines of rice. These down-regulated DEGs in the CMS lines were found to be involved in tapetum or cell wall formation and their suppressed expression might be related to male sterility. Weighted gene co-expression network analysis (WGCNA) revealed that two modules were significantly associated with male sterility and many hub genes that were differentially expressed in the CMS lines. A large set of putative genes involved in anther development was identified in the present study. The results will give some information for the nuclear gene regulation by different cytoplasmic genotypes and provide a rich resource for further functional research on the pollen development in rice.

  18. The Model-Based Study of the Effectiveness of Reporting Lists of Small Feature Sets Using RNA-Seq Data.

    PubMed

    Kim, Eunji; Ivanov, Ivan; Hua, Jianping; Lampe, Johanna W; Hullar, Meredith Aj; Chapkin, Robert S; Dougherty, Edward R

    2017-01-01

    Ranking feature sets for phenotype classification based on gene expression is a challenging issue in cancer bioinformatics. When the number of samples is small, all feature selection algorithms are known to be unreliable, producing significant error, and error estimators suffer from different degrees of imprecision. The problem is compounded by the fact that the accuracy of classification depends on the manner in which the phenomena are transformed into data by the measurement technology. Because next-generation sequencing technologies amount to a nonlinear transformation of the actual gene or RNA concentrations, they can potentially produce less discriminative data relative to the actual gene expression levels. In this study, we compare the performance of ranking feature sets derived from a model of RNA-Seq data with that of a multivariate normal model of gene concentrations using 3 measures: (1) ranking power, (2) length of extensions, and (3) Bayes features. This is the model-based study to examine the effectiveness of reporting lists of small feature sets using RNA-Seq data and the effects of different model parameters and error estimators. The results demonstrate that the general trends of the parameter effects on the ranking power of the underlying gene concentrations are preserved in the RNA-Seq data, whereas the power of finding a good feature set becomes weaker when gene concentrations are transformed by the sequencing machine.

  19. Multiple genome alignment for identifying the core structure among moderately related microbial genomes.

    PubMed

    Uchiyama, Ikuo

    2008-10-31

    Identifying the set of intrinsically conserved genes, or the genomic core, among related genomes is crucial for understanding prokaryotic genomes where horizontal gene transfers are common. Although core genome identification appears to be obvious among very closely related genomes, it becomes more difficult when more distantly related genomes are compared. Here, we consider the core structure as a set of sufficiently long segments in which gene orders are conserved so that they are likely to have been inherited mainly through vertical transfer, and developed a method for identifying the core structure by finding the order of pre-identified orthologous groups (OGs) that maximally retains the conserved gene orders. The method was applied to genome comparisons of two well-characterized families, Bacillaceae and Enterobacteriaceae, and identified their core structures comprising 1438 and 2125 OGs, respectively. The core sets contained most of the essential genes and their related genes, which were primarily included in the intersection of the two core sets comprising around 700 OGs. The definition of the genomic core based on gene order conservation was demonstrated to be more robust than the simpler approach based only on gene conservation. We also investigated the core structures in terms of G+C content homogeneity and phylogenetic congruence, and found that the core genes primarily exhibited the expected characteristic, i.e., being indigenous and sharing the same history, more than the non-core genes. The results demonstrate that our strategy of genome alignment based on gene order conservation can provide an effective approach to identify the genomic core among moderately related microbial genomes.

  20. A complete collection of single-gene deletion mutants of Acinetobacter baylyi ADP1

    PubMed Central

    de Berardinis, Véronique; Vallenet, David; Castelli, Vanina; Besnard, Marielle; Pinet, Agnès; Cruaud, Corinne; Samair, Sumitta; Lechaplais, Christophe; Gyapay, Gabor; Richez, Céline; Durot, Maxime; Kreimeyer, Annett; Le Fèvre, François; Schächter, Vincent; Pezo, Valérie; Döring, Volker; Scarpelli, Claude; Médigue, Claudine; Cohen, Georges N; Marlière, Philippe; Salanoubat, Marcel; Weissenbach, Jean

    2008-01-01

    We have constructed a collection of single-gene deletion mutants for all dispensable genes of the soil bacterium Acinetobacter baylyi ADP1. A total of 2594 deletion mutants were obtained, whereas 499 (16%) were not, and are therefore candidate essential genes for life on minimal medium. This essentiality data set is 88% consistent with the Escherichia coli data set inferred from the Keio mutant collection profiled for growth on minimal medium, while 80% of the orthologous genes described as essential in Pseudomonas aeruginosa are also essential in ADP1. Several strategies were undertaken to investigate ADP1 metabolism by (1) searching for discrepancies between our essentiality data and current metabolic knowledge, (2) comparing this essentiality data set to those from other organisms, (3) systematic phenotyping of the mutant collection on a variety of carbon sources (quinate, 2-3 butanediol, glucose, etc.). This collection provides a new resource for the study of gene function by forward and reverse genetic approaches and constitutes a robust experimental data source for systems biology approaches. PMID:18319726

  1. MorphDB: Prioritizing Genes for Specialized Metabolism Pathways and Gene Ontology Categories in Plants.

    PubMed

    Zwaenepoel, Arthur; Diels, Tim; Amar, David; Van Parys, Thomas; Shamir, Ron; Van de Peer, Yves; Tzfadia, Oren

    2018-01-01

    Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at http://bioinformatics.psb.ugent.be/webtools/morphdb/morphDB/index/. We also provide a toolkit, named "MORPH bulk" (https://github.com/arzwa/morph-bulk), for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest.

  2. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence

    PubMed Central

    Peña-Castillo, Lourdes; Tasan, Murat; Myers, Chad L; Lee, Hyunju; Joshi, Trupti; Zhang, Chao; Guan, Yuanfang; Leone, Michele; Pagnani, Andrea; Kim, Wan Kyu; Krumpelman, Chase; Tian, Weidong; Obozinski, Guillaume; Qi, Yanjun; Mostafavi, Sara; Lin, Guan Ning; Berriz, Gabriel F; Gibbons, Francis D; Lanckriet, Gert; Qiu, Jian; Grant, Charles; Barutcuoglu, Zafer; Hill, David P; Warde-Farley, David; Grouios, Chris; Ray, Debajyoti; Blake, Judith A; Deng, Minghua; Jordan, Michael I; Noble, William S; Morris, Quaid; Klein-Seetharaman, Judith; Bar-Joseph, Ziv; Chen, Ting; Sun, Fengzhu; Troyanskaya, Olga G; Marcotte, Edward M; Xu, Dong; Hughes, Timothy R; Roth, Frederick P

    2008-01-01

    Background: Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated. Results: In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%. Conclusion: We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized. PMID:18613946

  3. STBase: One Million Species Trees for Comparative Biology

    PubMed Central

    McMahon, Michelle M.; Deepak, Akshay; Fernández-Baca, David; Boss, Darren; Sanderson, Michael J.

    2015-01-01

    Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user’s query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees. PMID:25679219

  4. Genomic Analysis Reveals Contrasting PIFq Contribution to Diurnal Rhythmic Gene Expression in PIF-Induced and -Repressed Genes.

    PubMed

    Martin, Guiomar; Soy, Judit; Monte, Elena

    2016-01-01

    Members of the PIF quartet (PIFq; PIF1, PIF3, PIF4, and PIF5) collectively contribute to induce growth in Arabidopsis seedlings under short day (SD) conditions, specifically promoting elongation at dawn. Their action involves the direct regulation of growth-related and hormone-associated genes. However, a comprehensive definition of the PIFq-regulated transcriptome under SD is still lacking. We have recently shown that SD and free-running (LL) conditions correspond to "growth" and "no growth" conditions, respectively, correlating with greater abundance of PIF protein in SD. Here, we present a genomic analysis whereby we first define SD-regulated genes at dawn compared to LL in the wild type, followed by identification of those SD-regulated genes whose expression depends on the presence of PIFq. By using this sequential strategy, we have identified 349 PIF/SD-regulated genes, approximately 55% induced and 42% repressed by both SD and PIFq. Comparison with available databases indicates that PIF/SD-induced and PIF/SD-repressed sets are differently phased at dawn and mid-morning, respectively. In addition, we found that whereas rhythmicity of the PIF/SD-induced gene set is lost in LL, most PIF/SD-repressed genes keep their rhythmicity in LL, suggesting differential regulation of both gene sets by the circadian clock. Moreover, we also uncovered distinct overrepresented functions in the induced and repressed gene sets, in accord with previous studies in other examined PIF-regulated processes. Interestingly, promoter analyses showed that, whereas PIF/SD-induced genes are enriched in direct PIF targets, PIF/SD-repressed genes are mostly indirectly regulated by the PIFs and might be more enriched in ABA-regulated genes.

  5. Chilling Affects Phytohormone and Post-Embryonic Development Pathways during Bud Break and Fruit Set in Apple (Malus domestica Borkh.)

    PubMed Central

    Kumar, Gulshan; Gupta, Khushboo; Pathania, Shivalika; Swarnkar, Mohit Kumar; Rattan, Usha Kumari; Singh, Gagandeep; Sharma, Ram Kumar; Singh, Anil Kumar

    2017-01-01

    The availability of sufficient chilling during bud dormancy plays an important role in the subsequent yield and quality of apple fruit, whereas, insufficient chilling availability negatively impacts the apple production. The transcriptome profiling during bud dormancy release and initial fruit set under low and high chill conditions was performed using RNA-seq. The comparative high number of differentially expressed genes during bud break and fruit set under high chill condition indicates that chilling availability was associated with transcriptional reorganization. The comparative analysis reveals the differential expression of genes involved in phytohormone metabolism, particularly for Abscisic acid, gibberellic acid, ethylene, auxin and cytokinin. The expression of Dormancy Associated MADS-box, Flowering Locus C-like, Flowering Locus T-like and Terminal Flower 1-like genes was found to be modulated under differential chilling. The co-expression network analysis indentified two high chill specific modules that were found to be enriched for “post-embryonic development” GO terms. The network analysis also identified hub genes including Early flowering 7, RAF10, ZEP4 and F-box, which may be involved in regulating chilling-mediated dormancy release and fruit set. The results of transcriptome and co-expression network analysis indicate that chilling availability majorly regulates phytohormone-related pathways and post-embryonic development during bud break. PMID:28198417

  6. Genome-Wide Comparative Gene Family Classification

    PubMed Central

    Frech, Christian; Chen, Nansheng

    2010-01-01

    Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species. PMID:20976221

  7. Comparative genomic analysis by microbial COGs self-attraction rate.

    PubMed

    Santoni, Daniele; Romano-Spica, Vincenzo

    2009-06-21

    Whole genome analysis provides new perspectives to determine phylogenetic relationships among microorganisms. The availability of whole nucleotide sequences allows different levels of comparison among genomes by several approaches. In this work, self-attraction rates were considered for each cluster of orthologous groups of proteins (COGs) class in order to analyse gene aggregation levels in physical maps. Phylogenetic relationships among microorganisms were obtained by comparing self-attraction coefficients. Eighteen-dimensional vectors were computed for a set of 168 completely sequenced microbial genomes (19 archea, 149 bacteria). The components of the vector represent the aggregation rate of the genes belonging to each of 18 COGs classes. Genes involved in nonessential functions or related to environmental conditions showed the highest aggregation rates. On the contrary genes involved in basic cellular tasks showed a more uniform distribution along the genome, except for translation genes. Self-attraction clustering approach allowed classification of Proteobacteria, Bacilli and other species belonging to Firmicutes. Rearrangement and Lateral Gene Transfer events may influence divergences from classical taxonomy. Each set of COG classes' aggregation values represents an intrinsic property of the microbial genome. This novel approach provides a new point of view for whole genome analysis and bacterial characterization.

  8. Transcriptome profiling reveals novel BMI- and sex-specific gene expression signatures for human cardiac hypertrophy.

    PubMed

    Newman, Mackenzie S; Nguyen, Tina; Watson, Michael J; Hull, Robert W; Yu, Han-Gang

    2017-07-01

    How obesity or sex may affect the gene expression profiles of human cardiac hypertrophy is unknown. We hypothesized that body-mass index (BMI) and sex can affect gene expression profiles of cardiac hypertrophy. Human heart tissues were grouped according to sex (male, female), BMI (lean<25 kg/m 2 , obese>30 kg/m 2 ), or left ventricular hypertrophy (LVH) and non-LVH nonfailed controls (NF). We identified 24 differentially expressed (DE) genes comparing female with male samples. In obese subgroup, there were 236 DE genes comparing LVH with NF; in lean subgroup, there were seven DE genes comparing LVH with NF. In female subgroup, we identified 1,320 significant genes comparing LVH with NF; in male subgroup, there were 1,383 significant genes comparing LVH with NF. There were seven significant genes comparing obese LVH with lean NF; comparing male obese LVH with male lean NF samples we found 106 significant genes; comparing female obese LVH with male lean NF, we found no significant genes. Using absolute value of log 2 fold-change > 2 or extremely small P value (10 -20 ) as a criterion, we identified nine significant genes (HBA1, HBB, HIST1H2AC, GSTT1, MYL7, NPPA, NPPB, PDK4, PLA2G2A) in LVH, also found in published data set for ischemic and dilated cardiomyopathy in heart failure. We identified a potential gene expression signature that distinguishes between patients with high BMI or between men and women with cardiac hypertrophy. Expression of established biomarkers atrial natriuretic peptide A (NPPA) and B (NPPB) were already significantly increased in hypertrophy compared with controls. Copyright © 2017 the American Physiological Society.

  9. Incorporating networks in a probabilistic graphical model to find drivers for complex human diseases.

    PubMed

    Mezlini, Aziz M; Goldenberg, Anna

    2017-10-01

    Discovering genetic mechanisms driving complex diseases is a hard problem. Existing methods often lack power to identify the set of responsible genes. Protein-protein interaction networks have been shown to boost power when detecting gene-disease associations. We introduce a Bayesian framework, Conflux, to find disease associated genes from exome sequencing data using networks as a prior. There are two main advantages to using networks within a probabilistic graphical model. First, networks are noisy and incomplete, a substantial impediment to gene discovery. Incorporating networks into the structure of a probabilistic models for gene inference has less impact on the solution than relying on the noisy network structure directly. Second, using a Bayesian framework we can keep track of the uncertainty of each gene being associated with the phenotype rather than returning a fixed list of genes. We first show that using networks clearly improves gene detection compared to individual gene testing. We then show consistently improved performance of Conflux compared to the state-of-the-art diffusion network-based method Hotnet2 and a variety of other network and variant aggregation methods, using randomly generated and literature-reported gene sets. We test Hotnet2 and Conflux on several network configurations to reveal biases and patterns of false positives and false negatives in each case. Our experiments show that our novel Bayesian framework Conflux incorporates many of the advantages of the current state-of-the-art methods, while offering more flexibility and improved power in many gene-disease association scenarios.

  10. PSAT: A web tool to compare genomic neighborhoods of multiple prokaryotic genomes

    PubMed Central

    Fong, Christine; Rohmer, Laurence; Radey, Matthew; Wasnick, Michael; Brittnacher, Mitchell J

    2008-01-01

    Background The conservation of gene order among prokaryotic genomes can provide valuable insight into gene function, protein interactions, or events by which genomes have evolved. Although some tools are available for visualizing and comparing the order of genes between genomes of study, few support an efficient and organized analysis between large numbers of genomes. The Prokaryotic Sequence homology Analysis Tool (PSAT) is a web tool for comparing gene neighborhoods among multiple prokaryotic genomes. Results PSAT utilizes a database that is preloaded with gene annotation, BLAST hit results, and gene-clustering scores designed to help identify regions of conserved gene order. Researchers use the PSAT web interface to find a gene of interest in a reference genome and efficiently retrieve the sequence homologs found in other bacterial genomes. The tool generates a graphic of the genomic neighborhood surrounding the selected gene and the corresponding regions for its homologs in each comparison genome. Homologs in each region are color coded to assist users with analyzing gene order among various genomes. In contrast to common comparative analysis methods that filter sequence homolog data based on alignment score cutoffs, PSAT leverages gene context information for homologs, including those with weak alignment scores, enabling a more sensitive analysis. Features for constraining or ordering results are designed to help researchers browse results from large numbers of comparison genomes in an organized manner. PSAT has been demonstrated to be useful for helping to identify gene orthologs and potential functional gene clusters, and detecting genome modifications that may result in loss of function. Conclusion PSAT allows researchers to investigate the order of genes within local genomic neighborhoods of multiple genomes. A PSAT web server for public use is available for performing analyses on a growing set of reference genomes through any web browser with no client side software setup or installation required. Source code is freely available to researchers interested in setting up a local version of PSAT for analysis of genomes not available through the public server. Access to the public web server and instructions for obtaining source code can be found at . PMID:18366802

  11. Gene context conservation of a higher order than operons.

    PubMed

    Lathe, W C; Snel, B; Bork, P

    2000-10-01

    Operons, co-transcribed and co-regulated contiguous sets of genes, are poorly conserved over short periods of evolutionary time. The gene order, gene content and regulatory mechanisms of operons can be very different, even in closely related species. Here, we present several lines of evidence which suggest that, although an operon and its individual genes and regulatory structures are rearranged when comparing the genomes of different species, this rearrangement is a conservative process. Genomic rearrangements invariably maintain individual genes in very specific functional and regulatory contexts. We call this conserved context an uber-operon.

  12. Improving information retrieval in functional analysis.

    PubMed

    Rodriguez, Juan C; González, Germán A; Fresno, Cristóbal; Llera, Andrea S; Fernández, Elmer A

    2016-12-01

    Transcriptome analysis is essential to understand the mechanisms regulating key biological processes and functions. The first step usually consists of identifying candidate genes; to find out which pathways are affected by those genes, however, functional analysis (FA) is mandatory. The most frequently used strategies for this purpose are Gene Set and Singular Enrichment Analysis (GSEA and SEA) over Gene Ontology. Several statistical methods have been developed and compared in terms of computational efficiency and/or statistical appropriateness. However, whether their results are similar or complementary, the sensitivity to parameter settings, or possible bias in the analyzed terms has not been addressed so far. Here, two GSEA and four SEA methods and their parameter combinations were evaluated in six datasets by comparing two breast cancer subtypes with well-known differences in genetic background and patient outcomes. We show that GSEA and SEA lead to different results depending on the chosen statistic, model and/or parameters. Both approaches provide complementary results from a biological perspective. Hence, an Integrative Functional Analysis (IFA) tool is proposed to improve information retrieval in FA. It provides a common gene expression analytic framework that grants a comprehensive and coherent analysis. Only a minimal user parameter setting is required, since the best SEA/GSEA alternatives are integrated. IFA utility was demonstrated by evaluating four prostate cancer and the TCGA breast cancer microarray datasets, which showed its biological generalization capabilities. Copyright © 2016 Elsevier Ltd. All rights reserved.

  13. Methods for Genome-Wide Analysis of Gene Expression Changes in Polyploids

    PubMed Central

    Wang, Jianlin; Lee, Jinsuk J.; Tian, Lu; Lee, Hyeon-Se; Chen, Meng; Rao, Sheetal; Wei, Edward N.; Doerge, R. W.; Comai, Luca; Jeffrey Chen, Z.

    2007-01-01

    Polyploidy is an evolutionary innovation, providing extra sets of genetic material for phenotypic variation and adaptation. It is predicted that changes of gene expression by genetic and epigenetic mechanisms are responsible for novel variation in nascent and established polyploids (Liu and Wendel, 2002; Osborn et al., 2003; Pikaard, 2001). Studying gene expression changes in allopolyploids is more complicated than in autopolyploids, because allopolyploids contain more than two sets of genomes originating from divergent, but related, species. Here we describe two methods that are applicable to the genome-wide analysis of gene expression differences resulting from genome duplication in autopolyploids or interactions between homoeologous genomes in allopolyploids. First, we describe an amplified fragment length polymorphism (AFLP)–complementary DNA (cDNA) display method that allows the discrimination of homoeologous loci based on restriction polymorphisms between the progenitors. Second, we describe microarray analyses that can be used to compare gene expression differences between the allopolyploids and respective progenitors using appropriate experimental design and statistical analysis. We demonstrate the utility of these two complementary methods and discuss the pros and cons of using the methods to analyze gene expression changes in autopolyploids and allopolyploids. Furthermore, we describe these methods in general terms to be of wider applicability for comparative gene expression in a variety of evolutionary, genetic, biological, and physiological contexts. PMID:15865985

  14. Application of community phylogenetic approaches to understand gene expression: differential exploration of venom gene space in predatory marine gastropods.

    PubMed

    Chang, Dan; Duda, Thomas F

    2014-06-05

    Predatory marine gastropods of the genus Conus exhibit substantial variation in venom composition both within and among species. Apart from mechanisms associated with extensive turnover of gene families and rapid evolution of genes that encode venom components ('conotoxins'), the evolution of distinct conotoxin expression patterns is an additional source of variation that may drive interspecific differences in the utilization of species' 'venom gene space'. To determine the evolution of expression patterns of venom genes of Conus species, we evaluated the expression of A-superfamily conotoxin genes of a set of closely related Conus species by comparing recovered transcripts of A-superfamily genes that were previously identified from the genomes of these species. We modified community phylogenetics approaches to incorporate phylogenetic history and disparity of genes and their expression profiles to determine patterns of venom gene space utilization. Less than half of the A-superfamily gene repertoire of these species is expressed, and only a few orthologous genes are coexpressed among species. Species exhibit substantially distinct expression strategies, with some expressing sets of closely related loci ('under-dispersed' expression of available genes) while others express sets of more disparate genes ('over-dispersed' expression). In addition, expressed genes show higher dN/dS values than either unexpressed or ancestral genes; this implies that expression exposes genes to selection and facilitates rapid evolution of these genes. Few recent lineage-specific gene duplicates are expressed simultaneously, suggesting that expression divergence among redundant gene copies may be established shortly after gene duplication. Our study demonstrates that venom gene space is explored differentially by Conus species, a process that effectively permits the independent and rapid evolution of venoms in these species.

  15. Role of Mesenchymal-Derived Stem Cells in Stimulating Dormant Tumor Cells to Proliferate and Form Clinical Metastases

    DTIC Science & Technology

    2016-07-01

    tumor dormancy we have determined the break in dormancy is dependent on collagen and other fibrotic extracellular matrix components for the induction... collagen to induce a break from dormancy compared to dormant D2.0R cells revealed a set of genes that overlap with published dormancy gene sets. We...dormant D2.0R”) and proliferate when cultured in matrigel supplemented with collagen type-1 (“proliferative D2.0R”) (Barkan, Cancer Research, 2008, 68

  16. An extended data mining method for identifying differentially expressed assay-specific signatures in functional genomic studies.

    PubMed

    Rollins, Derrick K; Teh, Ailing

    2010-12-17

    Microarray data sets provide relative expression levels for thousands of genes for a small number, in comparison, of different experimental conditions called assays. Data mining techniques are used to extract specific information of genes as they relate to the assays. The multivariate statistical technique of principal component analysis (PCA) has proven useful in providing effective data mining methods. This article extends the PCA approach of Rollins et al. to the development of ranking genes of microarray data sets that express most differently between two biologically different grouping of assays. This method is evaluated on real and simulated data and compared to a current approach on the basis of false discovery rate (FDR) and statistical power (SP) which is the ability to correctly identify important genes. This work developed and evaluated two new test statistics based on PCA and compared them to a popular method that is not PCA based. Both test statistics were found to be effective as evaluated in three case studies: (i) exposing E. coli cells to two different ethanol levels; (ii) application of myostatin to two groups of mice; and (iii) a simulated data study derived from the properties of (ii). The proposed method (PM) effectively identified critical genes in these studies based on comparison with the current method (CM). The simulation study supports higher identification accuracy for PM over CM for both proposed test statistics when the gene variance is constant and for one of the test statistics when the gene variance is non-constant. PM compares quite favorably to CM in terms of lower FDR and much higher SP. Thus, PM can be quite effective in producing accurate signatures from large microarray data sets for differential expression between assays groups identified in a preliminary step of the PCA procedure and is, therefore, recommended for use in these applications.

  17. An Independent Filter for Gene Set Testing Based on Spectral Enrichment.

    PubMed

    Frost, H Robert; Li, Zhigang; Asselbergs, Folkert W; Moore, Jason H

    2015-01-01

    Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.

  18. Comparative Metagenomics Revealed Commonly Enriched Gene Sets in Human Gut Microbiomes

    PubMed Central

    Kurokawa, Ken; Itoh, Takehiko; Kuwahara, Tomomi; Oshima, Kenshiro; Toh, Hidehiro; Toyoda, Atsushi; Takami, Hideto; Morita, Hidetoshi; Sharma, Vineet K.; Srivastava, Tulika P.; Taylor, Todd D.; Noguchi, Hideki; Mori, Hiroshi; Ogura, Yoshitoshi; Ehrlich, Dusko S.; Itoh, Kikuji; Takagi, Toshihisa; Sakaki, Yoshiyuki; Hayashi, Tetsuya; Hattori, Masahira

    2007-01-01

    Numerous microbes inhabit the human intestine, many of which are uncharacterized or uncultivable. They form a complex microbial community that deeply affects human physiology. To identify the genomic features common to all human gut microbiomes as well as those variable among them, we performed a large-scale comparative metagenomic analysis of fecal samples from 13 healthy individuals of various ages, including unweaned infants. We found that, while the gut microbiota from unweaned infants were simple and showed a high inter-individual variation in taxonomic and gene composition, those from adults and weaned children were more complex but showed a high functional uniformity regardless of age or sex. In searching for the genes over-represented in gut microbiomes, we identified 237 gene families commonly enriched in adult-type and 136 families in infant-type microbiomes, with a small overlap. An analysis of their predicted functions revealed various strategies employed by each type of microbiota to adapt to its intestinal environment, suggesting that these gene sets encode the core functions of adult and infant-type gut microbiota. By analysing the orphan genes, 647 new gene families were identified to be exclusively present in human intestinal microbiomes. In addition, we discovered a conjugative transposon family explosively amplified in human gut microbiomes, which strongly suggests that the intestine is a ‘hot spot’ for horizontal gene transfer between microbes. PMID:17916580

  19. A computational approach to identify cellular heterogeneity and tissue-specific gene regulatory networks.

    PubMed

    Jambusaria, Ankit; Klomp, Jeff; Hong, Zhigang; Rafii, Shahin; Dai, Yang; Malik, Asrar B; Rehman, Jalees

    2018-06-07

    The heterogeneity of cells across tissue types represents a major challenge for studying biological mechanisms as well as for therapeutic targeting of distinct tissues. Computational prediction of tissue-specific gene regulatory networks may provide important insights into the mechanisms underlying the cellular heterogeneity of cells in distinct organs and tissues. Using three pathway analysis techniques, gene set enrichment analysis (GSEA), parametric analysis of gene set enrichment (PGSEA), alongside our novel model (HeteroPath), which assesses heterogeneously upregulated and downregulated genes within the context of pathways, we generated distinct tissue-specific gene regulatory networks. We analyzed gene expression data derived from freshly isolated heart, brain, and lung endothelial cells and populations of neurons in the hippocampus, cingulate cortex, and amygdala. In both datasets, we found that HeteroPath segregated the distinct cellular populations by identifying regulatory pathways that were not identified by GSEA or PGSEA. Using simulated datasets, HeteroPath demonstrated robustness that was comparable to what was seen using existing gene set enrichment methods. Furthermore, we generated tissue-specific gene regulatory networks involved in vascular heterogeneity and neuronal heterogeneity by performing motif enrichment of the heterogeneous genes identified by HeteroPath and linking the enriched motifs to regulatory transcription factors in the ENCODE database. HeteroPath assesses contextual bidirectional gene expression within pathways and thus allows for transcriptomic assessment of cellular heterogeneity. Unraveling tissue-specific heterogeneity of gene expression can lead to a better understanding of the molecular underpinnings of tissue-specific phenotypes.

  20. Down-weighting overlapping genes improves gene set analysis

    PubMed Central

    2012-01-01

    Background The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set. Results In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method Pathway Analysis with Down-weighting of Overlapping Genes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results. Conclusions PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org. PMID:22713124

  1. Novel gene sets improve set-level classification of prokaryotic gene expression data.

    PubMed

    Holec, Matěj; Kuželka, Ondřej; Železný, Filip

    2015-10-28

    Set-level classification of gene expression data has received significant attention recently. In this setting, high-dimensional vectors of features corresponding to genes are converted into lower-dimensional vectors of features corresponding to biologically interpretable gene sets. The dimensionality reduction brings the promise of a decreased risk of overfitting, potentially resulting in improved accuracy of the learned classifiers. However, recent empirical research has not confirmed this expectation. Here we hypothesize that the reported unfavorable classification results in the set-level framework were due to the adoption of unsuitable gene sets defined typically on the basis of the Gene ontology and the KEGG database of metabolic networks. We explore an alternative approach to defining gene sets, based on regulatory interactions, which we expect to collect genes with more correlated expression. We hypothesize that such more correlated gene sets will enable to learn more accurate classifiers. We define two families of gene sets using information on regulatory interactions, and evaluate them on phenotype-classification tasks using public prokaryotic gene expression data sets. From each of the two gene-set families, we first select the best-performing subtype. The two selected subtypes are then evaluated on independent (testing) data sets against state-of-the-art gene sets and against the conventional gene-level approach. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. Novel gene sets defined on the basis of regulatory interactions improve set-level classification of gene expression data. The experimental scripts and other material needed to reproduce the experiments are available at http://ida.felk.cvut.cz/novelgenesets.tar.gz.

  2. Three gene expression vector sets for concurrently expressing multiple genes in Saccharomyces cerevisiae.

    PubMed

    Ishii, Jun; Kondo, Takashi; Makino, Harumi; Ogura, Akira; Matsuda, Fumio; Kondo, Akihiko

    2014-05-01

    Yeast has the potential to be used in bulk-scale fermentative production of fuels and chemicals due to its tolerance for low pH and robustness for autolysis. However, expression of multiple external genes in one host yeast strain is considerably labor-intensive due to the lack of polycistronic transcription. To promote the metabolic engineering of yeast, we generated systematic and convenient genetic engineering tools to express multiple genes in Saccharomyces cerevisiae. We constructed a series of multi-copy and integration vector sets for concurrently expressing two or three genes in S. cerevisiae by embedding three classical promoters. The comparative expression capabilities of the constructed vectors were monitored with green fluorescent protein, and the concurrent expression of genes was monitored with three different fluorescent proteins. Our multiple gene expression tool will be helpful to the advanced construction of genetically engineered yeast strains in a variety of research fields other than metabolic engineering. © 2014 Federation of European Microbiological Societies. Published by John Wiley & Sons Ltd. All rights reserved.

  3. GEM-TREND: a web tool for gene expression data mining toward relevant network discovery

    PubMed Central

    Feng, Chunlai; Araki, Michihiro; Kunimoto, Ryo; Tamon, Akiko; Makiguchi, Hiroki; Niijima, Satoshi; Tsujimoto, Gozoh; Okuno, Yasushi

    2009-01-01

    Background DNA microarray technology provides us with a first step toward the goal of uncovering gene functions on a genomic scale. In recent years, vast amounts of gene expression data have been collected, much of which are available in public databases, such as the Gene Expression Omnibus (GEO). To date, most researchers have been manually retrieving data from databases through web browsers using accession numbers (IDs) or keywords, but gene-expression patterns are not considered when retrieving such data. The Connectivity Map was recently introduced to compare gene expression data by introducing gene-expression signatures (represented by a set of genes with up- or down-regulated labels according to their biological states) and is available as a web tool for detecting similar gene-expression signatures from a limited data set (approximately 7,000 expression profiles representing 1,309 compounds). In order to support researchers to utilize the public gene expression data more effectively, we developed a web tool for finding similar gene expression data and generating its co-expression networks from a publicly available database. Results GEM-TREND, a web tool for searching gene expression data, allows users to search data from GEO using gene-expression signatures or gene expression ratio data as a query and retrieve gene expression data by comparing gene-expression pattern between the query and GEO gene expression data. The comparison methods are based on the nonparametric, rank-based pattern matching approach of Lamb et al. (Science 2006) with the additional calculation of statistical significance. The web tool was tested using gene expression ratio data randomly extracted from the GEO and with in-house microarray data, respectively. The results validated the ability of GEM-TREND to retrieve gene expression entries biologically related to a query from GEO. For further analysis, a network visualization interface is also provided, whereby genes and gene annotations are dynamically linked to external data repositories. Conclusion GEM-TREND was developed to retrieve gene expression data by comparing query gene-expression pattern with those of GEO gene expression data. It could be a very useful resource for finding similar gene expression profiles and constructing its gene co-expression networks from a publicly available database. GEM-TREND was designed to be user-friendly and is expected to support knowledge discovery. GEM-TREND is freely available at . PMID:19728865

  4. GEM-TREND: a web tool for gene expression data mining toward relevant network discovery.

    PubMed

    Feng, Chunlai; Araki, Michihiro; Kunimoto, Ryo; Tamon, Akiko; Makiguchi, Hiroki; Niijima, Satoshi; Tsujimoto, Gozoh; Okuno, Yasushi

    2009-09-03

    DNA microarray technology provides us with a first step toward the goal of uncovering gene functions on a genomic scale. In recent years, vast amounts of gene expression data have been collected, much of which are available in public databases, such as the Gene Expression Omnibus (GEO). To date, most researchers have been manually retrieving data from databases through web browsers using accession numbers (IDs) or keywords, but gene-expression patterns are not considered when retrieving such data. The Connectivity Map was recently introduced to compare gene expression data by introducing gene-expression signatures (represented by a set of genes with up- or down-regulated labels according to their biological states) and is available as a web tool for detecting similar gene-expression signatures from a limited data set (approximately 7,000 expression profiles representing 1,309 compounds). In order to support researchers to utilize the public gene expression data more effectively, we developed a web tool for finding similar gene expression data and generating its co-expression networks from a publicly available database. GEM-TREND, a web tool for searching gene expression data, allows users to search data from GEO using gene-expression signatures or gene expression ratio data as a query and retrieve gene expression data by comparing gene-expression pattern between the query and GEO gene expression data. The comparison methods are based on the nonparametric, rank-based pattern matching approach of Lamb et al. (Science 2006) with the additional calculation of statistical significance. The web tool was tested using gene expression ratio data randomly extracted from the GEO and with in-house microarray data, respectively. The results validated the ability of GEM-TREND to retrieve gene expression entries biologically related to a query from GEO. For further analysis, a network visualization interface is also provided, whereby genes and gene annotations are dynamically linked to external data repositories. GEM-TREND was developed to retrieve gene expression data by comparing query gene-expression pattern with those of GEO gene expression data. It could be a very useful resource for finding similar gene expression profiles and constructing its gene co-expression networks from a publicly available database. GEM-TREND was designed to be user-friendly and is expected to support knowledge discovery. GEM-TREND is freely available at http://cgs.pharm.kyoto-u.ac.jp/services/network.

  5. Learning contextual gene set interaction networks of cancer with condition specificity

    PubMed Central

    2013-01-01

    Background Identifying similarities and differences in the molecular constitutions of various types of cancer is one of the key challenges in cancer research. The appearances of a cancer depend on complex molecular interactions, including gene regulatory networks and gene-environment interactions. This complexity makes it challenging to decipher the molecular origin of the cancer. In recent years, many studies reported methods to uncover heterogeneous depictions of complex cancers, which are often categorized into different subtypes. The challenge is to identify diverse molecular contexts within a cancer, to relate them to different subtypes, and to learn underlying molecular interactions specific to molecular contexts so that we can recommend context-specific treatment to patients. Results In this study, we describe a novel method to discern molecular interactions specific to certain molecular contexts. Unlike conventional approaches to build modular networks of individual genes, our focus is to identify cancer-generic and subtype-specific interactions between contextual gene sets, of which each gene set share coherent transcriptional patterns across a subset of samples, termed contextual gene set. We then apply a novel formulation for quantitating the effect of the samples from each subtype on the calculated strength of interactions observed. Two cancer data sets were analyzed to support the validity of condition-specificity of identified interactions. When compared to an existing approach, the proposed method was much more sensitive in identifying condition-specific interactions even in heterogeneous data set. The results also revealed that network components specific to different types of cancer are related to different biological functions than cancer-generic network components. We found not only the results that are consistent with previous studies, but also new hypotheses on the biological mechanisms specific to certain cancer types that warrant further investigations. Conclusions The analysis on the contextual gene sets and characterization of networks of interaction composed of these sets discovered distinct functional differences underlying various types of cancer. The results show that our method successfully reveals many subtype-specific regions in the identified maps of biological contexts, which well represent biological functions that can be connected to specific subtypes. PMID:23418942

  6. Differential reconstructed gene interaction networks for deriving toxicity threshold in chemical risk assessment.

    PubMed

    Yang, Yi; Maxwell, Andrew; Zhang, Xiaowei; Wang, Nan; Perkins, Edward J; Zhang, Chaoyang; Gong, Ping

    2013-01-01

    Pathway alterations reflected as changes in gene expression regulation and gene interaction can result from cellular exposure to toxicants. Such information is often used to elucidate toxicological modes of action. From a risk assessment perspective, alterations in biological pathways are a rich resource for setting toxicant thresholds, which may be more sensitive and mechanism-informed than traditional toxicity endpoints. Here we developed a novel differential networks (DNs) approach to connect pathway perturbation with toxicity threshold setting. Our DNs approach consists of 6 steps: time-series gene expression data collection, identification of altered genes, gene interaction network reconstruction, differential edge inference, mapping of genes with differential edges to pathways, and establishment of causal relationships between chemical concentration and perturbed pathways. A one-sample Gaussian process model and a linear regression model were used to identify genes that exhibited significant profile changes across an entire time course and between treatments, respectively. Interaction networks of differentially expressed (DE) genes were reconstructed for different treatments using a state space model and then compared to infer differential edges/interactions. DE genes possessing differential edges were mapped to biological pathways in databases such as KEGG pathways. Using the DNs approach, we analyzed a time-series Escherichia coli live cell gene expression dataset consisting of 4 treatments (control, 10, 100, 1000 mg/L naphthenic acids, NAs) and 18 time points. Through comparison of reconstructed networks and construction of differential networks, 80 genes were identified as DE genes with a significant number of differential edges, and 22 KEGG pathways were altered in a concentration-dependent manner. Some of these pathways were perturbed to a degree as high as 70% even at the lowest exposure concentration, implying a high sensitivity of our DNs approach. Findings from this proof-of-concept study suggest that our approach has a great potential in providing a novel and sensitive tool for threshold setting in chemical risk assessment. In future work, we plan to analyze more time-series datasets with a full spectrum of concentrations and sufficient replications per treatment. The pathway alteration-derived thresholds will also be compared with those derived from apical endpoints such as cell growth rate.

  7. B-cell Ligand Processing Pathways Detected by Large-scale Comparative Analysis

    PubMed Central

    Towfic, Fadi; Gupta, Shakti; Honavar, Vasant; Subramaniam, Shankar

    2012-01-01

    The initiation of B-cell ligand recognition is a critical step for the generation of an immune response against foreign bodies. We sought to identify the biochemical pathways involved in the B-cell ligand recognition cascade and sets of ligands that trigger similar immunological responses. We utilized several comparative approaches to analyze the gene coexpression networks generated from a set of microarray experiments spanning 33 different ligands. First, we compared the degree distributions of the generated networks. Second, we utilized a pairwise network alignment algorithm, BiNA, to align the networks based on the hubs in the networks. Third, we aligned the networks based on a set of KEGG pathways. We summarized our results by constructing a consensus hierarchy of pathways that are involved in B cell ligand recognition. The resulting pathways were further validated through literature for their common physiological responses. Collectively, the results based on our comparative analyses of degree distributions, alignment of hubs, and alignment based on KEGG pathways provide a basis for molecular characterization of the immune response states of B-cells and demonstrate the power of comparative approaches (e.g., gene coexpression network alignment algorithms) in elucidating biochemical pathways involved in complex signaling events in cells. PMID:22917187

  8. Uterine responses to early pre-attachment embryos in the domestic dog and comparisons with other domestic animal species†

    PubMed Central

    Graubner, Felix R.; Gram, Aykut; Kautz, Ewa; Bauersachs, Stefan; Aslan, Selim; Agaoglu, Ali R.; Boos, Alois

    2017-01-01

    Abstract In the dog, there is no luteolysis in the absence of pregnancy. Thus, this species lacks any anti-luteolytic endocrine signal as found in other species that modulate uterine function during the critical period of pregnancy establishment. Nevertheless, in the dog an embryo-maternal communication must occur in order to prevent rejection of embryos. Based on this hypothesis, we performed microarray analysis of canine uterine samples collected during pre-attachment phase (days 10-12) and in corresponding non-pregnant controls, in order to elucidate the embryo attachment signal. An additional goal was to identify differences in uterine responses to pre-attachment embryos between dogs and other mammalian species exhibiting different reproductive patterns with regard to luteolysis, implantation, and preparation for placentation. Therefore, the canine microarray data were compared with gene sets from pigs, cattle, horses, and humans. We found 412 genes differentially regulated between the two experimental groups. The functional terms most strongly enriched in response to pre-attachment embryos related to extracellular matrix function and remodeling, and to immune and inflammatory responses. Several candidate genes were validated by semi-quantitative PCR. When compared with other species, best matches were found with human and equine counterparts. Especially for the pig, the majority of overlapping genes showed opposite expression patterns. Interestingly, 1926 genes did not pair with any of the other gene sets. Using a microarray approach, we report the uterine changes in the dog driven by the presence of embryos and compare these results with datasets from other mammalian species, finding common-, contrary-, and exclusively canine-regulated genes. PMID:28651344

  9. Using Separation-of-Function Mutagenesis To Define the Full Spectrum of Activities Performed by the Est1 Telomerase Subunit in Vivo.

    PubMed

    Lubin, Johnathan W; Tucey, Timothy M; Lundblad, Victoria

    2018-01-01

    A leading objective in biology is to identify the complete set of activities that each gene performs in vivo In this study, we have asked whether a genetic approach can provide an efficient means of achieving this goal, through the identification and analysis of a comprehensive set of separation-of-function ( sof - ) mutations in a gene. Toward this goal, we have subjected the Saccharomyces cerevisiae EST1 gene, which encodes a regulatory subunit of telomerase, to intensive mutagenesis (with an average coverage of one mutation for every 4.5 residues), using strategies that eliminated those mutations that disrupted protein folding/stability. The resulting set of sof - mutations defined four biochemically distinct activities for the Est1 telomerase protein: two temporally separable steps in telomerase holoenzyme assembly, a telomerase recruitment activity, and a fourth newly discovered regulatory function. Although biochemically distinct, impairment of each of these four different activities nevertheless conferred a common phenotype (critically short telomeres) comparable to that of an est1 -∆ null strain. This highlights the limitations of gene deletions, even for nonessential genes; we suggest that employing a representative set of sof - mutations for each gene in future high- and low-throughput investigations will provide deeper insights into how proteins interact inside the cell. Copyright © 2018 by the Genetics Society of America.

  10. nGASP - the nematode genome annotation assessment project

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Coghlan, A; Fiedler, T J; McKay, S J

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner'more » algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders.« less

  11. Using the gene ontology to scan multilevel gene sets for associations in genome wide association studies.

    PubMed

    Schaid, Daniel J; Sinnwell, Jason P; Jenkins, Gregory D; McDonnell, Shannon K; Ingle, James N; Kubo, Michiaki; Goss, Paul E; Costantino, Joseph P; Wickerham, D Lawrence; Weinshilboum, Richard M

    2012-01-01

    Gene-set analyses have been widely used in gene expression studies, and some of the developed methods have been extended to genome wide association studies (GWAS). Yet, complications due to linkage disequilibrium (LD) among single nucleotide polymorphisms (SNPs), and variable numbers of SNPs per gene and genes per gene-set, have plagued current approaches, often leading to ad hoc "fixes." To overcome some of the current limitations, we developed a general approach to scan GWAS SNP data for both gene-level and gene-set analyses, building on score statistics for generalized linear models, and taking advantage of the directed acyclic graph structure of the gene ontology when creating gene-sets. However, other types of gene-set structures can be used, such as the popular Kyoto Encyclopedia of Genes and Genomes (KEGG). Our approach combines SNPs into genes, and genes into gene-sets, but assures that positive and negative effects of genes on a trait do not cancel. To control for multiple testing of many gene-sets, we use an efficient computational strategy that accounts for LD and provides accurate step-down adjusted P-values for each gene-set. Application of our methods to two different GWAS provide guidance on the potential strengths and weaknesses of our proposed gene-set analyses. © 2011 Wiley Periodicals, Inc.

  12. An improved method for functional similarity analysis of genes based on Gene Ontology.

    PubMed

    Tian, Zhen; Wang, Chunyu; Guo, Maozu; Liu, Xiaoyan; Teng, Zhixia

    2016-12-23

    Measures of gene functional similarity are essential tools for gene clustering, gene function prediction, evaluation of protein-protein interaction, disease gene prioritization and other applications. In recent years, many gene functional similarity methods have been proposed based on the semantic similarity of GO terms. However, these leading approaches may make errorprone judgments especially when they measure the specificity of GO terms as well as the IC of a term set. Therefore, how to estimate the gene functional similarity reliably is still a challenging problem. We propose WIS, an effective method to measure the gene functional similarity. First of all, WIS computes the IC of a term by employing its depth, the number of its ancestors as well as the topology of its descendants in the GO graph. Secondly, WIS calculates the IC of a term set by means of considering the weighted inherited semantics of terms. Finally, WIS estimates the gene functional similarity based on the IC overlap ratio of term sets. WIS is superior to some other representative measures on the experiments of functional classification of genes in a biological pathway, collaborative evaluation of GO-based semantic similarity measures, protein-protein interaction prediction and correlation with gene expression. Further analysis suggests that WIS takes fully into account the specificity of terms and the weighted inherited semantics of terms between GO terms. The proposed WIS method is an effective and reliable way to compare gene function. The web service of WIS is freely available at http://nclab.hit.edu.cn/WIS/ .

  13. Gene expression patterns combined with bioinformatics analysis identify genes associated with cholangiocarcinoma.

    PubMed

    Li, Chen; Shen, Weixing; Shen, Sheng; Ai, Zhilong

    2013-12-01

    To explore the molecular mechanisms of cholangiocarcinoma (CC), microarray technology was used to find biomarkers for early detection and diagnosis. The gene expression profiles from 6 patients with CC and 5 normal controls were downloaded from Gene Expression Omnibus and compared. As a result, 204 differentially co-expressed genes (DCGs) in CC patients compared to normal controls were identified using a computational bioinformatics analysis. These genes were mainly involved in coenzyme metabolic process, peptidase activity and oxidation reduction. A regulatory network was constructed by mapping the DCGs to known regulation data. Four transcription factors, FOXC1, ZIC2, NKX2-2 and GCGR, were hub nodes in the network. In conclusion, this study provides a set of targets useful for future investigations into molecular biomarker studies. Copyright © 2013 Elsevier Ltd. All rights reserved.

  14. Screening of differentially expressed genes between multiple trauma patients with and without sepsis.

    PubMed

    Ji, S C; Pan, Y T; Lu, Q Y; Sun, Z Y; Liu, Y Z

    2014-03-17

    The purpose of this study was to identify critical genes associated with septic multiple trauma by comparing peripheral whole blood samples from multiple trauma patients with and without sepsis. A microarray data set was downloaded from the Gene Expression Omnibus (GEO) database. This data set included 70 samples, 36 from multiple trauma patients with sepsis and 34 from multiple trauma patients without sepsis (as a control set). The data were preprocessed, and differentially expressed genes (DEGs) were then screened for using packages of the R language. Functional analysis of DEGs was performed with DAVID. Interaction networks were then established for the most up- and down-regulated genes using HitPredict. Pathway-enrichment analysis was conducted for genes in the networks using WebGestalt. Fifty-eight DEGs were identified. The expression levels of PLAU (down-regulated) and MMP8 (up-regulated) presented the largest fold-changes, and interaction networks were established for these genes. Further analysis revealed that PLAT (plasminogen activator, tissue) and SERPINF2 (serpin peptidase inhibitor, clade F, member 2), which interact with PLAU, play important roles in the pathway of the component and coagulation cascade. We hypothesize that PLAU is a major regulator of the component and coagulation cascade, and down-regulation of PLAU results in dysfunction of the pathway, causing sepsis.

  15. Methodology and software to detect viral integration site hot-spots

    PubMed Central

    2011-01-01

    Background Modern gene therapy methods have limited control over where a therapeutic viral vector inserts into the host genome. Vector integration can activate local gene expression, which can cause cancer if the vector inserts near an oncogene. Viral integration hot-spots or 'common insertion sites' (CIS) are scrutinized to evaluate and predict patient safety. CIS are typically defined by a minimum density of insertions (such as 2-4 within a 30-100 kb region), which unfortunately depends on the total number of observed VIS. This is problematic for comparing hot-spot distributions across data sets and patients, where the VIS numbers may vary. Results We develop two new methods for defining hot-spots that are relatively independent of data set size. Both methods operate on distributions of VIS across consecutive 1 Mb 'bins' of the genome. The first method 'z-threshold' tallies the number of VIS per bin, converts these counts to z-scores, and applies a threshold to define high density bins. The second method 'BCP' applies a Bayesian change-point model to the z-scores to define hot-spots. The novel hot-spot methods are compared with a conventional CIS method using simulated data sets and data sets from five published human studies, including the X-linked ALD (adrenoleukodystrophy), CGD (chronic granulomatous disease) and SCID-X1 (X-linked severe combined immunodeficiency) trials. The BCP analysis of the human X-linked ALD data for two patients separately (774 and 1627 VIS) and combined (2401 VIS) resulted in 5-6 hot-spots covering 0.17-0.251% of the genome and containing 5.56-7.74% of the total VIS. In comparison, the CIS analysis resulted in 12-110 hot-spots covering 0.018-0.246% of the genome and containing 5.81-22.7% of the VIS, corresponding to a greater number of hot-spots as the data set size increased. Our hot-spot methods enable one to evaluate the extent of VIS clustering, and formally compare data sets in terms of hot-spot overlap. Finally, we show that the BCP hot-spots from the repopulating samples coincide with greater gene and CpG island density than the median genome density. Conclusions The z-threshold and BCP methods are useful for comparing hot-spot patterns across data sets of disparate sizes. The methodology and software provided here should enable one to study hot-spot conservation across a variety of VIS data sets and evaluate vector safety for gene therapy trials. PMID:21914224

  16. DTFP-Growth: Dynamic Threshold-Based FP-Growth Rule Mining Algorithm Through Integrating Gene Expression, Methylation, and Protein-Protein Interaction Profiles.

    PubMed

    Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan

    2018-04-01

    Association rule mining is an important technique for identifying interesting relationships between gene pairs in a biological data set. Earlier methods basically work for a single biological data set, and, in maximum cases, a single minimum support cutoff can be applied globally, i.e., across all genesets/itemsets. To overcome this limitation, in this paper, we propose dynamic threshold-based FP-growth rule mining algorithm that integrates gene expression, methylation and protein-protein interaction profiles based on weighted shortest distance to find the novel associations among different pairs of genes in multi-view data sets. For this purpose, we introduce three new thresholds, namely, Distance-based Variable/Dynamic Supports (DVS), Distance-based Variable Confidences (DVC), and Distance-based Variable Lifts (DVL) for each rule by integrating co-expression, co-methylation, and protein-protein interactions existed in the multi-omics data set. We develop the proposed algorithm utilizing these three novel multiple threshold measures. In the proposed algorithm, the values of , , and are computed for each rule separately, and subsequently it is verified whether the support, confidence, and lift of each evolved rule are greater than or equal to the corresponding individual , , and values, respectively, or not. If all these three conditions for a rule are found to be true, the rule is treated as a resultant rule. One of the major advantages of the proposed method compared with other related state-of-the-art methods is that it considers both the quantitative and interactive significance among all pairwise genes belonging to each rule. Moreover, the proposed method generates fewer rules, takes less running time, and provides greater biological significance for the resultant top-ranking rules compared to previous methods.

  17. GeneSigDB: a manually curated database and resource for analysis of gene expression signatures

    PubMed Central

    Culhane, Aedín C.; Schröder, Markus S.; Sultana, Razvan; Picard, Shaita C.; Martinelli, Enzo N.; Kelly, Caroline; Haibe-Kains, Benjamin; Kapushesky, Misha; St Pierre, Anne-Alyssa; Flahive, William; Picard, Kermshlise C.; Gusenleitner, Daniel; Papenhausen, Gerald; O'Connor, Niall; Correll, Mick; Quackenbush, John

    2012-01-01

    GeneSigDB (http://www.genesigdb.org or http://compbio.dfci.harvard.edu/genesigdb/) is a database of gene signatures that have been extracted and manually curated from the published literature. It provides a standardized resource of published prognostic, diagnostic and other gene signatures of cancer and related disease to the community so they can compare the predictive power of gene signatures or use these in gene set enrichment analysis. Since GeneSigDB release 1.0, we have expanded from 575 to 3515 gene signatures, which were collected and transcribed from 1604 published articles largely focused on gene expression in cancer, stem cells, immune cells, development and lung disease. We have made substantial upgrades to the GeneSigDB website to improve accessibility and usability, including adding a tag cloud browse function, facetted navigation and a ‘basket’ feature to store genes or gene signatures of interest. Users can analyze GeneSigDB gene signatures, or upload their own gene list, to identify gene signatures with significant gene overlap and results can be viewed on a dynamic editable heatmap that can be downloaded as a publication quality image. All data in GeneSigDB can be downloaded in numerous formats including .gmt file format for gene set enrichment analysis or as a R/Bioconductor data file. GeneSigDB is available from http://www.genesigdb.org. PMID:22110038

  18. nGASP--the nematode genome annotation assessment project.

    PubMed

    Coghlan, Avril; Fiedler, Tristan J; McKay, Sheldon J; Flicek, Paul; Harris, Todd W; Blasiar, Darin; Stein, Lincoln D

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders. This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.

  19. Figure 4 from Integrative Genomics Viewer: Visualizing Big Data | Office of Cancer Genomics

    Cancer.gov

    Gene-list view of genomic data. The gene-list view allows users to compare data across a set of loci. The data in this figure includes copy number, mutation, and clinical data from 202 glioblastoma samples from TCGA. Adapted from Figure 7; Thorvaldsdottir H et al. 2012

  20. TimesVector: a vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes.

    PubMed

    Jung, Inuk; Jo, Kyuri; Kang, Hyejin; Ahn, Hongryul; Yu, Youngjae; Kim, Sun

    2017-12-01

    Identifying biologically meaningful gene expression patterns from time series gene expression data is important to understand the underlying biological mechanisms. To identify significantly perturbed gene sets between different phenotypes, analysis of time series transcriptome data requires consideration of time and sample dimensions. Thus, the analysis of such time series data seeks to search gene sets that exhibit similar or different expression patterns between two or more sample conditions, constituting the three-dimensional data, i.e. gene-time-condition. Computational complexity for analyzing such data is very high, compared to the already difficult NP-hard two dimensional biclustering algorithms. Because of this challenge, traditional time series clustering algorithms are designed to capture co-expressed genes with similar expression pattern in two sample conditions. We present a triclustering algorithm, TimesVector, specifically designed for clustering three-dimensional time series data to capture distinctively similar or different gene expression patterns between two or more sample conditions. TimesVector identifies clusters with distinctive expression patterns in three steps: (i) dimension reduction and clustering of time-condition concatenated vectors, (ii) post-processing clusters for detecting similar and distinct expression patterns and (iii) rescuing genes from unclassified clusters. Using four sets of time series gene expression data, generated by both microarray and high throughput sequencing platforms, we demonstrated that TimesVector successfully detected biologically meaningful clusters of high quality. TimesVector improved the clustering quality compared to existing triclustering tools and only TimesVector detected clusters with differential expression patterns across conditions successfully. The TimesVector software is available at http://biohealth.snu.ac.kr/software/TimesVector/. sunkim.bioinfo@snu.ac.kr. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  1. Combining Evidence of Preferential Gene-Tissue Relationships from Multiple Sources

    PubMed Central

    Guo, Jing; Hammar, Mårten; Öberg, Lisa; Padmanabhuni, Shanmukha S.; Bjäreland, Marcus; Dalevi, Daniel

    2013-01-01

    An important challenge in drug discovery and disease prognosis is to predict genes that are preferentially expressed in one or a few tissues, i.e. showing a considerably higher expression in one tissue(s) compared to the others. Although several data sources and methods have been published explicitly for this purpose, they often disagree and it is not evident how to retrieve these genes and how to distinguish true biological findings from those that are due to choice-of-method and/or experimental settings. In this work we have developed a computational approach that combines results from multiple methods and datasets with the aim to eliminate method/study-specific biases and to improve the predictability of preferentially expressed human genes. A rule-based score is used to merge and assign support to the results. Five sets of genes with known tissue specificity were used for parameter pruning and cross-validation. In total we identify 3434 tissue-specific genes. We compare the genes of highest scores with the public databases: PaGenBase (microarray), TiGER (EST) and HPA (protein expression data). The results have 85% overlap to PaGenBase, 71% to TiGER and only 28% to HPA. 99% of our predictions have support from at least one of these databases. Our approach also performs better than any of the databases on identifying drug targets and biomarkers with known tissue-specificity. PMID:23950964

  2. Gene expression signature of benign prostatic hyperplasia revealed by cDNA microarray analysis.

    PubMed

    Luo, Jun; Dunn, Thomas; Ewing, Charles; Sauvageot, Jurga; Chen, Yidong; Trent, Jeffrey; Isaacs, William

    2002-05-15

    Despite the high prevalence of benign prostatic hyperplasia (BPH) in the aging male, little is known regarding the etiology of this disease. A better understanding of the molecular etiology of BPH would be facilitated by a comprehensive analysis of gene expression patterns that are characteristic of benign growth in the prostate gland. Since genes differentially expressed between BPH and normal prostate tissues are likely to reflect underlying pathogenic mechanisms involved in the development of BPH, we performed comparative gene expression analysis using cDNA microarray technology to identify candidate genes associated with BPH. Total RNA was extracted from a set of 9 BPH specimens from men with extensive hyperplasia and a set of 12 histologically normal prostate tissues excised from radical prostatectomy specimens. Each of these 21 RNA samples was labeled with Cy3 in a reverse transcription reaction and cohybridized with a Cy5 labeled common reference sample to a cDNA microarray containing 6,500 human genes. Normalized fluorescent intensity ratios from each hybridization experiment were extracted to represent the relative mRNA abundance for each gene in each sample. Weighted gene and random permutation analyses were performed to generate a subset of genes with statistically significant differences in expression between BPH and normal prostate tissues. Semi-quantitative PCR analysis was performed to validate differential expression. A subset of 76 genes involved in a wide range of cellular functions was identified to be differentially expressed between BPH and normal prostate tissues. Semi-quantitative PCR was performed on 10 genes and 8 were validated. Genes consistently upregulated in BPH when compared to normal prostate tissues included: a restricted set of growth factors and their binding proteins (e.g. IGF-1 and -2, TGF-beta3, BMP5, latent TGF-beta binding protein 1 and -2); hydrolases, proteases, and protease inhibitors (e.g. neuropathy target esterase, MMP2, alpha-2-macroglobulin); stress response enzymes (e.g. COX2, GSTM5); and extracellular matrix molecules (e.g. laminin alpha 4 and beta 1, chondroitin sulfate proteoglycan 2, lumican). Genes consistently expressing less mRNA in BPH than in normal prostate tissues were less commonly observed and included the transcription factor KLF4, thrombospondin 4, nitric oxide synthase 2A, transglutaminase 3, and gastrin releasing peptide. We identified a diverse set of genes that are potentially related to benign prostatic hyperplasia, including genes both previously implicated in BPH pathogenesis as well as others not previously linked to this disease. Further targeted validation and investigations of these genes at the DNA, mRNA, and protein levels are warranted to determine the clinical relevance and possible therapeutic utility of these genes. Copyright 2002 Wiley-Liss, Inc.

  3. A Comparative Transcriptomic Analysis Reveals Conserved Features of Stem Cell Pluripotency in Planarians and Mammals

    PubMed Central

    Labbé, Roselyne M.; Irimia, Manuel; Currie, Ko W.; Lin, Alexander; Zhu, Shu Jun; Brown, David D.R.; Ross, Eric J.; Voisin, Veronique; Bader, Gary D.; Blencowe, Benjamin J.; Pearson, Bret J.

    2014-01-01

    Many long-lived species of animals require the function of adult stem cells throughout their lives. However, the transcriptomes of stem cells in invertebrates and vertebrates have not been compared, and consequently, ancestral regulatory circuits that control stem cell populations remain poorly defined. In this study, we have used data from high-throughput RNA sequencing to compare the transcriptomes of pluripotent adult stem cells from planarians with the transcriptomes of human and mouse pluripotent embryonic stem cells. From a stringently defined set of 4,432 orthologs shared between planarians, mice and humans, we identified 123 conserved genes that are ≥5-fold differentially expressed in stem cells from all three species. Guided by this gene set, we used RNAi screening in adult planarians to discover novel stem cell regulators, which we found to affect the stem cell-associated functions of tissue homeostasis, regeneration, and stem cell maintenance. Examples of genes that disrupted these processes included the orthologs of TBL3, PSD12, TTC27, and RACK1. From these analyses, we concluded that by comparing stem cell transcriptomes from diverse species, it is possible to uncover conserved factors that function in stem cell biology. These results provide insights into which genes comprised the ancestral circuitry underlying the control of stem cell self-renewal and pluripotency. PMID:22696458

  4. Searching for the molecular benchmark of physiological intestinal anastomotic healing in rats: an experimental study.

    PubMed

    Seifert, Gabriel J; Seifert, Michael; Kulemann, Birte; Holzner, Philipp A; Glatz, Torben; Timme, Sylvia; Sick, Olivia; Höppner, Jens; Hopt, Ulrich T; Marjanovic, Goran

    2014-01-01

    This investigation focuses on the physiological characteristics of gene transcription of intestinal tissue following anastomosis formation. In eight rats, end-to-end ileo-ileal anastomoses were performed (n = 2/group). The healthy intestinal tissue resected for this operation was used as a control. On days 0, 2, 4 and 8, 10-mm perianastomotic segments were resected. Control and perianastomotic segments were examined with an Affymetrix microarray chip to assess changes in gene regulation. Microarray findings were validated using real-time PCR for selected genes. In addition to screening global gene expression, we identified genes intensely regulated during healing and also subjected our data sets to an overrepresentation analysis using the Gene Ontology (GO) and Kyoto Encyclopedia for Genes and Genomes (KEGG). Compared to the control group, we observed that the number of differentially regulated genes peaked on day 2 with a total of 2,238 genes, decreasing by day 4 to 1,687 genes and to 1,407 genes by day 8. PCR validation for matrix metalloproteinases-3 and -13 showed not only identical transcription patterns but also analogous regulation intensity. When setting the cutoff of upregulation at 10-fold to identify genes likely to be relevant, the total gene count was significantly lower with 55, 45 and 37 genes on days 2, 4 and 8, respectively. A total of 947 GO subcategories were significantly overrepresented during anastomotic healing. Furthermore, 23 overrepresented KEGG pathways were identified. This study is the first of its kind that focuses explicitly on gene transcription during intestinal anastomotic healing under standardized conditions. Our work sets a foundation for further studies toward a more profound understanding of the physiology of anastomotic healing.

  5. Conjunctival transcriptome profiling of Solomon Islanders with active trachoma in the absence of Chlamydia trachomatis infection.

    PubMed

    Vasileva, Hristina; Butcher, Robert; Pickering, Harry; Sokana, Oliver; Jack, Kelvin; Solomon, Anthony W; Holland, Martin J; Roberts, Chrissy H

    2018-02-21

    Clinical signs of active (inflammatory) trachoma are found in many children in the Solomon Islands, but the majority of these individuals have no serological evidence of previous infection with Chlamydia trachomatis. In Temotu and Rennell and Bellona provinces, ocular infections with C. trachomatis were seldom detected among children with active trachoma; a similar lack of association was seen between active trachoma and other common bacterial and viral causes of follicular conjunctivitis. Here, we set out to characterise patterns of gene expression at the conjunctivae of children in these provinces with and without clinical signs of trachomatous inflammation-follicular (TF) and C. trachomatis infection. Purified RNA from children with and without active trachoma was run on Affymetrix GeneChip Human Transcriptome Array 2.0 microarrays. Profiles were compared between individuals with ocular C. trachomatis infection and TF (group DI; n = 6), individuals with TF but no C. trachomatis infection (group D; n = 7), and individuals without TF or C. trachomatis infection (group N; n = 7). Differential gene expression and gene set enrichment for pathway membership were assessed. Conjunctival gene expression profiles were more similar within-group than between-group. Principal components analysis indicated that the first and second principal components combined explained almost 50% of the variance in the dataset. When comparing the DI group to the N group, genes involved in T-cell proliferation, B-cell signalling and CD8+ T cell signalling pathways were differentially regulated. When comparing the DI group to the D group, CD8+ T-cell regulation, interferon-gamma and IL17 production pathways were enriched. Genes involved in RNA transcription and translation pathways were upregulated when comparing the D group to the N group. Gene expression profiles in children in the Solomon Islands indicate immune responses consistent with bacterial infection when TF and C. trachomatis infection are concurrent. The transcriptomes of children with TF but without identified infection were not consistent with allergic or viral conjunctivitis.

  6. Detection of the inferred interaction network in hepatocellular carcinoma from EHCO (Encyclopedia of Hepatocellular Carcinoma genes Online)

    PubMed Central

    Hsu, Chun-Nan; Lai, Jin-Mei; Liu, Chia-Hung; Tseng, Huei-Hun; Lin, Chih-Yun; Lin, Kuan-Ting; Yeh, Hsu-Hua; Sung, Ting-Yi; Hsu, Wen-Lian; Su, Li-Jen; Lee, Sheng-An; Chen, Chang-Han; Lee, Gen-Cher; Lee, DT; Shiue, Yow-Ling; Yeh, Chang-Wei; Chang, Chao-Hui; Kao, Cheng-Yan; Huang, Chi-Ying F

    2007-01-01

    Background The significant advances in microarray and proteomics analyses have resulted in an exponential increase in potential new targets and have promised to shed light on the identification of disease markers and cellular pathways. We aim to collect and decipher the HCC-related genes at the systems level. Results Here, we build an integrative platform, the Encyclopedia of Hepatocellular Carcinoma genes Online, dubbed EHCO , to systematically collect, organize and compare the pileup of unsorted HCC-related studies by using natural language processing and softbots. Among the eight gene set collections, ranging across PubMed, SAGE, microarray, and proteomics data, there are 2,906 genes in total; however, more than 77% genes are only included once, suggesting that tremendous efforts need to be exerted to characterize the relationship between HCC and these genes. Of these HCC inventories, protein binding represents the largest proportion (~25%) from Gene Ontology analysis. In fact, many differentially expressed gene sets in EHCO could form interaction networks (e.g. HBV-associated HCC network) by using available human protein-protein interaction datasets. To further highlight the potential new targets in the inferred network from EHCO, we combine comparative genomics and interactomics approaches to analyze 120 evolutionary conserved and overexpressed genes in HCC. 47 out of 120 queries can form a highly interactive network with 18 queries serving as hubs. Conclusion This architectural map may represent the first step toward the attempt to decipher the hepatocarcinogenesis at the systems level. Targeting hubs and/or disruption of the network formation might reveal novel strategy for HCC treatment. PMID:17326819

  7. Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering.

    PubMed

    Liu, Ying; Ciliax, Brian J; Borges, Karin; Dasigi, Venu; Ram, Ashwin; Navathe, Shamkant B; Dingledine, Ray

    2004-01-01

    One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.

  8. Mining functionally relevant gene sets for analyzing physiologically novel clinical expression data.

    PubMed

    Turcan, Sevin; Vetter, Douglas E; Maron, Jill L; Wei, Xintao; Slonim, Donna K

    2011-01-01

    Gene set analyses have become a standard approach for increasing the sensitivity of transcriptomic studies. However, analytical methods incorporating gene sets require the availability of pre-defined gene sets relevant to the underlying physiology being studied. For novel physiological problems, relevant gene sets may be unavailable or existing gene set databases may bias the results towards only the best-studied of the relevant biological processes. We describe a successful attempt to mine novel functional gene sets for translational projects where the underlying physiology is not necessarily well characterized in existing annotation databases. We choose targeted training data from public expression data repositories and define new criteria for selecting biclusters to serve as candidate gene sets. Many of the discovered gene sets show little or no enrichment for informative Gene Ontology terms or other functional annotation. However, we observe that such gene sets show coherent differential expression in new clinical test data sets, even if derived from different species, tissues, and disease states. We demonstrate the efficacy of this method on a human metabolic data set, where we discover novel, uncharacterized gene sets that are diagnostic of diabetes, and on additional data sets related to neuronal processes and human development. Our results suggest that our approach may be an efficient way to generate a collection of gene sets relevant to the analysis of data for novel clinical applications where existing functional annotation is relatively incomplete.

  9. Statistical mechanical model of coupled transcription from multiple promoters due to transcription factor titration

    PubMed Central

    Rydenfelt, Mattias; Cox, Robert Sidney; Garcia, Hernan; Phillips, Rob

    2014-01-01

    Transcription factors (TFs) with regulatory action at multiple promoter targets is the rule rather than the exception, with examples ranging from the cAMP receptor protein (CRP) in E. coli that regulates hundreds of different genes simultaneously to situations involving multiple copies of the same gene, such as plasmids, retrotransposons, or highly replicated viral DNA. When the number of TFs heavily exceeds the number of binding sites, TF binding to each promoter can be regarded as independent. However, when the number of TF molecules is comparable to the number of binding sites, TF titration will result in correlation (“promoter entanglement”) between transcription of different genes. We develop a statistical mechanical model which takes the TF titration effect into account and use it to predict both the level of gene expression for a general set of promoters and the resulting correlation in transcription rates of different genes. Our results show that the TF titration effect could be important for understanding gene expression in many regulatory settings. PMID:24580252

  10. Fully moderated T-statistic for small sample size gene expression arrays.

    PubMed

    Yu, Lianbo; Gulati, Parul; Fernandez, Soledad; Pennell, Michael; Kirschner, Lawrence; Jarjoura, David

    2011-09-15

    Gene expression microarray experiments with few replications lead to great variability in estimates of gene variances. Several Bayesian methods have been developed to reduce this variability and to increase power. Thus far, moderated t methods assumed a constant coefficient of variation (CV) for the gene variances. We provide evidence against this assumption, and extend the method by allowing the CV to vary with gene expression. Our CV varying method, which we refer to as the fully moderated t-statistic, was compared to three other methods (ordinary t, and two moderated t predecessors). A simulation study and a familiar spike-in data set were used to assess the performance of the testing methods. The results showed that our CV varying method had higher power than the other three methods, identified a greater number of true positives in spike-in data, fit simulated data under varying assumptions very well, and in a real data set better identified higher expressing genes that were consistent with functional pathways associated with the experiments.

  11. A Protocol for Using Gene Set Enrichment Analysis to Identify the Appropriate Animal Model for Translational Research.

    PubMed

    Weidner, Christopher; Steinfath, Matthias; Wistorf, Elisa; Oelgeschläger, Michael; Schneider, Marlon R; Schönfelder, Gilbert

    2017-08-16

    Recent studies that compared transcriptomic datasets of human diseases with datasets from mouse models using traditional gene-to-gene comparison techniques resulted in contradictory conclusions regarding the relevance of animal models for translational research. A major reason for the discrepancies between different gene expression analyses is the arbitrary filtering of differentially expressed genes. Furthermore, the comparison of single genes between different species and platforms often is limited by technical variance, leading to misinterpretation of the con/discordance between data from human and animal models. Thus, standardized approaches for systematic data analysis are needed. To overcome subjective gene filtering and ineffective gene-to-gene comparisons, we recently demonstrated that gene set enrichment analysis (GSEA) has the potential to avoid these problems. Therefore, we developed a standardized protocol for the use of GSEA to distinguish between appropriate and inappropriate animal models for translational research. This protocol is not suitable to predict how to design new model systems a-priori, as it requires existing experimental omics data. However, the protocol describes how to interpret existing data in a standardized manner in order to select the most suitable animal model, thus avoiding unnecessary animal experiments and misleading translational studies.

  12. Recursive feature selection with significant variables of support vectors.

    PubMed

    Tsai, Chen-An; Huang, Chien-Hsun; Chang, Ching-Wei; Chen, Chun-Houh

    2012-01-01

    The development of DNA microarray makes researchers screen thousands of genes simultaneously and it also helps determine high- and low-expression level genes in normal and disease tissues. Selecting relevant genes for cancer classification is an important issue. Most of the gene selection methods use univariate ranking criteria and arbitrarily choose a threshold to choose genes. However, the parameter setting may not be compatible to the selected classification algorithms. In this paper, we propose a new gene selection method (SVM-t) based on the use of t-statistics embedded in support vector machine. We compared the performance to two similar SVM-based methods: SVM recursive feature elimination (SVMRFE) and recursive support vector machine (RSVM). The three methods were compared based on extensive simulation experiments and analyses of two published microarray datasets. In the simulation experiments, we found that the proposed method is more robust in selecting informative genes than SVMRFE and RSVM and capable to attain good classification performance when the variations of informative and noninformative genes are different. In the analysis of two microarray datasets, the proposed method yields better performance in identifying fewer genes with good prediction accuracy, compared to SVMRFE and RSVM.

  13. Genome-Wide Gene Expression in relation to Age in Large Laboratory Cohorts of Drosophila melanogaster

    PubMed Central

    Carlson, Kimberly A.; Gardner, Kylee; Pashaj, Anjeza; Carlson, Darby J.; Yu, Fang; Eudy, James D.; Zhang, Chi; Harshman, Lawrence G.

    2015-01-01

    Aging is a complex process characterized by a steady decline in an organism's ability to perform life-sustaining tasks. In the present study, two cages of approximately 12,000 mated Drosophila melanogaster females were used as a source of RNA from individuals sampled frequently as a function of age. A linear model for microarray data method was used for the microarray analysis to adjust for the box effect; it identified 1,581 candidate aging genes. Cluster analyses using a self-organizing map algorithm on the 1,581 significant genes identified gene expression patterns across different ages. Genes involved in immune system function and regulation, chorion assembly and function, and metabolism were all significantly differentially expressed as a function of age. The temporal pattern of data indicated that gene expression related to aging is affected relatively early in life span. In addition, the temporal variance in gene expression in immune function genes was compared to a random set of genes. There was an increase in the variance of gene expression within each cohort, which was not observed in the set of random genes. This observation is compatible with the hypothesis that D. melanogaster immune function genes lose control of gene expression as flies age. PMID:26090231

  14. Identification of ecotype-specific marker genes for categorization of beer-spoiling Lactobacillus brevis.

    PubMed

    Behr, Jürgen; Geissler, Andreas J; Preissler, Patrick; Ehrenreich, Armin; Angelov, Angel; Vogel, Rudi F

    2015-10-01

    The tolerance to hop compounds, which is mainly associated with inhibition of bacterial growth in beer, is a multi-factorial trait. Any approaches to predict the physiological differences between beer-spoiling and non-spoiling strains on the basis of a single marker gene are limited. We identified ecotype-specific genes related to the ability to grow in Pilsner beer via comparative genome sequencing. The genome sequences of four different strains of Lactobacillus brevis were compared, including newly established genomes of two highly hop tolerant beer isolates, one strain isolated from faeces and one published genome of a silage isolate. Gene fragments exclusively occurring in beer-spoiling strains as well as sequences only occurring in non-spoiling strains were identified. Comparative genomic arrays were established and hybridized with a set of L. brevis strains, which are characterized by their ability to spoil beer. As result, a set of 33 and 4 oligonucleotide probes could be established specifically detecting beer-spoilers and non-spoilers, respectively. The detection of more than one of these marker sequences according to a genetic barcode enables scoring of L. brevis for their beer-spoiling potential and can thus assist in risk evaluation in brewing industry. Copyright © 2015 Elsevier Ltd. All rights reserved.

  15. Molecular and Cellular Profiling of Scalp Psoriasis Reveals Differences and Similarities Compared to Skin Psoriasis

    PubMed Central

    Ruano, Juan; Suárez-Fariñas, Mayte; Shemer, Avner; Oliva, Margeaux

    2016-01-01

    Scalp psoriasis shows a variable clinical spectrum and in many cases poses a great therapeutic challenge. However, it remains unknown whether the immune response of scalp psoriasis differs from understood pathomechanisms of psoriasis in other skin areas. We sought to determine the cellular and molecular phenotype of scalp psoriasis by performing a comparative analysis of scalp and skin using lesional and nonlesional samples from 20 Caucasian subjects with untreated moderate to severe psoriasis and significant scalp involvement and 10 control subjects without psoriasis. Our results suggest that even in the scalp, psoriasis is a disease of the inter-follicular skin. The immune mechanisms that mediate scalp psoriasis were found to be similar to those involved in skin psoriasis. However, the magnitude of dysregulation, number of differentially expressed genes, and enrichment of the psoriatic genomic fingerprint were more prominent in skin lesions. Furthermore, the scalp transcriptome showed increased modulation of several gene-sets, particularly those induced by interferon-gamma, compared with that of skin psoriasis, which was mainly associated with activation of TNFα/L-17/IL-22-induced keratinocyte response genes. We also detected differences in expression of gene-sets involving negative regulation, epigenetic regulation, epidermal differentiation, and dendritic cell or Th1/Th17/Th22-related T-cell processes. PMID:26849645

  16. Discovery of Genomic Breakpoints Affecting Breast Cancer Progression and Prognosis

    DTIC Science & Technology

    2010-10-01

    mutations compared to those detected by the 5Kbp method alone. Fosmid diTag method also reveals much higher proportion of gene fusions and truncations...observed highly similar structural mutational spectra affecting different sets of genes , pointing to similar histories of genomic instability against... mutations have been identified in non-BRCA1/2 multiethnic breast cancer cases (45,46), no truncating mutation of the RAP80 gene in breast cancer has

  17. Systems Genetics Analysis of GWAS reveals Novel Associations between Key Biological Processes and Coronary Artery Disease

    PubMed Central

    Ghosh, Sujoy; Vivar, Juan; Nelson, Christopher P; Willenborg, Christina; Segrè, Ayellet V; Mäkinen, Ville-Petteri; Nikpay, Majid; Erdmann, Jeannette; Blankenberg, Stefan; O'Donnell, Christopher; März, Winfried; Laaksonen, Reijo; Stewart, Alexandre FR; Epstein, Stephen E; Shah, Svati H; Granger, Christopher B; Hazen, Stanley L; Kathiresan, Sekar; Reilly, Muredach P; Yang, Xia; Quertermous, Thomas; Samani, Nilesh J; Schunkert, Heribert; Assimes, Themistocles L; McPherson, Ruth

    2016-01-01

    Objective Genome-wide association (GWA) studies have identified multiple genetic variants affecting the risk of coronary artery disease (CAD). However, individually these explain only a small fraction of the heritability of CAD and for most, the causal biological mechanisms remain unclear. We sought to obtain further insights into potential causal processes of CAD by integrating large-scale GWA data with expertly curated databases of core human pathways and functional networks. Approaches and Results Employing pathways (gene sets) from Reactome, we carried out a two-stage gene set enrichment analysis strategy. From a meta-analyzed discovery cohort of 7 CADGWAS data sets (9,889 cases/11,089 controls), nominally significant gene-sets were tested for replication in a meta-analysis of 9 additional studies (15,502 cases/55,730 controls) from the CARDIoGRAM Consortium. A total of 32 of 639 Reactome pathways tested showed convincing association with CAD (replication p<0.05). These pathways resided in 9 of 21 core biological processes represented in Reactome, and included pathways relevant to extracellular matrix integrity, innate immunity, axon guidance, and signaling by PDRF, NOTCH, and the TGF-β/SMAD receptor complex. Many of these pathways had strengths of association comparable to those observed in lipid transport pathways. Network analysis of unique genes within the replicated pathways further revealed several interconnected functional and topologically interacting modules representing novel associations (e.g. semaphorin regulated axonal guidance pathway) besides confirming known processes (lipid metabolism). The connectivity in the observed networks was statistically significant compared to random networks (p<0.001). Network centrality analysis (‘degree’ and ‘betweenness’) further identified genes (e.g. NCAM1, FYN, FURIN etc.) likely to play critical roles in the maintenance and functioning of several of the replicated pathways. Conclusions These findings provide novel insights into how genetic variation, interpreted in the context of biological processes and functional interactions among genes, may help define the genetic architecture of CAD. PMID:25977570

  18. Physiology of Pseudomonas aeruginosa in biofilms as revealed by transcriptome analysis

    PubMed Central

    2010-01-01

    Background Transcriptome analysis was applied to characterize the physiological activities of Pseudomonas aeruginosa grown for three days in drip-flow biofilm reactors. Conventional applications of transcriptional profiling often compare two paired data sets that differ in a single experimentally controlled variable. In contrast this study obtained the transcriptome of a single biofilm state, ranked transcript signals to make the priorities of the population manifest, and compared ranki ngs for a priori identified physiological marker genes between the biofilm and published data sets. Results Biofilms tolerated exposure to antibiotics, harbored steep oxygen concentration gradients, and exhibited stratified and heterogeneous spatial patterns of protein synthetic activity. Transcriptional profiling was performed and the signal intensity of each transcript was ranked to gain insight into the physiological state of the biofilm population. Similar rankings were obtained from data sets published in the GEO database http://www.ncbi.nlm.nih.gov/geo. By comparing the rank of genes selected as markers for particular physiological activities between the biofilm and comparator data sets, it was possible to infer qualitative features of the physiological state of the biofilm bacteria. These biofilms appeared, from their transcriptome, to be glucose nourished, iron replete, oxygen limited, and growing slowly or exhibiting stationary phase character. Genes associated with elaboration of type IV pili were strongly expressed in the biofilm. The biofilm population did not indicate oxidative stress, homoserine lactone mediated quorum sensing, or activation of efflux pumps. Using correlations with transcript ranks, the average specific growth rate of biofilm cells was estimated to be 0.08 h-1. Conclusions Collectively these data underscore the oxygen-limited, slow-growing nature of the biofilm population and are consistent with antimicrobial tolerance due to low metabolic activity. PMID:21083928

  19. Evaluating the consistency of gene sets used in the analysis of bacterial gene expression data.

    PubMed

    Tintle, Nathan L; Sitarik, Alexandra; Boerema, Benjamin; Young, Kylie; Best, Aaron A; Dejongh, Matthew

    2012-08-08

    Statistical analyses of whole genome expression data require functional information about genes in order to yield meaningful biological conclusions. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) are common sources of functionally grouped gene sets. For bacteria, the SEED and MicrobesOnline provide alternative, complementary sources of gene sets. To date, no comprehensive evaluation of the data obtained from these resources has been performed. We define a series of gene set consistency metrics directly related to the most common classes of statistical analyses for gene expression data, and then perform a comprehensive analysis of 3581 Affymetrix® gene expression arrays across 17 diverse bacteria. We find that gene sets obtained from GO and KEGG demonstrate lower consistency than those obtained from the SEED and MicrobesOnline, regardless of gene set size. Despite the widespread use of GO and KEGG gene sets in bacterial gene expression data analysis, the SEED and MicrobesOnline provide more consistent sets for a wide variety of statistical analyses. Increased use of the SEED and MicrobesOnline gene sets in the analysis of bacterial gene expression data may improve statistical power and utility of expression data.

  20. Gene set analysis of purine and pyrimidine antimetabolites cancer therapies.

    PubMed

    Fridley, Brooke L; Batzler, Anthony; Li, Liang; Li, Fang; Matimba, Alice; Jenkins, Gregory D; Ji, Yuan; Wang, Liewei; Weinshilboum, Richard M

    2011-11-01

    Responses to therapies, either with regard to toxicities or efficacy, are expected to involve complex relationships of gene products within the same molecular pathway or functional gene set. Therefore, pathways or gene sets, as opposed to single genes, may better reflect the true underlying biology and may be more appropriate units for analysis of pharmacogenomic studies. Application of such methods to pharmacogenomic studies may enable the detection of more subtle effects of multiple genes in the same pathway that may be missed by assessing each gene individually. A gene set analysis of 3821 gene sets is presented assessing the association between basal messenger RNA expression and drug cytotoxicity using ethnically defined human lymphoblastoid cell lines for two classes of drugs: pyrimidines [gemcitabine (dFdC) and arabinoside] and purines [6-thioguanine and 6-mercaptopurine]. The gene set nucleoside-diphosphatase activity was found to be significantly associated with both dFdC and arabinoside, whereas gene set γ-aminobutyric acid catabolic process was associated with dFdC and 6-thioguanine. These gene sets were significantly associated with the phenotype even after adjusting for multiple testing. In addition, five associated gene sets were found in common between the pyrimidines and two gene sets for the purines (3',5'-cyclic-AMP phosphodiesterase activity and γ-aminobutyric acid catabolic process) with a P value of less than 0.0001. Functional validation was attempted with four genes each in gene sets for thiopurine and pyrimidine antimetabolites. All four genes selected from the pyrimidine gene sets (PSME3, CANT1, ENTPD6, ADRM1) were validated, but only one (PDE4D) was validated for the thiopurine gene sets. In summary, results from the gene set analysis of pyrimidine and purine therapies, used often in the treatment of various cancers, provide novel insight into the relationship between genomic variation and drug response.

  1. A house finch (Haemorhous mexicanus) spleen transcriptome reveals intra- and interspecific patterns of gene expression, alternative splicing and genetic diversity in passerines.

    PubMed

    Zhang, Qu; Hill, Geoffrey E; Edwards, Scott V; Backström, Niclas

    2014-04-24

    With its plumage color dimorphism and unique history in North America, including a recent population expansion and an epizootic of Mycoplasma gallisepticum (MG), the house finch (Haemorhous mexicanus) is a model species for studying sexual selection, plumage coloration and host-parasite interactions. As part of our ongoing efforts to make available genomic resources for this species, here we report a transcriptome assembly derived from genes expressed in spleen. We characterize transcriptomes from two populations with different histories of demography and disease exposure: a recently founded population in the eastern US that has been exposed to MG for over a decade and a native population from the western range that has never been exposed to MG. We utilize this resource to quantify conservation in gene expression in passerine birds over approximately 50 MY by comparing splenic expression profiles for 9,646 house finch transcripts and those from zebra finch and find that less than half of all genes expressed in spleen in either species are expressed in both species. Comparative gene annotations from several vertebrate species suggest that the house finch transcriptomes contain ~15 genes not yet found in previously sequenced vertebrate genomes. The house finch transcriptomes harbour ~85,000 SNPs, ~20,000 of which are non-synonymous. Although not yet validated by biological or technical replication, we identify a set of genes exhibiting differences between populations in gene expression (n = 182; 2% of all transcripts), allele frequencies (76 FST ouliers) and alternative splicing as well as genes with several fixed non-synonymous substitutions; this set includes genes with functions related to double-strand break repair and immune response. The two house finch spleen transcriptome profiles will add to the increasing data on genome and transcriptome sequence information from natural populations. Differences in splenic expression between house finch and zebra finch imply either significant evolutionary turnover of splenic expression patterns or different physiological states of the individuals examined. The transcriptome resource will enhance the potential to annotate an eventual house finch genome, and the set of gene-based high-quality SNPs will help clarify the genetic underpinnings of host-pathogen interactions and sexual selection.

  2. Reliable measurement of E. coli single cell fluorescence distribution using a standard microscope set-up.

    PubMed

    Cortesi, Marilisa; Bandiera, Lucia; Pasini, Alice; Bevilacqua, Alessandro; Gherardi, Alessandro; Furini, Simone; Giordano, Emanuele

    2017-01-01

    Quantifying gene expression at single cell level is fundamental for the complete characterization of synthetic gene circuits, due to the significant impact of noise and inter-cellular variability on the system's functionality. Commercial set-ups that allow the acquisition of fluorescent signal at single cell level (flow cytometers or quantitative microscopes) are expensive apparatuses that are hardly affordable by small laboratories. A protocol that makes a standard optical microscope able to acquire quantitative, single cell, fluorescent data from a bacterial population transformed with synthetic gene circuitry is presented. Single cell fluorescence values, acquired with a microscope set-up and processed with custom-made software, are compared with results that were obtained with a flow cytometer in a bacterial population transformed with the same gene circuitry. The high correlation between data from the two experimental set-ups, with a correlation coefficient computed over the tested dynamic range > 0.99, proves that a standard optical microscope- when coupled with appropriate software for image processing- might be used for quantitative single-cell fluorescence measurements. The calibration of the set-up, together with its validation, is described. The experimental protocol described in this paper makes quantitative measurement of single cell fluorescence accessible to laboratories equipped with standard optical microscope set-ups. Our method allows for an affordable measurement/quantification of intercellular variability, whose better understanding of this phenomenon will improve our comprehension of cellular behaviors and the design of synthetic gene circuits. All the required software is freely available to the synthetic biology community (MUSIQ Microscope flUorescence SIngle cell Quantification).

  3. Predictive Genes in Adjacent Normal Tissue Are Preferentially Altered by sCNV during Tumorigenesis in Liver Cancer and May Rate Limiting

    PubMed Central

    Lamb, John R.; Zhang, Chunsheng; Xie, Tao; Wang, Kai; Zhang, Bin; Hao, Ke; Chudin, Eugene; Fraser, Hunter B.; Millstein, Joshua; Ferguson, Mark; Suver, Christine; Ivanovska, Irena; Scott, Martin; Philippar, Ulrike; Bansal, Dimple; Zhang, Zhan; Burchard, Julja; Smith, Ryan; Greenawalt, Danielle; Cleary, Michele; Derry, Jonathan; Loboda, Andrey; Watters, James; Poon, Ronnie T. P.; Fan, Sheung T.; Yeung, Chun; Lee, Nikki P. Y.; Guinney, Justin; Molony, Cliona; Emilsson, Valur; Buser-Doepner, Carolyn; Zhu, Jun; Friend, Stephen; Mao, Mao; Shaw, Peter M.; Dai, Hongyue; Luk, John M.; Schadt, Eric E.

    2011-01-01

    Background In hepatocellular carcinoma (HCC) genes predictive of survival have been found in both adjacent normal (AN) and tumor (TU) tissues. The relationships between these two sets of predictive genes and the general process of tumorigenesis and disease progression remains unclear. Methodology/Principal Findings Here we have investigated HCC tumorigenesis by comparing gene expression, DNA copy number variation and survival using ∼250 AN and TU samples representing, respectively, the pre-cancer state, and the result of tumorigenesis. Genes that participate in tumorigenesis were defined using a gene-gene correlation meta-analysis procedure that compared AN versus TU tissues. Genes predictive of survival in AN (AN-survival genes) were found to be enriched in the differential gene-gene correlation gene set indicating that they directly participate in the process of tumorigenesis. Additionally the AN-survival genes were mostly not predictive after tumorigenesis in TU tissue and this transition was associated with and could largely be explained by the effect of somatic DNA copy number variation (sCNV) in cis and in trans. The data was consistent with the variance of AN-survival genes being rate-limiting steps in tumorigenesis and this was confirmed using a treatment that promotes HCC tumorigenesis that selectively altered AN-survival genes and genes differentially correlated between AN and TU. Conclusions/Significance This suggests that the process of tumor evolution involves rate-limiting steps related to the background from which the tumor evolved where these were frequently predictive of clinical outcome. Additionally treatments that alter the likelihood of tumorigenesis occurring may act by altering AN-survival genes, suggesting that the process can be manipulated. Further sCNV explains a substantial fraction of tumor specific expression and may therefore be a causal driver of tumor evolution in HCC and perhaps many solid tumor types. PMID:21750698

  4. Combining Shapley value and statistics to the analysis of gene expression data in children exposed to air pollution

    PubMed Central

    Moretti, Stefano; van Leeuwen, Danitsja; Gmuender, Hans; Bonassi, Stefano; van Delft, Joost; Kleinjans, Jos; Patrone, Fioravante; Merlo, Domenico Franco

    2008-01-01

    Background In gene expression analysis, statistical tests for differential gene expression provide lists of candidate genes having, individually, a sufficiently low p-value. However, the interpretation of each single p-value within complex systems involving several interacting genes is problematic. In parallel, in the last sixty years, game theory has been applied to political and social problems to assess the power of interacting agents in forcing a decision and, more recently, to represent the relevance of genes in response to certain conditions. Results In this paper we introduce a Bootstrap procedure to test the null hypothesis that each gene has the same relevance between two conditions, where the relevance is represented by the Shapley value of a particular coalitional game defined on a microarray data-set. This method, which is called Comparative Analysis of Shapley value (shortly, CASh), is applied to data concerning the gene expression in children differentially exposed to air pollution. The results provided by CASh are compared with the results from a parametric statistical test for testing differential gene expression. Both lists of genes provided by CASh and t-test are informative enough to discriminate exposed subjects on the basis of their gene expression profiles. While many genes are selected in common by CASh and the parametric test, it turns out that the biological interpretation of the differences between these two selections is more interesting, suggesting a different interpretation of the main biological pathways in gene expression regulation for exposed individuals. A simulation study suggests that CASh offers more power than t-test for the detection of differential gene expression variability. Conclusion CASh is successfully applied to gene expression analysis of a data-set where the joint expression behavior of genes may be critical to characterize the expression response to air pollution. We demonstrate a synergistic effect between coalitional games and statistics that resulted in a selection of genes with a potential impact in the regulation of complex pathways. PMID:18764936

  5. Identification of a B cell signature associated with renal transplant tolerance in humans

    PubMed Central

    Newell, Kenneth A.; Asare, Adam; Kirk, Allan D.; Gisler, Trang D.; Bourcier, Kasia; Suthanthiran, Manikkam; Burlingham, William J.; Marks, William H.; Sanz, Ignacio; Lechler, Robert I.; Hernandez-Fuentes, Maria P.; Turka, Laurence A.; Seyfert-Margolis, Vicki L.

    2010-01-01

    Establishing long-term allograft acceptance without the requirement for continuous immunosuppression, a condition known as allograft tolerance, is a highly desirable therapeutic goal in solid organ transplantation. Determining which recipients would benefit from withdrawal or minimization of immunosuppression would be greatly facilitated by biomarkers predictive of tolerance. In this study, we identified the largest reported cohort to our knowledge of tolerant renal transplant recipients, as defined by stable graft function and receiving no immunosuppression for more than 1 year, and compared their gene expression profiles and peripheral blood lymphocyte subsets with those of subjects with stable graft function who are receiving immunosuppressive drugs as well as healthy controls. In addition to being associated with clinical and phenotypic parameters, renal allograft tolerance was strongly associated with a B cell signature using several assays. Tolerant subjects showed increased expression of multiple B cell differentiation genes, and a set of just 3 of these genes distinguished tolerant from nontolerant recipients in a unique test set of samples. This B cell signature was associated with upregulation of CD20 mRNA in urine sediment cells and elevated numbers of peripheral blood naive and transitional B cells in tolerant participants compared with those receiving immunosuppression. These results point to a critical role for B cells in regulating alloimmunity and provide a candidate set of genes for wider-scale screening of renal transplant recipients. PMID:20501946

  6. Epigenetic regulation of depot-specific gene expression in adipose tissue.

    PubMed

    Gehrke, Sandra; Brueckner, Bodo; Schepky, Andreas; Klein, Johannes; Iwen, Alexander; Bosch, Thomas C G; Wenck, Horst; Winnefeld, Marc; Hagemann, Sabine

    2013-01-01

    In humans, adipose tissue is distributed in subcutaneous abdominal and subcutaneous gluteal depots that comprise a variety of functional differences. Whereas energy storage in gluteal adipose tissue has been shown to mediate a protective effect, an increase of abdominal adipose tissue is associated with metabolic disorders. However, the molecular basis of depot-specific characteristics is not completely understood yet. Using array-based analyses of transcription profiles, we identified a specific set of genes that was differentially expressed between subcutaneous abdominal and gluteal adipose tissue. To investigate the role of epigenetic regulation in depot-specific gene expression, we additionally analyzed genome-wide DNA methylation patterns in abdominal and gluteal depots. By combining both data sets, we identified a highly significant set of depot-specifically expressed genes that appear to be epigenetically regulated. Interestingly, the majority of these genes form part of the homeobox gene family. Moreover, genes involved in fatty acid metabolism were also differentially expressed. Therefore we suppose that changes in gene expression profiles might account for depot-specific differences in lipid composition. Indeed, triglycerides and fatty acids of abdominal adipose tissue were more saturated compared to triglycerides and fatty acids in gluteal adipose tissue. Taken together, our results uncover clear differences between abdominal and gluteal adipose tissue on the gene expression and DNA methylation level as well as in fatty acid composition. Therefore, a detailed molecular characterization of adipose tissue depots will be essential to develop new treatment strategies for metabolic syndrome associated complications.

  7. Reranking candidate gene models with cross-species comparison for improved gene prediction

    PubMed Central

    Liu, Qian; Crammer, Koby; Pereira, Fernando CN; Roos, David S

    2008-01-01

    Background Most gene finders score candidate gene models with state-based methods, typically HMMs, by combining local properties (coding potential, splice donor and acceptor patterns, etc). Competing models with similar state-based scores may be distinguishable with additional information. In particular, functional and comparative genomics datasets may help to select among competing models of comparable probability by exploiting features likely to be associated with the correct gene models, such as conserved exon/intron structure or protein sequence features. Results We have investigated the utility of a simple post-processing step for selecting among a set of alternative gene models, using global scoring rules to rerank competing models for more accurate prediction. For each gene locus, we first generate the K best candidate gene models using the gene finder Evigan, and then rerank these models using comparisons with putative orthologous genes from closely-related species. Candidate gene models with lower scores in the original gene finder may be selected if they exhibit strong similarity to probable orthologs in coding sequence, splice site location, or signal peptide occurrence. Experiments on Drosophila melanogaster demonstrate that reranking based on cross-species comparison outperforms the best gene models identified by Evigan alone, and also outperforms the comparative gene finders GeneWise and Augustus+. Conclusion Reranking gene models with cross-species comparison improves gene prediction accuracy. This straightforward method can be readily adapted to incorporate additional lines of evidence, as it requires only a ranked source of candidate gene models. PMID:18854050

  8. Genome-wide analysis reveals inositol, not choline, as the major effector of Ino2p-Ino4p and unfolded protein response target gene expression in yeast.

    PubMed

    Jesch, Stephen A; Zhao, Xin; Wells, Martin T; Henry, Susan A

    2005-03-11

    In the yeast Saccharomyces cerevisiae, the transcription of many genes encoding enzymes of phospholipid biosynthesis are repressed in cells grown in the presence of the phospholipid precursors inositol and choline. A genome-wide approach using cDNA microarray technology was used to profile the changes in the expression of all genes in yeast that respond to the exogenous presence of inositol and choline. We report that the global response to inositol is completely distinct from the effect of choline. Whereas the effect of inositol on gene expression was primarily repressing, the effect of choline on gene expression was activating. Moreover, the combination of inositol and choline increased the number of repressed genes compared with inositol alone and enhanced the repression levels of a subset of genes that responded to inositol. In all, 110 genes were repressed in the presence of inositol and choline. Two distinct sets of genes exhibited differential expression in response to inositol or the combination of inositol and choline in wild-type cells. One set of genes contained the UASINO sequence and were bound by Ino2p and Ino4p. Many of these genes were also negatively regulated by OPI1, suggesting a common regulatory mechanism for Ino2p, Ino4p, and Opi1p. Another nonoverlapping set of genes was coregulated by the unfolded protein response pathway, an ER-localized stress response pathway, but was not dependent on OPI1 and did not show further repression when choline was present together with inositol. These results suggest that inositol is the major effector of target gene expression, whereas choline plays a minor role.

  9. Genome Wide Analysis Reveals Inositol, not Choline, as the Major Effector of Ino2p-Ino4p and Unfolded Protein Response Target Gene Expression in Yeast

    PubMed Central

    Jesch, Stephen A.; Zhao, Xin; Wells, Martin T.; Henry, Susan A.

    2005-01-01

    SUMMARY In the yeast Saccharomyces cerevisiae the transcription of many genes encoding enzymes of phospholipid biosynthesis are repressed in cells grown in the presence of the phospholipid precursors inositol and choline. A genome-wide approach using cDNA microarray technology was utilized to profile the changes in the expression of all genes in yeast that respond to the exogenous presence of inositol and choline. We report that the global response to inositol is completely distinct from the effect of choline. Whereas the effect of inositol on gene expression was primarily repressing, the effect of choline on gene expression was activating. Moreover, the combination inositol and choline increased the number of repressed genes compared to inositol alone and enhanced the repression levels of a subset of genes that responded to inositol. In all, 110 genes were repressed in the presence of inositol and choline. Two distinct sets of genes exhibited differential expression in response to inositol or the combination of inositol and choline in wild type cells. One set of genes contained the UASINO sequence and were bound by Ino2p and Ino4p. Many of these genes were also negatively regulated by OPI1, suggesting a common regulatory mechanism for Ino2p, Ino4p, and Opi1p. Another non-overlapping set of genes were coregulated by the unfolded protein response pathway, an ER-localized stress response pathway, but were not dependent on OPI1 and did not show further repression when choline was present together with inositol. These results suggest that inositol is the major effector of target gene expression, while choline plays a minor role. PMID:15611057

  10. GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies

    PubMed Central

    Zhang, Bing; Schmoyer, Denise; Kirov, Stefan; Snoddy, Jay

    2004-01-01

    Background Microarray and other high-throughput technologies are producing large sets of interesting genes that are difficult to analyze directly. Bioinformatics tools are needed to interpret the functional information in the gene sets. Results We have created a web-based tool for data analysis and data visualization for sets of genes called GOTree Machine (GOTM). This tool was originally intended to analyze sets of co-regulated genes identified from microarray analysis but is adaptable for use with other gene sets from other high-throughput analyses. GOTree Machine generates a GOTree, a tree-like structure to navigate the Gene Ontology Directed Acyclic Graph for input gene sets. This system provides user friendly data navigation and visualization. Statistical analysis helps users to identify the most important Gene Ontology categories for the input gene sets and suggests biological areas that warrant further study. GOTree Machine is available online at . Conclusion GOTree Machine has a broad application in functional genomic, proteomic and other high-throughput methods that generate large sets of interesting genes; its primary purpose is to help users sort for interesting patterns in gene sets. PMID:14975175

  11. MARQ: an online tool to mine GEO for experiments with similar or opposite gene expression signatures.

    PubMed

    Vazquez, Miguel; Nogales-Cadenas, Ruben; Arroyo, Javier; Botías, Pedro; García, Raul; Carazo, Jose M; Tirado, Francisco; Pascual-Montano, Alberto; Carmona-Saez, Pedro

    2010-07-01

    The enormous amount of data available in public gene expression repositories such as Gene Expression Omnibus (GEO) offers an inestimable resource to explore gene expression programs across several organisms and conditions. This information can be used to discover experiments that induce similar or opposite gene expression patterns to a given query, which in turn may lead to the discovery of new relationships among diseases, drugs or pathways, as well as the generation of new hypotheses. In this work, we present MARQ, a web-based application that allows researchers to compare a query set of genes, e.g. a set of over- and under-expressed genes, against a signature database built from GEO datasets for different organisms and platforms. MARQ offers an easy-to-use and integrated environment to mine GEO, in order to identify conditions that induce similar or opposite gene expression patterns to a given experimental condition. MARQ also includes additional functionalities for the exploration of the results, including a meta-analysis pipeline to find genes that are differentially expressed across different experiments. The application is freely available at http://marq.dacya.ucm.es.

  12. Diversification of Root Hair Development Genes in Vascular Plants.

    PubMed

    Huang, Ling; Shi, Xinhui; Wang, Wenjia; Ryu, Kook Hui; Schiefelbein, John

    2017-07-01

    The molecular genetic program for root hair development has been studied intensively in Arabidopsis ( Arabidopsis thaliana ). To understand the extent to which this program might operate in other plants, we conducted a large-scale comparative analysis of root hair development genes from diverse vascular plants, including eudicots, monocots, and a lycophyte. Combining phylogenetics and transcriptomics, we discovered conservation of a core set of root hair genes across all vascular plants, which may derive from an ancient program for unidirectional cell growth coopted for root hair development during vascular plant evolution. Interestingly, we also discovered preferential diversification in the structure and expression of root hair development genes, relative to other root hair- and root-expressed genes, among these species. These differences enabled the definition of sets of genes and gene functions that were acquired or lost in specific lineages during vascular plant evolution. In particular, we found substantial divergence in the structure and expression of genes used for root hair patterning, suggesting that the Arabidopsis transcriptional regulatory mechanism is not shared by other species. To our knowledge, this study provides the first comprehensive view of gene expression in a single plant cell type across multiple species. © 2017 American Society of Plant Biologists. All Rights Reserved.

  13. Diversification of Root Hair Development Genes in Vascular Plants1[OPEN

    PubMed Central

    Shi, Xinhui; Wang, Wenjia; Ryu, Kook Hui

    2017-01-01

    The molecular genetic program for root hair development has been studied intensively in Arabidopsis (Arabidopsis thaliana). To understand the extent to which this program might operate in other plants, we conducted a large-scale comparative analysis of root hair development genes from diverse vascular plants, including eudicots, monocots, and a lycophyte. Combining phylogenetics and transcriptomics, we discovered conservation of a core set of root hair genes across all vascular plants, which may derive from an ancient program for unidirectional cell growth coopted for root hair development during vascular plant evolution. Interestingly, we also discovered preferential diversification in the structure and expression of root hair development genes, relative to other root hair- and root-expressed genes, among these species. These differences enabled the definition of sets of genes and gene functions that were acquired or lost in specific lineages during vascular plant evolution. In particular, we found substantial divergence in the structure and expression of genes used for root hair patterning, suggesting that the Arabidopsis transcriptional regulatory mechanism is not shared by other species. To our knowledge, this study provides the first comprehensive view of gene expression in a single plant cell type across multiple species. PMID:28487476

  14. Uterine responses to early pre-attachment embryos in the domestic dog and comparisons with other domestic animal species.

    PubMed

    Graubner, Felix R; Gram, Aykut; Kautz, Ewa; Bauersachs, Stefan; Aslan, Selim; Agaoglu, Ali R; Boos, Alois; Kowalewski, Mariusz P

    2017-08-01

    In the dog, there is no luteolysis in the absence of pregnancy. Thus, this species lacks any anti-luteolytic endocrine signal as found in other species that modulate uterine function during the critical period of pregnancy establishment. Nevertheless, in the dog an embryo-maternal communication must occur in order to prevent rejection of embryos. Based on this hypothesis, we performed microarray analysis of canine uterine samples collected during pre-attachment phase (days 10-12) and in corresponding non-pregnant controls, in order to elucidate the embryo attachment signal. An additional goal was to identify differences in uterine responses to pre-attachment embryos between dogs and other mammalian species exhibiting different reproductive patterns with regard to luteolysis, implantation, and preparation for placentation. Therefore, the canine microarray data were compared with gene sets from pigs, cattle, horses, and humans. We found 412 genes differentially regulated between the two experimental groups. The functional terms most strongly enriched in response to pre-attachment embryos related to extracellular matrix function and remodeling, and to immune and inflammatory responses. Several candidate genes were validated by semi-quantitative PCR. When compared with other species, best matches were found with human and equine counterparts. Especially for the pig, the majority of overlapping genes showed opposite expression patterns. Interestingly, 1926 genes did not pair with any of the other gene sets. Using a microarray approach, we report the uterine changes in the dog driven by the presence of embryos and compare these results with datasets from other mammalian species, finding common-, contrary-, and exclusively canine-regulated genes. © The Authors 2017. Published by Oxford University Press on behalf of Society for the Study of Reproduction.

  15. Characteristics of genomic signatures derived using univariate methods and mechanistically anchored functional descriptors for predicting drug- and xenobiotic-induced nephrotoxicity.

    PubMed

    Shi, Weiwei; Bugrim, Andrej; Nikolsky, Yuri; Nikolskya, Tatiana; Brennan, Richard J

    2008-01-01

    ABSTRACT The ideal toxicity biomarker is composed of the properties of prediction (is detected prior to traditional pathological signs of injury), accuracy (high sensitivity and specificity), and mechanistic relationships to the endpoint measured (biological relevance). Gene expression-based toxicity biomarkers ("signatures") have shown good predictive power and accuracy, but are difficult to interpret biologically. We have compared different statistical methods of feature selection with knowledge-based approaches, using GeneGo's database of canonical pathway maps, to generate gene sets for the classification of renal tubule toxicity. The gene set selection algorithms include four univariate analyses: t-statistics, fold-change, B-statistics, and RankProd, and their combination and overlap for the identification of differentially expressed probes. Enrichment analysis following the results of the four univariate analyses, Hotelling T-square test, and, finally out-of-bag selection, a variant of cross-validation, were used to identify canonical pathway maps-sets of genes coordinately involved in key biological processes-with classification power. Differentially expressed genes identified by the different statistical univariate analyses all generated reasonably performing classifiers of tubule toxicity. Maps identified by enrichment analysis or Hotelling T-square had lower classification power, but highlighted perturbed lipid homeostasis as a common discriminator of nephrotoxic treatments. The out-of-bag method yielded the best functionally integrated classifier. The map "ephrins signaling" performed comparably to a classifier derived using sparse linear programming, a machine learning algorithm, and represents a signaling network specifically involved in renal tubule development and integrity. Such functional descriptors of toxicity promise to better integrate predictive toxicogenomics with mechanistic analysis, facilitating the interpretation and risk assessment of predictive genomic investigations.

  16. A comparative analysis of biclustering algorithms for gene expression data

    PubMed Central

    Eren, Kemal; Deveci, Mehmet; Küçüktunç, Onur; Çatalyürek, Ümit V.

    2013-01-01

    The need to analyze high-dimension biological data is driving the development of new data mining methods. Biclustering algorithms have been successfully applied to gene expression data to discover local patterns, in which a subset of genes exhibit similar expression levels over a subset of conditions. However, it is not clear which algorithms are best suited for this task. Many algorithms have been published in the past decade, most of which have been compared only to a small number of algorithms. Surveys and comparisons exist in the literature, but because of the large number and variety of biclustering algorithms, they are quickly outdated. In this article we partially address this problem of evaluating the strengths and weaknesses of existing biclustering methods. We used the BiBench package to compare 12 algorithms, many of which were recently published or have not been extensively studied. The algorithms were tested on a suite of synthetic data sets to measure their performance on data with varying conditions, such as different bicluster models, varying noise, varying numbers of biclusters and overlapping biclusters. The algorithms were also tested on eight large gene expression data sets obtained from the Gene Expression Omnibus. Gene Ontology enrichment analysis was performed on the resulting biclusters, and the best enrichment terms are reported. Our analyses show that the biclustering method and its parameters should be selected based on the desired model, whether that model allows overlapping biclusters, and its robustness to noise. In addition, we observe that the biclustering algorithms capable of finding more than one model are more successful at capturing biologically relevant clusters. PMID:22772837

  17. Follow up of a robust meta-signature to identify Zika virus infection in Aedes aegypti: another brick in the wall.

    PubMed

    Fukutani, Eduardo; Rodrigues, Moreno; Kasprzykowski, José Irahe; Araujo, Cintia Figueiredo de; Paschoal, Alexandre Rossi; Ramos, Pablo Ivan Pereira; Fukutani, Kiyoshi Ferreira; Queiroz, Artur Trancoso Lopo de

    2018-01-01

    The mosquito Aedes aegypti is the main vector of several arthropod-borne diseases that have global impacts. In a previous meta-analysis, our group identified a vector gene set containing 110 genes strongly associated with infections of dengue, West Nile and yellow fever viruses. Of these 110 genes, four genes allowed a highly accurate classification of infected status. More recently, a new study of Ae. aegypti infected with Zika virus (ZIKV) was published, providing new data to investigate whether this "infection" gene set is also altered during a ZIKV infection. Our hypothesis is that the infection-associated signature may also serve as a proxy to classify the ZIKV infection in the vector. Raw data associated with the NCBI/BioProject were downloaded and re-analysed. A total of 18 paired-end replicates corresponding to three ZIKV-infected samples and three controls were included in this study. The nMDS technique with a logistic regression was used to obtain the probabilities of belonging to a given class. Thus, to compare both gene sets, we used the area under the curve and performed a comparison using the bootstrap method. Our meta-signature was able to separate the infected mosquitoes from the controls with good predictive power to classify the Zika-infected mosquitoes.

  18. In vitro osteogenic/dentinogenic potential of an experimental calcium aluminosilicate cement

    PubMed Central

    Eid, Ashraf A.; Niu, Li-na; Primus, Carolyn M.; Opperman, Lynne A.; Watanabe, Ikuya; Pashley, David H.; Tay, Franklin R.

    2013-01-01

    Introduction Calcium aluminosilicate cements are fast-setting, acid-resistant, bioactive cements that may be used as root-repair materials. This study examined the osteogenic/dentinogenic potential of an experimental calcium aluminosilicate cement (Quick-Set) using a murine odontoblast-like cell model. Methods Quick-Set and white ProRoot MTA (WMTA) were mixed with the proprietary gel or deionized water, allowed to set completely in 100% relative humidity and aged in complete growth medium for 2 weeks until rendered non-cytotoxic. Similarly-aged Teflon discs were used as negative control. The MDPC-23 cell-line was used for evaluating changes in mRNA expressions of genes associated with osteogenic/dentinogenic differentiation and mineralization (qRT-PCR) alkaline phosphatase enzyme production and extracellular matrix mineralization (Alizarin red-S staining). Results After MDPC-23 cells were incubated with the materials in osteogenic differentiation medium for 1 week, both cements showed upregulation in ALP and DSPP expression. Fold increases in these two genes were not significantly different between Quick-Set and WMTA. Both cements showed no statistically significant upregulation/downregulation in RUNX2, OCN, BSP and DMP1 gene expression compared with Teflon. Alkaline phosphatase activity of cells cultured on Quick-Set and WMTA were not significantly different at 1 week or 2 weeks, but were significantly higher (p<0.05) than Teflon in both weeks. Both cements showed significantly higher calcium deposition compared with Teflon after 3 weeks of incubation in mineralizing medium (p<0.001). Differences between Quick-Set and WMTA were not statistically significant. Conclusions The experimental calcium aluminosilicate cement exhibits similar osteogenic/dentinogenic properties to WMTA and may be a potential substitute for commercially-available tricalcium silicate cements. PMID:23953291

  19. Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer.

    PubMed

    Wolf, Yuri I; Makarova, Kira S; Yutin, Natalya; Koonin, Eugene V

    2012-12-14

    Collections of Clusters of Orthologous Genes (COGs) provide indispensable tools for comparative genomic analysis, evolutionary reconstruction and functional annotation of new genomes. Initially, COGs were made for all complete genomes of cellular life forms that were available at the time. However, with the accumulation of thousands of complete genomes, construction of a comprehensive COG set has become extremely computationally demanding and prone to error propagation, necessitating the switch to taxon-specific COG collections. Previously, we reported the collection of COGs for 41 genomes of Archaea (arCOGs). Here we present a major update of the arCOGs and describe evolutionary reconstructions to reveal general trends in the evolution of Archaea. The updated version of the arCOG database incorporates 91% of the pangenome of 120 archaea (251,032 protein-coding genes altogether) into 10,335 arCOGs. Using this new set of arCOGs, we performed maximum likelihood reconstruction of the genome content of archaeal ancestral forms and gene gain and loss events in archaeal evolution. This reconstruction shows that the last Common Ancestor of the extant Archaea was an organism of greater complexity than most of the extant archaea, probably with over 2,500 protein-coding genes. The subsequent evolution of almost all archaeal lineages was apparently dominated by gene loss resulting in genome streamlining. Overall, in the evolution of Archaea as well as a representative set of bacteria that was similarly analyzed for comparison, gene losses are estimated to outnumber gene gains at least 4 to 1. Analysis of specific patterns of gene gain in Archaea shows that, although some groups, in particular Halobacteria, acquire substantially more genes than others, on the whole, gene exchange between major groups of Archaea appears to be largely random, with no major 'highways' of horizontal gene transfer. The updated collection of arCOGs is expected to become a key resource for comparative genomics, evolutionary reconstruction and functional annotation of new archaeal genomes. Given that, in spite of the major increase in the number of genomes, the conserved core of archaeal genes appears to be stabilizing, the major evolutionary trends revealed here have a chance to stand the test of time. This article was reviewed by (for complete reviews see the Reviewers' Reports section): Dr. PLG, Prof. PF, Dr. PL (nominated by Prof. JPG).

  20. Gene set analysis using variance component tests.

    PubMed

    Huang, Yen-Tsung; Lin, Xihong

    2013-06-28

    Gene set analyses have become increasingly important in genomic research, as many complex diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important feature of a gene set to improve statistical power in gene set analyses. We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression coefficients in multivariate linear regression models, and calculate the p-values using permutation and a scaled chi-square approximation. We show using simulations that type I error is protected under different choices of working covariance matrices and power is improved as the working covariance approaches the true covariance. The global test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and gene set enrichment analysis (GSEA). We develop a gene set analyses method (TEGS) under the multivariate regression framework, which directly models the interdependence of the expression values in a gene set using a working covariance. TEGS outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data.

  1. Combined Large-Scale Phenotyping and Transcriptomics in Maize Reveals a Robust Growth Regulatory Network1[OPEN

    PubMed Central

    Herman, Dorota; Slabbinck, Bram; Pè, Mario Enrico

    2016-01-01

    Leaves are vital organs for biomass and seed production because of their role in the generation of metabolic energy and organic compounds. A better understanding of the molecular networks underlying leaf development is crucial to sustain global requirements for food and renewable energy. Here, we combined transcriptome profiling of proliferative leaf tissue with in-depth phenotyping of the fourth leaf at later stages of development in 197 recombinant inbred lines of two different maize (Zea mays) populations. Previously, correlation analysis in a classical biparental mapping population identified 1,740 genes correlated with at least one of 14 traits. Here, we extended these results with data from a multiparent advanced generation intercross population. As expected, the phenotypic variability was found to be larger in the latter population than in the biparental population, although general conclusions on the correlations among the traits are comparable. Data integration from the two diverse populations allowed us to identify a set of 226 genes that are robustly associated with diverse leaf traits. This set of genes is enriched for transcriptional regulators and genes involved in protein synthesis and cell wall metabolism. In order to investigate the molecular network context of the candidate gene set, we integrated our data with publicly available functional genomics data and identified a growth regulatory network of 185 genes. Our results illustrate the power of combining in-depth phenotyping with transcriptomics in mapping populations to dissect the genetic control of complex traits and present a set of candidate genes for use in biomass improvement. PMID:26754667

  2. Combined Large-Scale Phenotyping and Transcriptomics in Maize Reveals a Robust Growth Regulatory Network.

    PubMed

    Baute, Joke; Herman, Dorota; Coppens, Frederik; De Block, Jolien; Slabbinck, Bram; Dell'Acqua, Matteo; Pè, Mario Enrico; Maere, Steven; Nelissen, Hilde; Inzé, Dirk

    2016-03-01

    Leaves are vital organs for biomass and seed production because of their role in the generation of metabolic energy and organic compounds. A better understanding of the molecular networks underlying leaf development is crucial to sustain global requirements for food and renewable energy. Here, we combined transcriptome profiling of proliferative leaf tissue with in-depth phenotyping of the fourth leaf at later stages of development in 197 recombinant inbred lines of two different maize (Zea mays) populations. Previously, correlation analysis in a classical biparental mapping population identified 1,740 genes correlated with at least one of 14 traits. Here, we extended these results with data from a multiparent advanced generation intercross population. As expected, the phenotypic variability was found to be larger in the latter population than in the biparental population, although general conclusions on the correlations among the traits are comparable. Data integration from the two diverse populations allowed us to identify a set of 226 genes that are robustly associated with diverse leaf traits. This set of genes is enriched for transcriptional regulators and genes involved in protein synthesis and cell wall metabolism. In order to investigate the molecular network context of the candidate gene set, we integrated our data with publicly available functional genomics data and identified a growth regulatory network of 185 genes. Our results illustrate the power of combining in-depth phenotyping with transcriptomics in mapping populations to dissect the genetic control of complex traits and present a set of candidate genes for use in biomass improvement. © 2016 American Society of Plant Biologists. All Rights Reserved.

  3. GARNET--gene set analysis with exploration of annotation relations.

    PubMed

    Rho, Kyoohyoung; Kim, Bumjin; Jang, Youngjun; Lee, Sanghyun; Bae, Taejeong; Seo, Jihae; Seo, Chaehwa; Lee, Jihyun; Kang, Hyunjung; Yu, Ungsik; Kim, Sunghoon; Lee, Sanghyuk; Kim, Wan Kyu

    2011-02-15

    Gene set analysis is a powerful method of deducing biological meaning for an a priori defined set of genes. Numerous tools have been developed to test statistical enrichment or depletion in specific pathways or gene ontology (GO) terms. Major difficulties towards biological interpretation are integrating diverse types of annotation categories and exploring the relationships between annotation terms of similar information. GARNET (Gene Annotation Relationship NEtwork Tools) is an integrative platform for gene set analysis with many novel features. It includes tools for retrieval of genes from annotation database, statistical analysis & visualization of annotation relationships, and managing gene sets. In an effort to allow access to a full spectrum of amassed biological knowledge, we have integrated a variety of annotation data that include the GO, domain, disease, drug, chromosomal location, and custom-defined annotations. Diverse types of molecular networks (pathways, transcription and microRNA regulations, protein-protein interaction) are also included. The pair-wise relationship between annotation gene sets was calculated using kappa statistics. GARNET consists of three modules--gene set manager, gene set analysis and gene set retrieval, which are tightly integrated to provide virtually automatic analysis for gene sets. A dedicated viewer for annotation network has been developed to facilitate exploration of the related annotations. GARNET (gene annotation relationship network tools) is an integrative platform for diverse types of gene set analysis, where complex relationships among gene annotations can be easily explored with an intuitive network visualization tool (http://garnet.isysbio.org/ or http://ercsb.ewha.ac.kr/garnet/).

  4. Gene Expression in Wilms’ Tumor Mimics the Earliest Committed Stage in the Metanephric Mesenchymal-Epithelial Transition

    PubMed Central

    Li, Chi-Ming; Guo, Meirong; Borczuk, Alain; Powell, Charles A.; Wei, Michelle; Thaker, Harshwardhan M.; Friedman, Richard; Klein, Ulf; Tycko, Benjamin

    2002-01-01

    Wilms’ tumor (WT) has been considered a prototype for arrested cellular differentiation in cancer, but previous studies have relied on selected markers. We have now performed an unbiased survey of gene expression in WTs using oligonucleotide microarrays. Statistical criteria identified 357 genes as differentially expressed between WTs and fetal kidneys. This set contained 124 matches to genes on a microarray used by Stuart and colleagues (Stuart RO, Bush KT, Nigam SK: Changes in global gene expression patterns during development and maturation of the rat kidney. Proc Natl Acad Sci USA 2001, 98:5649–5654) to establish genes with stage-specific expression in the developing rat kidney. Mapping between the two data sets showed that WTs systematically overexpressed genes corresponding to the earliest stage of metanephric development, and underexpressed genes corresponding to later stages. Automated clustering identified a smaller group of 27 genes that were highly expressed in WTs compared to fetal kidney and heterologous tumor and normal tissues. This signature set was enriched in genes encoding transcription factors. Four of these, PAX2, EYA1, HBF2, and HOXA11, are essential for cell survival and proliferation in early metanephric development, whereas others, including SIX1, MOX1, and SALL2, are predicted to act at this stage. SIX1 and SALL2 proteins were expressed in the condensing mesenchyme in normal human fetal kidneys, but were absent (SIX1) or reduced (SALL2) in cells at other developmental stages. These data imply that the blastema in WTs has progressed to the committed stage in the mesenchymal-epithelial transition, where it is partially arrested in differentiation. The WT-signature set also contained the Wnt receptor FZD7, the tumor antigen PRAME, the imprinted gene NNAT and the metastasis-associated transcription factor E1AF. PMID:12057921

  5. Alteration of topoisomerase II-alpha gene in human breast cancer: association with responsiveness to anthracycline-based chemotherapy.

    PubMed

    Press, Michael F; Sauter, Guido; Buyse, Marc; Bernstein, Leslie; Guzman, Roberta; Santiago, Angela; Villalobos, Ivonne E; Eiermann, Wolfgang; Pienkowski, Tadeusz; Martin, Miguel; Robert, Nicholas; Crown, John; Bee, Valerie; Taupin, Henry; Flom, Kerry J; Tabah-Fisch, Isabelle; Pauletti, Giovanni; Lindsay, Mary-Ann; Riva, Alessandro; Slamon, Dennis J

    2011-03-01

    Approximately 35% of HER2-amplified breast cancers have coamplification of the topoisomerase II-alpha (TOP2A) gene encoding an enzyme that is a major target of anthracyclines. This study was designed to evaluate whether TOP2A gene alterations may predict incremental responsiveness to anthracyclines in some breast cancers. A total of 4,943 breast cancers were analyzed for alterations in TOP2A and HER2. Primary tumor tissues from patients with metastatic breast cancer treated in a trial of chemotherapy plus/minus trastuzumab were studied for amplification/deletion of TOP2A and HER2 as a test set followed by evaluation of malignancies from two separate, large trials for changes in these same genes as a validation set. Association between these alterations and clinical outcomes was determined. Test set cases containing HER2 amplification treated with doxorubicin and cyclophosphamide (AC) plus trastuzumab, demonstrated longer progression-free survival compared to those treated with AC alone (P = .0002). However, patients treated with AC alone whose tumors contain HER2/TOP2A coamplification experienced a similar improvement in survival (P = .004). Conversely, for patients treated with paclitaxel, HER2/TOP2A coamplification was not associated with improved outcomes. These observations were confirmed in a larger validation set, where HER2/TOP2A coamplification was again associated with longer survival when only anthracycline-containing chemotherapy was used for treatment compared with outcome in HER2-positive cancers lacking TOP2A coamplification. In a study involving nearly 5,000 breast malignancies, both test set and validation set demonstrate that TOP2A coamplification, not HER2 amplification, is the clinically useful predictive marker of an incremental response to anthracycline-based chemotherapy. Absence of HER2/TOP2A coamplification may indicate a more restricted efficacy advantage for breast cancers than previously thought.

  6. Alteration of Topoisomerase II–Alpha Gene in Human Breast Cancer: Association With Responsiveness to Anthracycline-Based Chemotherapy

    PubMed Central

    Press, Michael F.; Sauter, Guido; Buyse, Marc; Bernstein, Leslie; Guzman, Roberta; Santiago, Angela; Villalobos, Ivonne E.; Eiermann, Wolfgang; Pienkowski, Tadeusz; Martin, Miguel; Robert, Nicholas; Crown, John; Bee, Valerie; Taupin, Henry; Flom, Kerry J.; Tabah-Fisch, Isabelle; Pauletti, Giovanni; Lindsay, Mary-Ann; Riva, Alessandro; Slamon, Dennis J.

    2011-01-01

    Purpose Approximately 35% of HER2-amplified breast cancers have coamplification of the topoisomerase II-alpha (TOP2A) gene encoding an enzyme that is a major target of anthracyclines. This study was designed to evaluate whether TOP2A gene alterations may predict incremental responsiveness to anthracyclines in some breast cancers. Methods A total of 4,943 breast cancers were analyzed for alterations in TOP2A and HER2. Primary tumor tissues from patients with metastatic breast cancer treated in a trial of chemotherapy plus/minus trastuzumab were studied for amplification/deletion of TOP2A and HER2 as a test set followed by evaluation of malignancies from two separate, large trials for changes in these same genes as a validation set. Association between these alterations and clinical outcomes was determined. Results Test set cases containing HER2 amplification treated with doxorubicin and cyclophosphamide (AC) plus trastuzumab, demonstrated longer progression-free survival compared to those treated with AC alone (P = .0002). However, patients treated with AC alone whose tumors contain HER2/TOP2A coamplification experienced a similar improvement in survival (P = .004). Conversely, for patients treated with paclitaxel, HER2/TOP2A coamplification was not associated with improved outcomes. These observations were confirmed in a larger validation set, where HER2/TOP2A coamplification was again associated with longer survival when only anthracycline-containing chemotherapy was used for treatment compared with outcome in HER2-positive cancers lacking TOP2A coamplification. Conclusion In a study involving nearly 5,000 breast malignancies, both test set and validation set demonstrate that TOP2A coamplification, not HER2 amplification, is the clinically useful predictive marker of an incremental response to anthracycline-based chemotherapy. Absence of HER2/TOP2A coamplification may indicate a more restricted efficacy advantage for breast cancers than previously thought. PMID:21189395

  7. Sorting genomes by reciprocal translocations, insertions, and deletions.

    PubMed

    Qi, Xingqin; Li, Guojun; Li, Shuguang; Xu, Ying

    2010-01-01

    The problem of sorting by reciprocal translocations (abbreviated as SBT) arises from the field of comparative genomics, which is to find a shortest sequence of reciprocal translocations that transforms one genome Pi into another genome Gamma, with the restriction that Pi and Gamma contain the same genes. SBT has been proved to be polynomial-time solvable, and several polynomial algorithms have been developed. In this paper, we show how to extend Bergeron's SBT algorithm to include insertions and deletions, allowing to compare genomes containing different genes. In particular, if the gene set of Pi is a subset (or superset, respectively) of the gene set of Gamma, we present an approximation algorithm for transforming Pi into Gamma by reciprocal translocations and deletions (insertions, respectively), providing a sorting sequence with length at most OPT + 2, where OPT is the minimum number of translocations and deletions (insertions, respectively) needed to transform Pi into Gamma; if Pi and Gamma have different genes but not containing each other, we give a heuristic to transform Pi into Gamma by a shortest sequence of reciprocal translocations, insertions, and deletions, with bounds for the length of the sorting sequence it outputs. At a conceptual level, there is some similarity between our algorithm and the algorithm developed by El Mabrouk which is used to sort two chromosomes with different gene contents by reversals, insertions, and deletions.

  8. Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts.

    PubMed

    Xu, Lijing; Furlotte, Nicholas; Lin, Yunyue; Heinrich, Kevin; Berry, Michael W; George, Ebenezer O; Homayouni, Ramin

    2011-04-14

    High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature. GCAT is freely available at http://binf1.memphis.edu/gcat.

  9. The Identification of Novel Diagnostic Marker Genes for the Detection of Beer Spoiling Pediococcus damnosus Strains Using the BlAst Diagnostic Gene findEr

    PubMed Central

    Schmid, Jonas; Zehe, Anja; Vogel, Rudi F.

    2016-01-01

    As the number of bacterial genomes increases dramatically, the demand for easy to use tools with transparent functionality and comprehensible output for applied comparative genomics grows as well. We present BlAst Diagnostic Gene findEr (BADGE), a tool for the rapid prediction of diagnostic marker genes (DMGs) for the differentiation of bacterial groups (e.g. pathogenic / nonpathogenic). DMG identification settings can be modified easily and installing and running BADGE does not require specific bioinformatics skills. During the BADGE run the user is informed step by step about the DMG finding process, thus making it easy to evaluate the impact of chosen settings and options. On the basis of an example with relevance for beer brewing, being one of the oldest biotechnological processes known, we show a straightforward procedure, from phenotyping, genome sequencing, assembly and annotation, up to a discriminant marker gene PCR assay, making comparative genomics a means to an end. The value and the functionality of BADGE were thoroughly examined, resulting in the successful identification and validation of an outstanding novel DMG (fabZ) for the discrimination of harmless and harmful contaminations of Pediococcus damnosus, which can be applied for spoilage risk determination in breweries. Concomitantly, we present and compare five complete P. damnosus genomes sequenced in this study, finding that the ability to produce the unwanted, spoilage associated off-flavor diacetyl is a plasmid encoded trait in this important beer spoiling species. PMID:27028007

  10. TEGS-CN: A Statistical Method for Pathway Analysis of Genome-wide Copy Number Profile.

    PubMed

    Huang, Yen-Tsung; Hsu, Thomas; Christiani, David C

    2014-01-01

    The effects of copy number alterations make up a significant part of the tumor genome profile, but pathway analyses of these alterations are still not well established. We proposed a novel method to analyze multiple copy numbers of genes within a pathway, termed Test for the Effect of a Gene Set with Copy Number data (TEGS-CN). TEGS-CN was adapted from TEGS, a method that we previously developed for gene expression data using a variance component score test. With additional development, we extend the method to analyze DNA copy number data, accounting for different sizes and thus various numbers of copy number probes in genes. The test statistic follows a mixture of X (2) distributions that can be obtained using permutation with scaled X (2) approximation. We conducted simulation studies to evaluate the size and the power of TEGS-CN and to compare its performance with TEGS. We analyzed a genome-wide copy number data from 264 patients of non-small-cell lung cancer. With the Molecular Signatures Database (MSigDB) pathway database, the genome-wide copy number data can be classified into 1814 biological pathways or gene sets. We investigated associations of the copy number profile of the 1814 gene sets with pack-years of cigarette smoking. Our analysis revealed five pathways with significant P values after Bonferroni adjustment (<2.8 × 10(-5)), including the PTEN pathway (7.8 × 10(-7)), the gene set up-regulated under heat shock (3.6 × 10(-6)), the gene sets involved in the immune profile for rejection of kidney transplantation (9.2 × 10(-6)) and for transcriptional control of leukocytes (2.2 × 10(-5)), and the ganglioside biosynthesis pathway (2.7 × 10(-5)). In conclusion, we present a new method for pathway analyses of copy number data, and causal mechanisms of the five pathways require further study.

  11. The Gene Set Builder: collation, curation, and distribution of sets of genes

    PubMed Central

    Yusuf, Dimas; Lim, Jonathan S; Wasserman, Wyeth W

    2005-01-01

    Background In bioinformatics and genomics, there are many applications designed to investigate the common properties for a set of genes. Often, these multi-gene analysis tools attempt to reveal sequential, functional, and expressional ties. However, while tremendous effort has been invested in developing tools that can analyze a set of genes, minimal effort has been invested in developing tools that can help researchers compile, store, and annotate gene sets in the first place. As a result, the process of making or accessing a set often involves tedious and time consuming steps such as finding identifiers for each individual gene. These steps are often repeated extensively to shift from one identifier type to another; or to recreate a published set. In this paper, we present a simple online tool which – with the help of the gene catalogs Ensembl and GeneLynx – can help researchers build and annotate sets of genes quickly and easily. Description The Gene Set Builder is a database-driven, web-based tool designed to help researchers compile, store, export, and share sets of genes. This application supports the 17 eukaryotic genomes found in version 32 of the Ensembl database, which includes species from yeast to human. User-created information such as sets and customized annotations are stored to facilitate easy access. Gene sets stored in the system can be "exported" in a variety of output formats – as lists of identifiers, in tables, or as sequences. In addition, gene sets can be "shared" with specific users to facilitate collaborations or fully released to provide access to published results. The application also features a Perl API (Application Programming Interface) for direct connectivity to custom analysis tools. A downloadable Quick Reference guide and an online tutorial are available to help new users learn its functionalities. Conclusion The Gene Set Builder is an Ensembl-facilitated online tool designed to help researchers compile and manage sets of genes in a user-friendly environment. The application can be accessed via . PMID:16371163

  12. Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes

    PubMed Central

    Liu, Kuan-Liang; Porras-Alfaro, Andrea; Eichorst, Stephanie A.

    2012-01-01

    Taxonomic and phylogenetic fingerprinting based on sequence analysis of gene fragments from the large-subunit rRNA (LSU) gene or the internal transcribed spacer (ITS) region is becoming an integral part of fungal classification. The lack of an accurate and robust classification tool trained by a validated sequence database for taxonomic placement of fungal LSU genes is a severe limitation in taxonomic analysis of fungal isolates or large data sets obtained from environmental surveys. Using a hand-curated set of 8,506 fungal LSU gene fragments, we determined the performance characteristics of a naïve Bayesian classifier across multiple taxonomic levels and compared the classifier performance to that of a sequence similarity-based (BLASTN) approach. The naïve Bayesian classifier was computationally more rapid (>460-fold with our system) than the BLASTN approach, and it provided equal or superior classification accuracy. Classifier accuracies were compared using sequence fragments of 100 bp and 400 bp and two different PCR primer anchor points to mimic sequence read lengths commonly obtained using current high-throughput sequencing technologies. Accuracy was higher with 400-bp sequence reads than with 100-bp reads. It was also significantly affected by sequence location across the 1,400-bp test region. The highest accuracy was obtained across either the D1 or D2 variable region. The naïve Bayesian classifier provides an effective and rapid means to classify fungal LSU sequences from large environmental surveys. The training set and tool are publicly available through the Ribosomal Database Project (http://rdp.cme.msu.edu/classifier/classifier.jsp). PMID:22194300

  13. Analysis of high-throughput biological data using their rank values.

    PubMed

    Dembélé, Doulaye

    2018-01-01

    High-throughput biological technologies are routinely used to generate gene expression profiling or cytogenetics data. To achieve high performance, methods available in the literature become more specialized and often require high computational resources. Here, we propose a new versatile method based on the data-ordering rank values. We use linear algebra, the Perron-Frobenius theorem and also extend a method presented earlier for searching differentially expressed genes for the detection of recurrent copy number aberration. A result derived from the proposed method is a one-sample Student's t-test based on rank values. The proposed method is to our knowledge the only that applies to gene expression profiling and to cytogenetics data sets. This new method is fast, deterministic, and requires a low computational load. Probabilities are associated with genes to allow a statistically significant subset selection in the data set. Stability scores are also introduced as quality parameters. The performance and comparative analyses were carried out using real data sets. The proposed method can be accessed through an R package available from the CRAN (Comprehensive R Archive Network) website: https://cran.r-project.org/web/packages/fcros .

  14. Prediction of cancer class with majority voting genetic programming classifier using gene expression data.

    PubMed

    Paul, Topon Kumar; Iba, Hitoshi

    2009-01-01

    In order to get a better understanding of different types of cancers and to find the possible biomarkers for diseases, recently, many researchers are analyzing the gene expression data using various machine learning techniques. However, due to a very small number of training samples compared to the huge number of genes and class imbalance, most of these methods suffer from overfitting. In this paper, we present a majority voting genetic programming classifier (MVGPC) for the classification of microarray data. Instead of a single rule or a single set of rules, we evolve multiple rules with genetic programming (GP) and then apply those rules to test samples to determine their labels with majority voting technique. By performing experiments on four different public cancer data sets, including multiclass data sets, we have found that the test accuracies of MVGPC are better than those of other methods, including AdaBoost with GP. Moreover, some of the more frequently occurring genes in the classification rules are known to be associated with the types of cancers being studied in this paper.

  15. HisB as novel selection marker for gene targeting approaches in Aspergillus niger.

    PubMed

    Fiedler, Markus R M; Gensheimer, Tarek; Kubisch, Christin; Meyer, Vera

    2017-03-08

    For Aspergillus niger, a broad set of auxotrophic and dominant resistance markers is available. However, only few offer targeted modification of a gene of interest into or at a genomic locus of choice, which hampers functional genomics studies. We thus aimed to extend the available set by generating a histidine auxotrophic strain with a characterized hisB locus for targeted gene integration and deletion in A. niger. A histidine-auxotrophic strain was established via disruption of the A. niger hisB gene by using the counterselectable pyrG marker. After curing, a hisB - , pyrG - strain was obtained, which served as recipient strain for further studies. We show here that both hisB orthologs from A. nidulans and A. niger can be used to reestablish histidine prototrophy in this recipient strain. Whereas the hisB gene from A. nidulans was suitable for efficient gene targeting at different loci in A. niger, the hisB gene from A. niger allowed efficient integration of a Tet-on driven luciferase reporter construct at the endogenous non-functional hisB locus. Subsequent analysis of the luciferase activity revealed that the hisB locus is tight under non-inducing conditions and allows even higher luciferase expression levels compared to the pyrG integration locus. Taken together, we provide here an alternative selection marker for A. niger, hisB, which allows efficient homologous integration rates as well as high expression levels which compare favorably to the well-established pyrG selection marker.

  16. Spectral gene set enrichment (SGSE).

    PubMed

    Frost, H Robert; Li, Zhigang; Moore, Jason H

    2015-03-03

    Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.

  17. SiBIC: a web server for generating gene set networks based on biclusters obtained by maximal frequent itemset mining.

    PubMed

    Takahashi, Kei-ichiro; Takigawa, Ichigaku; Mamitsuka, Hiroshi

    2013-01-01

    Detecting biclusters from expression data is useful, since biclusters are coexpressed genes under only part of all given experimental conditions. We present a software called SiBIC, which from a given expression dataset, first exhaustively enumerates biclusters, which are then merged into rather independent biclusters, which finally are used to generate gene set networks, in which a gene set assigned to one node has coexpressed genes. We evaluated each step of this procedure: 1) significance of the generated biclusters biologically and statistically, 2) biological quality of merged biclusters, and 3) biological significance of gene set networks. We emphasize that gene set networks, in which nodes are not genes but gene sets, can be more compact than usual gene networks, meaning that gene set networks are more comprehensible. SiBIC is available at http://utrecht.kuicr.kyoto-u.ac.jp:8080/miami/faces/index.jsp.

  18. The Role of the Immune Response in the Pathogenesis of Thyroid Eye Disease: A Reassessment

    PubMed Central

    Rosenbaum, James T.; Choi, Dongseok; Wong, Amanda; Wilson, David J.; Grossniklaus, Hans E.; Harrington, Christina A.; Dailey, Roger A.; Ng, John D.; Steele, Eric A.; Czyz, Craig N.; Foster, Jill A.; Tse, David; Alabiad, Chris; Dubovy, Sander; Parekh, Prashant K.; Harris, Gerald J.; Kazim, Michael; Patel, Payal J.; White, Valerie A.; Dolman, Peter J.; Edward, Deepak P.; Alkatan, Hind M.; al Hussain, Hailah; Selva, Dinesh; Yeatts, R. Patrick; Korn, Bobby S.; Kikkawa, Don O.; Stauffer, Patrick; Planck, Stephen R.

    2015-01-01

    Background Although thyroid eye disease is a common complication of Graves’ disease, the pathogenesis of the orbital disease is poorly understood. Most authorities implicate the immune response as an important causal factor. We sought to clarify pathogenesis by using gene expression microarray. Methods An international consortium of ocular pathologists and orbital surgeons contributed formalin fixed orbital biopsies. RNA was extracted from orbital tissue from 20 healthy controls, 25 patients with thyroid eye disease (TED), 25 patients with nonspecific orbital inflammation (NSOI), 7 patients with sarcoidosis and 6 patients with granulomatosis with polyangiitis (GPA). Tissue was divided into a discovery set and a validation set. Gene expression was quantified using Affymetrix U133 Plus 2.0 microarrays which include 54,000 probe sets. Results Principal component analysis showed that gene expression from tissue from patients with TED more closely resembled gene expression from healthy control tissue in comparison to gene expression characteristic of sarcoidosis, NSOI, or granulomatosis with polyangiitis. Unsupervised cluster dendrograms further indicated the similarity between TED and healthy controls. Heat maps based on gene expression for cytokines, chemokines, or their receptors showed that these inflammatory markers were associated with NSOI, sarcoidosis, or GPA much more frequently than with TED. Conclusion This is the first study to compare gene expression in TED to gene expression associated with other causes of exophthalmos. The juxtaposition shows that inflammatory markers are far less characteristic of TED relative to other orbital inflammatory diseases. PMID:26371757

  19. Parenclitic networks: uncovering new functions in biological data

    PubMed Central

    Zanin, Massimiliano; Alcazar, Joaquín Medina; Carbajosa, Jesus Vicente; Paez, Marcela Gomez; Papo, David; Sousa, Pedro; Menasalvas, Ernestina; Boccaletti, Stefano

    2014-01-01

    We introduce a novel method to represent time independent, scalar data sets as complex networks. We apply our method to investigate gene expression in the response to osmotic stress of Arabidopsis thaliana. In the proposed network representation, the most important genes for the plant response turn out to be the nodes with highest centrality in appropriately reconstructed networks. We also performed a target experiment, in which the predicted genes were artificially induced one by one, and the growth of the corresponding phenotypes compared to that of the wild-type. The joint application of the network reconstruction method and of the in vivo experiments allowed identifying 15 previously unknown key genes, and provided models of their mutual relationships. This novel representation extends the use of graph theory to data sets hitherto considered outside of the realm of its application, vastly simplifying the characterization of their underlying structure. PMID:24870931

  20. Mouse Genome Database: From sequence to phenotypes and disease models

    PubMed Central

    Richardson, Joel E.; Kadin, James A.; Smith, Cynthia L.; Blake, Judith A.; Bult, Carol J.

    2015-01-01

    Summary The Mouse Genome Database (MGD, www.informatics.jax.org) is the international scientific database for genetic, genomic, and biological data on the laboratory mouse to support the research requirements of the biomedical community. To accomplish this goal, MGD provides broad data coverage, serves as the authoritative standard for mouse nomenclature for genes, mutants, and strains, and curates and integrates many types of data from literature and electronic sources. Among the key data sets MGD supports are: the complete catalog of mouse genes and genome features, comparative homology data for mouse and vertebrate genes, the authoritative set of Gene Ontology (GO) annotations for mouse gene functions, a comprehensive catalog of mouse mutations and their phenotypes, and a curated compendium of mouse models of human diseases. Here, we describe the data acquisition process, specifics about MGD's key data areas, methods to access and query MGD data, and outreach and user help facilities. genesis 53:458–473, 2015. © 2015 The Authors. Genesis Published by Wiley Periodicals, Inc. PMID:26150326

  1. The recurrent SET-NUP214 fusion as a new HOXA activation mechanism in pediatric T-cell acute lymphoblastic leukemia

    PubMed Central

    Van Vlierberghe, Pieter; van Grotel, Martine; Tchinda, Joëlle; Lee, Charles; Beverloo, H. Berna; van der Spek, Peter J.; Stubbs, Andrew; Cools, Jan; Nagata, Kyosuke; Fornerod, Maarten; Buijs-Gladdines, Jessica; Horstmann, Martin; van Wering, Elisabeth R.; Soulier, Jean; Pieters, Rob

    2008-01-01

    T-cell acute lymphoblastic leukemia (T-ALL) is mostly characterized by specific chromosomal abnormalities, some occurring in a mutually exclusive manner that possibly delineate specific T-ALL subgroups. One subgroup, including MLL-rearranged, CALM-AF10 or inv (7)(p15q34) patients, is characterized by elevated expression of HOXA genes. Using a gene expression–based clustering analysis of 67 T-ALL cases with recurrent molecular genetic abnormalities and 25 samples lacking apparent aberrations, we identified 5 new patients with elevated HOXA levels. Using microarray-based comparative genomic hybridization (array-CGH), a cryptic and recurrent deletion, del (9)(q34.11q34.13), was exclusively identified in 3 of these 5 patients. This deletion results in a conserved SET-NUP214 fusion product, which was also identified in the T-ALL cell line LOUCY. SET-NUP214 binds in the promoter regions of specific HOXA genes, where it interacts with CRM1 and DOT1L, which may transcriptionally activate specific members of the HOXA cluster. Targeted inhibition of SET-NUP214 by siRNA abolished expression of HOXA genes, inhibited proliferation, and induced differentiation in LOUCY but not in other T-ALL lines. We conclude that SET-NUP214 may contribute to the pathogenesis of T-ALL by enforcing T-cell differentiation arrest. PMID:18299449

  2. Novel Gene Expression Profile of Women with Intrinsic Skin Youthfulness by Whole Transcriptome Sequencing

    PubMed Central

    Xu, Jin; Spitale, Robert C.; Guan, Linna; Flynn, Ryan A.; Torre, Eduardo A.; Li, Rui; Raber, Inbar; Qu, Kun; Kern, Dale; Knaggs, Helen E.; Chang, Howard Y.; Chang, Anne Lynn S.

    2016-01-01

    While much is known about genes that promote aging, little is known about genes that protect against or prevent aging, particularly in human skin. The main objective of this study was to perform an unbiased, whole transcriptome search for genes that associate with intrinsic skin youthfulness. To accomplish this, healthy women (n = 122) of European descent, ages 18–89 years with Fitzpatrick skin type I/II were examined for facial skin aging parameters and clinical covariates, including smoking and ultraviolet exposure. Skin youthfulness was defined as the top 10% of individuals whose assessed skin aging features were most discrepant with their chronological ages. Skin biopsies from sun-protected inner arm were subjected to 3’-end sequencing for expression quantification, with results verified by quantitative reverse transcriptase-polymerase chain reaction. Unbiased clustering revealed gene expression signatures characteristic of older women with skin youthfulness (n = 12) compared to older women without skin youthfulness (n = 33), after accounting for gene expression changes associated with chronological age alone. Gene set analysis was performed using Genomica open-access software. This study identified a novel set of candidate skin youthfulness genes demonstrating differences between SY and non-SY group, including pleckstrin homology like domain family A member 1 (PHLDA1) (p = 2.4x10-5), a follicle stem cell marker, and hyaluronan synthase 2-anti-sense 1 (HAS2-AS1) (p = 0.00105), a non-coding RNA that is part of the hyaluronan synthesis pathway. We show that immunologic gene sets are the most significantly altered in skin youthfulness (with the most significant gene set p = 2.4x10-5), suggesting the immune system plays an important role in skin youthfulness, a finding that has not previously been recognized. These results are a valuable resource from which multiple future studies may be undertaken to better understand the mechanisms that promote skin youthfulness in humans. PMID:27829007

  3. When is hub gene selection better than standard meta-analysis?

    PubMed

    Langfelder, Peter; Mischel, Paul S; Horvath, Steve

    2013-01-01

    Since hub nodes have been found to play important roles in many networks, highly connected hub genes are expected to play an important role in biology as well. However, the empirical evidence remains ambiguous. An open question is whether (or when) hub gene selection leads to more meaningful gene lists than a standard statistical analysis based on significance testing when analyzing genomic data sets (e.g., gene expression or DNA methylation data). Here we address this question for the special case when multiple genomic data sets are available. This is of great practical importance since for many research questions multiple data sets are publicly available. In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a co-expression network analysis approach that selects intramodular hubs in consensus modules. We assess the performance of these two types of approaches according to two criteria. The first criterion evaluates the biological insights gained and is relevant in basic research. The second criterion evaluates the validation success (reproducibility) in independent data sets and often applies in clinical diagnostic or prognostic applications. We compare meta-analysis with consensus network analysis based on weighted correlation network analysis (WGCNA) in three comprehensive and unbiased empirical studies: (1) Finding genes predictive of lung cancer survival, (2) finding methylation markers related to age, and (3) finding mouse genes related to total cholesterol. The results demonstrate that intramodular hub gene status with respect to consensus modules is more useful than a meta-analysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1). However, standard meta-analysis methods perform as good as (if not better than) a consensus network approach in terms of validation success (criterion 2). The article also reports a comparison of meta-analysis techniques applied to gene expression data and presents novel R functions for carrying out consensus network analysis, network based screening, and meta analysis.

  4. Comprehensive identification of Vibrio vulnificus genes required for growth in human serum.

    PubMed

    Carda-Diéguez, M; Silva-Hernández, F X; Hubbard, T P; Chao, M C; Waldor, M K; Amaro, C

    2018-12-31

    Vibrio vulnificus can be a highly invasive pathogen capable of spreading from an infection site to the bloodstream, causing sepsis and death. To survive and proliferate in blood, the pathogen requires mechanisms to overcome the innate immune defenses and metabolic limitations of this host niche. We created a high-density transposon mutant library in YJ016, a strain representative of the most virulent V. vulnificus lineage (or phylogroup) and used transposon insertion sequencing (TIS) screens to identify loci that enable the pathogen to survive and proliferate in human serum. Initially, genes underrepresented for insertions were used to estimate the V. vulnificus essential gene set; comparisons of these genes with similar TIS-based classification of underrepresented genes in other vibrios enabled the compilation of a common Vibrio essential gene set. Analysis of the relative abundance of insertion mutants in the library after exposure to serum suggested that genes involved in capsule biogenesis are critical for YJ016 complement resistance. Notably, homologues of two genes required for YJ016 serum-resistance and capsule biogenesis were not previously linked to capsule biogenesis and are largely absent from other V. vulnificus strains. The relative abundance of mutants after exposure to heat inactivated serum was compared with the findings from the serum screen. These comparisons suggest that in both conditions the pathogen relies on its Na + transporting NADH-ubiquinone reductase (NQR) complex and type II secretion system to survive/proliferate within the metabolic constraints of serum. Collectively, our findings reveal the potency of comparative TIS screens to provide knowledge of how a pathogen overcomes the diverse limitations to growth imposed by serum.

  5. The Maximal C³ Self-Complementary Trinucleotide Circular Code X in Genes of Bacteria, Archaea, Eukaryotes, Plasmids and Viruses.

    PubMed

    Michel, Christian J

    2017-04-18

    In 1996, a set X of 20 trinucleotides was identified in genes of both prokaryotes and eukaryotes which has on average the highest occurrence in reading frame compared to its two shifted frames. Furthermore, this set X has an interesting mathematical property as X is a maximal C 3 self-complementary trinucleotide circular code. In 2015, by quantifying the inspection approach used in 1996, the circular code X was confirmed in the genes of bacteria and eukaryotes and was also identified in the genes of plasmids and viruses. The method was based on the preferential occurrence of trinucleotides among the three frames at the gene population level. We extend here this definition at the gene level. This new statistical approach considers all the genes, i.e., of large and small lengths, with the same weight for searching the circular code X . As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. At the gene level, the circular code X is strengthened in the genes of bacteria, eukaryotes, plasmids, and viruses, and is now also identified in the genes of archaea. The genes of mitochondria and chloroplasts contain a subset of the circular code X . Finally, by studying viral genes, the circular code X was found in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes.

  6. Reconstruction of metabolic networks from high-throughput metabolite profiling data: in silico analysis of red blood cell metabolism.

    PubMed

    Nemenman, Ilya; Escola, G Sean; Hlavacek, William S; Unkefer, Pat J; Unkefer, Clifford J; Wall, Michael E

    2007-12-01

    We investigate the ability of algorithms developed for reverse engineering of transcriptional regulatory networks to reconstruct metabolic networks from high-throughput metabolite profiling data. For benchmarking purposes, we generate synthetic metabolic profiles based on a well-established model for red blood cell metabolism. A variety of data sets are generated, accounting for different properties of real metabolic networks, such as experimental noise, metabolite correlations, and temporal dynamics. These data sets are made available online. We use ARACNE, a mainstream algorithm for reverse engineering of transcriptional regulatory networks from gene expression data, to predict metabolic interactions from these data sets. We find that the performance of ARACNE on metabolic data is comparable to that on gene expression data.

  7. The limitations of simple gene set enrichment analysis assuming gene independence.

    PubMed

    Tamayo, Pablo; Steinhardt, George; Liberzon, Arthur; Mesirov, Jill P

    2016-02-01

    Since its first publication in 2003, the Gene Set Enrichment Analysis method, based on the Kolmogorov-Smirnov statistic, has been heavily used, modified, and also questioned. Recently a simplified approach using a one-sample t-test score to assess enrichment and ignoring gene-gene correlations was proposed by Irizarry et al. 2009 as a serious contender. The argument criticizes Gene Set Enrichment Analysis's nonparametric nature and its use of an empirical null distribution as unnecessary and hard to compute. We refute these claims by careful consideration of the assumptions of the simplified method and its results, including a comparison with Gene Set Enrichment Analysis's on a large benchmark set of 50 datasets. Our results provide strong empirical evidence that gene-gene correlations cannot be ignored due to the significant variance inflation they produced on the enrichment scores and should be taken into account when estimating gene set enrichment significance. In addition, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose more generally for gene set enrichment methods. © The Author(s) 2012.

  8. An analysis of gene expression in PTSD implicates genes involved in the glucocorticoid receptor pathway and neural responses to stress

    PubMed Central

    Logue, Mark W.; Smith, Alicia K.; Baldwin, Clinton; Wolf, Erika J.; Guffanti, Guia; Ratanatharathorn, Andrew; Stone, Annjanette; Schichman, Steven A.; Humphries, Donald; Binder, Elisabeth B.; Arloth, Janine; Menke, Andreas; Uddin, Monica; Wildman, Derek; Galea, Sandro; Aiello, Allison E.; Koenen, Karestan C.; Miller, Mark W.

    2015-01-01

    We examined the association between posttraumatic stress disorder (PTSD) and gene expression using whole blood samples from a cohort of trauma-exposed white non-Hispanic male veterans (115 cases and 28 controls). 10,264 probes of genes and gene transcripts were analyzed. We found 41 that were differentially expressed in PTSD cases versus controls (multiple-testing corrected p<0.05). The most significant was DSCAM, a neurological gene expressed widely in the developing brain and in the amygdala and hippocampus of the adult brain. We then examined the 41 differentially expressed genes in a meta-analysis using two replication cohorts and found significant associations with PTSD for 7 of the 41 (p<0.05), one of which (ATP6AP1L) survived multiple-testing correction. There was also broad evidence of overlap across the discovery and replication samples for the entire set of genes implicated in the discovery data based on the direction of effect and an enrichment of p<0.05 significant probes beyond what would be expected under the null. Finally, we found that the set of differentially expressed genes from the discovery sample was enriched for genes responsive to glucocorticoid signaling with most showing reduced expression in PTSD cases compared to controls. PMID:25867994

  9. Effect of feed supplementation with live yeast on the intestinal transcriptome profile of weaning pigs orally challenged with Escherichia coli F4.

    PubMed

    Trevisi, P; Latorre, R; Priori, D; Luise, D; Archetti, I; Mazzoni, M; D'Inca, R; Bosi, P

    2017-01-01

    The ability of live yeasts to modulate pig intestinal cell signals in response to infection with Escherichia coli F4ac (ETEC) has not been studied in-depth. The aim of this trial was to evaluate the effect of Saccharomyces cerevisiae CNCM I-4407 (Sc), supplied at different times, on the transcriptome profile of the jejunal mucosa of pigs 24 h after infection with ETEC. In total, 20 piglets selected to be ETEC-susceptible were weaned at 24 days of age (day 0) and allotted by litter to one of following groups: control (CO), CO+colistin (AB), CO+5×1010 colony-forming unit (CFU) Sc/kg feed, from day 0 (PR) and CO+5×1010 CFU Sc/kg feed from day 7 (CM). On day 7, the pigs were orally challenged with ETEC and were slaughtered 24 h later after blood sampling for haptoglobin (Hp) and C-reactive protein (CRP) determination. The jejunal mucosa was sampled (1) for morphometry; (2) for quantification of proliferation, apoptosis and zonula occludens (ZO-1); (3) to carry out the microarray analysis. A functional analysis was carried out using Gene Set Enrichment Analysis. The normalized enrichment score (NES) was calculated for each gene set, and statistical significance was defined when the False Discovery Rate % was <25 and P-values of NES were <0.05. The blood concentration of CRP and Hp, and the score for ZO-1 integrity on the jejunal villi did not differ between groups. The intestinal crypts were deeper in the AB (P=0.05) and the yeast groups (P<0.05) than in the CO group. Antibiotic treatment increased the number of mitotic cells in intestinal villi as compared with the control group (P<0.05). The PR group tended to increase the mitotic cells in villi and crypts and tended to reduce the cells in apoptosis as compared with the CM group. The transcriptome profiles of the AB and PR groups were similar. In both groups, the gene sets involved in mitosis and in mitochondria development ranked the highest, whereas in the CO group, the gene sets related to cell junction and anion channels were affected. In the CM group, the gene sets linked to the metabolic process, and transcription ranked the highest; a gene set linked with a negative effect on growth was also affected. In conclusion, the constant supplementation in the feed with the strain of yeast tested was effective in counteracting the detrimental effect of ETEC infection in susceptible pigs limits the early activation of the gene sets related to the impairment of the jejunal mucosa.

  10. Ensemble positive unlabeled learning for disease gene identification.

    PubMed

    Yang, Peng; Li, Xiaoli; Chua, Hon-Nian; Kwoh, Chee-Keong; Ng, See-Kiong

    2014-01-01

    An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.

  11. Gene selection for the reconstruction of stem cell differentiation trees: a linear programming approach.

    PubMed

    Ghadie, Mohamed A; Japkowicz, Nathalie; Perkins, Theodore J

    2015-08-15

    Stem cell differentiation is largely guided by master transcriptional regulators, but it also depends on the expression of other types of genes, such as cell cycle genes, signaling genes, metabolic genes, trafficking genes, etc. Traditional approaches to understanding gene expression patterns across multiple conditions, such as principal components analysis or K-means clustering, can group cell types based on gene expression, but they do so without knowledge of the differentiation hierarchy. Hierarchical clustering can organize cell types into a tree, but in general this tree is different from the differentiation hierarchy itself. Given the differentiation hierarchy and gene expression data at each node, we construct a weighted Euclidean distance metric such that the minimum spanning tree with respect to that metric is precisely the given differentiation hierarchy. We provide a set of linear constraints that are provably sufficient for the desired construction and a linear programming approach to identify sparse sets of weights, effectively identifying genes that are most relevant for discriminating different parts of the tree. We apply our method to microarray gene expression data describing 38 cell types in the hematopoiesis hierarchy, constructing a weighted Euclidean metric that uses just 175 genes. However, we find that there are many alternative sets of weights that satisfy the linear constraints. Thus, in the style of random-forest training, we also construct metrics based on random subsets of the genes and compare them to the metric of 175 genes. We then report on the selected genes and their biological functions. Our approach offers a new way to identify genes that may have important roles in stem cell differentiation. tperkins@ohri.ca Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  12. Assessment of gene order computing methods for Alzheimer's disease

    PubMed Central

    2013-01-01

    Background Computational genomics of Alzheimer disease (AD), the most common form of senile dementia, is a nascent field in AD research. The field includes AD gene clustering by computing gene order which generates higher quality gene clustering patterns than most other clustering methods. However, there are few available gene order computing methods such as Genetic Algorithm (GA) and Ant Colony Optimization (ACO). Further, their performance in gene order computation using AD microarray data is not known. We thus set forth to evaluate the performances of current gene order computing methods with different distance formulas, and to identify additional features associated with gene order computation. Methods Using different distance formulas- Pearson distance and Euclidean distance, the squared Euclidean distance, and other conditions, gene orders were calculated by ACO and GA (including standard GA and improved GA) methods, respectively. The qualities of the gene orders were compared, and new features from the calculated gene orders were identified. Results Compared to the GA methods tested in this study, ACO fits the AD microarray data the best when calculating gene order. In addition, the following features were revealed: different distance formulas generated a different quality of gene order, and the commonly used Pearson distance was not the best distance formula when used with both GA and ACO methods for AD microarray data. Conclusion Compared with Pearson distance and Euclidean distance, the squared Euclidean distance generated the best quality gene order computed by GA and ACO methods. PMID:23369541

  13. The transcriptomic fingerprint of glucoamylase over-expression in Aspergillus niger

    PubMed Central

    2012-01-01

    Background Filamentous fungi such as Aspergillus niger are well known for their exceptionally high capacity for secretion of proteins, organic acids, and secondary metabolites and they are therefore used in biotechnology as versatile microbial production platforms. However, system-wide insights into their metabolic and secretory capacities are sparse and rational strain improvement approaches are therefore limited. In order to gain a genome-wide view on the transcriptional regulation of the protein secretory pathway of A. niger, we investigated the transcriptome of A. niger when it was forced to overexpression the glaA gene (encoding glucoamylase, GlaA) and secrete GlaA to high level. Results An A. niger wild-type strain and a GlaA over-expressing strain, containing multiple copies of the glaA gene, were cultivated under maltose-limited chemostat conditions (specific growth rate 0.1 h-1). Elevated glaA mRNA and extracellular GlaA levels in the over-expressing strain were accompanied by elevated transcript levels from 772 genes and lowered transcript levels from 815 genes when compared to the wild-type strain. Using GO term enrichment analysis, four higher-order categories were identified in the up-regulated gene set: i) endoplasmic reticulum (ER) membrane translocation, ii) protein glycosylation, iii) vesicle transport, and iv) ion homeostasis. Among these, about 130 genes had predicted functions for the passage of proteins through the ER and those genes included target genes of the HacA transcription factor that mediates the unfolded protein response (UPR), e.g. bipA, clxA, prpA, tigA and pdiA. In order to identify those genes that are important for high-level secretion of proteins by A. niger, we compared the transcriptome of the GlaA overexpression strain of A. niger with six other relevant transcriptomes of A. niger. Overall, 40 genes were found to have either elevated (from 36 genes) or lowered (from 4 genes) transcript levels under all conditions that were examined, thus defining the core set of genes important for ensuring high protein traffic through the secretory pathway. Conclusion We have defined the A. niger genes that respond to elevated secretion of GlaA and, furthermore, we have defined a core set of genes that appear to be involved more generally in the intensified traffic of proteins through the secretory pathway of A. niger. The consistent up-regulation of a gene encoding the acetyl-coenzyme A transporter suggests a possible role for transient acetylation to ensure correct folding of secreted proteins. PMID:23237452

  14. Approximate geodesic distances reveal biologically relevant structures in microarray data.

    PubMed

    Nilsson, Jens; Fioretos, Thoas; Höglund, Mattias; Fontes, Magnus

    2004-04-12

    Genome-wide gene expression measurements, as currently determined by the microarray technology, can be represented mathematically as points in a high-dimensional gene expression space. Genes interact with each other in regulatory networks, restricting the cellular gene expression profiles to a certain manifold, or surface, in gene expression space. To obtain knowledge about this manifold, various dimensionality reduction methods and distance metrics are used. For data points distributed on curved manifolds, a sensible distance measure would be the geodesic distance along the manifold. In this work, we examine whether an approximate geodesic distance measure captures biological similarities better than the traditionally used Euclidean distance. We computed approximate geodesic distances, determined by the Isomap algorithm, for one set of lymphoma and one set of lung cancer microarray samples. Compared with the ordinary Euclidean distance metric, this distance measure produced more instructive, biologically relevant, visualizations when applying multidimensional scaling. This suggests the Isomap algorithm as a promising tool for the interpretation of microarray data. Furthermore, the results demonstrate the benefit and importance of taking nonlinearities in gene expression data into account.

  15. Determination of the Core of a Minimal Bacterial Gene Set†

    PubMed Central

    Gil, Rosario; Silva, Francisco J.; Peretó, Juli; Moya, Andrés

    2004-01-01

    The availability of a large number of complete genome sequences raises the question of how many genes are essential for cellular life. Trying to reconstruct the core of the protein-coding gene set for a hypothetical minimal bacterial cell, we have performed a computational comparative analysis of eight bacterial genomes. Six of the analyzed genomes are very small due to a dramatic genome size reduction process, while the other two, corresponding to free-living relatives, are larger. The available data from several systematic experimental approaches to define all the essential genes in some completely sequenced bacterial genomes were also considered, and a reconstruction of a minimal metabolic machinery necessary to sustain life was carried out. The proposed minimal genome contains 206 protein-coding genes with all the genetic information necessary for self-maintenance and reproduction in the presence of a full complement of essential nutrients and in the absence of environmental stress. The main features of such a minimal gene set, as well as the metabolic functions that must be present in the hypothetical minimal cell, are discussed. PMID:15353568

  16. Deinococcus geothermalis: The Pool of Extreme Radiation Resistance Genes Shrinks

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Makarova, Kira S.; Omelchenko, Marina V.; Gaidamakova, Elena K.

    Bacteria of the genus Deinococcus are extremely resistant to ionizing radiation (IR), ultraviolet light (UV) and desiccation. The mesophile Deinococcus radiodurans was the first member of this group whose genome was completely sequenced. Analysis of the genome sequence of D. radiodurans, however, failed to identify unique DNA repair systems. To further delineate the genes underlying the resistance phenotypes, we report the whole-genome sequence of a second Deinococcus species, the thermophile Deinococcus geothermalis, which at itsoptimal growth temperature is as resistant to IR, UV and desiccation as D. radiodurans, and a comparative analysis of the two Deinococcus genomes. Many D. radioduransmore » genes previously implicated in resistance, but for which no sensitive phenotype was observed upon disruption, are absent in D. geothermalis. In contrast, most D. radiodurans genes whose mutants displayed a radiation-sensitive phenotype in D. radiodurans are conserved in D. geothermalis. Supporting the existence of a Deinococcus radiation response regulon, a common palindromic DNA motif was identified in a conserved set of genes associated with resistance, and a dedicated transcriptional regulator was predicted. We present the case that these two species evolved essentially the same diverse set of gene families, and that the extreme stress-resistance phenotypes of the Deinococcus lineage emerged progressively by amassing cell-cleaning systems from different sources, but not by acquisition of novel DNA repair systems. Our reconstruction of the genomic evolution of the Deinococcus-Thermus phylum indicates that the corresponding set of enzymes proliferated mainly in the common ancestor of Deinococcus. Results of the comparative analysis weaken the arguments for a role of higher-order chromosome alignment structures in resistance; more clearly define and substantially revise downward the number of uncharacterized genes that might participate in DNA repair and contribute to resistance; and strengthen the case for a role in survival of systems involved in manganese and iron homeostasis.« less

  17. Optimized Probe Masking for Comparative Transcriptomics of Closely Related Species

    PubMed Central

    Poeschl, Yvonne; Delker, Carolin; Trenner, Jana; Ullrich, Kristian Karsten; Quint, Marcel; Grosse, Ivo

    2013-01-01

    Microarrays are commonly applied to study the transcriptome of specific species. However, many available microarrays are restricted to model organisms, and the design of custom microarrays for other species is often not feasible. Hence, transcriptomics approaches of non-model organisms as well as comparative transcriptomics studies among two or more species often make use of cost-intensive RNAseq studies or, alternatively, by hybridizing transcripts of a query species to a microarray of a closely related species. When analyzing these cross-species microarray expression data, differences in the transcriptome of the query species can cause problems, such as the following: (i) lower hybridization accuracy of probes due to mismatches or deletions, (ii) probes binding multiple transcripts of different genes, and (iii) probes binding transcripts of non-orthologous genes. So far, methods for (i) exist, but these neglect (ii) and (iii). Here, we propose an approach for comparative transcriptomics addressing problems (i) to (iii), which retains only transcript-specific probes binding transcripts of orthologous genes. We apply this approach to an Arabidopsis lyrata expression data set measured on a microarray designed for Arabidopsis thaliana, and compare it to two alternative approaches, a sequence-based approach and a genomic DNA hybridization-based approach. We investigate the number of retained probe sets, and we validate the resulting expression responses by qRT-PCR. We find that the proposed approach combines the benefit of sequence-based stringency and accuracy while allowing the expression analysis of much more genes than the alternative sequence-based approach. As an added benefit, the proposed approach requires probes to detect transcripts of orthologous genes only, which provides a superior base for biological interpretation of the measured expression responses. PMID:24260119

  18. Validating internal controls for quantitative plant gene expression studies.

    PubMed

    Brunner, Amy M; Yakovlev, Igor A; Strauss, Steven H

    2004-08-18

    Real-time reverse transcription PCR (RT-PCR) has greatly improved the ease and sensitivity of quantitative gene expression studies. However, accurate measurement of gene expression with this method relies on the choice of a valid reference for data normalization. Studies rarely verify that gene expression levels for reference genes are adequately consistent among the samples used, nor compare alternative genes to assess which are most reliable for the experimental conditions analyzed. Using real-time RT-PCR to study the expression of 10 poplar (genus Populus) housekeeping genes, we demonstrate a simple method for determining the degree of stability of gene expression over a set of experimental conditions. Based on a traditional method for analyzing the stability of varieties in plant breeding, it defines measures of gene expression stability from analysis of variance (ANOVA) and linear regression. We found that the potential internal control genes differed widely in their expression stability over the different tissues, developmental stages and environmental conditions studied. Our results support that quantitative comparisons of candidate reference genes are an important part of real-time RT-PCR studies that seek to precisely evaluate variation in gene expression. The method we demonstrated facilitates statistical and graphical evaluation of gene expression stability. Selection of the best reference gene for a given set of experimental conditions should enable detection of biologically significant changes in gene expression that are too small to be revealed by less precise methods, or when highly variable reference genes are unknowingly used in real-time RT-PCR experiments.

  19. Comparative study of joint analysis of microarray gene expression data in survival prediction and risk assessment of breast cancer patients

    PubMed Central

    2016-01-01

    Abstract Microarray gene expression data sets are jointly analyzed to increase statistical power. They could either be merged together or analyzed by meta-analysis. For a given ensemble of data sets, it cannot be foreseen which of these paradigms, merging or meta-analysis, works better. In this article, three joint analysis methods, Z -score normalization, ComBat and the inverse normal method (meta-analysis) were selected for survival prognosis and risk assessment of breast cancer patients. The methods were applied to eight microarray gene expression data sets, totaling 1324 patients with two clinical endpoints, overall survival and relapse-free survival. The performance derived from the joint analysis methods was evaluated using Cox regression for survival analysis and independent validation used as bias estimation. Overall, Z -score normalization had a better performance than ComBat and meta-analysis. Higher Area Under the Receiver Operating Characteristic curve and hazard ratio were also obtained when independent validation was used as bias estimation. With a lower time and memory complexity, Z -score normalization is a simple method for joint analysis of microarray gene expression data sets. The derived findings suggest further assessment of this method in future survival prediction and cancer classification applications. PMID:26504096

  20. Careful Selection of Reference Genes Is Required for Reliable Performance of RT-qPCR in Human Normal and Cancer Cell Lines

    PubMed Central

    Jacob, Francis; Guertler, Rea; Naim, Stephanie; Nixdorf, Sheri; Fedier, André; Hacker, Neville F.; Heinzelmann-Schwarz, Viola

    2013-01-01

    Reverse Transcription - quantitative Polymerase Chain Reaction (RT-qPCR) is a standard technique in most laboratories. The selection of reference genes is essential for data normalization and the selection of suitable reference genes remains critical. Our aim was to 1) review the literature since implementation of the MIQE guidelines in order to identify the degree of acceptance; 2) compare various algorithms in their expression stability; 3) identify a set of suitable and most reliable reference genes for a variety of human cancer cell lines. A PubMed database review was performed and publications since 2009 were selected. Twelve putative reference genes were profiled in normal and various cancer cell lines (n = 25) using 2-step RT-qPCR. Investigated reference genes were ranked according to their expression stability by five algorithms (geNorm, Normfinder, BestKeeper, comparative ΔCt, and RefFinder). Our review revealed 37 publications, with two thirds patient samples and one third cell lines. qPCR efficiency was given in 68.4% of all publications, but only 28.9% of all studies provided RNA/cDNA amount and standard curves. GeNorm and Normfinder algorithms were used in 60.5% in combination. In our selection of 25 cancer cell lines, we identified HSPCB, RRN18S, and RPS13 as the most stable expressed reference genes. In the subset of ovarian cancer cell lines, the reference genes were PPIA, RPS13 and SDHA, clearly demonstrating the necessity to select genes depending on the research focus. Moreover, a cohort of at least three suitable reference genes needs to be established in advance to the experiments, according to the guidelines. For establishing a set of reference genes for gene normalization we recommend the use of ideally three reference genes selected by at least three stability algorithms. The unfortunate lack of compliance to the MIQE guidelines reflects that these need to be further established in the research community. PMID:23554992

  1. Development of 25 near-isogenic lines (NILs) with ten BPH resistance genes in rice (Oryza sativa L.): production, resistance spectrum, and molecular analysis.

    PubMed

    Jena, Kshirod K; Hechanova, Sherry Lou; Verdeprado, Holden; Prahalada, G D; Kim, Sung-Ryul

    2017-11-01

    A first set of 25 NILs carrying ten BPH resistance genes and their pyramids was developed in the background of indica variety IR24 for insect resistance breeding in rice. Brown planthopper (Nilaparvata lugens Stal.) is one of the most destructive insect pests in rice. Development of near-isogenic lines (NILs) is an important strategy for genetic analysis of brown planthopper (BPH) resistance (R) genes and their deployment against diverse BPH populations. A set of 25 NILs with 9 single R genes and 16 multiple R gene combinations consisting of 11 two-gene pyramids and 5 three-gene pyramids in the genetic background of the susceptible indica rice cultivar IR24 was developed through marker-assisted selection. The linked DNA markers for each of the R genes were used for foreground selection and confirming the introgressed regions of the BPH R genes. Modified seed box screening and feeding rate of BPH were used to evaluate the spectrum of resistance. BPH reaction of each of the NILs carrying different single genes was variable at the antibiosis level with the four BPH populations of the Philippines. The NILs with two- to three-pyramided genes showed a stronger level of antibiosis (49.3-99.0%) against BPH populations compared with NILs with a single R gene NILs (42.0-83.5%) and IR24 (10.0%). Background genotyping by high-density SNPs markers revealed that most of the chromosome regions of the NILs (BC 3 F 5 ) had IR24 genome recovery of 82.0-94.2%. Six major agronomic data of the NILs showed a phenotypically comparable agronomic performance with IR24. These newly developed NILs will be useful as new genetic resources for BPH resistance breeding and are valuable sources of genes in monitoring against the emerging BPH biotypes in different rice-growing countries.

  2. An integrated approach for identifying wrongly labelled samples when performing classification in microarray data.

    PubMed

    Leung, Yuk Yee; Chang, Chun Qi; Hung, Yeung Sam

    2012-01-01

    Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own. We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the 'wrong' (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.

  3. Vascular tone pathway polymorphisms in relation to primary open-angle glaucoma.

    PubMed

    Kang, J H; Loomis, S J; Yaspan, B L; Bailey, J C; Weinreb, R N; Lee, R K; Lichter, P R; Budenz, D L; Liu, Y; Realini, T; Gaasterland, D; Gaasterland, T; Friedman, D S; McCarty, C A; Moroi, S E; Olson, L; Schuman, J S; Singh, K; Vollrath, D; Wollstein, G; Zack, D J; Brilliant, M; Sit, A J; Christen, W G; Fingert, J; Forman, J P; Buys, E S; Kraft, P; Zhang, K; Allingham, R R; Pericak-Vance, M A; Richards, J E; Hauser, M A; Haines, J L; Wiggs, J L; Pasquale, L R

    2014-06-01

    Vascular perfusion may be impaired in primary open-angle glaucoma (POAG); thus, we evaluated a panel of markers in vascular tone-regulating genes in relation to POAG. We used Illumina 660W-Quad array genotype data and pooled P-values from 3108 POAG cases and 3430 controls from the combined National Eye Institute Glaucoma Human Genetics Collaboration consortium and Glaucoma Genes and Environment studies. Using information from previous literature and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, we compiled single-nucleotide polymorphisms (SNPs) in 186 vascular tone-regulating genes. We used the 'Pathway Analysis by Randomization Incorporating Structure' analysis software, which performed 1000 permutations to compare the overall pathway and selected genes with comparable randomly generated pathways and genes in their association with POAG. The vascular tone pathway was not associated with POAG overall or POAG subtypes, defined by the type of visual field loss (early paracentral loss (n=224 cases) or only peripheral loss (n=993 cases)) (permuted P≥0.20). In gene-based analyses, eight were associated with POAG overall at permuted P<0.001: PRKAA1, CAV1, ITPR3, EDNRB, GNB2, DNM2, HFE, and MYL9. Notably, six of these eight (the first six listed) code for factors involved in the endothelial nitric oxide synthase activity, and three of these six (CAV1, ITPR3, and EDNRB) were also associated with early paracentral loss at P<0.001, whereas none of the six genes reached P<0.001 for peripheral loss only. Although the assembled vascular tone SNP set was not associated with POAG, genes that code for local factors involved in setting vascular tone were associated with POAG.

  4. A gene expression resource generated by genome-wide lacZ profiling in the mouse

    PubMed Central

    Tuck, Elizabeth; Estabel, Jeanne; Oellrich, Anika; Maguire, Anna Karin; Adissu, Hibret A.; Souter, Luke; Siragher, Emma; Lillistone, Charlotte; Green, Angela L.; Wardle-Jones, Hannah; Carragher, Damian M.; Karp, Natasha A.; Smedley, Damian; Adams, Niels C.; Bussell, James N.; Adams, David J.; Ramírez-Solis, Ramiro; Steel, Karen P.; Galli, Antonella; White, Jacqueline K.

    2015-01-01

    ABSTRACT Knowledge of the expression profile of a gene is a critical piece of information required to build an understanding of the normal and essential functions of that gene and any role it may play in the development or progression of disease. High-throughput, large-scale efforts are on-going internationally to characterise reporter-tagged knockout mouse lines. As part of that effort, we report an open access adult mouse expression resource, in which the expression profile of 424 genes has been assessed in up to 47 different organs, tissues and sub-structures using a lacZ reporter gene. Many specific and informative expression patterns were noted. Expression was most commonly observed in the testis and brain and was most restricted in white adipose tissue and mammary gland. Over half of the assessed genes presented with an absent or localised expression pattern (categorised as 0-10 positive structures). A link between complexity of expression profile and viability of homozygous null animals was observed; inactivation of genes expressed in ≥21 structures was more likely to result in reduced viability by postnatal day 14 compared with more restricted expression profiles. For validation purposes, this mouse expression resource was compared with Bgee, a federated composite of RNA-based expression data sets. Strong agreement was observed, indicating a high degree of specificity in our data. Furthermore, there were 1207 observations of expression of a particular gene in an anatomical structure where Bgee had no data, indicating a large amount of novelty in our data set. Examples of expression data corroborating and extending genotype-phenotype associations and supporting disease gene candidacy are presented to demonstrate the potential of this powerful resource. PMID:26398943

  5. The Molecular Signatures Database (MSigDB) hallmark gene set collection.

    PubMed

    Liberzon, Arthur; Birger, Chet; Thorvaldsdóttir, Helga; Ghandi, Mahmoud; Mesirov, Jill P; Tamayo, Pablo

    2015-12-23

    The Molecular Signatures Database (MSigDB) is one of the most widely used and comprehensive databases of gene sets for performing gene set enrichment analysis. Since its creation, MSigDB has grown beyond its roots in metabolic disease and cancer to include >10,000 gene sets. These better represent a wider range of biological processes and diseases, but the utility of the database is reduced by increased redundancy across, and heterogeneity within, gene sets. To address this challenge, here we use a combination of automated approaches and expert curation to develop a collection of "hallmark" gene sets as part of MSigDB. Each hallmark in this collection consists of a "refined" gene set, derived from multiple "founder" sets, that conveys a specific biological state or process and displays coherent expression. The hallmarks effectively summarize most of the relevant information of the original founder sets and, by reducing both variation and redundancy, provide more refined and concise inputs for gene set enrichment analysis.

  6. Estimating genome-wide regulatory activity from multi-omics data sets using mathematical optimization.

    PubMed

    Trescher, Saskia; Münchmeyer, Jannes; Leser, Ulf

    2017-03-27

    Gene regulation is one of the most important cellular processes, indispensable for the adaptability of organisms and closely interlinked with several classes of pathogenesis and their progression. Elucidation of regulatory mechanisms can be approached by a multitude of experimental methods, yet integration of the resulting heterogeneous, large, and noisy data sets into comprehensive and tissue or disease-specific cellular models requires rigorous computational methods. Recently, several algorithms have been proposed which model genome-wide gene regulation as sets of (linear) equations over the activity and relationships of transcription factors, genes and other factors. Subsequent optimization finds those parameters that minimize the divergence of predicted and measured expression intensities. In various settings, these methods produced promising results in terms of estimating transcription factor activity and identifying key biomarkers for specific phenotypes. However, despite their common root in mathematical optimization, they vastly differ in the types of experimental data being integrated, the background knowledge necessary for their application, the granularity of their regulatory model, the concrete paradigm used for solving the optimization problem and the data sets used for evaluation. Here, we review five recent methods of this class in detail and compare them with respect to several key properties. Furthermore, we quantitatively compare the results of four of the presented methods based on publicly available data sets. The results show that all methods seem to find biologically relevant information. However, we also observe that the mutual result overlaps are very low, which contradicts biological intuition. Our aim is to raise further awareness of the power of these methods, yet also to identify common shortcomings and necessary extensions enabling focused research on the critical points.

  7. Text analysis of MEDLINE for discovering functional relationships among genes: evaluation of keyword extraction weighting schemes.

    PubMed

    Liu, Ying; Navathe, Shamkant B; Pivoshenko, Alex; Dasigi, Venu G; Dingledine, Ray; Ciliax, Brian J

    2006-01-01

    One of the key challenges of microarray studies is to derive biological insights from the gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the functional links among genes. However, the quality of the keyword lists significantly affects the clustering results. We compared two keyword weighting schemes: normalised z-score and term frequency-inverse document frequency (TFIDF). Two gene sets were tested to evaluate the effectiveness of the weighting schemes for keyword extraction for gene clustering. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords outperformed those produced from normalised z-score weighted keywords. The optimised algorithms should be useful for partitioning genes from microarray lists into functionally discrete clusters.

  8. Validation of the β-amy1 transcription profiling assay and selection of reference genes suited for a RT-qPCR assay in developing barley caryopsis.

    PubMed

    Ovesná, Jaroslava; Kučera, Ladislav; Vaculová, Kateřina; Štrymplová, Kamila; Svobodová, Ilona; Milella, Luigi

    2012-01-01

    Reverse transcription coupled with real-time quantitative PCR (RT-qPCR) is a frequently used method for gene expression profiling. Reference genes (RGs) are commonly employed to normalize gene expression data. A limited information exist on the gene expression and profiling in developing barley caryopsis. Expression stability was assessed by measuring the cycle threshold (Ct) range and applying both the GeNorm (pair-wise comparison of geometric means) and Normfinder (model-based approach) principles for the calculation. Here, we have identified a set of four RGs suitable for studying gene expression in the developing barley caryopsis. These encode the proteins GAPDH, HSP90, HSP70 and ubiquitin. We found a correlation between the frequency of occurrence of a transcript in silico and its suitability as an RG. This set of RGs was tested by comparing the normalized level of β-amylase (β-amy1) transcript with directly measured quantities of the BMY1 gene product in the developing barley caryopsis. This panel of genes could be used for other gene expression studies, as well as to optimize β-amy1 analysis for study of the impact of β-amy1 expression upon barley end-use quality.

  9. Differential gene expression in HIV/SIV-associated and spontaneous lymphomas

    PubMed Central

    2005-01-01

    Diffuse large B-cell lymphoma (DLBCL) is more prevalent and more often fatal in HIV-infected patients and SIV-infected monkeys compared to immune-competent individuals. Molecular, biological, and immunological data indicate that virus-associated lymphomagenesis is similar in both infected hosts. To find genes specifically overexpressed in HIV/SIV-associated and non-HIV/SIV-associated DLBCL we compared gene expression profiles of HIV/SIV-related and non-HIV-related lymphomas using subtractive hybridization and Northern blot analysis. Our experimental approach allowed us to detect two genes (a-myb and pub) upregulated solely in HIV/SIV-associated DLBCLs potentially involved in virus-specific lymphomagenesis in human and monkey. Downregulation of the pub gene was observed in all non-HIV-associated lymphomas investigated. In addition, we have found genes upregulated in both non-HIV- and HIV-associated lymphomas. Among those were genes both with known (set, ND4, SMG-1) and unknown functions. In summary, we have demonstrated that simultaneous transcriptional upregulation of at least two genes (a-myb and pub) was specific for AIDS-associated lymphomas. PMID:16239949

  10. Development of PCR primers specific for the amplification and direct sequencing of gyrB genes from microbacteria, order Actinomycetales.

    PubMed

    Richert, Kathrin; Brambilla, Evelyne; Stackebrandt, Erko

    2005-01-01

    PCR primer sets were developed for the specific amplification and sequence analyses encoding the gyrase subunit B (gyrB) of members of the family Microbacteriaceae, class Actinobacteria. The family contains species highly related by 16S rRNA gene sequence analyses. In order to test if the gene sequence analysis of gyrB is appropriate to discriminate between closely related species, we evaluate the 16S rRNA gene phylogeny of its members. As the published universal primer set for gyrB failed to amplify the responding gene of the majority of the 80 type strains of the family, three new primer sets were identified that generated fragments with a composite sequence length of about 900 nt. However, the amplification of all three fragments was successful only in 25% of the 80 type strains. In this study, the substitution frequencies in genes encoding gyrase and 16S rDNA were compared for 10 strains of nine genera. The frequency of gyrB nucleotide substitution is significantly higher than that of the 16S rDNA, and no linear correlation exists between the similarities of both molecules among members of the Microbacteriaceae. The phylogenetic analyses using the gyrB sequences provide higher resolution than using 16S rDNA sequences and seem able to discriminate between closely related species.

  11. Microarray-based gene expression profiling in patients with cryopyrin-associated periodic syndromes defines a disease-related signature and IL-1-responsive transcripts.

    PubMed

    Balow, James E; Ryan, John G; Chae, Jae Jin; Booty, Matthew G; Bulua, Ariel; Stone, Deborah; Sun, Hong-Wei; Greene, James; Barham, Beverly; Goldbach-Mansky, Raphaela; Kastner, Daniel L; Aksentijevich, Ivona

    2013-06-01

    To analyse gene expression patterns and to define a specific gene expression signature in patients with the severe end of the spectrum of cryopyrin-associated periodic syndromes (CAPS). The molecular consequences of interleukin 1 inhibition were examined by comparing gene expression patterns in 16 CAPS patients before and after treatment with anakinra. We collected peripheral blood mononuclear cells from 22 CAPS patients with active disease and from 14 healthy children. Transcripts that passed stringent filtering criteria (p values≤false discovery rate 1%) were considered as differentially expressed genes (DEG). A set of DEG was validated by quantitative reverse transcription PCR and functional studies with primary cells from CAPS patients and healthy controls. We used 17 CAPS and 66 non-CAPS patient samples to create a set of gene expression models that differentiates CAPS patients from controls and from patients with other autoinflammatory conditions. Many DEG include transcripts related to the regulation of innate and adaptive immune responses, oxidative stress, cell death, cell adhesion and motility. A set of gene expression-based models comprising the CAPS-specific gene expression signature correctly classified all 17 samples from an independent dataset. This classifier also correctly identified 15 of 16 post-anakinra CAPS samples despite the fact that these CAPS patients were in clinical remission. We identified a gene expression signature that clearly distinguished CAPS patients from controls. A number of DEG were in common with other systemic inflammatory diseases such as systemic onset juvenile idiopathic arthritis. The CAPS-specific gene expression classifiers also suggest incomplete suppression of inflammation at low doses of anakinra.

  12. Microarray-based gene expression profiling in patients with cryopyrin-associated periodic syndromes defines a disease-related signature and IL-1-responsive transcripts

    PubMed Central

    Balow, James E; Ryan, John G; Chae, Jae Jin; Booty, Matthew G; Bulua, Ariel; Stone, Deborah; Sun, Hong-Wei; Greene, James; Barham, Beverly; Goldbach-Mansky, Raphaela; Kastner, Daniel L; Aksentijevich, Ivona

    2014-01-01

    Objective To analyse gene expression patterns and to define a specific gene expression signature in patients with the severe end of the spectrum of cryopyrin-associated periodic syndromes (CAPS). The molecular consequences of interleukin 1 inhibition were examined by comparing gene expression patterns in 16 CAPS patients before and after treatment with anakinra. Methods We collected peripheral blood mononuclear cells from 22 CAPS patients with active disease and from 14 healthy children. Transcripts that passed stringent filtering criteria (p values ≤ false discovery rate 1%) were considered as differentially expressed genes (DEG). A set of DEG was validated by quantitative reverse transcription PCR and functional studies with primary cells from CAPS patients and healthy controls. We used 17 CAPS and 66 non-CAPS patient samples to create a set of gene expression models that differentiates CAPS patients from controls and from patients with other autoinflammatory conditions. Results Many DEG include transcripts related to the regulation of innate and adaptive immune responses, oxidative stress, cell death, cell adhesion and motility. A set of gene expression-based models comprising the CAPS-specific gene expression signature correctly classified all 17 samples from an independent dataset. This classifier also correctly identified 15 of 16 postanakinra CAPS samples despite the fact that these CAPS patients were in clinical remission. Conclusions We identified a gene expression signature that clearly distinguished CAPS patients from controls. A number of DEG were in common with other systemic inflammatory diseases such as systemic onset juvenile idiopathic arthritis. The CAPS-specific gene expression classifiers also suggest incomplete suppression of inflammation at low doses of anakinra. PMID:23223423

  13. snpGeneSets: An R Package for Genome-Wide Study Annotation

    PubMed Central

    Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

    2016-01-01

    Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048

  14. The Association of Multiple Interacting Genes with Specific Phenotypes in Rice Using Gene Coexpression Networks1[C][W][OA

    PubMed Central

    Ficklin, Stephen P.; Luo, Feng; Feltus, F. Alex

    2010-01-01

    Discovering gene sets underlying the expression of a given phenotype is of great importance, as many phenotypes are the result of complex gene-gene interactions. Gene coexpression networks, built using a set of microarray samples as input, can help elucidate tightly coexpressed gene sets (modules) that are mixed with genes of known and unknown function. Functional enrichment analysis of modules further subdivides the coexpressed gene set into cofunctional gene clusters that may coexist in the module with other functionally related gene clusters. In this study, 45 coexpressed gene modules and 76 cofunctional gene clusters were discovered for rice (Oryza sativa) using a global, knowledge-independent paradigm and the combination of two network construction methodologies. Some clusters were enriched for previously characterized mutant phenotypes, providing evidence for specific gene sets (and their annotated molecular functions) that underlie specific phenotypes. PMID:20668062

  15. The association of multiple interacting genes with specific phenotypes in rice using gene coexpression networks.

    PubMed

    Ficklin, Stephen P; Luo, Feng; Feltus, F Alex

    2010-09-01

    Discovering gene sets underlying the expression of a given phenotype is of great importance, as many phenotypes are the result of complex gene-gene interactions. Gene coexpression networks, built using a set of microarray samples as input, can help elucidate tightly coexpressed gene sets (modules) that are mixed with genes of known and unknown function. Functional enrichment analysis of modules further subdivides the coexpressed gene set into cofunctional gene clusters that may coexist in the module with other functionally related gene clusters. In this study, 45 coexpressed gene modules and 76 cofunctional gene clusters were discovered for rice (Oryza sativa) using a global, knowledge-independent paradigm and the combination of two network construction methodologies. Some clusters were enriched for previously characterized mutant phenotypes, providing evidence for specific gene sets (and their annotated molecular functions) that underlie specific phenotypes.

  16. Effect of Aggregation Operators on Network-Based Disease Gene Prioritization: A Case Study on Blood Disorders.

    PubMed

    Grewal, Nivit; Singh, Shailendra; Chand, Trilok

    2017-01-01

    Owing to the innate noise in the biological data sources, a single source or a single measure do not suffice for an effective disease gene prioritization. So, the integration of multiple data sources or aggregation of multiple measures is the need of the hour. The aggregation operators combine multiple related data values to a single value such that the combined value has the effect of all the individual values. In this paper, an attempt has been made for applying the fuzzy aggregation on the network-based disease gene prioritization and investigate its effect under noise conditions. This study has been conducted for a set of 15 blood disorders by fusing four different network measures, computed from the protein interaction network, using a selected set of aggregation operators and ranking the genes on the basis of the aggregated value. The aggregation operator-based rankings have been compared with the "Random walk with restart" gene prioritization method. The impact of noise has also been investigated by adding varying proportions of noise to the seed set. The results reveal that for all the selected blood disorders, the Mean of Maximal operator has relatively outperformed the other aggregation operators for noisy as well as non-noisy data.

  17. Loci and pathways associated with uterine capacity for pregnancy and fertility in beef cattle.

    PubMed

    Neupane, Mahesh; Geary, Thomas W; Kiser, Jennifer N; Burns, Gregory W; Hansen, Peter J; Spencer, Thomas E; Neibergs, Holly L

    2017-01-01

    Infertility and subfertility negatively impact the economics and reproductive performance of cattle. Of note, significant pregnancy loss occurs in cattle during the first month of pregnancy, yet little is known about the genetic loci influencing pregnancy success and loss in cattle. To identify quantitative trait loci (QTL) with large effects associated with early pregnancy loss, Angus crossbred heifers were classified based on day 28 pregnancy outcomes to serial embryo transfer. A genome wide association analysis (GWAA) was conducted comparing 30 high fertility heifers with 100% success in establishing pregnancy to 55 subfertile heifers with 25% or less success. A gene set enrichment analysis SNP (GSEA-SNP) was performed to identify gene sets and leading edge genes influencing pregnancy loss. The GWAA identified 22 QTL (p < 1 x 10-5), and GSEA-SNP identified 9 gene sets (normalized enrichment score > 3.0) with 253 leading edge genes. Network analysis identified TNF (tumor necrosis factor), estrogen, and TP53 (tumor protein 53) as the top of 671 upstream regulators (p < 0.001), whereas the SOX2 (SRY [sex determining region Y]-box 2) and OCT4 (octamer-binding transcription factor 4) complex was the top master regulator out of 773 master regulators associated with fertility (p < 0.001). Identification of QTL and genes in pathways that improve early pregnancy success provides critical information for genomic selection to increase fertility in cattle. The identified genes and regulators also provide insight into the complex biological mechanisms underlying pregnancy establishment in cattle.

  18. Turning publicly available gene expression data into discoveries using gene set context analysis.

    PubMed

    Ji, Zhicheng; Vokes, Steven A; Dang, Chi V; Ji, Hongkai

    2016-01-08

    Gene Set Context Analysis (GSCA) is an open source software package to help researchers use massive amounts of publicly available gene expression data (PED) to make discoveries. Users can interactively visualize and explore gene and gene set activities in 25,000+ consistently normalized human and mouse gene expression samples representing diverse biological contexts (e.g. different cells, tissues and disease types, etc.). By providing one or multiple genes or gene sets as input and specifying a gene set activity pattern of interest, users can query the expression compendium to systematically identify biological contexts associated with the specified gene set activity pattern. In this way, researchers with new gene sets from their own experiments may discover previously unknown contexts of gene set functions and hence increase the value of their experiments. GSCA has a graphical user interface (GUI). The GUI makes the analysis convenient and customizable. Analysis results can be conveniently exported as publication quality figures and tables. GSCA is available at https://github.com/zji90/GSCA. This software significantly lowers the bar for biomedical investigators to use PED in their daily research for generating and screening hypotheses, which was previously difficult because of the complexity, heterogeneity and size of the data. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  19. High-resolution gene expression data from blastoderm embryos of the scuttle fly Megaselia abdita

    PubMed Central

    Wotton, Karl R; Jiménez-Guri, Eva; Crombach, Anton; Cicin-Sain, Damjan; Jaeger, Johannes

    2015-01-01

    Gap genes are involved in segment determination during early development in dipteran insects (flies, midges, and mosquitoes). We carried out a systematic quantitative comparative analysis of the gap gene network across different dipteran species. Our work provides mechanistic insights into the evolution of this pattern-forming network. As a central component of our project, we created a high-resolution quantitative spatio-temporal data set of gap and maternal co-ordinate gene expression in the blastoderm embryo of the non-drosophilid scuttle fly, Megaselia abdita. Our data include expression patterns in both wild-type and RNAi-treated embryos. The data—covering 10 genes, 10 time points, and over 1,000 individual embryos—consist of original embryo images, quantified expression profiles, extracted positions of expression boundaries, and integrated expression patterns, plus metadata and intermediate processing steps. These data provide a valuable resource for researchers interested in the comparative study of gene regulatory networks and pattern formation, an essential step towards a more quantitative and mechanistic understanding of developmental evolution. PMID:25977812

  20. MAVTgsa: An R Package for Gene Set (Enrichment) Analysis

    DOE PAGES

    Chien, Chih-Yi; Chang, Ching-Wei; Tsai, Chen-An; ...

    2014-01-01

    Gene semore » t analysis methods aim to determine whether an a priori defined set of genes shows statistically significant difference in expression on either categorical or continuous outcomes. Although many methods for gene set analysis have been proposed, a systematic analysis tool for identification of different types of gene set significance modules has not been developed previously. This work presents an R package, called MAVTgsa, which includes three different methods for integrated gene set enrichment analysis. (1) The one-sided OLS (ordinary least squares) test detects coordinated changes of genes in gene set in one direction, either up- or downregulation. (2) The two-sided MANOVA (multivariate analysis variance) detects changes both up- and downregulation for studying two or more experimental conditions. (3) A random forests-based procedure is to identify gene sets that can accurately predict samples from different experimental conditions or are associated with the continuous phenotypes. MAVTgsa computes the P values and FDR (false discovery rate) q -value for all gene sets in the study. Furthermore, MAVTgsa provides several visualization outputs to support and interpret the enrichment results. This package is available online.« less

  1. Selection of Phototransduction Genes in Homo sapiens.

    PubMed

    Christopher, Mark; Scheetz, Todd E; Mullins, Robert F; Abràmoff, Michael D

    2013-08-13

    We investigated the evidence of recent positive selection in the human phototransduction system at single nucleotide polymorphism (SNP) and gene level. SNP genotyping data from the International HapMap Project for European, Eastern Asian, and African populations was used to discover differences in haplotype length and allele frequency between these populations. Numeric selection metrics were computed for each SNP and aggregated into gene-level metrics to measure evidence of recent positive selection. The level of recent positive selection in phototransduction genes was evaluated and compared to a set of genes shown previously to be under recent selection, and a set of highly conserved genes as positive and negative controls, respectively. Six of 20 phototransduction genes evaluated had gene-level selection metrics above the 90th percentile: RGS9, GNB1, RHO, PDE6G, GNAT1, and SLC24A1. The selection signal across these genes was found to be of similar magnitude to the positive control genes and much greater than the negative control genes. There is evidence for selective pressure in the genes involved in retinal phototransduction, and traces of this selective pressure can be demonstrated using SNP-level and gene-level metrics of allelic variation. We hypothesize that the selective pressure on these genes was related to their role in low light vision and retinal adaptation to ambient light changes. Uncovering the underlying genetics of evolutionary adaptations in phototransduction not only allows greater understanding of vision and visual diseases, but also the development of patient-specific diagnostic and intervention strategies.

  2. A house finch (Haemorhous mexicanus) spleen transcriptome reveals intra- and interspecific patterns of gene expression, alternative splicing and genetic diversity in passerines

    PubMed Central

    2014-01-01

    Background With its plumage color dimorphism and unique history in North America, including a recent population expansion and an epizootic of Mycoplasma gallisepticum (MG), the house finch (Haemorhous mexicanus) is a model species for studying sexual selection, plumage coloration and host-parasite interactions. As part of our ongoing efforts to make available genomic resources for this species, here we report a transcriptome assembly derived from genes expressed in spleen. Results We characterize transcriptomes from two populations with different histories of demography and disease exposure: a recently founded population in the eastern US that has been exposed to MG for over a decade and a native population from the western range that has never been exposed to MG. We utilize this resource to quantify conservation in gene expression in passerine birds over approximately 50 MY by comparing splenic expression profiles for 9,646 house finch transcripts and those from zebra finch and find that less than half of all genes expressed in spleen in either species are expressed in both species. Comparative gene annotations from several vertebrate species suggest that the house finch transcriptomes contain ~15 genes not yet found in previously sequenced vertebrate genomes. The house finch transcriptomes harbour ~85,000 SNPs, ~20,000 of which are non-synonymous. Although not yet validated by biological or technical replication, we identify a set of genes exhibiting differences between populations in gene expression (n = 182; 2% of all transcripts), allele frequencies (76 FST ouliers) and alternative splicing as well as genes with several fixed non-synonymous substitutions; this set includes genes with functions related to double-strand break repair and immune response. Conclusions The two house finch spleen transcriptome profiles will add to the increasing data on genome and transcriptome sequence information from natural populations. Differences in splenic expression between house finch and zebra finch imply either significant evolutionary turnover of splenic expression patterns or different physiological states of the individuals examined. The transcriptome resource will enhance the potential to annotate an eventual house finch genome, and the set of gene-based high-quality SNPs will help clarify the genetic underpinnings of host-pathogen interactions and sexual selection. PMID:24758272

  3. An enhancement of binary particle swarm optimization for gene selection in classifying cancer classes

    PubMed Central

    2013-01-01

    Background Gene expression data could likely be a momentous help in the progress of proficient cancer diagnoses and classification platforms. Lately, many researchers analyze gene expression data using diverse computational intelligence methods, for selecting a small subset of informative genes from the data for cancer classification. Many computational methods face difficulties in selecting small subsets due to the small number of samples compared to the huge number of genes (high-dimension), irrelevant genes, and noisy genes. Methods We propose an enhanced binary particle swarm optimization to perform the selection of small subsets of informative genes which is significant for cancer classification. Particle speed, rule, and modified sigmoid function are introduced in this proposed method to increase the probability of the bits in a particle’s position to be zero. The method was empirically applied to a suite of ten well-known benchmark gene expression data sets. Results The performance of the proposed method proved to be superior to other previous related works, including the conventional version of binary particle swarm optimization (BPSO) in terms of classification accuracy and the number of selected genes. The proposed method also requires lower computational time compared to BPSO. PMID:23617960

  4. The Essential Genome of Escherichia coli K-12

    PubMed Central

    2018-01-01

    ABSTRACT Transposon-directed insertion site sequencing (TraDIS) is a high-throughput method coupling transposon mutagenesis with short-fragment DNA sequencing. It is commonly used to identify essential genes. Single gene deletion libraries are considered the gold standard for identifying essential genes. Currently, the TraDIS method has not been benchmarked against such libraries, and therefore, it remains unclear whether the two methodologies are comparable. To address this, a high-density transposon library was constructed in Escherichia coli K-12. Essential genes predicted from sequencing of this library were compared to existing essential gene databases. To decrease false-positive identification of essential genes, statistical data analysis included corrections for both gene length and genome length. Through this analysis, new essential genes and genes previously incorrectly designated essential were identified. We show that manual analysis of TraDIS data reveals novel features that would not have been detected by statistical analysis alone. Examples include short essential regions within genes, orientation-dependent effects, and fine-resolution identification of genome and protein features. Recognition of these insertion profiles in transposon mutagenesis data sets will assist genome annotation of less well characterized genomes and provides new insights into bacterial physiology and biochemistry. PMID:29463657

  5. A large-scale benchmark of gene prioritization methods.

    PubMed

    Guala, Dimitri; Sonnhammer, Erik L L

    2017-04-21

    In order to maximize the use of results from high-throughput experimental studies, e.g. GWAS, for identification and diagnostics of new disease-associated genes, it is important to have properly analyzed and benchmarked gene prioritization tools. While prospective benchmarks are underpowered to provide statistically significant results in their attempt to differentiate the performance of gene prioritization tools, a strategy for retrospective benchmarking has been missing, and new tools usually only provide internal validations. The Gene Ontology(GO) contains genes clustered around annotation terms. This intrinsic property of GO can be utilized in construction of robust benchmarks, objective to the problem domain. We demonstrate how this can be achieved for network-based gene prioritization tools, utilizing the FunCoup network. We use cross-validation and a set of appropriate performance measures to compare state-of-the-art gene prioritization algorithms: three based on network diffusion, NetRank and two implementations of Random Walk with Restart, and MaxLink that utilizes network neighborhood. Our benchmark suite provides a systematic and objective way to compare the multitude of available and future gene prioritization tools, enabling researchers to select the best gene prioritization tool for the task at hand, and helping to guide the development of more accurate methods.

  6. Transcriptional reprogramming of gene expression in bovine somatic cell chromatin transfer embryos

    PubMed Central

    Rodriguez-Osorio, Nelida; Wang, Zhongde; Kasinathan, Poothappillai; Page, Grier P; Robl, James M; Memili, Erdogan

    2009-01-01

    Background Successful reprogramming of a somatic genome to produce a healthy clone by somatic cells nuclear transfer (SCNT) is a rare event and the mechanisms involved in this process are poorly defined. When serial or successive rounds of cloning are performed, blastocyst and full term development rates decline even further with the increasing rounds of cloning. Identifying the "cumulative errors" could reveal the epigenetic reprogramming blocks in animal cloning. Results Bovine clones from up to four generations of successive cloning were produced by chromatin transfer (CT). Using Affymetrix bovine microarrays we determined that the transcriptomes of blastocysts derived from the first and the fourth rounds of cloning (CT1 and CT4 respectively) have undergone an extensive reprogramming and were more similar to blastocysts derived from in vitro fertilization (IVF) than to the donor cells used for the first and the fourth rounds of chromatin transfer (DC1 and DC4 respectively). However a set of transcripts in the cloned embryos showed a misregulated pattern when compared to IVF embryos. Among the genes consistently upregulated in both CT groups compared to the IVF embryos were genes involved in regulation of cytoskeleton and cell shape. Among the genes consistently upregulated in IVF embryos compared to both CT groups were genes involved in chromatin remodelling and stress coping. Conclusion The present study provides a data set that could contribute in our understanding of epigenetic errors in somatic cell chromatin transfer. Identifying "cumulative errors" after serial cloning could reveal some of the epigenetic reprogramming blocks shedding light on the reprogramming process, important for both basic and applied research. PMID:19393066

  7. In silico analysis of stomach lineage specific gene set expression pattern in gastric cancer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pandi, Narayanan Sathiya, E-mail: sathiyapandi@gmail.com; Suganya, Sivagurunathan; Rajendran, Suriliyandi

    Highlights: •Identified stomach lineage specific gene set (SLSGS) was found to be under expressed in gastric tumors. •Elevated expression of SLSGS in gastric tumor is a molecular predictor of metabolic type gastric cancer. •In silico pathway scanning identified estrogen-α signaling is a putative regulator of SLSGS in gastric cancer. •Elevated expression of SLSGS in GC is associated with an overall increase in the survival of GC patients. -- Abstract: Stomach lineage specific gene products act as a protective barrier in the normal stomach and their expression maintains the normal physiological processes, cellular integrity and morphology of the gastric wall. However,more » the regulation of stomach lineage specific genes in gastric cancer (GC) is far less clear. In the present study, we sought to investigate the role and regulation of stomach lineage specific gene set (SLSGS) in GC. SLSGS was identified by comparing the mRNA expression profiles of normal stomach tissue with other organ tissue. The obtained SLSGS was found to be under expressed in gastric tumors. Functional annotation analysis revealed that the SLSGS was enriched for digestive function and gastric epithelial maintenance. Employing a single sample prediction method across GC mRNA expression profiles identified the under expression of SLSGS in proliferative type and invasive type gastric tumors compared to the metabolic type gastric tumors. Integrative pathway activation prediction analysis revealed a close association between estrogen-α signaling and SLSGS expression pattern in GC. Elevated expression of SLSGS in GC is associated with an overall increase in the survival of GC patients. In conclusion, our results highlight that estrogen mediated regulation of SLSGS in gastric tumor is a molecular predictor of metabolic type GC and prognostic factor in GC.« less

  8. Comparison of normalization methods for differential gene expression analysis in RNA-Seq experiments

    PubMed Central

    Maza, Elie; Frasse, Pierre; Senin, Pavel; Bouzayen, Mondher; Zouine, Mohamed

    2013-01-01

    In recent years, RNA-Seq technologies became a powerful tool for transcriptome studies. However, computational methods dedicated to the analysis of high-throughput sequencing data are yet to be standardized. In particular, it is known that the choice of a normalization procedure leads to a great variability in results of differential gene expression analysis. The present study compares the most widespread normalization procedures and proposes a novel one aiming at removing an inherent bias of studied transcriptomes related to their relative size. Comparisons of the normalization procedures are performed on real and simulated data sets. Real RNA-Seq data sets analyses, performed with all the different normalization methods, show that only 50% of significantly differentially expressed genes are common. This result highlights the influence of the normalization step on the differential expression analysis. Real and simulated data sets analyses give similar results showing 3 different groups of procedures having the same behavior. The group including the novel method named “Median Ratio Normalization” (MRN) gives the lower number of false discoveries. Within this group the MRN method is less sensitive to the modification of parameters related to the relative size of transcriptomes such as the number of down- and upregulated genes and the gene expression levels. The newly proposed MRN method efficiently deals with intrinsic bias resulting from relative size of studied transcriptomes. Validation with real and simulated data sets confirmed that MRN is more consistent and robust than existing methods. PMID:26442135

  9. Systems Genetics Analysis of Genome-Wide Association Study Reveals Novel Associations Between Key Biological Processes and Coronary Artery Disease.

    PubMed

    Ghosh, Sujoy; Vivar, Juan; Nelson, Christopher P; Willenborg, Christina; Segrè, Ayellet V; Mäkinen, Ville-Petteri; Nikpay, Majid; Erdmann, Jeannette; Blankenberg, Stefan; O'Donnell, Christopher; März, Winfried; Laaksonen, Reijo; Stewart, Alexandre F R; Epstein, Stephen E; Shah, Svati H; Granger, Christopher B; Hazen, Stanley L; Kathiresan, Sekar; Reilly, Muredach P; Yang, Xia; Quertermous, Thomas; Samani, Nilesh J; Schunkert, Heribert; Assimes, Themistocles L; McPherson, Ruth

    2015-07-01

    Genome-wide association studies have identified multiple genetic variants affecting the risk of coronary artery disease (CAD). However, individually these explain only a small fraction of the heritability of CAD and for most, the causal biological mechanisms remain unclear. We sought to obtain further insights into potential causal processes of CAD by integrating large-scale GWA data with expertly curated databases of core human pathways and functional networks. Using pathways (gene sets) from Reactome, we carried out a 2-stage gene set enrichment analysis strategy. From a meta-analyzed discovery cohort of 7 CAD genome-wide association study data sets (9889 cases/11 089 controls), nominally significant gene sets were tested for replication in a meta-analysis of 9 additional studies (15 502 cases/55 730 controls) from the Coronary ARtery DIsease Genome wide Replication and Meta-analysis (CARDIoGRAM) Consortium. A total of 32 of 639 Reactome pathways tested showed convincing association with CAD (replication P<0.05). These pathways resided in 9 of 21 core biological processes represented in Reactome, and included pathways relevant to extracellular matrix (ECM) integrity, innate immunity, axon guidance, and signaling by PDRF (platelet-derived growth factor), NOTCH, and the transforming growth factor-β/SMAD receptor complex. Many of these pathways had strengths of association comparable to those observed in lipid transport pathways. Network analysis of unique genes within the replicated pathways further revealed several interconnected functional and topologically interacting modules representing novel associations (eg, semaphoring-regulated axonal guidance pathway) besides confirming known processes (lipid metabolism). The connectivity in the observed networks was statistically significant compared with random networks (P<0.001). Network centrality analysis (degree and betweenness) further identified genes (eg, NCAM1, FYN, FURIN, etc) likely to play critical roles in the maintenance and functioning of several of the replicated pathways. These findings provide novel insights into how genetic variation, interpreted in the context of biological processes and functional interactions among genes, may help define the genetic architecture of CAD. © 2015 American Heart Association, Inc.

  10. Gene expression models for prediction of longitudinal dispersion coefficient in streams

    NASA Astrophysics Data System (ADS)

    Sattar, Ahmed M. A.; Gharabaghi, Bahram

    2015-05-01

    Longitudinal dispersion is the key hydrologic process that governs transport of pollutants in natural streams. It is critical for spill action centers to be able to predict the pollutant travel time and break-through curves accurately following accidental spills in urban streams. This study presents a novel gene expression model for longitudinal dispersion developed using 150 published data sets of geometric and hydraulic parameters in natural streams in the United States, Canada, Europe, and New Zealand. The training and testing of the model were accomplished using randomly-selected 67% (100 data sets) and 33% (50 data sets) of the data sets, respectively. Gene expression programming (GEP) is used to develop empirical relations between the longitudinal dispersion coefficient and various control variables, including the Froude number which reflects the effect of reach slope, aspect ratio, and the bed material roughness on the dispersion coefficient. Two GEP models have been developed, and the prediction uncertainties of the developed GEP models are quantified and compared with those of existing models, showing improved prediction accuracy in favor of GEP models. Finally, a parametric analysis is performed for further verification of the developed GEP models. The main reason for the higher accuracy of the GEP models compared to the existing regression models is that exponents of the key variables (aspect ratio and bed material roughness) are not constants but a function of the Froude number. The proposed relations are both simple and accurate and can be effectively used to predict the longitudinal dispersion coefficients in natural streams.

  11. Finding the missing honey bee genes: lessons learned from a genome upgrade.

    PubMed

    Elsik, Christine G; Worley, Kim C; Bennett, Anna K; Beye, Martin; Camara, Francisco; Childers, Christopher P; de Graaf, Dirk C; Debyser, Griet; Deng, Jixin; Devreese, Bart; Elhaik, Eran; Evans, Jay D; Foster, Leonard J; Graur, Dan; Guigo, Roderic; Hoff, Katharina Jasmin; Holder, Michael E; Hudson, Matthew E; Hunt, Greg J; Jiang, Huaiyang; Joshi, Vandita; Khetani, Radhika S; Kosarev, Peter; Kovar, Christie L; Ma, Jian; Maleszka, Ryszard; Moritz, Robin F A; Munoz-Torres, Monica C; Murphy, Terence D; Muzny, Donna M; Newsham, Irene F; Reese, Justin T; Robertson, Hugh M; Robinson, Gene E; Rueppell, Olav; Solovyev, Victor; Stanke, Mario; Stolle, Eckart; Tsuruda, Jennifer M; Vaerenbergh, Matthias Van; Waterhouse, Robert M; Weaver, Daniel B; Whitfield, Charles W; Wu, Yuanqing; Zdobnov, Evgeny M; Zhang, Lan; Zhu, Dianhui; Gibbs, Richard A

    2014-01-30

    The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination.

  12. Comparative genomics reveals candidate carotenoid pathway regulators of ripening watermelon fruit.

    PubMed

    Grassi, Stefania; Piro, Gabriella; Lee, Je Min; Zheng, Yi; Fei, Zhangjun; Dalessandro, Giuseppe; Giovannoni, James J; Lenucci, Marcello S

    2013-11-12

    Many fruits, including watermelon, are proficient in carotenoid accumulation during ripening. While most genes encoding steps in the carotenoid biosynthetic pathway have been cloned, few transcriptional regulators of these genes have been defined to date. Here we describe the identification of a set of putative carotenoid-related transcription factors resulting from fresh watermelon carotenoid and transcriptome analysis during fruit development and ripening. Our goal is to both clarify the expression profiles of carotenoid pathway genes and to identify candidate regulators and molecular targets for crop improvement. Total carotenoids progressively increased during fruit ripening up to ~55 μg g(-1) fw in red-ripe fruits. Trans-lycopene was the carotenoid that contributed most to this increase. Many of the genes related to carotenoid metabolism displayed changing expression levels during fruit ripening generating a metabolic flux toward carotenoid synthesis. Constitutive low expression of lycopene cyclase genes resulted in lycopene accumulation. RNA-seq expression profiling of watermelon fruit development yielded a set of transcription factors whose expression was correlated with ripening and carotenoid accumulation. Nineteen putative transcription factor genes from watermelon and homologous to tomato carotenoid-associated genes were identified. Among these, six were differentially expressed in the flesh of both species during fruit development and ripening. Taken together the data suggest that, while the regulation of a common set of metabolic genes likely influences carotenoid synthesis and accumulation in watermelon and tomato fruits during development and ripening, specific and limiting regulators may differ between climacteric and non-climacteric fruits, possibly related to their differential susceptibility to and use of ethylene during ripening.

  13. Finding the missing honey bee genes: lessons learned from a genome upgrade

    PubMed Central

    2014-01-01

    Background The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. Results Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. Conclusions Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination. PMID:24479613

  14. Comparative genomics reveals candidate carotenoid pathway regulators of ripening watermelon fruit

    PubMed Central

    2013-01-01

    Background Many fruits, including watermelon, are proficient in carotenoid accumulation during ripening. While most genes encoding steps in the carotenoid biosynthetic pathway have been cloned, few transcriptional regulators of these genes have been defined to date. Here we describe the identification of a set of putative carotenoid-related transcription factors resulting from fresh watermelon carotenoid and transcriptome analysis during fruit development and ripening. Our goal is to both clarify the expression profiles of carotenoid pathway genes and to identify candidate regulators and molecular targets for crop improvement. Results Total carotenoids progressively increased during fruit ripening up to ~55 μg g-1 fw in red-ripe fruits. Trans-lycopene was the carotenoid that contributed most to this increase. Many of the genes related to carotenoid metabolism displayed changing expression levels during fruit ripening generating a metabolic flux toward carotenoid synthesis. Constitutive low expression of lycopene cyclase genes resulted in lycopene accumulation. RNA-seq expression profiling of watermelon fruit development yielded a set of transcription factors whose expression was correlated with ripening and carotenoid accumulation. Nineteen putative transcription factor genes from watermelon and homologous to tomato carotenoid-associated genes were identified. Among these, six were differentially expressed in the flesh of both species during fruit development and ripening. Conclusions Taken together the data suggest that, while the regulation of a common set of metabolic genes likely influences carotenoid synthesis and accumulation in watermelon and tomato fruits during development and ripening, specific and limiting regulators may differ between climacteric and non-climacteric fruits, possibly related to their differential susceptibility to and use of ethylene during ripening. PMID:24219562

  15. Prediction of regulatory gene pairs using dynamic time warping and gene ontology.

    PubMed

    Yang, Andy C; Hsu, Hui-Huang; Lu, Ming-Da; Tseng, Vincent S; Shih, Timothy K

    2014-01-01

    Selecting informative genes is the most important task for data analysis on microarray gene expression data. In this work, we aim at identifying regulatory gene pairs from microarray gene expression data. However, microarray data often contain multiple missing expression values. Missing value imputation is thus needed before further processing for regulatory gene pairs becomes possible. We develop a novel approach to first impute missing values in microarray time series data by combining k-Nearest Neighbour (KNN), Dynamic Time Warping (DTW) and Gene Ontology (GO). After missing values are imputed, we then perform gene regulation prediction based on our proposed DTW-GO distance measurement of gene pairs. Experimental results show that our approach is more accurate when compared with existing missing value imputation methods on real microarray data sets. Furthermore, our approach can also discover more regulatory gene pairs that are known in the literature than other methods.

  16. ADAGE signature analysis: differential expression analysis with data-defined gene sets.

    PubMed

    Tan, Jie; Huyck, Matthew; Hu, Dongbo; Zelaya, René A; Hogan, Deborah A; Greene, Casey S

    2017-11-22

    Gene set enrichment analysis and overrepresentation analyses are commonly used methods to determine the biological processes affected by a differential expression experiment. This approach requires biologically relevant gene sets, which are currently curated manually, limiting their availability and accuracy in many organisms without extensively curated resources. New feature learning approaches can now be paired with existing data collections to directly extract functional gene sets from big data. Here we introduce a method to identify perturbed processes. In contrast with methods that use curated gene sets, this approach uses signatures extracted from public expression data. We first extract expression signatures from public data using ADAGE, a neural network-based feature extraction approach. We next identify signatures that are differentially active under a given treatment. Our results demonstrate that these signatures represent biological processes that are perturbed by the experiment. Because these signatures are directly learned from data without supervision, they can identify uncurated or novel biological processes. We implemented ADAGE signature analysis for the bacterial pathogen Pseudomonas aeruginosa. For the convenience of different user groups, we implemented both an R package (ADAGEpath) and a web server ( http://adage.greenelab.com ) to run these analyses. Both are open-source to allow easy expansion to other organisms or signature generation methods. We applied ADAGE signature analysis to an example dataset in which wild-type and ∆anr mutant cells were grown as biofilms on the Cystic Fibrosis genotype bronchial epithelial cells. We mapped active signatures in the dataset to KEGG pathways and compared with pathways identified using GSEA. The two approaches generally return consistent results; however, ADAGE signature analysis also identified a signature that revealed the molecularly supported link between the MexT regulon and Anr. We designed ADAGE signature analysis to perform gene set analysis using data-defined functional gene signatures. This approach addresses an important gap for biologists studying non-traditional model organisms and those without extensive curated resources available. We built both an R package and web server to provide ADAGE signature analysis to the community.

  17. Distinct skeletal muscle fiber characteristics and gene expression in diet-sensitive versus diet-resistant obesity.

    PubMed

    Gerrits, Martin F; Ghosh, Sujoy; Kavaslar, Nihan; Hill, Benjamin; Tour, Anastasia; Seifert, Erin L; Beauchamp, Brittany; Gorman, Shelby; Stuart, Joan; Dent, Robert; McPherson, Ruth; Harper, Mary-Ellen

    2010-08-01

    Inter-individual variability in weight gain and loss under energy surfeit and deficit conditions, respectively, are well recognized but poorly understood phenomena. We documented weight loss variability in an intensively supervised clinical weight loss program and assessed skeletal muscle gene expression and phenotypic characteristics related to variable response to a 900 kcal regimen. Matched pairs of healthy, diet-compliant, obese diet-sensitive (ODS) and diet-resistant (ODR) subjects were defined as those in the highest and lowest quintiles for weight loss rate. Physical activity energy expenditure was minimal and comparable. Following program completion and weight stabilization, skeletal muscle biopsies were obtained. Gene expression analysis of rectus femoris and vastus lateralis indicated upregulation of genes and gene sets involved in oxidative phosphorylation and glucose and fatty acid metabolism in ODS compared with ODR. In vastus lateralis, there was a higher proportion of oxidative (type I) fibers in ODS compared with ODR women and lean controls, fiber hypertrophy in ODS compared with ODR women and lean controls, and lower succinate dehydrogenase in oxidative and oxidative-glycolytic fibers in all obese compared with lean subjects. Intramuscular lipid content was generally higher in obese versus lean, and specifically higher in ODS vs. lean women. Altogether, our findings demonstrate differences in muscle gene expression and fiber composition related to clinical weight loss success.

  18. Distinct skeletal muscle fiber characteristics and gene expression in diet-sensitive versus diet-resistant obesity

    PubMed Central

    Gerrits, Martin F.; Ghosh, Sujoy; Kavaslar, Nihan; Hill, Benjamin; Tour, Anastasia; Seifert, Erin L.; Beauchamp, Brittany; Gorman, Shelby; Stuart, Joan; Dent, Robert; McPherson, Ruth; Harper, Mary-Ellen

    2010-01-01

    Inter-individual variability in weight gain and loss under energy surfeit and deficit conditions, respectively, are well recognized but poorly understood phenomena. We documented weight loss variability in an intensively supervised clinical weight loss program and assessed skeletal muscle gene expression and phenotypic characteristics related to variable response to a 900 kcal regimen. Matched pairs of healthy, diet-compliant, obese diet-sensitive (ODS) and diet-resistant (ODR) subjects were defined as those in the highest and lowest quintiles for weight loss rate. Physical activity energy expenditure was minimal and comparable. Following program completion and weight stabilization, skeletal muscle biopsies were obtained. Gene expression analysis of rectus femoris and vastus lateralis indicated upregulation of genes and gene sets involved in oxidative phosphorylation and glucose and fatty acid metabolism in ODS compared with ODR. In vastus lateralis, there was a higher proportion of oxidative (type I) fibers in ODS compared with ODR women and lean controls, fiber hypertrophy in ODS compared with ODR women and lean controls, and lower succinate dehydrogenase in oxidative and oxidative-glycolytic fibers in all obese compared with lean subjects. Intramuscular lipid content was generally higher in obese versus lean, and specifically higher in ODS vs. lean women. Altogether, our findings demonstrate differences in muscle gene expression and fiber composition related to clinical weight loss success. PMID:20332421

  19. The Maximal C3 Self-Complementary Trinucleotide Circular Code X in Genes of Bacteria, Archaea, Eukaryotes, Plasmids and Viruses

    PubMed Central

    Michel, Christian J.

    2017-01-01

    In 1996, a set X of 20 trinucleotides was identified in genes of both prokaryotes and eukaryotes which has on average the highest occurrence in reading frame compared to its two shifted frames. Furthermore, this set X has an interesting mathematical property as X is a maximal C3 self-complementary trinucleotide circular code. In 2015, by quantifying the inspection approach used in 1996, the circular code X was confirmed in the genes of bacteria and eukaryotes and was also identified in the genes of plasmids and viruses. The method was based on the preferential occurrence of trinucleotides among the three frames at the gene population level. We extend here this definition at the gene level. This new statistical approach considers all the genes, i.e., of large and small lengths, with the same weight for searching the circular code X. As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. At the gene level, the circular code X is strengthened in the genes of bacteria, eukaryotes, plasmids, and viruses, and is now also identified in the genes of archaea. The genes of mitochondria and chloroplasts contain a subset of the circular code X. Finally, by studying viral genes, the circular code X was found in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes. PMID:28420220

  20. Intragenome Diversity of Gene Families Encoding Toxin-like Proteins in Venomous Animals.

    PubMed

    Rodríguez de la Vega, Ricardo C; Giraud, Tatiana

    2016-11-01

    The evolution of venoms is the story of how toxins arise and of the processes that generate and maintain their diversity. For animal venoms these processes include recruitment for expression in the venom gland, neofunctionalization, paralogous expansions, and functional divergence. The systematic study of these processes requires the reliable identification of the venom components involved in antagonistic interactions. High-throughput sequencing has the potential of uncovering the entire set of toxins in a given organism, yet the existence of non-venom toxin paralogs and the misleading effects of partial census of the molecular diversity of toxins make necessary to collect complementary evidence to distinguish true toxins from their non-venom paralogs. Here, we analyzed the whole genomes of two scorpions, one spider and one snake, aiming at the identification of the full repertoires of genes encoding toxin-like proteins. We classified the entire set of protein-coding genes into paralogous groups and monotypic genes, identified genes encoding toxin-like proteins based on known toxin families, and quantified their expression in both venom-glands and pooled tissues. Our results confirm that genes encoding toxin-like proteins are part of multigene families, and that these families arise by recruitment events from non-toxin genes followed by limited expansions of the toxin-like protein coding genes. We also show that failing to account for sequence similarity with non-toxin proteins has a considerable misleading effect that can be greatly reduced by comparative transcriptomics. Our study overall contributes to the understanding of the evolutionary dynamics of proteins involved in antagonistic interactions. © The Author 2016. Published by Oxford University Press on behalf of the Society for Integrative and Comparative Biology. All rights reserved. For permissions please email: journals.permissions@oup.com.

  1. Whole Genome Gene Expression Meta-Analysis of Inflammatory Bowel Disease Colon Mucosa Demonstrates Lack of Major Differences between Crohn's Disease and Ulcerative Colitis

    PubMed Central

    Østvik, Ann E.; Drozdov, Ignat; Gustafsson, Bjørn I.; Kidd, Mark; Beisvag, Vidar; Torp, Sverre H.; Waldum, Helge L.; Martinsen, Tom Christian; Damås, Jan Kristian; Espevik, Terje; Sandvik, Arne K.

    2013-01-01

    Background In inflammatory bowel disease (IBD), genetic susceptibility together with environmental factors disturbs gut homeostasis producing chronic inflammation. The two main IBD subtypes are Ulcerative colitis (UC) and Crohn’s disease (CD). We present the to-date largest microarray gene expression study on IBD encompassing both inflamed and un-inflamed colonic tissue. A meta-analysis including all available, comparable data was used to explore important aspects of IBD inflammation, thereby validating consistent gene expression patterns. Methods Colon pinch biopsies from IBD patients were analysed using Illumina whole genome gene expression technology. Differential expression (DE) was identified using LIMMA linear model in the R statistical computing environment. Results were enriched for gene ontology (GO) categories. Sets of genes encoding antimicrobial proteins (AMP) and proteins involved in T helper (Th) cell differentiation were used in the interpretation of the results. All available data sets were analysed using the same methods, and results were compared on a global and focused level as t-scores. Results Gene expression in inflamed mucosa from UC and CD are remarkably similar. The meta-analysis confirmed this. The patterns of AMP and Th cell-related gene expression were also very similar, except for IL23A which was consistently higher expressed in UC than in CD. Un-inflamed tissue from patients demonstrated minimal differences from healthy controls. Conclusions There is no difference in the Th subgroup involvement between UC and CD. Th1/Th17 related expression, with little Th2 differentiation, dominated both diseases. The different IL23A expression between UC and CD suggests an IBD subtype specific role. AMPs, previously little studied, are strongly overexpressed in IBD. The presented meta-analysis provides a sound background for further research on IBD pathobiology. PMID:23468882

  2. Whole genome gene expression meta-analysis of inflammatory bowel disease colon mucosa demonstrates lack of major differences between Crohn's disease and ulcerative colitis.

    PubMed

    Granlund, Atle van Beelen; Flatberg, Arnar; Østvik, Ann E; Drozdov, Ignat; Gustafsson, Bjørn I; Kidd, Mark; Beisvag, Vidar; Torp, Sverre H; Waldum, Helge L; Martinsen, Tom Christian; Damås, Jan Kristian; Espevik, Terje; Sandvik, Arne K

    2013-01-01

    In inflammatory bowel disease (IBD), genetic susceptibility together with environmental factors disturbs gut homeostasis producing chronic inflammation. The two main IBD subtypes are Ulcerative colitis (UC) and Crohn's disease (CD). We present the to-date largest microarray gene expression study on IBD encompassing both inflamed and un-inflamed colonic tissue. A meta-analysis including all available, comparable data was used to explore important aspects of IBD inflammation, thereby validating consistent gene expression patterns. Colon pinch biopsies from IBD patients were analysed using Illumina whole genome gene expression technology. Differential expression (DE) was identified using LIMMA linear model in the R statistical computing environment. Results were enriched for gene ontology (GO) categories. Sets of genes encoding antimicrobial proteins (AMP) and proteins involved in T helper (Th) cell differentiation were used in the interpretation of the results. All available data sets were analysed using the same methods, and results were compared on a global and focused level as t-scores. Gene expression in inflamed mucosa from UC and CD are remarkably similar. The meta-analysis confirmed this. The patterns of AMP and Th cell-related gene expression were also very similar, except for IL23A which was consistently higher expressed in UC than in CD. Un-inflamed tissue from patients demonstrated minimal differences from healthy controls. There is no difference in the Th subgroup involvement between UC and CD. Th1/Th17 related expression, with little Th2 differentiation, dominated both diseases. The different IL23A expression between UC and CD suggests an IBD subtype specific role. AMPs, previously little studied, are strongly overexpressed in IBD. The presented meta-analysis provides a sound background for further research on IBD pathobiology.

  3. svdPPCS: an effective singular value decomposition-based method for conserved and divergent co-expression gene module identification.

    PubMed

    Zhang, Wensheng; Edwards, Andrea; Fan, Wei; Zhu, Dongxiao; Zhang, Kun

    2010-06-22

    Comparative analysis of gene expression profiling of multiple biological categories, such as different species of organisms or different kinds of tissue, promises to enhance the fundamental understanding of the universality as well as the specialization of mechanisms and related biological themes. Grouping genes with a similar expression pattern or exhibiting co-expression together is a starting point in understanding and analyzing gene expression data. In recent literature, gene module level analysis is advocated in order to understand biological network design and system behaviors in disease and life processes; however, practical difficulties often lie in the implementation of existing methods. Using the singular value decomposition (SVD) technique, we developed a new computational tool, named svdPPCS (SVD-based Pattern Pairing and Chart Splitting), to identify conserved and divergent co-expression modules of two sets of microarray experiments. In the proposed methods, gene modules are identified by splitting the two-way chart coordinated with a pair of left singular vectors factorized from the gene expression matrices of the two biological categories. Importantly, the cutoffs are determined by a data-driven algorithm using the well-defined statistic, SVD-p. The implementation was illustrated on two time series microarray data sets generated from the samples of accessory gland (ACG) and malpighian tubule (MT) tissues of the line W118 of M. drosophila. Two conserved modules and six divergent modules, each of which has a unique characteristic profile across tissue kinds and aging processes, were identified. The number of genes contained in these models ranged from five to a few hundred. Three to over a hundred GO terms were over-represented in individual modules with FDR < 0.1. One divergent module suggested the tissue-specific relationship between the expressions of mitochondrion-related genes and the aging process. This finding, together with others, may be of biological significance. The validity of the proposed SVD-based method was further verified by a simulation study, as well as the comparisons with regression analysis and cubic spline regression analysis plus PAM based clustering. svdPPCS is a novel computational tool for the comparative analysis of transcriptional profiling. It especially fits the comparison of time series data of related organisms or different tissues of the same organism under equivalent or similar experimental conditions. The general scheme can be directly extended to the comparisons of multiple data sets. It also can be applied to the integration of data sets from different platforms and of different sources.

  4. Comparison of taxon-specific versus general locus sets for targeted sequence capture in plant phylogenomics.

    PubMed

    Chau, John H; Rahfeldt, Wolfgang A; Olmstead, Richard G

    2018-03-01

    Targeted sequence capture can be used to efficiently gather sequence data for large numbers of loci, such as single-copy nuclear loci. Most published studies in plants have used taxon-specific locus sets developed individually for a clade using multiple genomic and transcriptomic resources. General locus sets can also be developed from loci that have been identified as single-copy and have orthologs in large clades of plants. We identify and compare a taxon-specific locus set and three general locus sets (conserved ortholog set [COSII], shared single-copy nuclear [APVO SSC] genes, and pentatricopeptide repeat [PPR] genes) for targeted sequence capture in Buddleja (Scrophulariaceae) and outgroups. We evaluate their performance in terms of assembly success, sequence variability, and resolution and support of inferred phylogenetic trees. The taxon-specific locus set had the most target loci. Assembly success was high for all locus sets in Buddleja samples. For outgroups, general locus sets had greater assembly success. Taxon-specific and PPR loci had the highest average variability. The taxon-specific data set produced the best-supported tree, but all data sets showed improved resolution over previous non-sequence capture data sets. General locus sets can be a useful source of sequence capture targets, especially if multiple genomic resources are not available for a taxon.

  5. Mitochondrial comparative genomics and phylogenetic signal assessment of mtDNA among arbuscular mycorrhizal fungi.

    PubMed

    Nadimi, Maryam; Daubois, Laurence; Hijri, Mohamed

    2016-05-01

    Mitochondrial (mt) genes, such as cytochrome C oxidase genes (cox), have been widely used for barcoding in many groups of organisms, although this approach has been less powerful in the fungal kingdom due to the rapid evolution of their mt genomes. The use of mt genes in phylogenetic studies of Dikarya has been met with success, while early diverging fungal lineages remain less studied, particularly the arbuscular mycorrhizal fungi (AMF). Advances in next-generation sequencing have substantially increased the number of publically available mtDNA sequences for the Glomeromycota. As a result, comparison of mtDNA across key AMF taxa can now be applied to assess the phylogenetic signal of individual mt coding genes, as well as concatenated subsets of coding genes. Here we show comparative analyses of publically available mt genomes of Glomeromycota, augmented with two mtDNA genomes that were newly sequenced for this study (Rhizophagus irregularis DAOM240159 and Glomus aggregatum DAOM240163), resulting in 16 complete mtDNA datasets. R. irregularis isolate DAOM240159 and G. aggregatum isolate DAOM240163 showed mt genomes measuring 72,293bp and 69,505bp with G+C contents of 37.1% and 37.3%, respectively. We assessed the phylogenies inferred from single mt genes and complete sets of coding genes, which are referred to as "supergenes" (16 concatenated coding genes), using Shimodaira-Hasegawa tests, in order to identify genes that best described AMF phylogeny. We found that rnl, nad5, cox1, and nad2 genes, as well as concatenated subset of these genes, provided phylogenies that were similar to the supergene set. This mitochondrial genomic analysis was also combined with principal coordinate and partitioning analyses, which helped to unravel certain evolutionary relationships in the Rhizophagus genus and for G. aggregatum within the Glomeromycota. We showed evidence to support the position of G. aggregatum within the R. irregularis 'species complex'. Copyright © 2016 Elsevier Inc. All rights reserved.

  6. Identification of Disease Critical Genes Using Collective Meta-heuristic Approaches: An Application to Preeclampsia.

    PubMed

    Biswas, Surama; Dutta, Subarna; Acharyya, Sriyankar

    2017-12-01

    Identifying a small subset of disease critical genes out of a large size of microarray gene expression data is a challenge in computational life sciences. This paper has applied four meta-heuristic algorithms, namely, honey bee mating optimization (HBMO), harmony search (HS), differential evolution (DE) and genetic algorithm (basic version GA) to find disease critical genes of preeclampsia which affects women during gestation. Two hybrid algorithms, namely, HBMO-kNN and HS-kNN have been newly proposed here where kNN (k nearest neighbor classifier) is used for sample classification. Performances of these new approaches have been compared with other two hybrid algorithms, namely, DE-kNN and SGA-kNN. Three datasets of different sizes have been used. In a dataset, the set of genes found common in the output of each algorithm is considered here as disease critical genes. In different datasets, the percentage of classification or classification accuracy of meta-heuristic algorithms varied between 92.46 and 100%. HBMO-kNN has the best performance (99.64-100%) in almost all data sets. DE-kNN secures the second position (99.42-100%). Disease critical genes obtained here match with clinically revealed preeclampsia genes to a large extent.

  7. Spectral biclustering of microarray data: coclustering genes and conditions.

    PubMed

    Kluger, Yuval; Basri, Ronen; Chang, Joseph T; Gerstein, Mark

    2003-04-01

    Global analyses of RNA expression levels are useful for classifying genes and overall phenotypes. Often these classification problems are linked, and one wants to find "marker genes" that are differentially expressed in particular sets of "conditions." We have developed a method that simultaneously clusters genes and conditions, finding distinctive "checkerboard" patterns in matrices of gene expression data, if they exist. In a cancer context, these checkerboards correspond to genes that are markedly up- or downregulated in patients with particular types of tumors. Our method, spectral biclustering, is based on the observation that checkerboard structures in matrices of expression data can be found in eigenvectors corresponding to characteristic expression patterns across genes or conditions. In addition, these eigenvectors can be readily identified by commonly used linear algebra approaches, in particular the singular value decomposition (SVD), coupled with closely integrated normalization steps. We present a number of variants of the approach, depending on whether the normalization over genes and conditions is done independently or in a coupled fashion. We then apply spectral biclustering to a selection of publicly available cancer expression data sets, and examine the degree to which the approach is able to identify checkerboard structures. Furthermore, we compare the performance of our biclustering methods against a number of reasonable benchmarks (e.g., direct application of SVD or normalized cuts to raw data).

  8. Gene discovery in the hamster: a comparative genomics approach for gene annotation by sequencing of hamster testis cDNAs

    PubMed Central

    Oduru, Sreedhar; Campbell, Janee L; Karri, SriTulasi; Hendry, William J; Khan, Shafiq A; Williams, Simon C

    2003-01-01

    Background Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish) genomes. Results 735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque. Conclusion The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells. PMID:12783626

  9. Presymptomatic Diagnosis of Celiac Disease in Predisposed Children: The Role of Gene Expression Profile.

    PubMed

    Galatola, Martina; Cielo, Donatella; Panico, Camilla; Stellato, Pio; Malamisura, Basilio; Carbone, Lorenzo; Gianfrani, Carmen; Troncone, Riccardo; Greco, Luigi; Auricchio, Renata

    2017-09-01

    The prevalence of celiac disease (CD) has increased significantly in recent years, and risk prediction and early diagnosis have become imperative especially in at-risk families. In a previous study, we identified individuals with CD based on the expression profile of a set of candidate genes in peripheral blood monocytes. Here we evaluated the expression of a panel of CD candidate genes in peripheral blood mononuclear cells from at-risk infants long time before any symptom or production of antibodies. We analyzed the gene expression of a set of 9 candidate genes, associated with CD, in 22 human leukocyte antigen predisposed children from at-risk families for CD, studied from birth to 6 years of age. Nine of them developed CD (patients) and 13 did not (controls). We analyzed gene expression at 3 different time points (age matched in the 2 groups): 4-19 months before diagnosis, at the time of CD diagnosis, and after at least 1 year of a gluten-free diet. At similar age points, controls were also evaluated. Three genes (KIAA, TAGAP [T-cell Activation GTPase Activating Protein], and SH2B3 [SH2B Adaptor Protein 3]) were overexpressed in patients, compared with controls, at least 9 months before CD diagnosis. At a stepwise discriminant analysis, 4 genes (RGS1 [Regulator of G-protein signaling 1], TAGAP, TNFSF14 [Tumor Necrosis Factor (Ligand) Superfamily member 14], and SH2B3) differentiate patients from controls before serum antibodies production and clinical symptoms. Multivariate equation correctly classified CD from non-CD children in 95.5% of patients. The expression of a small set of candidate genes in peripheral blood mononuclear cells can predict CD at least 9 months before the appearance of any clinical and serological signs of the disease.

  10. Gene expression profiles in whole blood and associations with metabolic dysregulation in obesity.

    PubMed

    Cox, Amanda J; Zhang, Ping; Evans, Tiffany J; Scott, Rodney J; Cripps, Allan W; West, Nicholas P

    Gene expression data provides one tool to gain further insight into the complex biological interactions linking obesity and metabolic disease. This study examined associations between blood gene expression profiles and metabolic disease in obesity. Whole blood gene expression profiles, performed using the Illumina HT-12v4 Human Expression Beadchip, were compared between (i) individuals with obesity (O) or lean (L) individuals (n=21 each), (ii) individuals with (M) or without (H) Metabolic Syndrome (n=11 each) matched on age and gender. Enrichment of differentially expressed genes (DEG) into biological pathways was assessed using Ingenuity Pathway Analysis. Association between sets of genes from biological pathways considered functionally relevant and Metabolic Syndrome were further assessed using an area under the curve (AUC) and cross-validated classification rate (CR). For OvL, only 50 genes were significantly differentially expressed based on the selected differential expression threshold (1.2-fold, p<0.05). For MvH, 582 genes were significantly differentially expressed (1.2-fold, p<0.05) and pathway analysis revealed enrichment of DEG into a diverse set of pathways including immune/inflammatory control, insulin signalling and mitochondrial function pathways. Gene sets from the mTOR signalling pathways demonstrated the strongest association with Metabolic Syndrome (p=8.1×10 -8 ; AUC: 0.909, CR: 72.7%). These results support the use of expression profiling in whole blood in the absence of more specific tissue types for investigations of metabolic disease. Using a pathway analysis approach it was possible to identify an enrichment of DEG into biological pathways that could be targeted for in vitro follow-up. Copyright © 2017 Asia Oceania Association for the Study of Obesity. Published by Elsevier Ltd. All rights reserved.

  11. Positive-unlabeled learning for disease gene identification

    PubMed Central

    Yang, Peng; Li, Xiao-Li; Mei, Jian-Ping; Kwoh, Chee-Keong; Ng, See-Kiong

    2012-01-01

    Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the negative training set N (non-disease gene set does not exist) to build classifiers to identify new disease genes from the unknown genes. However, such kind of classifiers is actually built from a noisy negative set N as there can be unknown disease genes in N itself. As a result, the classifiers do not perform as well as they could be. Result: Instead of treating the unknown genes as negative examples in N, we treat them as an unlabeled set U. We design a novel positive-unlabeled (PU) learning algorithm PUDI (PU learning for disease gene identification) to build a classifier using P and U. We first partition U into four sets, namely, reliable negative set RN, likely positive set LP, likely negative set LN and weak negative set WN. The weighted support vector machines are then used to build a multi-level classifier based on the four training sets and positive training set P to identify disease genes. Our experimental results demonstrate that our proposed PUDI algorithm outperformed the existing methods significantly. Conclusion: The proposed PUDI algorithm is able to identify disease genes more accurately by treating the unknown data more appropriately as unlabeled set U instead of negative set N. Given that many machine learning problems in biomedical research do involve positive and unlabeled data instead of negative data, it is possible that the machine learning methods for these problems can be further improved by adopting PU learning methods, as we have done here for disease gene identification. Availability and implementation: The executable program and data are available at http://www1.i2r.a-star.edu.sg/∼xlli/PUDI/PUDI.html. Contact: xlli@i2r.a-star.edu.sg or yang0293@e.ntu.edu.sg Supplementary information: Supplementary Data are available at Bioinformatics online. PMID:22923290

  12. Validating internal controls for quantitative plant gene expression studies

    PubMed Central

    Brunner, Amy M; Yakovlev, Igor A; Strauss, Steven H

    2004-01-01

    Background Real-time reverse transcription PCR (RT-PCR) has greatly improved the ease and sensitivity of quantitative gene expression studies. However, accurate measurement of gene expression with this method relies on the choice of a valid reference for data normalization. Studies rarely verify that gene expression levels for reference genes are adequately consistent among the samples used, nor compare alternative genes to assess which are most reliable for the experimental conditions analyzed. Results Using real-time RT-PCR to study the expression of 10 poplar (genus Populus) housekeeping genes, we demonstrate a simple method for determining the degree of stability of gene expression over a set of experimental conditions. Based on a traditional method for analyzing the stability of varieties in plant breeding, it defines measures of gene expression stability from analysis of variance (ANOVA) and linear regression. We found that the potential internal control genes differed widely in their expression stability over the different tissues, developmental stages and environmental conditions studied. Conclusion Our results support that quantitative comparisons of candidate reference genes are an important part of real-time RT-PCR studies that seek to precisely evaluate variation in gene expression. The method we demonstrated facilitates statistical and graphical evaluation of gene expression stability. Selection of the best reference gene for a given set of experimental conditions should enable detection of biologically significant changes in gene expression that are too small to be revealed by less precise methods, or when highly variable reference genes are unknowingly used in real-time RT-PCR experiments. PMID:15317655

  13. A gene expression estimator of intramuscular fat percentage for use in both cattle and sheep

    PubMed Central

    2014-01-01

    Background The expression of genes encoding proteins involved in triacyglyceride and fatty acid synthesis and storage in cattle muscle are correlated with intramuscular fat (IMF)%. Are the same genes also correlated with IMF% in sheep muscle, and can the same set of genes be used to estimate IMF% in both species? Results The correlation between gene expression (microarray) and IMF% in the longissimus muscle (LM) of twenty sheep was calculated. An integrated analysis of this dataset with an equivalent cattle correlation dataset and a cattle differential expression dataset was undertaken. A total of 30 genes were identified to be strongly correlated with IMF% in both cattle and sheep. The overlap of genes was highly significant, 8 of the 13 genes in the TAG gene set and 8 of the 13 genes in the FA gene set were in the top 100 and 500 genes respectively most correlated with IMF% in sheep, P-value = 0. Of the 30 genes, CIDEA, THRSP, ACSM1, DGAT2 and FABP4 had the highest average rank in both species. Using the data from two small groups of Brahman cattle (control and Hormone growth promotant-treated [known to decrease IMF% in muscle]) and 22 animals in total, the utility of a direct measure and different estimators of IMF% (ultrasound and gene expression) to differentiate between the two groups were examined. Directly measured IMF% and IMF% estimated from ultrasound scanning could not discriminate between the two groups. However, using gene expression to estimate IMF% discriminated between the two groups. Increasing the number of genes used to estimate IMF% from one to five significantly increased the discrimination power; but increasing the number of genes to 15 resulted in little further improvement. Conclusion We have demonstrated the utility of a comparative approach to identify robust estimators of IMF% in the LM in cattle and sheep. We have also demonstrated a number of approaches (potentially applicable to much smaller groups of animals than conventional methods) to using gene expression to rank animals for IMF% within a single farm/treatment, or to estimate differences in IMF% between two farms/treatments. PMID:25028604

  14. Computation and application of tissue-specific gene set weights.

    PubMed

    Frost, H Robert

    2018-04-06

    Gene set testing, or pathway analysis, has become a critical tool for the analysis of highdimensional genomic data. Although the function and activity of many genes and higher-level processes is tissue-specific, gene set testing is typically performed in a tissue agnostic fashion, which impacts statistical power and the interpretation and replication of results. To address this challenge, we have developed a bioinformatics approach to compute tissuespecific weights for individual gene sets using information on tissue-specific gene activity from the Human Protein Atlas (HPA). We used this approach to create a public repository of tissue-specific gene set weights for 37 different human tissue types from the HPA and all collections in the Molecular Signatures Database (MSigDB). To demonstrate the validity and utility of these weights, we explored three different applications: the functional characterization of human tissues, multi-tissue analysis for systemic diseases and tissue-specific gene set testing. All data used in the reported analyses is publicly available. An R implementation of the method and tissue-specific weights for MSigDB gene set collections can be downloaded at http://www.dartmouth.edu/∼hrfrost/TissueSpecificGeneSets. rob.frost@dartmouth.edu.

  15. Dynamic sporulation gene co-expression networks for Bacillus subtilis 168 and the food-borne isolate Bacillus amyloliquefaciens: a transcriptomic model

    PubMed Central

    Omony, Jimmy; de Jong, Anne; Krawczyk, Antonina O.; Eijlander, Robyn T.; Kuipers, Oscar P.

    2018-01-01

    Sporulation is a survival strategy, adapted by bacterial cells in response to harsh environmental adversities. The adaptation potential differs between strains and the variations may arise from differences in gene regulation. Gene networks are a valuable way of studying such regulation processes and establishing associations between genes. We reconstructed and compared sporulation gene co-expression networks (GCNs) of the model laboratory strain Bacillus subtilis 168 and the food-borne industrial isolate Bacillus amyloliquefaciens. Transcriptome data obtained from samples of six stages during the sporulation process were used for network inference. Subsequently, a gene set enrichment analysis was performed to compare the reconstructed GCNs of B. subtilis 168 and B. amyloliquefaciens with respect to biological functions, which showed the enriched modules with coherent functional groups associated with sporulation. On basis of the GCNs and time-evolution of differentially expressed genes, we could identify novel candidate genes strongly associated with sporulation in B. subtilis 168 and B. amyloliquefaciens. The GCNs offer a framework for exploring transcription factors, their targets, and co-expressed genes during sporulation. Furthermore, the methodology described here can conveniently be applied to other species or biological processes. PMID:29424683

  16. Dynamic sporulation gene co-expression networks for Bacillus subtilis 168 and the food-borne isolate Bacillus amyloliquefaciens: a transcriptomic model.

    PubMed

    Omony, Jimmy; de Jong, Anne; Krawczyk, Antonina O; Eijlander, Robyn T; Kuipers, Oscar P

    2018-02-09

    Sporulation is a survival strategy, adapted by bacterial cells in response to harsh environmental adversities. The adaptation potential differs between strains and the variations may arise from differences in gene regulation. Gene networks are a valuable way of studying such regulation processes and establishing associations between genes. We reconstructed and compared sporulation gene co-expression networks (GCNs) of the model laboratory strain Bacillus subtilis 168 and the food-borne industrial isolate Bacillus amyloliquefaciens. Transcriptome data obtained from samples of six stages during the sporulation process were used for network inference. Subsequently, a gene set enrichment analysis was performed to compare the reconstructed GCNs of B. subtilis 168 and B. amyloliquefaciens with respect to biological functions, which showed the enriched modules with coherent functional groups associated with sporulation. On basis of the GCNs and time-evolution of differentially expressed genes, we could identify novel candidate genes strongly associated with sporulation in B. subtilis 168 and B. amyloliquefaciens. The GCNs offer a framework for exploring transcription factors, their targets, and co-expressed genes during sporulation. Furthermore, the methodology described here can conveniently be applied to other species or biological processes.

  17. Functional clustering of time series gene expression data by Granger causality

    PubMed Central

    2012-01-01

    Background A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes. Results In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence. Conclusions This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them. PMID:23107425

  18. Comparative genomics of ParaHox clusters of teleost fishes: gene cluster breakup and the retention of gene sets following whole genome duplications

    PubMed Central

    Siegel, Nicol; Hoegg, Simone; Salzburger, Walter; Braasch, Ingo; Meyer, Axel

    2007-01-01

    Background The evolutionary lineage leading to the teleost fish underwent a whole genome duplication termed FSGD or 3R in addition to two prior genome duplications that took place earlier during vertebrate evolution (termed 1R and 2R). Resulting from the FSGD, additional copies of genes are present in fish, compared to tetrapods whose lineage did not experience the 3R genome duplication. Interestingly, we find that ParaHox genes do not differ in number in extant teleost fishes despite their additional genome duplication from the genomic situation in mammals, but they are distributed over twice as many paralogous regions in fish genomes. Results We determined the DNA sequence of the entire ParaHox C1 paralogon in the East African cichlid fish Astatotilapia burtoni, and compared it to orthologous regions in other vertebrate genomes as well as to the paralogous vertebrate ParaHox D paralogons. Evolutionary relationships among genes from these four chromosomal regions were studied with several phylogenetic algorithms. We provide evidence that the genes of the ParaHox C paralogous cluster are duplicated in teleosts, just as it had been shown previously for the D paralogon genes. Overall, however, synteny and cluster integrity seems to be less conserved in ParaHox gene clusters than in Hox gene clusters. Comparative analyses of non-coding sequences uncovered conserved, possibly co-regulatory elements, which are likely to contain promoter motives of the genes belonging to the ParaHox paralogons. Conclusion There seems to be strong stabilizing selection for gene order as well as gene orientation in the ParaHox C paralogon, since with a few exceptions, only the lengths of the introns and intergenic regions differ between the distantly related species examined. The high degree of evolutionary conservation of this gene cluster's architecture in particular – but possibly clusters of genes more generally – might be linked to the presence of promoter, enhancer or inhibitor motifs that serve to regulate more than just one gene. Therefore, deletions, inversions or relocations of individual genes could destroy the regulation of the clustered genes in this region. The existence of such a regulation network might explain the evolutionary conservation of gene order and orientation over the course of hundreds of millions of years of vertebrate evolution. Another possible explanation for the highly conserved gene order might be the existence of a regulator not located immediately next to its corresponding gene but further away since a relocation or inversion would possibly interrupt this interaction. Different ParaHox clusters were found to have experienced differential gene loss in teleosts. Yet the complete set of these homeobox genes was maintained, albeit distributed over almost twice the number of chromosomes. Selection due to dosage effects and/or stoichiometric disturbance might act more strongly to maintain a modal number of homeobox genes (and possibly transcription factors more generally) per genome, yet permit the accumulation of other (non regulatory) genes associated with these homeobox gene clusters. PMID:17822543

  19. Wound healing, calcium signaling, and other novel pathways are associated with the formation of butterfly eyespots.

    PubMed

    Özsu, Nesibe; Monteiro, Antónia

    2017-10-16

    One hypothesis surrounding the origin of novel traits is that they originate from the co-option of pre-existing genes or larger gene regulatory networks into novel developmental contexts. Insights into a trait's evolutionary origins can, thus, be gained via identification of the genes underlying trait development, and exploring whether those genes also function in other developmental contexts. Here we investigate the set of genes associated with the development of eyespot color patterns, a trait that originated once within the Nymphalid family of butterflies. Although several genes associated with eyespot development have been identified, the eyespot gene regulatory network remains largely unknown. In this study, next-generation sequencing and transcriptome analyses were used to identify a large set of genes associated with eyespot development of Bicyclus anynana butterflies, at 3-6 h after pupation, prior to the differentiation of the color rings. Eyespot-associated genes were identified by comparing the transcriptomes of homologous micro-dissected wing tissues that either develop or do not develop eyespots in wild-type and a mutant line of butterflies, Spotty, with extra eyespots. Overall, 186 genes were significantly up and down-regulated in wing tissues that develop eyespots compared to wing tissues that do not. Many of the differentially expressed genes have yet to be annotated. New signaling pathways, including the Toll, Fibroblast Growth Factor (FGF), extracellular signal-regulated kinase (ERK) and/or Jun N-terminal kinase (JNK) signaling pathways are associated for the first time with eyespot development. In addition, several genes involved in wound healing and calcium signaling were also found to be associated with eyespots. Overall, this study provides the identity of many new genes and signaling pathways associated with eyespots, and suggests that the ancient wound healing gene regulatory network may have been co-opted to cells at the center of the pattern to aid in eyespot origins. New transcription factors that may be providing different identities to distinct wing sectors, and genes with sexually dimorphic expression in the eyespots were also identified.

  20. The effects of inference method, population sampling, and gene sampling on species tree inferences: an empirical study in slender salamanders (Plethodontidae: Batrachoseps).

    PubMed

    Jockusch, Elizabeth L; Martínez-Solano, Iñigo; Timpe, Elizabeth K

    2015-01-01

    Species tree methods are now widely used to infer the relationships among species from multilocus data sets. Many methods have been developed, which differ in whether gene and species trees are estimated simultaneously or sequentially, and in how gene trees are used to infer the species tree. While these methods perform well on simulated data, less is known about what impacts their performance on empirical data. We used a data set including five nuclear genes and one mitochondrial gene for 22 species of Batrachoseps to compare the effects of method of analysis, within-species sampling and gene sampling on species tree inferences. For this data set, the choice of inference method had the largest effect on the species tree topology. Exclusion of individual loci had large effects in *BEAST and STEM, but not in MP-EST. Different loci carried the greatest leverage in these different methods, showing that the causes of their disproportionate effects differ. Even though substantial information was present in the nuclear loci, the mitochondrial gene dominated the *BEAST species tree. This leverage is inherent to the mtDNA locus and results from its high variation and lower assumed ploidy. This mtDNA leverage may be problematic when mtDNA has undergone introgression, as is likely in this data set. By contrast, the leverage of RAG1 in STEM analyses does not reflect properties inherent to the locus, but rather results from a gene tree that is strongly discordant with all others, and is best explained by introgression between distantly related species. Within-species sampling was also important, especially in *BEAST analyses, as shown by differences in tree topology across 100 subsampled data sets. Despite the sensitivity of the species tree methods to multiple factors, five species groups, the relationships among these, and some relationships within them, are generally consistently resolved for Batrachoseps. © The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  1. Ecological transcriptomics of lake-type and riverine sockeye salmon (Oncorhynchus nerka)

    PubMed Central

    2011-01-01

    Background There are a growing number of genomes sequenced with tentative functions assigned to a large proportion of the individual genes. Model organisms in laboratory settings form the basis for the assignment of gene function, and the ecological context of gene function is lacking. This work addresses this shortcoming by investigating expressed genes of sockeye salmon (Oncorhynchus nerka) muscle tissue. We compared morphology and gene expression in natural juvenile sockeye populations related to river and lake habitats. Based on previously documented divergent morphology, feeding strategy, and predation in association with these distinct environments, we expect that burst swimming is favored in riverine population and continuous swimming is favored in lake-type population. In turn we predict that morphology and expressed genes promote burst swimming in riverine sockeye and continuous swimming in lake-type sockeye. Results We found the riverine sockeye population had deep, robust bodies and lake-type had shallow, streamlined bodies. Gene expression patterns were measured using a 16K microarray, discovering 141 genes with significant differential expression. Overall, the identity and function of these genes was consistent with our hypothesis. In addition, Gene Ontology (GO) enrichment analyses with a larger set of differentially expressed genes found the "biosynthesis" category enriched for the riverine population and the "metabolism" category enriched for the lake-type population. Conclusions This study provides a framework for understanding sockeye life history from a transcriptomic perspective and a starting point for more extensive, targeted studies determining the ecological context of genes. PMID:22136247

  2. Ecological transcriptomics of lake-type and riverine sockeye salmon (Oncorhynchus nerka).

    PubMed

    Pavey, Scott A; Sutherland, Ben J G; Leong, Jong; Robb, Adrienne; von Schalburg, Kris; Hamon, Troy R; Koop, Ben F; Nielsen, Jennifer L

    2011-12-02

    There are a growing number of genomes sequenced with tentative functions assigned to a large proportion of the individual genes. Model organisms in laboratory settings form the basis for the assignment of gene function, and the ecological context of gene function is lacking. This work addresses this shortcoming by investigating expressed genes of sockeye salmon (Oncorhynchus nerka) muscle tissue. We compared morphology and gene expression in natural juvenile sockeye populations related to river and lake habitats. Based on previously documented divergent morphology, feeding strategy, and predation in association with these distinct environments, we expect that burst swimming is favored in riverine population and continuous swimming is favored in lake-type population. In turn we predict that morphology and expressed genes promote burst swimming in riverine sockeye and continuous swimming in lake-type sockeye. We found the riverine sockeye population had deep, robust bodies and lake-type had shallow, streamlined bodies. Gene expression patterns were measured using a 16 k microarray, discovering 141 genes with significant differential expression. Overall, the identity and function of these genes was consistent with our hypothesis. In addition, Gene Ontology (GO) enrichment analyses with a larger set of differentially expressed genes found the "biosynthesis" category enriched for the riverine population and the "metabolism" category enriched for the lake-type population. This study provides a framework for understanding sockeye life history from a transcriptomic perspective and a starting point for more extensive, targeted studies determining the ecological context of genes.

  3. Using parentage analysis to examine gene flow and spatial genetic structure.

    PubMed

    Kane, Nolan C; King, Matthew G

    2009-04-01

    Numerous approaches have been developed to examine recent and historical gene flow between populations, but few studies have used empirical data sets to compare different approaches. Some methods are expected to perform better under particular scenarios, such as high or low gene flow, but this, too, has rarely been tested. In this issue of Molecular Ecology, Saenz-Agudelo et al. (2009) apply assignment tests and parentage analysis to microsatellite data from five geographically proximal (2-6 km) and one much more distant (1500 km) panda clownfish populations, showing that parentage analysis performed better in situations of high gene flow, while their assignment tests did better with low gene flow. This unusually complete data set is comprised of multiple exhaustively sampled populations, including nearly all adults and large numbers of juveniles, enabling the authors to ask questions that in many systems would be impossible to answer. Their results emphasize the importance of selecting the right analysis to use, based on the underlying model and how well its assumptions are met by the populations to be analysed.

  4. Transcriptomic Analysis of the Adaptation of Listeria monocytogenes to Lagoon and Soil Matrices Associated with a Piggery Environment: Comparison of Expression Profiles

    PubMed Central

    Vivant, Anne-Laure; Desneux, Jeremy; Pourcher, Anne-Marie; Piveteau, Pascal

    2017-01-01

    Understanding how Listeria monocytogenes, the causative agent of listeriosis, adapts to the environment is crucial. Adaptation to new matrices requires regulation of gene expression. To determine how the pathogen adapts to lagoon effluent and soil, two matrices where L. monocytogenes has been isolated, we compared the transcriptomes of L. monocytogenes CIP 110868 20 min and 24 h after its transfer to effluent and soil extract. Results showed major variations in the transcriptome of L. monocytogenes in the lagoon effluent but only minor modifications in the soil. In both the lagoon effluent and in the soil, genes involved in mobility and chemotaxis and in the transport of carbohydrates were the most frequently represented in the set of genes with higher transcript levels, and genes with phage-related functions were the most represented in the set of genes with lower transcript levels. A modification of the cell envelop was only found in the lagoon environment. Finally, the differential analysis included a large proportion of regulators, regulons, and ncRNAs. PMID:29018416

  5. Transcriptomic Analysis of the Adaptation of Listeria monocytogenes to Lagoon and Soil Matrices Associated with a Piggery Environment: Comparison of Expression Profiles.

    PubMed

    Vivant, Anne-Laure; Desneux, Jeremy; Pourcher, Anne-Marie; Piveteau, Pascal

    2017-01-01

    Understanding how Listeria monocytogenes , the causative agent of listeriosis, adapts to the environment is crucial. Adaptation to new matrices requires regulation of gene expression. To determine how the pathogen adapts to lagoon effluent and soil, two matrices where L. monocytogenes has been isolated, we compared the transcriptomes of L. monocytogenes CIP 110868 20 min and 24 h after its transfer to effluent and soil extract. Results showed major variations in the transcriptome of L. monocytogenes in the lagoon effluent but only minor modifications in the soil. In both the lagoon effluent and in the soil, genes involved in mobility and chemotaxis and in the transport of carbohydrates were the most frequently represented in the set of genes with higher transcript levels, and genes with phage-related functions were the most represented in the set of genes with lower transcript levels. A modification of the cell envelop was only found in the lagoon environment. Finally, the differential analysis included a large proportion of regulators, regulons, and ncRNAs.

  6. Probabilistic modeling of the evolution of gene synteny within reconciled phylogenies

    PubMed Central

    2015-01-01

    Background Most models of genome evolution concern either genetic sequences, gene content or gene order. They sometimes integrate two of the three levels, but rarely the three of them. Probabilistic models of gene order evolution usually have to assume constant gene content or adopt a presence/absence coding of gene neighborhoods which is blind to complex events modifying gene content. Results We propose a probabilistic evolutionary model for gene neighborhoods, allowing genes to be inserted, duplicated or lost. It uses reconciled phylogenies, which integrate sequence and gene content evolution. We are then able to optimize parameters such as phylogeny branch lengths, or probabilistic laws depicting the diversity of susceptibility of syntenic regions to rearrangements. We reconstruct a structure for ancestral genomes by optimizing a likelihood, keeping track of all evolutionary events at the level of gene content and gene synteny. Ancestral syntenies are associated with a probability of presence. We implemented the model with the restriction that at most one gene duplication separates two gene speciations in reconciled gene trees. We reconstruct ancestral syntenies on a set of 12 drosophila genomes, and compare the evolutionary rates along the branches and along the sites. We compare with a parsimony method and find a significant number of results not supported by the posterior probability. The model is implemented in the Bio++ library. It thus benefits from and enriches the classical models and methods for molecular evolution. PMID:26452018

  7. An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data.

    PubMed

    Nidheesh, N; Abdul Nazeer, K A; Ameer, P M

    2017-12-01

    Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data. Copyright © 2017 Elsevier Ltd. All rights reserved.

  8. High-throughput gene mapping in Caenorhabditis elegans.

    PubMed

    Swan, Kathryn A; Curtis, Damian E; McKusick, Kathleen B; Voinov, Alexander V; Mapa, Felipa A; Cancilla, Michael R

    2002-07-01

    Positional cloning of mutations in model genetic systems is a powerful method for the identification of targets of medical and agricultural importance. To facilitate the high-throughput mapping of mutations in Caenorhabditis elegans, we have identified a further 9602 putative new single nucleotide polymorphisms (SNPs) between two C. elegans strains, Bristol N2 and the Hawaiian mapping strain CB4856, by sequencing inserts from a CB4856 genomic DNA library and using an informatics pipeline to compare sequences with the canonical N2 genomic sequence. When combined with data from other laboratories, our marker set of 17,189 SNPs provides even coverage of the complete worm genome. To date, we have confirmed >1099 evenly spaced SNPs (one every 91 +/- 56 kb) across the six chromosomes and validated the utility of our SNP marker set and new fluorescence polarization-based genotyping methods for systematic and high-throughput identification of genes in C. elegans by cloning several proprietary genes. We illustrate our approach by recombination mapping and confirmation of the mutation in the cloned gene, dpy-18.

  9. The genetic diversity of Epstein-Barr virus in the setting of transplantation relative to non-transplant settings: A feasibility study.

    PubMed

    Allen, Upton D; Hu, Pingzhao; Pereira, Sergio L; Robinson, Joan L; Paton, Tara A; Beyene, Joseph; Khodai-Booran, Nasser; Dipchand, Anne; Hébert, Diane; Ng, Vicky; Nalpathamkalam, Thomas; Read, Stanley

    2016-02-01

    This study examines EBV strains from transplant patients and patients with IM by sequencing major EBV genes. We also used NGS to detect EBV DNA within total genomic DNA, and to evaluate its genetic variation. Sanger sequencing of major EBV genes was used to compare SNVs from samples taken from transplant patients vs. patients with IM. We sequenced EBV DNA from a healthy EBV-seropositive individual on a HiSeq 2000 instrument. Data were mapped to the EBV reference genomes (AG876 and B95-8). The number of EBNA2 SNVs was higher than for EBNA1 and the other genes sequenced within comparable reference coordinates. For EBNA2, there was a median of 15 SNV among transplant samples compared with 10 among IM samples (p = 0.036). EBNA1 showed little variation between samples. For NGS, we identified 640 and 892 variants at an unadjusted p value of 5 × 10(-8) for AG876 and B95-8 genomes, respectively. We used complementary sequence strategies to examine EBV genetic diversity and its application to transplantation. The results provide the framework for further characterization of EBV strains and related outcomes after organ transplantation. © 2015 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  10. Suppression subtractive hybridization and comparative expression analysis to identify developmentally regulated genes in filamentous fungi.

    PubMed

    Gesing, Stefan; Schindler, Daniel; Nowrousian, Minou

    2013-09-01

    Ascomycetes differentiate four major morphological types of fruiting bodies (apothecia, perithecia, pseudothecia and cleistothecia) that are derived from an ancestral fruiting body. Thus, fruiting body differentiation is most likely controlled by a set of common core genes. One way to identify such genes is to search for genes with evolutionary conserved expression patterns. Using suppression subtractive hybridization (SSH), we selected differentially expressed transcripts in Pyronema confluens (Pezizales) by comparing two cDNA libraries specific for sexual and for vegetative development, respectively. The expression patterns of selected genes from both libraries were verified by quantitative real time PCR. Expression of several corresponding homologous genes was found to be conserved in two members of the Sordariales (Sordaria macrospora and Neurospora crassa), a derived group of ascomycetes that is only distantly related to the Pezizales. Knockout studies with N. crassa orthologues of differentially regulated genes revealed a functional role during fruiting body development for the gene NCU05079, encoding a putative MFS peptide transporter. These data indicate conserved gene expression patterns and a functional role of the corresponding genes during fruiting body development; such genes are candidates of choice for further functional analysis. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  11. The extraction of drug-disease correlations based on module distance in incomplete human interactome.

    PubMed

    Yu, Liang; Wang, Bingbo; Ma, Xiaoke; Gao, Lin

    2016-12-23

    Extracting drug-disease correlations is crucial in unveiling disease mechanisms, as well as discovering new indications of available drugs, or drug repositioning. Both the interactome and the knowledge of disease-associated and drug-associated genes remain incomplete. We present a new method to predict the associations between drugs and diseases. Our method is based on a module distance, which is originally proposed to calculate distances between modules in incomplete human interactome. We first map all the disease genes and drug genes to a combined protein interaction network. Then based on the module distance, we calculate the distances between drug gene sets and disease gene sets, and take the distances as the relationships of drug-disease pairs. We also filter possible false positive drug-disease correlations by p-value. Finally, we validate the top-100 drug-disease associations related to six drugs in the predicted results. The overlapping between our predicted correlations with those reported in Comparative Toxicogenomics Database (CTD) and literatures, and their enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways demonstrate our approach can not only effectively identify new drug indications, but also provide new insight into drug-disease discovery.

  12. Distinct gene expression profiles determine molecular treatment response in childhood acute lymphoblastic leukemia.

    PubMed

    Cario, Gunnar; Stanulla, Martin; Fine, Bernard M; Teuffel, Oliver; Neuhoff, Nils V; Schrauder, André; Flohr, Thomas; Schäfer, Beat W; Bartram, Claus R; Welte, Karl; Schlegelberger, Brigitte; Schrappe, Martin

    2005-01-15

    Treatment resistance, as indicated by the presence of high levels of minimal residual disease (MRD) after induction therapy and induction consolidation, is associated with a poor prognosis in childhood acute lymphoblastic leukemia (ALL). We hypothesized that treatment resistance is an intrinsic feature of ALL cells reflected in the gene expression pattern and that resistance to chemotherapy can be predicted before treatment. To test these hypotheses, gene expression signatures of ALL samples with high MRD load were compared with those of samples without measurable MRD during treatment. We identified 54 genes that clearly distinguished resistant from sensitive ALL samples. Genes with low expression in resistant samples were predominantly associated with cell-cycle progression and apoptosis, suggesting that impaired cell proliferation and apoptosis are involved in treatment resistance. Prediction analysis using randomly selected samples as a training set and the remaining samples as a test set revealed an accuracy of 84%. We conclude that resistance to chemotherapy seems at least in part to be an intrinsic feature of ALL cells. Because treatment response could be predicted with high accuracy, gene expression profiling could become a clinically relevant tool for treatment stratification in the early course of childhood ALL.

  13. Conjugative plasmids: vessels of the communal gene pool

    PubMed Central

    Norman, Anders; Hansen, Lars H.; Sørensen, Søren J.

    2009-01-01

    Comparative whole-genome analyses have demonstrated that horizontal gene transfer (HGT) provides a significant contribution to prokaryotic genome innovation. The evolution of specific prokaryotes is therefore tightly linked to the environment in which they live and the communal pool of genes available within that environment. Here we use the term supergenome to describe the set of all genes that a prokaryotic ‘individual’ can draw on within a particular environmental setting. Conjugative plasmids can be considered particularly successful entities within the communal pool, which have enabled HGT over large taxonomic distances. These plasmids are collections of discrete regions of genes that function as ‘backbone modules’ to undertake different aspects of overall plasmid maintenance and propagation. Conjugative plasmids often carry suites of ‘accessory elements’ that contribute adaptive traits to the hosts and, potentially, other resident prokaryotes within specific environmental niches. Insight into the evolution of plasmid modules therefore contributes to our knowledge of gene dissemination and evolution within prokaryotic communities. This communal pool provides the prokaryotes with an important mechanistic framework for obtaining adaptability and functional diversity that alleviates the need for large genomes of specialized ‘private genes’. PMID:19571247

  14. Defining the optimal animal model for translational research using gene set enrichment analysis.

    PubMed

    Weidner, Christopher; Steinfath, Matthias; Opitz, Elisa; Oelgeschläger, Michael; Schönfelder, Gilbert

    2016-08-01

    The mouse is the main model organism used to study the functions of human genes because most biological processes in the mouse are highly conserved in humans. Recent reports that compared identical transcriptomic datasets of human inflammatory diseases with datasets from mouse models using traditional gene-to-gene comparison techniques resulted in contradictory conclusions regarding the relevance of animal models for translational research. To reduce susceptibility to biased interpretation, all genes of interest for the biological question under investigation should be considered. Thus, standardized approaches for systematic data analysis are needed. We analyzed the same datasets using gene set enrichment analysis focusing on pathways assigned to inflammatory processes in either humans or mice. The analyses revealed a moderate overlap between all human and mouse datasets, with average positive and negative predictive values of 48 and 57% significant correlations. Subgroups of the septic mouse models (i.e., Staphylococcus aureus injection) correlated very well with most human studies. These findings support the applicability of targeted strategies to identify the optimal animal model and protocol to improve the success of translational research. © 2016 The Authors. Published under the terms of the CC BY 4.0 license.

  15. GeneTopics - interpretation of gene sets via literature-driven topic models

    PubMed Central

    2013-01-01

    Background Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. Methods Our proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research. Results We validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner. Conclusions Our approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly tied to relevant articles in the literature. Extending a general topic model method, the approach introduced here establishes a workflow for the interpretation of gene sets generated from diverse experimental scenarios that can complement the classical approach of comparison to reference gene sets. PMID:24564875

  16. Gene selection and cancer type classification of diffuse large-B-cell lymphoma using a bivariate mixture model for two-species data.

    PubMed

    Su, Yuhua; Nielsen, Dahlia; Zhu, Lei; Richards, Kristy; Suter, Steven; Breen, Matthew; Motsinger-Reif, Alison; Osborne, Jason

    2013-01-05

    : A bivariate mixture model utilizing information across two species was proposed to solve the fundamental problem of identifying differentially expressed genes in microarray experiments. The model utility was illustrated using a dog and human lymphoma data set prepared by a group of scientists in the College of Veterinary Medicine at North Carolina State University. A small number of genes were identified as being differentially expressed in both species and the human genes in this cluster serve as a good predictor for classifying diffuse large-B-cell lymphoma (DLBCL) patients into two subgroups, the germinal center B-cell-like diffuse large B-cell lymphoma and the activated B-cell-like diffuse large B-cell lymphoma. The number of human genes that were observed to be significantly differentially expressed (21) from the two-species analysis was very small compared to the number of human genes (190) identified with only one-species analysis (human data). The genes may be clinically relevant/important, as this small set achieved low misclassification rates of DLBCL subtypes. Additionally, the two subgroups defined by this cluster of human genes had significantly different survival functions, indicating that the stratification based on gene-expression profiling using the proposed mixture model provided improved insight into the clinical differences between the two cancer subtypes.

  17. Machine Learning–Based Differential Network Analysis: A Study of Stress-Responsive Transcriptomes in Arabidopsis[W

    PubMed Central

    Ma, Chuang; Xin, Mingming; Feldmann, Kenneth A.; Wang, Xiangfeng

    2014-01-01

    Machine learning (ML) is an intelligent data mining technique that builds a prediction model based on the learning of prior knowledge to recognize patterns in large-scale data sets. We present an ML-based methodology for transcriptome analysis via comparison of gene coexpression networks, implemented as an R package called machine learning–based differential network analysis (mlDNA) and apply this method to reanalyze a set of abiotic stress expression data in Arabidopsis thaliana. The mlDNA first used a ML-based filtering process to remove nonexpressed, constitutively expressed, or non-stress-responsive “noninformative” genes prior to network construction, through learning the patterns of 32 expression characteristics of known stress-related genes. The retained “informative” genes were subsequently analyzed by ML-based network comparison to predict candidate stress-related genes showing expression and network differences between control and stress networks, based on 33 network topological characteristics. Comparative evaluation of the network-centric and gene-centric analytic methods showed that mlDNA substantially outperformed traditional statistical testing–based differential expression analysis at identifying stress-related genes, with markedly improved prediction accuracy. To experimentally validate the mlDNA predictions, we selected 89 candidates out of the 1784 predicted salt stress–related genes with available SALK T-DNA mutagenesis lines for phenotypic screening and identified two previously unreported genes, mutants of which showed salt-sensitive phenotypes. PMID:24520154

  18. A new computational strategy for predicting essential genes.

    PubMed

    Cheng, Jian; Wu, Wenwu; Zhang, Yinwen; Li, Xiangchen; Jiang, Xiaoqian; Wei, Gehong; Tao, Shiheng

    2013-12-21

    Determination of the minimum gene set for cellular life is one of the central goals in biology. Genome-wide essential gene identification has progressed rapidly in certain bacterial species; however, it remains difficult to achieve in most eukaryotic species. Several computational models have recently been developed to integrate gene features and used as alternatives to transfer gene essentiality annotations between organisms. We first collected features that were widely used by previous predictive models and assessed the relationships between gene features and gene essentiality using a stepwise regression model. We found two issues that could significantly reduce model accuracy: (i) the effect of multicollinearity among gene features and (ii) the diverse and even contrasting correlations between gene features and gene essentiality existing within and among different species. To address these issues, we developed a novel model called feature-based weighted Naïve Bayes model (FWM), which is based on Naïve Bayes classifiers, logistic regression, and genetic algorithm. The proposed model assesses features and filters out the effects of multicollinearity and diversity. The performance of FWM was compared with other popular models, such as support vector machine, Naïve Bayes model, and logistic regression model, by applying FWM to reciprocally predict essential genes among and within 21 species. Our results showed that FWM significantly improves the accuracy and robustness of essential gene prediction. FWM can remarkably improve the accuracy of essential gene prediction and may be used as an alternative method for other classification work. This method can contribute substantially to the knowledge of the minimum gene sets required for living organisms and the discovery of new drug targets.

  19. Potential gene flow from transgenic rice (Oryza sativa L.) to different weedy rice (Oryza sativa f. spontanea) accessions based on reproductive compatibility.

    PubMed

    Song, Xiaoling; Liu, Linli; Wang, Zhou; Qiang, Sheng

    2009-08-01

    The possibility of gene flow from transgenic crops to wild relatives may be affected by reproductive capacity between them. The potential gene flow from two transgenic rice lines containing the bar gene to five accessions of weedy rice (WR1-WR5) was determined through examination of reproductive compatibility under controlled pollination. The pollen grain germination of two transgenic rice lines on the stigma of all weedy rice, rice pollen tube growth down the style and entry into the weedy rice ovary were similar to self-pollination in weedy rice. However, delayed double fertilisation and embryo abortion in crosses between WR2 and Y0003 were observed. Seed sets between transgenic rice lines and weedy rice varied from 8 to 76%. Although repeated pollination increased seed set significantly, the rank of the seed set between the weedy rice accessions and rice lines was not changed. The germination rates of F(1) hybrids were similar or greater compared with respective females. All F(1) plants expressed glufosinate resistance in the presence of glufosinate selection pressure. The frequency of gene flow between different weedy rice accessions and transgenic herbicide-resistant rice may differ owing to different reproductive compatibility. This result suggests that, when wild relatives are selected as experimental materials for assessing the gene flow of transgenic rice, it is necessary to address the compatibility between transgenic rice and wild relatives.

  20. Gene Selection and Cancer Classification: A Rough Sets Based Approach

    NASA Astrophysics Data System (ADS)

    Sun, Lijun; Miao, Duoqian; Zhang, Hongyun

    Indentification of informative gene subsets responsible for discerning between available samples of gene expression data is an important task in bioinformatics. Reducts, from rough sets theory, corresponding to a minimal set of essential genes for discerning samples, is an efficient tool for gene selection. Due to the compuational complexty of the existing reduct algoritms, feature ranking is usually used to narrow down gene space as the first step and top ranked genes are selected . In this paper,we define a novel certierion based on the expression level difference btween classes and contribution to classification of the gene for scoring genes and present a algorithm for generating all possible reduct from informative genes.The algorithm takes the whole attribute sets into account and find short reduct with a significant reduction in computational complexity. An exploration of this approach on benchmark gene expression data sets demonstrates that this approach is successful for selecting high discriminative genes and the classification accuracy is impressive.

  1. Genes Regulated by Vitamin D in Bone Cells Are Positively Selected in East Asians

    PubMed Central

    Chen, Yuan; Xue, Yali; Luiselli, Donata; Tyler-Smith, Chris; Pagani, Luca; Ayub, Qasim

    2015-01-01

    Vitamin D and folate are activated and degraded by sunlight, respectively, and the physiological processes they control are likely to have been targets of selection as humans expanded from Africa into Eurasia. We investigated signals of positive selection in gene sets involved in the metabolism, regulation and action of these two vitamins in worldwide populations sequenced by Phase I of the 1000 Genomes Project. Comparing allele frequency-spectrum-based summary statistics between these gene sets and matched control genes, we observed a selection signal specific to East Asians for a gene set associated with vitamin D action in bones. The selection signal was mainly driven by three genes CXXC finger protein 1 (CXXC1), low density lipoprotein receptor-related protein 5 (LRP5) and runt-related transcription factor 2 (RUNX2). Examination of population differentiation and haplotypes allowed us to identify several candidate causal regulatory variants in each gene. Four of these candidate variants (one each in CXXC1 and RUNX2 and two in LRP5) had a >70% derived allele frequency in East Asians, but were present at lower (20–60%) frequency in Europeans as well, suggesting that the adaptation might have been part of a common response to climatic and dietary changes as humans expanded out of Africa, with implications for their role in vitamin D-dependent bone mineralization and osteoporosis insurgence. We also observed haplotype sharing between East Asians, Finns and an extinct archaic human (Denisovan) sample at the CXXC1 locus, which is best explained by incomplete lineage sorting. PMID:26719974

  2. Whole-transcriptome RNA-seq, gene set enrichment pathway analysis, and exon coverage analysis of two plastid RNA editing mutants.

    PubMed

    Hackett, Justin B; Lu, Yan

    2017-05-04

    In land plants, plastid and mitochondrial RNAs are subject to post-transcriptional C-to-U RNA editing. T-DNA insertions in the ORGANELLE RNA RECOGNITION MOTIF PROTEIN6 gene resulted in reduced photosystem II (PSII) activity and smaller plant and leaf sizes. Exon coverage analysis of the ORRM6 gene showed that orrm6-1 and orrm6-2 are loss-of-function mutants. Compared to other ORRM proteins, ORRM6 affects a relative small number of RNA editing sites. Sanger sequencing of reverse transcription-PCR products of plastid transcripts revealed 2 plastid RNA editing sites that are substantially affected in the orrm6 mutants: psbF-C77 and accD-C794. The psbF gene encodes the β subunit of cytochrome b 559 , an essential component of PSII. The accD gene encodes the β subunit of acetyl-CoA carboxylase, a protein required in plastid fatty acid biosynthesis. Whole-transcriptome RNA-seq demonstrated that editing at psbF-C77 is nearly absent and the editing extent at accD-C794 was significantly reduced. Gene set enrichment pathway analysis showed that expression of multiple gene sets involved in photosynthesis, especially photosynthetic electron transport, is significantly upregulated in both orrm6 mutants. The upregulation could be a mechanism to compensate for the reduced PSII electron transport rate in the orrm6 mutants. These results further demonstrated that Organelle RNA Recognition Motif protein ORRM6 is required in editing of specific RNAs in the Arabidopsis (Arabidopsis thaliana) plastid.

  3. Loci and pathways associated with uterine capacity for pregnancy and fertility in beef cattle

    PubMed Central

    Geary, Thomas W.; Kiser, Jennifer N.; Burns, Gregory W.; Hansen, Peter J.; Spencer, Thomas E.; Neibergs, Holly L.

    2017-01-01

    Infertility and subfertility negatively impact the economics and reproductive performance of cattle. Of note, significant pregnancy loss occurs in cattle during the first month of pregnancy, yet little is known about the genetic loci influencing pregnancy success and loss in cattle. To identify quantitative trait loci (QTL) with large effects associated with early pregnancy loss, Angus crossbred heifers were classified based on day 28 pregnancy outcomes to serial embryo transfer. A genome wide association analysis (GWAA) was conducted comparing 30 high fertility heifers with 100% success in establishing pregnancy to 55 subfertile heifers with 25% or less success. A gene set enrichment analysis SNP (GSEA-SNP) was performed to identify gene sets and leading edge genes influencing pregnancy loss. The GWAA identified 22 QTL (p < 1 x 10−5), and GSEA-SNP identified 9 gene sets (normalized enrichment score > 3.0) with 253 leading edge genes. Network analysis identified TNF (tumor necrosis factor), estrogen, and TP53 (tumor protein 53) as the top of 671 upstream regulators (p < 0.001), whereas the SOX2 (SRY [sex determining region Y]-box 2) and OCT4 (octamer-binding transcription factor 4) complex was the top master regulator out of 773 master regulators associated with fertility (p < 0.001). Identification of QTL and genes in pathways that improve early pregnancy success provides critical information for genomic selection to increase fertility in cattle. The identified genes and regulators also provide insight into the complex biological mechanisms underlying pregnancy establishment in cattle. PMID:29228019

  4. Transcriptome profiling of visceral adipose tissue in a novel obese rat model, WNIN/Ob & its comparison with other animal models.

    PubMed

    Sakamuri, Siva Sankara Vara Prasad; Putcha, Uday Kumar; Veettil, Giridharan Nappan; Ayyalasomayajula, Vajreswari

    2016-09-01

    Adipose tissue dysfunction in obesity is linked to the development of type 2 diabetes and cardiovascular diseases. We studied the differential gene expression in retroperitoneal adipose tissue of a novel obese rat model, WNIN/Ob, to understand the possible underlying transcriptional changes involved in the development of obesity and associatedcomorbidities in this model. Four month old, male WNIN/Ob lean and obese rats were taken, blood was collected and tissues were dissected. Body composition analysis and adipose tissue histology were performed. Global gene expression in retroperitoneal adipose tissue of lean and obese rats was studied by microarray using Affymetrix GeneChips. One thousand and seventeen probe sets were downregulated and 963 probe sets were upregulated (more than two-fold) in adipose tissue of WNIN/Ob obese rats when compared to that of lean rats. Small nucleolar RNA (SnoRNA) made most of the underexpressed probe sets, whereas immune system-related genes werethe most overexpressed in the adipose tissues of obese rats. Genes coding for cytoskeletal proteinswere downregulated, whereas genes related to lipid biosynthesis were elevated in the adipose tissue of obese rats. Majority of the altered genes and pathways in adipose tissue of WNIN/Ob obese rats were similar to the observations in other obese animal models and human obesity. Based on these observations, it is proposed that WNIN/Ob obese rat model may be a good model to study the mechanisms involved in the development of obesity and its comorbidities. Downregulation of SnoRNA appears to be a novel feature in this obese rat model.

  5. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update

    PubMed Central

    Kuleshov, Maxim V.; Jones, Matthew R.; Rouillard, Andrew D.; Fernandez, Nicolas F.; Duan, Qiaonan; Wang, Zichen; Koplev, Simon; Jenkins, Sherry L.; Jagodnik, Kathleen M.; Lachmann, Alexander; McDermott, Michael G.; Monteiro, Caroline D.; Gundersen, Gregory W.; Ma'ayan, Avi

    2016-01-01

    Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr. PMID:27141961

  6. Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering.

    PubMed

    Chang, Jinyuan; Zhou, Wen; Zhou, Wen-Xin; Wang, Lan

    2017-03-01

    Comparing large covariance matrices has important applications in modern genomics, where scientists are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large number of genes vary between different biological states. We propose a computationally fast procedure for testing the equality of two large covariance matrices when the dimensions of the covariance matrices are much larger than the sample sizes. A distinguishing feature of the new procedure is that it imposes no structural assumptions on the unknown covariance matrices. Hence, the test is robust with respect to various complex dependence structures that frequently arise in genomics. We prove that the proposed procedure is asymptotically valid under weak moment conditions. As an interesting application, we derive a new gene clustering algorithm which shares the same nice property of avoiding restrictive structural assumptions for high-dimensional genomics data. Using an asthma gene expression dataset, we illustrate how the new test helps compare the covariance matrices of the genes across different gene sets/pathways between the disease group and the control group, and how the gene clustering algorithm provides new insights on the way gene clustering patterns differ between the two groups. The proposed methods have been implemented in an R-package HDtest and are available on CRAN. © 2016, The International Biometric Society.

  7. Diversity and evolution of the emerging Pandoraviridae family.

    PubMed

    Legendre, Matthieu; Fabre, Elisabeth; Poirot, Olivier; Jeudy, Sandra; Lartigue, Audrey; Alempic, Jean-Marie; Beucher, Laure; Philippe, Nadège; Bertaux, Lionel; Christo-Foroux, Eugène; Labadie, Karine; Couté, Yohann; Abergel, Chantal; Claverie, Jean-Michel

    2018-06-11

    With DNA genomes reaching 2.5 Mb packed in particles of bacterium-like shape and dimension, the first two Acanthamoeba-infecting pandoraviruses remained up to now the most complex viruses since their discovery in 2013. Our isolation of three new strains from distant locations and environments is now used to perform the first comparative genomics analysis of the emerging worldwide-distributed Pandoraviridae family. Thorough annotation of the genomes combining transcriptomic, proteomic, and bioinformatic analyses reveals many non-coding transcripts and significantly reduces the former set of predicted protein-coding genes. Here we show that the pandoraviruses exhibit an open pan-genome, the enormous size of which is not adequately explained by gene duplications or horizontal transfers. As most of the strain-specific genes have no extant homolog and exhibit statistical features comparable to intergenic regions, we suggest that de novo gene creation could contribute to the evolution of the giant pandoravirus genomes.

  8. SET oncoprotein accumulation regulates transcription through DNA demethylation and histone hypoacetylation.

    PubMed

    Almeida, Luciana O; Neto, Marinaldo P C; Sousa, Lucas O; Tannous, Maryna A; Curti, Carlos; Leopoldino, Andreia M

    2017-04-18

    Epigenetic modifications are essential in the control of normal cellular processes and cancer development. DNA methylation and histone acetylation are major epigenetic modifications involved in gene transcription and abnormal events driving the oncogenic process. SET protein accumulates in many cancer types, including head and neck squamous cell carcinoma (HNSCC); SET is a member of the INHAT complex that inhibits gene transcription associating with histones and preventing their acetylation. We explored how SET protein accumulation impacts on the regulation of gene expression, focusing on DNA methylation and histone acetylation. DNA methylation profile of 24 tumour suppressors evidenced that SET accumulation decreased DNA methylation in association with loss of 5-methylcytidine, formation of 5-hydroxymethylcytosine and increased TET1 levels, indicating an active DNA demethylation mechanism. However, the expression of some suppressor genes was lowered in cells with high SET levels, suggesting that loss of methylation is not the main mechanism modulating gene expression. SET accumulation also downregulated the expression of 32 genes of a panel of 84 transcription factors, and SET directly interacted with chromatin at the promoter of the downregulated genes, decreasing histone acetylation. Gene expression analysis after cell treatment with 5-aza-2'-deoxycytidine (5-AZA) and Trichostatin A (TSA) revealed that histone acetylation reversed transcription repression promoted by SET. These results suggest a new function for SET in the regulation of chromatin dynamics. In addition, TSA diminished both SET protein levels and SET capability to bind to gene promoter, suggesting that administration of epigenetic modifier agents could be efficient to reverse SET phenotype in cancer.

  9. Comparative Genomics and Host Resistance against Infectious Diseases

    PubMed Central

    Qureshi, Salman T.; Skamene, Emil

    1999-01-01

    The large size and complexity of the human genome have limited the identification and functional characterization of components of the innate immune system that play a critical role in front-line defense against invading microorganisms. However, advances in genome analysis (including the development of comprehensive sets of informative genetic markers, improved physical mapping methods, and novel techniques for transcript identification) have reduced the obstacles to discovery of novel host resistance genes. Study of the genomic organization and content of widely divergent vertebrate species has shown a remarkable degree of evolutionary conservation and enables meaningful cross-species comparison and analysis of newly discovered genes. Application of comparative genomics to host resistance will rapidly expand our understanding of human immune defense by facilitating the translation of knowledge acquired through the study of model organisms. We review the rationale and resources for comparative genomic analysis and describe three examples of host resistance genes successfully identified by this approach. PMID:10081670

  10. Defining genes using "blueprint" versus "instruction" metaphors: effects for genetic determinism, response efficacy, and perceived control.

    PubMed

    Parrott, Roxanne; Smith, Rachel A

    2014-01-01

    Evidence supports mixed attributions aligned with personal and/or clinical control and gene expression for health in this era of genomic science and health care. We consider variance in these attributions and possible relationships to individual mind sets associated with essentialist beliefs that genes determine health versus threat beliefs that genes increase susceptibility for disease and severity linked to gene-environment interactions. Further, we contribute to theory and empirical research to evaluate the use of metaphors to define genes. Participants (N = 324) read a message that varied the introduction by providing a definition of genes that used either an "instruction" metaphor or a "blueprint" metaphor. The "instruction" metaphor compared to the "blueprint" metaphor promoted stronger threat perceptions, which aligned with both belief in the response efficacy of genetic research for health and perceived behavioral control linked to genes and health. The "blueprint" metaphor compared to the "instruction" metaphor promoted stronger essentialist beliefs, which aligned with more intense positive regard for the efficacy of genetic research and human health. Implications for health communicators include societal effects aligned with stigma and discrimination that such findings portend.

  11. EG-05COMBINATION OF GENE COPY GAIN AND EPIGENETIC DEREGULATION ARE ASSOCIATED WITH THE ABERRANT EXPRESSION OF A STEM CELL RELATED HOX-SIGNATURE IN GLIOBLASTOMA

    PubMed Central

    Kurscheid, Sebastian; Bady, Pierre; Sciuscio, Davide; Samarzija, Ivana; Shay, Tal; Vassallo, Irene; Van Criekinge, Wim; Domany, Eytan; Stupp, Roger; Delorenzi, Mauro; Hegi, Monika

    2014-01-01

    We previously reported a stem cell related HOX gene signature associated with resistance to chemo-radiotherapy (TMZ/RT- > TMZ) in glioblastoma. However, underlying mechanisms triggering overexpression remain mostly elusive. Interestingly, HOX genes are neither involved in the developing brain, nor expressed in normal brain, suggestive of an acquired gene expression signature during gliomagenesis. HOXA genes are located on CHR 7 that displays trisomy in most glioblastoma which strongly impacts gene expression on this chromosome, modulated by local regulatory elements. Furthermore we observed more pronounced DNA methylation across the HOXA locus as compared to non-tumoral brain (Human methylation 450K BeadChip Illumina; 59 glioblastoma, 5 non-tumoral brain sampes). CpG probes annotated for HOX-signature genes, contributing most to the variability, served as input into the analysis of DNA methylation and expression to identify key regulatory regions. The structural similarity of the observed correlation matrices between DNA methylation and gene expression in our cohort and an independent data-set from TCGA (106 glioblastoma) was remarkable (RV-coefficient, 0.84; p-value < 0.0001). We identified a CpG located in the promoter region of the HOXA10 locus exerting the strongest mean negative correlation between methylation and expression of the whole HOX-signature. Applying this analysis the same CpG emerged in the external set. We then determined the contribution of both, gene copy aberration (CNA) and methylation at the selected probe to explain expression of the HOX-signature using a linear model. Statistically significant results suggested an additive effect between gene dosage and methylation at the key CpG identified. Similarly, such an additive effect was also observed in the external data-set. Taken together, we hypothesize that overexpression of the stem-cell related HOX signature is triggered by gain of trisomy 7 and escape from compensatory DNA methylation at positions controlling the effect of enhanced gene dose on expression.

  12. Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks.

    PubMed

    Blatti, Charles; Sinha, Saurabh

    2016-07-15

    Analysis of co-expressed gene sets typically involves testing for enrichment of different annotations or 'properties' such as biological processes, pathways, transcription factor binding sites, etc., one property at a time. This common approach ignores any known relationships among the properties or the genes themselves. It is believed that known biological relationships among genes and their many properties may be exploited to more accurately reveal commonalities of a gene set. Previous work has sought to achieve this by building biological networks that combine multiple types of gene-gene or gene-property relationships, and performing network analysis to identify other genes and properties most relevant to a given gene set. Most existing network-based approaches for recognizing genes or annotations relevant to a given gene set collapse information about different properties to simplify (homogenize) the networks. We present a network-based method for ranking genes or properties related to a given gene set. Such related genes or properties are identified from among the nodes of a large, heterogeneous network of biological information. Our method involves a random walk with restarts, performed on an initial network with multiple node and edge types that preserve more of the original, specific property information than current methods that operate on homogeneous networks. In this first stage of our algorithm, we find the properties that are the most relevant to the given gene set and extract a subnetwork of the original network, comprising only these relevant properties. We then re-rank genes by their similarity to the given gene set, based on a second random walk with restarts, performed on the above subnetwork. We demonstrate the effectiveness of this algorithm for ranking genes related to Drosophila embryonic development and aggressive responses in the brains of social animals. DRaWR was implemented as an R package available at veda.cs.illinois.edu/DRaWR. blatti@illinois.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  13. Draft genomes of two blister beetles Hycleus cichorii and Hycleus phaleratus

    PubMed Central

    Wu, Yuan-Ming; Li, Jiang

    2018-01-01

    Abstract Background Commonly known as blister beetles or Spanish fly, there are more than 1500 species in the Meloidae family (Hexapoda: Coleoptera: Tenebrionoidea) that produce the potent defensive blistering agent cantharidin. Cantharidin and its derivatives have been used to treat cancers such as liver, stomach, lung, and esophageal cancers. Hycleus cichorii and Hycleus phaleratus are the most commercially important blister beetles in China due to their ability to biosynthesize this potent vesicant. However, there is a lack of genome reference, which has hindered development of studies on the biosynthesis of cantharidin and a better understanding of its biology and pharmacology. Results We report 2 draft genomes and quantified gene sets for the blister beetles H. cichorii and H. phaleratus, 2 complex genomes with >72% repeats and approximately 1% heterozygosity, using Illumina sequencing data. An integrated assembly pipeline was performed for assembly, and most of the coding regions were obtained. Benchmarking universal single-copy orthologs (BUSCO) assessment showed that our assembly obtained more than 98% of the Endopterygota universal single-copy orthologs. Comparison analysis showed that the completeness of coding genes in our assembly was comparable to other beetle genomes such as Dendroctonus ponderosae and Agrilus planipennis. Gene annotation yielded 13 813 and 13 725 protein-coding genes in H. cichorii and H. phaleratus, of which approximately 89% were functionally annotated. BUSCO assessment showed that approximately 86% and 84% of the Endopterygota universal single-copy orthologs were annotated completely in these 2 gene sets, whose completeness is comparable to that of D. ponderosae and A. planipennis. Conclusions Assembly of both blister beetle genomes provides a valuable resource for future biosynthesis of cantharidin and comparative genomic studies of blister beetles and other beetles. PMID:29444297

  14. Draft genomes of two blister beetles Hycleus cichorii and Hycleus phaleratus.

    PubMed

    Wu, Yuan-Ming; Li, Jiang; Chen, Xiang-Sheng

    2018-03-01

    Commonly known as blister beetles or Spanish fly, there are more than 1500 species in the Meloidae family (Hexapoda: Coleoptera: Tenebrionoidea) that produce the potent defensive blistering agent cantharidin. Cantharidin and its derivatives have been used to treat cancers such as liver, stomach, lung, and esophageal cancers. Hycleus cichorii and Hycleus phaleratus are the most commercially important blister beetles in China due to their ability to biosynthesize this potent vesicant. However, there is a lack of genome reference, which has hindered development of studies on the biosynthesis of cantharidin and a better understanding of its biology and pharmacology. We report 2 draft genomes and quantified gene sets for the blister beetles H. cichorii and H. phaleratus, 2 complex genomes with >72% repeats and approximately 1% heterozygosity, using Illumina sequencing data. An integrated assembly pipeline was performed for assembly, and most of the coding regions were obtained. Benchmarking universal single-copy orthologs (BUSCO) assessment showed that our assembly obtained more than 98% of the Endopterygota universal single-copy orthologs. Comparison analysis showed that the completeness of coding genes in our assembly was comparable to other beetle genomes such as Dendroctonus ponderosae and Agrilus planipennis. Gene annotation yielded 13 813 and 13 725 protein-coding genes in H. cichorii and H. phaleratus, of which approximately 89% were functionally annotated. BUSCO assessment showed that approximately 86% and 84% of the Endopterygota universal single-copy orthologs were annotated completely in these 2 gene sets, whose completeness is comparable to that of D. ponderosae and A. planipennis. Assembly of both blister beetle genomes provides a valuable resource for future biosynthesis of cantharidin and comparative genomic studies of blister beetles and other beetles.

  15. Comparison of gene expression microarray data with count-based RNA measurements informs microarray interpretation.

    PubMed

    Richard, Arianne C; Lyons, Paul A; Peters, James E; Biasci, Daniele; Flint, Shaun M; Lee, James C; McKinney, Eoin F; Siegel, Richard M; Smith, Kenneth G C

    2014-08-04

    Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study. Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this "gold-standard" comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues. Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently.

  16. Whole-genome relationships among Francisella bacteria of diverse origins define new species and provide specific regions for detection

    DOE PAGES

    Challacombe, Jean Faust; Petersen, Jeannine M.; Gallegos-Graves, La Verne A.; ...

    2016-11-23

    Francisella tularensis is a highly virulent zoonotic pathogen that causes tularemia and, because of weaponization efforts in past world wars, is considered a tier 1 biothreat agent. Detection and surveillance of F. tularensis may be confounded by the presence of uncharacterized, closely related organisms. Through DNA-based diagnostics and environmental surveys, novel clinical and environmental Francisella isolates have been obtained in recent years. Here we present 7 new Francisella genomes and a comparison of their characteristics to each other and to 24 publicly available genomes as well as a comparative analysis of 16S rRNA and sdhA genes from over 90 Francisellamore » strains. Delineation of new species in bacteria is challenging, especially when isolates having very close genomic characteristics exhibit different physiological features—for example, when some are virulent pathogens in humans and animals while others are nonpathogenic or are opportunistic pathogens. Species resolution within Francisella varies with analyses of single genes, multiple gene or protein sets, or whole-genome comparisons of nucleic acid and amino acid sequences. Analyses focusing on single genes (16S rRNA, sdhA), multiple gene sets (virulence genes, lipopolysaccharide [LPS] biosynthesis genes, pathogenicity island), and whole-genome comparisons (nucleotide and protein) gave congruent results, but with different levels of discrimination confidence. We designate four new species within the genus; Francisella opportunistica sp. nov. (MA06-7296), Francisella salina sp. nov. (TX07-7308), Francisella uliginis sp. nov. (TX07-7310), and Francisella frigiditurris sp. nov. (CA97-1460). Lastly, this study provides a robust comparative framework to discern species and virulence features of newly detected Francisella bacteria.« less

  17. Whole-genome relationships among Francisella bacteria of diverse origins define new species and provide specific regions for detection

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Challacombe, Jean Faust; Petersen, Jeannine M.; Gallegos-Graves, La Verne A.

    Francisella tularensis is a highly virulent zoonotic pathogen that causes tularemia and, because of weaponization efforts in past world wars, is considered a tier 1 biothreat agent. Detection and surveillance of F. tularensis may be confounded by the presence of uncharacterized, closely related organisms. Through DNA-based diagnostics and environmental surveys, novel clinical and environmental Francisella isolates have been obtained in recent years. Here we present 7 new Francisella genomes and a comparison of their characteristics to each other and to 24 publicly available genomes as well as a comparative analysis of 16S rRNA and sdhA genes from over 90 Francisellamore » strains. Delineation of new species in bacteria is challenging, especially when isolates having very close genomic characteristics exhibit different physiological features—for example, when some are virulent pathogens in humans and animals while others are nonpathogenic or are opportunistic pathogens. Species resolution within Francisella varies with analyses of single genes, multiple gene or protein sets, or whole-genome comparisons of nucleic acid and amino acid sequences. Analyses focusing on single genes (16S rRNA, sdhA), multiple gene sets (virulence genes, lipopolysaccharide [LPS] biosynthesis genes, pathogenicity island), and whole-genome comparisons (nucleotide and protein) gave congruent results, but with different levels of discrimination confidence. We designate four new species within the genus; Francisella opportunistica sp. nov. (MA06-7296), Francisella salina sp. nov. (TX07-7308), Francisella uliginis sp. nov. (TX07-7310), and Francisella frigiditurris sp. nov. (CA97-1460). Lastly, this study provides a robust comparative framework to discern species and virulence features of newly detected Francisella bacteria.« less

  18. Gene expression profiles of Arabidopsis Cvi seeds during dormancy cycling indicate a common underlying dormancy control mechanism.

    PubMed

    Cadman, Cassandra S C; Toorop, Peter E; Hilhorst, Henk W M; Finch-Savage, William E

    2006-06-01

    Physiologically dormant seeds, like those of Arabidopsis, will cycle through dormant states as seasons change until the environment is favourable for seedling establishment. This phenomenon is widespread in the plant kingdom, but has not been studied at the molecular level. Full-genome microarrays were used for a global transcript analysis of Arabidopsis thaliana (accession Cvi) seeds in a range of dormant and dry after-ripened states during cycling. Principal component analysis of the expression patterns observed showed that they differed in newly imbibed primary dormant seeds, as commonly used in experimental studies, compared with those in the maintained primary and secondary dormant states that exist during cycling. Dormant and after-ripened seeds appear to have equally active although distinct gene expression programmes, dormant seeds having greatly reduced gene expression associated with protein synthesis, potentially controlling the completion of germination. A core set of 442 genes were identified that had higher expression in all dormant states compared with after-ripened states. Abscisic acid (ABA) responsive elements were significantly over-represented in this set of genes the expression of which was enhanced when multiple copies of the elements were present. ABA regulation of dormancy was further supported by expression patterns of key genes in ABA synthesis/catabolism, and dormancy loss in the presence of fluridone. The data support an ABA-gibberelic acid hormone balance mechanism controlling cycling through dormant states that depends on synthetic and catabolic pathways of both hormones. Many of the most highly expressed genes in dormant states were stress-related even in the absence of abiotic stress, indicating that ABA, stress and dormancy responses overlap significantly at the transcriptome level.

  19. Genome variations associated with viral susceptibility and calcification in Emiliania huxleyi.

    PubMed

    Kegel, Jessica U; John, Uwe; Valentin, Klaus; Frickenhaus, Stephan

    2013-01-01

    Emiliania huxleyi, a key player in the global carbon cycle is one of the best studied coccolithophores with respect to biogeochemical cycles, climatology, and host-virus interactions. Strains of E. huxleyi show phenotypic plasticity regarding growth behaviour, light-response, calcification, acidification, and virus susceptibility. This phenomenon is likely a consequence of genomic differences, or transcriptomic responses, to environmental conditions or threats such as viral infections. We used an E. huxleyi genome microarray based on the sequenced strain CCMP1516 (reference strain) to perform comparative genomic hybridizations (CGH) of 16 E. huxleyi strains of different geographic origin. We investigated the genomic diversity and plasticity and focused on the identification of genes related to virus susceptibility and coccolith production (calcification). Among the tested 31940 gene models a core genome of 14628 genes was identified by hybridization among 16 E. huxleyi strains. 224 probes were characterized as specific for the reference strain CCMP1516. Compared to the sequenced E. huxleyi strain CCMP1516 variation in gene content of up to 30 percent among strains was observed. Comparison of core and non-core transcripts sets in terms of annotated functions reveals a broad, almost equal functional coverage over all KOG-categories of both transcript sets within the whole annotated genome. Within the variable (non-core) genome we identified genes associated with virus susceptibility and calcification. Genes associated with virus susceptibility include a Bax inhibitor-1 protein, three LRR receptor-like protein kinases, and mitogen-activated protein kinase. Our list of transcripts associated with coccolith production will stimulate further research, e.g. by genetic manipulation. In particular, the V-type proton ATPase 16 kDa proteolipid subunit is proposed to be a plausible target gene for further calcification studies.

  20. Genome Variations Associated with Viral Susceptibility and Calcification in Emiliania huxleyi

    PubMed Central

    Kegel, Jessica U.; John, Uwe; Valentin, Klaus; Frickenhaus, Stephan

    2013-01-01

    Emiliania huxleyi, a key player in the global carbon cycle is one of the best studied coccolithophores with respect to biogeochemical cycles, climatology, and host-virus interactions. Strains of E. huxleyi show phenotypic plasticity regarding growth behaviour, light-response, calcification, acidification, and virus susceptibility. This phenomenon is likely a consequence of genomic differences, or transcriptomic responses, to environmental conditions or threats such as viral infections. We used an E. huxleyi genome microarray based on the sequenced strain CCMP1516 (reference strain) to perform comparative genomic hybridizations (CGH) of 16 E. huxleyi strains of different geographic origin. We investigated the genomic diversity and plasticity and focused on the identification of genes related to virus susceptibility and coccolith production (calcification). Among the tested 31940 gene models a core genome of 14628 genes was identified by hybridization among 16 E. huxleyi strains. 224 probes were characterized as specific for the reference strain CCMP1516. Compared to the sequenced E. huxleyi strain CCMP1516 variation in gene content of up to 30 percent among strains was observed. Comparison of core and non-core transcripts sets in terms of annotated functions reveals a broad, almost equal functional coverage over all KOG-categories of both transcript sets within the whole annotated genome. Within the variable (non-core) genome we identified genes associated with virus susceptibility and calcification. Genes associated with virus susceptibility include a Bax inhibitor-1 protein, three LRR receptor-like protein kinases, and mitogen-activated protein kinase. Our list of transcripts associated with coccolith production will stimulate further research, e.g. by genetic manipulation. In particular, the V-type proton ATPase 16 kDa proteolipid subunit is proposed to be a plausible target gene for further calcification studies. PMID:24260453

  1. Gene Expression Profiling in BRAF-Mutated Melanoma Reveals Patient Subgroups with Poor Outcomes to Vemurafenib That May Be Overcome by Cobimetinib Plus Vemurafenib.

    PubMed

    Wongchenko, Matthew J; McArthur, Grant A; Dréno, Brigitte; Larkin, James; Ascierto, Paolo A; Sosman, Jeffrey; Andries, Luc; Kockx, Mark; Hurst, Stephen D; Caro, Ivor; Rooney, Isabelle; Hegde, Priti S; Molinero, Luciana; Yue, Huibin; Chang, Ilsung; Amler, Lukas; Yan, Yibing; Ribas, Antoni

    2017-09-01

    Purpose: The association of tumor gene expression profiles with progression-free survival (PFS) outcomes in patients with BRAF V600 -mutated melanoma treated with vemurafenib or cobimetinib combined with vemurafenib was evaluated. Experimental Design: Gene expression of archival tumor samples from patients in four trials (BRIM-2, BRIM-3, BRIM-7, and coBRIM) was evaluated. Genes significantly associated with PFS ( P < 0.05) were identified by univariate Cox proportional hazards modeling, then subjected to unsupervised hierarchical clustering, principal component analysis, and recursive partitioning to develop optimized gene signatures. Results: Forty-six genes were identified as significantly associated with PFS in both BRIM-2 ( n = 63) and the vemurafenib arm of BRIM-3 ( n = 160). Two distinct signatures were identified: cell cycle and immune. Among vemurafenib-treated patients, the cell-cycle signature was associated with shortened PFS compared with the immune signature in the BRIM-2/BRIM-3 training set [hazard ratio (HR) 1.8; 95% confidence interval (CI), 1.3-2.6, P = 0.0001] and in the coBRIM validation set ( n = 101; HR, 1.6; 95% CI, 1.0-2.5; P = 0.08). The adverse impact of the cell-cycle signature on PFS was not observed in patients treated with cobimetinib combined with vemurafenib ( n = 99; HR, 1.1; 95% CI, 0.7-1.8; P = 0.66). Conclusions: In vemurafenib-treated patients, the cell-cycle gene signature was associated with shorter PFS. However, in cobimetinib combined with vemurafenib-treated patients, both cell cycle and immune signature subgroups had comparable PFS. Cobimetinib combined with vemurafenib may abrogate the adverse impact of the cell-cycle signature. Clin Cancer Res; 23(17); 5238-45. ©2017 AACR . ©2017 American Association for Cancer Research.

  2. Pericentromeric Effects Shape the Patterns of Divergence, Retention, and Expression of Duplicated Genes in the Paleopolyploid Soybean[C][W

    PubMed Central

    Du, Jianchang; Tian, Zhixi; Sui, Yi; Zhao, Meixia; Song, Qijian; Cannon, Steven B.; Cregan, Perry; Ma, Jianxin

    2012-01-01

    The evolutionary forces that govern the divergence and retention of duplicated genes in polyploids are poorly understood. In this study, we first investigated the rates of nonsynonymous substitution (Ka) and the rates of synonymous substitution (Ks) for a nearly complete set of genes in the paleopolyploid soybean (Glycine max) by comparing the orthologs between soybean and its progenitor species Glycine soja and then compared the patterns of gene divergence and expression between pericentromeric regions and chromosomal arms in different gene categories. Our results reveal strong associations between duplication status and Ka and gene expression levels and overall low Ks and low levels of gene expression in pericentromeric regions. It is theorized that deleterious mutations can easily accumulate in recombination-suppressed regions, because of Hill-Robertson effects. Intriguingly, the genes in pericentromeric regions—the cold spots for meiotic recombination in soybean—showed significantly lower Ka and higher levels of expression than their homoeologs in chromosomal arms. This asymmetric evolution of two members of individual whole genome duplication (WGD)-derived gene pairs, echoing the biased accumulation of singletons in pericentromeric regions, suggests that distinct genomic features between the two distinct chromatin types are important determinants shaping the patterns of divergence and retention of WGD-derived genes. PMID:22227891

  3. Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification.

    PubMed

    Oberthuer, André; Berthold, Frank; Warnat, Patrick; Hero, Barbara; Kahlert, Yvonne; Spitz, Rüdiger; Ernestus, Karen; König, Rainer; Haas, Stefan; Eils, Roland; Schwab, Manfred; Brors, Benedikt; Westermann, Frank; Fischer, Matthias

    2006-11-01

    To develop a gene expression-based classifier for neuroblastoma patients that reliably predicts courses of the disease. Two hundred fifty-one neuroblastoma specimens were analyzed using a customized oligonucleotide microarray comprising 10,163 probes for transcripts with differential expression in clinical subgroups of the disease. Subsequently, the prediction analysis for microarrays (PAM) was applied to a first set of patients with maximally divergent clinical courses (n = 77). The classification accuracy was estimated by a complete 10-times-repeated 10-fold cross validation, and a 144-gene predictor was constructed from this set. This classifier's predictive power was evaluated in an independent second set (n = 174) by comparing results of the gene expression-based classification with those of risk stratification systems of current trials from Germany, Japan, and the United States. The first set of patients was accurately predicted by PAM (cross-validated accuracy, 99%). Within the second set, the PAM classifier significantly separated cohorts with distinct courses (3-year event-free survival [EFS] 0.86 +/- 0.03 [favorable; n = 115] v 0.52 +/- 0.07 [unfavorable; n = 59] and 3-year overall survival 0.99 +/- 0.01 v 0.84 +/- 0.05; both P < .0001) and separated risk groups of current neuroblastoma trials into subgroups with divergent outcome (NB2004: low-risk 3-year EFS 0.86 +/- 0.04 v 0.25 +/- 0.15, P < .0001; intermediate-risk 1.00 v 0.57 +/- 0.19, P = .018; high-risk 0.81 +/- 0.10 v 0.56 +/- 0.08, P = .06). In a multivariate Cox regression model, the PAM predictor classified patients of the second set more accurately than risk stratification of current trials from Germany, Japan, and the United States (P < .001; hazard ratio, 4.756 [95% CI, 2.544 to 8.893]). Integration of gene expression-based class prediction of neuroblastoma patients may improve risk estimation of current neuroblastoma trials.

  4. GSCALite: A Web Server for Gene Set Cancer Analysis.

    PubMed

    Liu, Chun-Jie; Hu, Fei-Fei; Xia, Mengxuan; Han, Leng; Zhang, Qiong; Guo, An-Yuan

    2018-05-22

    The availability of cancer genomic data makes it possible to analyze genes related to cancer. Cancer is usually the result of a set of genes and the signal of a single gene could be covered by background noise. Here, we present a web server named Gene Set Cancer Analysis (GSCALite) to analyze a set of genes in cancers with the following functional modules. (i) Differential expression in tumor vs normal, and the survival analysis; (ii) Genomic variations and their survival analysis; (iii) Gene expression associated cancer pathway activity; (iv) miRNA regulatory network for genes; (v) Drug sensitivity for genes; (vi) Normal tissue expression and eQTL for genes. GSCALite is a user-friendly web server for dynamic analysis and visualization of gene set in cancer and drug sensitivity correlation, which will be of broad utilities to cancer researchers. GSCALite is available on http://bioinfo.life.hust.edu.cn/web/GSCALite/. guoay@hust.edu.cn or zhangqiong@hust.edu.cn. Supplementary data are available at Bioinformatics online.

  5. Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys

    PubMed Central

    Werner, Jeffrey J; Koren, Omry; Hugenholtz, Philip; DeSantis, Todd Z; Walters, William A; Caporaso, J Gregory; Angenent, Largus T; Knight, Rob; Ley, Ruth E

    2012-01-01

    Taxonomic classification of the thousands–millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a naïve Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases. PMID:21716311

  6. Sequence Search and Comparative Genomic Analysis of SUMO-Activating Enzymes Using CoGe.

    PubMed

    Carretero-Paulet, Lorenzo; Albert, Victor A

    2016-01-01

    The growing number of genome sequences completed during the last few years has made necessary the development of bioinformatics tools for the easy access and retrieval of sequence data, as well as for downstream comparative genomic analyses. Some of these are implemented as online platforms that integrate genomic data produced by different genome sequencing initiatives with data mining tools as well as various comparative genomic and evolutionary analysis possibilities.Here, we use the online comparative genomics platform CoGe ( http://www.genomevolution.org/coge/ ) (Lyons and Freeling. Plant J 53:661-673, 2008; Tang and Lyons. Front Plant Sci 3:172, 2012) (1) to retrieve the entire complement of orthologous and paralogous genes belonging to the SUMO-Activating Enzymes 1 (SAE1) gene family from a set of species representative of the Brassicaceae plant eudicot family with genomes fully sequenced, and (2) to investigate the history, timing, and molecular mechanisms of the gene duplications driving the evolutionary expansion and functional diversification of the SAE1 family in Brassicaceae.

  7. Comparative analyses of Xanthomonas and Xylella complete genomes.

    PubMed

    Moreira, Leandro M; De Souza, Robson F; Digiampietri, Luciano A; Da Silva, Ana C R; Setubal, João C

    2005-01-01

    Computational analyses of four bacterial genomes of the Xanthomonadaceae family reveal new unique genes that may be involved in adaptation, pathogenicity, and host specificity. The Xanthomonas genus presents 3636 unique genes distributed in 1470 families, while Xylella genus presents 1026 unique genes distributed in 375 families. Among Xanthomonas-specific genes, we highlight a large number of cell wall degrading enzymes, proteases, and iron receptors, a set of energy metabolism genes, second copy of the type II secretion system, type III secretion system, flagella and chemotactic machinery, and the xanthomonadin synthesis gene cluster. Important genes unique to the Xylella genus are an additional copy of a type IV pili gene cluster and the complete machinery of colicin V synthesis and secretion. Intersections of gene sets from both genera reveal a cluster of genes homologous to Salmonella's SPI-7 island in Xanthomonas axonopodis pv citri and Xylella fastidiosa 9a5c, which might be involved in host specificity. Each genome also presents important unique genes, such as an HMS cluster, the kdgT gene, and O-antigen in Xanthomonas axonopodis pv citri; a number of avrBS genes and a distinct O-antigen in Xanthomonas campestris pv campestris, a type I restriction-modification system and a nickase gene in Xylella fastidiosa 9a5c, and a type II restriction-modification system and two genes related to peptidoglycan biosynthesis in Xylella fastidiosa temecula 1. All these differences imply a considerable number of gene gains and losses during the divergence of the four lineages, and are associated with structural genome modifications that may have a direct relation with the mode of transmission, adaptation to specific environments and pathogenicity of each organism.

  8. A transcriptomic study reveals differentially expressed genes and pathways respond to simulated acid rain in Arabidopsis thaliana.

    PubMed

    Liu, Ting-Wu; Niu, Li; Fu, Bin; Chen, Juan; Wu, Fei-Hua; Chen, Juan; Wang, Wen-Hua; Hu, Wen-Jun; He, Jun-Xian; Zheng, Hai-Lei

    2013-01-01

    Acid rain, as a worldwide environmental issue, can cause serious damage to plants. In this study, we provided the first case study on the systematic responses of arabidopsis (Arabidopsis thaliana (L.) Heynh.) to simulated acid rain (SiAR) by transcriptome approach. Transcriptomic analysis revealed that the expression of a set of genes related to primary metabolisms, including nitrogen, sulfur, amino acid, photosynthesis, and reactive oxygen species metabolism, were altered under SiAR. In addition, transport and signal transduction related pathways, especially calcium-related signaling pathways, were found to play important roles in the response of arabidopsis to SiAR stress. Further, we compared our data set with previously published data sets on arabidopsis transcriptome subjected to various stresses, including wound, salt, light, heavy metal, karrikin, temperature, osmosis, etc. The results showed that many genes were overlapped in several stresses, suggesting that plant response to SiAR is a complex process, which may require the participation of multiple defense-signaling pathways. The results of this study will help us gain further insights into the response mechanisms of plants to acid rain stress.

  9. Transcriptional differences between normal and glioma-derived glial progenitor cells identify a core set of dysregulated genes.

    PubMed

    Auvergne, Romane M; Sim, Fraser J; Wang, Su; Chandler-Militello, Devin; Burch, Jaclyn; Al Fanek, Yazan; Davis, Danielle; Benraiss, Abdellatif; Walter, Kevin; Achanta, Pragathi; Johnson, Mahlon; Quinones-Hinojosa, Alfredo; Natesan, Sridaran; Ford, Heide L; Goldman, Steven A

    2013-06-27

    Glial progenitor cells (GPCs) are a potential source of malignant gliomas. We used A2B5-based sorting to extract tumorigenic GPCs from human gliomas spanning World Health Organization grades II-IV. Messenger RNA profiling identified a cohort of genes that distinguished A2B5+ glioma tumor progenitor cells (TPCs) from A2B5+ GPCs isolated from normal white matter. A core set of genes and pathways was substantially dysregulated in A2B5+ TPCs, which included the transcription factor SIX1 and its principal cofactors, EYA1 and DACH2. Small hairpin RNAi silencing of SIX1 inhibited the expansion of glioma TPCs in vitro and in vivo, suggesting a critical and unrecognized role of the SIX1-EYA1-DACH2 system in glioma genesis or progression. By comparing the expression patterns of glioma TPCs with those of normal GPCs, we have identified a discrete set of pathways by which glial tumorigenesis may be better understood and more specifically targeted. Copyright © 2013 The Authors. Published by Elsevier Inc. All rights reserved.

  10. Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity.

    PubMed

    King, Brian R; Aburdene, Maurice; Thompson, Alex; Warres, Zach

    2014-01-01

    Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.

  11. Gap Gene Regulatory Dynamics Evolve along a Genotype Network

    PubMed Central

    Crombach, Anton; Wotton, Karl R.; Jiménez-Guri, Eva; Jaeger, Johannes

    2016-01-01

    Developmental gene networks implement the dynamic regulatory mechanisms that pattern and shape the organism. Over evolutionary time, the wiring of these networks changes, yet the patterning outcome is often preserved, a phenomenon known as “system drift.” System drift is illustrated by the gap gene network—involved in segmental patterning—in dipteran insects. In the classic model organism Drosophila melanogaster and the nonmodel scuttle fly Megaselia abdita, early activation and placement of gap gene expression domains show significant quantitative differences, yet the final patterning output of the system is essentially identical in both species. In this detailed modeling analysis of system drift, we use gene circuits which are fit to quantitative gap gene expression data in M. abdita and compare them with an equivalent set of models from D. melanogaster. The results of this comparative analysis show precisely how compensatory regulatory mechanisms achieve equivalent final patterns in both species. We discuss the larger implications of the work in terms of “genotype networks” and the ways in which the structure of regulatory networks can influence patterns of evolutionary change (evolvability). PMID:26796549

  12. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update.

    PubMed

    Kuleshov, Maxim V; Jones, Matthew R; Rouillard, Andrew D; Fernandez, Nicolas F; Duan, Qiaonan; Wang, Zichen; Koplev, Simon; Jenkins, Sherry L; Jagodnik, Kathleen M; Lachmann, Alexander; McDermott, Michael G; Monteiro, Caroline D; Gundersen, Gregory W; Ma'ayan, Avi

    2016-07-08

    Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. Gaussian mixture clustering and imputation of microarray data.

    PubMed

    Ouyang, Ming; Welsh, William J; Georgopoulos, Panos

    2004-04-12

    In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.

  14. Prediction of missing common genes for disease pairs using network based module separation on incomplete human interactome.

    PubMed

    Akram, Pakeeza; Liao, Li

    2017-12-06

    Identification of common genes associated with comorbid diseases can be critical in understanding their pathobiological mechanism. This work presents a novel method to predict missing common genes associated with a disease pair. Searching for missing common genes is formulated as an optimization problem to minimize network based module separation from two subgraphs produced by mapping genes associated with disease onto the interactome. Using cross validation on more than 600 disease pairs, our method achieves significantly higher average receiver operating characteristic ROC Score of 0.95 compared to a baseline ROC score 0.60 using randomized data. Missing common genes prediction is aimed to complete gene set associated with comorbid disease for better understanding of biological intervention. It will also be useful for gene targeted therapeutics related to comorbid diseases. This method can be further considered for prediction of missing edges to complete the subgraph associated with disease pair.

  15. Initial leukemic gene expression profiles of patients with poor in vivo prednisone response are similar to those of blasts persisting under prednisone treatment in childhood acute lymphoblastic leukemia.

    PubMed

    Cario, Gunnar; Fetz, Andrea; Bretscher, Christian; Möricke, Anja; Schrauder, Andre; Stanulla, Martin; Schrappe, Martin

    2008-09-01

    Response to initial glucocorticoid (GC) treatment is a strong prognostic factor in childhood acute lymphoblastic leukemia (ALL). Patients with a poor prednisone response (PPR) have a poor event-free survival as compared to those with a good prednisone response (PGR). Causes of prednisone resistance are still not well understood. We hypothesized that GC resistance is an intrinsic feature of ALL cells which is reflected in the gene expression pattern and analyzed genome-wide gene expression using microarrays. A case-control study was performed comparing gene expression profiles from initial ALL samples of 20 patients with PPR and those of 20 patients with PGR. Differential gene expression of a subset of genes was confirmed by real-time quantitative polymerase chain reaction analysis and validation was performed in a second independent patient sample (n=20). We identified 121 genes that clearly distinguished prednisone-resistant from sensitive ALL samples (FDR<5%, fold change>or=1.5). Differential gene expression of 21 of these genes could be validated in a second independent set. Of importance, there was a remarkable concordance of genes identified by comparing expression signatures of PPR and PGR cells at diagnosis and those previously described to be up- or downregulated in leukemic cells persisting under GC treatment. Thus, GC resistance seems at least in part to be an intrinsic feature of leukemic cells. Leukemic cells of patients with PPR are characterized by gene expression pattern which are similar to those of resistant cells persisting under glucocorticoid treatment.

  16. Comparing Patterns of Natural Selection across Species Using Selective Signatures

    PubMed Central

    Shapiro, B. Jesse; Alm, Eric J

    2008-01-01

    Comparing gene expression profiles over many different conditions has led to insights that were not obvious from single experiments. In the same way, comparing patterns of natural selection across a set of ecologically distinct species may extend what can be learned from individual genome-wide surveys. Toward this end, we show how variation in protein evolutionary rates, after correcting for genome-wide effects such as mutation rate and demographic factors, can be used to estimate the level and types of natural selection acting on genes across different species. We identify unusually rapidly and slowly evolving genes, relative to empirically derived genome-wide and gene family-specific background rates for 744 core protein families in 30 γ-proteobacterial species. We describe the pattern of fast or slow evolution across species as the “selective signature” of a gene. Selective signatures represent a profile of selection across species that is predictive of gene function: pairs of genes with correlated selective signatures are more likely to share the same cellular function, and genes in the same pathway can evolve in concert. For example, glycolysis and phenylalanine metabolism genes evolve rapidly in Idiomarina loihiensis, mirroring an ecological shift in carbon source from sugars to amino acids. In a broader context, our results suggest that the genomic landscape is organized into functional modules even at the level of natural selection, and thus it may be easier than expected to understand the complex evolutionary pressures on a cell. PMID:18266472

  17. A Versatile Panel of Reference Gene Assays for the Measurement of Chicken mRNA by Quantitative PCR

    PubMed Central

    Maier, Helena J.; Van Borm, Steven; Young, John R.; Fife, Mark

    2016-01-01

    Quantitative real-time PCR assays are widely used for the quantification of mRNA within avian experimental samples. Multiple stably-expressed reference genes, selected for the lowest variation in representative samples, can be used to control random technical variation. Reference gene assays must be reliable, have high amplification specificity and efficiency, and not produce signals from contaminating DNA. Whilst recent research papers identify specific genes that are stable in particular tissues and experimental treatments, here we describe a panel of ten avian gene primer and probe sets that can be used to identify suitable reference genes in many experimental contexts. The panel was tested with TaqMan and SYBR Green systems in two experimental scenarios: a tissue collection and virus infection of cultured fibroblasts. GeNorm and NormFinder algorithms were able to select appropriate reference gene sets in each case. We show the effects of using the selected genes on the detection of statistically significant differences in expression. The results are compared with those obtained using 28s ribosomal RNA, the present most widely accepted reference gene in chicken work, identifying circumstances where its use might provide misleading results. Methods for eliminating DNA contamination of RNA reduced, but did not completely remove, detectable DNA. We therefore attached special importance to testing each qPCR assay for absence of signal using DNA template. The assays and analyses developed here provide a useful resource for selecting reference genes for investigations of avian biology. PMID:27537060

  18. Phylogenetics and evolution of Su(var)3-9 SET genes in land plants: rapid diversification in structure and function.

    PubMed

    Zhu, Xinyu; Ma, Hong; Chen, Zhiduan

    2011-03-09

    Plants contain numerous Su(var)3-9 homologues (SUVH) and related (SUVR) genes, some of which await functional characterization. Although there have been studies on the evolution of plant Su(var)3-9 SET genes, a systematic evolutionary study including major land plant groups has not been reported. Large-scale phylogenetic and evolutionary analyses can help to elucidate the underlying molecular mechanisms and contribute to improve genome annotation. Putative orthologs of plant Su(var)3-9 SET protein sequences were retrieved from major representatives of land plants. A novel clustering that included most members analyzed, henceforth referred to as core Su(var)3-9 homologues and related (cSUVHR) gene clade, was identified as well as all orthologous groups previously identified. Our analysis showed that plant Su(var)3-9 SET proteins possessed a variety of domain organizations, and can be classified into five types and ten subtypes. Plant Su(var)3-9 SET genes also exhibit a wide range of gene structures among different paralogs within a family, even in the regions encoding conserved PreSET and SET domains. We also found that the majority of SUVH members were intronless and formed three subclades within the SUVH clade. A detailed phylogenetic analysis of the plant Su(var)3-9 SET genes was performed. A novel deep phylogenetic relationship including most plant Su(var)3-9 SET genes was identified. Additional domains such as SAR, ZnF_C2H2 and WIYLD were early integrated into primordial PreSET/SET/PostSET domain organization. At least three classes of gene structures had been formed before the divergence of Physcomitrella patens (moss) from other land plants. One or multiple retroposition events might have occurred among SUVH genes with the donor genes leading to the V-2 orthologous group. The structural differences among evolutionary groups of plant Su(var)3-9 SET genes with different functions were described, contributing to the design of further experimental studies.

  19. Inference of combinatorial Boolean rules of synergistic gene sets from cancer microarray datasets.

    PubMed

    Park, Inho; Lee, Kwang H; Lee, Doheon

    2010-06-15

    Gene set analysis has become an important tool for the functional interpretation of high-throughput gene expression datasets. Moreover, pattern analyses based on inferred gene set activities of individual samples have shown the ability to identify more robust disease signatures than individual gene-based pattern analyses. Although a number of approaches have been proposed for gene set-based pattern analysis, the combinatorial influence of deregulated gene sets on disease phenotype classification has not been studied sufficiently. We propose a new approach for inferring combinatorial Boolean rules of gene sets for a better understanding of cancer transcriptome and cancer classification. To reduce the search space of the possible Boolean rules, we identify small groups of gene sets that synergistically contribute to the classification of samples into their corresponding phenotypic groups (such as normal and cancer). We then measure the significance of the candidate Boolean rules derived from each group of gene sets; the level of significance is based on the class entropy of the samples selected in accordance with the rules. By applying the present approach to publicly available prostate cancer datasets, we identified 72 significant Boolean rules. Finally, we discuss several identified Boolean rules, such as the rule of glutathione metabolism (down) and prostaglandin synthesis regulation (down), which are consistent with known prostate cancer biology. Scripts written in Python and R are available at http://biosoft.kaist.ac.kr/~ihpark/. The refined gene sets and the full list of the identified Boolean rules are provided in the Supplementary Material. Supplementary data are available at Bioinformatics online.

  20. Distributional fold change test – a statistical approach for detecting differential expression in microarray experiments

    PubMed Central

    2012-01-01

    Background Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different statistical methods have been used to systematically extract biological information and to quantify the associated uncertainty. The simplest method to identify differentially expressed genes is to evaluate the ratio of average intensities in two different conditions and consider all genes that differ by more than an arbitrary cut-off value to be differentially expressed. This filtering approach is not a statistical test and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed. At the same time the fold change by itself provide valuable information and it is important to find unambiguous ways of using this information in expression data treatment. Results A new method of finding differentially expressed genes, called distributional fold change (DFC) test is introduced. The method is based on an analysis of the intensity distribution of all microarray probe sets mapped to a three dimensional feature space composed of average expression level, average difference of gene expression and total variance. The proposed method allows one to rank each feature based on the signal-to-noise ratio and to ascertain for each feature the confidence level and power for being differentially expressed. The performance of the new method was evaluated using the total and partial area under receiver operating curves and tested on 11 data sets from Gene Omnibus Database with independently verified differentially expressed genes and compared with the t-test and shrinkage t-test. Overall the DFC test performed the best – on average it had higher sensitivity and partial AUC and its elevation was most prominent in the low range of differentially expressed features, typical for formalin-fixed paraffin-embedded sample sets. Conclusions The distributional fold change test is an effective method for finding and ranking differentially expressed probesets on microarrays. The application of this test is advantageous to data sets using formalin-fixed paraffin-embedded samples or other systems where degradation effects diminish the applicability of correlation adjusted methods to the whole feature set. PMID:23122055

  1. Lung cancer signature biomarkers: tissue specific semantic similarity based clustering of digital differential display (DDD) data.

    PubMed

    Srivastava, Mousami; Khurana, Pankaj; Sugadev, Ragumani

    2012-11-02

    The tissue-specific Unigene Sets derived from more than one million expressed sequence tags (ESTs) in the NCBI, GenBank database offers a platform for identifying significantly and differentially expressed tissue-specific genes by in-silico methods. Digital differential display (DDD) rapidly creates transcription profiles based on EST comparisons and numerically calculates, as a fraction of the pool of ESTs, the relative sequence abundance of known and novel genes. However, the process of identifying the most likely tissue for a specific disease in which to search for candidate genes from the pool of differentially expressed genes remains difficult. Therefore, we have used 'Gene Ontology semantic similarity score' to measure the GO similarity between gene products of lung tissue-specific candidate genes from control (normal) and disease (cancer) sets. This semantic similarity score matrix based on hierarchical clustering represents in the form of a dendrogram. The dendrogram cluster stability was assessed by multiple bootstrapping. Multiple bootstrapping also computes a p-value for each cluster and corrects the bias of the bootstrap probability. Subsequent hierarchical clustering by the multiple bootstrapping method (α = 0.95) identified seven clusters. The comparative, as well as subtractive, approach revealed a set of 38 biomarkers comprising four distinct lung cancer signature biomarker clusters (panel 1-4). Further gene enrichment analysis of the four panels revealed that each panel represents a set of lung cancer linked metastasis diagnostic biomarkers (panel 1), chemotherapy/drug resistance biomarkers (panel 2), hypoxia regulated biomarkers (panel 3) and lung extra cellular matrix biomarkers (panel 4). Expression analysis reveals that hypoxia induced lung cancer related biomarkers (panel 3), HIF and its modulating proteins (TGM2, CSNK1A1, CTNNA1, NAMPT/Visfatin, TNFRSF1A, ETS1, SRC-1, FN1, APLP2, DMBT1/SAG, AIB1 and AZIN1) are significantly down regulated. All down regulated genes in this panel were highly up regulated in most other types of cancers. These panels of proteins may represent signature biomarkers for lung cancer and will aid in lung cancer diagnosis and disease monitoring as well as in the prediction of responses to therapeutics.

  2. A 6-gene signature identifies four molecular subgroups of neuroblastoma

    PubMed Central

    2011-01-01

    Background There are currently three postulated genomic subtypes of the childhood tumour neuroblastoma (NB); Type 1, Type 2A, and Type 2B. The most aggressive forms of NB are characterized by amplification of the oncogene MYCN (MNA) and low expression of the favourable marker NTRK1. Recently, mutations or high expression of the familial predisposition gene Anaplastic Lymphoma Kinase (ALK) was associated to unfavourable biology of sporadic NB. Also, various other genes have been linked to NB pathogenesis. Results The present study explores subgroup discrimination by gene expression profiling using three published microarray studies on NB (47 samples). Four distinct clusters were identified by Principal Components Analysis (PCA) in two separate data sets, which could be verified by an unsupervised hierarchical clustering in a third independent data set (101 NB samples) using a set of 74 discriminative genes. The expression signature of six NB-associated genes ALK, BIRC5, CCND1, MYCN, NTRK1, and PHOX2B, significantly discriminated the four clusters (p < 0.05, one-way ANOVA test). PCA clusters p1, p2, and p3 were found to correspond well to the postulated subtypes 1, 2A, and 2B, respectively. Remarkably, a fourth novel cluster was detected in all three independent data sets. This cluster comprised mainly 11q-deleted MNA-negative tumours with low expression of ALK, BIRC5, and PHOX2B, and was significantly associated with higher tumour stage, poor outcome and poor survival compared to the Type 1-corresponding favourable group (INSS stage 4 and/or dead of disease, p < 0.05, Fisher's exact test). Conclusions Based on expression profiling we have identified four molecular subgroups of neuroblastoma, which can be distinguished by a 6-gene signature. The fourth subgroup has not been described elsewhere, and efforts are currently made to further investigate this group's specific characteristics. PMID:21492432

  3. Phylogenetics and evolution of Trx SET genes in fully sequenced land plants.

    PubMed

    Zhu, Xinyu; Chen, Caoyi; Wang, Baohua

    2012-04-01

    Plant Trx SET proteins are involved in H3K4 methylation and play a key role in plant floral development. Genes encoding Trx SET proteins constitute a multigene family in which the copy number varies among plant species and functional divergence appears to have occurred repeatedly. To investigate the evolutionary history of the Trx SET gene family, we made a comprehensive evolutionary analysis on this gene family from 13 major representatives of green plants. A novel clustering (here named as cpTrx clade), which included the III-1, III-2, and III-4 orthologous groups, previously resolved was identified. Our analysis showed that plant Trx proteins possessed a variety of domain organizations and gene structures among paralogs. Additional domains such as PHD, PWWP, and FYR were early integrated into primordial SET-PostSET domain organization of cpTrx clade. We suggested that the PostSET domain was lost in some members of III-4 orthologous group during the evolution of land plants. At least four classes of gene structures had been formed at the early evolutionary stage of land plants. Three intronless orphan Trx SET genes from the Physcomitrella patens (moss) were identified, and supposedly, their parental genes have been eliminated from the genome. The structural differences among evolutionary groups of plant Trx SET genes with different functions were described, contributing to the design of further experimental studies.

  4. Gene array analysis reveals a common Runx transcriptional program controlling cell adhesion and survival

    PubMed Central

    Wotton, Sandy; Terry, Anne; Kilbey, Anna; Jenkins, Alma; Herzyk, Pawel; Cameron, Ewan; Neil, James C.

    2008-01-01

    The Runx genes play divergent roles in development and cancer, where they can act either as oncogenes or tumour suppressors. We compared the effects of ectopic Runx expression in established fibroblasts, where all three genes produce an indistinguishable phenotype entailing epithelioid morphology and increased cell survival under stress conditions. Gene array analysis revealed a strongly overlapping transcriptional signature, with no examples of opposing regulation of the same target gene. A common set of 50 highly regulated genes was identified after further filtering on regulation by inducible RUNX1-ER. This set revealed a strong bias towards genes with annotated roles in cancer and development, and a preponderance of targets encoding extracellular or surface proteins, reflecting the marked effects of Runx on cell adhesion. Furthermore, in silico prediction of resistance to glucocorticoid growth inhibition was confirmed in fibroblasts and lymphoid cells expressing ectopic Runx. The effects of fibroblast expression of common RUNX1 fusion oncoproteins (RUNX1-ETO, TEL-RUNX1, CBFB-MYH11) were also tested. While two direct Runx activation target genes were repressed (Ncam1, Rgc32), the fusion proteins appeared to disrupt regulation of down-regulated targets (Cebpd, Id2, Rgs2) rather than impose constitutive repression. These results elucidate the oncogenic potential of the Runx family and reveal novel targets for therapeutic inhibition. PMID:18560354

  5. A 16-Gene Signature Distinguishes Anaplastic Astrocytoma from Glioblastoma

    PubMed Central

    Rao, Soumya Alige Mahabala; Srinivasan, Sujaya; Patric, Irene Rosita Pia; Hegde, Alangar Sathyaranjandas; Chandramouli, Bangalore Ashwathnarayanara; Arimappamagan, Arivazhagan; Santosh, Vani; Kondaiah, Paturu; Rao, Manchanahalli R. Sathyanarayana; Somasundaram, Kumaravel

    2014-01-01

    Anaplastic astrocytoma (AA; Grade III) and glioblastoma (GBM; Grade IV) are diffusely infiltrating tumors and are called malignant astrocytomas. The treatment regimen and prognosis are distinctly different between anaplastic astrocytoma and glioblastoma patients. Although histopathology based current grading system is well accepted and largely reproducible, intratumoral histologic variations often lead to difficulties in classification of malignant astrocytoma samples. In order to obtain a more robust molecular classifier, we analysed RT-qPCR expression data of 175 differentially regulated genes across astrocytoma using Prediction Analysis of Microarrays (PAM) and found the most discriminatory 16-gene expression signature for the classification of anaplastic astrocytoma and glioblastoma. The 16-gene signature obtained in the training set was validated in the test set with diagnostic accuracy of 89%. Additionally, validation of the 16-gene signature in multiple independent cohorts revealed that the signature predicted anaplastic astrocytoma and glioblastoma samples with accuracy rates of 99%, 88%, and 92% in TCGA, GSE1993 and GSE4422 datasets, respectively. The protein-protein interaction network and pathway analysis suggested that the 16-genes of the signature identified epithelial-mesenchymal transition (EMT) pathway as the most differentially regulated pathway in glioblastoma compared to anaplastic astrocytoma. In addition to identifying 16 gene classification signature, we also demonstrated that genes involved in epithelial-mesenchymal transition may play an important role in distinguishing glioblastoma from anaplastic astrocytoma. PMID:24475040

  6. Molecular Diagnosis of Infantile Mitochondrial Disease with Targeted Next-Generation Sequencing

    PubMed Central

    Calvo, Sarah E.; Compton, Alison G.; Hershman, Steven G.; Lim, Sze Chern; Lieber, Daniel S.; Tucker, Elena J.; Laskowski, Adrienne; Garone, Caterina; Liu, Shangtao; Jaffe, David B.; Christodoulou, John; Fletcher, Janice M.; Bruno, Damien L; Goldblatt, Jack; DiMauro, Salvatore; Thorburn, David R.; Mootha, Vamsi K.

    2012-01-01

    Advances in next-generation sequencing (NGS) promise to facilitate diagnosis of inherited disorders. While in research settings NGS has pinpointed causal alleles using segregation in large families, the key challenge for clinical diagnosis is application to single individuals. To explore its diagnostic utility, we performed targeted NGS in 42 unrelated infants with clinical and biochemical evidence of mitochondrial oxidative phosphorylation disease, who were refractory to traditional molecular diagnosis. These devastating mitochondrial disorders are characterized by phenotypic and genetic heterogeneity, with over 100 causal genes identified to date. We performed “MitoExome” sequencing of the mitochondrial DNA (mtDNA) and exons of ~1000 nuclear genes encoding mitochondrial proteins and prioritized rare mutations predicted to disrupt function. Since patients and controls harbored a comparable number of such heterozygous alleles, we could not prioritize dominant acting genes. However, patients showed a five-fold enrichment of genes with two such mutations that could underlie recessive disease. In total, 23/42 (55%) patients harbored such recessive genes or pathogenic mtDNA variants. Firm diagnoses were enabled in 10 patients (24%) who had mutations in genes previously linked to disease. 13 patients (31%) had mutations in nuclear genes never linked to disease. The pathogenicity of two such genes, NDUFB3 and AGK, was supported by cDNA complementation and evidence from multiple patients, respectively. The results underscore the immediate potential and challenges of deploying NGS in clinical settings. PMID:22277967

  7. ADGO: analysis of differentially expressed gene sets using composite GO annotation.

    PubMed

    Nam, Dougu; Kim, Sang-Bae; Kim, Seon-Kyu; Yang, Sungjin; Kim, Seon-Young; Chu, In-Sun

    2006-09-15

    Genes are typically expressed in modular manners in biological processes. Recent studies reflect such features in analyzing gene expression patterns by directly scoring gene sets. Gene annotations have been used to define the gene sets, which have served to reveal specific biological themes from expression data. However, current annotations have limited analytical power, because they are classified by single categories providing only unary information for the gene sets. Here we propose a method for discovering composite biological themes from expression data. We intersected two annotated gene sets from different categories of Gene Ontology (GO). We then scored the expression changes of all the single and intersected sets. In this way, we were able to uncover, for example, a gene set with the molecular function F and the cellular component C that showed significant expression change, while the changes in individual gene sets were not significant. We provided an exemplary analysis for HIV-1 immune response. In addition, we tested the method on 20 public datasets where we found many 'filtered' composite terms the number of which reached approximately 34% (a strong criterion, 5% significance) of the number of significant unary terms on average. By using composite annotation, we can derive new and improved information about disease and biological processes from expression data. We provide a web application (ADGO: http://array.kobic.re.kr/ADGO) for the analysis of differentially expressed gene sets with composite GO annotations. The user can analyze Affymetrix and dual channel array (spotted cDNA and spotted oligo microarray) data for four species: human, mouse, rat and yeast. chu@kribb.re.kr http://array.kobic.re.kr/ADGO.

  8. Discovering potential driver genes through an integrated model of somatic mutation profiles and gene functional information.

    PubMed

    Xi, Jianing; Wang, Minghui; Li, Ao

    2017-09-26

    The accumulating availability of next-generation sequencing data offers an opportunity to pinpoint driver genes that are causally implicated in oncogenesis through computational models. Despite previous efforts made regarding this challenging problem, there is still room for improvement in the driver gene identification accuracy. In this paper, we propose a novel integrated approach called IntDriver for prioritizing driver genes. Based on a matrix factorization framework, IntDriver can effectively incorporate functional information from both the interaction network and Gene Ontology similarity, and detect driver genes mutated in different sets of patients at the same time. When evaluated through known benchmarking driver genes, the top ranked genes of our result show highly significant enrichment for the known genes. Meanwhile, IntDriver also detects some known driver genes that are not found by the other competing approaches. When measured by precision, recall and F1 score, the performances of our approach are comparable or increased in comparison to the competing approaches.

  9. Comparative mapping in the Fagaceae and beyond with EST-SSRs

    PubMed Central

    2012-01-01

    Background Genetic markers and linkage mapping are basic prerequisites for comparative genetic analyses, QTL detection and map-based cloning. A large number of mapping populations have been developed for oak, but few gene-based markers are available for constructing integrated genetic linkage maps and comparing gene order and QTL location across related species. Results We developed a set of 573 expressed sequence tag-derived simple sequence repeats (EST-SSRs) and located 397 markers (EST-SSRs and genomic SSRs) on the 12 oak chromosomes (2n = 2x = 24) on the basis of Mendelian segregation patterns in 5 full-sib mapping pedigrees of two species: Quercus robur (pedunculate oak) and Quercus petraea (sessile oak). Consensus maps for the two species were constructed and aligned. They showed a high degree of macrosynteny between these two sympatric European oaks. We assessed the transferability of EST-SSRs to other Fagaceae genera and a subset of these markers was mapped in Castanea sativa, the European chestnut. Reasonably high levels of macrosynteny were observed between oak and chestnut. We also obtained diversity statistics for a subset of EST-SSRs, to support further population genetic analyses with gene-based markers. Finally, based on the orthologous relationships between the oak, Arabidopsis, grape, poplar, Medicago, and soybean genomes and the paralogous relationships between the 12 oak chromosomes, we propose an evolutionary scenario of the 12 oak chromosomes from the eudicot ancestral karyotype. Conclusions This study provides map locations for a large set of EST-SSRs in two oak species of recognized biological importance in natural ecosystems. This first step toward the construction of a gene-based linkage map will facilitate the assignment of future genome scaffolds to pseudo-chromosomes. This study also provides an indication of the potential utility of new gene-based markers for population genetics and comparative mapping within and beyond the Fagaceae. PMID:22931513

  10. categoryCompare, an analytical tool based on feature annotations

    PubMed Central

    Flight, Robert M.; Harrison, Benjamin J.; Mohammad, Fahim; Bunge, Mary B.; Moon, Lawrence D. F.; Petruska, Jeffrey C.; Rouchka, Eric C.

    2014-01-01

    Assessment of high-throughput—omics data initially focuses on relative or raw levels of a particular feature, such as an expression value for a transcript, protein, or metabolite. At a second level, analyses of annotations including known or predicted functions and associations of each individual feature, attempt to distill biological context. Most currently available comparative- and meta-analyses methods are dependent on the availability of identical features across data sets, and concentrate on determining features that are differentially expressed across experiments, some of which may be considered “biomarkers.” The heterogeneity of measurement platforms and inherent variability of biological systems confounds the search for robust biomarkers indicative of a particular condition. In many instances, however, multiple data sets show involvement of common biological processes or signaling pathways, even though individual features are not commonly measured or differentially expressed between them. We developed a methodology, categoryCompare, for cross-platform and cross-sample comparison of high-throughput data at the annotation level. We assessed the utility of the approach using hypothetical data, as well as determining similarities and differences in the set of processes in two instances: (1) denervated skin vs. denervated muscle, and (2) colon from Crohn's disease vs. colon from ulcerative colitis (UC). The hypothetical data showed that in many cases comparing annotations gave superior results to comparing only at the gene level. Improved analytical results depended as well on the number of genes included in the annotation term, the amount of noise in relation to the number of genes expressing in unenriched annotation categories, and the specific method in which samples are combined. In the skin vs. muscle denervation comparison, the tissues demonstrated markedly different responses. The Crohn's vs. UC comparison showed gross similarities in inflammatory response in the two diseases, with particular processes specific to each disease. PMID:24808906

  11. Expression Levels of pvcrt-o and pvmdr-1 Are Associated with Chloroquine Resistance and Severe Plasmodium vivax Malaria in Patients of the Brazilian Amazon

    PubMed Central

    Melo, Gisely C.; Monteiro, Wuelton M.; Siqueira, André M.; Silva, Siuhelem R.; Magalhães, Belisa M. L.; Alencar, Aline C. C.; Kuehn, Andrea; Portillo, Hernando A. del.; Fernandez-Becerra, Carmen; Lacerda, Marcus V. G.

    2014-01-01

    Molecular markers associated with the increase of chloroquine resistance and disease severity in Plasmodium vivax are needed. The objective of this study was to evaluate the expression levels of pvcrt-o and pvmdr-1 genes in a group of patients presenting CQRPv and patients who developed severe complications triggered exclusively by P. vivax infection. Two different sets of patients were included to this comprehensive study performed in the Brazilian Amazon: 1) patients with clinically characterized chloroquine-resistant P. vivax compared with patients with susceptible parasites from in vivo studies and 2) patients with severe vivax malaria compared with patients without severity. Quantitative real-time PCR was performed to compare the transcript levels of two main transporters genes, P. vivax chloroquine resistance transporter (pvcrt-o) and the P. vivax multidrug resistance transporter (pvmdr-1). Twelve chloroquine resistant cases and other 15 isolates from susceptible cases were included in the first set of patients. For the second set, seven patients with P. vivax-attributed severe and 10 mild manifestations were included. Parasites from patients with chloroquine resistance presented up to 6.1 (95% CI: 3.8–14.3) and 2.4 (95% CI: 0.53–9.1) fold increase in pvcrt-o and pvmdr-1 expression levels, respectively, compared to the susceptible group. Parasites from the severe vivax group had a 2.9 (95% CI: 1.1–8.3) and 4.9 (95% CI: 2.3–18.8) fold increase in pvcrt-o and pvmdr-1 expression levels as compared to the control group with mild disease. These findings suggest that chloroquine resistance and clinical severity in P. vivax infections are strongly associated with increased expression levels of the pvcrt-o and pvmdr-1 genes likely involved in chloroquine resistance. PMID:25157811

  12. Association of FLOWERING LOCUS T/TERMINAL FLOWER 1-like gene FTL2 expression with growth rhythm in Scots pine (Pinus sylvestris).

    PubMed

    Avia, Komlan; Kärkkäinen, Katri; Lagercrantz, Ulf; Savolainen, Outi

    2014-10-01

    Understanding the genetic basis of the timing of bud set, an important trait in conifers, is relevant for adaptation and forestry practice. In common garden experiments, both Scots pine (Pinus sylvestris) and Norway spruce (Picea abies) show a latitudinal cline in the trait. We compared the regulation of their bud set biology by examining the expression of PsFTL2, a Pinus sylvestris homolog to PaFTL2, a FLOWERING LOCUS T/TERMINAL FLOWER 1 (FT/TFL1)-like gene, the expression levels of which have been found previously to be associated with the timing of bud set in Norway spruce. In a common garden study, we analyzed the relationship of bud phenology under natural and artificial photoperiods and the expression of PsFTL2 in a set of Scots pine populations from different latitudes. The expression of PsFTL2 increased in the needles preceding bud set and decreased during bud burst. In the northernmost population, even short night periods were efficient to trigger this expression, which also increased earlier under all photoperiodic regimes compared with the southern populations. Despite the different biology, with few limitations, the two conifers that diverged 140 million yr ago probably share an association of FTL2 with bud set, pointing to a common mechanism for the timing of growth cessation in conifers. © 2014 The Authors. New Phytologist © 2014 New Phytologist Trust.

  13. A comparative gene expression analysis of iron-limited cultures of Chaetoceros socialis and Pseudo-nitzschia arenysensis using newly developed iron assays

    NASA Astrophysics Data System (ADS)

    Abdala, Z. M.; Powell, K.; Cronin, D.; Chappell, D.

    2016-02-01

    A comparative gene expression analysis of iron-limited cultures of Chaetoceros socialis and Pseudo-nitzschia arenysensisusing newly developed iron assays Zuzanna M. Abdala, Kimberly Powell, Dylan P. Cronin, P. Dreux Chappell Diatoms, accounting for about 40% of the primary production in marine ecosystems, play a vital role in the dynamics of marine systems. Iron availability is understood to be a driving factor controlling productivity of many marine phytoplankton, including diatoms, as it functions as a cofactor for many proteins including several involved with photosynthetic processes. Previous work examining transcriptomes of diatoms of the Thalassiosira genus grown in controlled laboratory settings has identified genes whose expression can be used as sensitive markers of iron status. Data mining publically available diatom transcriptome data for these genes enables development of additional iron status assays for environmentally-relevant diatoms. For the present study, gene expression analysis of iron-limited laboratory cultures of Chaetoceros socialis and Pseudo-nitzschia arenysensis grown in continuous light was done using quantitative reverse transcriptase polymerase chain reaction (qRT-PCR). C. socialis and P. arenysensis serve as comparative models for analyzing gene expression in iron limitation in different ecological community assemblages. These data may ultimately assist to illuminate the function of iron in photosynthetic activity in diatoms.

  14. SynFind: Compiling Syntenic Regions across Any Set of Genomes on Demand.

    PubMed

    Tang, Haibao; Bomhoff, Matthew D; Briones, Evan; Zhang, Liangsheng; Schnable, James C; Lyons, Eric

    2015-11-11

    The identification of conserved syntenic regions enables discovery of predicted locations for orthologous and homeologous genes, even when no such gene is present. This capability means that synteny-based methods are far more effective than sequence similarity-based methods in identifying true-negatives, a necessity for studying gene loss and gene transposition. However, the identification of syntenic regions requires complex analyses which must be repeated for pairwise comparisons between any two species. Therefore, as the number of published genomes increases, there is a growing demand for scalable, simple-to-use applications to perform comparative genomic analyses that cater to both gene family studies and genome-scale studies. We implemented SynFind, a web-based tool that addresses this need. Given one query genome, SynFind is capable of identifying conserved syntenic regions in any set of target genomes. SynFind is capable of reporting per-gene information, useful for researchers studying specific gene families, as well as genome-wide data sets of syntenic gene and predicted gene locations, critical for researchers focused on large-scale genomic analyses. Inference of syntenic homologs provides the basis for correlation of functional changes around genes of interests between related organisms. Deployed on the CoGe online platform, SynFind is connected to the genomic data from over 15,000 organisms from all domains of life as well as supporting multiple releases of the same organism. SynFind makes use of a powerful job execution framework that promises scalability and reproducibility. SynFind can be accessed at http://genomevolution.org/CoGe/SynFind.pl. A video tutorial of SynFind using Phytophthrora as an example is available at http://www.youtube.com/watch?v=2Agczny9Nyc. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  15. Discretization provides a conceptually simple tool to build expression networks.

    PubMed

    Vass, J Keith; Higham, Desmond J; Mudaliar, Manikhandan A V; Mao, Xuerong; Crowther, Daniel J

    2011-04-18

    Biomarker identification, using network methods, depends on finding regular co-expression patterns; the overall connectivity is of greater importance than any single relationship. A second requirement is a simple algorithm for ranking patients on how relevant a gene-set is. For both of these requirements discretized data helps to first identify gene cliques, and then to stratify patients.We explore a biologically intuitive discretization technique which codes genes as up- or down-regulated, with values close to the mean set as unchanged; this allows a richer description of relationships between genes than can be achieved by positive and negative correlation. We find a close agreement between our results and the template gene-interactions used to build synthetic microarray-like data by SynTReN, which synthesizes "microarray" data using known relationships which are successfully identified by our method.We are able to split positive co-regulation into up-together and down-together and negative co-regulation is considered as directed up-down relationships. In some cases these exist in only one direction, with real data, but not with the synthetic data. We illustrate our approach using two studies on white blood cells and derived immortalized cell lines and compare the approach with standard correlation-based computations. No attempt is made to distinguish possible causal links as the search for biomarkers would be crippled by losing highly significant co-expression relationships. This contrasts with approaches like ARACNE and IRIS.The method is illustrated with an analysis of gene-expression for energy metabolism pathways. For each discovered relationship we are able to identify the samples on which this is based in the discretized sample-gene matrix, along with a simplified view of the patterns of gene expression; this helps to dissect the gene-sample relevant to a research topic--identifying sets of co-regulated and anti-regulated genes and the samples or patients in which this relationship occurs.

  16. Reference Gene Validation for RT-qPCR, a Note on Different Available Software Packages

    PubMed Central

    De Spiegelaere, Ward; Dern-Wieloch, Jutta; Weigel, Roswitha; Schumacher, Valérie; Schorle, Hubert; Nettersheim, Daniel; Bergmann, Martin; Brehm, Ralph; Kliesch, Sabine; Vandekerckhove, Linos; Fink, Cornelia

    2015-01-01

    Background An appropriate normalization strategy is crucial for data analysis from real time reverse transcription polymerase chain reactions (RT-qPCR). It is widely supported to identify and validate stable reference genes, since no single biological gene is stably expressed between cell types or within cells under different conditions. Different algorithms exist to validate optimal reference genes for normalization. Applying human cells, we here compare the three main methods to the online available RefFinder tool that integrates these algorithms along with R-based software packages which include the NormFinder and GeNorm algorithms. Results 14 candidate reference genes were assessed by RT-qPCR in two sample sets, i.e. a set of samples of human testicular tissue containing carcinoma in situ (CIS), and a set of samples from the human adult Sertoli cell line (FS1) either cultured alone or in co-culture with the seminoma like cell line (TCam-2) or with equine bone marrow derived mesenchymal stem cells (eBM-MSC). Expression stabilities of the reference genes were evaluated using geNorm, NormFinder, and BestKeeper. Similar results were obtained by the three approaches for the most and least stably expressed genes. The R-based packages NormqPCR, SLqPCR and the NormFinder for R script gave identical gene rankings. Interestingly, different outputs were obtained between the original software packages and the RefFinder tool, which is based on raw Cq values for input. When the raw data were reanalysed assuming 100% efficiency for all genes, then the outputs of the original software packages were similar to the RefFinder software, indicating that RefFinder outputs may be biased because PCR efficiencies are not taken into account. Conclusions This report shows that assay efficiency is an important parameter for reference gene validation. New software tools that incorporate these algorithms should be carefully validated prior to use. PMID:25825906

  17. Reference gene validation for RT-qPCR, a note on different available software packages.

    PubMed

    De Spiegelaere, Ward; Dern-Wieloch, Jutta; Weigel, Roswitha; Schumacher, Valérie; Schorle, Hubert; Nettersheim, Daniel; Bergmann, Martin; Brehm, Ralph; Kliesch, Sabine; Vandekerckhove, Linos; Fink, Cornelia

    2015-01-01

    An appropriate normalization strategy is crucial for data analysis from real time reverse transcription polymerase chain reactions (RT-qPCR). It is widely supported to identify and validate stable reference genes, since no single biological gene is stably expressed between cell types or within cells under different conditions. Different algorithms exist to validate optimal reference genes for normalization. Applying human cells, we here compare the three main methods to the online available RefFinder tool that integrates these algorithms along with R-based software packages which include the NormFinder and GeNorm algorithms. 14 candidate reference genes were assessed by RT-qPCR in two sample sets, i.e. a set of samples of human testicular tissue containing carcinoma in situ (CIS), and a set of samples from the human adult Sertoli cell line (FS1) either cultured alone or in co-culture with the seminoma like cell line (TCam-2) or with equine bone marrow derived mesenchymal stem cells (eBM-MSC). Expression stabilities of the reference genes were evaluated using geNorm, NormFinder, and BestKeeper. Similar results were obtained by the three approaches for the most and least stably expressed genes. The R-based packages NormqPCR, SLqPCR and the NormFinder for R script gave identical gene rankings. Interestingly, different outputs were obtained between the original software packages and the RefFinder tool, which is based on raw Cq values for input. When the raw data were reanalysed assuming 100% efficiency for all genes, then the outputs of the original software packages were similar to the RefFinder software, indicating that RefFinder outputs may be biased because PCR efficiencies are not taken into account. This report shows that assay efficiency is an important parameter for reference gene validation. New software tools that incorporate these algorithms should be carefully validated prior to use.

  18. SEGEL: A Web Server for Visualization of Smoking Effects on Human Lung Gene Expression.

    PubMed

    Xu, Yan; Hu, Brian; Alnajm, Sammy S; Lu, Yin; Huang, Yangxin; Allen-Gipson, Diane; Cheng, Feng

    2015-01-01

    Cigarette smoking is a major cause of death worldwide resulting in over six million deaths per year. Cigarette smoke contains complex mixtures of chemicals that are harmful to nearly all organs of the human body, especially the lungs. Cigarette smoking is considered the major risk factor for many lung diseases, particularly chronic obstructive pulmonary diseases (COPD) and lung cancer. However, the underlying molecular mechanisms of smoking-induced lung injury associated with these lung diseases still remain largely unknown. Expression microarray techniques have been widely applied to detect the effects of smoking on gene expression in different human cells in the lungs. These projects have provided a lot of useful information for researchers to understand the potential molecular mechanism(s) of smoke-induced pathogenesis. However, a user-friendly web server that would allow scientists to fast query these data sets and compare the smoking effects on gene expression across different cells had not yet been established. For that reason, we have integrated eight public expression microarray data sets from trachea epithelial cells, large airway epithelial cells, small airway epithelial cells, and alveolar macrophage into an online web server called SEGEL (Smoking Effects on Gene Expression of Lung). Users can query gene expression patterns across these cells from smokers and nonsmokers by gene symbols, and find the effects of smoking on the gene expression of lungs from this web server. Sex difference in response to smoking is also shown. The relationship between the gene expression and cigarette smoking consumption were calculated and are shown in the server. The current version of SEGEL web server contains 42,400 annotated gene probe sets represented on the Affymetrix Human Genome U133 Plus 2.0 platform. SEGEL will be an invaluable resource for researchers interested in the effects of smoking on gene expression in the lungs. The server also provides useful information for drug development against smoking-related diseases. The SEGEL web server is available online at http://www.chengfeng.info/smoking_database.html.

  19. Divergent and convergent modes of interaction between wheat and Puccinia graminis f. sp. tritici isolates revealed by the comparative gene co-expression network and genome analyses.

    PubMed

    Rutter, William B; Salcedo, Andres; Akhunova, Alina; He, Fei; Wang, Shichen; Liang, Hanquan; Bowden, Robert L; Akhunov, Eduard

    2017-04-12

    Two opposing evolutionary constraints exert pressure on plant pathogens: one to diversify virulence factors in order to evade plant defenses, and the other to retain virulence factors critical for maintaining a compatible interaction with the plant host. To better understand how the diversified arsenals of fungal genes promote interaction with the same compatible wheat line, we performed a comparative genomic analysis of two North American isolates of Puccinia graminis f. sp. tritici (Pgt). The patterns of inter-isolate divergence in the secreted candidate effector genes were compared with the levels of conservation and divergence of plant-pathogen gene co-expression networks (GCN) developed for each isolate. Comprative genomic analyses revealed substantial level of interisolate divergence in effector gene complement and sequence divergence. Gene Ontology (GO) analyses of the conserved and unique parts of the isolate-specific GCNs identified a number of conserved host pathways targeted by both isolates. Interestingly, the degree of inter-isolate sub-network conservation varied widely for the different host pathways and was positively associated with the proportion of conserved effector candidates associated with each sub-network. While different Pgt isolates tended to exploit similar wheat pathways for infection, the mode of plant-pathogen interaction varied for different pathways with some pathways being associated with the conserved set of effectors and others being linked with the diverged or isolate-specific effectors. Our data suggest that at the intra-species level pathogen populations likely maintain divergent sets of effectors capable of targeting the same plant host pathways. This functional redundancy may play an important role in the dynamic of the "arms-race" between host and pathogen serving as the basis for diverse virulence strategies and creating conditions where mutations in certain effector groups will not have a major effect on the pathogen's ability to infect the host.

  20. Learning Abilities and Disabilities: Generalist Genes, Specialist Environments.

    PubMed

    Kovas, Yulia; Plomin, Robert

    2007-10-01

    Twin studies comparing identical and fraternal twins consistently show substantial genetic influence on individual differences in learning abilities such as reading and mathematics, as well as in other cognitive abilities such as spatial ability and memory. Multivariate genetic research has shown that the same set of genes is largely responsible for genetic influence on these diverse cognitive areas. We call these "generalist genes." What differentiates these abilities is largely the environment, especially nonshared environments that make children growing up in the same family different from one another. These multivariate genetic findings of generalist genes and specialist environments have far-reaching implications for diagnosis and treatment of learning disabilities and for understanding the brain mechanisms that mediate these effects.

  1. MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.

    PubMed

    Zhu, Huaiqiu; Hu, Gang-Qing; Yang, Yi-Fan; Wang, Jin; She, Zhen-Su

    2007-03-16

    Despite a remarkable success in the computational prediction of genes in Bacteria and Archaea, a lack of comprehensive understanding of prokaryotic gene structures prevents from further elucidation of differences among genomes. It continues to be interesting to develop new ab initio algorithms which not only accurately predict genes, but also facilitate comparative studies of prokaryotic genomes. This paper describes a new prokaryotic genefinding algorithm based on a comprehensive statistical model of protein coding Open Reading Frames (ORFs) and Translation Initiation Sites (TISs). The former is based on a linguistic "Entropy Density Profile" (EDP) model of coding DNA sequence and the latter comprises several relevant features related to the translation initiation. They are combined to form a so-called Multivariate Entropy Distance (MED) algorithm, MED 2.0, that incorporates several strategies in the iterative program. The iterations enable us to develop a non-supervised learning process and to obtain a set of genome-specific parameters for the gene structure, before making the prediction of genes. Results of extensive tests show that MED 2.0 achieves a competitive high performance in the gene prediction for both 5' and 3' end matches, compared to the current best prokaryotic gene finders. The advantage of the MED 2.0 is particularly evident for GC-rich genomes and archaeal genomes. Furthermore, the genome-specific parameters given by MED 2.0 match with the current understanding of prokaryotic genomes and may serve as tools for comparative genomic studies. In particular, MED 2.0 is shown to reveal divergent translation initiation mechanisms in archaeal genomes while making a more accurate prediction of TISs compared to the existing gene finders and the current GenBank annotation.

  2. Transcriptome analysis of genes and gene networks involved in aggressive behavior in mouse and zebrafish.

    PubMed

    Malki, Karim; Du Rietz, Ebba; Crusio, Wim E; Pain, Oliver; Paya-Cano, Jose; Karadaghi, Rezhaw L; Sluyter, Frans; de Boer, Sietse F; Sandnabba, Kenneth; Schalkwyk, Leonard C; Asherson, Philip; Tosto, Maria Grazia

    2016-09-01

    Despite moderate heritability estimates, the molecular architecture of aggressive behavior remains poorly characterized. This study compared gene expression profiles from a genetic mouse model of aggression with zebrafish, an animal model traditionally used to study aggression. A meta-analytic, cross-species approach was used to identify genomic variants associated with aggressive behavior. The Rankprod algorithm was used to evaluated mRNA differences from prefrontal cortex tissues of three sets of mouse lines (N = 18) selectively bred for low and high aggressive behavior (SAL/LAL, TA/TNA, and NC900/NC100). The same approach was used to evaluate mRNA differences in zebrafish (N = 12) exposed to aggressive or non-aggressive social encounters. Results were compared to uncover genes consistently implicated in aggression across both studies. Seventy-six genes were differentially expressed (PFP < 0.05) in aggressive compared to non-aggressive mice. Seventy genes were differentially expressed in zebrafish exposed to a fight encounter compared to isolated zebrafish. Seven genes (Fos, Dusp1, Hdac4, Ier2, Bdnf, Btg2, and Nr4a1) were differentially expressed across both species 5 of which belonging to a gene-network centred on the c-Fos gene hub. Network analysis revealed an association with the MAPK signaling cascade. In human studies HDAC4 haploinsufficiency is a key genetic mechanism associated with brachydactyly mental retardation syndrome (BDMR), which is associated with aggressive behaviors. Moreover, the HDAC4 receptor is a drug target for valproic acid, which is being employed as an effective pharmacological treatment for aggressive behavior in geriatric, psychiatric, and brain-injury patients. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

  3. A powerful nonparametric method for detecting differentially co-expressed genes: distance correlation screening and edge-count test.

    PubMed

    Zhang, Qingyang

    2018-05-16

    Differential co-expression analysis, as a complement of differential expression analysis, offers significant insights into the changes in molecular mechanism of different phenotypes. A prevailing approach to detecting differentially co-expressed genes is to compare Pearson's correlation coefficients in two phenotypes. However, due to the limitations of Pearson's correlation measure, this approach lacks the power to detect nonlinear changes in gene co-expression which is common in gene regulatory networks. In this work, a new nonparametric procedure is proposed to search differentially co-expressed gene pairs in different phenotypes from large-scale data. Our computational pipeline consisted of two main steps, a screening step and a testing step. The screening step is to reduce the search space by filtering out all the independent gene pairs using distance correlation measure. In the testing step, we compare the gene co-expression patterns in different phenotypes by a recently developed edge-count test. Both steps are distribution-free and targeting nonlinear relations. We illustrate the promise of the new approach by analyzing the Cancer Genome Atlas data and the METABRIC data for breast cancer subtypes. Compared with some existing methods, the new method is more powerful in detecting nonlinear type of differential co-expressions. The distance correlation screening can greatly improve computational efficiency, facilitating its application to large data sets.

  4. Use of the Ion PGM and the GeneReader NGS Systems in Daily Routine Practice for Advanced Lung Adenocarcinoma Patients: A Practical Point of View Reporting a Comparative Study and Assessment of 90 Patients.

    PubMed

    Heeke, Simon; Hofman, Véronique; Long-Mira, Elodie; Lespinet, Virginie; Lalvée, Salomé; Bordone, Olivier; Ribeyre, Camille; Tanga, Virginie; Benzaquen, Jonathan; Leroy, Sylvie; Cohen, Charlotte; Mouroux, Jérôme; Marquette, Charles Hugo; Ilié, Marius; Hofman, Paul

    2018-03-21

    Background : With the integration of various targeted therapies into the clinical management of patients with advanced lung adenocarcinoma, next-generation sequencing (NGS) has become the technology of choice and has led to an increase in simultaneously interrogated genes. However, the broader adoption of NGS for routine clinical practice is still hampered by sophisticated workflows, complex bioinformatics analysis and medical interpretation. Therefore, the performance of the novel QIAGEN GeneReader NGS system was compared to an in-house ISO-15189 certified Ion PGM NGS platform. Methods : Clinical samples from 90 patients (60 Retrospectively and 30 Prospectively) with lung adenocarcinoma were sequenced with both systems. Mutations were analyzed and EGFR , KRAS , BRAF , NRAS , ALK , PIK3CA and ERBB2 genes were compared and sampling time and suitability for clinical testing were assessed. Results : Both sequencing systems showed perfect concordance for the overlapping genes. Correlation of allele frequency was r ² = 0.93 for the retrospective patients and r ² = 0.81 for the prospective patients. Hands-on time and total run time were shorter using the PGM system, while the GeneReader platform provided good traceability and up-to-date interpretation of the results. Conclusion : We demonstrated the suitability of the GeneReader NGS system in routine practice in a clinical pathology laboratory setting.

  5. The cure: design and evaluation of a crowdsourcing game for gene selection for breast cancer survival prediction.

    PubMed

    Good, Benjamin M; Loguercio, Salvatore; Griffith, Obi L; Nanis, Max; Wu, Chunlei; Su, Andrew I

    2014-07-29

    Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility, and biological interpretability. Methods that take advantage of structured prior knowledge (eg, protein interaction networks) show promise in helping to define better signatures, but most knowledge remains unstructured. Crowdsourcing via scientific discovery games is an emerging methodology that has the potential to tap into human intelligence at scales and in modes unheard of before. The main objective of this study was to test the hypothesis that knowledge linking expression patterns of specific genes to breast cancer outcomes could be captured from players of an open, Web-based game. We envisioned capturing knowledge both from the player's prior experience and from their ability to interpret text related to candidate genes presented to them in the context of the game. We developed and evaluated an online game called The Cure that captured information from players regarding genes for use as predictors of breast cancer survival. Information gathered from game play was aggregated using a voting approach, and used to create rankings of genes. The top genes from these rankings were evaluated using annotation enrichment analysis, comparison to prior predictor gene sets, and by using them to train and test machine learning systems for predicting 10 year survival. Between its launch in September 2012 and September 2013, The Cure attracted more than 1000 registered players, who collectively played nearly 10,000 games. Gene sets assembled through aggregation of the collected data showed significant enrichment for genes known to be related to key concepts such as cancer, disease progression, and recurrence. In terms of the predictive accuracy of models trained using this information, these gene sets provided comparable performance to gene sets generated using other methods, including those used in commercial tests. The Cure is available on the Internet. The principal contribution of this work is to show that crowdsourcing games can be developed as a means to address problems involving domain knowledge. While most prior work on scientific discovery games and crowdsourcing in general takes as a premise that contributors have little or no expertise, here we demonstrated a crowdsourcing system that succeeded in capturing expert knowledge.

  6. Enrichment of Circular Code Motifs in the Genes of the Yeast Saccharomyces cerevisiae.

    PubMed

    Michel, Christian J; Ngoune, Viviane Nguefack; Poch, Olivier; Ripp, Raymond; Thompson, Julie D

    2017-12-03

    A set X of 20 trinucleotides has been found to have the highest average occurrence in the reading frame, compared to the two shifted frames, of genes of bacteria, archaea, eukaryotes, plasmids and viruses. This set X has an interesting mathematical property, since X is a maximal C3 self-complementary trinucleotide circular code. Furthermore, any motif obtained from this circular code X has the capacity to retrieve, maintain and synchronize the original (reading) frame. Since 1996, the theory of circular codes in genes has mainly been developed by analysing the properties of the 20 trinucleotides of X, using combinatorics and statistical approaches. For the first time, we test this theory by analysing the X motifs, i.e., motifs from the circular code X, in the complete genome of the yeast Saccharomyces cerevisiae . Several properties of X motifs are identified by basic statistics (at the frequency level), and evaluated by comparison to R motifs, i.e., random motifs generated from 30 different random codes R. We first show that the frequency of X motifs is significantly greater than that of R motifs in the genome of S. cerevisiae . We then verify that no significant difference is observed between the frequencies of X and R motifs in the non-coding regions of S. cerevisiae , but that the occurrence number of X motifs is significantly higher than R motifs in the genes (protein-coding regions). This property is true for all cardinalities of X motifs (from 4 to 20) and for all 16 chromosomes. We further investigate the distribution of X motifs in the three frames of S. cerevisiae genes and show that they occur more frequently in the reading frame, regardless of their cardinality or their length. Finally, the ratio of X genes, i.e., genes with at least one X motif, to non-X genes, in the set of verified genes is significantly different to that observed in the set of putative or dubious genes with no experimental evidence. These results, taken together, represent the first evidence for a significant enrichment of X motifs in the genes of an extant organism. They raise two hypotheses: the X motifs may be evolutionary relics of the primitive codes used for translation, or they may continue to play a functional role in the complex processes of genome decoding and protein synthesis.

  7. Dose-response assessment for influenza A virus based on data sets of infection with its live attenuated reassortants.

    PubMed

    Watanabe, Toru; Bartrand, Timothy A; Omura, Tatsuo; Haas, Charles N

    2012-03-01

    Reported data sets on infection of volunteers challenged with wild-type influenza A virus at graded doses are few. Alternatively, we aimed at developing a dose-response assessment for this virus based on the data sets for its live attenuated reassortants. Eleven data sets for live attenuated reassortants that were fit to beta-Poisson and exponential dose-response models. Dose-response relationships for those reassortants were characterized by pooling analysis of the data sets with respect to virus subtype (H1N1 or H3N2), attenuation method (cold-adapted or avian-human gene reassortment), and human age (adults or children). Furthermore, by comparing the above data sets to a limited number of reported data sets for wild-type virus, we quantified the degree of attenuation of wild-type virus with gene reassortment and estimated its infectivity. As a result, dose-response relationships of all reassortants were best described by a beta-Poisson model. Virus subtype and human age were significant factors determining the dose-response relationship, whereas attenuation method affected only the relationship of H1N1 virus infection to adults. The data sets for H3N2 wild-type virus could be pooled with those for its reassortants on the assumption that the gene reassortment attenuates wild-type virus by at least 63 times and most likely 1,070 times. Considering this most likely degree of attenuation, 10% infectious dose of H3N2 wild-type virus for adults was estimated at 18 TCID50 (95% CI = 8.8-35 TCID50). The infectivity of wild-type H1N1 virus remains unknown as the data set pooling was unsuccessful. © 2011 Society for Risk Analysis.

  8. CORE_TF: a user-friendly interface to identify evolutionary conserved transcription factor binding sites in sets of co-regulated genes

    PubMed Central

    Hestand, Matthew S; van Galen, Michiel; Villerius, Michel P; van Ommen, Gert-Jan B; den Dunnen, Johan T; 't Hoen, Peter AC

    2008-01-01

    Background The identification of transcription factor binding sites is difficult since they are only a small number of nucleotides in size, resulting in large numbers of false positives and false negatives in current approaches. Computational methods to reduce false positives are to look for over-representation of transcription factor binding sites in a set of similarly regulated promoters or to look for conservation in orthologous promoter alignments. Results We have developed a novel tool, "CORE_TF" (Conserved and Over-REpresented Transcription Factor binding sites) that identifies common transcription factor binding sites in promoters of co-regulated genes. To improve upon existing binding site predictions, the tool searches for position weight matrices from the TRANSFACR database that are over-represented in an experimental set compared to a random set of promoters and identifies cross-species conservation of the predicted transcription factor binding sites. The algorithm has been evaluated with expression and chromatin-immunoprecipitation on microarray data. We also implement and demonstrate the importance of matching the random set of promoters to the experimental promoters by GC content, which is a unique feature of our tool. Conclusion The program CORE_TF is accessible in a user friendly web interface at . It provides a table of over-represented transcription factor binding sites in the users input genes' promoters and a graphical view of evolutionary conserved transcription factor binding sites. In our test data sets it successfully predicts target transcription factors and their binding sites. PMID:19036135

  9. Systems biology definition of the core proteome of metabolism and expression is consistent with high-throughput data.

    PubMed

    Yang, Laurence; Tan, Justin; O'Brien, Edward J; Monk, Jonathan M; Kim, Donghyuk; Li, Howard J; Charusanti, Pep; Ebrahim, Ali; Lloyd, Colton J; Yurkovich, James T; Du, Bin; Dräger, Andreas; Thomas, Alex; Sun, Yuekai; Saunders, Michael A; Palsson, Bernhard O

    2015-08-25

    Finding the minimal set of gene functions needed to sustain life is of both fundamental and practical importance. Minimal gene lists have been proposed by using comparative genomics-based core proteome definitions. A definition of a core proteome that is supported by empirical data, is understood at the systems-level, and provides a basis for computing essential cell functions is lacking. Here, we use a systems biology-based genome-scale model of metabolism and expression to define a functional core proteome consisting of 356 gene products, accounting for 44% of the Escherichia coli proteome by mass based on proteomics data. This systems biology core proteome includes 212 genes not found in previous comparative genomics-based core proteome definitions, accounts for 65% of known essential genes in E. coli, and has 78% gene function overlap with minimal genomes (Buchnera aphidicola and Mycoplasma genitalium). Based on transcriptomics data across environmental and genetic backgrounds, the systems biology core proteome is significantly enriched in nondifferentially expressed genes and depleted in differentially expressed genes. Compared with the noncore, core gene expression levels are also similar across genetic backgrounds (two times higher Spearman rank correlation) and exhibit significantly more complex transcriptional and posttranscriptional regulatory features (40% more transcription start sites per gene, 22% longer 5'UTR). Thus, genome-scale systems biology approaches rigorously identify a functional core proteome needed to support growth. This framework, validated by using high-throughput datasets, facilitates a mechanistic understanding of systems-level core proteome function through in silico models; it de facto defines a paleome.

  10. Comparative genomics among Saccharomyces cerevisiae × Saccharomyces kudriavzevii natural hybrid strains isolated from wine and beer reveals different origins

    PubMed Central

    2012-01-01

    Background Interspecific hybrids between S. cerevisiae × S. kudriavzevii have frequently been detected in wine and beer fermentations. Significant physiological differences among parental and hybrid strains under different stress conditions have been evidenced. In this study, we used comparative genome hybridization analysis to evaluate the genome composition of different S. cerevisiae × S. kudriavzevii natural hybrids isolated from wine and beer fermentations to infer their evolutionary origins and to figure out the potential role of common S. kudriavzevii gene fraction present in these hybrids. Results Comparative genomic hybridization (CGH) and ploidy analyses carried out in this study confirmed the presence of individual and differential chromosomal composition patterns for most S. cerevisiae × S. kudriavzevii hybrids from beer and wine. All hybrids share a common set of depleted S. cerevisiae genes, which also are depleted or absent in the wine strains studied so far, and the presence a common set of S. kudriavzevii genes, which may be associated with their capability to grow at low temperatures. Finally, a maximum parsimony analysis of chromosomal rearrangement events, occurred in the hybrid genomes, indicated the presence of two main groups of wine hybrids and different divergent lineages of brewing strains. Conclusion Our data suggest that wine and beer S. cerevisiae × S. kudriavzevii hybrids have been originated by different rare-mating events involving a diploid wine S. cerevisiae and a haploid or diploid European S. kudriavzevii strains. Hybrids maintain several S. kudriavzevii genes involved in cold adaptation as well as those related to S. kudriavzevii mitochondrial functions. PMID:22906207

  11. Decreased Nucleotide and Expression Diversity and Modified Coexpression Patterns Characterize Domestication in the Common Bean[W][OPEN

    PubMed Central

    Bellucci, Elisa; Bitocchi, Elena; Ferrarini, Alberto; Benazzo, Andrea; Biagetti, Eleonora; Klie, Sebastian; Minio, Andrea; Rau, Domenico; Rodriguez, Monica; Panziera, Alex; Venturini, Luca; Attene, Giovanna; Albertini, Emidio; Jackson, Scott A.; Nanni, Laura; Fernie, Alisdair R.; Nikoloski, Zoran; Bertorelle, Giorgio; Delledonne, Massimo; Papa, Roberto

    2014-01-01

    Using RNA sequencing technology and de novo transcriptome assembly, we compared representative sets of wild and domesticated accessions of common bean (Phaseolus vulgaris) from Mesoamerica. RNA was extracted at the first true-leaf stage, and de novo assembly was used to develop a reference transcriptome; the final data set consists of ∼190,000 single nucleotide polymorphisms from 27,243 contigs in expressed genomic regions. A drastic reduction in nucleotide diversity (∼60%) is evident for the domesticated form, compared with the wild form, and almost 50% of the contigs that are polymorphic were brought to fixation by domestication. In parallel, the effects of domestication decreased the diversity of gene expression (18%). While the coexpression networks for the wild and domesticated accessions demonstrate similar seminal network properties, they show distinct community structures that are enriched for different molecular functions. After simulating the demographic dynamics during domestication, we found that 9% of the genes were actively selected during domestication. We also show that selection induced a further reduction in the diversity of gene expression (26%) and was associated with 5-fold enrichment of differentially expressed genes. While there is substantial evidence of positive selection associated with domestication, in a few cases, this selection has increased the nucleotide diversity in the domesticated pool at target loci associated with abiotic stress responses, flowering time, and morphology. PMID:24850850

  12. In silico analysis of stomach lineage specific gene set expression pattern in gastric cancer.

    PubMed

    Pandi, Narayanan Sathiya; Suganya, Sivagurunathan; Rajendran, Suriliyandi

    2013-10-04

    Stomach lineage specific gene products act as a protective barrier in the normal stomach and their expression maintains the normal physiological processes, cellular integrity and morphology of the gastric wall. However, the regulation of stomach lineage specific genes in gastric cancer (GC) is far less clear. In the present study, we sought to investigate the role and regulation of stomach lineage specific gene set (SLSGS) in GC. SLSGS was identified by comparing the mRNA expression profiles of normal stomach tissue with other organ tissue. The obtained SLSGS was found to be under expressed in gastric tumors. Functional annotation analysis revealed that the SLSGS was enriched for digestive function and gastric epithelial maintenance. Employing a single sample prediction method across GC mRNA expression profiles identified the under expression of SLSGS in proliferative type and invasive type gastric tumors compared to the metabolic type gastric tumors. Integrative pathway activation prediction analysis revealed a close association between estrogen-α signaling and SLSGS expression pattern in GC. Elevated expression of SLSGS in GC is associated with an overall increase in the survival of GC patients. In conclusion, our results highlight that estrogen mediated regulation of SLSGS in gastric tumor is a molecular predictor of metabolic type GC and prognostic factor in GC. Copyright © 2013 Elsevier Inc. All rights reserved.

  13. Comparative Life Cycle Transcriptomics Revises Leishmania mexicana Genome Annotation and Links a Chromosome Duplication with Parasitism of Vertebrates

    PubMed Central

    Fiebig, Michael; Kelly, Steven; Gluenz, Eva

    2015-01-01

    Leishmania spp. are protozoan parasites that have two principal life cycle stages: the motile promastigote forms that live in the alimentary tract of the sandfly and the amastigote forms, which are adapted to survive and replicate in the harsh conditions of the phagolysosome of mammalian macrophages. Here, we used Illumina sequencing of poly-A selected RNA to characterise and compare the transcriptomes of L. mexicana promastigotes, axenic amastigotes and intracellular amastigotes. These data allowed the production of the first transcriptome evidence-based annotation of gene models for this species, including genome-wide mapping of trans-splice sites and poly-A addition sites. The revised genome annotation encompassed 9,169 protein-coding genes including 936 novel genes as well as modifications to previously existing gene models. Comparative analysis of gene expression across promastigote and amastigote forms revealed that 3,832 genes are differentially expressed between promastigotes and intracellular amastigotes. A large proportion of genes that were downregulated during differentiation to amastigotes were associated with the function of the motile flagellum. In contrast, those genes that were upregulated included cell surface proteins, transporters, peptidases and many uncharacterized genes, including 293 of the 936 novel genes. Genome-wide distribution analysis of the differentially expressed genes revealed that the tetraploid chromosome 30 is highly enriched for genes that were upregulated in amastigotes, providing the first evidence of a link between this whole chromosome duplication event and adaptation to the vertebrate host in this group. Peptide evidence for 42 proteins encoded by novel transcripts supports the idea of an as yet uncharacterised set of small proteins in Leishmania spp. with possible implications for host-pathogen interactions. PMID:26452044

  14. Robust extraction of functional signals from gene set analysis using a generalized threshold free scoring function

    PubMed Central

    2009-01-01

    Background A central task in contemporary biosciences is the identification of biological processes showing response in genome-wide differential gene expression experiments. Two types of analysis are common. Either, one generates an ordered list based on the differential expression values of the probed genes and examines the tail areas of the list for over-representation of various functional classes. Alternatively, one monitors the average differential expression level of genes belonging to a given functional class. So far these two types of method have not been combined. Results We introduce a scoring function, Gene Set Z-score (GSZ), for the analysis of functional class over-representation that combines two previous analysis methods. GSZ encompasses popular functions such as correlation, hypergeometric test, Max-Mean and Random Sets as limiting cases. GSZ is stable against changes in class size as well as across different positions of the analysed gene list in tests with randomized data. GSZ shows the best overall performance in a detailed comparison to popular functions using artificial data. Likewise, GSZ stands out in a cross-validation of methods using split real data. A comparison of empirical p-values further shows a strong difference in favour of GSZ, which clearly reports better p-values for top classes than the other methods. Furthermore, GSZ detects relevant biological themes that are missed by the other methods. These observations also hold when comparing GSZ with popular program packages. Conclusion GSZ and improved versions of earlier methods are a useful contribution to the analysis of differential gene expression. The methods and supplementary material are available from the website http://ekhidna.biocenter.helsinki.fi/users/petri/public/GSZ/GSZscore.html. PMID:19775443

  15. Distinct transcriptional MYCN/c-MYC activities are associated with spontaneous regression or malignant progression in neuroblastomas

    PubMed Central

    Westermann, Frank; Muth, Daniel; Benner, Axel; Bauer, Tobias; Henrich, Kai-Oliver; Oberthuer, André; Brors, Benedikt; Beissbarth, Tim; Vandesompele, Jo; Pattyn, Filip; Hero, Barbara; König, Rainer; Fischer, Matthias; Schwab, Manfred

    2008-01-01

    Background Amplified MYCN oncogene resulting in deregulated MYCN transcriptional activity is observed in 20% of neuroblastomas and identifies a highly aggressive subtype. In MYCN single-copy neuroblastomas, elevated MYCN mRNA and protein levels are paradoxically associated with a more favorable clinical phenotype, including disseminated tumors that subsequently regress spontaneously (stage 4s-non-amplified). In this study, we asked whether distinct transcriptional MYCN or c-MYC activities are associated with specific neuroblastoma phenotypes. Results We defined a core set of direct MYCN/c-MYC target genes by applying gene expression profiling and chromatin immunoprecipitation (ChIP, ChIP-chip) in neuroblastoma cells that allow conditional regulation of MYCN and c-MYC. Their transcript levels were analyzed in 251 primary neuroblastomas. Compared to localized-non-amplified neuroblastomas, MYCN/c-MYC target gene expression gradually increases from stage 4s-non-amplified through stage 4-non-amplified to MYCN amplified tumors. This was associated with MYCN activation in stage 4s-non-amplified and predominantly c-MYC activation in stage 4-non-amplified tumors. A defined set of MYCN/c-MYC target genes was induced in stage 4-non-amplified but not in stage 4s-non-amplified neuroblastomas. In line with this, high expression of a subset of MYCN/c-MYC target genes identifies a patient subtype with poor overall survival independent of the established risk markers amplified MYCN, disease stage, and age at diagnosis. Conclusions High MYCN/c-MYC target gene expression is a hallmark of malignant neuroblastoma progression, which is predominantly driven by c-MYC in stage 4-non-amplified tumors. In contrast, moderate MYCN function gain in stage 4s-non-amplified tumors induces only a restricted set of target genes that is still compatible with spontaneous regression. PMID:18851746

  16. The Bos taurus–Bos indicus balance in fertility and milk related genes

    PubMed Central

    Lehnert, Sigrid A.; Mudadu, Mauricio A.; Coutinho, Luiz; Regitano, Luciana; George, Andrew; Reverter, Antonio

    2017-01-01

    Numerical approaches to high-density single nucleotide polymorphism (SNP) data are often employed independently to address individual questions. We linked independent approaches in a bioinformatics pipeline for further insight. The pipeline driven by heterozygosity and Hardy-Weinberg equilibrium (HWE) analyses was applied to characterize Bos taurus and Bos indicus ancestry. We infer a gene co-heterozygosity network that regulates bovine fertility, from data on 18,363 cattle with genotypes for 729,068 SNP. Hierarchical clustering separated populations according to Bos taurus and Bos indicus ancestry. The weights of the first principal component were subjected to Normal mixture modelling allowing the estimation of a gene’s contribution to the Bos taurus-Bos indicus axis. We used deviation from HWE, contribution to Bos indicus content and association to fertility traits to select 1,284 genes. With this set, we developed a co-heterozygosity network where the group of genes annotated as fertility-related had significantly higher Bos indicus content compared to other functional classes of genes, while the group of genes associated with milk production had significantly higher Bos taurus content. The network analysis resulted in capturing novel gene associations of relevance to bovine domestication events. We report transcription factors that are likely to regulate genes associated with cattle domestication and tropical adaptation. Our pipeline can be generalized to any scenarios where population structure requires scrutiny at the molecular level, particularly in the presence of a priori set of genes known to impact a phenotype of evolutionary interest such as fertility. PMID:28763475

  17. Complete Chloroplast Genome Sequence of Holoparasite Cistanche deserticola (Orobanchaceae) Reveals Gene Loss and Horizontal Gene Transfer from Its Host Haloxylon ammodendron (Chenopodiaceae)

    PubMed Central

    Qiao, Qin; Ren, Zhumei; Zhao, Jiayuan; Yonezawa, Takahiro; Hasegawa, Masami; Crabbe, M. James C; Li, Jianqiang; Zhong, Yang

    2013-01-01

    Background The central function of chloroplasts is to carry out photosynthesis, and its gene content and structure are highly conserved across land plants. Parasitic plants, which have reduced photosynthetic ability, suffer gene losses from the chloroplast (cp) genome accompanied by the relaxation of selective constraints. Compared with the rapid rise in the number of cp genome sequences of photosynthetic organisms, there are limited data sets from parasitic plants. Principal Findings/Significance Here we report the complete sequence of the cp genome of Cistanche deserticola, a holoparasitic desert species belonging to the family Orobanchaceae. The cp genome of C. deserticola is greatly reduced both in size (102,657 bp) and in gene content, indicating that all genes required for photosynthesis suffer from gene loss and pseudogenization, except for psbM. The striking difference from other holoparasitic plants is that it retains almost a full set of tRNA genes, and it has lower dN/dS for most genes than another close holoparasitic plant, E. virginiana, suggesting that Cistanche deserticola has undergone fewer losses, either due to a reduced level of holoparasitism, or to a recent switch to this life history. We also found that the rpoC2 gene was present in two copies within C. deserticola. Its own copy has much shortened and turned out to be a pseudogene. Another copy, which was not located in its cp genome, was a homolog of the host plant, Haloxylon ammodendron (Chenopodiaceae), suggesting that it was acquired from its host via a horizontal gene transfer. PMID:23554920

  18. Transcriptional profiling of the host cell response to feline immunodeficiency virus infection.

    PubMed

    Ertl, Reinhard; Klein, Dieter

    2014-03-19

    Feline immunodeficiency virus (FIV) is a widespread pathogen of the domestic cat and an important animal model for human immunodeficiency virus (HIV) research. In contrast to HIV, only limited information is available on the transcriptional host cell response to FIV infections. This study aims to identify FIV-induced gene expression changes in feline T-cells during the early phase of the infection. Illumina RNA-sequencing (RNA-seq) was used identify differentially expressed genes (DEGs) at 24 h after FIV infection. After removal of low-quality reads, the remaining sequencing data were mapped against the cat genome and the numbers of mapping reads were counted for each gene. Regulated genes were identified through the comparison of FIV and mock-infected data sets. After statistical analysis and the removal of genes with insufficient coverage, we detected a total of 69 significantly DEGs (44 up- and 25 down-regulated genes) upon FIV infection. The results obtained by RNA-seq were validated by reverse transcription qPCR analysis for 10 genes. Out of the most distinct DEGs identified in this study, several genes are already known to interact with HIV in humans, indicating comparable effects of both viruses on the host cell gene expression and furthermore, highlighting the importance of FIV as a model system for HIV. In addition, a set of new genes not previously linked to virus infections could be identified. The provided list of virus-induced genes may represent useful information for future studies focusing on the molecular mechanisms of virus-host interactions in FIV pathogenesis.

  19. Complete chloroplast genome sequence of holoparasite Cistanche deserticola (Orobanchaceae) reveals gene loss and horizontal gene transfer from its host Haloxylon ammodendron (Chenopodiaceae).

    PubMed

    Li, Xi; Zhang, Ti-Cao; Qiao, Qin; Ren, Zhumei; Zhao, Jiayuan; Yonezawa, Takahiro; Hasegawa, Masami; Crabbe, M James C; Li, Jianqiang; Zhong, Yang

    2013-01-01

    The central function of chloroplasts is to carry out photosynthesis, and its gene content and structure are highly conserved across land plants. Parasitic plants, which have reduced photosynthetic ability, suffer gene losses from the chloroplast (cp) genome accompanied by the relaxation of selective constraints. Compared with the rapid rise in the number of cp genome sequences of photosynthetic organisms, there are limited data sets from parasitic plants. PRINCIPAL FINDINGS/SIGNIFICANCE: Here we report the complete sequence of the cp genome of Cistanche deserticola, a holoparasitic desert species belonging to the family Orobanchaceae. The cp genome of C. deserticola is greatly reduced both in size (102,657 bp) and in gene content, indicating that all genes required for photosynthesis suffer from gene loss and pseudogenization, except for psbM. The striking difference from other holoparasitic plants is that it retains almost a full set of tRNA genes, and it has lower dN/dS for most genes than another close holoparasitic plant, E. virginiana, suggesting that Cistanche deserticola has undergone fewer losses, either due to a reduced level of holoparasitism, or to a recent switch to this life history. We also found that the rpoC2 gene was present in two copies within C. deserticola. Its own copy has much shortened and turned out to be a pseudogene. Another copy, which was not located in its cp genome, was a homolog of the host plant, Haloxylon ammodendron (Chenopodiaceae), suggesting that it was acquired from its host via a horizontal gene transfer.

  20. A support vector machine based test for incongruence between sets of trees in tree space

    PubMed Central

    2012-01-01

    Background The increased use of multi-locus data sets for phylogenetic reconstruction has increased the need to determine whether a set of gene trees significantly deviate from the phylogenetic patterns of other genes. Such unusual gene trees may have been influenced by other evolutionary processes such as selection, gene duplication, or horizontal gene transfer. Results Motivated by this problem we propose a nonparametric goodness-of-fit test for two empirical distributions of gene trees, and we developed the software GeneOut to estimate a p-value for the test. Our approach maps trees into a multi-dimensional vector space and then applies support vector machines (SVMs) to measure the separation between two sets of pre-defined trees. We use a permutation test to assess the significance of the SVM separation. To demonstrate the performance of GeneOut, we applied it to the comparison of gene trees simulated within different species trees across a range of species tree depths. Applied directly to sets of simulated gene trees with large sample sizes, GeneOut was able to detect very small differences between two set of gene trees generated under different species trees. Our statistical test can also include tree reconstruction into its test framework through a variety of phylogenetic optimality criteria. When applied to DNA sequence data simulated from different sets of gene trees, results in the form of receiver operating characteristic (ROC) curves indicated that GeneOut performed well in the detection of differences between sets of trees with different distributions in a multi-dimensional space. Furthermore, it controlled false positive and false negative rates very well, indicating a high degree of accuracy. Conclusions The non-parametric nature of our statistical test provides fast and efficient analyses, and makes it an applicable test for any scenario where evolutionary or other factors can lead to trees with different multi-dimensional distributions. The software GeneOut is freely available under the GNU public license. PMID:22909268

  1. A graph-theoretic approach for inparalog detection.

    PubMed

    Tremblay-Savard, Olivier; Swenson, Krister M

    2012-01-01

    Understanding the history of a gene family that evolves through duplication, speciation, and loss is a fundamental problem in comparative genomics. Features such as function, position, and structural similarity between genes are intimately connected to this history; relationships between genes such as orthology (genes related through a speciation event) or paralogy (genes related through a duplication event) are usually correlated with these features. For example, recent work has shown that in human and mouse there is a strong connection between function and inparalogs, the paralogs that were created since the speciation event separating the human and mouse lineages. Methods exist for detecting inparalogs that either use information from only two species, or consider a set of species but rely on clustering methods. In this paper we present a graph-theoretic approach for finding lower bounds on the number of inparalogs for a given set of species; we pose an edge covering problem on the similarity graph and give an efficient 2/3-approximation as well as a faster heuristic. Since the physical position of inparalogs corresponding to recent speciations is not likely to have changed since the duplication, we also use our predictions to estimate the types of duplications that have occurred in some vertebrates and drosophila.

  2. Welcome to pandoraviruses at the ‘Fourth TRUC’ club

    PubMed Central

    Sharma, Vikas; Colson, Philippe; Chabrol, Olivier; Scheid, Patrick; Pontarotti, Pierre; Raoult, Didier

    2015-01-01

    Nucleocytoplasmic large DNA viruses, or representatives of the proposed order Megavirales, belong to families of giant viruses that infect a broad range of eukaryotic hosts. Megaviruses have been previously described to comprise a fourth monophylogenetic TRUC (things resisting uncompleted classification) together with cellular domains in the universal tree of life. Recently described pandoraviruses have large (1.9–2.5 MB) and highly divergent genomes. In the present study, we updated the classification of pandoraviruses and other reported giant viruses. Phylogenetic trees were constructed based on six informational genes. Hierarchical clustering was performed based on a set of informational genes from Megavirales members and cellular organisms. Homologous sequences were selected from cellular organisms using TimeTree software, comprising comprehensive, and representative sets of members from Bacteria, Archaea, and Eukarya. Phylogenetic analyses based on three conserved core genes clustered pandoraviruses with phycodnaviruses, exhibiting their close relatedness. Additionally, hierarchical clustering analyses based on informational genes grouped pandoraviruses with Megavirales members as a super group distinct from cellular organisms. Thus, the analyses based on core conserved genes revealed that pandoraviruses are new genuine members of the ‘Fourth TRUC’ club, encompassing distinct life forms compared with cellular organisms. PMID:26042093

  3. Welcome to pandoraviruses at the 'Fourth TRUC' club.

    PubMed

    Sharma, Vikas; Colson, Philippe; Chabrol, Olivier; Scheid, Patrick; Pontarotti, Pierre; Raoult, Didier

    2015-01-01

    Nucleocytoplasmic large DNA viruses, or representatives of the proposed order Megavirales, belong to families of giant viruses that infect a broad range of eukaryotic hosts. Megaviruses have been previously described to comprise a fourth monophylogenetic TRUC (things resisting uncompleted classification) together with cellular domains in the universal tree of life. Recently described pandoraviruses have large (1.9-2.5 MB) and highly divergent genomes. In the present study, we updated the classification of pandoraviruses and other reported giant viruses. Phylogenetic trees were constructed based on six informational genes. Hierarchical clustering was performed based on a set of informational genes from Megavirales members and cellular organisms. Homologous sequences were selected from cellular organisms using TimeTree software, comprising comprehensive, and representative sets of members from Bacteria, Archaea, and Eukarya. Phylogenetic analyses based on three conserved core genes clustered pandoraviruses with phycodnaviruses, exhibiting their close relatedness. Additionally, hierarchical clustering analyses based on informational genes grouped pandoraviruses with Megavirales members as a super group distinct from cellular organisms. Thus, the analyses based on core conserved genes revealed that pandoraviruses are new genuine members of the 'Fourth TRUC' club, encompassing distinct life forms compared with cellular organisms.

  4. Transcriptome Sequencing of Gracilariopsis lemaneiformis to Analyze the Genes Related to Optically Active Phycoerythrin Synthesis.

    PubMed

    Huang, Xiaoyun; Zang, Xiaonan; Wu, Fei; Jin, Yuming; Wang, Haitao; Liu, Chang; Ding, Yating; He, Bangxiang; Xiao, Dongfang; Song, Xinwei; Liu, Zhu

    2017-01-01

    Gracilariopsis lemaneiformis (aka Gracilaria lemaneiformis) is a red macroalga rich in phycoerythrin, which can capture light efficiently and transfer it to photosystemⅡ. However, little is known about the synthesis of optically active phycoerythrinin in G. lemaneiformis at the molecular level. With the advent of high-throughput sequencing technology, analysis of genetic information for G. lemaneiformis by transcriptome sequencing is an effective means to get a deeper insight into the molecular mechanism of phycoerythrin synthesis. Illumina technology was employed to sequence the transcriptome of two strains of G. lemaneiformis- the wild type and a green-pigmented mutant. We obtained a total of 86915 assembled unigenes as a reference gene set, and 42884 unigenes were annotated in at least one public database. Taking the above transcriptome sequencing as a reference gene set, 4041 differentially expressed genes were screened to analyze and compare the gene expression profiles of the wild type and green mutant. By GO and KEGG pathway analysis, we concluded that three factors, including a reduction in the expression level of apo-phycoerythrin, an increase of chlorophyll light-harvesting complex synthesis, and reduction of phycoerythrobilin by competitive inhibition, caused the reduction of optically active phycoerythrin in the green-pigmented mutant.

  5. Identifying differentially expressed genes in cancer patients using a non-parameter Ising model.

    PubMed

    Li, Xumeng; Feltus, Frank A; Sun, Xiaoqian; Wang, James Z; Luo, Feng

    2011-10-01

    Identification of genes and pathways involved in diseases and physiological conditions is a major task in systems biology. In this study, we developed a novel non-parameter Ising model to integrate protein-protein interaction network and microarray data for identifying differentially expressed (DE) genes. We also proposed a simulated annealing algorithm to find the optimal configuration of the Ising model. The Ising model was applied to two breast cancer microarray data sets. The results showed that more cancer-related DE sub-networks and genes were identified by the Ising model than those by the Markov random field model. Furthermore, cross-validation experiments showed that DE genes identified by Ising model can improve classification performance compared with DE genes identified by Markov random field model. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  6. Gene expression pattern recognition algorithm inferences to classify samples exposed to chemical agents

    NASA Astrophysics Data System (ADS)

    Bushel, Pierre R.; Bennett, Lee; Hamadeh, Hisham; Green, James; Ableson, Alan; Misener, Steve; Paules, Richard; Afshari, Cynthia

    2002-06-01

    We present an analysis of pattern recognition procedures used to predict the classes of samples exposed to pharmacologic agents by comparing gene expression patterns from samples treated with two classes of compounds. Rat liver mRNA samples following exposure for 24 hours with phenobarbital or peroxisome proliferators were analyzed using a 1700 rat cDNA microarray platform. Sets of genes that were consistently differentially expressed in the rat liver samples following treatment were stored in the MicroArray Project System (MAPS) database. MAPS identified 238 genes in common that possessed a low probability (P < 0.01) of being randomly detected as differentially expressed at the 95% confidence level. Hierarchical cluster analysis on the 238 genes clustered specific gene expression profiles that separated samples based on exposure to a particular class of compound.

  7. A systems biology analysis of the changes in gene expression via silencing of HPV-18 E1 expression in HeLa cells.

    PubMed

    Castillo, Andres; Wang, Lu; Koriyama, Chihaya; Eizuru, Yoshito; Jordan, King; Akiba, Suminori

    2014-10-01

    Previous studies have reported the detection of a truncated E1 mRNA generated from HPV-18 in HeLa cells. Although it is unclear whether a truncated E1 protein could function as a replicative helicase for viral replication, it would still retain binding sites for potential interactions with different host cell proteins. Furthermore, in this study, we found evidence in support of expression of full-length HPV-18 E1 mRNA in HeLa cells. To determine whether interactions between E1 and cellular proteins play an important role in cellular processes other than viral replication, genome-wide expression profiles of HPV-18 positive HeLa cells were compared before and after the siRNA knockdown of E1 expression. Differential expression and gene set enrichment analysis uncovered four functionally related sets of genes implicated in host defence mechanisms against viral infection. These included the toll-like receptor, interferon and apoptosis pathways, along with the antiviral interferon-stimulated gene set. In addition, we found that the transcriptional coactivator E1A-binding protein p300 (EP300) was downregulated, which is interesting given that EP300 is thought to be required for the transcription of HPV-18 genes in HeLa cells. The observed changes in gene expression produced via the silencing of HPV-18 E1 expression in HeLa cells indicate that in addition to its well-known role in viral replication, the E1 protein may also play an important role in mitigating the host's ability to defend against viral infection.

  8. Quality controls in cellular immunotherapies: rapid assessment of clinical grade dendritic cells by gene expression profiling.

    PubMed

    Castiello, Luciano; Sabatino, Marianna; Zhao, Yingdong; Tumaini, Barbara; Ren, Jiaqiang; Ping, Jin; Wang, Ena; Wood, Lauren V; Marincola, Francesco M; Puri, Raj K; Stroncek, David F

    2013-02-01

    Cell-based immunotherapies are among the most promising approaches for developing effective and targeted immune response. However, their clinical usefulness and the evaluation of their efficacy rely heavily on complex quality control assessment. Therefore, rapid systematic methods are urgently needed for the in-depth characterization of relevant factors affecting newly developed cell product consistency and the identification of reliable markers for quality control. Using dendritic cells (DCs) as a model, we present a strategy to comprehensively characterize manufactured cellular products in order to define factors affecting their variability, quality and function. After generating clinical grade human monocyte-derived mature DCs (mDCs), we tested by gene expression profiling the degrees of product consistency related to the manufacturing process and variability due to intra- and interdonor factors, and how each factor affects single gene variation. Then, by calculating for each gene an index of variation we selected candidate markers for identity testing, and defined a set of genes that may be useful comparability and potency markers. Subsequently, we confirmed the observed gene index of variation in a larger clinical data set. In conclusion, using high-throughput technology we developed a method for the characterization of cellular therapies and the discovery of novel candidate quality assurance markers.

  9. Broad-Enrich: functional interpretation of large sets of broad genomic regions.

    PubMed

    Cavalcante, Raymond G; Lee, Chee; Welch, Ryan P; Patil, Snehal; Weymouth, Terry; Scott, Laura J; Sartor, Maureen A

    2014-09-01

    Functional enrichment testing facilitates the interpretation of Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) data in terms of pathways and other biological contexts. Previous methods developed and used to test for key gene sets affected in ChIP-seq experiments treat peaks as points, and are based on the number of peaks associated with a gene or a binary score for each gene. These approaches work well for transcription factors, but histone modifications often occur over broad domains, and across multiple genes. To incorporate the unique properties of broad domains into functional enrichment testing, we developed Broad-Enrich, a method that uses the proportion of each gene's locus covered by a peak. We show that our method has a well-calibrated false-positive rate, performing well with ChIP-seq data having broad domains compared with alternative approaches. We illustrate Broad-Enrich with 55 ENCODE ChIP-seq datasets using different methods to define gene loci. Broad-Enrich can also be applied to other datasets consisting of broad genomic domains such as copy number variations. http://broad-enrich.med.umich.edu for Web version and R package. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.

  10. Comparison of transcriptional profiles of Clostridium thermocellum grown on cellobiose and pretreated yellow poplar using RNA-Seq

    PubMed Central

    Wei, Hui; Fu, Yan; Magnusson, Lauren; Baker, John O.; Maness, Pin-Ching; Xu, Qi; Yang, Shihui; Bowersox, Andrew; Bogorad, Igor; Wang, Wei; Tucker, Melvin P.; Himmel, Michael E.; Ding, Shi-You

    2014-01-01

    The anaerobic, thermophilic bacterium, Clostridium thermocellum, secretes multi-protein enzyme complexes, termed cellulosomes, which synergistically interact with the microbial cell surface and efficiently disassemble plant cell wall biomass. C. thermocellum has also been considered a potential consolidated bioprocessing (CBP) organism due to its ability to produce the biofuel products, hydrogen, and ethanol. We found that C. thermocellum fermentation of pretreated yellow poplar (PYP) produced 30 and 39% of ethanol and hydrogen product concentrations, respectively, compared to fermentation of cellobiose. RNA-seq was used to analyze the transcriptional profiles of these cells. The PYP-grown cells taken for analysis at the late stationary phase showed 1211 genes up-regulated and 314 down-regulated by more than two-fold compared to the cellobiose-grown cells. These affected genes cover a broad spectrum of specific functional categories. The transcriptional analysis was further validated by sub-proteomics data taken from the literature; as well as by quantitative reverse transcription-PCR (qRT-PCR) analyses of selected genes. Specifically, 47 cellulosomal protein-encoding genes, genes for 4 pairs of SigI-RsgI for polysaccharide sensing, 7 cellodextrin ABC transporter genes, and a set of NAD(P)H hydogenase and alcohol dehydrogenase genes were up-regulated for cells growing on PYP compared to cellobiose. These genes could be potential candidates for future studies aimed at gaining insight into the regulatory mechanism of this organism as well as for improvement of C. thermocellum in its role as a CBP organism. PMID:24782837

  11. Inferring species trees from incongruent multi-copy gene trees using the Robinson-Foulds distance

    PubMed Central

    2013-01-01

    Background Constructing species trees from multi-copy gene trees remains a challenging problem in phylogenetics. One difficulty is that the underlying genes can be incongruent due to evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. Gene tree estimation errors may further exacerbate the difficulties of species tree estimation. Results We present a new approach for inferring species trees from incongruent multi-copy gene trees that is based on a generalization of the Robinson-Foulds (RF) distance measure to multi-labeled trees (mul-trees). We prove that it is NP-hard to compute the RF distance between two mul-trees; however, it is easy to calculate this distance between a mul-tree and a singly-labeled species tree. Motivated by this, we formulate the RF problem for mul-trees (MulRF) as follows: Given a collection of multi-copy gene trees, find a singly-labeled species tree that minimizes the total RF distance from the input mul-trees. We develop and implement a fast SPR-based heuristic algorithm for the NP-hard MulRF problem. We compare the performance of the MulRF method (available at http://genome.cs.iastate.edu/CBL/MulRF/) with several gene tree parsimony approaches using gene tree simulations that incorporate gene tree error, gene duplications and losses, and/or lateral transfer. The MulRF method produces more accurate species trees than gene tree parsimony approaches. We also demonstrate that the MulRF method infers in minutes a credible plant species tree from a collection of nearly 2,000 gene trees. Conclusions Our new phylogenetic inference method, based on a generalized RF distance, makes it possible to quickly estimate species trees from large genomic data sets. Since the MulRF method, unlike gene tree parsimony, is based on a generic tree distance measure, it is appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplication and losses as well as phylogenetic error may contribute to gene tree discord. In experiments, the MulRF method estimated species trees accurately and quickly, demonstrating MulRF as an efficient alternative approach for phylogenetic inference from large-scale genomic data sets. PMID:24180377

  12. Virulotyping of Shigella spp. isolated from pediatric patients in Tehran, Iran.

    PubMed

    Ranjbar, Reza; Bolandian, Masomeh; Behzadi, Payam

    2017-03-01

    Shigellosis is a considerable infectious disease with high morbidity and mortality among children worldwide. In this survey the prevalence of four important virulence genes including ial, ipaH, set1A, and set1B were investigated among Shigella strains and the related gene profiles identified in the present investigation, stool specimens were collected from children who were referred to two hospitals in Tehran, Iran. The samples were collected during 3 years (2008-2010) from children who were suspected to shigellosis. Shigella spp. were identified throughout microbiological and serological tests and then subjected to PCR for virulotyping. Shigella sonnei was ranking first (65.5%) followed by Shigella flexneri (25.9%), Shigella boydii (6.9%), and Shigella dysenteriae (1.7%). The ial gene was the most frequent virulence gene among isolated bacterial strains and was followed by ipaH, set1B, and set1A. S. flexneri possessed all of the studied virulence genes (ial 65.51%, ipaH 58.62%, set1A 12.07%, and set1B 22.41%). Moreover, the pattern of virulence gene profiles including ial, ial-ipaH, ial-ipaH-set1B, and ial-ipaH-set1B-set1A was identified for isolated Shigella spp. strains. The pattern of virulence genes is changed in isolated strains of Shigella in this study. So, the ial gene is placed first and the ipaH in second.

  13. Altered gut transcriptome in spondyloarthropathy

    PubMed Central

    Laukens, D; Peeters, H; Cruyssen, B V; Boonefaes, T; Elewaut, D; De Keyser, F; Mielants, H; Cuvelier, C; Veys, E M; Knecht, K; Van Hummelen, P; Remaut, E; Steidler, L; De Vos, M; Rottiers, P

    2006-01-01

    Background Intestinal inflammation is a common feature of spondyloarthropathy (SpA) and Crohn's disease. Inflammation is manifested clinically in Crohn's disease and subclinically in SpA. However, a fraction of patients with SpA develops overt Crohn's disease. Aims To investigate whether subclinical gut lesions in patients with SpA are associated with transcriptome changes comparable to those seen in Crohn's disease and to examine global gene expression in non‐inflamed colon biopsy specimens and screen patients for differentially expressed genes. Methods Macroarray analysis was used as an initial genomewide screen for selecting a comprehensive set of genes relevant to Crohn's disease and SpA. This led to the identification of 2625 expressed sequence tags that are differentially expressed in the colon of patients with Crohn's disease or SpA. These clones, with appropriate controls (6779 in total), were used to construct a glass‐based microarray, which was then used to analyse colon biopsy specimens from 15 patients with SpA, 11 patients with Crohn's disease and 10 controls. Results 95 genes were identified as differentially expressed in patients with SpA having a history of subclinical chronic gut inflammation and also in patients with Crohn's disease. Principal component analysis of this filtered set of genes successfully distinguished colon biopsy specimens from the three groups studied. Patients with SpA having subclinical chronic gut inflammation cluster together and are more related to those with Crohn's disease. Conclusion The transcriptome in the intestine of patients with SpA differs from that of controls. Moreover, these gene changes are comparable to those seen in patients with Crohn's disease, confirming initial clinical observations. On the basis of these findings, new (genetic) markers for detection of early Crohn's disease in patients with SpA can be considered. PMID:16476712

  14. Computing and Applying Atomic Regulons to Understand Gene Expression and Regulation

    DOE PAGES

    Faria, José P.; Davis, James J.; Edirisinghe, Janaka N.; ...

    2016-11-24

    Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. A multitude of technologies, abstractions, and interpretive frameworks have emerged to answer the challenges presented by genome function and regulatory network inference. Here, we propose a new approach for producing biologically meaningful clusters of coexpressed genes, called Atomic Regulons (ARs), based on expression data, gene context, and functional relationships. We demonstrate this new approach by computing ARs for Escherichia coli, which we compare with the coexpressed gene clusters predicted by two prevalent existing methods: hierarchical clustering and k-meansmore » clustering. We test the consistency of ARs predicted by all methods against expected interactions predicted by the Context Likelihood of Relatedness (CLR) mutual information based method, finding that the ARs produced by our approach show better agreement with CLR interactions. We then apply our method to compute ARs for four other genomes: Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus. We compare the AR clusters from all genomes to study the similarity of coexpression among a phylogenetically diverse set of species, identifying subsystems that show remarkable similarity over wide phylogenetic distances. We also study the sensitivity of our method for computing ARs to the expression data used in the computation, showing that our new approach requires less data than competing approaches to converge to a near final configuration of ARs. We go on to use our sensitivity analysis to identify the specific experiments that lead most rapidly to the final set of ARs for E. coli. As a result, this analysis produces insights into improving the design of gene expression experiments.« less

  15. Computing and Applying Atomic Regulons to Understand Gene Expression and Regulation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Faria, José P.; Davis, James J.; Edirisinghe, Janaka N.

    Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. A multitude of technologies, abstractions, and interpretive frameworks have emerged to answer the challenges presented by genome function and regulatory network inference. Here, we propose a new approach for producing biologically meaningful clusters of coexpressed genes, called Atomic Regulons (ARs), based on expression data, gene context, and functional relationships. We demonstrate this new approach by computing ARs for Escherichia coli, which we compare with the coexpressed gene clusters predicted by two prevalent existing methods: hierarchical clustering and k-meansmore » clustering. We test the consistency of ARs predicted by all methods against expected interactions predicted by the Context Likelihood of Relatedness (CLR) mutual information based method, finding that the ARs produced by our approach show better agreement with CLR interactions. We then apply our method to compute ARs for four other genomes: Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus. We compare the AR clusters from all genomes to study the similarity of coexpression among a phylogenetically diverse set of species, identifying subsystems that show remarkable similarity over wide phylogenetic distances. We also study the sensitivity of our method for computing ARs to the expression data used in the computation, showing that our new approach requires less data than competing approaches to converge to a near final configuration of ARs. We go on to use our sensitivity analysis to identify the specific experiments that lead most rapidly to the final set of ARs for E. coli. As a result, this analysis produces insights into improving the design of gene expression experiments.« less

  16. Transcriptional Changes in Canine Distemper Virus-Induced Demyelinating Leukoencephalitis Favor a Biphasic Mode of Demyelination

    PubMed Central

    Ulrich, Reiner; Puff, Christina; Wewetzer, Konstantin; Kalkuhl, Arno; Deschl, Ulrich; Baumgärtner, Wolfgang

    2014-01-01

    Canine distemper virus (CDV)-induced demyelinating leukoencephalitis in dogs (Canis familiaris) is suggested to represent a naturally occurring translational model for subacute sclerosing panencephalitis and multiple sclerosis in humans. The aim of this study was a hypothesis-free microarray analysis of the transcriptional changes within cerebellar specimens of five cases of acute, six cases of subacute demyelinating, and three cases of chronic demyelinating and inflammatory CDV leukoencephalitis as compared to twelve non-infected control dogs. Frozen cerebellar specimens were used for analysis of histopathological changes including demyelination, transcriptional changes employing microarrays, and presence of CDV nucleoprotein RNA and protein using microarrays, RT-qPCR and immunohistochemistry. Microarray analysis revealed 780 differentially expressed probe sets. The dominating change was an up-regulation of genes related to the innate and the humoral immune response, and less distinct the cytotoxic T-cell-mediated immune response in all subtypes of CDV leukoencephalitis as compared to controls. Multiple myelin genes including myelin basic protein and proteolipid protein displayed a selective down-regulation in subacute CDV leukoencephalitis, suggestive of an oligodendrocyte dystrophy. In contrast, a marked up-regulation of multiple immunoglobulin-like expressed sequence tags and the delta polypeptide of the CD3 antigen was observed in chronic CDV leukoencephalitis, in agreement with the hypothesis of an immune-mediated demyelination in the late inflammatory phase of the disease. Analysis of pathways intimately linked to demyelination as determined by morphometry employing correlation-based Gene Set Enrichment Analysis highlighted the pathomechanistic importance of up-regulated genes comprised by the gene ontology terms “viral replication” and “humoral immune response” as well as down-regulated genes functionally related to “metabolite and energy generation”. PMID:24755553

  17. Transcriptional changes in canine distemper virus-induced demyelinating leukoencephalitis favor a biphasic mode of demyelination.

    PubMed

    Ulrich, Reiner; Puff, Christina; Wewetzer, Konstantin; Kalkuhl, Arno; Deschl, Ulrich; Baumgärtner, Wolfgang

    2014-01-01

    Canine distemper virus (CDV)-induced demyelinating leukoencephalitis in dogs (Canis familiaris) is suggested to represent a naturally occurring translational model for subacute sclerosing panencephalitis and multiple sclerosis in humans. The aim of this study was a hypothesis-free microarray analysis of the transcriptional changes within cerebellar specimens of five cases of acute, six cases of subacute demyelinating, and three cases of chronic demyelinating and inflammatory CDV leukoencephalitis as compared to twelve non-infected control dogs. Frozen cerebellar specimens were used for analysis of histopathological changes including demyelination, transcriptional changes employing microarrays, and presence of CDV nucleoprotein RNA and protein using microarrays, RT-qPCR and immunohistochemistry. Microarray analysis revealed 780 differentially expressed probe sets. The dominating change was an up-regulation of genes related to the innate and the humoral immune response, and less distinct the cytotoxic T-cell-mediated immune response in all subtypes of CDV leukoencephalitis as compared to controls. Multiple myelin genes including myelin basic protein and proteolipid protein displayed a selective down-regulation in subacute CDV leukoencephalitis, suggestive of an oligodendrocyte dystrophy. In contrast, a marked up-regulation of multiple immunoglobulin-like expressed sequence tags and the delta polypeptide of the CD3 antigen was observed in chronic CDV leukoencephalitis, in agreement with the hypothesis of an immune-mediated demyelination in the late inflammatory phase of the disease. Analysis of pathways intimately linked to demyelination as determined by morphometry employing correlation-based Gene Set Enrichment Analysis highlighted the pathomechanistic importance of up-regulated genes comprised by the gene ontology terms "viral replication" and "humoral immune response" as well as down-regulated genes functionally related to "metabolite and energy generation".

  18. A whole blood gene expression-based signature for smoking status

    PubMed Central

    2012-01-01

    Background Smoking is the leading cause of preventable death worldwide and has been shown to increase the risk of multiple diseases including coronary artery disease (CAD). We sought to identify genes whose levels of expression in whole blood correlate with self-reported smoking status. Methods Microarrays were used to identify gene expression changes in whole blood which correlated with self-reported smoking status; a set of significant genes from the microarray analysis were validated by qRT-PCR in an independent set of subjects. Stepwise forward logistic regression was performed using the qRT-PCR data to create a predictive model whose performance was validated in an independent set of subjects and compared to cotinine, a nicotine metabolite. Results Microarray analysis of whole blood RNA from 209 PREDICT subjects (41 current smokers, 4 quit ≤ 2 months, 64 quit > 2 months, 100 never smoked; NCT00500617) identified 4214 genes significantly correlated with self-reported smoking status. qRT-PCR was performed on 1,071 PREDICT subjects across 256 microarray genes significantly correlated with smoking or CAD. A five gene (CLDND1, LRRN3, MUC1, GOPC, LEF1) predictive model, derived from the qRT-PCR data using stepwise forward logistic regression, had a cross-validated mean AUC of 0.93 (sensitivity=0.78; specificity=0.95), and was validated using 180 independent PREDICT subjects (AUC=0.82, CI 0.69-0.94; sensitivity=0.63; specificity=0.94). Plasma from the 180 validation subjects was used to assess levels of cotinine; a model using a threshold of 10 ng/ml cotinine resulted in an AUC of 0.89 (CI 0.81-0.97; sensitivity=0.81; specificity=0.97; kappa with expression model = 0.53). Conclusion We have constructed and validated a whole blood gene expression score for the evaluation of smoking status, demonstrating that clinical and environmental factors contributing to cardiovascular disease risk can be assessed by gene expression. PMID:23210427

  19. A functional polymorphism of the TNF-{alpha} gene that is associated with type 2 DM

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Susa, Shinji; Daimon, Makoto; Sakabe, Jun-Ichi

    2008-05-09

    To examine the association of the tumor necrosis factor-{alpha} (TNF-{alpha}) gene region with type 2 diabetes (DM), 11 single-nucleotide polymorphisms (SNPs) of the region were analyzed. The initial study using a sample set (148 cases vs. 227 controls) showed a significant association of the SNP IVS1G + 123A of the TNF-{alpha} gene with DM (p = 0.0056). Multiple logistic regression analysis using an enlarged sample set (225 vs. 716) revealed the significant association of the SNP with DM independently of any clinical traits examined (OR: 1.49, p = 0.014). The functional relevance of the SNP were examined by the electrophoreticmore » mobility shift assays using nuclear extracts from the U937 and NIH3T3 cells and luciferase assays in these cells with Simian virus 40 promoter- and TNF-{alpha} promoter-reporter gene constructs. The functional analyses showed that YY1 transcription factor bound allele-specifically to the SNP region and, the IVS1 + 123A allele had an increase in luciferase expression compared with the G allele.« less

  20. A Granular Self-Organizing Map for Clustering and Gene Selection in Microarray Data.

    PubMed

    Ray, Shubhra Sankar; Ganivada, Avatharam; Pal, Sankar K

    2016-09-01

    A new granular self-organizing map (GSOM) is developed by integrating the concept of a fuzzy rough set with the SOM. While training the GSOM, the weights of a winning neuron and the neighborhood neurons are updated through a modified learning procedure. The neighborhood is newly defined using the fuzzy rough sets. The clusters (granules) evolved by the GSOM are presented to a decision table as its decision classes. Based on the decision table, a method of gene selection is developed. The effectiveness of the GSOM is shown in both clustering samples and developing an unsupervised fuzzy rough feature selection (UFRFS) method for gene selection in microarray data. While the superior results of the GSOM, as compared with the related clustering methods, are provided in terms of β -index, DB-index, Dunn-index, and fuzzy rough entropy, the genes selected by the UFRFS are not only better in terms of classification accuracy and a feature evaluation index, but also statistically more significant than the related unsupervised methods. The C-codes of the GSOM and UFRFS are available online at http://avatharamg.webs.com/software-code.

  1. Integrating genome-wide association study and expression quantitative trait loci data identifies multiple genes and gene set associated with neuroticism.

    PubMed

    Fan, Qianrui; Wang, Wenyu; Hao, Jingcan; He, Awen; Wen, Yan; Guo, Xiong; Wu, Cuiyan; Ning, Yujie; Wang, Xi; Wang, Sen; Zhang, Feng

    2017-08-01

    Neuroticism is a fundamental personality trait with significant genetic determinant. To identify novel susceptibility genes for neuroticism, we conducted an integrative analysis of genomic and transcriptomic data of genome wide association study (GWAS) and expression quantitative trait locus (eQTL) study. GWAS summary data was driven from published studies of neuroticism, totally involving 170,906 subjects. eQTL dataset containing 927,753 eQTLs were obtained from an eQTL meta-analysis of 5311 samples. Integrative analysis of GWAS and eQTL data was conducted by summary data-based Mendelian randomization (SMR) analysis software. To identify neuroticism associated gene sets, the SMR analysis results were further subjected to gene set enrichment analysis (GSEA). The gene set annotation dataset (containing 13,311 annotated gene sets) of GSEA Molecular Signatures Database was used. SMR single gene analysis identified 6 significant genes for neuroticism, including MSRA (p value=2.27×10 -10 ), MGC57346 (p value=6.92×10 -7 ), BLK (p value=1.01×10 -6 ), XKR6 (p value=1.11×10 -6 ), C17ORF69 (p value=1.12×10 -6 ) and KIAA1267 (p value=4.00×10 -6 ). Gene set enrichment analysis observed significant association for Chr8p23 gene set (false discovery rate=0.033). Our results provide novel clues for the genetic mechanism studies of neuroticism. Copyright © 2017. Published by Elsevier Inc.

  2. Strategies for comparing gene expression profiles from different microarray platforms: application to a case-control experiment.

    PubMed

    Severgnini, Marco; Bicciato, Silvio; Mangano, Eleonora; Scarlatti, Francesca; Mezzelani, Alessandra; Mattioli, Michela; Ghidoni, Riccardo; Peano, Clelia; Bonnal, Raoul; Viti, Federica; Milanesi, Luciano; De Bellis, Gianluca; Battaglia, Cristina

    2006-06-01

    Meta-analysis of microarray data is increasingly important, considering both the availability of multiple platforms using disparate technologies and the accumulation in public repositories of data sets from different laboratories. We addressed the issue of comparing gene expression profiles from two microarray platforms by devising a standardized investigative strategy. We tested this procedure by studying MDA-MB-231 cells, which undergo apoptosis on treatment with resveratrol. Gene expression profiles were obtained using high-density, short-oligonucleotide, single-color microarray platforms: GeneChip (Affymetrix) and CodeLink (Amersham). Interplatform analyses were carried out on 8414 common transcripts represented on both platforms, as identified by LocusLink ID, representing 70.8% and 88.6% of annotated GeneChip and CodeLink features, respectively. We identified 105 differentially expressed genes (DEGs) on CodeLink and 42 DEGs on GeneChip. Among them, only 9 DEGs were commonly identified by both platforms. Multiple analyses (BLAST alignment of probes with target sequences, gene ontology, literature mining, and quantitative real-time PCR) permitted us to investigate the factors contributing to the generation of platform-dependent results in single-color microarray experiments. An effective approach to cross-platform comparison involves microarrays of similar technologies, samples prepared by identical methods, and a standardized battery of bioinformatic and statistical analyses.

  3. Comparative transcriptomic evidence for Tween80-enhanced biodegradation of phenanthrene by Sphingomonas sp. GY2B.

    PubMed

    Liu, Shasha; Guo, Chuling; Lin, Weijia; Wu, Fengji; Lu, Guining; Lu, Jing; Dang, Zhi

    2017-12-31

    Previous study of the effects of surfactants on the biodegradation of phenanthrene focused on investigating alterations of the cell characteristics of Sphingomonas sp. GY2B. However, genes regulation associated with biodegradation and biological processes in response to the presence of surfactants, remains unclear. In this study, comparative transcriptome analysis was conducted to observe the gene expression of GY2B during phenanthrene biodegradation in the presence and absence of Tween80. A diverse set of genes was regulated by Tween80, leading to increased biodegradation of phenanthrene by GY2B: (i) Tween80 increased expression of genes related to H + transport in the plasma membrane to provide a driving force (i.e., ATP) for accelerating transmembrane transport of phenanthrene with increasing Tween80 concentrations, thereby enhancing the uptake and degradation of phenanthrene by GY2B; (ii) Tween80 (1 and 8 CMC) promoted intracellular biodegradation of phenanthrene by stimulating expression of genes encoding dioxygenases and monooxygenase, increasing expression of genes involved in intracellular metabolic processes (e.g., TCA cycle); and (iii) Tween80 likely increased GY2B vitality and growth by inducing expression of genes associated with ABC transporters and protein transport, regulating genes involved in other biological processes (e.g., transcription, translation). Copyright © 2017. Published by Elsevier B.V.

  4. Roles of Distal and Genic Methylation in the Development of Prostate Tumorigenesis Revealed by Genome-wide DNA Methylation Analysis.

    PubMed

    Wang, Yao; Jadhav, Rohit Ramakant; Liu, Joseph; Wilson, Desiree; Chen, Yidong; Thompson, Ian M; Troyer, Dean A; Hernandez, Javier; Shi, Huidong; Leach, Robin J; Huang, Tim H-M; Jin, Victor X

    2016-02-29

    Aberrant DNA methylation at promoters is often linked to tumorigenesis. But many aspects of DNA methylation remain unexplored, including the individual roles of distal and gene body methylation, as well as their collaborative roles with promoter methylation. Here we performed a MBD-seq analysis on prostate specimens classified into low, high, and very high risk group based on Gleason score and TNM stages. We identified gene sets with differential methylation regions (DMRs) in Distal, TSS, gene body and TES. To understand the collaborative roles, TSS was compared with the other three DMRs, resulted in 12 groups of genes with collaborative differential methylation patterns (CDMPs). We found several groups of genes that show opposite methylation patterns in Distal and Genic regions compared to TSS region, and in general they are differentially expressed genes (DEGs) in tumors in TCGA RNA-seq data. IPA (Ingenuity Pathway Analysis) reveals AR/TP53 signaling network to be a major signaling pathway, and survival analysis indicates genes subsets significantly associated with prostate cancer recurrence. Our results suggest that DNA methylation in Distal and Genic regions also plays critical roles in contributing to prostate tumorigenesis, and may act either positively or negatively with TSSs to alter gene regulation in tumors.

  5. A transversal approach to predict gene product networks from ontology-based similarity

    PubMed Central

    Chabalier, Julie; Mosser, Jean; Burgun, Anita

    2007-01-01

    Background Interpretation of transcriptomic data is usually made through a "standard" approach which consists in clustering the genes according to their expression patterns and exploiting Gene Ontology (GO) annotations within each expression cluster. This approach makes it difficult to underline functional relationships between gene products that belong to different expression clusters. To address this issue, we propose a transversal analysis that aims to predict functional networks based on a combination of GO processes and data expression. Results The transversal approach presented in this paper consists in computing the semantic similarity between gene products in a Vector Space Model. Through a weighting scheme over the annotations, we take into account the representativity of the terms that annotate a gene product. Comparing annotation vectors results in a matrix of gene product similarities. Combined with expression data, the matrix is displayed as a set of functional gene networks. The transversal approach was applied to 186 genes related to the enterocyte differentiation stages. This approach resulted in 18 functional networks proved to be biologically relevant. These results were compared with those obtained through a standard approach and with an approach based on information content similarity. Conclusion Complementary to the standard approach, the transversal approach offers new insight into the cellular mechanisms and reveals new research hypotheses by combining gene product networks based on semantic similarity, and data expression. PMID:17605807

  6. paraGSEA: a scalable approach for large-scale gene expression profiling

    PubMed Central

    Peng, Shaoliang; Yang, Shunyun

    2017-01-01

    Abstract More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA. PMID:28973463

  7. Identifying biologically relevant putative mechanisms in a given phenotype comparison

    PubMed Central

    Hanoudi, Samer; Donato, Michele; Draghici, Sorin

    2017-01-01

    A major challenge in life science research is understanding the mechanism involved in a given phenotype. The ability to identify the correct mechanisms is needed in order to understand fundamental and very important phenomena such as mechanisms of disease, immune systems responses to various challenges, and mechanisms of drug action. The current data analysis methods focus on the identification of the differentially expressed (DE) genes using their fold change and/or p-values. Major shortcomings of this approach are that: i) it does not consider the interactions between genes; ii) its results are sensitive to the selection of the threshold(s) used, and iii) the set of genes produced by this approach is not always conducive to formulating mechanistic hypotheses. Here we present a method that can construct networks of genes that can be considered putative mechanisms. The putative mechanisms constructed by this approach are not limited to the set of DE genes, but also considers all known and relevant gene-gene interactions. We analyzed three real datasets for which both the causes of the phenotype, as well as the true mechanisms were known. We show that the method identified the correct mechanisms when applied on microarray datasets from mouse. We compared the results of our method with the results of the classical approach, showing that our method produces more meaningful biological insights. PMID:28486531

  8. Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions

    PubMed Central

    Kluger, Yuval; Basri, Ronen; Chang, Joseph T.; Gerstein, Mark

    2003-01-01

    Global analyses of RNA expression levels are useful for classifying genes and overall phenotypes. Often these classification problems are linked, and one wants to find “marker genes” that are differentially expressed in particular sets of “conditions.” We have developed a method that simultaneously clusters genes and conditions, finding distinctive “checkerboard” patterns in matrices of gene expression data, if they exist. In a cancer context, these checkerboards correspond to genes that are markedly up- or downregulated in patients with particular types of tumors. Our method, spectral biclustering, is based on the observation that checkerboard structures in matrices of expression data can be found in eigenvectors corresponding to characteristic expression patterns across genes or conditions. In addition, these eigenvectors can be readily identified by commonly used linear algebra approaches, in particular the singular value decomposition (SVD), coupled with closely integrated normalization steps. We present a number of variants of the approach, depending on whether the normalization over genes and conditions is done independently or in a coupled fashion. We then apply spectral biclustering to a selection of publicly available cancer expression data sets, and examine the degree to which the approach is able to identify checkerboard structures. Furthermore, we compare the performance of our biclustering methods against a number of reasonable benchmarks (e.g., direct application of SVD or normalized cuts to raw data). PMID:12671006

  9. Development and evaluation of near-isogenic lines for major blast resistance gene(s) in Basmati rice.

    PubMed

    Khanna, Apurva; Sharma, Vinay; Ellur, Ranjith K; Shikari, Asif B; Gopala Krishnan, S; Singh, U D; Prakash, G; Sharma, T R; Rathour, Rajeev; Variar, Mukund; Prashanthi, S K; Nagarajan, M; Vinod, K K; Bhowmick, Prolay K; Singh, N K; Prabhu, K V; Singh, B D; Singh, Ashok K

    2015-07-01

    A set of NILs carrying major blast resistance genes in a Basmati rice variety has been developed. Also, the efficacy of pyramids over monogenic NILs against rice blast pathogen Magnaporthe oryzae has been demonstrated. Productivity and quality of Basmati rice is severely affected by rice blast disease. Major genes and QTLs conferring resistance to blast have been reported only in non-Basmati rice germplasm. Here, we report incorporation of seven blast resistance genes from the donor lines DHMASQ164-2a (Pi54, Pi1, Pita), IRBLz5-CA (Pi2), IRBLb-B (Pib), IRBL5-M (Pi5) and IRBL9-W (Pi9) into the genetic background of an elite Basmati rice variety Pusa Basmati 1 (PB1). A total of 36 near-isogenic lines (NILs) comprising of 14 monogenic, 16 two-gene pyramids and six three-gene pyramids were developed through marker-assisted backcross breeding (MABB). Foreground, recombinant and background selection was used to identify the plants with target gene(s), minimize the linkage drag and increase the recurrent parent genome (RPG) recovery (93.5-98.6 %), respectively, in the NILs. Comparative analysis performed using 50,051 SNPs and 500 SSR markers revealed that the SNPs provided better insight into the RPG recovery. Most of the monogenic NILs showed comparable performance in yield and quality, concomitantly, Pusa1637-18-7-6-20 (Pi9), was significantly superior in yield and stable across four different environments as compared to recurrent parent (RP) PB1. Further, among the pyramids, Pusa1930-12-6 (Pi2+Pi5) showed significantly higher yield and Pusa1633-7-8-53-6-8 (Pi54+Pi1+Pita) was superior in cooking quality as compared to RP PB1. The NILs carrying gene Pi9 were found to be the most effective against the concoction of virulent races predominant in the hotspot locations for blast disease. Conversely, when analyzed under artificial inoculation, three-gene pyramids expressed enhanced resistance as compared to the two-gene and monogenic NILs.

  10. Comparative transcriptomics among floral organs of the basal eudicot Eschscholzia californica as reference for floral evolutionary developmental studies

    PubMed Central

    2010-01-01

    Background Molecular genetic studies of floral development have concentrated on several core eudicots and grasses (monocots), which have canalized floral forms. Basal eudicots possess a wider range of floral morphologies than the core eudicots and grasses and can serve as an evolutionary link between core eudicots and monocots, and provide a reference for studies of other basal angiosperms. Recent advances in genomics have enabled researchers to profile gene activities during floral development, primarily in the eudicot Arabidopsis thaliana and the monocots rice and maize. However, our understanding of floral developmental processes among the basal eudicots remains limited. Results Using a recently generated expressed sequence tag (EST) set, we have designed an oligonucleotide microarray for the basal eudicot Eschscholzia californica (California poppy). We performed microarray experiments with an interwoven-loop design in order to characterize the E. californica floral transcriptome and to identify differentially expressed genes in flower buds with pre-meiotic and meiotic cells, four floral organs at pre-anthesis stages (sepals, petals, stamens and carpels), developing fruits, and leaves. Conclusions Our results provide a foundation for comparative gene expression studies between eudicots and basal angiosperms. We identified whorl-specific gene expression patterns in E. californica and examined the floral expression of several gene families. Interestingly, most E. californica homologs of Arabidopsis genes important for flower development, except for genes encoding MADS-box transcription factors, show different expression patterns between the two species. Our comparative transcriptomics study highlights the unique evolutionary position of E. californica compared with basal angiosperms and core eudicots. PMID:20950453

  11. Profiling of gene duplication patterns of sequenced teleost genomes: evidence for rapid lineage-specific genome expansion mediated by recent tandem duplications.

    PubMed

    Lu, Jianguo; Peatman, Eric; Tang, Haibao; Lewis, Joshua; Liu, Zhanjiang

    2012-06-15

    Gene duplication has had a major impact on genome evolution. Localized (or tandem) duplication resulting from unequal crossing over and whole genome duplication are believed to be the two dominant mechanisms contributing to vertebrate genome evolution. While much scrutiny has been directed toward discerning patterns indicative of whole-genome duplication events in teleost species, less attention has been paid to the continuous nature of gene duplications and their impact on the size, gene content, functional diversity, and overall architecture of teleost genomes. Here, using a Markov clustering algorithm directed approach we catalogue and analyze patterns of gene duplication in the four model teleost species with chromosomal coordinates: zebrafish, medaka, stickleback, and Tetraodon. Our analyses based on set size, duplication type, synonymous substitution rate (Ks), and gene ontology emphasize shared and lineage-specific patterns of genome evolution via gene duplication. Most strikingly, our analyses highlight the extraordinary duplication and retention rate of recent duplicates in zebrafish and their likely role in the structural and functional expansion of the zebrafish genome. We find that the zebrafish genome is remarkable in its large number of duplicated genes, small duplicate set size, biased Ks distribution toward minimal mutational divergence, and proportion of tandem and intra-chromosomal duplicates when compared with the other teleost model genomes. The observed gene duplication patterns have played significant roles in shaping the architecture of teleost genomes and appear to have contributed to the recent functional diversification and divergence of important physiological processes in zebrafish. We have analyzed gene duplication patterns and duplication types among the available teleost genomes and found that a large number of genes were tandemly and intrachromosomally duplicated, suggesting their origin of independent and continuous duplication. This is particularly true for the zebrafish genome. Further analysis of the duplicated gene sets indicated that a significant portion of duplicated genes in the zebrafish genome were of recent, lineage-specific duplication events. Most strikingly, a subset of duplicated genes is enriched among the recently duplicated genes involved in immune or sensory response pathways. Such findings demonstrated the significance of continuous gene duplication as well as that of whole genome duplication in the course of genome evolution.

  12. CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.

    PubMed

    Testa, Alison C; Hane, James K; Ellwood, Simon R; Oliver, Richard P

    2015-03-11

    The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.

  13. Gene expression profiling in whole blood of patients with coronary artery disease

    PubMed Central

    Taurino, Chiara; Miller, William H.; McBride, Martin W.; McClure, John D.; Khanin, Raya; Moreno, María U.; Dymott, Jane A.; Delles, Christian; Dominiczak, Anna F.

    2010-01-01

    Owing to the dynamic nature of the transcriptome, gene expression profiling is a promising tool for discovery of disease-related genes and biological pathways. In the present study, we examined gene expression in whole blood of 12 patients with CAD (coronary artery disease) and 12 healthy control subjects. Furthermore, ten patients with CAD underwent whole-blood gene expression analysis before and after the completion of a cardiac rehabilitation programme following surgical coronary revascularization. mRNA and miRNA (microRNA) were isolated for expression profiling. Gene expression analysis identified 365 differentially expressed genes in patients with CAD compared with healthy controls (175 up- and 190 down-regulated in CAD), and 645 in CAD rehabilitation patients (196 up- and 449 down-regulated post-rehabilitation). Biological pathway analysis identified a number of canonical pathways, including oxidative phosphorylation and mitochondrial function, as being significantly and consistently modulated across the groups. Analysis of miRNA expression revealed a number of differentially expressed miRNAs, including hsa-miR-140-3p (control compared with CAD, P=0.017), hsa-miR-182 (control compared with CAD, P=0.093), hsa-miR-92a and hsa-miR-92b (post- compared with pre-exercise, P<0.01). Global analysis of predicted miRNA targets found significantly reduced expression of genes with target regions compared with those without: hsa-miR-140-3p (P=0.002), hsa-miR-182 (P=0.001), hsa-miR-92a and hsa-miR-92b (P=2.2×10−16). In conclusion, using whole blood as a ‘surrogate tissue’ in patients with CAD, we have identified differentially expressed miRNAs, differentially regulated genes and modulated pathways which warrant further investigation in the setting of cardiovascular function. This approach may represent a novel non-invasive strategy to unravel potentially modifiable pathways and possible therapeutic targets in cardiovascular disease. PMID:20528768

  14. shinyGISPA: A web application for characterizing phenotype by gene sets using multiple omics data combinations.

    PubMed

    Dwivedi, Bhakti; Kowalski, Jeanne

    2018-01-01

    While many methods exist for integrating multi-omics data or defining gene sets, there is no one single tool that defines gene sets based on merging of multiple omics data sets. We present shinyGISPA, an open-source application with a user-friendly web-based interface to define genes according to their similarity in several molecular changes that are driving a disease phenotype. This tool was developed to help facilitate the usability of a previously published method, Gene Integrated Set Profile Analysis (GISPA), among researchers with limited computer-programming skills. The GISPA method allows the identification of multiple gene sets that may play a role in the characterization, clinical application, or functional relevance of a disease phenotype. The tool provides an automated workflow that is highly scalable and adaptable to applications that go beyond genomic data merging analysis. It is available at http://shinygispa.winship.emory.edu/shinyGISPA/.

  15. shinyGISPA: A web application for characterizing phenotype by gene sets using multiple omics data combinations

    PubMed Central

    Dwivedi, Bhakti

    2018-01-01

    While many methods exist for integrating multi-omics data or defining gene sets, there is no one single tool that defines gene sets based on merging of multiple omics data sets. We present shinyGISPA, an open-source application with a user-friendly web-based interface to define genes according to their similarity in several molecular changes that are driving a disease phenotype. This tool was developed to help facilitate the usability of a previously published method, Gene Integrated Set Profile Analysis (GISPA), among researchers with limited computer-programming skills. The GISPA method allows the identification of multiple gene sets that may play a role in the characterization, clinical application, or functional relevance of a disease phenotype. The tool provides an automated workflow that is highly scalable and adaptable to applications that go beyond genomic data merging analysis. It is available at http://shinygispa.winship.emory.edu/shinyGISPA/. PMID:29415010

  16. Gene expression analysis identify a metabolic and cell function alterations as a hallmark of obesity without metabolic syndrome in peripheral blood, a pilot study.

    PubMed

    de Luis, Daniel Antonio; Almansa, Raquel; Aller, Rocío; Izaola, Olatz; Romero, E

    2017-06-10

    Understanding molecular basis involved in overweight is an important first step in developing therapeutic pathways against excess in body weight gain. The purpose of our pilot study was to evaluate the gene expression profiles in the peripheral blood of obese patients without other metabolic complications. A sample of 17 obese patients without metabolic syndrome and 15 non obese control subjects was evaluated in a prospective way. Following 'One-Color Microarray-Based Gene Expression Analysis' protocol Version 5.7 (Agilent p/n 4140-90040), cRNA was hybridized with Whole Human Genome Oligo Microarray Kit (Agilent p/n G2519F-014850) containing 41,000+ unique human genes and transcripts. The average age of the study group was 43.6 ± 19.7 years with a sex distribution of 64.7% females and 35.3% males. No statistical differences were detected with healthy controls 41.9 ± 12.3 years with a sex distribution of 70% females and 30% males. Obese patients showed 1436 genes that were differentially expressed compared to control group. Ingenuity Pathway Analysis showed that these genes participated in 13 different categories related to metabolism and cellular functions. In the gene set of cellular function, the most important genes were C-terminal region of Nel-like molecule 1 protein (NELL1) and Pigment epithelium-derived factor (SPEDF), both genes were over-expressed. In the gene set of metabolism, insulin growth factor type 1 (IGF1), ApoA5 (apolipoprotein subtype 5), Foxo4 (Forkhead transcription factor 4), ADIPOR1 (receptor of adiponectin type 1) and AQP7 (aquaporin channel proteins7) were over expressed. Moreover, PIKFYVE (PtdIns(3) P 5-kinase), and ROCK-2 (rho-kinase II) were under expressed. We showed that PBMCs from obese subjects presented significant changes in gene expression, exhibiting 1436 differentially expressed genes compared to PBMCs from non-obese subjects. Furthermore, our data showed a number of genes involved in relevant processes implicated in metabolism, with genes presenting high fold-change values (up-regulation and down regulation) associated with lipid, carbohydrate and protein metabolism. Copyright © 2017 Elsevier Ltd and European Society for Clinical Nutrition and Metabolism. All rights reserved.

  17. Integrative Functional Genomics for Systems Genetics in GeneWeaver.org.

    PubMed

    Bubier, Jason A; Langston, Michael A; Baker, Erich J; Chesler, Elissa J

    2017-01-01

    The abundance of existing functional genomics studies permits an integrative approach to interpreting and resolving the results of diverse systems genetics studies. However, a major challenge lies in assembling and harmonizing heterogeneous data sets across species for facile comparison to the positional candidate genes and coexpression networks that come from systems genetic studies. GeneWeaver is an online database and suite of tools at www.geneweaver.org that allows for fast aggregation and analysis of gene set-centric data. GeneWeaver contains curated experimental data together with resource-level data such as GO annotations, MP annotations, and KEGG pathways, along with persistent stores of user entered data sets. These can be entered directly into GeneWeaver or transferred from widely used resources such as GeneNetwork.org. Data are analyzed using statistical tools and advanced graph algorithms to discover new relations, prioritize candidate genes, and generate function hypotheses. Here we use GeneWeaver to find genes common to multiple gene sets, prioritize candidate genes from a quantitative trait locus, and characterize a set of differentially expressed genes. Coupling a large multispecies repository curated and empirical functional genomics data to fast computational tools allows for the rapid integrative analysis of heterogeneous data for interpreting and extrapolating systems genetics results.

  18. Comparative Analysis of Syntenic Genes in Grass Genomes Reveals Accelerated Rates of Gene Structure and Coding Sequence Evolution in Polyploid Wheat1[W][OA

    PubMed Central

    Akhunov, Eduard D.; Sehgal, Sunish; Liang, Hanquan; Wang, Shichen; Akhunova, Alina R.; Kaur, Gaganpreet; Li, Wanlong; Forrest, Kerrie L.; See, Deven; Šimková, Hana; Ma, Yaqin; Hayden, Matthew J.; Luo, Mingcheng; Faris, Justin D.; Doležel, Jaroslav; Gill, Bikram S.

    2013-01-01

    Cycles of whole-genome duplication (WGD) and diploidization are hallmarks of eukaryotic genome evolution and speciation. Polyploid wheat (Triticum aestivum) has had a massive increase in genome size largely due to recent WGDs. How these processes may impact the dynamics of gene evolution was studied by comparing the patterns of gene structure changes, alternative splicing (AS), and codon substitution rates among wheat and model grass genomes. In orthologous gene sets, significantly more acquired and lost exonic sequences were detected in wheat than in model grasses. In wheat, 35% of these gene structure rearrangements resulted in frame-shift mutations and premature termination codons. An increased codon mutation rate in the wheat lineage compared with Brachypodium distachyon was found for 17% of orthologs. The discovery of premature termination codons in 38% of expressed genes was consistent with ongoing pseudogenization of the wheat genome. The rates of AS within the individual wheat subgenomes (21%–25%) were similar to diploid plants. However, we uncovered a high level of AS pattern divergence between the duplicated homeologous copies of genes. Our results are consistent with the accelerated accumulation of AS isoforms, nonsynonymous mutations, and gene structure rearrangements in the wheat lineage, likely due to genetic redundancy created by WGDs. Whereas these processes mostly contribute to the degeneration of a duplicated genome and its diploidization, they have the potential to facilitate the origin of new functional variations, which, upon selection in the evolutionary lineage, may play an important role in the origin of novel traits. PMID:23124323

  19. Comparing the transcriptomes of embryos from domesticated and wild Atlantic salmon (Salmo salar L.) stocks and examining factors that influence heritability of gene expression.

    PubMed

    Bicskei, Beatrix; Taggart, John B; Glover, Kevin A; Bron, James E

    2016-03-17

    Due to selective breeding, domesticated and wild Atlantic salmon are genetically diverged, which raises concerns about farmed escapees having the potential to alter the genetic composition of wild populations and thereby disrupting local adaptation. Documenting transcriptional differences between wild and domesticated stocks under controlled conditions is one way to explore the consequences of domestication and selection. We compared the transcriptomes of wild and domesticated Atlantic salmon embryos, by using a custom 44k oligonucleotide microarray to identify perturbed gene pathways between the two stocks, and to document the inheritance patterns of differentially-expressed genes by examining gene expression in their reciprocal hybrids. Data from 24 array interrogations were analysed: four reciprocal cross types (W♀ × W♂, D♀ × W♂; W♀ × D♂, D♀ × D♂) × six biological replicates. A common set of 31,491 features on the microarrays passed quality control, of which about 62 % were assigned a KEGG Orthology number. A total of 6037 distinct genes were identified for gene-set enrichment/pathway analysis. The most highly enriched functional groups that were perturbed between the two stocks were cellular signalling and immune system, ribosome and RNA transport, and focal adhesion and gap junction pathways, relating to cell communication and cell adhesion molecules. Most transcripts that were differentially expressed between the stocks were governed by additive gene interaction (33 to 42 %). Maternal dominance and over-dominance were also prevalent modes of inheritance, with no convincing evidence for a stock effect. Our data indicate that even at this relatively early developmental stage, transcriptional differences exist between the two stocks and affect pathways that are relevant to wild versus domesticated environments. Many of the identified differentially perturbed pathways are involved in organogenesis, which is expected to be an active process at the eyed egg stage. The dominant effects are more largely due to the maternal line than to the origin of the stock. This finding is particularly relevant in the context of potential introgression between farmed and wild fish, since female escapees tend to have a higher spawning success rate compared to males.

  20. Redundancy control in pathway databases (ReCiPa): an application for improving gene-set enrichment analysis in Omics studies and "Big data" biology.

    PubMed

    Vivar, Juan C; Pemu, Priscilla; McPherson, Ruth; Ghosh, Sujoy

    2013-08-01

    Abstract Unparalleled technological advances have fueled an explosive growth in the scope and scale of biological data and have propelled life sciences into the realm of "Big Data" that cannot be managed or analyzed by conventional approaches. Big Data in the life sciences are driven primarily via a diverse collection of 'omics'-based technologies, including genomics, proteomics, metabolomics, transcriptomics, metagenomics, and lipidomics. Gene-set enrichment analysis is a powerful approach for interrogating large 'omics' datasets, leading to the identification of biological mechanisms associated with observed outcomes. While several factors influence the results from such analysis, the impact from the contents of pathway databases is often under-appreciated. Pathway databases often contain variously named pathways that overlap with one another to varying degrees. Ignoring such redundancies during pathway analysis can lead to the designation of several pathways as being significant due to high content-similarity, rather than truly independent biological mechanisms. Statistically, such dependencies also result in correlated p values and overdispersion, leading to biased results. We investigated the level of redundancies in multiple pathway databases and observed large discrepancies in the nature and extent of pathway overlap. This prompted us to develop the application, ReCiPa (Redundancy Control in Pathway Databases), to control redundancies in pathway databases based on user-defined thresholds. Analysis of genomic and genetic datasets, using ReCiPa-generated overlap-controlled versions of KEGG and Reactome pathways, led to a reduction in redundancy among the top-scoring gene-sets and allowed for the inclusion of additional gene-sets representing possibly novel biological mechanisms. Using obesity as an example, bioinformatic analysis further demonstrated that gene-sets identified from overlap-controlled pathway databases show stronger evidence of prior association to obesity compared to pathways identified from the original databases.

  1. Gene selection for tumor classification using neighborhood rough sets and entropy measures.

    PubMed

    Chen, Yumin; Zhang, Zunjun; Zheng, Jianzhong; Ma, Ying; Xue, Yu

    2017-03-01

    With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification. Copyright © 2017 Elsevier Inc. All rights reserved.

  2. Conditional entropy in variation-adjusted windows detects selection signatures associated with expression quantitative trait loci (eQTLs)

    PubMed Central

    2015-01-01

    Background Over the past 50,000 years, shifts in human-environmental or human-human interactions shaped genetic differences within and among human populations, including variants under positive selection. Shaped by environmental factors, such variants influence the genetics of modern health, disease, and treatment outcome. Because evolutionary processes tend to act on gene regulation, we test whether regulatory variants are under positive selection. We introduce a new approach to enhance detection of genetic markers undergoing positive selection, using conditional entropy to capture recent local selection signals. Results We use conditional logistic regression to compare our Adjusted Haplotype Conditional Entropy (H|H) measure of positive selection to existing positive selection measures. H|H and existing measures were applied to published regulatory variants acting in cis (cis-eQTLs), with conditional logistic regression testing whether regulatory variants undergo stronger positive selection than the surrounding gene. These cis-eQTLs were drawn from six independent studies of genotype and RNA expression. The conditional logistic regression shows that, overall, H|H is substantially more powerful than existing positive-selection methods in identifying cis-eQTLs against other Single Nucleotide Polymorphisms (SNPs) in the same genes. When broken down by Gene Ontology, H|H predictions are particularly strong in some biological process categories, where regulatory variants are under strong positive selection compared to the bulk of the gene, distinct from those GO categories under overall positive selection. . However, cis-eQTLs in a second group of genes lack positive selection signatures detectable by H|H, consistent with ancient short haplotypes compared to the surrounding gene (for example, in innate immunity GO:0042742); under such other modes of selection, H|H would not be expected to be a strong predictor.. These conditional logistic regression models are adjusted for Minor allele frequency(MAF); otherwise, ascertainment bias is a huge factor in all eQTL data sets. Relationships between Gene Ontology categories, positive selection and eQTL specificity were replicated with H|H in a single larger data set. Our measure, Adjusted Haplotype Conditional Entropy (H|H), was essential in generating all of the results above because it: 1) is a stronger overall predictor for eQTLs than comparable existing approaches, and 2) shows low sequential auto-correlation, overcoming problems with convergence of these conditional regression statistical models. Conclusions Our new method, H|H, provides a consistently more robust signal associated with cis-eQTLs compared to existing methods. We interpret this to indicate that some cis-eQTLs are under positive selection compared to their surrounding genes. Conditional entropy indicative of a selective sweep is an especially strong predictor of eQTLs for genes in several biological processes of medical interest. Where conditional entropy is a weak or negative predictor of eQTLs, such as innate immune genes, this would be consistent with balancing selection acting on such eQTLs over long time periods. Different measures of selection may be needed for variant prioritization under other modes of evolutionary selection. PMID:26111110

  3. Involvement of astrocyte metabolic coupling in Tourette syndrome pathogenesis.

    PubMed

    de Leeuw, Christiaan; Goudriaan, Andrea; Smit, August B; Yu, Dongmei; Mathews, Carol A; Scharf, Jeremiah M; Verheijen, Mark H G; Posthuma, Danielle

    2015-11-01

    Tourette syndrome is a heritable neurodevelopmental disorder whose pathophysiology remains unknown. Recent genome-wide association studies suggest that it is a polygenic disorder influenced by many genes of small effect. We tested whether these genes cluster in cellular function by applying gene-set analysis using expert curated sets of brain-expressed genes in the current largest available Tourette syndrome genome-wide association data set, involving 1285 cases and 4964 controls. The gene sets included specific synaptic, astrocytic, oligodendrocyte and microglial functions. We report association of Tourette syndrome with a set of genes involved in astrocyte function, specifically in astrocyte carbohydrate metabolism. This association is driven primarily by a subset of 33 genes involved in glycolysis and glutamate metabolism through which astrocytes support synaptic function. Our results indicate for the first time that the process of astrocyte-neuron metabolic coupling may be an important contributor to Tourette syndrome pathogenesis.

  4. Involvement of astrocyte metabolic coupling in Tourette syndrome pathogenesis

    PubMed Central

    de Leeuw, Christiaan; Goudriaan, Andrea; Smit, August B; Yu, Dongmei; Mathews, Carol A; Scharf, Jeremiah M; Scharf, J M; Pauls, D L; Yu, D; Illmann, C; Osiecki, L; Neale, B M; Mathews, C A; Reus, V I; Lowe, T L; Freimer, N B; Cox, N J; Davis, L K; Rouleau, G A; Chouinard, S; Dion, Y; Girard, S; Cath, D C; Posthuma, D; Smit, J H; Heutink, P; King, R A; Fernandez, T; Leckman, J F; Sandor, P; Barr, C L; McMahon, W; Lyon, G; Leppert, M; Morgan, J; Weiss, R; Grados, M A; Singer, H; Jankovic, J; Tischfield, J A; Heiman, G A; Verheijen, Mark H G; Posthuma, Danielle

    2015-01-01

    Tourette syndrome is a heritable neurodevelopmental disorder whose pathophysiology remains unknown. Recent genome-wide association studies suggest that it is a polygenic disorder influenced by many genes of small effect. We tested whether these genes cluster in cellular function by applying gene-set analysis using expert curated sets of brain-expressed genes in the current largest available Tourette syndrome genome-wide association data set, involving 1285 cases and 4964 controls. The gene sets included specific synaptic, astrocytic, oligodendrocyte and microglial functions. We report association of Tourette syndrome with a set of genes involved in astrocyte function, specifically in astrocyte carbohydrate metabolism. This association is driven primarily by a subset of 33 genes involved in glycolysis and glutamate metabolism through which astrocytes support synaptic function. Our results indicate for the first time that the process of astrocyte-neuron metabolic coupling may be an important contributor to Tourette syndrome pathogenesis. PMID:25735483

  5. Evolutionary Genomics of Genes Involved in Olfactory Behavior in the Drosophila melanogaster Species Group

    PubMed Central

    Lavagnino, Nicolás; Serra, François; Arbiza, Leonardo; Dopazo, Hernán; Hasson, Esteban

    2012-01-01

    Previous comparative genomic studies of genes involved in olfactory behavior in Drosophila focused only on particular gene families such as odorant receptor and/or odorant binding proteins. However, olfactory behavior has a complex genetic architecture that is orchestrated by many interacting genes. In this paper, we present a comparative genomic study of olfactory behavior in Drosophila including an extended set of genes known to affect olfactory behavior. We took advantage of the recent burst of whole genome sequences and the development of powerful statistical tools to analyze genomic data and test evolutionary and functional hypotheses of olfactory genes in the six species of the Drosophila melanogaster species group for which whole genome sequences are available. Our study reveals widespread purifying selection and limited incidence of positive selection on olfactory genes. We show that the pace of evolution of olfactory genes is mostly independent of the life cycle stage, and of the number of life cycle stages, in which they participate in olfaction. However, we detected a relationship between evolutionary rates and the position that the gene products occupy in the olfactory system, genes occupying central positions tend to be more constrained than peripheral genes. Finally, we demonstrate that specialization to one host does not seem to be associated with bursts of adaptive evolution in olfactory genes in D. sechellia and D. erecta, the two specialists species analyzed, but rather different lineages have idiosyncratic evolutionary histories in which both historical and ecological factors have been involved. PMID:22346339

  6. SYBR green-based real-time reverse transcription-PCR for typing and subtyping of all hemagglutinin and neuraminidase genes of avian influenza viruses and comparison to standard serological subtyping tests.

    PubMed

    Tsukamoto, Kenji; Panei, Carlos Javier; Javier, Panei Carlos; Shishido, Makiko; Noguchi, Daigo; Pearce, John; Kang, Hyun-Mi; Jeong, Ok Mi; Lee, Youn-Jeong; Nakanishi, Koji; Ashizawa, Takayoshi

    2012-01-01

    Continuing outbreaks of H5N1 highly pathogenic (HP) avian influenza virus (AIV) infections of wild birds and poultry worldwide emphasize the need for global surveillance of wild birds. To support the future surveillance activities, we developed a SYBR green-based, real-time reverse transcriptase PCR (rRT-PCR) for detecting nucleoprotein (NP) genes and subtyping 16 hemagglutinin (HA) and 9 neuraminidase (NA) genes simultaneously. Primers were improved by focusing on Eurasian or North American lineage genes; the number of mixed-base positions per primer was set to five or fewer, and the concentration of each primer set was optimized empirically. Also, 30 cycles of amplification of 1:10 dilutions of cDNAs from cultured viruses effectively reduced minor cross- or nonspecific reactions. Under these conditions, 346 HA and 345 NA genes of 349 AIVs were detected, with average sensitivities of NP, HA, and NA genes of 10(1.5), 10(2.3), and 10(3.1) 50% egg infective doses, respectively. Utility of rRT-PCR for subtyping AIVs was compared with that of current standard serological tests by using 104 recent migratory duck virus isolates. As a result, all HA genes and 99% of the NA genes were genetically subtyped, while only 45% of HA genes and 74% of NA genes were serologically subtyped. Additionally, direct subtyping of AIVs in fecal samples was possible by 40 cycles of amplification: approximately 70% of HA and NA genes of NP gene-positive samples were successfully subtyped. This validation study indicates that rRT-PCR with optimized primers and reaction conditions is a powerful tool for subtyping varied AIVs in clinical and cultured samples.

  7. Comparative functional characterization of the CSR-1 22G-RNA pathway in Caenorhabditis nematodes

    PubMed Central

    Tu, Shikui; Wu, Monica Z.; Wang, Jie; Cutter, Asher D.; Weng, Zhiping; Claycomb, Julie M.

    2015-01-01

    As a champion of small RNA research for two decades, Caenorhabditis elegans has revealed the essential Argonaute CSR-1 to play key nuclear roles in modulating chromatin, chromosome segregation and germline gene expression via 22G-small RNAs. Despite CSR-1 being preserved among diverse nematodes, the conservation and divergence in function of the targets of small RNA pathways remains poorly resolved. Here we apply comparative functional genomic analysis between C. elegans and Caenorhabditis briggsae to characterize the CSR-1 pathway, its targets and their evolution. C. briggsae CSR-1-associated small RNAs that we identified by immunoprecipitation-small RNA sequencing overlap with 22G-RNAs depleted in cbr-csr-1 RNAi-treated worms. By comparing 22G-RNAs and target genes between species, we defined a set of CSR-1 target genes with conserved germline expression, enrichment in operons and more slowly evolving coding sequences than other genes, along with a small group of evolutionarily labile targets. We demonstrate that the association of CSR-1 with chromatin is preserved, and show that depletion of cbr-csr-1 leads to chromosome segregation defects and embryonic lethality. This first comparative characterization of a small RNA pathway in Caenorhabditis establishes a conserved nuclear role for CSR-1 and highlights its key role in germline gene regulation across multiple animal species. PMID:25510497

  8. Lineage-Specific Evolutionary Histories and Regulation of Major Starch Metabolism Genes during Banana Ripening

    PubMed Central

    Jourda, Cyril; Cardi, Céline; Gibert, Olivier; Giraldo Toro, Andrès; Ricci, Julien; Mbéguié-A-Mbéguié, Didier; Yahiaoui, Nabila

    2016-01-01

    Starch is the most widespread and abundant storage carbohydrate in plants. It is also a major feature of cultivated bananas as it accumulates to large amounts during banana fruit development before almost complete conversion to soluble sugars during ripening. Little is known about the structure of major gene families involved in banana starch metabolism and their evolution compared to other species. To identify genes involved in banana starch metabolism and investigate their evolutionary history, we analyzed six gene families playing a crucial role in plant starch biosynthesis and degradation: the ADP-glucose pyrophosphorylases (AGPases), starch synthases (SS), starch branching enzymes (SBE), debranching enzymes (DBE), α-amylases (AMY) and β-amylases (BAM). Using comparative genomics and phylogenetic approaches, these genes were classified into families and sub-families and orthology relationships with functional genes in Eudicots and in grasses were identified. In addition to known ancestral duplications shaping starch metabolism gene families, independent evolution in banana and grasses also occurred through lineage-specific whole genome duplications for specific sub-families of AGPase, SS, SBE, and BAM genes; and through gene-scale duplications for AMY genes. In particular, banana lineage duplications yielded a set of AGPase, SBE and BAM genes that were highly or specifically expressed in banana fruits. Gene expression analysis highlighted a complex transcriptional reprogramming of starch metabolism genes during ripening of banana fruits. A differential regulation of expression between banana gene duplicates was identified for SBE and BAM genes, suggesting that part of starch metabolism regulation in the fruit evolved in the banana lineage. PMID:27994606

  9. Gene-expression profiles of epithelial cells treated with EMD in vitro: analysis using complementary DNA arrays.

    PubMed

    Kapferer, I; Schmidt, S; Gstir, R; Durstberger, G; Huber, L A; Vietor, I

    2011-02-01

    During surgical periodontal treatment, EMD is topically applied in order to facilitate regeneration of the periodontal ligament, acellular cementum and alveolar bone. Suppresion of epithelial down-growth is essential for successful periodontal regeneration; however, the underlying mechanisms of how EMD influences epithelial wound healing are poorly understood. In the present study, the effects of EMD on gene-expression profiling in an epithelial cell line (HSC-2) model were investigated. Gene-expression modifications, determined using a comparative genome-wide expression-profiling strategy, were independently validated by quantitative real-time RT-PCR. Additionally, cell cycle, cell growth and in vitro wound-healing assays were conducted. A set of 43 EMD-regulated genes was defined, which may be responsible for the reduced epithelial down-growth upon EMD application. Gene ontology analysis revealed genes that could be attributed to pathways of locomotion, developmental processes and associated processes such as regulation of cell size and cell growth. Additionally, eight regulated genes have previously been reported to take part in the process of epithelial-to-mesenchymal transition. Several independent experimental assays revealed significant inhibition of cell migration, growth and cell cycle by EMD. The set of EMD-regulated genes identified in this study offers the opportunity to clarify mechanisms underlying the effects of EMD on epithelial cells. Reduced epithelial repopulation of the dental root upon periodontal surgery may be the consequence of reduced migration and cell growth, as well as epithelial-to-mesenchymal transition. © 2010 John Wiley & Sons A/S.

  10. Scuba: scalable kernel-based gene prioritization.

    PubMed

    Zampieri, Guido; Tran, Dinh Van; Donini, Michele; Navarin, Nicolò; Aiolli, Fabio; Sperduti, Alessandro; Valle, Giorgio

    2018-01-25

    The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability. We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods. Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba .

  11. Case-based retrieval framework for gene expression data.

    PubMed

    Anaissi, Ali; Goyal, Madhu; Catchpoole, Daniel R; Braytee, Ali; Kennedy, Paul J

    2015-01-01

    The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process. This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles. The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children's Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set. The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.

  12. How powerful are summary-based methods for identifying expression-trait associations under different genetic architectures?

    PubMed

    Veturi, Yogasudha; Ritchie, Marylyn D

    2018-01-01

    Transcriptome-wide association studies (TWAS) have recently been employed as an approach that can draw upon the advantages of genome-wide association studies (GWAS) and gene expression studies to identify genes associated with complex traits. Unlike standard GWAS, summary level data suffices for TWAS and offers improved statistical power. Two popular TWAS methods include either (a) imputing the cis genetic component of gene expression from smaller sized studies (using multi-SNP prediction or MP) into much larger effective sample sizes afforded by GWAS - TWAS-MP or (b) using summary-based Mendelian randomization - TWAS-SMR. Although these methods have been effective at detecting functional variants, it remains unclear how extensive variability in the genetic architecture of complex traits and diseases impacts TWAS results. Our goal was to investigate the different scenarios under which these methods yielded enough power to detect significant expression-trait associations. In this study, we conducted extensive simulations based on 6000 randomly chosen, unrelated Caucasian males from Geisinger's MyCode population to compare the power to detect cis expression-trait associations (within 500 kb of a gene) using the above-described approaches. To test TWAS across varying genetic backgrounds we simulated gene expression and phenotype using different quantitative trait loci per gene and cis-expression /trait heritability under genetic models that differentiate the effect of causality from that of pleiotropy. For each gene, on a training set ranging from 100 to 1000 individuals, we either (a) estimated regression coefficients with gene expression as the response using five different methods: LASSO, elastic net, Bayesian LASSO, Bayesian spike-slab, and Bayesian ridge regression or (b) performed eQTL analysis. We then sampled with replacement 50,000, 150,000, and 300,000 individuals respectively from the testing set of the remaining 5000 individuals and conducted GWAS on each set. Subsequently, we integrated the GWAS summary statistics derived from the testing set with the weights (or eQTLs) derived from the training set to identify expression-trait associations using (a) TWAS-MP (b) TWAS-SMR (c) eQTL-based GWAS, or (d) standalone GWAS. Finally, we examined the power to detect functionally relevant genes using the different approaches under the considered simulation scenarios. In general, we observed great similarities among TWAS-MP methods although the Bayesian methods resulted in improved power in comparison to LASSO and elastic net as the trait architecture grew more complex while training sample sizes and expression heritability remained small. Finally, we observed high power under causality but very low to moderate power under pleiotropy.

  13. Latent Gammaherpesvirus 68 Infection Induces Distinct Transcriptional Changes in Different Organs

    PubMed Central

    Canny, Susan P.; Goel, Gautam; Reese, Tiffany A.; Zhang, Xin; Xavier, Ramnik

    2014-01-01

    Previous studies identified a role for latent herpesvirus infection in cross-protection against infection and exacerbation of chronic inflammatory diseases. Here, we identified more than 500 genes differentially expressed in spleens, livers, or brains of mice latently infected with gammaherpesvirus 68 and found that distinct sets of genes linked to different pathways were altered in the spleen compared to those in the liver. Several of the most differentially expressed latency-specific genes (e.g., the gamma interferon [IFN-γ], Cxcl9, and Ccl5 genes) are associated with known latency-specific phenotypes. Chronic herpesvirus infection, therefore, significantly alters the transcriptional status of host organs. We speculate that such changes may influence host physiology, the status of the immune system, and disease susceptibility. PMID:24155394

  14. Comparative genomics of the lactic acid bacteria

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Makarova, K.; Slesarev, A.; Wolf, Y.

    Lactic acid-producing bacteria are associated with various plant and animal niches and play a key role in the production of fermented foods and beverages. We report nine genome sequences representing the phylogenetic and functional diversity of these bacteria. The small genomes of lactic acid bacteria encode a broad repertoire of transporters for efficient carbon and nitrogen acquisition from the nutritionally rich environments they inhabit and reflect a limited range of biosynthetic capabilities that indicate both prototrophic and auxotrophic strains. Phylogenetic analyses, comparison of gene content across the group, and reconstruction of ancestral gene sets indicate a combination of extensive genemore » loss and key gene acquisitions via horizontal gene transfer during the coevolution of lactic acid bacteria with their habitats.« less

  15. Learning Abilities and Disabilities: Generalist Genes, Specialist Environments

    PubMed Central

    Kovas, Yulia; Plomin, Robert

    2007-01-01

    Twin studies comparing identical and fraternal twins consistently show substantial genetic influence on individual differences in learning abilities such as reading and mathematics, as well as in other cognitive abilities such as spatial ability and memory. Multivariate genetic research has shown that the same set of genes is largely responsible for genetic influence on these diverse cognitive areas. We call these “generalist genes.” What differentiates these abilities is largely the environment, especially nonshared environments that make children growing up in the same family different from one another. These multivariate genetic findings of generalist genes and specialist environments have far-reaching implications for diagnosis and treatment of learning disabilities and for understanding the brain mechanisms that mediate these effects. PMID:20351764

  16. Limnobacter spp. as newly detected phenol-degraders among Baltic Sea surface water bacteria characterised by comparative analysis of catabolic genes.

    PubMed

    Vedler, Eve; Heinaru, Eeva; Jutkina, Jekaterina; Viggor, Signe; Koressaar, Triinu; Remm, Maido; Heinaru, Ain

    2013-12-01

    A set of phenol-degrading strains of a collection of bacteria isolated from Baltic Sea surface water was screened for the presence of two key catabolic genes coding for phenol hydroxylases and catechol 2,3-dioxygenases. The multicomponent phenol hydroxylase (LmPH) gene was detected in 70 out of 92 strains studied, and 41 strains among these LmPH(+) phenol-degraders were found to exhibit catechol 2,3-dioxygenase (C23O) activity. Comparative phylogenetic analyses of LmPH and C23O sequences from 56 representative strains were performed. The studied strains were mostly affiliated to the genera Pseudomonas and Acinetobacter. However, the study also widened the range of phenol-degraders by including the genus Limnobacter. Furthermore, using a next generation sequencing approach, the LmPH genes of Limnobacter strains were found to be the most prevalent ones in the microbial community of the Baltic Sea surface water. Four different Limnobacter strains having almost identical 16S rRNA gene sequences (99%) and similar physiological properties formed separate phylogenetic clusters of LmPH and C23O genes in the respective phylogenetic trees. Copyright © 2013 Elsevier GmbH. All rights reserved.

  17. The evolution of mollusc shells.

    PubMed

    McDougall, Carmel; Degnan, Bernard M

    2018-05-01

    Molluscan shells are externally fabricated by specialized epithelial cells on the dorsal mantle. Although a conserved set of regulatory genes appears to underlie specification of mantle progenitor cells, the genes that contribute to the formation of the mature shell are incredibly diverse. Recent comparative analyses of mantle transcriptomes and shell proteomes of gastropods and bivalves are consistent with shell diversity being underpinned by a rapidly evolving mantle secretome (suite of genes expressed in the mantle that encode secreted proteins) that is the product of (a) high rates of gene co-option into and loss from the mantle gene regulatory network, and (b) the rapid evolution of coding sequences, particular those encoding repetitive low complexity domains. Outside a few conserved genes, such as carbonic anhydrase, a so-called "biomineralization toolkit" has yet to be discovered. Despite this, a common suite of protein domains, which are often associated with the extracellular matrix and immunity, appear to have been independently and often uniquely co-opted into the mantle secretomes of different species. The evolvability of the mantle secretome provides a molecular explanation for the evolution and diversity of molluscan shells. These genomic processes are likely to underlie the evolution of other animal biominerals, including coral and echinoderm skeletons. This article is categorized under: Comparative Development and Evolution > Regulation of Organ Diversity Comparative Development and Evolution > Evolutionary Novelties. © 2018 Wiley Periodicals, Inc.

  18. CYP1A1, GCLC, AGT, AGTR1 gene-gene interactions in community-acquired pneumonia pulmonary complications.

    PubMed

    Salnikova, Lyubov E; Smelaya, Tamara V; Golubev, Arkadiy M; Rubanovich, Alexander V; Moroz, Viktor V

    2013-11-01

    This study was conducted to establish the possible contribution of functional gene polymorphisms in detoxification/oxidative stress and vascular remodeling pathways to community-acquired pneumonia (CAP) susceptibility in the case-control study (350 CAP patients, 432 control subjects) and to predisposition to the development of CAP complications in the prospective study. All subjects were genotyped for 16 polymorphic variants in the 14 genes of xenobiotics detoxification CYP1A1, AhR, GSTM1, GSTT1, ABCB1, redox-status SOD2, CAT, GCLC, and vascular homeostasis ACE, AGT, AGTR1, NOS3, MTHFR, VEGFα. Risk of pulmonary complications (PC) in the single locus analysis was associated with CYP1A1, GCLC and AGTR1 genes. Extra PC (toxic shock syndrome and myocarditis) were not associated with these genes. We evaluated gene-gene interactions using multi-factor dimensionality reduction, and cumulative gene risk score approaches. The final model which included >5 risk alleles in the CYP1A1 (rs2606345, rs4646903, rs1048943), GCLC, AGT, and AGTR1 genes was associated with pleuritis, empyema, acute respiratory distress syndrome, all PC and acute respiratory failure (ARF). We considered CYP1A1, GCLC, AGT, AGTR1 gene set using Set Distiller mode implemented in GeneDecks for discovering gene-set relations via the degree of sharing descriptors within a given gene set. N-acetylcysteine and oxygen were defined by Set Distiller as the best descriptors for the gene set associated in the present study with PC and ARF. Results of the study are in line with literature data and suggest that genetically determined oxidative stress exacerbation may contribute to the progression of lung inflammation.

  19. Challenges in projecting clustering results across gene expression-profiling datasets.

    PubMed

    Lusa, Lara; McShane, Lisa M; Reid, James F; De Cecco, Loris; Ambrogi, Federico; Biganzoli, Elia; Gariboldi, Manuela; Pierotti, Marco A

    2007-11-21

    Gene expression microarray studies for several types of cancer have been reported to identify previously unknown subtypes of tumors. For breast cancer, a molecular classification consisting of five subtypes based on gene expression microarray data has been proposed. These subtypes have been reported to exist across several breast cancer microarray studies, and they have demonstrated some association with clinical outcome. A classification rule based on the method of centroids has been proposed for identifying the subtypes in new collections of breast cancer samples; the method is based on the similarity of the new profiles to the mean expression profile of the previously identified subtypes. Previously identified centroids of five breast cancer subtypes were used to assign 99 breast cancer samples, including a subset of 65 estrogen receptor-positive (ER+) samples, to five breast cancer subtypes based on microarray data for the samples. The effect of mean centering the genes (i.e., transforming the expression of each gene so that its mean expression is equal to 0) on subtype assignment by method of centroids was assessed. Further studies of the effect of mean centering and of class prevalence in the test set on the accuracy of method of centroids classifications of ER status were carried out using training and test sets for which ER status had been independently determined by ligand-binding assay and for which the proportion of ER+ and ER- samples were systematically varied. When all 99 samples were considered, mean centering before application of the method of centroids appeared to be helpful for correctly assigning samples to subtypes, as evidenced by the expression of genes that had previously been used as markers to identify the subtypes. However, when only the 65 ER+ samples were considered for classification, many samples appeared to be misclassified, as evidenced by an unexpected distribution of ER+ samples among the resultant subtypes. When genes were mean centered before classification of samples for ER status, the accuracy of the ER subgroup assignments was highly dependent on the proportion of ER+ samples in the test set; this effect of subtype prevalence was not seen when gene expression data were not mean centered. Simple corrections such as mean centering of genes aimed at microarray platform or batch effect correction can have undesirable consequences because patient population effects can easily be confused with these assay-related effects. Careful thought should be given to the comparability of the patient populations before attempting to force data comparability for purposes of assigning subtypes to independent subjects.

  20. F-MAP: A Bayesian approach to infer the gene regulatory network using external hints

    PubMed Central

    Shahdoust, Maryam; Mahjub, Hossein; Sadeghi, Mehdi

    2017-01-01

    The Common topological features of related species gene regulatory networks suggest reconstruction of the network of one species by using the further information from gene expressions profile of related species. We present an algorithm to reconstruct the gene regulatory network named; F-MAP, which applies the knowledge about gene interactions from related species. Our algorithm sets a Bayesian framework to estimate the precision matrix of one species microarray gene expressions dataset to infer the Gaussian Graphical model of the network. The conjugate Wishart prior is used and the information from related species is applied to estimate the hyperparameters of the prior distribution by using the factor analysis. Applying the proposed algorithm on six related species of drosophila shows that the precision of reconstructed networks is improved considerably compared to the precision of networks constructed by other Bayesian approaches. PMID:28938012

  1. EXP-PAC: providing comparative analysis and storage of next generation gene expression data.

    PubMed

    Church, Philip C; Goscinski, Andrzej; Lefèvre, Christophe

    2012-07-01

    Microarrays and more recently RNA sequencing has led to an increase in available gene expression data. How to manage and store this data is becoming a key issue. In response we have developed EXP-PAC, a web based software package for storage, management and analysis of gene expression and sequence data. Unique to this package is SQL based querying of gene expression data sets, distributed normalization of raw gene expression data and analysis of gene expression data across experiments and species. This package has been populated with lactation data in the international milk genomic consortium web portal (http://milkgenomics.org/). Source code is also available which can be hosted on a Windows, Linux or Mac APACHE server connected to a private or public network (http://mamsap.it.deakin.edu.au/~pcc/Release/EXP_PAC.html). Copyright © 2012 Elsevier Inc. All rights reserved.

  2. EvoCor: a platform for predicting functionally related genes using phylogenetic and expression profiles.

    PubMed

    Dittmar, W James; McIver, Lauren; Michalak, Pawel; Garner, Harold R; Valdez, Gregorio

    2014-07-01

    The wealth of publicly available gene expression and genomic data provides unique opportunities for computational inference to discover groups of genes that function to control specific cellular processes. Such genes are likely to have co-evolved and be expressed in the same tissues and cells. Unfortunately, the expertise and computational resources required to compare tens of genomes and gene expression data sets make this type of analysis difficult for the average end-user. Here, we describe the implementation of a web server that predicts genes involved in affecting specific cellular processes together with a gene of interest. We termed the server 'EvoCor', to denote that it detects functional relationships among genes through evolutionary analysis and gene expression correlation. This web server integrates profiles of sequence divergence derived by a Hidden Markov Model (HMM) and tissue-wide gene expression patterns to determine putative functional linkages between pairs of genes. This server is easy to use and freely available at http://pilot-hmm.vbi.vt.edu/. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  3. MiRNA-TF-gene network analysis through ranking of biomolecules for multi-informative uterine leiomyoma dataset.

    PubMed

    Mallik, Saurav; Maulik, Ujjwal

    2015-10-01

    Gene ranking is an important problem in bioinformatics. Here, we propose a new framework for ranking biomolecules (viz., miRNAs, transcription-factors/TFs and genes) in a multi-informative uterine leiomyoma dataset having both gene expression and methylation data using (statistical) eigenvector centrality based approach. At first, genes that are both differentially expressed and methylated, are identified using Limma statistical test. A network, comprising these genes, corresponding TFs from TRANSFAC and ITFP databases, and targeter miRNAs from miRWalk database, is then built. The biomolecules are then ranked based on eigenvector centrality. Our proposed method provides better average accuracy in hub gene and non-hub gene classifications than other methods. Furthermore, pre-ranked Gene set enrichment analysis is applied on the pathway database as well as GO-term databases of Molecular Signatures Database with providing a pre-ranked gene-list based on different centrality values for comparing among the ranking methods. Finally, top novel potential gene-markers for the uterine leiomyoma are provided. Copyright © 2015 Elsevier Inc. All rights reserved.

  4. Repression of Middle Sporulation Genes in Saccharomyces cerevisiae by the Sum1-Rfm1-Hst1 Complex Is Maintained by Set1 and H3K4 Methylation

    PubMed Central

    Jaiswal, Deepika; Jezek, Meagan; Quijote, Jeremiah; Lum, Joanna; Choi, Grace; Kulkarni, Rushmie; Park, DoHwan; Green, Erin M.

    2017-01-01

    The conserved yeast histone methyltransferase Set1 targets H3 lysine 4 (H3K4) for mono, di, and trimethylation and is linked to active transcription due to the euchromatic distribution of these methyl marks and the recruitment of Set1 during transcription. However, loss of Set1 results in increased expression of multiple classes of genes, including genes adjacent to telomeres and middle sporulation genes, which are repressed under normal growth conditions because they function in meiotic progression and spore formation. The mechanisms underlying Set1-mediated gene repression are varied, and still unclear in some cases, although repression has been linked to both direct and indirect action of Set1, associated with noncoding transcription, and is often dependent on the H3K4me2 mark. We show that Set1, and particularly the H3K4me2 mark, are implicated in repression of a subset of middle sporulation genes during vegetative growth. In the absence of Set1, there is loss of the DNA-binding transcriptional regulator Sum1 and the associated histone deacetylase Hst1 from chromatin in a locus-specific manner. This is linked to increased H4K5ac at these loci and aberrant middle gene expression. These data indicate that, in addition to DNA sequence, histone modification status also contributes to proper localization of Sum1. Our results also show that the role for Set1 in middle gene expression control diverges as cells receive signals to undergo meiosis. Overall, this work dissects an unexplored role for Set1 in gene-specific repression, and provides important insights into a new mechanism associated with the control of gene expression linked to meiotic differentiation. PMID:29066473

  5. The genetic architecture of gene expression levels in wild baboons.

    PubMed

    Tung, Jenny; Zhou, Xiang; Alberts, Susan C; Stephens, Matthew; Gilad, Yoav

    2015-02-25

    Primate evolution has been argued to result, in part, from changes in how genes are regulated. However, we still know little about gene regulation in natural primate populations. We conducted an RNA sequencing (RNA-seq)-based study of baboons from an intensively studied wild population. We performed complementary expression quantitative trait locus (eQTL) mapping and allele-specific expression analyses, discovering substantial evidence for, and surprising power to detect, genetic effects on gene expression levels in the baboons. eQTL were most likely to be identified for lineage-specific, rapidly evolving genes; interestingly, genes with eQTL significantly overlapped between baboons and a comparable human eQTL data set. Our results suggest that genes vary in their tolerance of genetic perturbation, and that this property may be conserved across species. Further, they establish the feasibility of eQTL mapping using RNA-seq data alone, and represent an important step towards understanding the genetic architecture of gene expression in primates.

  6. The genetic architecture of gene expression levels in wild baboons

    PubMed Central

    Tung, Jenny; Zhou, Xiang; Alberts, Susan C; Stephens, Matthew; Gilad, Yoav

    2015-01-01

    Primate evolution has been argued to result, in part, from changes in how genes are regulated. However, we still know little about gene regulation in natural primate populations. We conducted an RNA sequencing (RNA-seq)-based study of baboons from an intensively studied wild population. We performed complementary expression quantitative trait locus (eQTL) mapping and allele-specific expression analyses, discovering substantial evidence for, and surprising power to detect, genetic effects on gene expression levels in the baboons. eQTL were most likely to be identified for lineage-specific, rapidly evolving genes; interestingly, genes with eQTL significantly overlapped between baboons and a comparable human eQTL data set. Our results suggest that genes vary in their tolerance of genetic perturbation, and that this property may be conserved across species. Further, they establish the feasibility of eQTL mapping using RNA-seq data alone, and represent an important step towards understanding the genetic architecture of gene expression in primates. DOI: http://dx.doi.org/10.7554/eLife.04729.001 PMID:25714927

  7. NvERTx: a gene expression database to compare embryogenesis and regeneration in the sea anemone Nematostella vectensis.

    PubMed

    Warner, Jacob F; Guerlais, Vincent; Amiel, Aldine R; Johnston, Hereroa; Nedoncelle, Karine; Röttinger, Eric

    2018-05-17

    For over a century, researchers have been comparing embryogenesis and regeneration hoping that lessons learned from embryonic development will unlock hidden regenerative potential. This problem has historically been a difficult one to investigate because the best regenerative model systems are poor embryonic models and vice versa. Recently, however, there has been renewed interest in this question, as emerging models have allowed researchers to investigate these processes in the same organism. This interest has been further fueled by the advent of high-throughput transcriptomic analyses that provide virtual mountains of data. Here, we present N ematostella vectensis Embryogenesis and Regeneration Transcriptomics (NvERTx), a platform for comparing gene expression during embryogenesis and regeneration. NvERTx consists of close to 50 transcriptomic data sets spanning embryogenesis and regeneration in Nematostella These data were used to perform a robust de novo transcriptome assembly, with which users can search, conduct BLAST analyses, and plot the expression of multiple genes during these two developmental processes. The site is also home to the results of gene clustering analyses, to further mine the data and identify groups of co-expressed genes. The site can be accessed at http://nvertx.kahikai.org. © 2018. Published by The Company of Biologists Ltd.

  8. A Simple Screening Approach To Prioritize Genes for Functional Analysis Identifies a Role for Interferon Regulatory Factor 7 in the Control of Respiratory Syncytial Virus Disease

    PubMed Central

    McDonald, Jacqueline U.; Kaforou, Myrsini; Clare, Simon; Hale, Christine; Ivanova, Maria; Huntley, Derek; Dorner, Marcus; Wright, Victoria J.; Levin, Michael; Martinon-Torres, Federico; Herberg, Jethro A.

    2016-01-01

    ABSTRACT Greater understanding of the functions of host gene products in response to infection is required. While many of these genes enable pathogen clearance, some enhance pathogen growth or contribute to disease symptoms. Many studies have profiled transcriptomic and proteomic responses to infection, generating large data sets, but selecting targets for further study is challenging. Here we propose a novel data-mining approach combining multiple heterogeneous data sets to prioritize genes for further study by using respiratory syncytial virus (RSV) infection as a model pathogen with a significant health care impact. The assumption was that the more frequently a gene is detected across multiple studies, the more important its role is. A literature search was performed to find data sets of genes and proteins that change after RSV infection. The data sets were standardized, collated into a single database, and then panned to determine which genes occurred in multiple data sets, generating a candidate gene list. This candidate gene list was validated by using both a clinical cohort and in vitro screening. We identified several genes that were frequently expressed following RSV infection with no assigned function in RSV control, including IFI27, IFIT3, IFI44L, GBP1, OAS3, IFI44, and IRF7. Drilling down into the function of these genes, we demonstrate a role in disease for the gene for interferon regulatory factor 7, which was highly ranked on the list, but not for IRF1, which was not. Thus, we have developed and validated an approach for collating published data sets into a manageable list of candidates, identifying novel targets for future analysis. IMPORTANCE Making the most of “big data” is one of the core challenges of current biology. There is a large array of heterogeneous data sets of host gene responses to infection, but these data sets do not inform us about gene function and require specialized skill sets and training for their utilization. Here we describe an approach that combines and simplifies these data sets, distilling this information into a single list of genes commonly upregulated in response to infection with RSV as a model pathogen. Many of the genes on the list have unknown functions in RSV disease. We validated the gene list with new clinical, in vitro, and in vivo data. This approach allows the rapid selection of genes of interest for further, more-detailed studies, thus reducing time and costs. Furthermore, the approach is simple to use and widely applicable to a range of diseases. PMID:27822537

  9. Gene Expression Profiling in the Hibernating Primate, Cheirogaleus Medius

    PubMed Central

    Faherty, Sheena L.; Villanueva-Cañas, José Luis; Klopfer, Peter H.; Albà, M. Mar; Yoder, Anne D.

    2016-01-01

    Hibernation is a complex physiological response that some mammalian species employ to evade energetic demands. Previous work in mammalian hibernators suggests that hibernation is activated not by a set of genes unique to hibernators, but by differential expression of genes that are present in all mammals. This question of universal genetic mechanisms requires further investigation and can only be tested through additional investigations of phylogenetically dispersed species. To explore this question, we use RNA-Seq to investigate gene expression dynamics as they relate to the varying physiological states experienced throughout the year in a group of primate hibernators—Madagascar’s dwarf lemurs (genus Cheirogaleus). In a novel experimental approach, we use longitudinal sampling of biological tissues as a method for capturing gene expression profiles from the same individuals throughout their annual hibernation cycle. We identify 90 candidate genes that have variable expression patterns when comparing two active states (Active 1 and Active 2) with a torpor state. These include genes that are involved in metabolic pathways, feeding behavior, and circadian rhythms, as might be expected to correlate with seasonal physiological state changes. The identified genes appear to be critical for maintaining the health of an animal that undergoes prolonged periods of metabolic depression concurrent with the hibernation phenotype. By focusing on these differentially expressed genes in dwarf lemurs, we compare gene expression patterns in previously studied mammalian hibernators. Additionally, by employing evolutionary rate analysis, we find that hibernation-related genes do not evolve under positive selection in hibernating species relative to nonhibernators. PMID:27412611

  10. Selecting a set of housekeeping genes for quantitative real-time PCR in normal and tetraploid haemocytes of soft-shell clams, Mya arenaria.

    PubMed

    Siah, A; Dohoo, C; McKenna, P; Delaporte, M; Berthe, F C J

    2008-09-01

    The transcripts involved in the molecular mechanisms of haemic neoplasia in relation to the haemocyte ploidy status of the soft-shell clam, Mya arenaria, have yet to be identified. For this purpose, real-time quantitative RT-PCR constitutes a sensitive and efficient technique, which can help determine the gene expression involved in haemocyte tetraploid status in clams affected by haemic neoplasia. One of the critical steps in comparing transcription profiles is the stability of selected housekeeping genes, as well as an accurate normalization. In this study, we selected five reference genes, S18, L37, EF1, EF2 and actin, generally used as single control genes. Their expression was analyzed by real-time quantitative RT-PCR at different levels of haemocyte ploidy status in order to select the most stable genes. Using the geNorm software, our results showed that L37, EF1 and S18 represent the most stable gene expressions related to various ploidy status ranging from 0 to 78% of tetraploid haemocytes in clams sampled in North River (Prince Edward Island, Canada). However, actin gene expression appeared to be highly regulated. Hence, using it as a housekeeping gene in tetraploid haemocytes can result in inaccurate data. To compare gene expression levels related to haemocyte ploidy status in Mya arenaria, using L37, EF1 and S18 as housekeeping genes for accurate normalization is therefore recommended.

  11. Genome-wide analysis of starch metabolism genes in potato (Solanum tuberosum L.).

    PubMed

    Van Harsselaar, Jessica K; Lorenz, Julia; Senning, Melanie; Sonnewald, Uwe; Sonnewald, Sophia

    2017-01-05

    Starch is the principle constituent of potato tubers and is of considerable importance for food and non-food applications. Its metabolism has been subject of extensive research over the past decades. Despite its importance, a description of the complete inventory of genes involved in starch metabolism and their genome organization in potato plants is still missing. Moreover, mechanisms regulating the expression of starch genes in leaves and tubers remain elusive with regard to differences between transitory and storage starch metabolism, respectively. This study aimed at identifying and mapping the complete set of potato starch genes, and to study their expression pattern in leaves and tubers using different sets of transcriptome data. Moreover, we wanted to uncover transcription factors co-regulated with starch accumulation in tubers in order to get insight into the regulation of starch metabolism. We identified 77 genomic loci encoding enzymes involved in starch metabolism. Novel isoforms of many enzymes were found. Their analysis will help to elucidate mechanisms of starch biosynthesis and degradation. Expression analysis of starch genes led to the identification of tissue-specific isoenzymes suggesting differences in the transcriptional regulation of starch metabolism between potato leaf and tuber tissues. Selection of genes predominantly expressed in developing potato tubers and exhibiting an expression pattern indicative for a role in starch biosynthesis enabled the identification of possible transcriptional regulators of tuber starch biosynthesis by co-expression analysis. This study provides the annotation of the complete set of starch metabolic genes in potato plants and their genomic localizations. Novel, so far undescribed, enzyme isoforms were revealed. Comparative transcriptome analysis enabled the identification of tuber- and leaf-specific isoforms of starch genes. This finding suggests distinct regulatory mechanisms in transitory and storage starch metabolism. Putative regulatory proteins of starch biosynthesis in potato tubers have been identified by co-expression and their expression was verified by quantitative RT-PCR.

  12. RARGE II: an integrated phenotype database of Arabidopsis mutant traits using a controlled vocabulary.

    PubMed

    Akiyama, Kenji; Kurotani, Atsushi; Iida, Kei; Kuromori, Takashi; Shinozaki, Kazuo; Sakurai, Tetsuya

    2014-01-01

    Arabidopsis thaliana is one of the most popular experimental plants. However, only 40% of its genes have at least one experimental Gene Ontology (GO) annotation assigned. Systematic observation of mutant phenotypes is an important technique for elucidating gene functions. Indeed, several large-scale phenotypic analyses have been performed and have generated phenotypic data sets from many Arabidopsis mutant lines and overexpressing lines, which are freely available online. Since each Arabidopsis mutant line database uses individual phenotype expression, the differences in the structured term sets used by each database make it difficult to compare data sets and make it impossible to search across databases. Therefore, we obtained publicly available information for a total of 66,209 Arabidopsis mutant lines, including loss-of-function (RATM and TARAPPER) and gain-of-function (AtFOX and OsFOX) lines, and integrated the phenotype data by mapping the descriptions onto Plant Ontology (PO) and Phenotypic Quality Ontology (PATO) terms. This approach made it possible to manage the four different phenotype databases as one large data set. Here, we report a publicly accessible web-based database, the RIKEN Arabidopsis Genome Encyclopedia II (RARGE II; http://rarge-v2.psc.riken.jp/), in which all of the data described in this study are included. Using the database, we demonstrated consistency (in terms of protein function) with a previous study and identified the presumed function of an unknown gene. We provide examples of AT1G21600, which is a subunit in the plastid-encoded RNA polymerase complex, and AT5G56980, which is related to the jasmonic acid signaling pathway.

  13. Global parameter estimation for thermodynamic models of transcriptional regulation.

    PubMed

    Suleimenov, Yerzhan; Ay, Ahmet; Samee, Md Abul Hassan; Dresch, Jacqueline M; Sinha, Saurabh; Arnosti, David N

    2013-07-15

    Deciphering the mechanisms involved in gene regulation holds the key to understanding the control of central biological processes, including human disease, population variation, and the evolution of morphological innovations. New experimental techniques including whole genome sequencing and transcriptome analysis have enabled comprehensive modeling approaches to study gene regulation. In many cases, it is useful to be able to assign biological significance to the inferred model parameters, but such interpretation should take into account features that affect these parameters, including model construction and sensitivity, the type of fitness calculation, and the effectiveness of parameter estimation. This last point is often neglected, as estimation methods are often selected for historical reasons or for computational ease. Here, we compare the performance of two parameter estimation techniques broadly representative of local and global approaches, namely, a quasi-Newton/Nelder-Mead simplex (QN/NMS) method and a covariance matrix adaptation-evolutionary strategy (CMA-ES) method. The estimation methods were applied to a set of thermodynamic models of gene transcription applied to regulatory elements active in the Drosophila embryo. Measuring overall fit, the global CMA-ES method performed significantly better than the local QN/NMS method on high quality data sets, but this difference was negligible on lower quality data sets with increased noise or on data sets simplified by stringent thresholding. Our results suggest that the choice of parameter estimation technique for evaluation of gene expression models depends both on quality of data, the nature of the models [again, remains to be established] and the aims of the modeling effort. Copyright © 2013 Elsevier Inc. All rights reserved.

  14. Inference of Evolutionary Forces Acting on Human Biological Pathways

    PubMed Central

    Daub, Josephine T.; Dupanloup, Isabelle; Robinson-Rechavi, Marc; Excoffier, Laurent

    2015-01-01

    Because natural selection is likely to act on multiple genes underlying a given phenotypic trait, we study here the potential effect of ongoing and past selection on the genetic diversity of human biological pathways. We first show that genes included in gene sets are generally under stronger selective constraints than other genes and that their evolutionary response is correlated. We then introduce a new procedure to detect selection at the pathway level based on a decomposition of the classical McDonald–Kreitman test extended to multiple genes. This new test, called 2DNS, detects outlier gene sets and takes into account past demographic effects and evolutionary constraints specific to gene sets. Selective forces acting on gene sets can be easily identified by a mere visual inspection of the position of the gene sets relative to their two-dimensional null distribution. We thus find several outlier gene sets that show signals of positive, balancing, or purifying selection but also others showing an ancient relaxation of selective constraints. The principle of the 2DNS test can also be applied to other genomic contrasts. For instance, the comparison of patterns of polymorphisms private to African and non-African populations reveals that most pathways show a higher proportion of nonsynonymous mutations in non-Africans than in Africans, potentially due to different demographic histories and selective pressures. PMID:25971280

  15. A Compendium of Canine Normal Tissue Gene Expression

    PubMed Central

    Chen, Qing-Rong; Wen, Xinyu; Khan, Javed; Khanna, Chand

    2011-01-01

    Background Our understanding of disease is increasingly informed by changes in gene expression between normal and abnormal tissues. The release of the canine genome sequence in 2005 provided an opportunity to better understand human health and disease using the dog as clinically relevant model. Accordingly, we now present the first genome-wide, canine normal tissue gene expression compendium with corresponding human cross-species analysis. Methodology/Principal Findings The Affymetrix platform was utilized to catalogue gene expression signatures of 10 normal canine tissues including: liver, kidney, heart, lung, cerebrum, lymph node, spleen, jejunum, pancreas and skeletal muscle. The quality of the database was assessed in several ways. Organ defining gene sets were identified for each tissue and functional enrichment analysis revealed themes consistent with known physio-anatomic functions for each organ. In addition, a comparison of orthologous gene expression between matched canine and human normal tissues uncovered remarkable similarity. To demonstrate the utility of this dataset, novel canine gene annotations were established based on comparative analysis of dog and human tissue selective gene expression and manual curation of canine probeset mapping. Public access, using infrastructure identical to that currently in use for human normal tissues, has been established and allows for additional comparisons across species. Conclusions/Significance These data advance our understanding of the canine genome through a comprehensive analysis of gene expression in a diverse set of tissues, contributing to improved functional annotation that has been lacking. Importantly, it will be used to inform future studies of disease in the dog as a model for human translational research and provides a novel resource to the community at large. PMID:21655323

  16. Comparison of dorsal root ganglion gene expression in rat models of traumatic and HIV-associated neuropathic pain

    PubMed Central

    Maratou, Klio; Wallace, Victoria C.J.; Hasnie, Fauzia S.; Okuse, Kenji; Hosseini, Ramine; Jina, Nipurna; Blackbeard, Julie; Pheby, Timothy; Orengo, Christine; Dickenson, Anthony H.; McMahon, Stephen B.; Rice, Andrew S.C.

    2009-01-01

    To elucidate the mechanisms underlying peripheral neuropathic pain in the context of HIV infection and antiretroviral therapy, we measured gene expression in dorsal root ganglia (DRG) of rats subjected to systemic treatment with the anti-retroviral agent, ddC (Zalcitabine) and concomitant delivery of HIV-gp120 to the rat sciatic nerve. L4 and L5 DRGs were collected at day 14 (time of peak behavioural change) and changes in gene expression were measured using Affymetrix whole genome rat arrays. Conventional analysis of this data set and Gene Set Enrichment Analysis (GSEA) was performed to discover biological processes altered in this model. Transcripts associated with G protein coupled receptor signalling and cell adhesion were enriched in the treated animals, while ribosomal proteins and proteasome pathways were associated with gene down-regulation. To identify genes that are directly relevant to neuropathic mechanical hypersensitivity, as opposed to epiphenomena associated with other aspects of the response to a sciatic nerve lesion, we compared the gp120 + ddC-evoked gene expression with that observed in a model of traumatic neuropathic pain (L5 spinal nerve transection), where hypersensitivity to a static mechanical stimulus is also observed. We identified 39 genes/expressed sequence tags that are differentially expressed in the same direction in both models. Most of these have not previously been implicated in mechanical hypersensitivity and may represent novel targets for therapeutic intervention. As an external control, the RNA expression of three genes was examined by RT-PCR, while the protein levels of two were studied using western blot analysis. PMID:18606552

  17. A tree of life based on ninety-eight expressed genes conserved across diverse eukaryotic species

    PubMed Central

    Jayaswal, Pawan Kumar; Dogra, Vivek; Shanker, Asheesh; Sharma, Tilak Raj

    2017-01-01

    Rapid advances in DNA sequencing technologies have resulted in the accumulation of large data sets in the public domain, facilitating comparative studies to provide novel insights into the evolution of life. Phylogenetic studies across the eukaryotic taxa have been reported but on the basis of a limited number of genes. Here we present a genome-wide analysis across different plant, fungal, protist, and animal species, with reference to the 36,002 expressed genes of the rice genome. Our analysis revealed 9831 genes unique to rice and 98 genes conserved across all 49 eukaryotic species analysed. The 98 genes conserved across diverse eukaryotes mostly exhibited binding and catalytic activities and shared common sequence motifs; and hence appeared to have a common origin. The 98 conserved genes belonged to 22 functional gene families including 26S protease, actin, ADP–ribosylation factor, ATP synthase, casein kinase, DEAD-box protein, DnaK, elongation factor 2, glyceraldehyde 3-phosphate, phosphatase 2A, ras-related protein, Ser/Thr protein phosphatase family protein, tubulin, ubiquitin and others. The consensus Bayesian eukaryotic tree of life developed in this study demonstrated widely separated clades of plants, fungi, and animals. Musa acuminata provided an evolutionary link between monocotyledons and dicotyledons, and Salpingoeca rosetta provided an evolutionary link between fungi and animals, which indicating that protozoan species are close relatives of fungi and animals. The divergence times for 1176 species pairs were estimated accurately by integrating fossil information with synonymous substitution rates in the comprehensive set of 98 genes. The present study provides valuable insight into the evolution of eukaryotes. PMID:28922368

  18. Characterization of Arabidopsis Transcriptional Responses to Different Aphid Species Reveals Genes that Contribute to Host Susceptibility and Non-host Resistance

    PubMed Central

    Jaouannet, Maëlle; Morris, Jenny A.; Hedley, Peter E.; Bos, Jorunn I. B.

    2015-01-01

    Aphids are economically important pests that display exceptional variation in host range. The determinants of diverse aphid host ranges are not well understood, but it is likely that molecular interactions are involved. With significant progress being made towards understanding host responses upon aphid attack, the mechanisms underlying non-host resistance remain to be elucidated. Here, we investigated and compared Arabidopsis thaliana host and non-host responses to aphids at the transcriptional level using three different aphid species, Myzus persicae, Myzus cerasi and Rhopalosiphum pisum. Gene expression analyses revealed a high level of overlap in the overall gene expression changes during the host and non-host interactions with regards to the sets of genes differentially expressed and the direction of expression changes. Despite this overlap in transcriptional responses across interactions, there was a stronger repression of genes involved in metabolism and oxidative responses specifically during the host interaction with M. persicae. In addition, we identified a set of genes with opposite gene expression patterns during the host versus non-host interactions. Aphid performance assays on Arabidopsis mutants that were selected based on our transcriptome analyses identified novel genes contributing to host susceptibility, host defences during interactions with M. persicae as well to non-host resistance against R. padi. Understanding how plants respond to aphid species that differ in their ability to infest plant species, and identifying the genes and signaling pathways involved, is essential for the development of novel and durable aphid control in crop plants. PMID:25993686

  19. Impact of microRNAs on regulatory networks and pathways in human colorectal carcinogenesis and development of metastasis

    PubMed Central

    2013-01-01

    Background Qualitative alterations or abnormal expression of microRNAs (miRNAs) in colon cancer have mainly been demonstrated in primary tumors. Poorly overlapping sets of oncomiRs, tumor suppressor miRNAs and metastamiRs have been linked with distinct stages in the progression of colorectal cancer. To identify changes in both miRNA and gene expression levels among normal colon mucosa, primary tumor and liver metastasis samples, and to classify miRNAs into functional networks, in this work miRNA and gene expression profiles in 158 samples from 46 patients were analysed. Results Most changes in miRNA and gene expression levels had already manifested in the primary tumors while these levels were almost stably maintained in the subsequent primary tumor-to-metastasis transition. In addition, comparing normal tissue, tumor and metastasis, we did not observe general impairment or any rise in miRNA biogenesis. While only few mRNAs were found to be differentially expressed between primary colorectal carcinoma and liver metastases, miRNA expression profiles can classify primary tumors and metastases well, including differential expression of miR-10b, miR-210 and miR-708. Of 82 miRNAs that were modulated during tumor progression, 22 were involved in EMT. qRT-PCR confirmed the down-regulation of miR-150 and miR-10b in both primary tumor and metastasis compared to normal mucosa and of miR-146a in metastases compared to primary tumor. The upregulation of miR-201 in metastasis compared both with normal and primary tumour was also confirmed. A preliminary survival analysis considering differentially expressed miRNAs suggested a possible link between miR-10b expression in metastasis and patient survival. By integrating miRNA and target gene expression data, we identified a combination of interconnected miRNAs, which are organized into sub-networks, including several regulatory relationships with differentially expressed genes. Key regulatory interactions were validated experimentally. Specific mixed circuits involving miRNAs and transcription factors were identified and deserve further investigation. The suppressor activity of miR-182 on ENTPD5 gene was identified for the first time and confirmed in an independent set of samples. Conclusions Using a large dataset of CRC miRNA and gene expression profiles, we describe the interplay of miRNA groups in regulating gene expression, which in turn affects modulated pathways that are important for tumor development. PMID:23987127

  20. Genetic and cytokine changes associated with symptomatic stages of CLL.

    PubMed

    Agarwal, Amit; Cooke, Lawrence; Riley, Christopher; Qi, Wenqing; Mount, David; Mahadevan, Daruka

    2014-09-01

    The pathogenesis and drug resistance of symptomatic CLL patients involves genetic changes associated with the CLL clone as well as changes within the microenvironment. To further understand these processes, we compared early stage CLL to symptomatic late stage using gene expression and serum cytokine profiling to gain insight of the genetic and microenvironment changes associated with the most severe form of the disease. Patients were classified into low stage (Rai stage 0/I/II) and high stage (Rai stage III/IV). Gene expression profiles were obtained on pretreatment samples using the HG-U133A 2.0 Affymetrix platform. A comparison of low versus high stage CLL revealed a set of 21 genes differentially expressed genes. 15 genes were up regulated in the high stage compared to low stage while 6 genes were down regulated. Analysis of GO molecular function revealed 9 of 21 genes were involved in transcription factor activity. Serum cytokine profiles showed six cytokines to be significantly different in high stage patients. Two chemokines, SDF-1/CXCL12 and uPAR known to be involved in stem cell mobilization and homing were increased in serum of high stage patients. This study has identified therapeutic targets for symptomatic CLL patients. Copyright © 2014 Elsevier Ltd. All rights reserved.

  1. Comparative genomics of the mimicry switch in Papilio dardanus.

    PubMed

    Timmermans, Martijn J T N; Baxter, Simon W; Clark, Rebecca; Heckel, David G; Vogel, Heiko; Collins, Steve; Papanicolaou, Alexie; Fukova, Iva; Joron, Mathieu; Thompson, Martin J; Jiggins, Chris D; ffrench-Constant, Richard H; Vogler, Alfried P

    2014-07-22

    The African Mocker Swallowtail, Papilio dardanus, is a textbook example in evolutionary genetics. Classical breeding experiments have shown that wing pattern variation in this polymorphic Batesian mimic is determined by the polyallelic H locus that controls a set of distinct mimetic phenotypes. Using bacterial artificial chromosome (BAC) sequencing, recombination analyses and comparative genomics, we show that H co-segregates with an interval of less than 500 kb that is collinear with two other Lepidoptera genomes and contains 24 genes, including the transcription factor genes engrailed (en) and invected (inv). H is located in a region of conserved gene order, which argues against any role for genomic translocations in the evolution of a hypothesized multi-gene mimicry locus. Natural populations of P. dardanus show significant associations of specific morphs with single nucleotide polymorphisms (SNPs), centred on en. In addition, SNP variation in the H region reveals evidence of non-neutral molecular evolution in the en gene alone. We find evidence for a duplication potentially driving physical constraints on recombination in the lamborni morph. Absence of perfect linkage disequilibrium between different genes in the other morphs suggests that H is limited to nucleotide positions in the regulatory and coding regions of en. Our results therefore support the hypothesis that a single gene underlies wing pattern variation in P. dardanus.

  2. Comparative modular analysis of gene expression in vertebrate organs.

    PubMed

    Piasecka, Barbara; Kutalik, Zoltán; Roux, Julien; Bergmann, Sven; Robinson-Rechavi, Marc

    2012-03-29

    The degree of conservation of gene expression between homologous organs largely remains an open question. Several recent studies reported some evidence in favor of such conservation. Most studies compute organs' similarity across all orthologous genes, whereas the expression level of many genes are not informative about organ specificity. Here, we use a modularization algorithm to overcome this limitation through the identification of inter-species co-modules of organs and genes. We identify such co-modules using mouse and human microarray expression data. They are functionally coherent both in terms of genes and of organs from both organisms. We show that a large proportion of genes belonging to the same co-module are orthologous between mouse and human. Moreover, their zebrafish orthologs also tend to be expressed in the corresponding homologous organs. Notable exceptions to the general pattern of conservation are the testis and the olfactory bulb. Interestingly, some co-modules consist of single organs, while others combine several functionally related organs. For instance, amygdala, cerebral cortex, hypothalamus and spinal cord form a clearly discernible unit of expression, both in mouse and human. Our study provides a new framework for comparative analysis which will be applicable also to other sets of large-scale phenotypic data collected across different species.

  3. Inferring gene regression networks with model trees

    PubMed Central

    2010-01-01

    Background Novel strategies are required in order to handle the huge amount of data produced by microarray technologies. To infer gene regulatory networks, the first step is to find direct regulatory relationships between genes building the so-called gene co-expression networks. They are typically generated using correlation statistics as pairwise similarity measures. Correlation-based methods are very useful in order to determine whether two genes have a strong global similarity but do not detect local similarities. Results We propose model trees as a method to identify gene interaction networks. While correlation-based methods analyze each pair of genes, in our approach we generate a single regression tree for each gene from the remaining genes. Finally, a graph from all the relationships among output and input genes is built taking into account whether the pair of genes is statistically significant. For this reason we apply a statistical procedure to control the false discovery rate. The performance of our approach, named REGNET, is experimentally tested on two well-known data sets: Saccharomyces Cerevisiae and E.coli data set. First, the biological coherence of the results are tested. Second the E.coli transcriptional network (in the Regulon database) is used as control to compare the results to that of a correlation-based method. This experiment shows that REGNET performs more accurately at detecting true gene associations than the Pearson and Spearman zeroth and first-order correlation-based methods. Conclusions REGNET generates gene association networks from gene expression data, and differs from correlation-based methods in that the relationship between one gene and others is calculated simultaneously. Model trees are very useful techniques to estimate the numerical values for the target genes by linear regression functions. They are very often more precise than linear regression models because they can add just different linear regressions to separate areas of the search space favoring to infer localized similarities over a more global similarity. Furthermore, experimental results show the good performance of REGNET. PMID:20950452

  4. Missing value imputation in DNA microarrays based on conjugate gradient method.

    PubMed

    Dorri, Fatemeh; Azmi, Paeiz; Dorri, Faezeh

    2012-02-01

    Analysis of gene expression profiles needs a complete matrix of gene array values; consequently, imputation methods have been suggested. In this paper, an algorithm that is based on conjugate gradient (CG) method is proposed to estimate missing values. k-nearest neighbors of the missed entry are first selected based on absolute values of their Pearson correlation coefficient. Then a subset of genes among the k-nearest neighbors is labeled as the best similar ones. CG algorithm with this subset as its input is then used to estimate the missing values. Our proposed CG based algorithm (CGimpute) is evaluated on different data sets. The results are compared with sequential local least squares (SLLSimpute), Bayesian principle component analysis (BPCAimpute), local least squares imputation (LLSimpute), iterated local least squares imputation (ILLSimpute) and adaptive k-nearest neighbors imputation (KNNKimpute) methods. The average of normalized root mean squares error (NRMSE) and relative NRMSE in different data sets with various missing rates shows CGimpute outperforms other methods. Copyright © 2011 Elsevier Ltd. All rights reserved.

  5. Simple, rapid and sensitive detection of Orientia tsutsugamushi by loop-isothermal DNA amplification.

    PubMed

    Paris, Daniel H; Blacksell, Stuart D; Newton, Paul N; Day, Nicholas P J

    2008-12-01

    We present a loop-mediated isothermal PCR assay (LAMP) targeting the groEL gene, which encodes the 60kDa heat shock protein of Orientia tsutsugamushi. Evaluation included testing of 63 samples of contemporary in vitro isolates, buffy coats and whole blood samples from patients with fever. Detection limits for LAMP were assessed by serial dilutions and quantitation by real-time PCR assay based on the same target gene: three copies/microl for linearized plasmids, 26 copies/microl for VERO cell culture isolates, 14 copies/microl for full blood samples and 41 copies/microl for clinical buffy coats. Based on a limited sample number, the LAMP assay is comparable in sensitivity with conventional nested PCR (56kDa gene), with limits of detection well below the range of known admission bacterial loads of patients with scrub typhus. This inexpensive method requires no sophisticated equipment or sample preparation, and may prove useful as a diagnostic assay in financially poor settings; however, it requires further prospective validation in the field setting.

  6. Solving the influence maximization problem reveals regulatory organization of the yeast cell cycle.

    PubMed

    Gibbs, David L; Shmulevich, Ilya

    2017-06-01

    The Influence Maximization Problem (IMP) aims to discover the set of nodes with the greatest influence on network dynamics. The problem has previously been applied in epidemiology and social network analysis. Here, we demonstrate the application to cell cycle regulatory network analysis for Saccharomyces cerevisiae. Fundamentally, gene regulation is linked to the flow of information. Therefore, our implementation of the IMP was framed as an information theoretic problem using network diffusion. Utilizing more than 26,000 regulatory edges from YeastMine, gene expression dynamics were encoded as edge weights using time lagged transfer entropy, a method for quantifying information transfer between variables. By picking a set of source nodes, a diffusion process covers a portion of the network. The size of the network cover relates to the influence of the source nodes. The set of nodes that maximizes influence is the solution to the IMP. By solving the IMP over different numbers of source nodes, an influence ranking on genes was produced. The influence ranking was compared to other metrics of network centrality. Although the top genes from each centrality ranking contained well-known cell cycle regulators, there was little agreement and no clear winner. However, it was found that influential genes tend to directly regulate or sit upstream of genes ranked by other centrality measures. The influential nodes act as critical sources of information flow, potentially having a large impact on the state of the network. Biological events that affect influential nodes and thereby affect information flow could have a strong effect on network dynamics, potentially leading to disease. Code and data can be found at: https://github.com/gibbsdavidl/miergolf.

  7. BloodChIP: a database of comparative genome-wide transcription factor binding profiles in human blood cells.

    PubMed

    Chacon, Diego; Beck, Dominik; Perera, Dilmi; Wong, Jason W H; Pimanda, John E

    2014-01-01

    The BloodChIP database (http://www.med.unsw.edu.au/CRCWeb.nsf/page/BloodChIP) supports exploration and visualization of combinatorial transcription factor (TF) binding at a particular locus in human CD34-positive and other normal and leukaemic cells or retrieval of target gene sets for user-defined combinations of TFs across one or more cell types. Increasing numbers of genome-wide TF binding profiles are being added to public repositories, and this trend is likely to continue. For the power of these data sets to be fully harnessed by experimental scientists, there is a need for these data to be placed in context and easily accessible for downstream applications. To this end, we have built a user-friendly database that has at its core the genome-wide binding profiles of seven key haematopoietic TFs in human stem/progenitor cells. These binding profiles are compared with binding profiles in normal differentiated and leukaemic cells. We have integrated these TF binding profiles with chromatin marks and expression data in normal and leukaemic cell fractions. All queries can be exported into external sites to construct TF-gene and protein-protein networks and to evaluate the association of genes with cellular processes and tissue expression.

  8. A methodology for the analysis of differential coexpression across the human lifespan.

    PubMed

    Gillis, Jesse; Pavlidis, Paul

    2009-09-22

    Differential coexpression is a change in coexpression between genes that may reflect 'rewiring' of transcriptional networks. It has previously been hypothesized that such changes might be occurring over time in the lifespan of an organism. While both coexpression and differential expression of genes have been previously studied in life stage change or aging, differential coexpression has not. Generalizing differential coexpression analysis to many time points presents a methodological challenge. Here we introduce a method for analyzing changes in coexpression across multiple ordered groups (e.g., over time) and extensively test its validity and usefulness. Our method is based on the use of the Haar basis set to efficiently represent changes in coexpression at multiple time scales, and thus represents a principled and generalizable extension of the idea of differential coexpression to life stage data. We used published microarray studies categorized by age to test the methodology. We validated the methodology by testing our ability to reconstruct Gene Ontology (GO) categories using our measure of differential coexpression and compared this result to using coexpression alone. Our method allows significant improvement in characterizing these groups of genes. Further, we examine the statistical properties of our measure of differential coexpression and establish that the results are significant both statistically and by an improvement in semantic similarity. In addition, we found that our method finds more significant changes in gene relationships compared to several other methods of expressing temporal relationships between genes, such as coexpression over time. Differential coexpression over age generates significant and biologically relevant information about the genes producing it. Our Haar basis methodology for determining age-related differential coexpression performs better than other tested methods. The Haar basis set also lends itself to ready interpretation in terms of both evolutionary and physiological mechanisms of aging and can be seen as a natural generalization of two-category differential coexpression. paul@bioinformatics.ubc.ca.

  9. A strategy to apply quantitative epistasis analysis on developmental traits.

    PubMed

    Labocha, Marta K; Yuan, Wang; Aleman-Meza, Boanerges; Zhong, Weiwei

    2017-05-15

    Genetic interactions are keys to understand complex traits and evolution. Epistasis analysis is an effective method to map genetic interactions. Large-scale quantitative epistasis analysis has been well established for single cells. However, there is a substantial lack of such studies in multicellular organisms and their complex phenotypes such as development. Here we present a method to extend quantitative epistasis analysis to developmental traits. In the nematode Caenorhabditis elegans, we applied RNA interference on mutants to inactivate two genes, used an imaging system to quantitatively measure phenotypes, and developed a set of statistical methods to extract genetic interactions from phenotypic measurement. Using two different C. elegans developmental phenotypes, body length and sex ratio, as examples, we showed that this method could accommodate various metazoan phenotypes with performances comparable to those methods in single cell growth studies. Comparing with qualitative observations, this method of quantitative epistasis enabled detection of new interactions involving subtle phenotypes. For example, several sex-ratio genes were found to interact with brc-1 and brd-1, the orthologs of the human breast cancer genes BRCA1 and BARD1, respectively. We confirmed the brc-1 interactions with the following genes in DNA damage response: C34F6.1, him-3 (ortholog of HORMAD1, HORMAD2), sdc-1, and set-2 (ortholog of SETD1A, SETD1B, KMT2C, KMT2D), validating the effectiveness of our method in detecting genetic interactions. We developed a reliable, high-throughput method for quantitative epistasis analysis of developmental phenotypes.

  10. Characterization of a Genomic Signature of Pregnancy in the Breast

    PubMed Central

    Belitskaya-Lévy, Ilana; Zeleniuch-Jacquotte, Anne; Russo, Jose; Russo, Irma H.; Bordás, Pal; Åhman, Janet; Afanasyeva, Yelena; Johansson, Robert; Lenner, Per; Li, Xiaochun; de Cicco, Ricardo López; Peri, Suraj; Ross, Eric; Russo, Patricia A.; Santucci-Pereira, Julia; Sheriff, Fathima S.; Slifker, Michael; Hallmans, Göran; Toniolo, Paolo; Arslan, Alan A.

    2012-01-01

    The objective of the current study was to comprehensively compare the genomic profiles in the breast of parous and nulliparous postmenopausal women to identify genes that permanently change their expression following pregnancy. The study was designed as a two-phase approach. In the discovery phase, we compared breast genomic profiles of 37 parous with 18 nulliparous postmenopausal women. In the validation phase, confirmation of the genomic patterns observed in the discovery phase was sought in an independent set of 30 parous and 22 nulliparous postmenopausal women. RNA was hybridized to Affymetrix HG_U133 Plus 2.0 oligonucleotide arrays containing probes to 54,675 transcripts; scanned and the images analyzed using Affymetrix GCOS software. Surrogate variable analysis, logistic regression and significance analysis for microarrays were used to identify statistically significant differences in expression of genes. The False Discovery Rate (FDR) approach was used to control for multiple comparisons. We found that 208 genes (305 probe sets) were differentially expressed between parous and nulliparous women in both discovery and validation phases of the study at a FDR of 10% and with at least a 1.25-fold change. These genes are involved in regulation of transcription, centrosome organization, RNA splicing, cell cycle control, adhesion and differentiation. The results provide persuasive evidence that full-term pregnancy induces long-term genomic changes in the breast. The genomic signature of pregnancy could be used as an intermediate marker to assess potential chemopreventive interventions with hormones mimicking the effects of pregnancy for prevention of breast cancer. PMID:21622728

  11. Integrative analysis of gene expression and copy number alterations using canonical correlation analysis.

    PubMed

    Soneson, Charlotte; Lilljebjörn, Henrik; Fioretos, Thoas; Fontes, Magnus

    2010-04-15

    With the rapid development of new genetic measurement methods, several types of genetic alterations can be quantified in a high-throughput manner. While the initial focus has been on investigating each data set separately, there is an increasing interest in studying the correlation structure between two or more data sets. Multivariate methods based on Canonical Correlation Analysis (CCA) have been proposed for integrating paired genetic data sets. The high dimensionality of microarray data imposes computational difficulties, which have been addressed for instance by studying the covariance structure of the data, or by reducing the number of variables prior to applying the CCA. In this work, we propose a new method for analyzing high-dimensional paired genetic data sets, which mainly emphasizes the correlation structure and still permits efficient application to very large data sets. The method is implemented by translating a regularized CCA to its dual form, where the computational complexity depends mainly on the number of samples instead of the number of variables. The optimal regularization parameters are chosen by cross-validation. We apply the regularized dual CCA, as well as a classical CCA preceded by a dimension-reducing Principal Components Analysis (PCA), to a paired data set of gene expression changes and copy number alterations in leukemia. Using the correlation-maximizing methods, regularized dual CCA and PCA+CCA, we show that without pre-selection of known disease-relevant genes, and without using information about clinical class membership, an exploratory analysis singles out two patient groups, corresponding to well-known leukemia subtypes. Furthermore, the variables showing the highest relevance to the extracted features agree with previous biological knowledge concerning copy number alterations and gene expression changes in these subtypes. Finally, the correlation-maximizing methods are shown to yield results which are more biologically interpretable than those resulting from a covariance-maximizing method, and provide different insight compared to when each variable set is studied separately using PCA. We conclude that regularized dual CCA as well as PCA+CCA are useful methods for exploratory analysis of paired genetic data sets, and can be efficiently implemented also when the number of variables is very large.

  12. The Cure: Design and Evaluation of a Crowdsourcing Game for Gene Selection for Breast Cancer Survival Prediction

    PubMed Central

    Loguercio, Salvatore; Griffith, Obi L; Nanis, Max; Wu, Chunlei; Su, Andrew I

    2014-01-01

    Background Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility, and biological interpretability. Methods that take advantage of structured prior knowledge (eg, protein interaction networks) show promise in helping to define better signatures, but most knowledge remains unstructured. Crowdsourcing via scientific discovery games is an emerging methodology that has the potential to tap into human intelligence at scales and in modes unheard of before. Objective The main objective of this study was to test the hypothesis that knowledge linking expression patterns of specific genes to breast cancer outcomes could be captured from players of an open, Web-based game. We envisioned capturing knowledge both from the player’s prior experience and from their ability to interpret text related to candidate genes presented to them in the context of the game. Methods We developed and evaluated an online game called The Cure that captured information from players regarding genes for use as predictors of breast cancer survival. Information gathered from game play was aggregated using a voting approach, and used to create rankings of genes. The top genes from these rankings were evaluated using annotation enrichment analysis, comparison to prior predictor gene sets, and by using them to train and test machine learning systems for predicting 10 year survival. Results Between its launch in September 2012 and September 2013, The Cure attracted more than 1000 registered players, who collectively played nearly 10,000 games. Gene sets assembled through aggregation of the collected data showed significant enrichment for genes known to be related to key concepts such as cancer, disease progression, and recurrence. In terms of the predictive accuracy of models trained using this information, these gene sets provided comparable performance to gene sets generated using other methods, including those used in commercial tests. The Cure is available on the Internet. Conclusions The principal contribution of this work is to show that crowdsourcing games can be developed as a means to address problems involving domain knowledge. While most prior work on scientific discovery games and crowdsourcing in general takes as a premise that contributors have little or no expertise, here we demonstrated a crowdsourcing system that succeeded in capturing expert knowledge. PMID:25654473

  13. A literature search tool for intelligent extraction of disease-associated genes.

    PubMed

    Jung, Jae-Yoon; DeLuca, Todd F; Nelson, Tristan H; Wall, Dennis P

    2014-01-01

    To extract disorder-associated genes from the scientific literature in PubMed with greater sensitivity for literature-based support than existing methods. We developed a PubMed query to retrieve disorder-related, original research articles. Then we applied a rule-based text-mining algorithm with keyword matching to extract target disorders, genes with significant results, and the type of study described by the article. We compared our resulting candidate disorder genes and supporting references with existing databases. We demonstrated that our candidate gene set covers nearly all genes in manually curated databases, and that the references supporting the disorder-gene link are more extensive and accurate than other general purpose gene-to-disorder association databases. We implemented a novel publication search tool to find target articles, specifically focused on links between disorders and genotypes. Through comparison against gold-standard manually updated gene-disorder databases and comparison with automated databases of similar functionality we show that our tool can search through the entirety of PubMed to extract the main gene findings for human diseases rapidly and accurately.

  14. Complete mitochondrial genome and taxonomic revision of Cardiodactylus muiri Otte, 2007 (Gryllidae: Eneopterinae: Lebinthini).

    PubMed

    Dong, Jiajia; Vicente, Natallia; Chintauan-Marquier, Ioana C; Ramadi, Cahyo; Dettai, Agnès; Robillard, Tony

    2017-05-15

    In the present study, we report the high-coverage complete mitochondrial genome (mitogenome) of the cricket Cardiodactylus muiri Otte, 2007. The mitogenome was sequenced using a long-PCR approach on an Ion Torrent Personal Genome Machine (PGM) for next generation sequencing technology. The total length of the amplified mitogenome is 16,328 bp, representing 13 protein-coding genes, 22 transfer RNA genes, two ribosomal RNA genes and one noncoding region (D-loop region). The new sets of long-PCR primers reported here are invaluable resources for future comparative evolutionary genomic studies in Orthopteran insects. The new mitogenome sequence is compared with published cricket mitogenomes. In the taxonomic part, we present new records for the species and describe life-history traits, habitat and male calling song of the species; based on observation of new material, the species Cardiodactylus buru Gorochov & Robillard, 2014 is synonymized under C. muiri.

  15. Whole genome sequencing data and de novo draft assemblies for 66 teleost species

    PubMed Central

    Malmstrøm, Martin; Matschiner, Michael; Tørresen, Ole K.; Jakobsen, Kjetill S.; Jentoft, Sissel

    2017-01-01

    Teleost fishes comprise more than half of all vertebrate species, yet genomic data are only available for 0.2% of their diversity. Here, we present whole genome sequencing data for 66 new species of teleosts, vastly expanding the availability of genomic data for this important vertebrate group. We report on de novo assemblies based on low-coverage (9–39×) sequencing and present detailed methodology for all analyses. To facilitate further utilization of this data set, we present statistical analyses of the gene space completeness and verify the expected phylogenetic position of the sequenced genomes in a large mitogenomic context. We further present a nuclear marker set used for phylogenetic inference and evaluate each gene tree in relation to the species tree to test for homogeneity in the phylogenetic signal. Collectively, these analyses illustrate the robustness of this highly diverse data set and enable extensive reuse of the selected phylogenetic markers and the genomic data in general. This data set covers all major teleost lineages and provides unprecedented opportunities for comparative studies of teleosts. PMID:28094797

  16. Functional genomics of the evolution of increased resistance to parasitism in Drosophila.

    PubMed

    Wertheim, Bregje; Kraaijeveld, Alex R; Hopkins, Meirion G; Walther Boer, Mark; Godfray, H Charles J

    2011-03-01

    Individual hosts normally respond to parasite attack by launching an acute immune response (a phenotypic plastic response), while host populations can respond in the longer term by evolving higher level of defence against parasites. Little is known about the genetics of the evolved response: the identity and number of genes involved and whether it involves a pre-activation of the regulatory systems governing the plastic response. We explored these questions by surveying transcriptional changes in a Drosophila melanogaster strain artificially selected for resistance against the hymenopteran endoparasitoid Asobara tabida. Using micro-arrays, we profiled gene expression at seven time points during development (from the egg to the second instar larva) and found a large number of genes (almost 900) with altered expression levels. Bioinformatic analysis showed that some were involved in immunity or defence-associated functions but many were not. Previously, we had defined a set of genes whose level of expression changed after parasitoid attack and a comparison with the present set showed a significant though comparatively small overlap. This suggests that the evolutionary response to parasitism is not a simple pre-activation of the plastic, acute response. We also found overlap in the genes involved in the evolutionary response to parasitism and to other biotic and abiotic stressors, perhaps suggesting a 'module' of genes involved in a generalized stress response as has been found in other organisms. © 2010 Blackwell Publishing Ltd.

  17. Cytochrome cd1-containing nitrite reductase encoding gene nirS as a new functional biomarker for detection of anaerobic ammonium oxidizing (Anammox) bacteria.

    PubMed

    Li, Meng; Ford, Tim; Li, Xiaoyan; Gu, Ji-Dong

    2011-04-15

    A newly designed primer set (AnnirS), together with a previously published primer set (ScnirS), was used to detect anammox bacterial nirS genes from sediments collected from three marine environments. Phylogenetic analysis demonstrated that all retrieved sequences were clearly different from typical denitrifiers' nirS, but do group together with the known anammox bacterial nirS. Sequences targeted by ScnirS are closely related to Scalindua nirS genes recovered from the Peruvian oxygen minimum zone (OMZ), whereas sequences targeted by AnnirS are more closely affiliated with the nirS of Candidatus 'Kuenenia stuttgartiensis' and even form a new phylogenetic nirS clade, which might be related to other genera of the anammox bacteria. Analysis demonstrated that retrieved sequences had higher sequence identities (>60%) with known anammox bacterial nirS genes than with denitrifiers' nirS, on both nucleotide and amino acid levels. Compared to the 16S rRNA and hydrazine oxidoreductase (hzo) genes, the anammox bacterial nirS not only showed consistent phylogenetic relationships but also demonstrated more reliable quantification of anammox bacteria because of the single copy of the nirS gene in the anammox bacterial genome and the specificity of PCR primers for different genera of anammox bacteria, thus providing a suitable functional biomarker for investigation of anammox bacteria.

  18. Horizontal gene acquisitions, mobile element proliferation, and genome decay in the host-restricted plant pathogen erwinia tracheiphila

    USDA-ARS?s Scientific Manuscript database

    Modern industrial agriculture depends on high-density cultivation of genetically similar crop plants, creating favorable conditions for the emergence of novel pathogens with increased fitness in managed compared with ecologically intact settings. Here, we present the genome sequence of six strains o...

  19. Virulence Phenotypes and Molecular Genotypes of Puccinia triticina Isolates from Italy

    USDA-ARS?s Scientific Manuscript database

    Twenty-four isolates of Puccinia triticina from Italy were characterized for virulence to seedlings of 22 common wheat cv. Thatcher isolines each with a different leaf rust resistance gene, and for molecular genotypes at 15 simple sequence repeat (SSR) loci. The isolates were compared with a set of ...

  20. A random set scoring model for prioritization of disease candidate genes using protein complexes and data-mining of GeneRIF, OMIM and PubMed records.

    PubMed

    Jiang, Li; Edwards, Stefan M; Thomsen, Bo; Workman, Christopher T; Guldbrandtsen, Bernt; Sørensen, Peter

    2014-09-24

    Prioritizing genetic variants is a challenge because disease susceptibility loci are often located in genes of unknown function or the relationship with the corresponding phenotype is unclear. A global data-mining exercise on the biomedical literature can establish the phenotypic profile of genes with respect to their connection to disease phenotypes. The importance of protein-protein interaction networks in the genetic heterogeneity of common diseases or complex traits is becoming increasingly recognized. Thus, the development of a network-based approach combined with phenotypic profiling would be useful for disease gene prioritization. We developed a random-set scoring model and implemented it to quantify phenotype relevance in a network-based disease gene-prioritization approach. We validated our approach based on different gene phenotypic profiles, which were generated from PubMed abstracts, OMIM, and GeneRIF records. We also investigated the validity of several vocabulary filters and different likelihood thresholds for predicted protein-protein interactions in terms of their effect on the network-based gene-prioritization approach, which relies on text-mining of the phenotype data. Our method demonstrated good precision and sensitivity compared with those of two alternative complex-based prioritization approaches. We then conducted a global ranking of all human genes according to their relevance to a range of human diseases. The resulting accurate ranking of known causal genes supported the reliability of our approach. Moreover, these data suggest many promising novel candidate genes for human disorders that have a complex mode of inheritance. We have implemented and validated a network-based approach to prioritize genes for human diseases based on their phenotypic profile. We have devised a powerful and transparent tool to identify and rank candidate genes. Our global gene prioritization provides a unique resource for the biological interpretation of data from genome-wide association studies, and will help in the understanding of how the associated genetic variants influence disease or quantitative phenotypes.

Top