Sample records for gene expression datasets

  1. Integrative analysis of gene expression and DNA methylation using unsupervised feature extraction for detecting candidate cancer biomarkers.

    PubMed

    Moon, Myungjin; Nakai, Kenta

    2018-04-01

    Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.

  2. Utility and Limitations of Using Gene Expression Data to Identify Functional Associations

    PubMed Central

    Peng, Cheng; Shiu, Shin-Han

    2016-01-01

    Gene co-expression has been widely used to hypothesize gene function through guilt-by association. However, it is not clear to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the type of dataset impacts inferences about gene functions. Here our goal is to assess the utility and limitations of using co-expression as a criterion to recover functional associations between genes. By determining the percentage of gene pairs in a metabolic pathway with significant expression correlation, we found that many genes in the same pathway do not have similar transcript profiles and the choice of dataset, annotation quality, gene function, expression similarity measure, and clustering approach significantly impacts the ability to recover functional associations between genes using Arabidopsis thaliana as an example. Some datasets are more informative in capturing coordinated expression profiles and larger data sets are not always better. In addition, to recover the maximum number of known pathways and identify candidate genes with similar functions, it is important to explore rather exhaustively multiple dataset combinations, similarity measures, clustering algorithms and parameters. Finally, we validated the biological relevance of co-expression cluster memberships with an independent phenomics dataset and found that genes that consistently cluster with leucine degradation genes tend to have similar leucine levels in mutants. This study provides a framework for obtaining gene functional associations by maximizing the information that can be obtained from gene expression datasets. PMID:27935950

  3. Similarity of markers identified from cancer gene expression studies: observations from GEO.

    PubMed

    Shi, Xingjie; Shen, Shihao; Liu, Jin; Huang, Jian; Zhou, Yong; Ma, Shuangge

    2014-09-01

    Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  4. UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets.

    PubMed

    Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K

    2015-06-04

    Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.

  5. DigOut: viewing differential expression genes as outliers.

    PubMed

    Yu, Hui; Tu, Kang; Xie, Lu; Li, Yuan-Yuan

    2010-12-01

    With regards to well-replicated two-conditional microarray datasets, the selection of differentially expressed (DE) genes is a well-studied computational topic, but for multi-conditional microarray datasets with limited or no replication, the same task is not properly addressed by previous studies. This paper adopts multivariate outlier analysis to analyze replication-lacking multi-conditional microarray datasets, finding that it performs significantly better than the widely used limit fold change (LFC) model in a simulated comparative experiment. Compared with the LFC model, the multivariate outlier analysis also demonstrates improved stability against sample variations in a series of manipulated real expression datasets. The reanalysis of a real non-replicated multi-conditional expression dataset series leads to satisfactory results. In conclusion, a multivariate outlier analysis algorithm, like DigOut, is particularly useful for selecting DE genes from non-replicated multi-conditional gene expression dataset.

  6. cGRNB: a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets.

    PubMed

    Xu, Huayong; Yu, Hui; Tu, Kang; Shi, Qianqian; Wei, Chaochun; Li, Yuan-Yuan; Li, Yi-Xue

    2013-01-01

    We are witnessing rapid progress in the development of methodologies for building the combinatorial gene regulatory networks involving both TFs (Transcription Factors) and miRNAs (microRNAs). There are a few tools available to do these jobs but most of them are not easy to use and not accessible online. A web server is especially needed in order to allow users to upload experimental expression datasets and build combinatorial regulatory networks corresponding to their particular contexts. In this work, we compiled putative TF-gene, miRNA-gene and TF-miRNA regulatory relationships from forward-engineering pipelines and curated them as built-in data libraries. We streamlined the R codes of our two separate forward-and-reverse engineering algorithms for combinatorial gene regulatory network construction and formalized them as two major functional modules. As a result, we released the cGRNB (combinatorial Gene Regulatory Networks Builder): a web server for constructing combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. The cGRNB enables two major network-building modules, one for MPGE (miRNA-perturbed gene expression) datasets and the other for parallel miRNA/mRNA expression datasets. A miRNA-centered two-layer combinatorial regulatory cascade is the output of the first module and a comprehensive genome-wide network involving all three types of combinatorial regulations (TF-gene, TF-miRNA, and miRNA-gene) are the output of the second module. In this article we propose cGRNB, a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. Since parallel miRNA/mRNA expression datasets are rapidly accumulated by the advance of next-generation sequencing techniques, cGRNB will be very useful tool for researchers to build combinatorial gene regulatory networks based on expression datasets. The cGRNB web-server is free and available online at http://www.scbit.org/cgrnb.

  7. A comparative study of RNA-Seq and microarray data analysis on the two examples of rectal-cancer patients and Burkitt Lymphoma cells.

    PubMed

    Wolff, Alexander; Bayerlová, Michaela; Gaedcke, Jochen; Kube, Dieter; Beißbarth, Tim

    2018-01-01

    Pipeline comparisons for gene expression data are highly valuable for applied real data analyses, as they enable the selection of suitable analysis strategies for the dataset at hand. Such pipelines for RNA-Seq data should include mapping of reads, counting and differential gene expression analysis or preprocessing, normalization and differential gene expression in case of microarray analysis, in order to give a global insight into pipeline performances. Four commonly used RNA-Seq pipelines (STAR/HTSeq-Count/edgeR, STAR/RSEM/edgeR, Sailfish/edgeR, TopHat2/Cufflinks/CuffDiff)) were investigated on multiple levels (alignment and counting) and cross-compared with the microarray counterpart on the level of gene expression and gene ontology enrichment. For these comparisons we generated two matched microarray and RNA-Seq datasets: Burkitt Lymphoma cell line data and rectal cancer patient data. The overall mapping rate of STAR was 98.98% for the cell line dataset and 98.49% for the patient dataset. Tophat's overall mapping rate was 97.02% and 96.73%, respectively, while Sailfish had only an overall mapping rate of 84.81% and 54.44%. The correlation of gene expression in microarray and RNA-Seq data was moderately worse for the patient dataset (ρ = 0.67-0.69) than for the cell line dataset (ρ = 0.87-0.88). An exception were the correlation results of Cufflinks, which were substantially lower (ρ = 0.21-0.29 and 0.34-0.53). For both datasets we identified very low numbers of differentially expressed genes using the microarray platform. For RNA-Seq we checked the agreement of differentially expressed genes identified in the different pipelines and of GO-term enrichment results. In conclusion the combination of STAR aligner with HTSeq-Count followed by STAR aligner with RSEM and Sailfish generated differentially expressed genes best suited for the dataset at hand and in agreement with most of the other transcriptomics pipelines.

  8. GSNFS: Gene subnetwork biomarker identification of lung cancer expression data.

    PubMed

    Doungpan, Narumol; Engchuan, Worrawat; Chan, Jonathan H; Meechai, Asawin

    2016-12-05

    Gene expression has been used to identify disease gene biomarkers, but there are ongoing challenges. Single gene or gene-set biomarkers are inadequate to provide sufficient understanding of complex disease mechanisms and the relationship among those genes. Network-based methods have thus been considered for inferring the interaction within a group of genes to further study the disease mechanism. Recently, the Gene-Network-based Feature Set (GNFS), which is capable of handling case-control and multiclass expression for gene biomarker identification, has been proposed, partly taking into account of network topology. However, its performance relies on a greedy search for building subnetworks and thus requires further improvement. In this work, we establish a new approach named Gene Sub-Network-based Feature Selection (GSNFS) by implementing the GNFS framework with two proposed searching and scoring algorithms, namely gene-set-based (GS) search and parent-node-based (PN) search, to identify subnetworks. An additional dataset is used to validate the results. The two proposed searching algorithms of the GSNFS method for subnetwork expansion are concerned with the degree of connectivity and the scoring scheme for building subnetworks and their topology. For each iteration of expansion, the neighbour genes of a current subnetwork, whose expression data improved the overall subnetwork score, is recruited. While the GS search calculated the subnetwork score using an activity score of a current subnetwork and the gene expression values of its neighbours, the PN search uses the expression value of the corresponding parent of each neighbour gene. Four lung cancer expression datasets were used for subnetwork identification. In addition, using pathway data and protein-protein interaction as network data in order to consider the interaction among significant genes were discussed. Classification was performed to compare the performance of the identified gene subnetworks with three subnetwork identification algorithms. The two searching algorithms resulted in better classification and gene/gene-set agreement compared to the original greedy search of the GNFS method. The identified lung cancer subnetwork using the proposed searching algorithm resulted in an improvement of the cross-dataset validation and an increase in the consistency of findings between two independent datasets. The homogeneity measurement of the datasets was conducted to assess dataset compatibility in cross-dataset validation. The lung cancer dataset with higher homogeneity showed a better result when using the GS search while the dataset with low homogeneity showed a better result when using the PN search. The 10-fold cross-dataset validation on the independent lung cancer datasets showed higher classification performance of the proposed algorithms when compared with the greedy search in the original GNFS method. The proposed searching algorithms provide a higher number of genes in the subnetwork expansion step than the greedy algorithm. As a result, the performance of the subnetworks identified from the GSNFS method was improved in terms of classification performance and gene/gene-set level agreement depending on the homogeneity of the datasets used in the analysis. Some common genes obtained from the four datasets using different searching algorithms are genes known to play a role in lung cancer. The improvement of classification performance and the gene/gene-set level agreement, and the biological relevance indicated the effectiveness of the GSNFS method for gene subnetwork identification using expression data.

  9. Mining Gene Regulatory Networks by Neural Modeling of Expression Time-Series.

    PubMed

    Rubiolo, Mariano; Milone, Diego H; Stegmayer, Georgina

    2015-01-01

    Discovering gene regulatory networks from data is one of the most studied topics in recent years. Neural networks can be successfully used to infer an underlying gene network by modeling expression profiles as times series. This work proposes a novel method based on a pool of neural networks for obtaining a gene regulatory network from a gene expression dataset. They are used for modeling each possible interaction between pairs of genes in the dataset, and a set of mining rules is applied to accurately detect the subjacent relations among genes. The results obtained on artificial and real datasets confirm the method effectiveness for discovering regulatory networks from a proper modeling of the temporal dynamics of gene expression profiles.

  10. A group LASSO-based method for robustly inferring gene regulatory networks from multiple time-course datasets.

    PubMed

    Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun

    2014-01-01

    As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.

  11. Gene expression profile of mouse prostate tumors reveals dysregulations in major biological processes and identifies potential murine targets for preclinical development of human prostate cancer therapy.

    PubMed

    Haram, Kerstyn M; Peltier, Heidi J; Lu, Bin; Bhasin, Manoj; Otu, Hasan H; Choy, Bob; Regan, Meredith; Libermann, Towia A; Latham, Gary J; Sanda, Martin G; Arredouani, Mohamed S

    2008-10-01

    Translation of preclinical studies into effective human cancer therapy is hampered by the lack of defined molecular expression patterns in mouse models that correspond to the human counterpart. We sought to generate an open source TRAMP mouse microarray dataset and to use this array to identify differentially expressed genes from human prostate cancer (PCa) that have concordant expression in TRAMP tumors, and thereby represent lead targets for preclinical therapy development. We performed microarrays on total RNA extracted and amplified from eight TRAMP tumors and nine normal prostates. A subset of differentially expressed genes was validated by QRT-PCR. Differentially expressed TRAMP genes were analyzed for concordant expression in publicly available human prostate array datasets and a subset of resulting genes was analyzed by QRT-PCR. Cross-referencing differentially expressed TRAMP genes to public human prostate array datasets revealed 66 genes with concordant expression in mouse and human PCa; 56 between metastases and normal and 10 between primary tumor and normal tissues. Of these 10 genes, two, Sox4 and Tubb2a, were validated by QRT-PCR. Our analysis also revealed various dysregulations in major biologic pathways in the TRAMP prostates. We report a TRAMP microarray dataset of which a gene subset was validated by QRT-PCR with expression patterns consistent with previous gene-specific TRAMP studies. Concordance analysis between TRAMP and human PCa associated genes supports the utility of the model and suggests several novel molecular targets for preclinical therapy.

  12. A high resolution atlas of gene expression in the domestic sheep (Ovis aries)

    PubMed Central

    Farquhar, Iseabail L.; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G.; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C. Bruce; Freeman, Tom C.; Archibald, Alan L.; Hume, David A.

    2017-01-01

    Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of ‘guilt by association’ was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages. PMID:28915238

  13. A high resolution atlas of gene expression in the domestic sheep (Ovis aries).

    PubMed

    Clark, Emily L; Bush, Stephen J; McCulloch, Mary E B; Farquhar, Iseabail L; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G; Wu, Chunlei; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C Bruce; Freeman, Tom C; Summers, Kim M; Archibald, Alan L; Hume, David A

    2017-09-01

    Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of 'guilt by association' was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages.

  14. CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets

    PubMed Central

    Li, Yang; Liu, Jun S.; Mootha, Vamsi K.

    2017-01-01

    In recent years, there has been a huge rise in the number of publicly available transcriptional profiling datasets. These massive compendia comprise billions of measurements and provide a special opportunity to predict the function of unstudied genes based on co-expression to well-studied pathways. Such analyses can be very challenging, however, since biological pathways are modular and may exhibit co-expression only in specific contexts. To overcome these challenges we introduce CLIC, CLustering by Inferred Co-expression. CLIC accepts as input a pathway consisting of two or more genes. It then uses a Bayesian partition model to simultaneously partition the input gene set into coherent co-expressed modules (CEMs), while assigning the posterior probability for each dataset in support of each CEM. CLIC then expands each CEM by scanning the transcriptome for additional co-expressed genes, quantified by an integrated log-likelihood ratio (LLR) score weighted for each dataset. As a byproduct, CLIC automatically learns the conditions (datasets) within which a CEM is operative. We implemented CLIC using a compendium of 1774 mouse microarray datasets (28628 microarrays) or 1887 human microarray datasets (45158 microarrays). CLIC analysis reveals that of 910 canonical biological pathways, 30% consist of strongly co-expressed gene modules for which new members are predicted. For example, CLIC predicts a functional connection between protein C7orf55 (FMC1) and the mitochondrial ATP synthase complex that we have experimentally validated. CLIC is freely available at www.gene-clic.org. We anticipate that CLIC will be valuable both for revealing new components of biological pathways as well as the conditions in which they are active. PMID:28719601

  15. Dynamic association rules for gene expression data analysis.

    PubMed

    Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung

    2015-10-14

    The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.

  16. A microarray whole-genome gene expression dataset in a rat model of inflammatory corneal angiogenesis.

    PubMed

    Mukwaya, Anthony; Lindvall, Jessica M; Xeroudaki, Maria; Peebo, Beatrice; Ali, Zaheer; Lennikov, Anton; Jensen, Lasse Dahl Ejby; Lagali, Neil

    2016-11-22

    In angiogenesis with concurrent inflammation, many pathways are activated, some linked to VEGF and others largely VEGF-independent. Pathways involving inflammatory mediators, chemokines, and micro-RNAs may play important roles in maintaining a pro-angiogenic environment or mediating angiogenic regression. Here, we describe a gene expression dataset to facilitate exploration of pro-angiogenic, pro-inflammatory, and remodelling/normalization-associated genes during both an active capillary sprouting phase, and in the restoration of an avascular phenotype. The dataset was generated by microarray analysis of the whole transcriptome in a rat model of suture-induced inflammatory corneal neovascularisation. Regions of active capillary sprout growth or regression in the cornea were harvested and total RNA extracted from four biological replicates per group. High quality RNA was obtained for gene expression analysis using microarrays. Fold change of selected genes was validated by qPCR, and protein expression was evaluated by immunohistochemistry. We provide a gene expression dataset that may be re-used to investigate corneal neovascularisation, and may also have implications in other contexts of inflammation-mediated angiogenesis.

  17. RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes.

    PubMed

    Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa

    2017-08-29

    Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/.

  18. RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes

    PubMed Central

    Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa

    2017-01-01

    Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/. PMID:28850115

  19. A Self-Directed Method for Cell-Type Identification and Separation of Gene Expression Microarrays

    PubMed Central

    Zuckerman, Neta S.; Noam, Yair; Goldsmith, Andrea J.; Lee, Peter P.

    2013-01-01

    Gene expression analysis is generally performed on heterogeneous tissue samples consisting of multiple cell types. Current methods developed to separate heterogeneous gene expression rely on prior knowledge of the cell-type composition and/or signatures - these are not available in most public datasets. We present a novel method to identify the cell-type composition, signatures and proportions per sample without need for a-priori information. The method was successfully tested on controlled and semi-controlled datasets and performed as accurately as current methods that do require additional information. As such, this method enables the analysis of cell-type specific gene expression using existing large pools of publically available microarray datasets. PMID:23990767

  20. Comparison of gene expression microarray data with count-based RNA measurements informs microarray interpretation.

    PubMed

    Richard, Arianne C; Lyons, Paul A; Peters, James E; Biasci, Daniele; Flint, Shaun M; Lee, James C; McKinney, Eoin F; Siegel, Richard M; Smith, Kenneth G C

    2014-08-04

    Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study. Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this "gold-standard" comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues. Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently.

  1. Cancer Detection in Microarray Data Using a Modified Cat Swarm Optimization Clustering Approach

    PubMed

    M, Pandi; R, Balamurugan; N, Sadhasivam

    2017-12-29

    Objective: A better understanding of functional genomics can be obtained by extracting patterns hidden in gene expression data. This could have paramount implications for cancer diagnosis, gene treatments and other domains. Clustering may reveal natural structures and identify interesting patterns in underlying data. The main objective of this research was to derive a heuristic approach to detection of highly co-expressed genes related to cancer from gene expression data with minimum Mean Squared Error (MSE). Methods: A modified CSO algorithm using Harmony Search (MCSO-HS) for clustering cancer gene expression data was applied. Experiment results are analyzed using two cancer gene expression benchmark datasets, namely for leukaemia and for breast cancer. Result: The results indicated MCSO-HS to be better than HS and CSO, 13% and 9% with the leukaemia dataset. For breast cancer dataset improvement was by 22% and 17%, respectively, in terms of MSE. Conclusion: The results showed MCSO-HS to outperform HS and CSO with both benchmark datasets. To validate the clustering results, this work was tested with internal and external cluster validation indices. Also this work points to biological validation of clusters with gene ontology in terms of function, process and component. Creative Commons Attribution License

  2. Complex nature of SNP genotype effects on gene expression in primary human leucocytes.

    PubMed

    Heap, Graham A; Trynka, Gosia; Jansen, Ritsert C; Bruinenberg, Marcel; Swertz, Morris A; Dinesen, Lotte C; Hunt, Karen A; Wijmenga, Cisca; Vanheel, David A; Franke, Lude

    2009-01-07

    Genome wide association studies have been hugely successful in identifying disease risk variants, yet most variants do not lead to coding changes and how variants influence biological function is usually unknown. We correlated gene expression and genetic variation in untouched primary leucocytes (n = 110) from individuals with celiac disease - a common condition with multiple risk variants identified. We compared our observations with an EBV-transformed HapMap B cell line dataset (n = 90), and performed a meta-analysis to increase power to detect non-tissue specific effects. In celiac peripheral blood, 2,315 SNP variants influenced gene expression at 765 different transcripts (< 250 kb from SNP, at FDR = 0.05, cis expression quantitative trait loci, eQTLs). 135 of the detected SNP-probe effects (reflecting 51 unique probes) were also detected in a HapMap B cell line published dataset, all with effects in the same allelic direction. Overall gene expression differences within the two datasets predominantly explain the limited overlap in observed cis-eQTLs. Celiac associated risk variants from two regions, containing genes IL18RAP and CCR3, showed significant cis genotype-expression correlations in the peripheral blood but not in the B cell line datasets. We identified 14 genes where a SNP affected the expression of different probes within the same gene, but in opposite allelic directions. By incorporating genetic variation in co-expression analyses, functional relationships between genes can be more significantly detected. In conclusion, the complex nature of genotypic effects in human populations makes the use of a relevant tissue, large datasets, and analysis of different exons essential to enable the identification of the function for many genetic risk variants in common diseases.

  3. SPP1 and AGER as potential prognostic biomarkers for lung adenocarcinoma.

    PubMed

    Zhang, Weiguo; Fan, Junli; Chen, Qiang; Lei, Caipeng; Qiao, Bin; Liu, Qin

    2018-05-01

    Overdue treatment and prognostic evaluation lead to low survival rates in patients with lung adenocarcinoma (LUAD). To date, effective biomarkers for prognosis are still required. The aim of the present study was to screen differentially expressed genes (DEGs) as biomarkers for prognostic evaluation of LUAD. DEGs in tumor and normal samples were identified and analyzed for Kyoto Encyclopedia of Genes and Genomes/Gene Ontology functional enrichments. The common genes that are up and downregulated were selected for prognostic analysis using RNAseq data in The Cancer Genome Atlas. Differential expression analysis was performed with 164 samples in GSE10072 and GSE7670 datasets. A total of 484 DEGs that were present in GSE10072 and GSE7670 datasets were screened, including secreted phosphoprotein 1 (SPP1) that was highly expressed and DEGs ficolin 3, advanced glycosylation end-product specific receptor (AGER), transmembrane protein 100 that were lowly expressed in tumor tissues. These four key genes were subsequently verified using an independent dataset, GSE19804. The gene expression model was consistent with GSE10072 and GSE7670 datasets. The dysregulation of highly expressed SPP1 and lowly expressed AGER significantly reduced the median survival time of patients with LUAD. These findings suggest that SPP1 and AGER are risk factors for LUAD, and these two genes may be utilized in the prognostic evaluation of patients with LUAD. Additionally, the key genes and functional enrichments may provide a reference for investigating the molecular expression mechanisms underlying LUAD.

  4. Hybrid coexpression link similarity graph clustering for mining biological modules from multiple gene expression datasets.

    PubMed

    Salem, Saeed; Ozcaglar, Cagri

    2014-01-01

    Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways.

  5. SpeCond: a method to detect condition-specific gene expression

    PubMed Central

    2011-01-01

    Transcriptomic studies routinely measure expression levels across numerous conditions. These datasets allow identification of genes that are specifically expressed in a small number of conditions. However, there are currently no statistically robust methods for identifying such genes. Here we present SpeCond, a method to detect condition-specific genes that outperforms alternative approaches. We apply the method to a dataset of 32 human tissues to determine 2,673 specifically expressed genes. An implementation of SpeCond is freely available as a Bioconductor package at http://www.bioconductor.org/packages/release/bioc/html/SpeCond.html. PMID:22008066

  6. SurvExpress: an online biomarker validation tool and database for cancer gene expression data using survival analysis.

    PubMed

    Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G; Treviño, Victor

    2013-01-01

    Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R.

  7. SurvExpress: An Online Biomarker Validation Tool and Database for Cancer Gene Expression Data Using Survival Analysis

    PubMed Central

    Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G.; Treviño, Victor

    2013-01-01

    Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R. PMID:24066126

  8. Determining Cutoff Point of Ensemble Trees Based on Sample Size in Predicting Clinical Dose with DNA Microarray Data.

    PubMed

    Yılmaz Isıkhan, Selen; Karabulut, Erdem; Alpar, Celal Reha

    2016-01-01

    Background/Aim . Evaluating the success of dose prediction based on genetic or clinical data has substantially advanced recently. The aim of this study is to predict various clinical dose values from DNA gene expression datasets using data mining techniques. Materials and Methods . Eleven real gene expression datasets containing dose values were included. First, important genes for dose prediction were selected using iterative sure independence screening. Then, the performances of regression trees (RTs), support vector regression (SVR), RT bagging, SVR bagging, and RT boosting were examined. Results . The results demonstrated that a regression-based feature selection method substantially reduced the number of irrelevant genes from raw datasets. Overall, the best prediction performance in nine of 11 datasets was achieved using SVR; the second most accurate performance was provided using a gradient-boosting machine (GBM). Conclusion . Analysis of various dose values based on microarray gene expression data identified common genes found in our study and the referenced studies. According to our findings, SVR and GBM can be good predictors of dose-gene datasets. Another result of the study was to identify the sample size of n = 25 as a cutoff point for RT bagging to outperform a single RT.

  9. 8D.07: GENE EXPRESSION ANALYSIS AND BIOINFORMATICS REVEALED POTENTIAL TRANSCRIPTION FACTORS ASSOCIATED WITH RENIN-ANGIOTENSIN-ALDOSTERONE SYSTEM IN ATHEROMA.

    PubMed

    Nehme, A; Zibara, K; Cerutti, C; Bricca, G

    2015-06-01

    The implication of the renin-angiotensin-aldosterone system (RAAS) in atheroma development is well described. However, a complete view of the local RAAS in atheroma is still missing. In this study we aimed to reveal the organization of RAAS in atheroma at the transcriptomic level and identify the transcriptional regulators behind it. Extended RAAS (extRAAS) was defined as the set of 37 genes coding for classical and novel RAAS participants (Figure 1). Five microarray datasets containing overall 590 samples representing carotid and peripheral atheroma were downloaded from the GEO database. Correlation-based hierarchical clustering (R software) of extRAAS genes within each dataset allowed the identification of modules of co-expressed genes. Reproducible co-expression modules across datasets were then extracted. Transcription factors (TFs) having common binding sites (TFBSs) in the promoters of coordinated genes were identified using the Genomatix database tools and analyzed for their correlation with extRAAS genes in the microarray datasets. Expression data revealed the expressed extRAAS components and their relative abundance displaying the favored pathways in atheroma. Three co-expression modules with more than 80% reproducibility across datasets were extracted. Two of them (M1 and M2) contained genes coding for angiotensin metabolizing enzymes involved in different pathways: M1 included ACE, MME, RNPEP, and DPP3, in addition to 7 other genes; and M2 included CMA1, CTSG, and CPA3. The third module (M3) contained genes coding for receptors known to be implicated in atheroma (AGTR1, MR, GR, LNPEP, EGFR and GPER). M1 and M3 were negatively correlated in 3 of 5 datasets. We identified 19 TFs that have enriched TFBSs in the promoters of genes of M1, and two for M3, but none was found for M2. Among the extracted TFs, ELF1, MAX, and IRF5 showed significant positive correlations with peptidase-coding genes from M1 and negative correlations with receptors-coding genes from M3 (p < 0.05). The identified co-expression modules display the transcriptional organization of local extRAAS in human carotid atheroma. The identification of several TFs potentially associated to extRAAS genes may provide a frame for the discovery of atheroma-specific modulators of extRAAS activity.(Figure is included in full-text article.).

  10. Probe-level linear model fitting and mixture modeling results in high accuracy detection of differential gene expression.

    PubMed

    Lemieux, Sébastien

    2006-08-25

    The identification of differentially expressed genes (DEGs) from Affymetrix GeneChips arrays is currently done by first computing expression levels from the low-level probe intensities, then deriving significance by comparing these expression levels between conditions. The proposed PL-LM (Probe-Level Linear Model) method implements a linear model applied on the probe-level data to directly estimate the treatment effect. A finite mixture of Gaussian components is then used to identify DEGs using the coefficients estimated by the linear model. This approach can readily be applied to experimental design with or without replication. On a wholly defined dataset, the PL-LM method was able to identify 75% of the differentially expressed genes within 10% of false positives. This accuracy was achieved both using the three replicates per conditions available in the dataset and using only one replicate per condition. The method achieves, on this dataset, a higher accuracy than the best set of tools identified by the authors of the dataset, and does so using only one replicate per condition.

  11. Hybrid coexpression link similarity graph clustering for mining biological modules from multiple gene expression datasets

    PubMed Central

    2014-01-01

    Background Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. Results We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways. PMID:25221624

  12. Enhancing biological relevance of a weighted gene co-expression network for functional module identification.

    PubMed

    Prom-On, Santitham; Chanthaphan, Atthawut; Chan, Jonathan Hoyin; Meechai, Asawin

    2011-02-01

    Relationships among gene expression levels may be associated with the mechanisms of the disease. While identifying a direct association such as a difference in expression levels between case and control groups links genes to disease mechanisms, uncovering an indirect association in the form of a network structure may help reveal the underlying functional module associated with the disease under scrutiny. This paper presents a method to improve the biological relevance in functional module identification from the gene expression microarray data by enhancing the structure of a weighted gene co-expression network using minimum spanning tree. The enhanced network, which is called a backbone network, contains only the essential structural information to represent the gene co-expression network. The entire backbone network is decoupled into a number of coherent sub-networks, and then the functional modules are reconstructed from these sub-networks to ensure minimum redundancy. The method was tested with a simulated gene expression dataset and case-control expression datasets of autism spectrum disorder and colorectal cancer studies. The results indicate that the proposed method can accurately identify clusters in the simulated dataset, and the functional modules of the backbone network are more biologically relevant than those obtained from the original approach.

  13. Integrative Analysis of GWASs, Human Protein Interaction, and Gene Expression Identified Gene Modules Associated With BMDs

    PubMed Central

    He, Hao; Zhang, Lei; Li, Jian; Wang, Yu-Ping; Zhang, Ji-Gang; Shen, Jie; Guo, Yan-Fang

    2014-01-01

    Context: To date, few systems genetics studies in the bone field have been performed. We designed our study from a systems-level perspective by integrating genome-wide association studies (GWASs), human protein-protein interaction (PPI) network, and gene expression to identify gene modules contributing to osteoporosis risk. Methods: First we searched for modules significantly enriched with bone mineral density (BMD)-associated genes in human PPI network by using 2 large meta-analysis GWAS datasets through a dense module search algorithm. One included 7 individual GWAS samples (Meta7). The other was from the Genetic Factors for Osteoporosis Consortium (GEFOS2). One was assigned as a discovery dataset and the other as an evaluation dataset, and vice versa. Results: In total, 42 modules and 129 modules were identified significantly in both Meta7 and GEFOS2 datasets for femoral neck and spine BMD, respectively. There were 3340 modules identified for hip BMD only in Meta7. As candidate modules, they were assessed for the biological relevance to BMD by gene set enrichment analysis in 2 expression profiles generated from circulating monocytes in subjects with low versus high BMD values. Interestingly, there were 2 modules significantly enriched in monocytes from the low BMD group in both gene expression datasets (nominal P value <.05). Two modules had 16 nonredundant genes. Functional enrichment analysis revealed that both modules were enriched for genes involved in Wnt receptor signaling and osteoblast differentiation. Conclusion: We highlighted 2 modules and novel genes playing important roles in the regulation of bone mass, providing important clues for therapeutic approaches for osteoporosis. PMID:25119315

  14. Dataset of proinflammatory cytokine and cytokine receptor gene expression in rainbow trout (Oncorhynchus mykiss) measured using a novel GeXP multiplex, RT-PCR assay

    USDA-ARS?s Scientific Manuscript database

    A GeXP multiplex, RT-PCR assay was developed and optimized that simultaneously measures expression of a suite of immune-relevant genes in rainbow trout (Oncorhynchus mykiss), concentrating on tumor necrosis factor and interleukin-1 ligand/receptor systems and acute phase response genes. The dataset ...

  15. Microarray Data Mining for Potential Selenium Targets in Chemoprevention of Prostate Cancer

    PubMed Central

    ZHANG, HAITAO; DONG, YAN; ZHAO, HONGJUAN; BROOKS, JAMES D.; HAWTHORN, LESLEYANN; NOWAK, NORMA; MARSHALL, JAMES R.; GAO, ALLEN C.; IP, CLEMENT

    2008-01-01

    Background A previous clinical trial showed that selenium supplementation significantly reduced the incidence of prostate cancer. We report here a bioinformatics approach to gain new insights into selenium molecular targets that might be relevant to prostate cancer chemoprevention. Materials and Methods We first performed data mining analysis to identify genes which are consistently dysregulated in prostate cancer using published datasets from gene expression profiling of clinical prostate specimens. We then devised a method to systematically analyze three selenium microarray datasets from the LNCaP human prostate cancer cells, and to match the analysis to the cohort of genes implicated in prostate carcinogenesis. Moreover, we compared the selenium datasets with two datasets obtained from expression profiling of androgen-stimulated LNCaP cells. Results We found that selenium reverses the expression of genes implicated in prostate carcinogenesis. In addition, we found that selenium could counteract the effect of androgen on the expression of a subset obtained from androgen-regulated genes. Conclusions The above information provides us with a treasure of new clues to investigate the mechanism of selenium chemoprevention of prostate cancer. Furthermore, these selenium target genes could also serve as biomarkers in future clinical trials to gauge the efficacy of selenium intervention. PMID:18548127

  16. EPConDB: a web resource for gene expression related to pancreatic development, beta-cell function and diabetes.

    PubMed

    Mazzarelli, Joan M; Brestelli, John; Gorski, Regina K; Liu, Junmin; Manduchi, Elisabetta; Pinney, Deborah F; Schug, Jonathan; White, Peter; Kaestner, Klaus H; Stoeckert, Christian J

    2007-01-01

    EPConDB (http://www.cbil.upenn.edu/EPConDB) is a public web site that supports research in diabetes, pancreatic development and beta-cell function by providing information about genes expressed in cells of the pancreas. EPConDB displays expression profiles for individual genes and information about transcripts, promoter elements and transcription factor binding sites. Gene expression results are obtained from studies examining tissue expression, pancreatic development and growth, differentiation of insulin-producing cells, islet or beta-cell injury, and genetic models of impaired beta-cell function. The expression datasets are derived using different microarray platforms, including the BCBC PancChips and Affymetrix gene expression arrays. Other datasets include semi-quantitative RT-PCR and MPSS expression studies. For selected microarray studies, lists of differentially expressed genes, derived from PaGE analysis, are displayed on the site. EPConDB provides database queries and tools to examine the relationship between a gene, its transcriptional regulation, protein function and expression in pancreatic tissues.

  17. New Statistics for Testing Differential Expression of Pathways from Microarray Data

    NASA Astrophysics Data System (ADS)

    Siu, Hoicheong; Dong, Hua; Jin, Li; Xiong, Momiao

    Exploring biological meaning from microarray data is very important but remains a great challenge. Here, we developed three new statistics: linear combination test, quadratic test and de-correlation test to identify differentially expressed pathways from gene expression profile. We apply our statistics to two rheumatoid arthritis datasets. Notably, our results reveal three significant pathways and 275 genes in common in two datasets. The pathways we found are meaningful to uncover the disease mechanisms of rheumatoid arthritis, which implies that our statistics are a powerful tool in functional analysis of gene expression data.

  18. Integrated Analysis of Alzheimer's Disease and Schizophrenia Dataset Revealed Different Expression Pattern in Learning and Memory.

    PubMed

    Li, Wen-Xing; Dai, Shao-Xing; Liu, Jia-Qian; Wang, Qian; Li, Gong-Hua; Huang, Jing-Fei

    2016-01-01

    Alzheimer's disease (AD) and schizophrenia (SZ) are both accompanied by impaired learning and memory functions. This study aims to explore the expression profiles of learning or memory genes between AD and SZ. We downloaded 10 AD and 10 SZ datasets from GEO-NCBI for integrated analysis. These datasets were processed using RMA algorithm and a global renormalization for all studies. Then Empirical Bayes algorithm was used to find the differentially expressed genes between patients and controls. The results showed that most of the differentially expressed genes were related to AD whereas the gene expression profile was little affected in the SZ. Furthermore, in the aspects of the number of differentially expressed genes, the fold change and the brain region, there was a great difference in the expression of learning or memory related genes between AD and SZ. In AD, the CALB1, GABRA5, and TAC1 were significantly downregulated in whole brain, frontal lobe, temporal lobe, and hippocampus. However, in SZ, only two genes CRHBP and CX3CR1 were downregulated in hippocampus, and other brain regions were not affected. The effect of these genes on learning or memory impairment has been widely studied. It was suggested that these genes may play a crucial role in AD or SZ pathogenesis. The different gene expression patterns between AD and SZ on learning and memory functions in different brain regions revealed in our study may help to understand the different mechanism between two diseases.

  19. Genome-Level Longitudinal Expression of Signaling Pathways and Gene Networks in Pediatric Septic Shock

    PubMed Central

    Shanley, Thomas P; Cvijanovich, Natalie; Lin, Richard; Allen, Geoffrey L; Thomas, Neal J; Doctor, Allan; Kalyanaraman, Meena; Tofil, Nancy M; Penfil, Scott; Monaco, Marie; Odoms, Kelli; Barnes, Michael; Sakthivel, Bhuvaneswari; Aronow, Bruce J; Wong, Hector R

    2007-01-01

    We have conducted longitudinal studies focused on the expression profiles of signaling pathways and gene networks in children with septic shock. Genome-level expression profiles were generated from whole blood-derived RNA of children with septic shock (n = 30) corresponding to day one and day three of septic shock, respectively. Based on sequential statistical and expression filters, day one and day three of septic shock were characterized by differential regulation of 2,142 and 2,504 gene probes, respectively, relative to controls (n = 15). Venn analysis demonstrated 239 unique genes in the day one dataset, 598 unique genes in the day three dataset, and 1,906 genes common to both datasets. Functional analyses demonstrated time-dependent, differential regulation of genes involved in multiple signaling pathways and gene networks primarily related to immunity and inflammation. Notably, multiple and distinct gene networks involving T cell- and MHC antigen-related biology were persistently downregulated on both day one and day three. Further analyses demonstrated large scale, persistent downregulation of genes corresponding to functional annotations related to zinc homeostasis. These data represent the largest reported cohort of patients with septic shock subjected to longitudinal genome-level expression profiling. The data further advance our genome-level understanding of pediatric septic shock and support novel hypotheses. PMID:17932561

  20. Integrated analyses for genetic markers of polycystic ovary syndrome with 9 case-control studies of gene expression profiles.

    PubMed

    Lu, Chenqi; Liu, Xiaoqin; Wang, Lin; Jiang, Ning; Yu, Jun; Zhao, Xiaobo; Hu, Hairong; Zheng, Saihua; Li, Xuelian; Wang, Guiying

    2017-01-10

    Due to genetic heterogeneity and variable diagnostic criteria, genetic studies of polycystic ovary syndrome are particularly challenging. Furthermore, lack of sufficiently large cohorts limits the identification of susceptibility genes contributing to polycystic ovary syndrome. Here, we carried out a systematic search of studies deposited in the Gene Expression Omnibus database through August 31, 2016. The present analyses included studies with: 1) patients with polycystic ovary syndrome and normal controls, 2) gene expression profiling of messenger RNA, and 3) sufficient data for our analysis. Ultimately, a total of 9 studies with 13 datasets met the inclusion criteria and were performed for the subsequent integrated analyses. Through comprehensive analyses, there were 13 genetic factors overlapped in all datasets and identified as significant specific genes for polycystic ovary syndrome. After quality control assessment, there were six datasets remained. Further gene ontology enrichment and pathway analyses suggested that differentially expressed genes mainly enriched in oocyte pathways. These findings provide potential molecular markers for diagnosis and prognosis of polycystic ovary syndrome, and need in-depth studies on the exact function and mechanism in polycystic ovary syndrome.

  1. Genes@Work: an efficient algorithm for pattern discovery and multivariate feature selection in gene expression data.

    PubMed

    Lepre, Jorge; Rice, J Jeremy; Tu, Yuhai; Stolovitzky, Gustavo

    2004-05-01

    Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).

  2. Validation of MIMGO: a method to identify differentially expressed GO terms in a microarray dataset

    PubMed Central

    2012-01-01

    Background We previously proposed an algorithm for the identification of GO terms that commonly annotate genes whose expression is upregulated or downregulated in some microarray data compared with in other microarray data. We call these “differentially expressed GO terms” and have named the algorithm “matrix-assisted identification method of differentially expressed GO terms” (MIMGO). MIMGO can also identify microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. However, MIMGO has not yet been validated on a real microarray dataset using all available GO terms. Findings We combined Gene Set Enrichment Analysis (GSEA) with MIMGO to identify differentially expressed GO terms in a yeast cell cycle microarray dataset. GSEA followed by MIMGO (GSEA + MIMGO) correctly identified (p < 0.05) microarray data in which genes annotated to differentially expressed GO terms are upregulated. We found that GSEA + MIMGO was slightly less effective than, or comparable to, GSEA (Pearson), a method that uses Pearson’s correlation as a metric, at detecting true differentially expressed GO terms. However, unlike other methods including GSEA (Pearson), GSEA + MIMGO can comprehensively identify the microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. Conclusions MIMGO is a reliable method to identify differentially expressed GO terms comprehensively. PMID:23232071

  3. Seq-ing answers: uncovering the unexpected in global gene regulation.

    PubMed

    Otto, George Maxwell; Brar, Gloria Ann

    2018-04-19

    The development of techniques for measuring gene expression globally has greatly expanded our understanding of gene regulatory mechanisms in depth and scale. We can now quantify every intermediate and transition in the canonical pathway of gene expression-from DNA to mRNA to protein-genome-wide. Employing such measurements in parallel can produce rich datasets, but extracting the most information requires careful experimental design and analysis. Here, we argue for the value of genome-wide studies that measure multiple outputs of gene expression over many timepoints during the course of a natural developmental process. We discuss our findings from a highly parallel gene expression dataset of meiotic differentiation, and those of others, to illustrate how leveraging these features can provide new and surprising insight into fundamental mechanisms of gene regulation.

  4. A method for generating new datasets based on copy number for cancer analysis.

    PubMed

    Kim, Shinuk; Kon, Mark; Kang, Hyunsik

    2015-01-01

    New data sources for the analysis of cancer data are rapidly supplementing the large number of gene-expression markers used for current methods of analysis. Significant among these new sources are copy number variation (CNV) datasets, which typically enumerate several hundred thousand CNVs distributed throughout the genome. Several useful algorithms allow systems-level analyses of such datasets. However, these rich data sources have not yet been analyzed as deeply as gene-expression data. To address this issue, the extensive toolsets used for analyzing expression data in cancerous and noncancerous tissue (e.g., gene set enrichment analysis and phenotype prediction) could be redirected to extract a great deal of predictive information from CNV data, in particular those derived from cancers. Here we present a software package capable of preprocessing standard Agilent copy number datasets into a form to which essentially all expression analysis tools can be applied. We illustrate the use of this toolset in predicting the survival time of patients with ovarian cancer or glioblastoma multiforme and also provide an analysis of gene- and pathway-level deletions in these two types of cancer.

  5. Investigating the Effects of Imputation Methods for Modelling Gene Networks Using a Dynamic Bayesian Network from Gene Expression Data

    PubMed Central

    CHAI, Lian En; LAW, Chow Kuan; MOHAMAD, Mohd Saberi; CHONG, Chuii Khim; CHOON, Yee Wen; DERIS, Safaai; ILLIAS, Rosli Md

    2014-01-01

    Background: Gene expression data often contain missing expression values. Therefore, several imputation methods have been applied to solve the missing values, which include k-nearest neighbour (kNN), local least squares (LLS), and Bayesian principal component analysis (BPCA). However, the effects of these imputation methods on the modelling of gene regulatory networks from gene expression data have rarely been investigated and analysed using a dynamic Bayesian network (DBN). Methods: In the present study, we separately imputed datasets of the Escherichia coli S.O.S. DNA repair pathway and the Saccharomyces cerevisiae cell cycle pathway with kNN, LLS, and BPCA, and subsequently used these to generate gene regulatory networks (GRNs) using a discrete DBN. We made comparisons on the basis of previous studies in order to select the gene network with the least error. Results: We found that BPCA and LLS performed better on larger networks (based on the S. cerevisiae dataset), whereas kNN performed better on smaller networks (based on the E. coli dataset). Conclusion: The results suggest that the performance of each imputation method is dependent on the size of the dataset, and this subsequently affects the modelling of the resultant GRNs using a DBN. In addition, on the basis of these results, a DBN has the capacity to discover potential edges, as well as display interactions, between genes. PMID:24876803

  6. Evaluation of Different Normalization and Analysis Procedures for Illumina Gene Expression Microarray Data Involving Small Changes

    PubMed Central

    Johnstone, Daniel M.; Riveros, Carlos; Heidari, Moones; Graham, Ross M.; Trinder, Debbie; Berretta, Regina; Olynyk, John K.; Scott, Rodney J.; Moscato, Pablo; Milward, Elizabeth A.

    2013-01-01

    While Illumina microarrays can be used successfully for detecting small gene expression changes due to their high degree of technical replicability, there is little information on how different normalization and differential expression analysis strategies affect outcomes. To evaluate this, we assessed concordance across gene lists generated by applying different combinations of normalization strategy and analytical approach to two Illumina datasets with modest expression changes. In addition to using traditional statistical approaches, we also tested an approach based on combinatorial optimization. We found that the choice of both normalization strategy and analytical approach considerably affected outcomes, in some cases leading to substantial differences in gene lists and subsequent pathway analysis results. Our findings suggest that important biological phenomena may be overlooked when there is a routine practice of using only one approach to investigate all microarray datasets. Analytical artefacts of this kind are likely to be especially relevant for datasets involving small fold changes, where inherent technical variation—if not adequately minimized by effective normalization—may overshadow true biological variation. This report provides some basic guidelines for optimizing outcomes when working with Illumina datasets involving small expression changes. PMID:27605185

  7. Functional Analyses of NSF1 in Wine Yeast Using Interconnected Correlation Clustering and Molecular Analyses

    PubMed Central

    Bessonov, Kyrylo; Walkey, Christopher J.; Shelp, Barry J.; van Vuuren, Hennie J. J.; Chiu, David; van der Merwe, George

    2013-01-01

    Analyzing time-course expression data captured in microarray datasets is a complex undertaking as the vast and complex data space is represented by a relatively low number of samples as compared to thousands of available genes. Here, we developed the Interdependent Correlation Clustering (ICC) method to analyze relationships that exist among genes conditioned on the expression of a specific target gene in microarray data. Based on Correlation Clustering, the ICC method analyzes a large set of correlation values related to gene expression profiles extracted from given microarray datasets. ICC can be applied to any microarray dataset and any target gene. We applied this method to microarray data generated from wine fermentations and selected NSF1, which encodes a C2H2 zinc finger-type transcription factor, as the target gene. The validity of the method was verified by accurate identifications of the previously known functional roles of NSF1. In addition, we identified and verified potential new functions for this gene; specifically, NSF1 is a negative regulator for the expression of sulfur metabolism genes, the nuclear localization of Nsf1 protein (Nsf1p) is controlled in a sulfur-dependent manner, and the transcription of NSF1 is regulated by Met4p, an important transcriptional activator of sulfur metabolism genes. The inter-disciplinary approach adopted here highlighted the accuracy and relevancy of the ICC method in mining for novel gene functions using complex microarray datasets with a limited number of samples. PMID:24130853

  8. Statistical Test of Expression Pattern (STEPath): a new strategy to integrate gene expression data with genomic information in individual and meta-analysis studies.

    PubMed

    Martini, Paolo; Risso, Davide; Sales, Gabriele; Romualdi, Chiara; Lanfranchi, Gerolamo; Cagnin, Stefano

    2011-04-11

    In the last decades, microarray technology has spread, leading to a dramatic increase of publicly available datasets. The first statistical tools developed were focused on the identification of significant differentially expressed genes. Later, researchers moved toward the systematic integration of gene expression profiles with additional biological information, such as chromosomal location, ontological annotations or sequence features. The analysis of gene expression linked to physical location of genes on chromosomes allows the identification of transcriptionally imbalanced regions, while, Gene Set Analysis focuses on the detection of coordinated changes in transcriptional levels among sets of biologically related genes. In this field, meta-analysis offers the possibility to compare different studies, addressing the same biological question to fully exploit public gene expression datasets. We describe STEPath, a method that starts from gene expression profiles and integrates the analysis of imbalanced region as an a priori step before performing gene set analysis. The application of STEPath in individual studies produced gene set scores weighted by chromosomal activation. As a final step, we propose a way to compare these scores across different studies (meta-analysis) on related biological issues. One complication with meta-analysis is batch effects, which occur because molecular measurements are affected by laboratory conditions, reagent lots and personnel differences. Major problems occur when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. We evaluated the power of combining chromosome mapping and gene set enrichment analysis, performing the analysis on a dataset of leukaemia (example of individual study) and on a dataset of skeletal muscle diseases (meta-analysis approach). In leukaemia, we identified the Hox gene set, a gene set closely related to the pathology that other algorithms of gene set analysis do not identify, while the meta-analysis approach on muscular disease discriminates between related pathologies and correlates similar ones from different studies. STEPath is a new method that integrates gene expression profiles, genomic co-expressed regions and the information about the biological function of genes. The usage of the STEPath-computed gene set scores overcomes batch effects in the meta-analysis approaches allowing the direct comparison of different pathologies and different studies on a gene set activation level.

  9. An improved Pearson's correlation proximity-based hierarchical clustering for mining biological association between genes.

    PubMed

    Booma, P M; Prabhakaran, S; Dhanalakshmi, R

    2014-01-01

    Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality.

  10. An Improved Pearson's Correlation Proximity-Based Hierarchical Clustering for Mining Biological Association between Genes

    PubMed Central

    Booma, P. M.; Prabhakaran, S.; Dhanalakshmi, R.

    2014-01-01

    Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality. PMID:25136661

  11. Gene-Expression Signature Predicts Postoperative Recurrence in Stage I Non-Small Cell Lung Cancer Patients

    PubMed Central

    Lu, Yan; Wang, Liang; Liu, Pengyuan; Yang, Ping; You, Ming

    2012-01-01

    About 30% stage I non-small cell lung cancer (NSCLC) patients undergoing resection will recur. Robust prognostic markers are required to better manage therapy options. The purpose of this study is to develop and validate a novel gene-expression signature that can predict tumor recurrence of stage I NSCLC patients. Cox proportional hazards regression analysis was performed to identify recurrence-related genes and a partial Cox regression model was used to generate a gene signature of recurrence in the training dataset −142 stage I lung adenocarcinomas without adjunctive therapy from the Director's Challenge Consortium. Four independent validation datasets, including GSE5843, GSE8894, and two other datasets provided by Mayo Clinic and Washington University, were used to assess the prediction accuracy by calculating the correlation between risk score estimated from gene expression and real recurrence-free survival time and AUC of time-dependent ROC analysis. Pathway-based survival analyses were also performed. 104 probesets correlated with recurrence in the training dataset. They are enriched in cell adhesion, apoptosis and regulation of cell proliferation. A 51-gene expression signature was identified to distinguish patients likely to develop tumor recurrence (Dxy = −0.83, P<1e-16) and this signature was validated in four independent datasets with AUC >85%. Multiple pathways including leukocyte transendothelial migration and cell adhesion were highly correlated with recurrence-free survival. The gene signature is highly predictive of recurrence in stage I NSCLC patients, which has important prognostic and therapeutic implications for the future management of these patients. PMID:22292069

  12. Identifying spatially similar gene expression patterns in early stage fruit fly embryo images: binary feature versus invariant moment digital representations

    PubMed Central

    Gurunathan, Rajalakshmi; Van Emden, Bernard; Panchanathan, Sethuraman; Kumar, Sudhir

    2004-01-01

    Background Modern developmental biology relies heavily on the analysis of embryonic gene expression patterns. Investigators manually inspect hundreds or thousands of expression patterns to identify those that are spatially similar and to ultimately infer potential gene interactions. However, the rapid accumulation of gene expression pattern data over the last two decades, facilitated by high-throughput techniques, has produced a need for the development of efficient approaches for direct comparison of images, rather than their textual descriptions, to identify spatially similar expression patterns. Results The effectiveness of the Binary Feature Vector (BFV) and Invariant Moment Vector (IMV) based digital representations of the gene expression patterns in finding biologically meaningful patterns was compared for a small (226 images) and a large (1819 images) dataset. For each dataset, an ordered list of images, with respect to a query image, was generated to identify overlapping and similar gene expression patterns, in a manner comparable to what a developmental biologist might do. The results showed that the BFV representation consistently outperforms the IMV representation in finding biologically meaningful matches when spatial overlap of the gene expression pattern and the genes involved are considered. Furthermore, we explored the value of conducting image-content based searches in a dataset where individual expression components (or domains) of multi-domain expression patterns were also included separately. We found that this technique improves performance of both IMV and BFV based searches. Conclusions We conclude that the BFV representation consistently produces a more extensive and better list of biologically useful patterns than the IMV representation. The high quality of results obtained scales well as the search database becomes larger, which encourages efforts to build automated image query and retrieval systems for spatial gene expression patterns. PMID:15603586

  13. Integrative multi-platform meta-analysis of gene expression profiles in pancreatic ductal adenocarcinoma patients for identifying novel diagnostic biomarkers.

    PubMed

    Irigoyen, Antonio; Jimenez-Luna, Cristina; Benavides, Manuel; Caba, Octavio; Gallego, Javier; Ortuño, Francisco Manuel; Guillen-Ponce, Carmen; Rojas, Ignacio; Aranda, Enrique; Torres, Carolina; Prados, Jose

    2018-01-01

    Applying differentially expressed genes (DEGs) to identify feasible biomarkers in diseases can be a hard task when working with heterogeneous datasets. Expression data are strongly influenced by technology, sample preparation processes, and/or labeling methods. The proliferation of different microarray platforms for measuring gene expression increases the need to develop models able to compare their results, especially when different technologies can lead to signal values that vary greatly. Integrative meta-analysis can significantly improve the reliability and robustness of DEG detection. The objective of this work was to develop an integrative approach for identifying potential cancer biomarkers by integrating gene expression data from two different platforms. Pancreatic ductal adenocarcinoma (PDAC), where there is an urgent need to find new biomarkers due its late diagnosis, is an ideal candidate for testing this technology. Expression data from two different datasets, namely Affymetrix and Illumina (18 and 36 PDAC patients, respectively), as well as from 18 healthy controls, was used for this study. A meta-analysis based on an empirical Bayesian methodology (ComBat) was then proposed to integrate these datasets. DEGs were finally identified from the integrated data by using the statistical programming language R. After our integrative meta-analysis, 5 genes were commonly identified within the individual analyses of the independent datasets. Also, 28 novel genes that were not reported by the individual analyses ('gained' genes) were also discovered. Several of these gained genes have been already related to other gastroenterological tumors. The proposed integrative meta-analysis has revealed novel DEGs that may play an important role in PDAC and could be potential biomarkers for diagnosing the disease.

  14. Integration of Steady-State and Temporal Gene Expression Data for the Inference of Gene Regulatory Networks

    PubMed Central

    Wang, Yi Kan; Hurley, Daniel G.; Schnell, Santiago; Print, Cristin G.; Crampin, Edmund J.

    2013-01-01

    We develop a new regression algorithm, cMIKANA, for inference of gene regulatory networks from combinations of steady-state and time-series gene expression data. Using simulated gene expression datasets to assess the accuracy of reconstructing gene regulatory networks, we show that steady-state and time-series data sets can successfully be combined to identify gene regulatory interactions using the new algorithm. Inferring gene networks from combined data sets was found to be advantageous when using noisy measurements collected with either lower sampling rates or a limited number of experimental replicates. We illustrate our method by applying it to a microarray gene expression dataset from human umbilical vein endothelial cells (HUVECs) which combines time series data from treatment with growth factor TNF and steady state data from siRNA knockdown treatments. Our results suggest that the combination of steady-state and time-series datasets may provide better prediction of RNA-to-RNA interactions, and may also reveal biological features that cannot be identified from dynamic or steady state information alone. Finally, we consider the experimental design of genomics experiments for gene regulatory network inference and show that network inference can be improved by incorporating steady-state measurements with time-series data. PMID:23967277

  15. -A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome.

    PubMed

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.

  16. ­A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome

    PubMed Central

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp. PMID:28413616

  17. Molecular Subtypes of Glioblastoma Are Relevant to Lower Grade Glioma

    PubMed Central

    Sloan, Andrew E.; Chen, Yanwen; Brat, Daniel J.; O’Neill, Brian Patrick; de Groot, John; Yust-Katz, Shlomit; Yung, Wai-Kwan Alfred; Cohen, Mark L.; Aldape, Kenneth D.; Rosenfeld, Steven; Verhaak, Roeland G. W.; Barnholtz-Sloan, Jill S.

    2014-01-01

    Background Gliomas are the most common primary malignant brain tumors in adults with great heterogeneity in histopathology and clinical course. The intent was to evaluate the relevance of known glioblastoma (GBM) expression and methylation based subtypes to grade II and III gliomas (ie. lower grade gliomas). Methods Gene expression array, single nucleotide polymorphism (SNP) array and clinical data were obtained for 228 GBMs and 176 grade II/II gliomas (GII/III) from the publically available Rembrandt dataset. Two additional datasets with IDH1 mutation status were utilized as validation datasets (one publicly available dataset and one newly generated dataset from MD Anderson). Unsupervised clustering was performed and compared to gene expression subtypes assigned using the Verhaak et al 840-gene classifier. The glioma-CpG Island Methylator Phenotype (G-CIMP) was assigned using prediction models by Fine et al. Results Unsupervised clustering by gene expression aligned with the Verhaak 840-gene subtype group assignments. GII/IIIs were preferentially assigned to the proneural subtype with IDH1 mutation and G-CIMP. GBMs were evenly distributed among the four subtypes. Proneural, IDH1 mutant, G-CIMP GII/III s had significantly better survival than other molecular subtypes. Only 6% of GBMs were proneural and had either IDH1 mutation or G-CIMP but these tumors had significantly better survival than other GBMs. Copy number changes in chromosomes 1p and 19q were associated with GII/IIIs, while these changes in CDKN2A, PTEN and EGFR were more commonly associated with GBMs. Conclusions GBM gene-expression and methylation based subtypes are relevant for GII/III s and associate with overall survival differences. A better understanding of the association between these subtypes and GII/IIIs could further knowledge regarding prognosis and mechanisms of glioma progression. PMID:24614622

  18. GEOGLE: context mining tool for the correlation between gene expression and the phenotypic distinction.

    PubMed

    Yu, Yao; Tu, Kang; Zheng, Siyuan; Li, Yun; Ding, Guohui; Ping, Jie; Hao, Pei; Li, Yixue

    2009-08-25

    In the post-genomic era, the development of high-throughput gene expression detection technology provides huge amounts of experimental data, which challenges the traditional pipelines for data processing and analyzing in scientific researches. In our work, we integrated gene expression information from Gene Expression Omnibus (GEO), biomedical ontology from Medical Subject Headings (MeSH) and signaling pathway knowledge from sigPathway entries to develop a context mining tool for gene expression analysis - GEOGLE. GEOGLE offers a rapid and convenient way for searching relevant experimental datasets, pathways and biological terms according to multiple types of queries: including biomedical vocabularies, GDS IDs, gene IDs, pathway names and signature list. Moreover, GEOGLE summarizes the signature genes from a subset of GDSes and estimates the correlation between gene expression and the phenotypic distinction with an integrated p value. This approach performing global searching of expression data may expand the traditional way of collecting heterogeneous gene expression experiment data. GEOGLE is a novel tool that provides researchers a quantitative way to understand the correlation between gene expression and phenotypic distinction through meta-analysis of gene expression datasets from different experiments, as well as the biological meaning behind. The web site and user guide of GEOGLE are available at: http://omics.biosino.org:14000/kweb/workflow.jsp?id=00020.

  19. Altered Pathway Analyzer: A gene expression dataset analysis tool for identification and prioritization of differentially regulated and network rewired pathways

    PubMed Central

    Kaushik, Abhinav; Ali, Shakir; Gupta, Dinesh

    2017-01-01

    Gene connection rewiring is an essential feature of gene network dynamics. Apart from its normal functional role, it may also lead to dysregulated functional states by disturbing pathway homeostasis. Very few computational tools measure rewiring within gene co-expression and its corresponding regulatory networks in order to identify and prioritize altered pathways which may or may not be differentially regulated. We have developed Altered Pathway Analyzer (APA), a microarray dataset analysis tool for identification and prioritization of altered pathways, including those which are differentially regulated by TFs, by quantifying rewired sub-network topology. Moreover, APA also helps in re-prioritization of APA shortlisted altered pathways enriched with context-specific genes. We performed APA analysis of simulated datasets and p53 status NCI-60 cell line microarray data to demonstrate potential of APA for identification of several case-specific altered pathways. APA analysis reveals several altered pathways not detected by other tools evaluated by us. APA analysis of unrelated prostate cancer datasets identifies sample-specific as well as conserved altered biological processes, mainly associated with lipid metabolism, cellular differentiation and proliferation. APA is designed as a cross platform tool which may be transparently customized to perform pathway analysis in different gene expression datasets. APA is freely available at http://bioinfo.icgeb.res.in/APA. PMID:28084397

  20. Phosphoproteome and transcriptome analyses of ErbB ligand-stimulated MCF-7 cells.

    PubMed

    Nagashima, Takeshi; Oyama, Masaaki; Kozuka-Hata, Hiroko; Yumoto, Noriko; Sakaki, Yoshiyuki; Hatakeyama, Mariko

    2008-01-01

    Cellular signal transduction pathways and gene expression are tightly regulated to accommodate changes in response to physiological environments. In the current study, molecules were identified that are activated as a result of intracellular signaling and immediately expressed as mRNA in MCF-7 breast cancer cells shortly after stimulation of ErbB receptor ligands, epidermal growth factor (EGF) or heregulin (HRG). For the identification of tyrosine-phosphorylated proteins and expressed genes, a SILAC (stable isotopic labeling using amino acids in cell culture) method and Affymetrix gene expression array system, respectively, were used. Unexpectedly, the overlapping of genes appeared in two experimental datasets was very low for HRG (43 hits in the proteome data, 1,655 in the transcriptome data, and 5 hits common to both datasets), while no overlapping gene was detected for EGF (15 hits in the proteome data, 211 hits in the transcriptome data, and no hits common to both datasets). The HRG overlapping genes included ERBB2, NEDD9, MAPK3, JUP and EPHA2. Biological pathway analysis indicated that HRG-stimulated molecular activation is significantly related to cancer pathways including bladder cancer, chronic myeloid leukemia and pancreatic cancer (p < 0.05). The proteome datasets of EGF and HRG contain molecules that are related to Axon guidance, ErbB signaling and VEGF signaling at a high rate.

  1. An Integrated Bioinformatics Approach Identifies Elevated Cyclin E2 Expression and E2F Activity as Distinct Features of Tamoxifen Resistant Breast Tumors

    PubMed Central

    Huang, Lei; Zhao, Shuangping; Frasor, Jonna M.; Dai, Yang

    2011-01-01

    Approximately half of estrogen receptor (ER) positive breast tumors will fail to respond to endocrine therapy. Here we used an integrative bioinformatics approach to analyze three gene expression profiling data sets from breast tumors in an attempt to uncover underlying mechanisms contributing to the development of resistance and potential therapeutic strategies to counteract these mechanisms. Genes that are differentially expressed in tamoxifen resistant vs. sensitive breast tumors were identified from three different publically available microarray datasets. These differentially expressed (DE) genes were analyzed using gene function and gene set enrichment and examined in intrinsic subtypes of breast tumors. The Connectivity Map analysis was utilized to link gene expression profiles of tamoxifen resistant tumors to small molecules and validation studies were carried out in a tamoxifen resistant cell line. Despite little overlap in genes that are differentially expressed in tamoxifen resistant vs. sensitive tumors, a high degree of functional similarity was observed among the three datasets. Tamoxifen resistant tumors displayed enriched expression of genes related to cell cycle and proliferation, as well as elevated activity of E2F transcription factors, and were highly correlated with a Luminal intrinsic subtype. A number of small molecules, including phenothiazines, were found that induced a gene signature in breast cancer cell lines opposite to that found in tamoxifen resistant vs. sensitive tumors and the ability of phenothiazines to down-regulate cyclin E2 and inhibit proliferation of tamoxifen resistant breast cancer cells was validated. Our findings demonstrate that an integrated bioinformatics approach to analyze gene expression profiles from multiple breast tumor datasets can identify important biological pathways and potentially novel therapeutic options for tamoxifen-resistant breast cancers. PMID:21789246

  2. pySAPC, a python package for sparse affinity propagation clustering: Application to odontogenesis whole genome time series gene-expression data.

    PubMed

    Cao, Huojun; Amendt, Brad A

    2016-11-01

    Developmental dental anomalies are common forms of congenital defects. The molecular mechanisms of dental anomalies are poorly understood. Systematic approaches such as clustering genes based on similar expression patterns could identify novel genes involved in dental anomalies and provide a framework for understanding molecular regulatory mechanisms of these genes during tooth development (odontogenesis). A python package (pySAPC) of sparse affinity propagation clustering algorithm for large datasets was developed. Whole genome pair-wise similarity was calculated based on expression pattern similarity based on 45 microarrays of several stages during odontogenesis. pySAPC identified 743 gene clusters based on expression pattern similarity during mouse tooth development. Three clusters are significantly enriched for genes associated with dental anomalies (with FDR <0.1). The three clusters of genes have distinct expression patterns during odontogenesis. Clustering genes based on similar expression profiles recovered several known regulatory relationships for genes involved in odontogenesis, as well as many novel genes that may be involved with the same genetic pathways as genes that have already been shown to contribute to dental defects. By using sparse similarity matrix, pySAPC use much less memory and CPU time compared with the original affinity propagation program that uses a full similarity matrix. This python package will be useful for many applications where dataset(s) are too large to use full similarity matrix. This article is part of a Special Issue entitled "System Genetics" Guest Editor: Dr. Yudong Cai and Dr. Tao Huang. Copyright © 2016. Published by Elsevier B.V.

  3. A multi-strategy approach to informative gene identification from gene expression data.

    PubMed

    Liu, Ziying; Phan, Sieu; Famili, Fazel; Pan, Youlian; Lenferink, Anne E G; Cantin, Christiane; Collins, Catherine; O'Connor-McCourt, Maureen D

    2010-02-01

    An unsupervised multi-strategy approach has been developed to identify informative genes from high throughput genomic data. Several statistical methods have been used in the field to identify differentially expressed genes. Since different methods generate different lists of genes, it is very challenging to determine the most reliable gene list and the appropriate method. This paper presents a multi-strategy method, in which a combination of several data analysis techniques are applied to a given dataset and a confidence measure is established to select genes from the gene lists generated by these techniques to form the core of our final selection. The remainder of the genes that form the peripheral region are subject to exclusion or inclusion into the final selection. This paper demonstrates this methodology through its application to an in-house cancer genomics dataset and a public dataset. The results indicate that our method provides more reliable list of genes, which are validated using biological knowledge, biological experiments, and literature search. We further evaluated our multi-strategy method by consolidating two pairs of independent datasets, each pair is for the same disease, but generated by different labs using different platforms. The results showed that our method has produced far better results.

  4. A Pilot Proteogenomic Study with Data Integration Identifies MCT1 and GLUT1 as Prognostic Markers in Lung Adenocarcinoma.

    PubMed

    Stewart, Paul A; Parapatics, Katja; Welsh, Eric A; Müller, André C; Cao, Haoyun; Fang, Bin; Koomen, John M; Eschrich, Steven A; Bennett, Keiryn L; Haura, Eric B

    2015-01-01

    We performed a pilot proteogenomic study to compare lung adenocarcinoma to lung squamous cell carcinoma using quantitative proteomics (6-plex TMT) combined with a customized Affymetrix GeneChip. Using MaxQuant software, we identified 51,001 unique peptides that mapped to 7,241 unique proteins and from these identified 6,373 genes with matching protein expression for further analysis. We found a minor correlation between gene expression and protein expression; both datasets were able to independently recapitulate known differences between the adenocarcinoma and squamous cell carcinoma subtypes. We found 565 proteins and 629 genes to be differentially expressed between adenocarcinoma and squamous cell carcinoma, with 113 of these consistently differentially expressed at both the gene and protein levels. We then compared our results to published adenocarcinoma versus squamous cell carcinoma proteomic data that we also processed with MaxQuant. We selected two proteins consistently overexpressed in squamous cell carcinoma in all studies, MCT1 (SLC16A1) and GLUT1 (SLC2A1), for further investigation. We found differential expression of these same proteins at the gene level in our study as well as in other public gene expression datasets. These findings combined with survival analysis of public datasets suggest that MCT1 and GLUT1 may be potential prognostic markers in adenocarcinoma and druggable targets in squamous cell carcinoma. Data are available via ProteomeXchange with identifier PXD002622.

  5. VTCdb: a gene co-expression database for the crop species Vitis vinifera (grapevine).

    PubMed

    Wong, Darren C J; Sweetman, Crystal; Drew, Damian P; Ford, Christopher M

    2013-12-16

    Gene expression datasets in model plants such as Arabidopsis have contributed to our understanding of gene function and how a single underlying biological process can be governed by a diverse network of genes. The accumulation of publicly available microarray data encompassing a wide range of biological and environmental conditions has enabled the development of additional capabilities including gene co-expression analysis (GCA). GCA is based on the understanding that genes encoding proteins involved in similar and/or related biological processes may exhibit comparable expression patterns over a range of experimental conditions, developmental stages and tissues. We present an open access database for the investigation of gene co-expression networks within the cultivated grapevine, Vitis vinifera. The new gene co-expression database, VTCdb (http://vtcdb.adelaide.edu.au/Home.aspx), offers an online platform for transcriptional regulatory inference in the cultivated grapevine. Using condition-independent and condition-dependent approaches, grapevine co-expression networks were constructed using the latest publicly available microarray datasets from diverse experimental series, utilising the Affymetrix Vitis vinifera GeneChip (16 K) and the NimbleGen Grape Whole-genome microarray chip (29 K), thus making it possible to profile approximately 29,000 genes (95% of the predicted grapevine transcriptome). Applications available with the online platform include the use of gene names, probesets, modules or biological processes to query the co-expression networks, with the option to choose between Affymetrix or Nimblegen datasets and between multiple co-expression measures. Alternatively, the user can browse existing network modules using interactive network visualisation and analysis via CytoscapeWeb. To demonstrate the utility of the database, we present examples from three fundamental biological processes (berry development, photosynthesis and flavonoid biosynthesis) whereby the recovered sub-networks reconfirm established plant gene functions and also identify novel associations. Together, we present valuable insights into grapevine transcriptional regulation by developing network models applicable to researchers in their prioritisation of gene candidates, for on-going study of biological processes related to grapevine development, metabolism and stress responses.

  6. Integrating genome-wide association studies and gene expression data highlights dysregulated multiple sclerosis risk pathways.

    PubMed

    Liu, Guiyou; Zhang, Fang; Jiang, Yongshuai; Hu, Yang; Gong, Zhongying; Liu, Shoufeng; Chen, Xiuju; Jiang, Qinghua; Hao, Junwei

    2017-02-01

    Much effort has been expended on identifying the genetic determinants of multiple sclerosis (MS). Existing large-scale genome-wide association study (GWAS) datasets provide strong support for using pathway and network-based analysis methods to investigate the mechanisms underlying MS. However, no shared genetic pathways have been identified to date. We hypothesize that shared genetic pathways may indeed exist in different MS-GWAS datasets. Here, we report results from a three-stage analysis of GWAS and expression datasets. In stage 1, we conducted multiple pathway analyses of two MS-GWAS datasets. In stage 2, we performed a candidate pathway analysis of the large-scale MS-GWAS dataset. In stage 3, we performed a pathway analysis using the dysregulated MS gene list from seven human MS case-control expression datasets. In stage 1, we identified 15 shared pathways. In stage 2, we successfully replicated 14 of these 15 significant pathways. In stage 3, we found that dysregulated MS genes were significantly enriched in 10 of 15 MS risk pathways identified in stages 1 and 2. We report shared genetic pathways in different MS-GWAS datasets and highlight some new MS risk pathways. Our findings provide new insights on the genetic determinants of MS.

  7. Microarray Analysis Dataset

    EPA Pesticide Factsheets

    This file contains a link for Gene Expression Omnibus and the GSE designations for the publicly available gene expression data used in the study and reflected in Figures 6 and 7 for the Das et al., 2016 paper.This dataset is associated with the following publication:Das, K., C. Wood, M. Lin, A.A. Starkov, C. Lau, K.B. Wallace, C. Corton, and B. Abbott. Perfluoroalky acids-induced liver steatosis: Effects on genes controlling lipid homeostasis. TOXICOLOGY. Elsevier Science Ltd, New York, NY, USA, 378: 32-52, (2017).

  8. BABAR: an R package to simplify the normalisation of common reference design microarray-based transcriptomic datasets

    PubMed Central

    2010-01-01

    Background The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells. Results The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques. Conclusions BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show BABAR transforms real and simulated datasets to allow for the correct interpretation of these data, and is the ideal tool to facilitate the identification of differentially expressed genes or network inference analysis from transcriptomic datasets. PMID:20128918

  9. Meta-Analysis of Tumor Stem-Like Breast Cancer Cells Using Gene Set and Network Analysis

    PubMed Central

    Lee, Won Jun; Kim, Sang Cheol; Yoon, Jung-Ho; Yoon, Sang Jun; Lim, Johan; Kim, You-Sun; Kwon, Sung Won; Park, Jeong Hill

    2016-01-01

    Generally, cancer stem cells have epithelial-to-mesenchymal-transition characteristics and other aggressive properties that cause metastasis. However, there have been no confident markers for the identification of cancer stem cells and comparative methods examining adherent and sphere cells are widely used to investigate mechanism underlying cancer stem cells, because sphere cells have been known to maintain cancer stem cell characteristics. In this study, we conducted a meta-analysis that combined gene expression profiles from several studies that utilized tumorsphere technology to investigate tumor stem-like breast cancer cells. We used our own gene expression profiles along with the three different gene expression profiles from the Gene Expression Omnibus, which we combined using the ComBat method, and obtained significant gene sets using the gene set analysis of our datasets and the combined dataset. This experiment focused on four gene sets such as cytokine-cytokine receptor interaction that demonstrated significance in both datasets. Our observations demonstrated that among the genes of four significant gene sets, six genes were consistently up-regulated and satisfied the p-value of < 0.05, and our network analysis showed high connectivity in five genes. From these results, we established CXCR4, CXCL1 and HMGCS1, the intersecting genes of the datasets with high connectivity and p-value of < 0.05, as significant genes in the identification of cancer stem cells. Additional experiment using quantitative reverse transcription-polymerase chain reaction showed significant up-regulation in MCF-7 derived sphere cells and confirmed the importance of these three genes. Taken together, using meta-analysis that combines gene set and network analysis, we suggested CXCR4, CXCL1 and HMGCS1 as candidates involved in tumor stem-like breast cancer cells. Distinct from other meta-analysis, by using gene set analysis, we selected possible markers which can explain the biological mechanisms and suggested network analysis as an additional criterion for selecting candidates. PMID:26870956

  10. Identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network.

    PubMed

    Chen, Xi; Wang, Qiao-Ling; Zhang, Meng-Hui

    2017-10-01

    The current study aimed to identify key genes in glaucoma based on a benchmarked dataset and gene regulatory network (GRN). Local and global noise was added to the gene expression dataset to produce a benchmarked dataset. Differentially-expressed genes (DEGs) between patients with glaucoma and normal controls were identified utilizing the Linear Models for Microarray Data (Limma) package based on benchmarked dataset. A total of 5 GRN inference methods, including Zscore, GeneNet, context likelihood of relatedness (CLR) algorithm, Partial Correlation coefficient with Information Theory (PCIT) and GEne Network Inference with Ensemble of Trees (Genie3) were evaluated using receiver operating characteristic (ROC) and precision and recall (PR) curves. The interference method with the best performance was selected to construct the GRN. Subsequently, topological centrality (degree, closeness and betweenness) was conducted to identify key genes in the GRN of glaucoma. Finally, the key genes were validated by performing reverse transcription-quantitative polymerase chain reaction (RT-qPCR). A total of 176 DEGs were detected from the benchmarked dataset. The ROC and PR curves of the 5 methods were analyzed and it was determined that Genie3 had a clear advantage over the other methods; thus, Genie3 was used to construct the GRN. Following topological centrality analysis, 14 key genes for glaucoma were identified, including IL6 , EPHA2 and GSTT1 and 5 of these 14 key genes were validated by RT-qPCR. Therefore, the current study identified 14 key genes in glaucoma, which may be potential biomarkers to use in the diagnosis of glaucoma and aid in identifying the molecular mechanism of this disease.

  11. Pairwise gene GO-based measures for biclustering of high-dimensional expression data.

    PubMed

    Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S

    2018-01-01

    Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.

  12. A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue.

    PubMed

    Chen, Zhenyu; Li, Jianping; Wei, Liwei

    2007-10-01

    Recently, gene expression profiling using microarray techniques has been shown as a promising tool to improve the diagnosis and treatment of cancer. Gene expression data contain high level of noise and the overwhelming number of genes relative to the number of available samples. It brings out a great challenge for machine learning and statistic techniques. Support vector machine (SVM) has been successfully used to classify gene expression data of cancer tissue. In the medical field, it is crucial to deliver the user a transparent decision process. How to explain the computed solutions and present the extracted knowledge becomes a main obstacle for SVM. A multiple kernel support vector machine (MK-SVM) scheme, consisting of feature selection, rule extraction and prediction modeling is proposed to improve the explanation capacity of SVM. In this scheme, we show that the feature selection problem can be translated into an ordinary multiple parameters learning problem. And a shrinkage approach: 1-norm based linear programming is proposed to obtain the sparse parameters and the corresponding selected features. We propose a novel rule extraction approach using the information provided by the separating hyperplane and support vectors to improve the generalization capacity and comprehensibility of rules and reduce the computational complexity. Two public gene expression datasets: leukemia dataset and colon tumor dataset are used to demonstrate the performance of this approach. Using the small number of selected genes, MK-SVM achieves encouraging classification accuracy: more than 90% for both two datasets. Moreover, very simple rules with linguist labels are extracted. The rule sets have high diagnostic power because of their good classification performance.

  13. The pineal gland: A model for adrenergic modulation of ubiquitin ligases.

    PubMed

    Vriend, Jerry; Liu, Wenjun; Reiter, Russel J

    2017-01-01

    A recent study of the pineal gland of the rat found that the expression of more than 3000 genes showed significant day/night variations (The Hartley dataset). The investigators of this report made available a supplemental table in which they tabulated the expression of many genes that they did not discuss, including those coding for components of the ubiquitin proteasome system. Herein we identify the genes of the ubiquitin proteasome system whose expression were significantly influenced by environmental lighting in the Hartley dataset, those that were stimulated by DBcAMP in pineal glands in culture, and those that were stimulated by norepinephrine. Using the Ubiquitin and Ubiquitin-like Conjugation Database (UUCA) we identified ubiquitin ligases and conjugases, and deubiquitinases in the Hartley dataset for the purpose of determining whether expression of genes of the ubiquitin proteasome pathway were significantly influenced by day/night variations and if these variations were regulated by autonomic innervation of the pineal gland from the superior cervical ganglia. In the Hartley experiments pineal glands groups of rats sacrificed during the day and groups sacrificed during the night were examined for gene expression. Additional groups of rats had their superior cervical ganglia removed surgically or surgically decentralized and the pineal glands likewise examined for gene expression. The genes with at least a 2-fold day/night significant difference in expression included genes for 5 ubiquitin conjugating enzymes, genes for 58 ubiquitin E3 ligases and genes for 6 deubiquitinases. A 35-fold day/night difference was noted in the expression of the gene Sik1, which codes for a protein containing both an ubiquitin binding domain (UBD) and an ubiquitin-associated (UBA) domain. Most of the significant differences in these genes were prevented by surgical removal, or disconnection, of the superior cervical ganglia, and most were responsive, in vitro, to treatment with a cyclic AMP analog, and norepinephrine. All previously described 24-hour rhythms in the pineal require an intact sympathetic input from the superior cervical ganglia. The Hartley dataset thus provides evidence that the pineal gland is a highly useful model for studying adrenergically dependent mechanisms regulating variations in ubiquitin ligases, ubiquitin conjugases, and deubiquitinases, mechanisms that may be physiologically relevant not only in the pineal gland, but in all adrenergically innervated tissue.

  14. The pineal gland: A model for adrenergic modulation of ubiquitin ligases

    PubMed Central

    Liu, Wenjun; Reiter, Russel J.

    2017-01-01

    Introduction A recent study of the pineal gland of the rat found that the expression of more than 3000 genes showed significant day/night variations (The Hartley dataset). The investigators of this report made available a supplemental table in which they tabulated the expression of many genes that they did not discuss, including those coding for components of the ubiquitin proteasome system. Herein we identify the genes of the ubiquitin proteasome system whose expression were significantly influenced by environmental lighting in the Hartley dataset, those that were stimulated by DBcAMP in pineal glands in culture, and those that were stimulated by norepinephrine. Purpose Using the Ubiquitin and Ubiquitin-like Conjugation Database (UUCA) we identified ubiquitin ligases and conjugases, and deubiquitinases in the Hartley dataset for the purpose of determining whether expression of genes of the ubiquitin proteasome pathway were significantly influenced by day/night variations and if these variations were regulated by autonomic innervation of the pineal gland from the superior cervical ganglia. Methods In the Hartley experiments pineal glands groups of rats sacrificed during the day and groups sacrificed during the night were examined for gene expression. Additional groups of rats had their superior cervical ganglia removed surgically or surgically decentralized and the pineal glands likewise examined for gene expression. Results The genes with at least a 2-fold day/night significant difference in expression included genes for 5 ubiquitin conjugating enzymes, genes for 58 ubiquitin E3 ligases and genes for 6 deubiquitinases. A 35-fold day/night difference was noted in the expression of the gene Sik1, which codes for a protein containing both an ubiquitin binding domain (UBD) and an ubiquitin-associated (UBA) domain. Most of the significant differences in these genes were prevented by surgical removal, or disconnection, of the superior cervical ganglia, and most were responsive, in vitro, to treatment with a cyclic AMP analog, and norepinephrine. All previously described 24-hour rhythms in the pineal require an intact sympathetic input from the superior cervical ganglia. Conclusions The Hartley dataset thus provides evidence that the pineal gland is a highly useful model for studying adrenergically dependent mechanisms regulating variations in ubiquitin ligases, ubiquitin conjugases, and deubiquitinases, mechanisms that may be physiologically relevant not only in the pineal gland, but in all adrenergically innervated tissue. PMID:28212404

  15. Methods to increase reproducibility in differential gene expression via meta-analysis

    PubMed Central

    Sweeney, Timothy E.; Haynes, Winston A.; Vallania, Francesco; Ioannidis, John P.; Khatri, Purvesh

    2017-01-01

    Findings from clinical and biological studies are often not reproducible when tested in independent cohorts. Due to the testing of a large number of hypotheses and relatively small sample sizes, results from whole-genome expression studies in particular are often not reproducible. Compared to single-study analysis, gene expression meta-analysis can improve reproducibility by integrating data from multiple studies. However, there are multiple choices in designing and carrying out a meta-analysis. Yet, clear guidelines on best practices are scarce. Here, we hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility. We therefore constructed three very large gene expression meta-analyses from clinical samples, and then examined meta-analyses of subsets of the datasets (all combinations of datasets with up to N/2 samples and K/2 datasets) compared to a ‘silver standard’ of differentially expressed genes found in the entire cohort. We tested three random-effects meta-analysis models using this procedure. We showed relatively greater reproducibility with more-stringent effect size thresholds with relaxed significance thresholds; relatively lower reproducibility when imposing extraneous constraints on residual heterogeneity; and an underestimation of actual false positive rate by Benjamini–Hochberg correction. In addition, multivariate regression showed that the accuracy of a meta-analysis increased significantly with more included datasets even when controlling for sample size. PMID:27634930

  16. Integrated Quantitative Transcriptome Maps of Human Trisomy 21 Tissues and Cells

    PubMed Central

    Pelleri, Maria Chiara; Cattani, Chiara; Vitale, Lorenza; Antonaros, Francesca; Strippoli, Pierluigi; Locatelli, Chiara; Cocchi, Guido; Piovesan, Allison; Caracausi, Maria

    2018-01-01

    Down syndrome (DS) is due to the presence of an extra full or partial chromosome 21 (Hsa21). The identification of genes contributing to DS pathogenesis could be the key to any rational therapy of the associated intellectual disability. We aim at generating quantitative transcriptome maps in DS integrating all gene expression profile datasets available for any cell type or tissue, to obtain a complete model of the transcriptome in terms of both expression values for each gene and segmental trend of gene expression along each chromosome. We used the TRAM (Transcriptome Mapper) software for this meta-analysis, comparing transcript expression levels and profiles between DS and normal brain, lymphoblastoid cell lines, blood cells, fibroblasts, thymus and induced pluripotent stem cells, respectively. TRAM combined, normalized, and integrated datasets from different sources and across diverse experimental platforms. The main output was a linear expression value that may be used as a reference for each of up to 37,181 mapped transcripts analyzed, related to both known genes and expression sequence tag (EST) clusters. An independent example in vitro validation of fibroblast transcriptome map data was performed through “Real-Time” reverse transcription polymerase chain reaction showing an excellent correlation coefficient (r = 0.93, p < 0.0001) with data obtained in silico. The availability of linear expression values for each gene allowed the testing of the gene dosage hypothesis of the expected 3:2 DS/normal ratio for Hsa21 as well as other human genes in DS, in addition to listing genes differentially expressed with statistical significance. Although a fraction of Hsa21 genes escapes dosage effects, Hsa21 genes are selectively over-expressed in DS samples compared to genes from other chromosomes, reflecting a decisive role in the pathogenesis of the syndrome. Finally, the analysis of chromosomal segments reveals a high prevalence of Hsa21 over-expressed segments over the other genomic regions, suggesting, in particular, a specific region on Hsa21 that appears to be frequently over-expressed (21q22). Our complete datasets are released as a new framework to investigate transcription in DS for individual genes as well as chromosomal segments in different cell types and tissues. PMID:29740474

  17. Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach.

    PubMed

    Chowdhury, Nilotpal; Sapru, Shantanu

    2015-01-01

    Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate - adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.

  18. Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach

    PubMed Central

    Chowdhury, Nilotpal; Sapru, Shantanu

    2015-01-01

    Introduction Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. Aim The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Methods Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Results Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. Conclusion To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research. PMID:26080057

  19. Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework.

    PubMed

    Yang, Lingjian; Ainali, Chrysanthi; Tsoka, Sophia; Papageorgiou, Lazaros G

    2014-12-05

    Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies. A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile. The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

  20. Gene expression profiles reveal key genes for early diagnosis and treatment of adamantinomatous craniopharyngioma.

    PubMed

    Yang, Jun; Hou, Ziming; Wang, Changjiang; Wang, Hao; Zhang, Hongbing

    2018-04-23

    Adamantinomatous craniopharyngioma (ACP) is an aggressive brain tumor that occurs predominantly in the pediatric population. Conventional diagnosis method and standard therapy cannot treat ACPs effectively. In this paper, we aimed to identify key genes for ACP early diagnosis and treatment. Datasets GSE94349 and GSE68015 were obtained from Gene Expression Omnibus database. Consensus clustering was applied to discover the gene clusters in the expression data of GSE94349 and functional enrichment analysis was performed on gene set in each cluster. The protein-protein interaction (PPI) network was built by the Search Tool for the Retrieval of Interacting Genes, and hubs were selected. Support vector machine (SVM) model was built based on the signature genes identified from enrichment analysis and PPI network. Dataset GSE94349 was used for training and testing, and GSE68015 was used for validation. Besides, RT-qPCR analysis was performed to analyze the expression of signature genes in ACP samples compared with normal controls. Seven gene clusters were discovered in the differentially expressed genes identified from GSE94349 dataset. Enrichment analysis of each cluster identified 25 pathways that highly associated with ACP. PPI network was built and 46 hubs were determined. Twenty-five pathway-related genes that overlapped with the hubs in PPI network were used as signatures to establish the SVM diagnosis model for ACP. The prediction accuracy of SVM model for training, testing, and validation data were 94, 85, and 74%, respectively. The expression of CDH1, CCL2, ITGA2, COL8A1, COL6A2, and COL6A3 were significantly upregulated in ACP tumor samples, while CAMK2A, RIMS1, NEFL, SYT1, and STX1A were significantly downregulated, which were consistent with the differentially expressed gene analysis. SVM model is a promising classification tool for screening and early diagnosis of ACP. The ACP-related pathways and signature genes will advance our knowledge of ACP pathogenesis and benefit the therapy improvement.

  1. dbMDEGA: a database for meta-analysis of differentially expressed genes in autism spectrum disorder.

    PubMed

    Zhang, Shuyun; Deng, Libin; Jia, Qiyue; Huang, Shaoting; Gu, Junwang; Zhou, Fankun; Gao, Meng; Sun, Xinyi; Feng, Chang; Fan, Guangqin

    2017-11-16

    Autism spectrum disorders (ASD) are hereditary, heterogeneous and biologically complex neurodevelopmental disorders. Individual studies on gene expression in ASD cannot provide clear consensus conclusions. Therefore, a systematic review to synthesize the current findings from brain tissues and a search tool to share the meta-analysis results are urgently needed. Here, we conducted a meta-analysis of brain gene expression profiles in the current reported human ASD expression datasets (with 84 frozen male cortex samples, 17 female cortex samples, 32 cerebellum samples and 4 formalin fixed samples) and knock-out mouse ASD model expression datasets (with 80 collective brain samples). Then, we applied R language software and developed an interactive shared and updated database (dbMDEGA) displaying the results of meta-analysis of data from ASD studies regarding differentially expressed genes (DEGs) in the brain. This database, dbMDEGA ( https://dbmdega.shinyapps.io/dbMDEGA/ ), is a publicly available web-portal for manual annotation and visualization of DEGs in the brain from data from ASD studies. This database uniquely presents meta-analysis values and homologous forest plots of DEGs in brain tissues. Gene entries are annotated with meta-values, statistical values and forest plots of DEGs in brain samples. This database aims to provide searchable meta-analysis results based on the current reported brain gene expression datasets of ASD to help detect candidate genes underlying this disorder. This new analytical tool may provide valuable assistance in the discovery of DEGs and the elucidation of the molecular pathogenicity of ASD. This database model may be replicated to study other disorders.

  2. Bayesian median regression for temporal gene expression data

    NASA Astrophysics Data System (ADS)

    Yu, Keming; Vinciotti, Veronica; Liu, Xiaohui; 't Hoen, Peter A. C.

    2007-09-01

    Most of the existing methods for the identification of biologically interesting genes in a temporal expression profiling dataset do not fully exploit the temporal ordering in the dataset and are based on normality assumptions for the gene expression. In this paper, we introduce a Bayesian median regression model to detect genes whose temporal profile is significantly different across a number of biological conditions. The regression model is defined by a polynomial function where both time and condition effects as well as interactions between the two are included. MCMC-based inference returns the posterior distribution of the polynomial coefficients. From this a simple Bayes factor test is proposed to test for significance. The estimation of the median rather than the mean, and within a Bayesian framework, increases the robustness of the method compared to a Hotelling T2-test previously suggested. This is shown on simulated data and on muscular dystrophy gene expression data.

  3. Accurate and fast multiple-testing correction in eQTL studies.

    PubMed

    Sul, Jae Hoon; Raj, Towfique; de Jong, Simone; de Bakker, Paul I W; Raychaudhuri, Soumya; Ophoff, Roel A; Stranger, Barbara E; Eskin, Eleazar; Han, Buhm

    2015-06-04

    In studies of expression quantitative trait loci (eQTLs), it is of increasing interest to identify eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Detecting eGenes is important for follow-up analyses and prioritization because genes are the main entities in biological processes. To detect eGenes, one typically focuses on the genetic variant with the minimum p value among all variants in cis with a gene and corrects for multiple testing to obtain a gene-level p value. For performing multiple-testing correction, a permutation test is widely used. Because of growing sample sizes of eQTL studies, however, the permutation test has become a computational bottleneck in eQTL studies. In this paper, we propose an efficient approach for correcting for multiple testing and assess eGene p values by utilizing a multivariate normal distribution. Our approach properly takes into account the linkage-disequilibrium structure among variants, and its time complexity is independent of sample size. By applying our small-sample correction techniques, our method achieves high accuracy in both small and large studies. We have shown that our method consistently produces extremely accurate p values (accuracy > 98%) for three human eQTL datasets with different sample sizes and SNP densities: the Genotype-Tissue Expression pilot dataset, the multi-region brain dataset, and the HapMap 3 dataset. Copyright © 2015 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  4. ExpTreeDB: web-based query and visualization of manually annotated gene expression profiling experiments of human and mouse from GEO.

    PubMed

    Ni, Ming; Ye, Fuqiang; Zhu, Juanjuan; Li, Zongwei; Yang, Shuai; Yang, Bite; Han, Lu; Wu, Yongge; Chen, Ying; Li, Fei; Wang, Shengqi; Bo, Xiaochen

    2014-12-01

    Numerous public microarray datasets are valuable resources for the scientific communities. Several online tools have made great steps to use these data by querying related datasets with users' own gene signatures or expression profiles. However, dataset annotation and result exhibition still need to be improved. ExpTreeDB is a database that allows for queries on human and mouse microarray experiments from Gene Expression Omnibus with gene signatures or profiles. Compared with similar applications, ExpTreeDB pays more attention to dataset annotations and result visualization. We introduced a multiple-level annotation system to depict and organize original experiments. For example, a tamoxifen-treated cell line experiment is hierarchically annotated as 'agent→drug→estrogen receptor antagonist→tamoxifen'. Consequently, retrieved results are exhibited by an interactive tree-structured graphics, which provide an overview for related experiments and might enlighten users on key items of interest. The database is freely available at http://biotech.bmi.ac.cn/ExpTreeDB. Web site is implemented in Perl, PHP, R, MySQL and Apache. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  5. Reproducibility-optimized test statistic for ranking genes in microarray studies.

    PubMed

    Elo, Laura L; Filén, Sanna; Lahesmaa, Riitta; Aittokallio, Tero

    2008-01-01

    A principal goal of microarray studies is to identify the genes showing differential expression under distinct conditions. In such studies, the selection of an optimal test statistic is a crucial challenge, which depends on the type and amount of data under analysis. While previous studies on simulated or spike-in datasets do not provide practical guidance on how to choose the best method for a given real dataset, we introduce an enhanced reproducibility-optimization procedure, which enables the selection of a suitable gene- anking statistic directly from the data. In comparison with existing ranking methods, the reproducibilityoptimized statistic shows good performance consistently under various simulated conditions and on Affymetrix spike-in dataset. Further, the feasibility of the novel statistic is confirmed in a practical research setting using data from an in-house cDNA microarray study of asthma-related gene expression changes. These results suggest that the procedure facilitates the selection of an appropriate test statistic for a given dataset without relying on a priori assumptions, which may bias the findings and their interpretation. Moreover, the general reproducibilityoptimization procedure is not limited to detecting differential expression only but could be extended to a wide range of other applications as well.

  6. A Filter Feature Selection Method Based on MFA Score and Redundancy Excluding and It's Application to Tumor Gene Expression Data Analysis.

    PubMed

    Li, Jiangeng; Su, Lei; Pang, Zenan

    2015-12-01

    Feature selection techniques have been widely applied to tumor gene expression data analysis in recent years. A filter feature selection method named marginal Fisher analysis score (MFA score) which is based on graph embedding has been proposed, and it has been widely used mainly because it is superior to Fisher score. Considering the heavy redundancy in gene expression data, we proposed a new filter feature selection technique in this paper. It is named MFA score+ and is based on MFA score and redundancy excluding. We applied it to an artificial dataset and eight tumor gene expression datasets to select important features and then used support vector machine as the classifier to classify the samples. Compared with MFA score, t test and Fisher score, it achieved higher classification accuracy.

  7. A-MADMAN: Annotation-based microarray data meta-analysis tool

    PubMed Central

    Bisognin, Andrea; Coppe, Alessandro; Ferrari, Francesco; Risso, Davide; Romualdi, Chiara; Bicciato, Silvio; Bortoluzzi, Stefania

    2009-01-01

    Background Publicly available datasets of microarray gene expression signals represent an unprecedented opportunity for extracting genomic relevant information and validating biological hypotheses. However, the exploitation of this exceptionally rich mine of information is still hampered by the lack of appropriate computational tools, able to overcome the critical issues raised by meta-analysis. Results This work presents A-MADMAN, an open source web application which allows the retrieval, annotation, organization and meta-analysis of gene expression datasets obtained from Gene Expression Omnibus. A-MADMAN addresses and resolves several open issues in the meta-analysis of gene expression data. Conclusion A-MADMAN allows i) the batch retrieval from Gene Expression Omnibus and the local organization of raw data files and of any related meta-information, ii) the re-annotation of samples to fix incomplete, or otherwise inadequate, metadata and to create user-defined batches of data, iii) the integrative analysis of data obtained from different Affymetrix platforms through custom chip definition files and meta-normalization. Software and documentation are available on-line at . PMID:19563634

  8. Quantitative comparison of microarray experiments with published leukemia related gene expression signatures.

    PubMed

    Klein, Hans-Ulrich; Ruckert, Christian; Kohlmann, Alexander; Bullinger, Lars; Thiede, Christian; Haferlach, Torsten; Dugas, Martin

    2009-12-15

    Multiple gene expression signatures derived from microarray experiments have been published in the field of leukemia research. A comparison of these signatures with results from new experiments is useful for verification as well as for interpretation of the results obtained. Currently, the percentage of overlapping genes is frequently used to compare published gene signatures against a signature derived from a new experiment. However, it has been shown that the percentage of overlapping genes is of limited use for comparing two experiments due to the variability of gene signatures caused by different array platforms or assay-specific influencing parameters. Here, we present a robust approach for a systematic and quantitative comparison of published gene expression signatures with an exemplary query dataset. A database storing 138 leukemia-related published gene signatures was designed. Each gene signature was manually annotated with terms according to a leukemia-specific taxonomy. Two analysis steps are implemented to compare a new microarray dataset with the results from previous experiments stored and curated in the database. First, the global test method is applied to assess gene signatures and to constitute a ranking among them. In a subsequent analysis step, the focus is shifted from single gene signatures to chromosomal aberrations or molecular mutations as modeled in the taxonomy. Potentially interesting disease characteristics are detected based on the ranking of gene signatures associated with these aberrations stored in the database. Two example analyses are presented. An implementation of the approach is freely available as web-based application. The presented approach helps researchers to systematically integrate the knowledge derived from numerous microarray experiments into the analysis of a new dataset. By means of example leukemia datasets we demonstrate that this approach detects related experiments as well as related molecular mutations and may help to interpret new microarray data.

  9. A gene expression estimator of intramuscular fat percentage for use in both cattle and sheep

    PubMed Central

    2014-01-01

    Background The expression of genes encoding proteins involved in triacyglyceride and fatty acid synthesis and storage in cattle muscle are correlated with intramuscular fat (IMF)%. Are the same genes also correlated with IMF% in sheep muscle, and can the same set of genes be used to estimate IMF% in both species? Results The correlation between gene expression (microarray) and IMF% in the longissimus muscle (LM) of twenty sheep was calculated. An integrated analysis of this dataset with an equivalent cattle correlation dataset and a cattle differential expression dataset was undertaken. A total of 30 genes were identified to be strongly correlated with IMF% in both cattle and sheep. The overlap of genes was highly significant, 8 of the 13 genes in the TAG gene set and 8 of the 13 genes in the FA gene set were in the top 100 and 500 genes respectively most correlated with IMF% in sheep, P-value = 0. Of the 30 genes, CIDEA, THRSP, ACSM1, DGAT2 and FABP4 had the highest average rank in both species. Using the data from two small groups of Brahman cattle (control and Hormone growth promotant-treated [known to decrease IMF% in muscle]) and 22 animals in total, the utility of a direct measure and different estimators of IMF% (ultrasound and gene expression) to differentiate between the two groups were examined. Directly measured IMF% and IMF% estimated from ultrasound scanning could not discriminate between the two groups. However, using gene expression to estimate IMF% discriminated between the two groups. Increasing the number of genes used to estimate IMF% from one to five significantly increased the discrimination power; but increasing the number of genes to 15 resulted in little further improvement. Conclusion We have demonstrated the utility of a comparative approach to identify robust estimators of IMF% in the LM in cattle and sheep. We have also demonstrated a number of approaches (potentially applicable to much smaller groups of animals than conventional methods) to using gene expression to rank animals for IMF% within a single farm/treatment, or to estimate differences in IMF% between two farms/treatments. PMID:25028604

  10. Systematic analysis of microarray datasets to identify Parkinson's disease‑associated pathways and genes.

    PubMed

    Feng, Yinling; Wang, Xuefeng

    2017-03-01

    In order to investigate commonly disturbed genes and pathways in various brain regions of patients with Parkinson's disease (PD), microarray datasets from previous studies were collected and systematically analyzed. Different normalization methods were applied to microarray datasets from different platforms. A strategy combining gene co‑expression networks and clinical information was adopted, using weighted gene co‑expression network analysis (WGCNA) to screen for commonly disturbed genes in different brain regions of patients with PD. Functional enrichment analysis of commonly disturbed genes was performed using the Database for Annotation, Visualization, and Integrated Discovery (DAVID). Co‑pathway relationships were identified with Pearson's correlation coefficient tests and a hypergeometric distribution‑based test. Common genes in pathway pairs were selected out and regarded as risk genes. A total of 17 microarray datasets from 7 platforms were retained for further analysis. Five gene coexpression modules were identified, containing 9,745, 736, 233, 101 and 93 genes, respectively. One module was significantly correlated with PD samples and thus the 736 genes it contained were considered to be candidate PD‑associated genes. Functional enrichment analysis demonstrated that these genes were implicated in oxidative phosphorylation and PD. A total of 44 pathway pairs and 52 risk genes were revealed, and a risk gene pathway relationship network was constructed. Eight modules were identified and were revealed to be associated with PD, cancers and metabolism. A number of disturbed pathways and risk genes were unveiled in PD, and these findings may help advance understanding of PD pathogenesis.

  11. paraGSEA: a scalable approach for large-scale gene expression profiling

    PubMed Central

    Peng, Shaoliang; Yang, Shunyun

    2017-01-01

    Abstract More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA. PMID:28973463

  12. Identification and validation of differentially expressed transcripts by RNA-sequencing of formalin-fixed, paraffin-embedded (FFPE) lung tissue from patients with Idiopathic Pulmonary Fibrosis.

    PubMed

    Vukmirovic, Milica; Herazo-Maya, Jose D; Blackmon, John; Skodric-Trifunovic, Vesna; Jovanovic, Dragana; Pavlovic, Sonja; Stojsic, Jelena; Zeljkovic, Vesna; Yan, Xiting; Homer, Robert; Stefanovic, Branko; Kaminski, Naftali

    2017-01-12

    Idiopathic Pulmonary Fibrosis (IPF) is a lethal lung disease of unknown etiology. A major limitation in transcriptomic profiling of lung tissue in IPF has been a dependence on snap-frozen fresh tissues (FF). In this project we sought to determine whether genome scale transcript profiling using RNA Sequencing (RNA-Seq) could be applied to archived Formalin-Fixed Paraffin-Embedded (FFPE) IPF tissues. We isolated total RNA from 7 IPF and 5 control FFPE lung tissues and performed 50 base pair paired-end sequencing on Illumina 2000 HiSeq. TopHat2 was used to map sequencing reads to the human genome. On average ~62 million reads (53.4% of ~116 million reads) were mapped per sample. 4,131 genes were differentially expressed between IPF and controls (1,920 increased and 2,211 decreased (FDR < 0.05). We compared our results to differentially expressed genes calculated from a previously published dataset generated from FF tissues analyzed on Agilent microarrays (GSE47460). The overlap of differentially expressed genes was very high (760 increased and 1,413 decreased, FDR < 0.05). Only 92 differentially expressed genes changed in opposite directions. Pathway enrichment analysis performed using MetaCore confirmed numerous IPF relevant genes and pathways including extracellular remodeling, TGF-beta, and WNT. Gene network analysis of MMP7, a highly differentially expressed gene in both datasets, revealed the same canonical pathways and gene network candidates in RNA-Seq and microarray data. For validation by NanoString nCounter® we selected 35 genes that had a fold change of 2 in at least one dataset (10 discordant, 10 significantly differentially expressed in one dataset only and 15 concordant genes). High concordance of fold change and FDR was observed for each type of the samples (FF vs FFPE) with both microarrays (r = 0.92) and RNA-Seq (r = 0.90) and the number of discordant genes was reduced to four. Our results demonstrate that RNA sequencing of RNA obtained from archived FFPE lung tissues is feasible. The results obtained from FFPE tissue are highly comparable to FF tissues. The ability to perform RNA-Seq on archived FFPE IPF tissues should greatly enhance the availability of tissue biopsies for research in IPF.

  13. MiSTIC, an integrated platform for the analysis of heterogeneity in large tumour transcriptome datasets

    PubMed Central

    Sargeant, Tobias; Laperrière, David; Ismail, Houssam; Boucher, Geneviève; Rozendaal, Marieke; Lavallée, Vincent-Philippe; Ashton-Beaucage, Dariel; Wilhelm, Brian; Hébert, Josée; Hilton, Douglas J.

    2017-01-01

    Abstract Genome-wide transcriptome profiling has enabled non-supervised classification of tumours, revealing different sub-groups characterized by specific gene expression features. However, the biological significance of these subtypes remains for the most part unclear. We describe herein an interactive platform, Minimum Spanning Trees Inferred Clustering (MiSTIC), that integrates the direct visualization and comparison of the gene correlation structure between datasets, the analysis of the molecular causes underlying co-variations in gene expression in cancer samples, and the clinical annotation of tumour sets defined by the combined expression of selected biomarkers. We have used MiSTIC to highlight the roles of specific transcription factors in breast cancer subtype specification, to compare the aspects of tumour heterogeneity targeted by different prognostic signatures, and to highlight biomarker interactions in AML. A version of MiSTIC preloaded with datasets described herein can be accessed through a public web server (http://mistic.iric.ca); in addition, the MiSTIC software package can be obtained (github.com/iric-soft/MiSTIC) for local use with personalized datasets. PMID:28472340

  14. Low-rank regularization for learning gene expression programs.

    PubMed

    Ye, Guibo; Tang, Mengfan; Cai, Jian-Feng; Nie, Qing; Xie, Xiaohui

    2013-01-01

    Learning gene expression programs directly from a set of observations is challenging due to the complexity of gene regulation, high noise of experimental measurements, and insufficient number of experimental measurements. Imposing additional constraints with strong and biologically motivated regularizations is critical in developing reliable and effective algorithms for inferring gene expression programs. Here we propose a new form of regulation that constrains the number of independent connectivity patterns between regulators and targets, motivated by the modular design of gene regulatory programs and the belief that the total number of independent regulatory modules should be small. We formulate a multi-target linear regression framework to incorporate this type of regulation, in which the number of independent connectivity patterns is expressed as the rank of the connectivity matrix between regulators and targets. We then generalize the linear framework to nonlinear cases, and prove that the generalized low-rank regularization model is still convex. Efficient algorithms are derived to solve both the linear and nonlinear low-rank regularized problems. Finally, we test the algorithms on three gene expression datasets, and show that the low-rank regularization improves the accuracy of gene expression prediction in these three datasets.

  15. Preliminary characterization of IL32 in basal-like/triple negative compared to other types of breast cell lines and tissues

    PubMed Central

    2014-01-01

    Background Triple negative breast cancer (TNBC) and often basal-like cancers are defined as negative for estrogen receptor, progesterone receptor and Her2 gene expression. Over the past few years an incredible amount of data has been generated defining the molecular characteristics of both cancers. The aim of these studies is to better understand the cancers and identify genes and molecular pathways that might be useful as targeted therapies. In an attempt to contribute to the understanding of basal-like/TNBC, we examined the Gene Expression Omnibus (GEO) public datasets in search of genes that might define basal-like/TNBC. The Il32 gene was identified as a candidate. Findings Analysis of several GEO datasets showed differential expression of IL32 in patient samples previously designated as basal and/or TNBC compared to normal and luminal breast samples. As validation of the GEO results, RNA and protein expression levels were examined using MCF7 and MDA MB231 cell lines and tissue microarrays (TMAs). IL32 gene expression levels were higher in MDA MB231 compared to MCF7. Analysis of TMAs showed 42% of TNBC tissues and 25% of the non-TNBC were positive for IL32, while non-malignant patient samples and all but one hyperplastic tissue sample demonstrated lower levels of IL32 protein expression. Conclusion Data obtained from several publically available GEO datasets showed overexpression of IL32 gene in basal-like/TNBC samples compared to normal and luminal samples. In support of these data, analysis of TMA clinical samples demonstrated a particular pattern of IL32 differential expression. Considered together, these data suggest IL32 is a candidate suitable for further study. PMID:25100201

  16. A Meta-Analysis: Identification of Common Mir-145 Target Genes that have Similar Behavior in Different GEO Datasets.

    PubMed

    Pashaei, Elnaz; Guzel, Esra; Ozgurses, Mete Emir; Demirel, Goksun; Aydin, Nizamettin; Ozen, Mustafa

    MicroRNAs, which are small regulatory RNAs, post-transcriptionally regulate gene expression by binding 3'-UTR of their mRNA targets. Their deregulation has been shown to cause increased proliferation, migration, invasion, and apoptosis. miR-145, an important tumor supressor microRNA, has shown to be downregulated in many cancer types and has crucial roles in tumor initiation, progression, metastasis, invasion, recurrence, and chemo-radioresistance. Our aim is to investigate potential common target genes of miR-145, and to help understanding the underlying molecular pathways of tumor pathogenesis in association with those common target genes. Eight published microarray datasets, where targets of mir-145 were investigated in cell lines upon mir-145 over expression, were included into this study for meta-analysis. Inter group variabilities were assessed by box-plot analysis. Microarray datasets were analyzed using GEOquery package in Bioconducter 3.2 with R version 3.2.2 and two-way Hierarchical Clustering was used for gene expression data analysis. Meta-analysis of different GEO datasets showed that UNG, FUCA2, DERA, GMFB, TF, and SNX2 were commonly downregulated genes, whereas MYL9 and TAGLN were found to be commonly upregulated upon mir-145 over expression in prostate, breast, esophageal, bladder cancer, and head and neck squamous cell carcinoma. Biological process, molecular function, and pathway analysis of these potential targets of mir-145 through functional enrichments in PPI network demonstrated that those genes are significantly involved in telomere maintenance, DNA binding and repair mechanisms. As a conclusion, our results indicated that mir-145, through targeting its common potential targets, may significantly contribute to tumor pathogenesis in distinct cancer types and might serve as an important target for cancer therapy.

  17. Gene expression inference with deep learning.

    PubMed

    Chen, Yifei; Li, Yi; Narayan, Rajiv; Subramanian, Aravind; Xie, Xiaohui

    2016-06-15

    Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes. We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. D-GEX is available at https://github.com/uci-cbcl/D-GEX CONTACT: xhx@ics.uci.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  18. Gene expression inference with deep learning

    PubMed Central

    Chen, Yifei; Li, Yi; Narayan, Rajiv; Subramanian, Aravind; Xie, Xiaohui

    2016-01-01

    Motivation: Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes. Results: We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. Availability and implementation: D-GEX is available at https://github.com/uci-cbcl/D-GEX. Contact: xhx@ics.uci.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26873929

  19. An efficient method to identify differentially expressed genes in microarray experiments

    PubMed Central

    Qin, Huaizhen; Feng, Tao; Harding, Scott A.; Tsai, Chung-Jui; Zhang, Shuanglin

    2013-01-01

    Motivation Microarray experiments typically analyze thousands to tens of thousands of genes from small numbers of biological replicates. The fact that genes are normally expressed in functionally relevant patterns suggests that gene-expression data can be stratified and clustered into relatively homogenous groups. Cluster-wise dimensionality reduction should make it feasible to improve screening power while minimizing information loss. Results We propose a powerful and computationally simple method for finding differentially expressed genes in small microarray experiments. The method incorporates a novel stratification-based tight clustering algorithm, principal component analysis and information pooling. Comprehensive simulations show that our method is substantially more powerful than the popular SAM and eBayes approaches. We applied the method to three real microarray datasets: one from a Populus nitrogen stress experiment with 3 biological replicates; and two from public microarray datasets of human cancers with 10 to 40 biological replicates. In all three analyses, our method proved more robust than the popular alternatives for identification of differentially expressed genes. Availability The C++ code to implement the proposed method is available upon request for academic use. PMID:18453554

  20. Array data extractor (ADE): a LabVIEW program to extract and merge gene array data.

    PubMed

    Kurtenbach, Stefan; Kurtenbach, Sarah; Zoidl, Georg

    2013-12-01

    Large data sets from gene expression array studies are publicly available offering information highly valuable for research across many disciplines ranging from fundamental to clinical research. Highly advanced bioinformatics tools have been made available to researchers, but a demand for user-friendly software allowing researchers to quickly extract expression information for multiple genes from multiple studies persists. Here, we present a user-friendly LabVIEW program to automatically extract gene expression data for a list of genes from multiple normalized microarray datasets. Functionality was tested for 288 class A G protein-coupled receptors (GPCRs) and expression data from 12 studies comparing normal and diseased human hearts. Results confirmed known regulation of a beta 1 adrenergic receptor and further indicate novel research targets. Although existing software allows for complex data analyses, the LabVIEW based program presented here, "Array Data Extractor (ADE)", provides users with a tool to retrieve meaningful information from multiple normalized gene expression datasets in a fast and easy way. Further, the graphical programming language used in LabVIEW allows applying changes to the program without the need of advanced programming knowledge.

  1. Clinical Value of Prognosis Gene Expression Signatures in Colorectal Cancer: A Systematic Review

    PubMed Central

    Cordero, David; Riccadonna, Samantha; Solé, Xavier; Crous-Bou, Marta; Guinó, Elisabet; Sanjuan, Xavier; Biondo, Sebastiano; Soriano, Antonio; Jurman, Giuseppe; Capella, Gabriel; Furlanello, Cesare; Moreno, Victor

    2012-01-01

    Introduction The traditional staging system is inadequate to identify those patients with stage II colorectal cancer (CRC) at high risk of recurrence or with stage III CRC at low risk. A number of gene expression signatures to predict CRC prognosis have been proposed, but none is routinely used in the clinic. The aim of this work was to assess the prediction ability and potential clinical usefulness of these signatures in a series of independent datasets. Methods A literature review identified 31 gene expression signatures that used gene expression data to predict prognosis in CRC tissue. The search was based on the PubMed database and was restricted to papers published from January 2004 to December 2011. Eleven CRC gene expression datasets with outcome information were identified and downloaded from public repositories. Random Forest classifier was used to build predictors from the gene lists. Matthews correlation coefficient was chosen as a measure of classification accuracy and its associated p-value was used to assess association with prognosis. For clinical usefulness evaluation, positive and negative post-tests probabilities were computed in stage II and III samples. Results Five gene signatures showed significant association with prognosis and provided reasonable prediction accuracy in their own training datasets. Nevertheless, all signatures showed low reproducibility in independent data. Stratified analyses by stage or microsatellite instability status showed significant association but limited discrimination ability, especially in stage II tumors. From a clinical perspective, the most predictive signatures showed a minor but significant improvement over the classical staging system. Conclusions The published signatures show low prediction accuracy but moderate clinical usefulness. Although gene expression data may inform prognosis, better strategies for signature validation are needed to encourage their widespread use in the clinic. PMID:23145004

  2. Gene-Based Genome-Wide Association Analysis in European and Asian Populations Identified Novel Genes for Rheumatoid Arthritis.

    PubMed

    Zhu, Hong; Xia, Wei; Mo, Xing-Bo; Lin, Xiang; Qiu, Ying-Hua; Yi, Neng-Jun; Zhang, Yong-Hong; Deng, Fei-Yan; Lei, Shu-Feng

    2016-01-01

    Rheumatoid arthritis (RA) is a complex autoimmune disease. Using a gene-based association research strategy, the present study aims to detect unknown susceptibility to RA and to address the ethnic differences in genetic susceptibility to RA between European and Asian populations. Gene-based association analyses were performed with KGG 2.5 by using publicly available large RA datasets (14,361 RA cases and 43,923 controls of European subjects, 4,873 RA cases and 17,642 controls of Asian Subjects). For the newly identified RA-associated genes, gene set enrichment analyses and protein-protein interactions analyses were carried out with DAVID and STRING version 10.0, respectively. Differential expression verification was conducted using 4 GEO datasets. The expression levels of three selected 'highly verified' genes were measured by ELISA among our in-house RA cases and controls. A total of 221 RA-associated genes were newly identified by gene-based association study, including 71'overlapped', 76 'European-specific' and 74 'Asian-specific' genes. Among them, 105 genes had significant differential expressions between RA patients and health controls at least in one dataset, especially for 20 genes including 11 'overlapped' (ABCF1, FLOT1, HLA-F, IER3, TUBB, ZKSCAN4, BTN3A3, HSP90AB1, CUTA, BRD2, HLA-DMA), 5 'European-specific' (PHTF1, RPS18, BAK1, TNFRSF14, SUOX) and 4 'Asian-specific' (RNASET2, HFE, BTN2A2, MAPK13) genes whose differential expressions were significant at least in three datasets. The protein expressions of two selected genes FLOT1 (P value = 1.70E-02) and HLA-DMA (P value = 4.70E-02) in plasma were significantly different in our in-house samples. Our study identified 221 novel RA-associated genes and especially highlighted the importance of 20 candidate genes on RA. The results addressed ethnic genetic background differences for RA susceptibility between European and Asian populations and detected a long list of overlapped or ethnic specific RA genes. The study not only greatly increases our understanding of genetic susceptibility to RA, but also provides important insights into the ethno-genetic homogeneity and heterogeneity of RA in both ethnicities.

  3. Harnessing Diversity towards the Reconstructing of Large Scale Gene Regulatory Networks

    PubMed Central

    Yamanaka, Ryota; Kitano, Hiroaki

    2013-01-01

    Elucidating gene regulatory network (GRN) from large scale experimental data remains a central challenge in systems biology. Recently, numerous techniques, particularly consensus driven approaches combining different algorithms, have become a potentially promising strategy to infer accurate GRNs. Here, we develop a novel consensus inference algorithm, TopkNet that can integrate multiple algorithms to infer GRNs. Comprehensive performance benchmarking on a cloud computing framework demonstrated that (i) a simple strategy to combine many algorithms does not always lead to performance improvement compared to the cost of consensus and (ii) TopkNet integrating only high-performance algorithms provide significant performance improvement compared to the best individual algorithms and community prediction. These results suggest that a priori determination of high-performance algorithms is a key to reconstruct an unknown regulatory network. Similarity among gene-expression datasets can be useful to determine potential optimal algorithms for reconstruction of unknown regulatory networks, i.e., if expression-data associated with known regulatory network is similar to that with unknown regulatory network, optimal algorithms determined for the known regulatory network can be repurposed to infer the unknown regulatory network. Based on this observation, we developed a quantitative measure of similarity among gene-expression datasets and demonstrated that, if similarity between the two expression datasets is high, TopkNet integrating algorithms that are optimal for known dataset perform well on the unknown dataset. The consensus framework, TopkNet, together with the similarity measure proposed in this study provides a powerful strategy towards harnessing the wisdom of the crowds in reconstruction of unknown regulatory networks. PMID:24278007

  4. Two-pass imputation algorithm for missing value estimation in gene expression time series.

    PubMed

    Tsiporkova, Elena; Boeva, Veselka

    2007-10-01

    Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.

  5. Arabidopsis Gene Family Profiler (aGFP)--user-oriented transcriptomic database with easy-to-use graphic interface.

    PubMed

    Dupl'áková, Nikoleta; Renák, David; Hovanec, Patrik; Honysová, Barbora; Twell, David; Honys, David

    2007-07-23

    Microarray technologies now belong to the standard functional genomics toolbox and have undergone massive development leading to increased genome coverage, accuracy and reliability. The number of experiments exploiting microarray technology has markedly increased in recent years. In parallel with the rapid accumulation of transcriptomic data, on-line analysis tools are being introduced to simplify their use. Global statistical data analysis methods contribute to the development of overall concepts about gene expression patterns and to query and compose working hypotheses. More recently, these applications are being supplemented with more specialized products offering visualization and specific data mining tools. We present a curated gene family-oriented gene expression database, Arabidopsis Gene Family Profiler (aGFP; http://agfp.ueb.cas.cz), which gives the user access to a large collection of normalised Affymetrix ATH1 microarray datasets. The database currently contains NASC Array and AtGenExpress transcriptomic datasets for various tissues at different developmental stages of wild type plants gathered from nearly 350 gene chips. The Arabidopsis GFP database has been designed as an easy-to-use tool for users needing an easily accessible resource for expression data of single genes, pre-defined gene families or custom gene sets, with the further possibility of keyword search. Arabidopsis Gene Family Profiler presents a user-friendly web interface using both graphic and text output. Data are stored at the MySQL server and individual queries are created in PHP script. The most distinguishable features of Arabidopsis Gene Family Profiler database are: 1) the presentation of normalized datasets (Affymetrix MAS algorithm and calculation of model-based gene-expression values based on the Perfect Match-only model); 2) the choice between two different normalization algorithms (Affymetrix MAS4 or MAS5 algorithms); 3) an intuitive interface; 4) an interactive "virtual plant" visualizing the spatial and developmental expression profiles of both gene families and individual genes. Arabidopsis GFP gives users the possibility to analyze current Arabidopsis developmental transcriptomic data starting with simple global queries that can be expanded and further refined to visualize comparative and highly selective gene expression profiles.

  6. Impact of sequencing depth and read length on single cell RNA sequencing data of T cells.

    PubMed

    Rizzetto, Simone; Eltahla, Auda A; Lin, Peijie; Bull, Rowena; Lloyd, Andrew R; Ho, Joshua W K; Venturi, Vanessa; Luciani, Fabio

    2017-10-06

    Single cell RNA sequencing (scRNA-seq) provides great potential in measuring the gene expression profiles of heterogeneous cell populations. In immunology, scRNA-seq allowed the characterisation of transcript sequence diversity of functionally relevant T cell subsets, and the identification of the full length T cell receptor (TCRαβ), which defines the specificity against cognate antigens. Several factors, e.g. RNA library capture, cell quality, and sequencing output affect the quality of scRNA-seq data. We studied the effects of read length and sequencing depth on the quality of gene expression profiles, cell type identification, and TCRαβ reconstruction, utilising 1,305 single cells from 8 publically available scRNA-seq datasets, and simulation-based analyses. Gene expression was characterised by an increased number of unique genes identified with short read lengths (<50 bp), but these featured higher technical variability compared to profiles from longer reads. Successful TCRαβ reconstruction was achieved for 6 datasets (81% - 100%) with at least 0.25 millions (PE) reads of length >50 bp, while it failed for datasets with <30 bp reads. Sufficient read length and sequencing depth can control technical noise to enable accurate identification of TCRαβ and gene expression profiles from scRNA-seq data of T cells.

  7. An RNA-Seq based gene expression atlas of the common bean.

    PubMed

    O'Rourke, Jamie A; Iniguez, Luis P; Fu, Fengli; Bucciarelli, Bruna; Miller, Susan S; Jackson, Scott A; McClean, Philip E; Li, Jun; Dai, Xinbin; Zhao, Patrick X; Hernandez, Georgina; Vance, Carroll P

    2014-10-06

    Common bean (Phaseolus vulgaris) is grown throughout the world and comprises roughly 50% of the grain legumes consumed worldwide. Despite this, genetic resources for common beans have been lacking. Next generation sequencing, has facilitated our investigation of the gene expression profiles associated with biologically important traits in common bean. An increased understanding of gene expression in common bean will improve our understanding of gene expression patterns in other legume species. Combining recently developed genomic resources for Phaseolus vulgaris, including predicted gene calls, with RNA-Seq technology, we measured the gene expression patterns from 24 samples collected from seven tissues at developmentally important stages and from three nitrogen treatments. Gene expression patterns throughout the plant were analyzed to better understand changes due to nodulation, seed development, and nitrogen utilization. We have identified 11,010 genes differentially expressed with a fold change ≥ 2 and a P-value < 0.05 between different tissues at the same time point, 15,752 genes differentially expressed within a tissue due to changes in development, and 2,315 genes expressed only in a single tissue. These analyses identified 2,970 genes with expression patterns that appear to be directly dependent on the source of available nitrogen. Finally, we have assembled this data in a publicly available database, The Phaseolus vulgaris Gene Expression Atlas (Pv GEA), http://plantgrn.noble.org/PvGEA/ . Using the website, researchers can query gene expression profiles of their gene of interest, search for genes expressed in different tissues, or download the dataset in a tabular form. These data provide the basis for a gene expression atlas, which will facilitate functional genomic studies in common bean. Analysis of this dataset has identified genes important in regulating seed composition and has increased our understanding of nodulation and impact of the nitrogen source on assimilation and distribution throughout the plant.

  8. A Gene Expression Profile of BRCAness That Predicts for Responsiveness to Platinum and PARP Inhibitors

    DTIC Science & Technology

    2017-02-01

    To) 15 July 2010 – 2 Nov.2016 4 . TITLE AND SUBTITLE A Gene Expression Profile of BRCAness That Predicts for Responsiveness to Platinum and PARP...resistance in vitro, and to investigate the mechanism for this effect. The major goal for Aim 4 was to determine the reproducibility of the BRCAness...we used the epithelial ovarian cancer (EOC) dataset from The Cancer Genome Atlas (TCGA) ( 4 ). The TCGA dataset is a unique tool for these studies as

  9. Co-LncRNA: investigating the lncRNA combinatorial effects in GO annotations and KEGG pathways based on human RNA-Seq data

    PubMed Central

    Zhao, Zheng; Bai, Jing; Wu, Aiwei; Wang, Yuan; Zhang, Jinwen; Wang, Zishan; Li, Yongsheng; Xu, Juan; Li, Xia

    2015-01-01

    Long non-coding RNAs (lncRNAs) are emerging as key regulators of diverse biological processes and diseases. However, the combinatorial effects of these molecules in a specific biological function are poorly understood. Identifying co-expressed protein-coding genes of lncRNAs would provide ample insight into lncRNA functions. To facilitate such an effort, we have developed Co-LncRNA, which is a web-based computational tool that allows users to identify GO annotations and KEGG pathways that may be affected by co-expressed protein-coding genes of a single or multiple lncRNAs. LncRNA co-expressed protein-coding genes were first identified in publicly available human RNA-Seq datasets, including 241 datasets across 6560 total individuals representing 28 tissue types/cell lines. Then, the lncRNA combinatorial effects in a given GO annotations or KEGG pathways are taken into account by the simultaneous analysis of multiple lncRNAs in user-selected individual or multiple datasets, which is realized by enrichment analysis. In addition, this software provides a graphical overview of pathways that are modulated by lncRNAs, as well as a specific tool to display the relevant networks between lncRNAs and their co-expressed protein-coding genes. Co-LncRNA also supports users in uploading their own lncRNA and protein-coding gene expression profiles to investigate the lncRNA combinatorial effects. It will be continuously updated with more human RNA-Seq datasets on an annual basis. Taken together, Co-LncRNA provides a web-based application for investigating lncRNA combinatorial effects, which could shed light on their biological roles and could be a valuable resource for this community. Database URL: http://www.bio-bigdata.com/Co-LncRNA/ PMID:26363020

  10. SOURCES OF VARIATION IN BASELINE GENE EXPRESSION LEVELS FROM TOXICOGENOMIC STUDY CONTROL ANIMALS ACROSS MULTIPLE LABORATORIES

    EPA Science Inventory

    Variations in study design are typical for toxicogenomic studies, but their impact on gene expression in control animals has not been well characterized. A dataset of control animal microarray expression data was assembled by a working group of the Health and Environmental Scienc...

  11. Classification of Time Series Gene Expression in Clinical Studies via Integration of Biological Network

    PubMed Central

    Qian, Liwei; Zheng, Haoran; Zhou, Hong; Qin, Ruibin; Li, Jinlong

    2013-01-01

    The increasing availability of time series expression datasets, although promising, raises a number of new computational challenges. Accordingly, the development of suitable classification methods to make reliable and sound predictions is becoming a pressing issue. We propose, here, a new method to classify time series gene expression via integration of biological networks. We evaluated our approach on 2 different datasets and showed that the use of a hidden Markov model/Gaussian mixture models hybrid explores the time-dependence of the expression data, thereby leading to better prediction results. We demonstrated that the biclustering procedure identifies function-related genes as a whole, giving rise to high accordance in prognosis prediction across independent time series datasets. In addition, we showed that integration of biological networks into our method significantly improves prediction performance. Moreover, we compared our approach with several state-of–the-art algorithms and found that our method outperformed previous approaches with regard to various criteria. Finally, our approach achieved better prediction results on early-stage data, implying the potential of our method for practical prediction. PMID:23516469

  12. University of Texas Southwestern Medical Center: Functional Signature Ontology Tool: Triplicate Measurements of Reporter Gene Expression in Response to Individual Genetic and Chemical Perturbations in HCT116 Cells | Office of Cancer Genomics

    Cancer.gov

    The goal of this project is to use an eight-gene expression profile to define functional signatures for small molecules and natural products with heretofore undefined mechanism of action. Two genes in the eight gene set are used as internal controls and do not vary across gene expression array data collected from the public domain. The remaining six genes are found to vary independently across a large collection of publically available gene expression array datasets.  Read the abstract

  13. Identifying candidate genes for Type 2 Diabetes Mellitus and obesity through gene expression profiling in multiple tissues or cells.

    PubMed

    Chen, Junhui; Meng, Yuhuan; Zhou, Jinghui; Zhuo, Min; Ling, Fei; Zhang, Yu; Du, Hongli; Wang, Xiaoning

    2013-01-01

    Type 2 Diabetes Mellitus (T2DM) and obesity have become increasingly prevalent in recent years. Recent studies have focused on identifying causal variations or candidate genes for obesity and T2DM via analysis of expression quantitative trait loci (eQTL) within a single tissue. T2DM and obesity are affected by comprehensive sets of genes in multiple tissues. In the current study, gene expression levels in multiple human tissues from GEO datasets were analyzed, and 21 candidate genes displaying high percentages of differential expression were filtered out. Specifically, DENND1B, LYN, MRPL30, POC1B, PRKCB, RP4-655J12.3, HIBADH, and TMBIM4 were identified from the T2DM-control study, and BCAT1, BMP2K, CSRNP2, MYNN, NCKAP5L, SAP30BP, SLC35B4, SP1, BAP1, GRB14, HSP90AB1, ITGA5, and TOMM5 were identified from the obesity-control study. The majority of these genes are known to be involved in T2DM and obesity. Therefore, analysis of gene expression in various tissues using GEO datasets may be an effective and feasible method to determine novel or causal genes associated with T2DM and obesity.

  14. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

    PubMed

    Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W

    2010-07-02

    The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.

  15. Gene Expression Analysis to Assess the Relevance of Rodent Models to Human Lung Injury.

    PubMed

    Sweeney, Timothy E; Lofgren, Shane; Khatri, Purvesh; Rogers, Angela J

    2017-08-01

    The relevance of animal models to human diseases is an area of intense scientific debate. The degree to which mouse models of lung injury recapitulate human lung injury has never been assessed. Integrating data from both human and animal expression studies allows for increased statistical power and identification of conserved differential gene expression across organisms and conditions. We sought comprehensive integration of gene expression data in experimental acute lung injury (ALI) in rodents compared with humans. We performed two separate gene expression multicohort analyses to determine differential gene expression in experimental animal and human lung injury. We used correlational and pathway analyses combined with external in vitro gene expression data to identify both potential drivers of underlying inflammation and therapeutic drug candidates. We identified 21 animal lung tissue datasets and three human lung injury bronchoalveolar lavage datasets. We show that the metasignatures of animal and human experimental ALI are significantly correlated despite these widely varying experimental conditions. The gene expression changes among mice and rats across diverse injury models (ozone, ventilator-induced lung injury, LPS) are significantly correlated with human models of lung injury (Pearson r = 0.33-0.45, P < 1E -16 ). Neutrophil signatures are enriched in both animal and human lung injury. Predicted therapeutic targets, peptide ligand signatures, and pathway analyses are also all highly overlapping. Gene expression changes are similar in animal and human experimental ALI, and provide several physiologic and therapeutic insights to the disease.

  16. Hidden treasures in "ancient" microarrays: gene-expression portrays biology and potential resistance pathways of major lung cancer subtypes and normal tissue.

    PubMed

    Kerkentzes, Konstantinos; Lagani, Vincenzo; Tsamardinos, Ioannis; Vyberg, Mogens; Røe, Oluf Dimitri

    2014-01-01

    Novel statistical methods and increasingly more accurate gene annotations can transform "old" biological data into a renewed source of knowledge with potential clinical relevance. Here, we provide an in silico proof-of-concept by extracting novel information from a high-quality mRNA expression dataset, originally published in 2001, using state-of-the-art bioinformatics approaches. The dataset consists of histologically defined cases of lung adenocarcinoma (AD), squamous (SQ) cell carcinoma, small-cell lung cancer, carcinoid, metastasis (breast and colon AD), and normal lung specimens (203 samples in total). A battery of statistical tests was used for identifying differential gene expressions, diagnostic and prognostic genes, enriched gene ontologies, and signaling pathways. Our results showed that gene expressions faithfully recapitulate immunohistochemical subtype markers, as chromogranin A in carcinoids, cytokeratin 5, p63 in SQ, and TTF1 in non-squamous types. Moreover, biological information with putative clinical relevance was revealed as potentially novel diagnostic genes for each subtype with specificity 93-100% (AUC = 0.93-1.00). Cancer subtypes were characterized by (a) differential expression of treatment target genes as TYMS, HER2, and HER3 and (b) overrepresentation of treatment-related pathways like cell cycle, DNA repair, and ERBB pathways. The vascular smooth muscle contraction, leukocyte trans-endothelial migration, and actin cytoskeleton pathways were overexpressed in normal tissue. Reanalysis of this public dataset displayed the known biological features of lung cancer subtypes and revealed novel pathways of potentially clinical importance. The findings also support our hypothesis that even old omics data of high quality can be a source of significant biological information when appropriate bioinformatics methods are used.

  17. Emory University: High-Throughput Protein-Protein Interaction Dataset for Lung Cancer-Associated Genes | Office of Cancer Genomics

    Cancer.gov

    To discover novel PPI signaling hubs for lung cancer, CTD2 Center at Emory utilized large-scale genomics datasets and literature to compile a set of lung cancer-associated genes. A library of expression vectors were generated for these genes and utilized for detecting pairwise PPIs with cell lysate-based TR-FRET assays in high-throughput screening format. Read the abstract.

  18. MiSTIC, an integrated platform for the analysis of heterogeneity in large tumour transcriptome datasets.

    PubMed

    Lemieux, Sebastien; Sargeant, Tobias; Laperrière, David; Ismail, Houssam; Boucher, Geneviève; Rozendaal, Marieke; Lavallée, Vincent-Philippe; Ashton-Beaucage, Dariel; Wilhelm, Brian; Hébert, Josée; Hilton, Douglas J; Mader, Sylvie; Sauvageau, Guy

    2017-07-27

    Genome-wide transcriptome profiling has enabled non-supervised classification of tumours, revealing different sub-groups characterized by specific gene expression features. However, the biological significance of these subtypes remains for the most part unclear. We describe herein an interactive platform, Minimum Spanning Trees Inferred Clustering (MiSTIC), that integrates the direct visualization and comparison of the gene correlation structure between datasets, the analysis of the molecular causes underlying co-variations in gene expression in cancer samples, and the clinical annotation of tumour sets defined by the combined expression of selected biomarkers. We have used MiSTIC to highlight the roles of specific transcription factors in breast cancer subtype specification, to compare the aspects of tumour heterogeneity targeted by different prognostic signatures, and to highlight biomarker interactions in AML. A version of MiSTIC preloaded with datasets described herein can be accessed through a public web server (http://mistic.iric.ca); in addition, the MiSTIC software package can be obtained (github.com/iric-soft/MiSTIC) for local use with personalized datasets. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  19. Integrative sparse principal component analysis of gene expression data.

    PubMed

    Liu, Mengque; Fan, Xinyan; Fang, Kuangnan; Zhang, Qingzhao; Ma, Shuangge

    2017-12-01

    In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance. © 2017 WILEY PERIODICALS, INC.

  20. Unmasking Upstream Gene Expression Regulators with miRNA-corrected mRNA Data

    PubMed Central

    Bollmann, Stephanie; Bu, Dengpan; Wang, Jiaqi; Bionaz, Massimo

    2015-01-01

    Expressed micro-RNA (miRNA) affects messenger RNA (mRNA) abundance, hindering the accuracy of upstream regulator analysis. Our objective was to provide an algorithm to correct such bias. Large mRNA and miRNA analyses were performed on RNA extracted from bovine liver and mammary tissue. Using four levels of target scores from TargetScan (all miRNA:mRNA target gene pairs or only the top 25%, 50%, or 75%). Using four levels of target scores from TargetScan (all miRNA:mRNA target gene pairs or only the top 25%, 50%, or 75%) and four levels of the magnitude of miRNA effect (ME) on mRNA expression (30%, 50%, 75%, and 83% mRNA reduction), we generated 17 different datasets (including the original dataset). For each dataset, we performed upstream regulator analysis using two bioinformatics tools. We detected an increased effect on the upstream regulator analysis with larger miRNA:mRNA pair bins and higher ME. The miRNA correction allowed identification of several upstream regulators not present in the analysis of the original dataset. Thus, the proposed algorithm improved the prediction of upstream regulators. PMID:27279737

  1. Combining Shapley value and statistics to the analysis of gene expression data in children exposed to air pollution

    PubMed Central

    Moretti, Stefano; van Leeuwen, Danitsja; Gmuender, Hans; Bonassi, Stefano; van Delft, Joost; Kleinjans, Jos; Patrone, Fioravante; Merlo, Domenico Franco

    2008-01-01

    Background In gene expression analysis, statistical tests for differential gene expression provide lists of candidate genes having, individually, a sufficiently low p-value. However, the interpretation of each single p-value within complex systems involving several interacting genes is problematic. In parallel, in the last sixty years, game theory has been applied to political and social problems to assess the power of interacting agents in forcing a decision and, more recently, to represent the relevance of genes in response to certain conditions. Results In this paper we introduce a Bootstrap procedure to test the null hypothesis that each gene has the same relevance between two conditions, where the relevance is represented by the Shapley value of a particular coalitional game defined on a microarray data-set. This method, which is called Comparative Analysis of Shapley value (shortly, CASh), is applied to data concerning the gene expression in children differentially exposed to air pollution. The results provided by CASh are compared with the results from a parametric statistical test for testing differential gene expression. Both lists of genes provided by CASh and t-test are informative enough to discriminate exposed subjects on the basis of their gene expression profiles. While many genes are selected in common by CASh and the parametric test, it turns out that the biological interpretation of the differences between these two selections is more interesting, suggesting a different interpretation of the main biological pathways in gene expression regulation for exposed individuals. A simulation study suggests that CASh offers more power than t-test for the detection of differential gene expression variability. Conclusion CASh is successfully applied to gene expression analysis of a data-set where the joint expression behavior of genes may be critical to characterize the expression response to air pollution. We demonstrate a synergistic effect between coalitional games and statistics that resulted in a selection of genes with a potential impact in the regulation of complex pathways. PMID:18764936

  2. From DNA Copy Number to Gene Expression: Local aberrations, Trisomies and Monosomies

    NASA Astrophysics Data System (ADS)

    Shay, Tal

    The goal of my PhD research was to study the effect of DNA copy number changes on gene expression. DNA copy number aberrations may be local, encompassing several genes, or on the level of an entire chromosome, such as trisomy and monosomy. The main dataset I studied was of Glioblastoma, obtained in the framework of a collaboration, but I worked also with public datasets of cancer and Down's Syndrome. The molecular basis of expression changes in Glioblastoma. Glioblastoma is the most common and aggressive type of primary brain tumors in adults. In collaboration with Prof. Hegi (CHUV, Switzerland), we analyzed a rich Glioblastoma dataset including clinical information, DNA copy number (array CGH) and expression profiles. We explored the correlation between DNA copy number and gene expression at the level of chromosomal arms and local genomic aberrations. We detected known amplification and over expression of oncogenes, as well as deletion and down-regulation of tumor suppressor genes. We exploited that information to map alterations of pathways that are known to be disrupted in Glioblastoma, and tried to characterize samples that have no known alteration in any of the studied pathways. Identifying local DNA aberrations of biological significance. Many types of tumors exhibit chromosomal losses or gains and local amplifications and deletions. A region that is aberrant in many tumors, or whose copy number change is stronger, is more likely to be clinically relevant, and not just a by-product of genetic instability. We developed a novel method that defines and prioritizes aberrations by formalizing these intuitions. The method scores each aberration by the fraction of patients harboring it, its length and its amplitude, and assesses the significance of the score by comparing it to a null distribution obtained by permutations. This approach detects genetic locations that are significantly aberrant, generating a 'genomic aberration profile' for each sample. The 'genomic aberration profile' is then combined with chromosomal arm status (gain/loss) to define a succinct genomic signature for each tumor. Unsupervised clustering of the samples based on these genomic signatures can reveal novel tumor subtypes. This approach was applied to datasets from three types of brain tumors: Glioblastoma, Medulloblastoma and Neuroblastoma, and identified a new subtype in Medulloblastoma, characterized by many chromosomal aberrations. Elucidating the transcriptional effect of monosomy and trisomy. Trisomy and monosomy are expected to impact the expression of genes that are located on the affected chromosome. Analysis of several cancer datasets revealed that not all the genes on the aberrant chromosome are affected by the change of copy number. Affected genes exhibit a wide range of expression changes with varying penetrance. Specifically, (1) The effect of trisomy is much more conserved among individuals than the effect of monosomy and (2) the expression level of a gene in the diploid is significantly correlated with the level of change between the diploid and the trisomy or monosomy.

  3. Defining global neuroendocrine gene expression patterns associated with reproductive seasonality in fish.

    PubMed

    Zhang, Dapeng; Xiong, Huiling; Mennigen, Jan A; Popesku, Jason T; Marlatt, Vicki L; Martyniuk, Christopher J; Crump, Kate; Cossins, Andrew R; Xia, Xuhua; Trudeau, Vance L

    2009-06-05

    Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted. In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABA(A) gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays. Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development.

  4. Defining Global Neuroendocrine Gene Expression Patterns Associated with Reproductive Seasonality in Fish

    PubMed Central

    Mennigen, Jan A.; Popesku, Jason T.; Marlatt, Vicki L.; Martyniuk, Christopher J.; Crump, Kate; Cossins, Andrew R.; Xia, Xuhua; Trudeau, Vance L.

    2009-01-01

    Background Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted. Methodology/Principal Findings In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABAA gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays. Conclusions/Significance Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development. PMID:19503831

  5. dynGENIE3: dynamical GENIE3 for the inference of gene networks from time series expression data.

    PubMed

    Huynh-Thu, Vân Anh; Geurts, Pierre

    2018-02-21

    The elucidation of gene regulatory networks is one of the major challenges of systems biology. Measurements about genes that are exploited by network inference methods are typically available either in the form of steady-state expression vectors or time series expression data. In our previous work, we proposed the GENIE3 method that exploits variable importance scores derived from Random forests to identify the regulators of each target gene. This method provided state-of-the-art performance on several benchmark datasets, but it could however not specifically be applied to time series expression data. We propose here an adaptation of the GENIE3 method, called dynamical GENIE3 (dynGENIE3), for handling both time series and steady-state expression data. The proposed method is evaluated extensively on the artificial DREAM4 benchmarks and on three real time series expression datasets. Although dynGENIE3 does not systematically yield the best performance on each and every network, it is competitive with diverse methods from the literature, while preserving the main advantages of GENIE3 in terms of scalability.

  6. Novel candidate genes of the PARK7 interactome as mediators of apoptosis and acetylation in multiple sclerosis: An in silico analysis.

    PubMed

    Vavougios, George D; Zarogiannis, Sotirios G; Krogfelt, Karen Angeliki; Gourgoulianis, Konstantinos; Mitsikostas, Dimos Dimitrios; Hadjigeorgiou, Georgios

    2018-01-01

    currently only 4 studies have explored the potential role of PARK7's dysregulation in MS pathophysiology Currently, no study has evaluated the potential role of the PARK7 interactome in MS. The aim of our study was to assess the differential expression of PARK7 mRNA in peripheral blood mononuclears (PBMCs) donated from MS versus healthy patients using data mining techniques. The PARK7 interactome data from the GDS3920 profile were scrutinized for differentially expressed genes (DEGs); Gene Enrichment Analysis (GEA) was used to detect significantly enriched biological functions. 27 differentially expressed genes in the MS dataset were detected; 12 of these (NDUFA4, UBA2, TDP2, NPM1, NDUFS3, SUMO1, PIAS2, KIAA0101, RBBP4, NONO, RBBP7 AND HSPA4) are reported for the first time in MS. Stepwise Linear Discriminant Function Analysis constructed a predictive model (Wilk's λ = 0.176, χ 2 = 45.204, p = 1.5275e -10 ) with 2 variables (TIDP2, RBBP4) that achieved 96.6% accuracy when discriminating between patients and controls. Gene Enrichment Analysis revealed that induction and regulation of programmed / intrinsic cell death represented the most salient Gene Ontology annotations. Cross-validation on systemic lupus erythematosus and ischemic stroke datasets revealed that these functions are unique to the MS dataset. Based on our results, novel potential target genes are revealed; these differentially expressed genes regulate epigenetic and apoptotic pathways that may further elucidate underlying mechanisms of autorreactivity in MS. Copyright © 2017 Elsevier B.V. All rights reserved.

  7. Array data extractor (ADE): a LabVIEW program to extract and merge gene array data

    PubMed Central

    2013-01-01

    Background Large data sets from gene expression array studies are publicly available offering information highly valuable for research across many disciplines ranging from fundamental to clinical research. Highly advanced bioinformatics tools have been made available to researchers, but a demand for user-friendly software allowing researchers to quickly extract expression information for multiple genes from multiple studies persists. Findings Here, we present a user-friendly LabVIEW program to automatically extract gene expression data for a list of genes from multiple normalized microarray datasets. Functionality was tested for 288 class A G protein-coupled receptors (GPCRs) and expression data from 12 studies comparing normal and diseased human hearts. Results confirmed known regulation of a beta 1 adrenergic receptor and further indicate novel research targets. Conclusions Although existing software allows for complex data analyses, the LabVIEW based program presented here, “Array Data Extractor (ADE)”, provides users with a tool to retrieve meaningful information from multiple normalized gene expression datasets in a fast and easy way. Further, the graphical programming language used in LabVIEW allows applying changes to the program without the need of advanced programming knowledge. PMID:24289243

  8. Modelling gene expression profiles related to prostate tumor progression using binary states

    PubMed Central

    2013-01-01

    Background Cancer is a complex disease commonly characterized by the disrupted activity of several cancer-related genes such as oncogenes and tumor-suppressor genes. Previous studies suggest that the process of tumor progression to malignancy is dynamic and can be traced by changes in gene expression. Despite the enormous efforts made for differential expression detection and biomarker discovery, few methods have been designed to model the gene expression level to tumor stage during malignancy progression. Such models could help us understand the dynamics and simplify or reveal the complexity of tumor progression. Methods We have modeled an on-off state of gene activation per sample then per stage to select gene expression profiles associated to tumor progression. The selection is guided by statistical significance of profiles based on random permutated datasets. Results We show that our method identifies expected profiles corresponding to oncogenes and tumor suppressor genes in a prostate tumor progression dataset. Comparisons with other methods support our findings and indicate that a considerable proportion of significant profiles is not found by other statistical tests commonly used to detect differential expression between tumor stages nor found by other tailored methods. Ontology and pathway analysis concurred with these findings. Conclusions Results suggest that our methodology may be a valuable tool to study tumor malignancy progression, which might reveal novel cancer therapies. PMID:23721350

  9. Physiologically Shrinking the Solution Space of a Saccharomyces cerevisiae Genome-Scale Model Suggests the Role of the Metabolic Network in Shaping Gene Expression Noise.

    PubMed

    Chi, Baofang; Tao, Shiheng; Liu, Yanlin

    2015-01-01

    Sampling the solution space of genome-scale models is generally conducted to determine the feasible region for metabolic flux distribution. Because the region for actual metabolic states resides only in a small fraction of the entire space, it is necessary to shrink the solution space to improve the predictive power of a model. A common strategy is to constrain models by integrating extra datasets such as high-throughput datasets and C13-labeled flux datasets. However, studies refining these approaches by performing a meta-analysis of massive experimental metabolic flux measurements, which are closely linked to cellular phenotypes, are limited. In the present study, experimentally identified metabolic flux data from 96 published reports were systematically reviewed. Several strong associations among metabolic flux phenotypes were observed. These phenotype-phenotype associations at the flux level were quantified and integrated into a Saccharomyces cerevisiae genome-scale model as extra physiological constraints. By sampling the shrunken solution space of the model, the metabolic flux fluctuation level, which is an intrinsic trait of metabolic reactions determined by the network, was estimated and utilized to explore its relationship to gene expression noise. Although no correlation was observed in all enzyme-coding genes, a relationship between metabolic flux fluctuation and expression noise of genes associated with enzyme-dosage sensitive reactions was detected, suggesting that the metabolic network plays a role in shaping gene expression noise. Such correlation was mainly attributed to the genes corresponding to non-essential reactions, rather than essential ones. This was at least partially, due to regulations underlying the flux phenotype-phenotype associations. Altogether, this study proposes a new approach in shrinking the solution space of a genome-scale model, of which sampling provides new insights into gene expression noise.

  10. Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering.

    PubMed

    Deveci, Mehmet; Küçüktunç, Onur; Eren, Kemal; Bozdağ, Doruk; Kaya, Kamer; Çatalyürek, Ümit V

    2016-01-01

    Rapid development and increasing popularity of gene expression microarrays have resulted in a number of studies on the discovery of co-regulated genes. One important way of discovering such co-regulations is the query-based search since gene co-expressions may indicate a shared role in a biological process. Although there exist promising query-driven search methods adapting clustering, they fail to capture many genes that function in the same biological pathway because microarray datasets are fraught with spurious samples or samples of diverse origin, or the pathways might be regulated under only a subset of samples. On the other hand, a class of clustering algorithms known as biclustering algorithms which simultaneously cluster both the items and their features are useful while analyzing gene expression data, or any data in which items are related in only a subset of their samples. This means that genes need not be related in all samples to be clustered together. Because many genes only interact under specific circumstances, biclustering may recover the relationships that traditional clustering algorithms can easily miss. In this chapter, we briefly summarize the literature using biclustering for querying co-regulated genes. Then we present a novel biclustering approach and evaluate its performance by a thorough experimental analysis.

  11. University of Texas Southwestern Medical Center (UTSW): Functional Signature Ontology Tool: Triplicate Measurements of Reporter Gene Expression in Response to Individual Genetic and Chemical Perturbations in HCT116 Cells | Office of Cancer Genomics

    Cancer.gov

    The goal of this project is to use an eight-gene expression profile to define functional signatures for small molecules and natural products with heretofore undefined mechanism of action. Two genes in the eight gene set are used as internal controls and do not vary across gene expression array data collected from the public domain. The remaining six genes are found to vary independently across a large collection of publically available gene expression array datasets.  Read the abstract

  12. Pan- and core- network analysis of co-expression genes in a model plant

    DOE PAGES

    He, Fei; Maslov, Sergei

    2016-12-16

    Genome-wide gene expression experiments have been performed using the model plant Arabidopsis during the last decade. Some studies involved construction of coexpression networks, a popular technique used to identify groups of co-regulated genes, to infer unknown gene functions. One approach is to construct a single coexpression network by combining multiple expression datasets generated in different labs. We advocate a complementary approach in which we construct a large collection of 134 coexpression networks based on expression datasets reported in individual publications. To this end we reanalyzed public expression data. To describe this collection of networks we introduced concepts of ‘pan-network’ andmore » ‘core-network’ representing union and intersection between a sizeable fractions of individual networks, respectively. Here, we showed that these two types of networks are different both in terms of their topology and biological function of interacting genes. For example, the modules of the pan-network are enriched in regulatory and signaling functions, while the modules of the core-network tend to include components of large macromolecular complexes such as ribosomes and photosynthetic machinery. Our analysis is aimed to help the plant research community to better explore the information contained within the existing vast collection of gene expression data in Arabidopsis.« less

  13. Pan- and core- network analysis of co-expression genes in a model plant

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    He, Fei; Maslov, Sergei

    Genome-wide gene expression experiments have been performed using the model plant Arabidopsis during the last decade. Some studies involved construction of coexpression networks, a popular technique used to identify groups of co-regulated genes, to infer unknown gene functions. One approach is to construct a single coexpression network by combining multiple expression datasets generated in different labs. We advocate a complementary approach in which we construct a large collection of 134 coexpression networks based on expression datasets reported in individual publications. To this end we reanalyzed public expression data. To describe this collection of networks we introduced concepts of ‘pan-network’ andmore » ‘core-network’ representing union and intersection between a sizeable fractions of individual networks, respectively. Here, we showed that these two types of networks are different both in terms of their topology and biological function of interacting genes. For example, the modules of the pan-network are enriched in regulatory and signaling functions, while the modules of the core-network tend to include components of large macromolecular complexes such as ribosomes and photosynthetic machinery. Our analysis is aimed to help the plant research community to better explore the information contained within the existing vast collection of gene expression data in Arabidopsis.« less

  14. Gene Expression Signatures Diagnose Influenza and Other Symptomatic Respiratory Viral Infection in Humans

    PubMed Central

    Zaas, Aimee K.; Chen, Minhua; Varkey, Jay; Veldman, Timothy; Hero, Alfred O.; Lucas, Joseph; Huang, Yongsheng; Turner, Ronald; Gilbert, Anthony; Lambkin-Williams, Robert; Øien, N. Christine; Nicholson, Bradly; Kingsmore, Stephen; Carin, Lawrence; Woods, Christopher W.; Ginsburg, Geoffrey S.

    2010-01-01

    Summary Acute respiratory infections (ARI) are a common reason for seeking medical attention and the threat of pandemic influenza will likely add to these numbers. Using human viral challenge studies with live rhinovirus, respiratory syncytial virus, and influenza A, we developed peripheral blood gene expression signatures that distinguish individuals with symptomatic ARI from uninfected individuals with > 95% accuracy. We validated this “acute respiratory viral” signature - encompassing genes with a known role in host defense against viral infections - across each viral challenge. We also validated the signature in an independently acquired dataset for influenza A and classified infected individuals from healthy controls with 100% accuracy. In the same dataset, we could also distinguish viral from bacterial ARIs (93% accuracy). These results demonstrate that ARIs induce changes in human peripheral blood gene expression that can be used to diagnose a viral etiology of respiratory infection and triage symptomatic individuals. PMID:19664979

  15. Identifying key genes in rheumatoid arthritis by weighted gene co-expression network analysis.

    PubMed

    Ma, Chunhui; Lv, Qi; Teng, Songsong; Yu, Yinxian; Niu, Kerun; Yi, Chengqin

    2017-08-01

    This study aimed to identify rheumatoid arthritis (RA) related genes based on microarray data using the WGCNA (weighted gene co-expression network analysis) method. Two gene expression profile datasets GSE55235 (10 RA samples and 10 healthy controls) and GSE77298 (16 RA samples and seven healthy controls) were downloaded from Gene Expression Omnibus database. Characteristic genes were identified using metaDE package. WGCNA was used to find disease-related networks based on gene expression correlation coefficients, and module significance was defined as the average gene significance of all genes used to assess the correlation between the module and RA status. Genes in the disease-related gene co-expression network were subject to functional annotation and pathway enrichment analysis using Database for Annotation Visualization and Integrated Discovery. Characteristic genes were also mapped to the Connectivity Map to screen small molecules. A total of 599 characteristic genes were identified. For each dataset, characteristic genes in the green, red and turquoise modules were most closely associated with RA, with gene numbers of 54, 43 and 79, respectively. These genes were enriched in totally enriched in 17 Gene Ontology terms, mainly related to immune response (CD97, FYB, CXCL1, IKBKE, CCR1, etc.), inflammatory response (CD97, CXCL1, C3AR1, CCR1, LYZ, etc.) and homeostasis (C3AR1, CCR1, PLN, CCL19, PPT1, etc.). Two small-molecule drugs sanguinarine and papaverine were predicted to have a therapeutic effect against RA. Genes related to immune response, inflammatory response and homeostasis presumably have critical roles in RA pathogenesis. Sanguinarine and papaverine have a potential therapeutic effect against RA. © 2017 Asia Pacific League of Associations for Rheumatology and John Wiley & Sons Australia, Ltd.

  16. Genome-wide screen identifies a novel prognostic signature for breast cancer survival

    DOE PAGES

    Mao, Xuan Y.; Lee, Matthew J.; Zhu, Jeffrey; ...

    2017-01-21

    Large genomic datasets in combination with clinical data can be used as an unbiased tool to identify genes important in patient survival and discover potential therapeutic targets. We used a genome-wide screen to identify 587 genes significantly and robustly deregulated across four independent breast cancer (BC) datasets compared to normal breast tissue. Gene expression of 381 genes was significantly associated with relapse-free survival (RFS) in BC patients. We used a gene co-expression network approach to visualize the genetic architecture in normal breast and BCs. In normal breast tissue, co-expression cliques were identified enriched for cell cycle, gene transcription, cell adhesion,more » cytoskeletal organization and metabolism. In contrast, in BC, only two major co-expression cliques were identified enriched for cell cycle-related processes or blood vessel development, cell adhesion and mammary gland development processes. Interestingly, gene expression levels of 7 genes were found to be negatively correlated with many cell cycle related genes, highlighting these genes as potential tumor suppressors and novel therapeutic targets. A forward-conditional Cox regression analysis was used to identify a 12-gene signature associated with RFS. A prognostic scoring system was created based on the 12-gene signature. This scoring system robustly predicted BC patient RFS in 60 sampling test sets and was further validated in TCGA and METABRIC BC data. Our integrated study identified a 12-gene prognostic signature that could guide adjuvant therapy for BC patients and includes novel potential molecular targets for therapy.« less

  17. Genome-wide screen identifies a novel prognostic signature for breast cancer survival

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mao, Xuan Y.; Lee, Matthew J.; Zhu, Jeffrey

    Large genomic datasets in combination with clinical data can be used as an unbiased tool to identify genes important in patient survival and discover potential therapeutic targets. We used a genome-wide screen to identify 587 genes significantly and robustly deregulated across four independent breast cancer (BC) datasets compared to normal breast tissue. Gene expression of 381 genes was significantly associated with relapse-free survival (RFS) in BC patients. We used a gene co-expression network approach to visualize the genetic architecture in normal breast and BCs. In normal breast tissue, co-expression cliques were identified enriched for cell cycle, gene transcription, cell adhesion,more » cytoskeletal organization and metabolism. In contrast, in BC, only two major co-expression cliques were identified enriched for cell cycle-related processes or blood vessel development, cell adhesion and mammary gland development processes. Interestingly, gene expression levels of 7 genes were found to be negatively correlated with many cell cycle related genes, highlighting these genes as potential tumor suppressors and novel therapeutic targets. A forward-conditional Cox regression analysis was used to identify a 12-gene signature associated with RFS. A prognostic scoring system was created based on the 12-gene signature. This scoring system robustly predicted BC patient RFS in 60 sampling test sets and was further validated in TCGA and METABRIC BC data. Our integrated study identified a 12-gene prognostic signature that could guide adjuvant therapy for BC patients and includes novel potential molecular targets for therapy.« less

  18. Inflammatory Gene Regulatory Networks in Amnion Cells Following Cytokine Stimulation: Translational Systems Approach to Modeling Human Parturition

    PubMed Central

    Summerfield, Taryn L.; Yu, Lianbo; Gulati, Parul; Zhang, Jie; Huang, Kun; Romero, Roberto; Kniss, Douglas A.

    2011-01-01

    A majority of the studies examining the molecular regulation of human labor have been conducted using single gene approaches. While the technology to produce multi-dimensional datasets is readily available, the means for facile analysis of such data are limited. The objective of this study was to develop a systems approach to infer regulatory mechanisms governing global gene expression in cytokine-challenged cells in vitro, and to apply these methods to predict gene regulatory networks (GRNs) in intrauterine tissues during term parturition. To this end, microarray analysis was applied to human amnion mesenchymal cells (AMCs) stimulated with interleukin-1β, and differentially expressed transcripts were subjected to hierarchical clustering, temporal expression profiling, and motif enrichment analysis, from which a GRN was constructed. These methods were then applied to fetal membrane specimens collected in the absence or presence of spontaneous term labor. Analysis of cytokine-responsive genes in AMCs revealed a sterile immune response signature, with promoters enriched in response elements for several inflammation-associated transcription factors. In comparison to the fetal membrane dataset, there were 34 genes commonly upregulated, many of which were part of an acute inflammation gene expression signature. Binding motifs for nuclear factor-κB were prominent in the gene interaction and regulatory networks for both datasets; however, we found little evidence to support the utilization of pathogen-associated molecular pattern (PAMP) signaling. The tissue specimens were also enriched for transcripts governed by hypoxia-inducible factor. The approach presented here provides an uncomplicated means to infer global relationships among gene clusters involved in cellular responses to labor-associated signals. PMID:21655103

  19. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets

    PubMed Central

    Wernisch, Lorenz

    2017-01-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190

  20. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

    PubMed

    Gabasova, Evelina; Reid, John; Wernisch, Lorenz

    2017-10-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.

  1. THD-Module Extractor: An Application for CEN Module Extraction and Interesting Gene Identification for Alzheimer's Disease.

    PubMed

    Kakati, Tulika; Kashyap, Hirak; Bhattacharyya, Dhruba K

    2016-11-30

    There exist many tools and methods for construction of co-expression network from gene expression data and for extraction of densely connected gene modules. In this paper, a method is introduced to construct co-expression network and to extract co-expressed modules having high biological significance. The proposed method has been validated on several well known microarray datasets extracted from a diverse set of species, using statistical measures, such as p and q values. The modules obtained in these studies are found to be biologically significant based on Gene Ontology enrichment analysis, pathway analysis, and KEGG enrichment analysis. Further, the method was applied on an Alzheimer's disease dataset and some interesting genes are found, which have high semantic similarity among them, but are not significantly correlated in terms of expression similarity. Some of these interesting genes, such as MAPT, CASP2, and PSEN2, are linked with important aspects of Alzheimer's disease, such as dementia, increase cell death, and deposition of amyloid-beta proteins in Alzheimer's disease brains. The biological pathways associated with Alzheimer's disease, such as, Wnt signaling, Apoptosis, p53 signaling, and Notch signaling, incorporate these interesting genes. The proposed method is evaluated in regard to existing literature.

  2. THD-Module Extractor: An Application for CEN Module Extraction and Interesting Gene Identification for Alzheimer’s Disease

    PubMed Central

    Kakati, Tulika; Kashyap, Hirak; Bhattacharyya, Dhruba K.

    2016-01-01

    There exist many tools and methods for construction of co-expression network from gene expression data and for extraction of densely connected gene modules. In this paper, a method is introduced to construct co-expression network and to extract co-expressed modules having high biological significance. The proposed method has been validated on several well known microarray datasets extracted from a diverse set of species, using statistical measures, such as p and q values. The modules obtained in these studies are found to be biologically significant based on Gene Ontology enrichment analysis, pathway analysis, and KEGG enrichment analysis. Further, the method was applied on an Alzheimer’s disease dataset and some interesting genes are found, which have high semantic similarity among them, but are not significantly correlated in terms of expression similarity. Some of these interesting genes, such as MAPT, CASP2, and PSEN2, are linked with important aspects of Alzheimer’s disease, such as dementia, increase cell death, and deposition of amyloid-beta proteins in Alzheimer’s disease brains. The biological pathways associated with Alzheimer’s disease, such as, Wnt signaling, Apoptosis, p53 signaling, and Notch signaling, incorporate these interesting genes. The proposed method is evaluated in regard to existing literature. PMID:27901073

  3. Genome-wide identification of suitable zebrafish Danio rerio reference genes for normalization of gene expression data by RT-qPCR.

    PubMed

    Xu, H; Li, C; Zeng, Q; Agrawal, I; Zhu, X; Gong, Z

    2016-06-01

    In this study, to systematically identify the most stably expressed genes for internal reference in zebrafish Danio rerio investigations, 37 D. rerio transcriptomic datasets (both RNA sequencing and microarray data) were collected from gene expression omnibus (GEO) database and unpublished data, and gene expression variations were analysed under three experimental conditions: tissue types, developmental stages and chemical treatments. Forty-four putative candidate genes were identified with the c.v. <0·2 from all datasets. Following clustering into different functional groups, 21 genes, in addition to four conventional housekeeping genes (eef1a1l1, b2m, hrpt1l and actb1), were selected from different functional groups for further quantitative real-time (qrt-)PCR validation using 25 RNA samples from different adult tissues, developmental stages and chemical treatments. The qrt-PCR data were then analysed using the statistical algorithm refFinder for gene expression stability. Several new candidate genes showed better expression stability than the conventional housekeeping genes in all three categories. It was found that sep15 and metap1 were the top two stable genes for tissue types, ube2a and tmem50a the top two for different developmental stages, and rpl13a and rp1p0 the top two for chemical treatments. Thus, based on the extensive transcriptomic analyses and qrt-PCR validation, these new reference genes are recommended for normalization of D. rerio qrt-PCR data respectively for the three different experimental conditions. © 2016 The Fisheries Society of the British Isles.

  4. iCOSSY: An Online Tool for Context-Specific Subnetwork Discovery from Gene Expression Data

    PubMed Central

    Saha, Ashis; Jeon, Minji; Tan, Aik Choon; Kang, Jaewoo

    2015-01-01

    Pathway analyses help reveal underlying molecular mechanisms of complex biological phenotypes. Biologists tend to perform multiple pathway analyses on the same dataset, as there is no single answer. It is often inefficient for them to implement and/or install all the algorithms by themselves. Online tools can help the community in this regard. Here we present an online gene expression analytical tool called iCOSSY which implements a novel pathway-based COntext-specific Subnetwork discoverY (COSSY) algorithm. iCOSSY also includes a few modifications of COSSY to increase its reliability and interpretability. Users can upload their gene expression datasets, and discover important subnetworks of closely interacting molecules to differentiate between two phenotypes (context). They can also interactively visualize the resulting subnetworks. iCOSSY is a web server that finds subnetworks that are differentially expressed in two phenotypes. Users can visualize the subnetworks to understand the biology of the difference. PMID:26147457

  5. Connectivity Mapping for Candidate Therapeutics Identification Using Next Generation Sequencing RNA-Seq Data

    PubMed Central

    McArt, Darragh G.; Dunne, Philip D.; Blayney, Jaine K.; Salto-Tellez, Manuel; Van Schaeybroeck, Sandra; Hamilton, Peter W.; Zhang, Shu-Dong

    2013-01-01

    The advent of next generation sequencing technologies (NGS) has expanded the area of genomic research, offering high coverage and increased sensitivity over older microarray platforms. Although the current cost of next generation sequencing is still exceeding that of microarray approaches, the rapid advances in NGS will likely make it the platform of choice for future research in differential gene expression. Connectivity mapping is a procedure for examining the connections among diseases, genes and drugs by differential gene expression initially based on microarray technology, with which a large collection of compound-induced reference gene expression profiles have been accumulated. In this work, we aim to test the feasibility of incorporating NGS RNA-Seq data into the current connectivity mapping framework by utilizing the microarray based reference profiles and the construction of a differentially expressed gene signature from a NGS dataset. This would allow for the establishment of connections between the NGS gene signature and those microarray reference profiles, alleviating the associated incurring cost of re-creating drug profiles with NGS technology. We examined the connectivity mapping approach on a publicly available NGS dataset with androgen stimulation of LNCaP cells in order to extract candidate compounds that could inhibit the proliferative phenotype of LNCaP cells and to elucidate their potential in a laboratory setting. In addition, we also analyzed an independent microarray dataset of similar experimental settings. We found a high level of concordance between the top compounds identified using the gene signatures from the two datasets. The nicotine derivative cotinine was returned as the top candidate among the overlapping compounds with potential to suppress this proliferative phenotype. Subsequent lab experiments validated this connectivity mapping hit, showing that cotinine inhibits cell proliferation in an androgen dependent manner. Thus the results in this study suggest a promising prospect of integrating NGS data with connectivity mapping. PMID:23840550

  6. Chondrocyte channel transcriptomics

    PubMed Central

    Lewis, Rebecca; May, Hannah; Mobasheri, Ali; Barrett-Jolley, Richard

    2013-01-01

    To date, a range of ion channels have been identified in chondrocytes using a number of different techniques, predominantly electrophysiological and/or biomolecular; each of these has its advantages and disadvantages. Here we aim to compare and contrast the data available from biophysical and microarray experiments. This letter analyses recent transcriptomics datasets from chondrocytes, accessible from the European Bioinformatics Institute (EBI). We discuss whether such bioinformatic analysis of microarray datasets can potentially accelerate identification and discovery of ion channels in chondrocytes. The ion channels which appear most frequently across these microarray datasets are discussed, along with their possible functions. We discuss whether functional or protein data exist which support the microarray data. A microarray experiment comparing gene expression in osteoarthritis and healthy cartilage is also discussed and we verify the differential expression of 2 of these genes, namely the genes encoding large calcium-activated potassium (BK) and aquaporin channels. PMID:23995703

  7. Gene-expression signature regulated by the KEAP1-NRF2-CUL3 axis is associated with a poor prognosis in head and neck squamous cell cancer.

    PubMed

    Namani, Akhileshwar; Matiur Rahaman, Md; Chen, Ming; Tang, Xiuwen

    2018-01-06

    NRF2 is the key regulator of oxidative stress in normal cells and aberrant expression of the NRF2 pathway due to genetic alterations in the KEAP1 (Kelch-like ECH-associated protein 1)-NRF2 (nuclear factor erythroid 2 like 2)-CUL3 (cullin 3) axis leads to tumorigenesis and drug resistance in many cancers including head and neck squamous cell cancer (HNSCC). The main goal of this study was to identify specific genes regulated by the KEAP1-NRF2-CUL3 axis in HNSCC patients, to assess the prognostic value of this gene signature in different cohorts, and to reveal potential biomarkers. RNA-Seq V2 level 3 data from 279 tumor samples along with 37 adjacent normal samples from patients enrolled in the The Cancer Genome Atlas (TCGA)-HNSCC study were used to identify upregulated genes using two methods (altered KEAP1-NRF2-CUL3 versus normal, and altered KEAP1-NRF2-CUL3 versus wild-type). We then used a new approach to identify the combined gene signature by integrating both datasets and subsequently tested this signature in 4 independent HNSCC datasets to assess its prognostic value. In addition, functional annotation using the DAVID v6.8 database and protein-protein interaction (PPI) analysis using the STRING v10 database were performed on the signature. A signature composed of a subset of 17 genes regulated by the KEAP1-NRF2-CUL3 axis was identified by overlapping both the upregulated genes of altered versus normal (251 genes) and altered versus wild-type (25 genes) datasets. We showed that increased expression was significantly associated with poor survival in 4 independent HNSCC datasets, including the TCGA-HNSCC dataset. Furthermore, Gene Ontology, Kyoto Encyclopedia of Genes and Genomes, and PPI analysis revealed that most of the genes in this signature are associated with drug metabolism and glutathione metabolic pathways. Altogether, our study emphasizes the discovery of a gene signature regulated by the KEAP1-NRF2-CUL3 axis which is strongly associated with tumorigenesis and drug resistance in HNSCC. This 17-gene signature provides potential biomarkers and therapeutic targets for HNSCC cases in which the NRF2 pathway is activated.

  8. Exploring Transcription Factors-microRNAs Co-regulation Networks in Schizophrenia.

    PubMed

    Xu, Yong; Yue, Weihua; Yao Shugart, Yin; Li, Sheng; Cai, Lei; Li, Qiang; Cheng, Zaohuo; Wang, Guoqiang; Zhou, Zhenhe; Jin, Chunhui; Yuan, Jianmin; Tian, Lin; Wang, Jun; Zhang, Kai; Zhang, Kerang; Liu, Sha; Song, Yuqing; Zhang, Fuquan

    2016-07-01

    Transcriptional factors (TFs) and microRNAs (miRNAs) have been recognized as 2 classes of principal gene regulators that may be responsible for genome coexpression changes observed in schizophrenia (SZ). This study aims to (1) identify differentially coexpressed genes (DCGs) in 3 mRNA expression microarray datasets; (2) explore potential interactions among the DCGs, and differentially expressed miRNAs identified in our dataset composed of early-onset SZ patients and healthy controls; (3) validate expression levels of some key transcripts; and (4) explore the druggability of DCGs using the curated database. We detected a differential coexpression network associated with SZ and found that 9 out of the 12 regulators were replicated in either of the 2 other datasets. Leveraging the differentially expressed miRNAs identified in our previous dataset, we constructed a miRNA-TF-gene network relevant to SZ, including an EGR1-miR-124-3p-SKIL feed-forward loop. Our real-time quantitative PCR analysis indicated the overexpression of miR-124-3p, the under expression of SKIL and EGR1 in the blood of SZ patients compared with controls, and the direction of change of miR-124-3p and SKIL mRNA levels in SZ cases were reversed after a 12-week treatment cycle. Our druggability analysis revealed that many of these genes have the potential to be drug targets. Together, our results suggest that coexpression network abnormalities driven by combinatorial and interactive action from TFs and miRNAs may contribute to the development of SZ and be relevant to the clinical treatment of the disease. © The Author 2015. Published by Oxford University Press on behalf of the Maryland Psychiatric Research Center. All rights reserved. For permissions, please email: journals.permissions@oup.com.

  9. Exploring Transcription Factors-microRNAs Co-regulation Networks in Schizophrenia

    PubMed Central

    Xu, Yong; Yue, Weihua; Yao Shugart, Yin; Li, Sheng; Cai, Lei; Li, Qiang; Cheng, Zaohuo; Wang, Guoqiang; Zhou, Zhenhe; Jin, Chunhui; Yuan, Jianmin; Tian, Lin; Wang, Jun; Zhang, Kai; Zhang, Kerang; Liu, Sha; Song, Yuqing; Zhang, Fuquan

    2016-01-01

    Background: Transcriptional factors (TFs) and microRNAs (miRNAs) have been recognized as 2 classes of principal gene regulators that may be responsible for genome coexpression changes observed in schizophrenia (SZ). Methods: This study aims to (1) identify differentially coexpressed genes (DCGs) in 3 mRNA expression microarray datasets; (2) explore potential interactions among the DCGs, and differentially expressed miRNAs identified in our dataset composed of early-onset SZ patients and healthy controls; (3) validate expression levels of some key transcripts; and (4) explore the druggability of DCGs using the curated database. Results: We detected a differential coexpression network associated with SZ and found that 9 out of the 12 regulators were replicated in either of the 2 other datasets. Leveraging the differentially expressed miRNAs identified in our previous dataset, we constructed a miRNA–TF–gene network relevant to SZ, including an EGR1–miR-124-3p–SKIL feed-forward loop. Our real-time quantitative PCR analysis indicated the overexpression of miR-124-3p, the under expression of SKIL and EGR1 in the blood of SZ patients compared with controls, and the direction of change of miR-124-3p and SKIL mRNA levels in SZ cases were reversed after a 12-week treatment cycle. Our druggability analysis revealed that many of these genes have the potential to be drug targets. Conclusions: Together, our results suggest that coexpression network abnormalities driven by combinatorial and interactive action from TFs and miRNAs may contribute to the development of SZ and be relevant to the clinical treatment of the disease. PMID:26609121

  10. Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data.

    PubMed

    Ooi, Chia Huey; Chetty, Madhu; Teng, Shyh Wei

    2006-06-23

    Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.

  11. Structure and transcriptional regulation of the major intrinsic protein gene family in grapevine.

    PubMed

    Wong, Darren Chern Jan; Zhang, Li; Merlin, Isabelle; Castellarin, Simone D; Gambetta, Gregory A

    2018-04-11

    The major intrinsic protein (MIP) family is a family of proteins, including aquaporins, which facilitate water and small molecule transport across plasma membranes. In plants, MIPs function in a huge variety of processes including water transport, growth, stress response, and fruit development. In this study, we characterize the structure and transcriptional regulation of the MIP family in grapevine, describing the putative genome duplication events leading to the family structure and characterizing the family's tissue and developmental specific expression patterns across numerous preexisting microarray and RNAseq datasets. Gene co-expression network (GCN) analyses were carried out across these datasets and the promoters of each family member were analyzed for cis-regulatory element structure in order to provide insight into their transcriptional regulation. A total of 29 Vitis vinifera MIP family members (excluding putative pseudogenes) were identified of which all but two were mapped onto Vitis vinifera chromosomes. In this study, segmental duplication events were identified for five plasma membrane intrinsic protein (PIP) and four tonoplast intrinsic protein (TIP) genes, contributing to the expansion of PIPs and TIPs in grapevine. Grapevine MIP family members have distinct tissue and developmental expression patterns and hierarchical clustering revealed two primary groups regardless of the datasets analyzed. Composite microarray and RNA-seq gene co-expression networks (GCNs) highlighted the relationships between MIP genes and functional categories involved in cell wall modification and transport, as well as with other MIPs revealing a strong co-regulation within the family itself. Some duplicated MIP family members have undergone sub-functionalization and exhibit distinct expression patterns and GCNs. Cis-regulatory element (CRE) analyses of the MIP promoters and their associated GCN members revealed enrichment for numerous CREs including AP2/ERFs and NACs. Combining phylogenetic analyses, gene expression profiling, gene co-expression network analyses, and cis-regulatory element enrichment, this study provides a comprehensive overview of the structure and transcriptional regulation of the grapevine MIP family. The study highlights the duplication and sub-functionalization of the family, its strong coordinated expression with genes involved in growth and transport, and the putative classes of TFs responsible for its regulation.

  12. Genomic pathways modulated by Twist in breast cancer.

    PubMed

    Vesuna, Farhad; Bergman, Yehudit; Raman, Venu

    2017-01-13

    The basic helix-loop-helix transcription factor TWIST1 (Twist) is involved in embryonic cell lineage determination and mesodermal differentiation. There is evidence to indicate that Twist expression plays a role in breast tumor formation and metastasis, but the role of Twist in dysregulating pathways that drive the metastatic cascade is unclear. Moreover, many of the genes and pathways dysregulated by Twist in cell lines and mouse models have not been validated against data obtained from larger, independant datasets of breast cancer patients. We over-expressed the human Twist gene in non-metastatic MCF-7 breast cancer cells to generate the estrogen-independent metastatic breast cancer cell line MCF-7/Twist. These cells were inoculated in the mammary fat pad of female severe compromised immunodeficient mice, which subsequently formed xenograft tumors that metastasized to the lungs. Microarray data was collected from both in vitro (MCF-7 and MCF-7/Twist cell lines) and in vivo (primary tumors and lung metastases) models of Twist expression. Our data was compared to several gene datasets of various subtypes, classes, and grades of human breast cancers. Our data establishes a Twist over-expressing mouse model of breast cancer, which metastasizes to the lung and replicates some of the ontogeny of human breast cancer progression. Gene profiling data, following Twist expression, exhibited novel metastasis driver genes as well as cellular maintenance genes that were synonymous with the metastatic process. We demonstrated that the genes and pathways altered in the transgenic cell line and metastatic animal models parallel many of the dysregulated gene pathways observed in human breast cancers. Analogous gene expression patterns were observed in both in vitro and in vivo Twist preclinical models of breast cancer metastasis and breast cancer patient datasets supporting the functional role of Twist in promoting breast cancer metastasis. The data suggests that genetic dysregulation of Twist at the cellular level drives alterations in gene pathways in the Twist metastatic mouse model which are comparable to changes seen in human breast cancers. Lastly, we have identified novel genes and pathways that could be further investigated as targets for drugs to treat metastatic breast cancer.

  13. A ground truth based comparative study on clustering of gene expression data.

    PubMed

    Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue

    2008-05-01

    Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.

  14. Co-LncRNA: investigating the lncRNA combinatorial effects in GO annotations and KEGG pathways based on human RNA-Seq data.

    PubMed

    Zhao, Zheng; Bai, Jing; Wu, Aiwei; Wang, Yuan; Zhang, Jinwen; Wang, Zishan; Li, Yongsheng; Xu, Juan; Li, Xia

    2015-01-01

    Long non-coding RNAs (lncRNAs) are emerging as key regulators of diverse biological processes and diseases. However, the combinatorial effects of these molecules in a specific biological function are poorly understood. Identifying co-expressed protein-coding genes of lncRNAs would provide ample insight into lncRNA functions. To facilitate such an effort, we have developed Co-LncRNA, which is a web-based computational tool that allows users to identify GO annotations and KEGG pathways that may be affected by co-expressed protein-coding genes of a single or multiple lncRNAs. LncRNA co-expressed protein-coding genes were first identified in publicly available human RNA-Seq datasets, including 241 datasets across 6560 total individuals representing 28 tissue types/cell lines. Then, the lncRNA combinatorial effects in a given GO annotations or KEGG pathways are taken into account by the simultaneous analysis of multiple lncRNAs in user-selected individual or multiple datasets, which is realized by enrichment analysis. In addition, this software provides a graphical overview of pathways that are modulated by lncRNAs, as well as a specific tool to display the relevant networks between lncRNAs and their co-expressed protein-coding genes. Co-LncRNA also supports users in uploading their own lncRNA and protein-coding gene expression profiles to investigate the lncRNA combinatorial effects. It will be continuously updated with more human RNA-Seq datasets on an annual basis. Taken together, Co-LncRNA provides a web-based application for investigating lncRNA combinatorial effects, which could shed light on their biological roles and could be a valuable resource for this community. Database URL: http://www.bio-bigdata.com/Co-LncRNA/. © The Author(s) 2015. Published by Oxford University Press.

  15. Genome wide transcriptome profiling reveals differential gene expression in secondary metabolite pathway of Cymbopogon winterianus.

    PubMed

    Devi, Kamalakshi; Mishra, Surajit K; Sahu, Jagajjit; Panda, Debashis; Modi, Mahendra K; Sen, Priyabrata

    2016-02-15

    Advances in transcriptome sequencing provide fast, cost-effective and reliable approach to generate large expression datasets especially suitable for non-model species to identify putative genes, key pathway and regulatory mechanism. Citronella (Cymbopogon winterianus) is an aromatic medicinal grass used for anti-tumoral, antibacterial, anti-fungal, antiviral, detoxifying and natural insect repellent properties. Despite of having number of utilities, the genes involved in terpenes biosynthetic pathway is not yet clearly elucidated. The present study is a pioneering attempt to generate an exhaustive molecular information of secondary metabolite pathway and to increase genomic resources in Citronella. Using high-throughput RNA-Seq technology, root and leaf transcriptome was analysed at an unprecedented depth (11.7 Gb). Targeted searches identified majority of the genes associated with metabolic pathway and other natural product pathway viz. antibiotics synthesis along with many novel genes. Terpenoid biosynthesis genes comparative expression results were validated for 15 unigenes by RT-PCR and qRT-PCR. Thus the coverage of these transcriptome is comprehensive enough to discover all known genes of major metabolic pathways. This transcriptome dataset can serve as important public information for gene expression, genomics and function genomics studies in Citronella and shall act as a benchmark for future improvement of the crop.

  16. A biclustering algorithm for extracting bit-patterns from binary datasets.

    PubMed

    Rodriguez-Baena, Domingo S; Perez-Pulido, Antonio J; Aguilar-Ruiz, Jesus S

    2011-10-01

    Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html dsrodbae@upo.es Supplementary data are available at Bioinformatics online.

  17. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets

    PubMed Central

    2010-01-01

    Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141

  18. Genotype-based gene signature of glioma risk.

    PubMed

    Huang, Yen-Tsung; Zhang, Yi; Wu, Zhijin; Michaud, Dominique S

    2017-07-01

    Glioma accounts for 80% of malignant brain tumors, but its etiologic determinants remain elusive. Despite genetic susceptibility loci identified by genome-wide association study (GWAS), the agnostic approach leaves open the possibility that other susceptibility genes remain to be discovered. Here we conduct a gene-centric integrative GWAS (iGWAS) of glioma risk that combines transcriptomics and genetics. We synthesized a brain transcriptomics dataset (n = 354), a GWAS dataset (n = 4203), and an advanced glioma tumor transcriptomic dataset (n = 483) to conduct an iGWAS. Using the expression quantitative trait loci (eQTL) dataset, we built models to predict gene expression for the GWAS data, based on eQTL genotypes. With the predicted gene expression, iGWAS analyses were performed using a novel statistical method. Gene signature risk score was constructed using a penalized logistic regression model. A total of 30527 transcripts were analyzed using the iGWAS approach. Four novel glioma susceptibility genes were identified with internal and external validation, including DRD5 (P = 3.0 × 10-79), WDR1 (P = 8.4 × 10-77), NOMO1 (P = 1.3 × 10-25), and PDXDC1 (P = 8.3 × 10-24). The genotype-predicted transcription pattern between cases and controls is consistent with that between tumor and its matched normal tissue. The genotype-based 4-gene signature improved the classification between glioma cases and controls based on age, gender, and population stratification, with area under the receiver operating characteristic curve increasing from 0.77 to 0.85 (P = 8.1 × 10-23). A new genotype-based gene signature of glioma was identified using a novel iGWAS approach, which integrates multiplatform genomic data as well as different genetic association studies. © The Author(s) 2017. Published by Oxford University Press on behalf of the Society for Neuro-Oncology. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com

  19. Mining microarray datasets in nutrition: expression of the GPR120 (n-3 fatty acid receptor/sensor) gene is down-regulated in human adipocytes by macrophage secretions.

    PubMed

    Trayhurn, Paul; Denyer, Gareth

    2012-01-01

    Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity.

  20. Mining microarray datasets in nutrition: expression of the GPR120 (n-3 fatty acid receptor/sensor) gene is down-regulated in human adipocytes by macrophage secretions

    PubMed Central

    Trayhurn, Paul; Denyer, Gareth

    2012-01-01

    Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity. PMID:25191551

  1. A Pathway Based Classification Method for Analyzing Gene Expression for Alzheimer's Disease Diagnosis.

    PubMed

    Voyle, Nicola; Keohane, Aoife; Newhouse, Stephen; Lunnon, Katie; Johnston, Caroline; Soininen, Hilkka; Kloszewska, Iwona; Mecocci, Patrizia; Tsolaki, Magda; Vellas, Bruno; Lovestone, Simon; Hodges, Angela; Kiddle, Steven; Dobson, Richard Jb

    2016-01-01

    Recent studies indicate that gene expression levels in blood may be able to differentiate subjects with Alzheimer's disease (AD) from normal elderly controls and mild cognitively impaired (MCI) subjects. However, there is limited replicability at the single marker level. A pathway-based interpretation of gene expression may prove more robust. This study aimed to investigate whether a case/control classification model built on pathway level data was more robust than a gene level model and may consequently perform better in test data. The study used two batches of gene expression data from the AddNeuroMed (ANM) and Dementia Case Registry (DCR) cohorts. Our study used Illumina Human HT-12 Expression BeadChips to collect gene expression from blood samples. Random forest modeling with recursive feature elimination was used to predict case/control status. Age and APOE ɛ4 status were used as covariates for all analysis. Gene and pathway level models performed similarly to each other and to a model based on demographic information only. Any potential increase in concordance from the novel pathway level approach used here has not lead to a greater predictive ability in these datasets. However, we have only tested one method for creating pathway level scores. Further, we have been able to benchmark pathways against genes in datasets that had been extensively harmonized. Further work should focus on the use of alternative methods for creating pathway level scores, in particular those that incorporate pathway topology, and the use of an endophenotype based approach.

  2. Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data

    PubMed Central

    Daub, Carsten O; Steuer, Ralf; Selbig, Joachim; Kloska, Sebastian

    2004-01-01

    Background The information theoretic concept of mutual information provides a general framework to evaluate dependencies between variables. In the context of the clustering of genes with similar patterns of expression it has been suggested as a general quantity of similarity to extend commonly used linear measures. Since mutual information is defined in terms of discrete variables, its application to continuous data requires the use of binning procedures, which can lead to significant numerical errors for datasets of small or moderate size. Results In this work, we propose a method for the numerical estimation of mutual information from continuous data. We investigate the characteristic properties arising from the application of our algorithm and show that our approach outperforms commonly used algorithms: The significance, as a measure of the power of distinction from random correlation, is significantly increased. This concept is subsequently illustrated on two large-scale gene expression datasets and the results are compared to those obtained using other similarity measures. A C++ source code of our algorithm is available for non-commercial use from kloska@scienion.de upon request. Conclusion The utilisation of mutual information as similarity measure enables the detection of non-linear correlations in gene expression datasets. Frequently applied linear correlation measures, which are often used on an ad-hoc basis without further justification, are thereby extended. PMID:15339346

  3. Impact of missing data imputation methods on gene expression clustering and classification.

    PubMed

    de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G

    2015-02-26

    Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .

  4. Identification of Differentially Expressed Genes through Integrated Study of Alzheimer's Disease Affected Brain Regions.

    PubMed

    Puthiyedth, Nisha; Riveros, Carlos; Berretta, Regina; Moscato, Pablo

    2016-01-01

    Alzheimer's disease (AD) is the most common form of dementia in older adults that damages the brain and results in impaired memory, thinking and behaviour. The identification of differentially expressed genes and related pathways among affected brain regions can provide more information on the mechanisms of AD. In the past decade, several studies have reported many genes that are associated with AD. This wealth of information has become difficult to follow and interpret as most of the results are conflicting. In that case, it is worth doing an integrated study of multiple datasets that helps to increase the total number of samples and the statistical power in detecting biomarkers. In this study, we present an integrated analysis of five different brain region datasets and introduce new genes that warrant further investigation. The aim of our study is to apply a novel combinatorial optimisation based meta-analysis approach to identify differentially expressed genes that are associated to AD across brain regions. In this study, microarray gene expression data from 161 samples (74 non-demented controls, 87 AD) from the Entorhinal Cortex (EC), Hippocampus (HIP), Middle temporal gyrus (MTG), Posterior cingulate cortex (PC), Superior frontal gyrus (SFG) and visual cortex (VCX) brain regions were integrated and analysed using our method. The results are then compared to two popular meta-analysis methods, RankProd and GeneMeta, and to what can be obtained by analysing the individual datasets. We find genes related with AD that are consistent with existing studies, and new candidate genes not previously related with AD. Our study confirms the up-regualtion of INFAR2 and PTMA along with the down regulation of GPHN, RAB2A, PSMD14 and FGF. Novel genes PSMB2, WNK1, RPL15, SEMA4C, RWDD2A and LARGE are found to be differentially expressed across all brain regions. Further investigation on these genes may provide new insights into the development of AD. In addition, we identified the presence of 23 non-coding features, including four miRNA precursors (miR-7, miR570, miR-1229 and miR-6821), dysregulated across the brain regions. Furthermore, we compared our results with two popular meta-analysis methods RankProd and GeneMeta to validate our findings and performed a sensitivity analysis by removing one dataset at a time to assess the robustness of our results. These new findings may provide new insights into the disease mechanisms and thus make a significant contribution in the near future towards understanding, prevention and cure of AD.

  5. The Physcomitrella patens gene atlas project: large-scale RNA-seq based expression data.

    PubMed

    Perroud, Pierre-François; Haas, Fabian B; Hiss, Manuel; Ullrich, Kristian K; Alboresi, Alessandro; Amirebrahimi, Mojgan; Barry, Kerrie; Bassi, Roberto; Bonhomme, Sandrine; Chen, Haodong; Coates, Juliet C; Fujita, Tomomichi; Guyon-Debast, Anouchka; Lang, Daniel; Lin, Junyan; Lipzen, Anna; Nogué, Fabien; Oliver, Melvin J; Ponce de León, Inés; Quatrano, Ralph S; Rameau, Catherine; Reiss, Bernd; Reski, Ralf; Ricca, Mariana; Saidi, Younousse; Sun, Ning; Szövényi, Péter; Sreedasyam, Avinash; Grimwood, Jane; Stacey, Gary; Schmutz, Jeremy; Rensing, Stefan A

    2018-07-01

    High-throughput RNA sequencing (RNA-seq) has recently become the method of choice to define and analyze transcriptomes. For the model moss Physcomitrella patens, although this method has been used to help analyze specific perturbations, no overall reference dataset has yet been established. In the framework of the Gene Atlas project, the Joint Genome Institute selected P. patens as a flagship genome, opening the way to generate the first comprehensive transcriptome dataset for this moss. The first round of sequencing described here is composed of 99 independent libraries spanning 34 different developmental stages and conditions. Upon dataset quality control and processing through read mapping, 28 509 of the 34 361 v3.3 gene models (83%) were detected to be expressed across the samples. Differentially expressed genes (DEGs) were calculated across the dataset to permit perturbation comparisons between conditions. The analysis of the three most distinct and abundant P. patens growth stages - protonema, gametophore and sporophyte - allowed us to define both general transcriptional patterns and stage-specific transcripts. As an example of variation of physico-chemical growth conditions, we detail here the impact of ammonium supplementation under standard growth conditions on the protonemal transcriptome. Finally, the cooperative nature of this project allowed us to analyze inter-laboratory variation, as 13 different laboratories around the world provided samples. We compare differences in the replication of experiments in a single laboratory and between different laboratories. © 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.

  6. The Ability of Different Imputation Methods to Preserve the Significant Genes and Pathways in Cancer.

    PubMed

    Aghdam, Rosa; Baghfalaki, Taban; Khosravi, Pegah; Saberi Ansari, Elnaz

    2017-12-01

    Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/. Copyright © 2017. Production and hosting by Elsevier B.V.

  7. Digital gene expression profiling analysis and its application in the identification of genes associated with improved response to neoadjuvant chemotherapy in breast cancer.

    PubMed

    Liu, Xiaozhen; Jin, Gan; Qian, Jiacheng; Yang, Hongjian; Tang, Hongchao; Meng, Xuli; Li, Yongfeng

    2018-04-23

    This study aimed to screen sensitive biomarkers for the efficacy evaluation of neoadjuvant chemotherapy in breast cancer. In this study, Illumina digital gene expression sequencing technology was applied and differentially expressed genes (DEGs) between patients presenting pathological complete response (pCR) and non-pathological complete response (NpCR) were identified. Further, gene ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were then performed. The genes in significant enriched pathways were finally quantified by quantitative real-time PCR (qRT-PCR) to confirm that they were differentially expressed. Additionally, GSE23988 from Gene Expression Omnibus database was used as the validation dataset to confirm the DEGs. After removing the low-quality reads, 715 DEGs were finally detected. After mapping to KEGG pathways, 10 DEGs belonging to the ubiquitin proteasome pathway (HECTD3, PSMB10, UBD, UBE2C, and UBE2S) and cytokine-cytokine receptor interactions (CCL2, CCR1, CXCL10, CXCL11, and IL2RG) were selected for further analysis. These 10 genes were finally quantified by qRT-PCR to confirm that they were differentially expressed (the log 2 fold changes of selected genes were - 5.34, 7.81, 6.88, 5.74, 3.11, 19.58, 8.73, 8.88, 7.42, and 34.61 for HECTD3, PSMB10, UBD, UBE2C, UBE2S, CCL2, CCR1, CXCL10, CXCL11, and IL2RG, respectively). Moreover, 53 common genes were confirmed by the validation dataset, including downregulated UBE2C and UBE2S. Our results suggested that these 10 genes belonging to these two pathways might be useful as sensitive biomarkers for the efficacy evaluation of neoadjuvant chemotherapy in breast cancer.

  8. Robust diagnosis of non-Hodgkin lymphoma phenotypes validated on gene expression data from different laboratories.

    PubMed

    Bhanot, Gyan; Alexe, Gabriela; Levine, Arnold J; Stolovitzky, Gustavo

    2005-01-01

    A major challenge in cancer diagnosis from microarray data is the need for robust, accurate, classification models which are independent of the analysis techniques used and can combine data from different laboratories. We propose such a classification scheme originally developed for phenotype identification from mass spectrometry data. The method uses a robust multivariate gene selection procedure and combines the results of several machine learning tools trained on raw and pattern data to produce an accurate meta-classifier. We illustrate and validate our method by applying it to gene expression datasets: the oligonucleotide HuGeneFL microarray dataset of Shipp et al. (www.genome.wi.mit.du/MPR/lymphoma) and the Hu95Av2 Affymetrix dataset (DallaFavera's laboratory, Columbia University). Our pattern-based meta-classification technique achieves higher predictive accuracies than each of the individual classifiers , is robust against data perturbations and provides subsets of related predictive genes. Our techniques predict that combinations of some genes in the p53 pathway are highly predictive of phenotype. In particular, we find that in 80% of DLBCL cases the mRNA level of at least one of the three genes p53, PLK1 and CDK2 is elevated, while in 80% of FL cases, the mRNA level of at most one of them is elevated.

  9. A gene expression inflammatory signature specifically predicts multiple myeloma evolution and patients survival.

    PubMed

    Botta, C; Di Martino, M T; Ciliberto, D; Cucè, M; Correale, P; Rossi, M; Tagliaferri, P; Tassone, P

    2016-12-16

    Multiple myeloma (MM) is closely dependent on cross-talk between malignant plasma cells and cellular components of the inflammatory/immunosuppressive bone marrow milieu, which promotes disease progression, drug resistance, neo-angiogenesis, bone destruction and immune-impairment. We investigated the relevance of inflammatory genes in predicting disease evolution and patient survival. A bioinformatics study by Ingenuity Pathway Analysis on gene expression profiling dataset of monoclonal gammopathy of undetermined significance, smoldering and symptomatic-MM, identified inflammatory and cytokine/chemokine pathways as the most progressively affected during disease evolution. We then selected 20 candidate genes involved in B-cell inflammation and we investigated their role in predicting clinical outcome, through univariate and multivariate analyses (log-rank test, logistic regression and Cox-regression model). We defined an 8-genes signature (IL8, IL10, IL17A, CCL3, CCL5, VEGFA, EBI3 and NOS2) identifying each condition (MGUS/smoldering/symptomatic-MM) with 84% accuracy. Moreover, six genes (IFNG, IL2, LTA, CCL2, VEGFA, CCL3) were found independently correlated with patients' survival. Patients whose MM cells expressed high levels of Th1 cytokines (IFNG/LTA/IL2/CCL2) and low levels of CCL3 and VEGFA, experienced the longest survival. On these six genes, we built a prognostic risk score that was validated in three additional independent datasets. In this study, we provide proof-of-concept that inflammation has a critical role in MM patient progression and survival. The inflammatory-gene prognostic signature validated in different datasets clearly indicates novel opportunities for personalized anti-MM treatment.

  10. Diurnal Transcriptome and Gene Network Represented through Sparse Modeling in Brachypodium distachyon.

    PubMed

    Koda, Satoru; Onda, Yoshihiko; Matsui, Hidetoshi; Takahagi, Kotaro; Yamaguchi-Uehara, Yukiko; Shimizu, Minami; Inoue, Komaki; Yoshida, Takuhiro; Sakurai, Tetsuya; Honda, Hiroshi; Eguchi, Shinto; Nishii, Ryuei; Mochida, Keiichi

    2017-01-01

    We report the comprehensive identification of periodic genes and their network inference, based on a gene co-expression analysis and an Auto-Regressive eXogenous (ARX) model with a group smoothly clipped absolute deviation (SCAD) method using a time-series transcriptome dataset in a model grass, Brachypodium distachyon . To reveal the diurnal changes in the transcriptome in B. distachyon , we performed RNA-seq analysis of its leaves sampled through a diurnal cycle of over 48 h at 4 h intervals using three biological replications, and identified 3,621 periodic genes through our wavelet analysis. The expression data are feasible to infer network sparsity based on ARX models. We found that genes involved in biological processes such as transcriptional regulation, protein degradation, and post-transcriptional modification and photosynthesis are significantly enriched in the periodic genes, suggesting that these processes might be regulated by circadian rhythm in B. distachyon . On the basis of the time-series expression patterns of the periodic genes, we constructed a chronological gene co-expression network and identified putative transcription factors encoding genes that might be involved in the time-specific regulatory transcriptional network. Moreover, we inferred a transcriptional network composed of the periodic genes in B. distachyon , aiming to identify genes associated with other genes through variable selection by grouping time points for each gene. Based on the ARX model with the group SCAD regularization using our time-series expression datasets of the periodic genes, we constructed gene networks and found that the networks represent typical scale-free structure. Our findings demonstrate that the diurnal changes in the transcriptome in B. distachyon leaves have a sparse network structure, demonstrating the spatiotemporal gene regulatory network over the cyclic phase transitions in B. distachyon diurnal growth.

  11. FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks

    PubMed Central

    Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

    2015-01-01

    Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758

  12. FastGCN: a GPU accelerated tool for fast gene co-expression networks.

    PubMed

    Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

    2015-01-01

    Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.

  13. Meta-Analysis of Multiple Sclerosis Microarray Data Reveals Dysregulation in RNA Splicing Regulatory Genes.

    PubMed

    Paraboschi, Elvezia Maria; Cardamone, Giulia; Rimoldi, Valeria; Gemmati, Donato; Spreafico, Marta; Duga, Stefano; Soldà, Giulia; Asselta, Rosanna

    2015-09-30

    Abnormalities in RNA metabolism and alternative splicing (AS) are emerging as important players in complex disease phenotypes. In particular, accumulating evidence suggests the existence of pathogenic links between multiple sclerosis (MS) and altered AS, including functional studies showing that an imbalance in alternatively-spliced isoforms may contribute to disease etiology. Here, we tested whether the altered expression of AS-related genes represents a MS-specific signature. A comprehensive comparative analysis of gene expression profiles of publicly-available microarray datasets (190 MS cases, 182 controls), followed by gene-ontology enrichment analysis, highlighted a significant enrichment for differentially-expressed genes involved in RNA metabolism/AS. In detail, a total of 17 genes were found to be differentially expressed in MS in multiple datasets, with CELF1 being dysregulated in five out of seven studies. We confirmed CELF1 downregulation in MS (p=0.0015) by real-time RT-PCRs on RNA extracted from blood cells of 30 cases and 30 controls. As a proof of concept, we experimentally verified the unbalance in alternatively-spliced isoforms in MS of the NFAT5 gene, a putative CELF1 target. In conclusion, for the first time we provide evidence of a consistent dysregulation of splicing-related genes in MS and we discuss its possible implications in modulating specific AS events in MS susceptibility genes.

  14. Genetic regulation of gene expression in the lung identifies CST3 and CD22 as potential causal genes for airflow obstruction.

    PubMed

    Lamontagne, Maxime; Timens, Wim; Hao, Ke; Bossé, Yohan; Laviolette, Michel; Steiling, Katrina; Campbell, Joshua D; Couture, Christian; Conti, Massimo; Sherwood, Karen; Hogg, James C; Brandsma, Corry-Anke; van den Berge, Maarten; Sandford, Andrew; Lam, Stephen; Lenburg, Marc E; Spira, Avrum; Paré, Peter D; Nickle, David; Sin, Don D; Postma, Dirkje S

    2014-11-01

    COPD is a complex chronic disease with poorly understood pathogenesis. Integrative genomic approaches have the potential to elucidate the biological networks underlying COPD and lung function. We recently combined genome-wide genotyping and gene expression in 1111 human lung specimens to map expression quantitative trait loci (eQTL). To determine causal associations between COPD and lung function-associated single nucleotide polymorphisms (SNPs) and lung tissue gene expression changes in our lung eQTL dataset. We evaluated causality between SNPs and gene expression for three COPD phenotypes: FEV(1)% predicted, FEV(1)/FVC and COPD as a categorical variable. Different models were assessed in the three cohorts independently and in a meta-analysis. SNPs associated with a COPD phenotype and gene expression were subjected to causal pathway modelling and manual curation. In silico analyses evaluated functional enrichment of biological pathways among newly identified causal genes. Biologically relevant causal genes were validated in two separate gene expression datasets of lung tissues and bronchial airway brushings. High reliability causal relations were found in SNP-mRNA-phenotype triplets for FEV(1)% predicted (n=169) and FEV(1)/FVC (n=80). Several genes of potential biological relevance for COPD were revealed. eQTL-SNPs upregulating cystatin C (CST3) and CD22 were associated with worse lung function. Signalling pathways enriched with causal genes included xenobiotic metabolism, apoptosis, protease-antiprotease and oxidant-antioxidant balance. By using integrative genomics and analysing the relationships of COPD phenotypes with SNPs and gene expression in lung tissue, we identified CST3 and CD22 as potential causal genes for airflow obstruction. This study also augmented the understanding of previously described COPD pathways. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

  15. MiRNA-TF-gene network analysis through ranking of biomolecules for multi-informative uterine leiomyoma dataset.

    PubMed

    Mallik, Saurav; Maulik, Ujjwal

    2015-10-01

    Gene ranking is an important problem in bioinformatics. Here, we propose a new framework for ranking biomolecules (viz., miRNAs, transcription-factors/TFs and genes) in a multi-informative uterine leiomyoma dataset having both gene expression and methylation data using (statistical) eigenvector centrality based approach. At first, genes that are both differentially expressed and methylated, are identified using Limma statistical test. A network, comprising these genes, corresponding TFs from TRANSFAC and ITFP databases, and targeter miRNAs from miRWalk database, is then built. The biomolecules are then ranked based on eigenvector centrality. Our proposed method provides better average accuracy in hub gene and non-hub gene classifications than other methods. Furthermore, pre-ranked Gene set enrichment analysis is applied on the pathway database as well as GO-term databases of Molecular Signatures Database with providing a pre-ranked gene-list based on different centrality values for comparing among the ranking methods. Finally, top novel potential gene-markers for the uterine leiomyoma are provided. Copyright © 2015 Elsevier Inc. All rights reserved.

  16. Meta-analysis of gene expression profiles associated with histological classification and survival in 829 ovarian cancer samples.

    PubMed

    Fekete, Tibor; Rásó, Erzsébet; Pete, Imre; Tegze, Bálint; Liko, István; Munkácsy, Gyöngyi; Sipos, Norbert; Rigó, János; Györffy, Balázs

    2012-07-01

    Transcriptomic analysis of global gene expression in ovarian carcinoma can identify dysregulated genes capable to serve as molecular markers for histology subtypes and survival. The aim of our study was to validate previous candidate signatures in an independent setting and to identify single genes capable to serve as biomarkers for ovarian cancer progression. As several datasets are available in the GEO today, we were able to perform a true meta-analysis. First, 829 samples (11 datasets) were downloaded, and the predictive power of 16 previously published gene sets was assessed. Of these, eight were capable to discriminate histology subtypes, and none was capable to predict survival. To overcome the differences in previous studies, we used the 829 samples to identify new predictors. Then, we collected 64 ovarian cancer samples (median relapse-free survival 24.5 months) and performed TaqMan Real Time Polimerase Chain Reaction (RT-PCR) analysis for the best 40 genes associated with histology subtypes and survival. Over 90% of subtype-associated genes were confirmed. Overall survival was effectively predicted by hormone receptors (PGR and ESR2) and by TSPAN8. Relapse-free survival was predicted by MAPT and SNCG. In summary, we successfully validated several gene sets in a meta-analysis in large datasets of ovarian samples. Additionally, several individual genes identified were validated in a clinical cohort. Copyright © 2011 UICC.

  17. EG-05COMBINATION OF GENE COPY GAIN AND EPIGENETIC DEREGULATION ARE ASSOCIATED WITH THE ABERRANT EXPRESSION OF A STEM CELL RELATED HOX-SIGNATURE IN GLIOBLASTOMA

    PubMed Central

    Kurscheid, Sebastian; Bady, Pierre; Sciuscio, Davide; Samarzija, Ivana; Shay, Tal; Vassallo, Irene; Van Criekinge, Wim; Domany, Eytan; Stupp, Roger; Delorenzi, Mauro; Hegi, Monika

    2014-01-01

    We previously reported a stem cell related HOX gene signature associated with resistance to chemo-radiotherapy (TMZ/RT- > TMZ) in glioblastoma. However, underlying mechanisms triggering overexpression remain mostly elusive. Interestingly, HOX genes are neither involved in the developing brain, nor expressed in normal brain, suggestive of an acquired gene expression signature during gliomagenesis. HOXA genes are located on CHR 7 that displays trisomy in most glioblastoma which strongly impacts gene expression on this chromosome, modulated by local regulatory elements. Furthermore we observed more pronounced DNA methylation across the HOXA locus as compared to non-tumoral brain (Human methylation 450K BeadChip Illumina; 59 glioblastoma, 5 non-tumoral brain sampes). CpG probes annotated for HOX-signature genes, contributing most to the variability, served as input into the analysis of DNA methylation and expression to identify key regulatory regions. The structural similarity of the observed correlation matrices between DNA methylation and gene expression in our cohort and an independent data-set from TCGA (106 glioblastoma) was remarkable (RV-coefficient, 0.84; p-value < 0.0001). We identified a CpG located in the promoter region of the HOXA10 locus exerting the strongest mean negative correlation between methylation and expression of the whole HOX-signature. Applying this analysis the same CpG emerged in the external set. We then determined the contribution of both, gene copy aberration (CNA) and methylation at the selected probe to explain expression of the HOX-signature using a linear model. Statistically significant results suggested an additive effect between gene dosage and methylation at the key CpG identified. Similarly, such an additive effect was also observed in the external data-set. Taken together, we hypothesize that overexpression of the stem-cell related HOX signature is triggered by gain of trisomy 7 and escape from compensatory DNA methylation at positions controlling the effect of enhanced gene dose on expression.

  18. Acute hypoxia stress induced abundant differential expression genes and alternative splicing events in heart of tilapia.

    PubMed

    Xia, Jun Hong; Li, Hong Lian; Li, Bi Jun; Gu, Xiao Hui; Lin, Hao Ran

    2018-01-10

    Hypoxia is one of the critical environmental stressors for fish in aquatic environments. Although accumulating evidences indicate that gene expression is regulated by hypoxia stress in fish, how genes undergoing differential gene expression and/or alternative splicing (AS) in response to hypoxia stress in heart are not well understood. Using RNA-seq, we surveyed and detected 289 differential expressed genes (DEG) and 103 genes that undergo differential usage of exons and splice junctions events (DUES) in heart of a hypoxia tolerant fish, Nile tilapia, Oreochromis niloticus following 12h hypoxic treatment. The spatio-temporal expression analysis validated the significant association of differential exon usages in two randomly selected DUES genes (fam162a and ndrg2) in 5 tissues (heart, liver, brain, gill and spleen) sampled at three time points (6h, 12h, and 24h) under acute hypoxia treatment. Functional analysis significantly associated the differential expressed genes with the categories related to energy conservation, protein synthesis and immune response. Different enrichment categories were found between the DEG and DUES dataset. The Isomerase activity, Oxidoreductase activity, Glycolysis and Oxidative stress process were significantly enriched for the DEG gene dataset, but the Structural constituent of ribosome and Structural molecule activity, Ribosomal protein and RNA binding protein were significantly enriched only for the DUES genes. Our comparative transcriptomic analysis reveals abundant stress responsive genes and their differential regulation function in the heart tissues of Nile tilapia under acute hypoxia stress. Our findings will facilitate future investigation on transcriptome complexity and AS regulation during hypoxia stress in fish. Copyright © 2017 Elsevier B.V. All rights reserved.

  19. Integrated Analyses of Gene Expression Profiles Digs out Common Markers for Rheumatic Diseases

    PubMed Central

    Wang, Lan; Wu, Long-Fei; Lu, Xin; Mo, Xing-Bo; Tang, Zai-Xiang; Lei, Shu-Feng; Deng, Fei-Yan

    2015-01-01

    Objective Rheumatic diseases have some common symptoms. Extensive gene expression studies, accumulated thus far, have successfully identified signature molecules for each rheumatic disease, individually. However, whether there exist shared factors across rheumatic diseases has yet to be tested. Methods We collected and utilized 6 public microarray datasets covering 4 types of representative rheumatic diseases including rheumatoid arthritis, systemic lupus erythematosus, ankylosing spondylitis, and osteoarthritis. Then we detected overlaps of differentially expressed genes across datasets and performed a meta-analysis aiming at identifying common differentially expressed genes that discriminate between pathological cases and normal controls. To further gain insights into the functions of the identified common differentially expressed genes, we conducted gene ontology enrichment analysis and protein-protein interaction analysis. Results We identified a total of eight differentially expressed genes (TNFSF10, CX3CR1, LY96, TLR5, TXN, TIA1, PRKCH, PRF1), each associated with at least 3 of the 4 studied rheumatic diseases. Meta-analysis warranted the significance of the eight genes and highlighted the general significance of four genes (CX3CR1, LY96, TLR5, and PRF1). Protein-protein interaction and gene ontology enrichment analyses indicated that the eight genes interact with each other to exert functions related to immune response and immune regulation. Conclusion The findings support that there exist common factors underlying rheumatic diseases. For rheumatoid arthritis, systemic lupus erythematosus, ankylosing spondylitis and osteoarthritis diseases, those common factors include TNFSF10, CX3CR1, LY96, TLR5, TXN, TIA1, PRKCH, and PRF1. In-depth studies on these common factors may provide keys to understanding the pathogenesis and developing intervention strategies for rheumatic diseases. PMID:26352601

  20. Inferring Time-Varying Network Topologies from Gene Expression Data

    PubMed Central

    2007-01-01

    Most current methods for gene regulatory network identification lead to the inference of steady-state networks, that is, networks prevalent over all times, a hypothesis which has been challenged. There has been a need to infer and represent networks in a dynamic, that is, time-varying fashion, in order to account for different cellular states affecting the interactions amongst genes. In this work, we present an approach, regime-SSM, to understand gene regulatory networks within such a dynamic setting. The approach uses a clustering method based on these underlying dynamics, followed by system identification using a state-space model for each learnt cluster—to infer a network adjacency matrix. We finally indicate our results on the mouse embryonic kidney dataset as well as the T-cell activation-based expression dataset and demonstrate conformity with reported experimental evidence. PMID:18309363

  1. Inferring time-varying network topologies from gene expression data.

    PubMed

    Rao, Arvind; Hero, Alfred O; States, David J; Engel, James Douglas

    2007-01-01

    Most current methods for gene regulatory network identification lead to the inference of steady-state networks, that is, networks prevalent over all times, a hypothesis which has been challenged. There has been a need to infer and represent networks in a dynamic, that is, time-varying fashion, in order to account for different cellular states affecting the interactions amongst genes. In this work, we present an approach, regime-SSM, to understand gene regulatory networks within such a dynamic setting. The approach uses a clustering method based on these underlying dynamics, followed by system identification using a state-space model for each learnt cluster--to infer a network adjacency matrix. We finally indicate our results on the mouse embryonic kidney dataset as well as the T-cell activation-based expression dataset and demonstrate conformity with reported experimental evidence.

  2. Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies

    PubMed Central

    2014-01-01

    Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods. PMID:24708878

  3. Coagulation factor VII is regulated by androgen receptor in breast cancer.

    PubMed

    Naderi, Ali

    2015-02-01

    Androgen receptor (AR) is widely expressed in breast cancer; however, there is limited information on the key molecular functions and gene targets of AR in this disease. In this study, gene expression data from a cohort of 52 breast cancer cell lines was analyzed to identify a network of AR co-expressed genes. A total of 300 genes, which were significantly enriched for cell cycle and metabolic functions, showed absolute correlation coefficients (|CC|) of more than 0.5 with AR expression across the dataset. In this network, a subset of 35 "AR-signature" genes were highly co-expressed with AR (|CC|>0.6) that included transcriptional regulators PATZ1, NFATC4, and SPDEF. Furthermore, gene encoding coagulation factor VII (F7) demonstrated the closest expression pattern with AR (CC=0.716) in the dataset and factor VII protein expression was significantly associated to that of AR in a cohort of 209 breast tumors. Moreover, functional studies demonstrated that AR activation results in the induction of factor VII expression at both transcript and protein levels and AR directly binds to a proximal region of F7 promoter in breast cancer cells. Importantly, AR activation in breast cancer cells induced endogenous factor VII activity to convert factor X to Xa in conjunction with tissue factor. In summary, F7 is a novel AR target gene and AR activation regulates the ectopic expression and activity of factor VII in breast cancer cells. These findings have functional implications in the pathobiology of thromboembolic events and regulation of factor VII/tissue factor signaling in breast cancer. Copyright © 2014 Elsevier Inc. All rights reserved.

  4. Transcriptome database resource and gene expression atlas for the rose

    PubMed Central

    2012-01-01

    Background For centuries roses have been selected based on a number of traits. Little information exists on the genetic and molecular basis that contributes to these traits, mainly because information on expressed genes for this economically important ornamental plant is scarce. Results Here, we used a combination of Illumina and 454 sequencing technologies to generate information on Rosa sp. transcripts using RNA from various tissues and in response to biotic and abiotic stresses. A total of 80714 transcript clusters were identified and 76611 peptides have been predicted among which 20997 have been clustered into 13900 protein families. BLASTp hits in closely related Rosaceae species revealed that about half of the predicted peptides in the strawberry and peach genomes have orthologs in Rosa dataset. Digital expression was obtained using RNA samples from organs at different development stages and under different stress conditions. qPCR validated the digital expression data for a selection of 23 genes with high or low expression levels. Comparative gene expression analyses between the different tissues and organs allowed the identification of clusters that are highly enriched in given tissues or under particular conditions, demonstrating the usefulness of the digital gene expression analysis. A web interface ROSAseq was created that allows data interrogation by BLAST, subsequent analysis of DNA clusters and access to thorough transcript annotation including best BLAST matches on Fragaria vesca, Prunus persica and Arabidopsis. The rose peptides dataset was used to create the ROSAcyc resource pathway database that allows access to the putative genes and enzymatic pathways. Conclusions The study provides useful information on Rosa expressed genes, with thorough annotation and an overview of expression patterns for transcripts with good accuracy. PMID:23164410

  5. GEsture: an online hand-drawing tool for gene expression pattern search.

    PubMed

    Wang, Chunyan; Xu, Yiqing; Wang, Xuelin; Zhang, Li; Wei, Suyun; Ye, Qiaolin; Zhu, Youxiang; Yin, Hengfu; Nainwal, Manoj; Tanon-Reyes, Luis; Cheng, Feng; Yin, Tongming; Ye, Ning

    2018-01-01

    Gene expression profiling data provide useful information for the investigation of biological function and process. However, identifying a specific expression pattern from extensive time series gene expression data is not an easy task. Clustering, a popular method, is often used to classify similar expression genes, however, genes with a 'desirable' or 'user-defined' pattern cannot be efficiently detected by clustering methods. To address these limitations, we developed an online tool called GEsture. Users can draw, or graph a curve using a mouse instead of inputting abstract parameters of clustering methods. GEsture explores genes showing similar, opposite and time-delay expression patterns with a gene expression curve as input from time series datasets. We presented three examples that illustrate the capacity of GEsture in gene hunting while following users' requirements. GEsture also provides visualization tools (such as expression pattern figure, heat map and correlation network) to display the searching results. The result outputs may provide useful information for researchers to understand the targets, function and biological processes of the involved genes.

  6. SigEMD: A powerful method for differential gene expression analysis in single-cell RNA sequencing data.

    PubMed

    Wang, Tianyu; Nabavi, Sheida

    2018-04-24

    Differential gene expression analysis is one of the significant efforts in single cell RNA sequencing (scRNAseq) analysis to discover the specific changes in expression levels of individual cell types. Since scRNAseq exhibits multimodality, large amounts of zero counts, and sparsity, it is different from the traditional bulk RNA sequencing (RNAseq) data. The new challenges of scRNAseq data promote the development of new methods for identifying differentially expressed (DE) genes. In this study, we proposed a new method, SigEMD, that combines a data imputation approach, a logistic regression model and a nonparametric method based on the Earth Mover's Distance, to precisely and efficiently identify DE genes in scRNAseq data. The regression model and data imputation are used to reduce the impact of large amounts of zero counts, and the nonparametric method is used to improve the sensitivity of detecting DE genes from multimodal scRNAseq data. By additionally employing gene interaction network information to adjust the final states of DE genes, we further reduce the false positives of calling DE genes. We used simulated datasets and real datasets to evaluate the detection accuracy of the proposed method and to compare its performance with those of other differential expression analysis methods. Results indicate that the proposed method has an overall powerful performance in terms of precision in detection, sensitivity, and specificity. Copyright © 2018 Elsevier Inc. All rights reserved.

  7. The Importance of Normalization on Large and Heterogeneous Microarray Datasets

    EPA Science Inventory

    DNA microarray technology is a powerful functional genomics tool increasingly used for investigating global gene expression in environmental studies. Microarrays can also be used in identifying biological networks, as they give insight on the complex gene-to-gene interactions, ne...

  8. Intra- and interspecies gene expression models for predicting drug response in canine osteosarcoma.

    PubMed

    Fowles, Jared S; Brown, Kristen C; Hess, Ann M; Duval, Dawn L; Gustafson, Daniel L

    2016-02-19

    Genomics-based predictors of drug response have the potential to improve outcomes associated with cancer therapy. Osteosarcoma (OS), the most common primary bone cancer in dogs, is commonly treated with adjuvant doxorubicin or carboplatin following amputation of the affected limb. We evaluated the use of gene-expression based models built in an intra- or interspecies manner to predict chemosensitivity and treatment outcome in canine OS. Models were built and evaluated using microarray gene expression and drug sensitivity data from human and canine cancer cell lines, and canine OS tumor datasets. The "COXEN" method was utilized to filter gene signatures between human and dog datasets based on strong co-expression patterns. Models were built using linear discriminant analysis via the misclassification penalized posterior algorithm. The best doxorubicin model involved genes identified in human lines that were co-expressed and trained on canine OS tumor data, which accurately predicted clinical outcome in 73 % of dogs (p = 0.0262, binomial). The best carboplatin model utilized canine lines for gene identification and model training, with canine OS tumor data for co-expression. Dogs whose treatment matched our predictions had significantly better clinical outcomes than those that didn't (p = 0.0006, Log Rank), and this predictor significantly associated with longer disease free intervals in a Cox multivariate analysis (hazard ratio = 0.3102, p = 0.0124). Our data show that intra- and interspecies gene expression models can successfully predict response in canine OS, which may improve outcome in dogs and serve as pre-clinical validation for similar methods in human cancer research.

  9. BubbleGUM: automatic extraction of phenotype molecular signatures and comprehensive visualization of multiple Gene Set Enrichment Analyses.

    PubMed

    Spinelli, Lionel; Carpentier, Sabrina; Montañana Sanchis, Frédéric; Dalod, Marc; Vu Manh, Thien-Phong

    2015-10-19

    Recent advances in the analysis of high-throughput expression data have led to the development of tools that scaled-up their focus from single-gene to gene set level. For example, the popular Gene Set Enrichment Analysis (GSEA) algorithm can detect moderate but coordinated expression changes of groups of presumably related genes between pairs of experimental conditions. This considerably improves extraction of information from high-throughput gene expression data. However, although many gene sets covering a large panel of biological fields are available in public databases, the ability to generate home-made gene sets relevant to one's biological question is crucial but remains a substantial challenge to most biologists lacking statistic or bioinformatic expertise. This is all the more the case when attempting to define a gene set specific of one condition compared to many other ones. Thus, there is a crucial need for an easy-to-use software for generation of relevant home-made gene sets from complex datasets, their use in GSEA, and the correction of the results when applied to multiple comparisons of many experimental conditions. We developed BubbleGUM (GSEA Unlimited Map), a tool that allows to automatically extract molecular signatures from transcriptomic data and perform exhaustive GSEA with multiple testing correction. One original feature of BubbleGUM notably resides in its capacity to integrate and compare numerous GSEA results into an easy-to-grasp graphical representation. We applied our method to generate transcriptomic fingerprints for murine cell types and to assess their enrichments in human cell types. This analysis allowed us to confirm homologies between mouse and human immunocytes. BubbleGUM is an open-source software that allows to automatically generate molecular signatures out of complex expression datasets and to assess directly their enrichment by GSEA on independent datasets. Enrichments are displayed in a graphical output that helps interpreting the results. This innovative methodology has recently been used to answer important questions in functional genomics, such as the degree of similarities between microarray datasets from different laboratories or with different experimental models or clinical cohorts. BubbleGUM is executable through an intuitive interface so that both bioinformaticians and biologists can use it. It is available at http://www.ciml.univ-mrs.fr/applications/BubbleGUM/index.html .

  10. A formal concept analysis approach to consensus clustering of multi-experiment expression data

    PubMed Central

    2014-01-01

    Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. PMID:24885407

  11. Computational deconvolution of genome wide expression data from Parkinson's and Huntington's disease brain tissues using population-specific expression analysis

    PubMed Central

    Capurro, Alberto; Bodea, Liviu-Gabriel; Schaefer, Patrick; Luthi-Carter, Ruth; Perreau, Victoria M.

    2015-01-01

    The characterization of molecular changes in diseased tissues gives insight into pathophysiological mechanisms and is important for therapeutic development. Genome-wide gene expression analysis has proven valuable for identifying biological processes in neurodegenerative diseases using post mortem human brain tissue and numerous datasets are publically available. However, many studies utilize heterogeneous tissue samples consisting of multiple cell types, all of which contribute to global gene expression values, confounding biological interpretation of the data. In particular, changes in numbers of neuronal and glial cells occurring in neurodegeneration confound transcriptomic analyses, particularly in human brain tissues where sample availability and controls are limited. To identify cell specific gene expression changes in neurodegenerative disease, we have applied our recently published computational deconvolution method, population specific expression analysis (PSEA). PSEA estimates cell-type-specific expression values using reference expression measures, which in the case of brain tissue comprises mRNAs with cell-type-specific expression in neurons, astrocytes, oligodendrocytes and microglia. As an exercise in PSEA implementation and hypothesis development regarding neurodegenerative diseases, we applied PSEA to Parkinson's and Huntington's disease (PD, HD) datasets. Genes identified as differentially expressed in substantia nigra pars compacta neurons by PSEA were validated using external laser capture microdissection data. Network analysis and Annotation Clustering (DAVID) identified molecular processes implicated by differential gene expression in specific cell types. The results of these analyses provided new insights into the implementation of PSEA in brain tissues and additional refinement of molecular signatures in human HD and PD. PMID:25620908

  12. Function Clustering Self-Organization Maps (FCSOMs) for mining differentially expressed genes in Drosophila and its correlation with the growth medium.

    PubMed

    Liu, L L; Liu, M J; Ma, M

    2015-09-28

    The central task of this study was to mine the gene-to-medium relationship. Adequate knowledge of this relationship could potentially improve the accuracy of differentially expressed gene mining. One of the approaches to differentially expressed gene mining uses conventional clustering algorithms to identify the gene-to-medium relationship. Compared to conventional clustering algorithms, self-organization maps (SOMs) identify the nonlinear aspects of the gene-to-medium relationships by mapping the input space into another higher dimensional feature space. However, SOMs are not suitable for huge datasets consisting of millions of samples. Therefore, a new computational model, the Function Clustering Self-Organization Maps (FCSOMs), was developed. FCSOMs take advantage of the theory of granular computing as well as advanced statistical learning methodologies, and are built specifically for each information granule (a function cluster of genes), which are intelligently partitioned by the clustering algorithm provided by the DAVID_6.7 software platform. However, only the gene functions, and not their expression values, are considered in the fuzzy clustering algorithm of DAVID. Compared to the clustering algorithm of DAVID, these experimental results show a marked improvement in the accuracy of classification with the application of FCSOMs. FCSOMs can handle huge datasets and their complex classification problems, as each FCSOM (modeled for each function cluster) can be easily parallelized.

  13. Cross-platform normalization of microarray and RNA-seq data for machine learning applications

    PubMed Central

    Thompson, Jeffrey A.; Tan, Jie

    2016-01-01

    Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language. PMID:26844019

  14. Novel harmonic regularization approach for variable selection in Cox's proportional hazards model.

    PubMed

    Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan

    2014-01-01

    Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq  (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods.

  15. Bi-Force: large-scale bicluster editing and its application to gene expression data biclustering

    PubMed Central

    Sun, Peng; Speicher, Nora K.; Röttger, Richard; Guo, Jiong; Baumbach, Jan

    2014-01-01

    Abstract The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as ‘simultaneous clustering’ or ‘co-clustering’, has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: ‘Bi-Force’. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279–292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de. PMID:24682815

  16. Downstream targets of HOXB4 in a cell line model of primitive hematopoietic progenitor cells.

    PubMed

    Lee, Han M; Zhang, Hui; Schulz, Vincent; Tuck, David P; Forget, Bernard G

    2010-08-05

    Enforced expression of the homeobox transcription factor HOXB4 has been shown to enhance hematopoietic stem cell self-renewal and expansion ex vivo and in vivo. To investigate the downstream targets of HOXB4 in hematopoietic progenitor cells, HOXB4 was constitutively overexpressed in the primitive hematopoietic progenitor cell line EML. Two genome-wide analytical techniques were used: RNA expression profiling using microarrays and chromatin immunoprecipitation (ChIP)-chip. RNA expression profiling revealed that 465 gene transcripts were differentially expressed in KLS (c-Kit(+), Lin(-), Sca-1(+))-EML cells that overexpressed HOXB4 (KLS-EML-HOXB4) compared with control KLS-EML cells that were transduced with vector alone. In particular, erythroid-specific gene transcripts were observed to be highly down-regulated in KLS-EML-HOXB4 cells. ChIP-chip analysis revealed that the promoter region for 1910 genes, such as CD34, Sox4, and B220, were occupied by HOXB4 in KLS-EML-HOXB4 cells. Side-by-side comparison of the ChIP-chip and RNA expression profiling datasets provided correlative information and identified Gp49a and Laptm4b as candidate "stemness-related" genes. Both genes were highly ranked in both dataset lists and have been previously shown to be preferentially expressed in hematopoietic stem cells and down-regulated in mature hematopoietic cells, thus making them attractive candidates for future functional studies in hematopoietic cells.

  17. An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data.

    PubMed

    Nidheesh, N; Abdul Nazeer, K A; Ameer, P M

    2017-12-01

    Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data. Copyright © 2017 Elsevier Ltd. All rights reserved.

  18. Broad Integration of Expression Maps and Co-Expression Networks Compassing Novel Gene Functions in the Brain

    PubMed Central

    Okamura-Oho, Yuko; Shimokawa, Kazuro; Nishimura, Masaomi; Takemoto, Satoko; Sato, Akira; Furuichi, Teiichi; Yokota, Hideo

    2014-01-01

    Using a recently invented technique for gene expression mapping in the whole-anatomy context, termed transcriptome tomography, we have generated a dataset of 36,000 maps of overall gene expression in the adult-mouse brain. Here, using an informatics approach, we identified a broad co-expression network that follows an inverse power law and is rich in functional interaction and gene-ontology terms. Our framework for the integrated analysis of expression maps and graphs of co-expression networks revealed that groups of combinatorially expressed genes, which regulate cell differentiation during development, were present in the adult brain and each of these groups was associated with a discrete cell types. These groups included non-coding genes of unknown function. We found that these genes specifically linked developmentally conserved groups in the network. A previously unrecognized robust expression pattern covering the whole brain was related to the molecular anatomy of key biological processes occurring in particular areas. PMID:25382412

  19. Meta-analysis of human gene expression in response to Mycobacterium tuberculosis infection reveals potential therapeutic targets.

    PubMed

    Wang, Zhang; Arat, Seda; Magid-Slav, Michal; Brown, James R

    2018-01-10

    With the global emergence of multi-drug resistant strains of Mycobacterium tuberculosis, new strategies to treat tuberculosis are urgently needed such as therapeutics targeting potential human host factors. Here we performed a statistical meta-analysis of human gene expression in response to both latent and active pulmonary tuberculosis infections from nine published datasets. We found 1655 genes that were significantly differentially expressed during active tuberculosis infection. In contrast, no gene was significant for latent tuberculosis. Pathway enrichment analysis identified 90 significant canonical human pathways, including several pathways more commonly related to non-infectious diseases such as the LRRK2 pathway in Parkinson's disease, and PD-1/PD-L1 signaling pathway important for new immuno-oncology therapies. The analysis of human genome-wide association studies datasets revealed tuberculosis-associated genetic variants proximal to several genes in major histocompatibility complex for antigen presentation. We propose several new targets and drug-repurposing opportunities including intravenous immunoglobulin, ion-channel blockers and cancer immuno-therapeutics for development as combination therapeutics with anti-mycobacterial agents. Our meta-analysis provides novel insights into host genes and pathways important for tuberculosis and brings forth potential drug repurposing opportunities for host-directed therapies.

  20. Release of (and lessons learned from mining) a pioneering large toxicogenomics database.

    PubMed

    Sandhu, Komal S; Veeramachaneni, Vamsi; Yao, Xiang; Nie, Alex; Lord, Peter; Amaratunga, Dhammika; McMillian, Michael K; Verheyen, Geert R

    2015-07-01

    We release the Janssen Toxicogenomics database. This rat liver gene-expression database was generated using Codelink microarrays, and has been used over the past years within Janssen to derive signatures for multiple end points and to classify proprietary compounds. The release consists of gene-expression responses to 124 compounds, selected to give a broad coverage of liver-active compounds. A selection of the compounds were also analyzed on Affymetrix microarrays. The release includes results of an in-house reannotation pipeline to Entrez gene annotations, to classify probes into different confidence classes. High confidence unambiguously annotated probes were used to create gene-level data which served as starting point for cross-platform comparisons. Connectivity map-based similarity methods show excellent agreement between Codelink and Affymetrix runs of the same samples. We also compared our dataset with the Japanese Toxicogenomics Project and observed reasonable agreement, especially for compounds with stronger gene signatures. We describe an R-package containing the gene-level data and show how it can be used for expression-based similarity searches. Comparing the same biological samples run on the Affymetrix and the Codelink platform, good correspondence is observed using connectivity mapping approaches. As expected, this correspondence is smaller when the data are compared with an independent dataset such as TG-GATE. We hope that this collection of gene-expression profiles will be incorporated in toxicogenomics pipelines of users.

  1. Inference of Gene Regulatory Networks Using Bayesian Nonparametric Regression and Topology Information.

    PubMed

    Fan, Yue; Wang, Xiao; Peng, Qinke

    2017-01-01

    Gene regulatory networks (GRNs) play an important role in cellular systems and are important for understanding biological processes. Many algorithms have been developed to infer the GRNs. However, most algorithms only pay attention to the gene expression data but do not consider the topology information in their inference process, while incorporating this information can partially compensate for the lack of reliable expression data. Here we develop a Bayesian group lasso with spike and slab priors to perform gene selection and estimation for nonparametric models. B-spline basis functions are used to capture the nonlinear relationships flexibly and penalties are used to avoid overfitting. Further, we incorporate the topology information into the Bayesian method as a prior. We present the application of our method on DREAM3 and DREAM4 datasets and two real biological datasets. The results show that our method performs better than existing methods and the topology information prior can improve the result.

  2. Identification of Differentially Expressed Genes through Integrated Study of Alzheimer’s Disease Affected Brain Regions

    PubMed Central

    Berretta, Regina; Moscato, Pablo

    2016-01-01

    Background Alzheimer’s disease (AD) is the most common form of dementia in older adults that damages the brain and results in impaired memory, thinking and behaviour. The identification of differentially expressed genes and related pathways among affected brain regions can provide more information on the mechanisms of AD. In the past decade, several studies have reported many genes that are associated with AD. This wealth of information has become difficult to follow and interpret as most of the results are conflicting. In that case, it is worth doing an integrated study of multiple datasets that helps to increase the total number of samples and the statistical power in detecting biomarkers. In this study, we present an integrated analysis of five different brain region datasets and introduce new genes that warrant further investigation. Methods The aim of our study is to apply a novel combinatorial optimisation based meta-analysis approach to identify differentially expressed genes that are associated to AD across brain regions. In this study, microarray gene expression data from 161 samples (74 non-demented controls, 87 AD) from the Entorhinal Cortex (EC), Hippocampus (HIP), Middle temporal gyrus (MTG), Posterior cingulate cortex (PC), Superior frontal gyrus (SFG) and visual cortex (VCX) brain regions were integrated and analysed using our method. The results are then compared to two popular meta-analysis methods, RankProd and GeneMeta, and to what can be obtained by analysing the individual datasets. Results We find genes related with AD that are consistent with existing studies, and new candidate genes not previously related with AD. Our study confirms the up-regualtion of INFAR2 and PTMA along with the down regulation of GPHN, RAB2A, PSMD14 and FGF. Novel genes PSMB2, WNK1, RPL15, SEMA4C, RWDD2A and LARGE are found to be differentially expressed across all brain regions. Further investigation on these genes may provide new insights into the development of AD. In addition, we identified the presence of 23 non-coding features, including four miRNA precursors (miR-7, miR570, miR-1229 and miR-6821), dysregulated across the brain regions. Furthermore, we compared our results with two popular meta-analysis methods RankProd and GeneMeta to validate our findings and performed a sensitivity analysis by removing one dataset at a time to assess the robustness of our results. These new findings may provide new insights into the disease mechanisms and thus make a significant contribution in the near future towards understanding, prevention and cure of AD. PMID:27050411

  3. Cross-species comparison of the gut: Differential gene expression sheds light on biological differences in closely related tenebrionids.

    PubMed

    Oppert, Brenda; Perkin, Lindsey; Martynov, Alexander G; Elpidina, Elena N

    2018-04-01

    The gut is one of the primary interfaces between an insect and its environment. Understanding gene expression profiles in the insect gut can provide insight into interactions with the environment as well as identify potential control methods for pests. We compared the expression profiles of transcripts from the gut of larval stages of two coleopteran insects, Tenebrio molitor and Tribolium castaneum. These tenebrionids have different life cycles, varying in the duration and number of larval instars. T. castaneum has a sequenced genome and has been a model for coleopterans, and we recently obtained a draft genome for T. molitor. We assembled gut transcriptome reads from each insect to their respective genomes and filtered mapped reads to RPKM>1, yielding 11,521 and 17,871 genes in the T. castaneum and T. molitor datasets, respectively. There were identical GO terms in each dataset, and enrichment analyses also identified shared GO terms. From these datasets, we compiled an ortholog list of 6907 genes; 45% of the total assembled reads from T. castaneum were found in the top 25 orthologs, but only 27% of assembled reads were found in the top 25 T. molitor orthologs. There were 2281 genes unique to T. castaneum, and 2088 predicted genes unique to T. molitor, although improvements to the T. molitor genome will likely reduce these numbers as more orthologs are identified. We highlight a few unique genes in T. castaneum or T. molitor that may relate to distinct biological functions. A large number of putative genes expressed in the larval gut with uncharacterized functions (36 and 68% from T. castaneum and T. molitor, respectively) support the need for further research. These data are the first step in building a comprehensive understanding of the physiology of the gut in tenebrionid insects, illustrating commonalities and differences that may be related to speciation and environmental adaptation. Published by Elsevier Ltd.

  4. Super-delta: a new differential gene expression analysis procedure with robust data normalization.

    PubMed

    Liu, Yuhang; Zhang, Jinfeng; Qiu, Xing

    2017-12-21

    Normalization is an important data preparation step in gene expression analyses, designed to remove various systematic noise. Sample variance is greatly reduced after normalization, hence the power of subsequent statistical analyses is likely to increase. On the other hand, variance reduction is made possible by borrowing information across all genes, including differentially expressed genes (DEGs) and outliers, which will inevitably introduce some bias. This bias typically inflates type I error; and can reduce statistical power in certain situations. In this study we propose a new differential expression analysis pipeline, dubbed as super-delta, that consists of a multivariate extension of the global normalization and a modified t-test. A robust procedure is designed to minimize the bias introduced by DEGs in the normalization step. The modified t-test is derived based on asymptotic theory for hypothesis testing that suitably pairs with the proposed robust normalization. We first compared super-delta with four commonly used normalization methods: global, median-IQR, quantile, and cyclic loess normalization in simulation studies. Super-delta was shown to have better statistical power with tighter control of type I error rate than its competitors. In many cases, the performance of super-delta is close to that of an oracle test in which datasets without technical noise were used. We then applied all methods to a collection of gene expression datasets on breast cancer patients who received neoadjuvant chemotherapy. While there is a substantial overlap of the DEGs identified by all of them, super-delta were able to identify comparatively more DEGs than its competitors. Downstream gene set enrichment analysis confirmed that all these methods selected largely consistent pathways. Detailed investigations on the relatively small differences showed that pathways identified by super-delta have better connections to breast cancer than other methods. As a new pipeline, super-delta provides new insights to the area of differential gene expression analysis. Solid theoretical foundation supports its asymptotic unbiasedness and technical noise-free properties. Implementation on real and simulated datasets demonstrates its decent performance compared with state-of-art procedures. It also has the potential of expansion to be incorporated with other data type and/or more general between-group comparison problems.

  5. Application of machine learning on brain cancer multiclass classification

    NASA Astrophysics Data System (ADS)

    Panca, V.; Rustam, Z.

    2017-07-01

    Classification of brain cancer is a problem of multiclass classification. One approach to solve this problem is by first transforming it into several binary problems. The microarray gene expression dataset has the two main characteristics of medical data: extremely many features (genes) and only a few number of samples. The application of machine learning on microarray gene expression dataset mainly consists of two steps: feature selection and classification. In this paper, the features are selected using a method based on support vector machine recursive feature elimination (SVM-RFE) principle which is improved to solve multiclass classification, called multiple multiclass SVM-RFE. Instead of using only the selected features on a single classifier, this method combines the result of multiple classifiers. The features are divided into subsets and SVM-RFE is used on each subset. Then, the selected features on each subset are put on separate classifiers. This method enhances the feature selection ability of each single SVM-RFE. Twin support vector machine (TWSVM) is used as the method of the classifier to reduce computational complexity. While ordinary SVM finds single optimum hyperplane, the main objective Twin SVM is to find two non-parallel optimum hyperplanes. The experiment on the brain cancer microarray gene expression dataset shows this method could classify 71,4% of the overall test data correctly, using 100 and 1000 genes selected from multiple multiclass SVM-RFE feature selection method. Furthermore, the per class results show that this method could classify data of normal and MD class with 100% accuracy.

  6. Network information improves cancer outcome prediction.

    PubMed

    Roy, Janine; Winter, Christof; Isik, Zerrin; Schroeder, Michael

    2014-07-01

    Disease progression in cancer can vary substantially between patients. Yet, patients often receive the same treatment. Recently, there has been much work on predicting disease progression and patient outcome variables from gene expression in order to personalize treatment options. Despite first diagnostic kits in the market, there are open problems such as the choice of random gene signatures or noisy expression data. One approach to deal with these two problems employs protein-protein interaction networks and ranks genes using the random surfer model of Google's PageRank algorithm. In this work, we created a benchmark dataset collection comprising 25 cancer outcome prediction datasets from literature and systematically evaluated the use of networks and a PageRank derivative, NetRank, for signature identification. We show that the NetRank performs significantly better than classical methods such as fold change or t-test. Despite an order of magnitude difference in network size, a regulatory and protein-protein interaction network perform equally well. Experimental evaluation on cancer outcome prediction in all of the 25 underlying datasets suggests that the network-based methodology identifies highly overlapping signatures over all cancer types, in contrast to classical methods that fail to identify highly common gene sets across the same cancer types. Integration of network information into gene expression analysis allows the identification of more reliable and accurate biomarkers and provides a deeper understanding of processes occurring in cancer development and progression. © The Author 2012. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  7. Dynamic regulation of genetic pathways and targets during aging in Caenorhabditis elegans.

    PubMed

    He, Kan; Zhou, Tao; Shao, Jiaofang; Ren, Xiaoliang; Zhao, Zhongying; Liu, Dahai

    2014-03-01

    Numerous genetic targets and some individual pathways associated with aging have been identified using the worm model. However, less is known about the genetic mechanisms of aging in genome wide, particularly at the level of multiple pathways as well as the regulatory networks during aging. Here, we employed the gene expression datasets of three time points during aging in Caenorhabditis elegans (C. elegans) and performed the approach of gene set enrichment analysis (GSEA) on each dataset between adjacent stages. As a result, multiple genetic pathways and targets were identified as significantly down- or up-regulated. Among them, 5 truly aging-dependent signaling pathways including MAPK signaling pathway, mTOR signaling pathway, Wnt signaling pathway, TGF-beta signaling pathway and ErbB signaling pathway as well as 12 significantly associated genes were identified with dynamic expression pattern during aging. On the other hand, the continued declines in the regulation of several metabolic pathways have been demonstrated to display age-related changes. Furthermore, the reconstructed regulatory networks based on three of aging related Chromatin immunoprecipitation experiments followed by sequencing (ChIP-seq) datasets and the expression matrices of 154 involved genes in above signaling pathways provide new insights into aging at the multiple pathways level. The combination of multiple genetic pathways and targets needs to be taken into consideration in future studies of aging, in which the dynamic regulation would be uncovered.

  8. A Cancer Gene Selection Algorithm Based on the K-S Test and CFS.

    PubMed

    Su, Qiang; Wang, Yina; Jiang, Xiaobing; Chen, Fuxue; Lu, Wen-Cong

    2017-01-01

    To address the challenging problem of selecting distinguished genes from cancer gene expression datasets, this paper presents a gene subset selection algorithm based on the Kolmogorov-Smirnov (K-S) test and correlation-based feature selection (CFS) principles. The algorithm selects distinguished genes first using the K-S test, and then, it uses CFS to select genes from those selected by the K-S test. We adopted support vector machines (SVM) as the classification tool and used the criteria of accuracy to evaluate the performance of the classifiers on the selected gene subsets. This approach compared the proposed gene subset selection algorithm with the K-S test, CFS, minimum-redundancy maximum-relevancy (mRMR), and ReliefF algorithms. The average experimental results of the aforementioned gene selection algorithms for 5 gene expression datasets demonstrate that, based on accuracy, the performance of the new K-S and CFS-based algorithm is better than those of the K-S test, CFS, mRMR, and ReliefF algorithms. The experimental results show that the K-S test-CFS gene selection algorithm is a very effective and promising approach compared to the K-S test, CFS, mRMR, and ReliefF algorithms.

  9. Enhancer Linking by Methylation/Expression Relationships (ELMER) | Informatics Technology for Cancer Research (ITCR)

    Cancer.gov

    R tool for analysis of DNA methylation and expression datasets. Integrative analysis allows reconstruction of in vivo transcription factor networks altered in cancer along with identification of the underlying gene regulatory sequences.

  10. Revealing complex function, process and pathway interactions with high-throughput expression and biological annotation data.

    PubMed

    Singh, Nitesh Kumar; Ernst, Mathias; Liebscher, Volkmar; Fuellen, Georg; Taher, Leila

    2016-10-20

    The biological relationships both between and within the functions, processes and pathways that operate within complex biological systems are only poorly characterized, making the interpretation of large scale gene expression datasets extremely challenging. Here, we present an approach that integrates gene expression and biological annotation data to identify and describe the interactions between biological functions, processes and pathways that govern a phenotype of interest. The product is a global, interconnected network, not of genes but of functions, processes and pathways, that represents the biological relationships within the system. We validated our approach on two high-throughput expression datasets describing organismal and organ development. Our findings are well supported by the available literature, confirming that developmental processes and apoptosis play key roles in cell differentiation. Furthermore, our results suggest that processes related to pluripotency and lineage commitment, which are known to be critical for development, interact mainly indirectly, through genes implicated in more general biological processes. Moreover, we provide evidence that supports the relevance of cell spatial organization in the developing liver for proper liver function. Our strategy can be viewed as an abstraction that is useful to interpret high-throughput data and devise further experiments.

  11. Identification of druggable cancer driver genes amplified across TCGA datasets.

    PubMed

    Chen, Ying; McGee, Jeremy; Chen, Xianming; Doman, Thompson N; Gong, Xueqian; Zhang, Youyan; Hamm, Nicole; Ma, Xiwen; Higgs, Richard E; Bhagwat, Shripad V; Buchanan, Sean; Peng, Sheng-Bin; Staschke, Kirk A; Yadav, Vipin; Yue, Yong; Kouros-Mehr, Hosein

    2014-01-01

    The Cancer Genome Atlas (TCGA) projects have advanced our understanding of the driver mutations, genetic backgrounds, and key pathways activated across cancer types. Analysis of TCGA datasets have mostly focused on somatic mutations and translocations, with less emphasis placed on gene amplifications. Here we describe a bioinformatics screening strategy to identify putative cancer driver genes amplified across TCGA datasets. We carried out GISTIC2 analysis of TCGA datasets spanning 16 cancer subtypes and identified 486 genes that were amplified in two or more datasets. The list was narrowed to 75 cancer-associated genes with potential "druggable" properties. The majority of the genes were localized to 14 amplicons spread across the genome. To identify potential cancer driver genes, we analyzed gene copy number and mRNA expression data from individual patient samples and identified 42 putative cancer driver genes linked to diverse oncogenic processes. Oncogenic activity was further validated by siRNA/shRNA knockdown and by referencing the Project Achilles datasets. The amplified genes represented a number of gene families, including epigenetic regulators, cell cycle-associated genes, DNA damage response/repair genes, metabolic regulators, and genes linked to the Wnt, Notch, Hedgehog, JAK/STAT, NF-KB and MAPK signaling pathways. Among the 42 putative driver genes were known driver genes, such as EGFR, ERBB2 and PIK3CA. Wild-type KRAS was amplified in several cancer types, and KRAS-amplified cancer cell lines were most sensitive to KRAS shRNA, suggesting that KRAS amplification was an independent oncogenic event. A number of MAP kinase adapters were co-amplified with their receptor tyrosine kinases, such as the FGFR adapter FRS2 and the EGFR family adapters GRB2 and GRB7. The ubiquitin-like ligase DCUN1D1 and the histone methyltransferase NSD3 were also identified as novel putative cancer driver genes. We discuss the patient tailoring implications for existing cancer drug targets and we further discuss potential novel opportunities for drug discovery efforts.

  12. Identification of Druggable Cancer Driver Genes Amplified across TCGA Datasets

    PubMed Central

    Chen, Ying; McGee, Jeremy; Chen, Xianming; Doman, Thompson N.; Gong, Xueqian; Zhang, Youyan; Hamm, Nicole; Ma, Xiwen; Higgs, Richard E.; Bhagwat, Shripad V.; Buchanan, Sean; Peng, Sheng-Bin; Staschke, Kirk A.; Yadav, Vipin; Yue, Yong; Kouros-Mehr, Hosein

    2014-01-01

    The Cancer Genome Atlas (TCGA) projects have advanced our understanding of the driver mutations, genetic backgrounds, and key pathways activated across cancer types. Analysis of TCGA datasets have mostly focused on somatic mutations and translocations, with less emphasis placed on gene amplifications. Here we describe a bioinformatics screening strategy to identify putative cancer driver genes amplified across TCGA datasets. We carried out GISTIC2 analysis of TCGA datasets spanning 14 cancer subtypes and identified 461 genes that were amplified in two or more datasets. The list was narrowed to 73 cancer-associated genes with potential “druggable” properties. The majority of the genes were localized to 14 amplicons spread across the genome. To identify potential cancer driver genes, we analyzed gene copy number and mRNA expression data from individual patient samples and identified 40 putative cancer driver genes linked to diverse oncogenic processes. Oncogenic activity was further validated by siRNA/shRNA knockdown and by referencing the Project Achilles datasets. The amplified genes represented a number of gene families, including epigenetic regulators, cell cycle-associated genes, DNA damage response/repair genes, metabolic regulators, and genes linked to the Wnt, Notch, Hedgehog, JAK/STAT, NF-KB and MAPK signaling pathways. Among the 40 putative driver genes were known driver genes, such as EGFR, ERBB2 and PIK3CA. Wild-type KRAS was amplified in several cancer types, and KRAS-amplified cancer cell lines were most sensitive to KRAS shRNA, suggesting that KRAS amplification was an independent oncogenic event. A number of MAP kinase adapters were co-amplified with their receptor tyrosine kinases, such as the FGFR adapter FRS2 and the EGFR family adapter GRB7. The ubiquitin-like ligase DCUN1D1 and the histone methyltransferase NSD3 were also identified as novel putative cancer driver genes. We discuss the patient tailoring implications for existing cancer drug targets and we further discuss potential novel opportunities for drug discovery efforts. PMID:24874471

  13. Novel Harmonic Regularization Approach for Variable Selection in Cox's Proportional Hazards Model

    PubMed Central

    Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan

    2014-01-01

    Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq  (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods. PMID:25506389

  14. Identification of upstream transcription factors (TFs) for expression signature genes in breast cancer.

    PubMed

    Zang, Hongyan; Li, Ning; Pan, Yuling; Hao, Jingguang

    2017-03-01

    Breast cancer is a common malignancy among women with a rising incidence. Our intention was to detect transcription factors (TFs) for deeper understanding of the underlying mechanisms of breast cancer. Integrated analysis of gene expression datasets of breast cancer was performed. Then, functional annotation of differentially expressed genes (DEGs) was conducted, including Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment. Furthermore, TFs were identified and a global transcriptional regulatory network was constructed. Seven publically available GEO datasets were obtained, and a set of 1196 DEGs were identified (460 up-regulated and 736 down-regulated). Functional annotation results showed that cell cycle was the most significantly enriched pathway, which was consistent with the fact that cell cycle is closely related to various tumors. Fifty-three differentially expressed TFs were identified, and the regulatory networks consisted of 817 TF-target interactions between 46 TFs and 602 DEGs in the context of breast cancer. Top 10 TFs covering the most downstream DEGs were SOX10, NFATC2, ZNF354C, ARID3A, BRCA1, FOXO3, GATA3, ZEB1, HOXA5 and EGR1. The transcriptional regulatory networks could enable a better understanding of regulatory mechanisms of breast cancer pathology and provide an opportunity for the development of potential therapy.

  15. Comparison of alternative approaches for analysing multi-level RNA-seq data

    PubMed Central

    Mohorianu, Irina; Bretman, Amanda; Smith, Damian T.; Fowler, Emily K.; Dalmay, Tamas

    2017-01-01

    RNA sequencing (RNA-seq) is widely used for RNA quantification in the environmental, biological and medical sciences. It enables the description of genome-wide patterns of expression and the identification of regulatory interactions and networks. The aim of RNA-seq data analyses is to achieve rigorous quantification of genes/transcripts to allow a reliable prediction of differential expression (DE), despite variation in levels of noise and inherent biases in sequencing data. This can be especially challenging for datasets in which gene expression differences are subtle, as in the behavioural transcriptomics test dataset from D. melanogaster that we used here. We investigated the power of existing approaches for quality checking mRNA-seq data and explored additional, quantitative quality checks. To accommodate nested, multi-level experimental designs, we incorporated sample layout into our analyses. We employed a subsampling without replacement-based normalization and an identification of DE that accounted for the hierarchy and amplitude of effect sizes within samples, then evaluated the resulting differential expression call in comparison to existing approaches. In a final step to test for broader applicability, we applied our approaches to a published set of H. sapiens mRNA-seq samples, The dataset-tailored methods improved sample comparability and delivered a robust prediction of subtle gene expression changes. The proposed approaches have the potential to improve key steps in the analysis of RNA-seq data by incorporating the structure and characteristics of biological experiments. PMID:28792517

  16. ConGEMs: Condensed Gene Co-Expression Module Discovery Through Rule-Based Clustering and Its Application to Carcinogenesis.

    PubMed

    Mallik, Saurav; Zhao, Zhongming

    2017-12-28

    For transcriptomic analysis, there are numerous microarray-based genomic data, especially those generated for cancer research. The typical analysis measures the difference between a cancer sample-group and a matched control group for each transcript or gene. Association rule mining is used to discover interesting item sets through rule-based methodology. Thus, it has advantages to find causal effect relationships between the transcripts. In this work, we introduce two new rule-based similarity measures-weighted rank-based Jaccard and Cosine measures-and then propose a novel computational framework to detect condensed gene co-expression modules ( C o n G E M s) through the association rule-based learning system and the weighted similarity scores. In practice, the list of evolved condensed markers that consists of both singular and complex markers in nature depends on the corresponding condensed gene sets in either antecedent or consequent of the rules of the resultant modules. In our evaluation, these markers could be supported by literature evidence, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway and Gene Ontology annotations. Specifically, we preliminarily identified differentially expressed genes using an empirical Bayes test. A recently developed algorithm-RANWAR-was then utilized to determine the association rules from these genes. Based on that, we computed the integrated similarity scores of these rule-based similarity measures between each rule-pair, and the resultant scores were used for clustering to identify the co-expressed rule-modules. We applied our method to a gene expression dataset for lung squamous cell carcinoma and a genome methylation dataset for uterine cervical carcinogenesis. Our proposed module discovery method produced better results than the traditional gene-module discovery measures. In summary, our proposed rule-based method is useful for exploring biomarker modules from transcriptomic data.

  17. Causes and Consequences of Genetic Background Effects Illuminated by Integrative Genomic Analysis

    PubMed Central

    Chandler, Christopher H.; Chari, Sudarshan; Dworkin, Ian

    2014-01-01

    The phenotypic consequences of individual mutations are modulated by the wild-type genetic background in which they occur. Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist or about the mechanisms underlying it. We also lack knowledge on how mutations interact with genetic background to influence gene expression and how this in turn mediates mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome-level datasets from a mapping-by-introgression experiment and a tagged RNA gene expression dataset. In addition we used whole genome resequencing of the parental lines—two commonly used laboratory strains—to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutation screen. By searching for genes showing a congruent signal across multiple datasets, we were able to identify a robust set of candidate loci contributing to the background-dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers previously reported are caused by higher-order epistasis, not quantitative noncomplementation. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well. PMID:24504186

  18. Systems Level Analysis of Systemic Sclerosis Shows a Network of Immune and Profibrotic Pathways Connected with Genetic Polymorphisms

    PubMed Central

    Mahoney, J. Matthew; Taroni, Jaclyn; Martyanov, Viktor; Wood, Tammara A.; Greene, Casey S.; Pioli, Patricia A.; Hinchcliff, Monique E.; Whitfield, Michael L.

    2015-01-01

    Systemic sclerosis (SSc) is a rare systemic autoimmune disease characterized by skin and organ fibrosis. The pathogenesis of SSc and its progression are poorly understood. The SSc intrinsic gene expression subsets (inflammatory, fibroproliferative, normal-like, and limited) are observed in multiple clinical cohorts of patients with SSc. Analysis of longitudinal skin biopsies suggests that a patient's subset assignment is stable over 6–12 months. Genetically, SSc is multi-factorial with many genetic risk loci for SSc generally and for specific clinical manifestations. Here we identify the genes consistently associated with the intrinsic subsets across three independent cohorts, show the relationship between these genes using a gene-gene interaction network, and place the genetic risk loci in the context of the intrinsic subsets. To identify gene expression modules common to three independent datasets from three different clinical centers, we developed a consensus clustering procedure based on mutual information of partitions, an information theory concept, and performed a meta-analysis of these genome-wide gene expression datasets. We created a gene-gene interaction network of the conserved molecular features across the intrinsic subsets and analyzed their connections with SSc-associated genetic polymorphisms. The network is composed of distinct, but interconnected, components related to interferon activation, M2 macrophages, adaptive immunity, extracellular matrix remodeling, and cell proliferation. The network shows extensive connections between the inflammatory- and fibroproliferative-specific genes. The network also shows connections between these subset-specific genes and 30 SSc-associated polymorphic genes including STAT4, BLK, IRF7, NOTCH4, PLAUR, CSK, IRAK1, and several human leukocyte antigen (HLA) genes. Our analyses suggest that the gene expression changes underlying the SSc subsets may be long-lived, but mechanistically interconnected and related to a patients underlying genetic risk. PMID:25569146

  19. A Protocol for Using Gene Set Enrichment Analysis to Identify the Appropriate Animal Model for Translational Research.

    PubMed

    Weidner, Christopher; Steinfath, Matthias; Wistorf, Elisa; Oelgeschläger, Michael; Schneider, Marlon R; Schönfelder, Gilbert

    2017-08-16

    Recent studies that compared transcriptomic datasets of human diseases with datasets from mouse models using traditional gene-to-gene comparison techniques resulted in contradictory conclusions regarding the relevance of animal models for translational research. A major reason for the discrepancies between different gene expression analyses is the arbitrary filtering of differentially expressed genes. Furthermore, the comparison of single genes between different species and platforms often is limited by technical variance, leading to misinterpretation of the con/discordance between data from human and animal models. Thus, standardized approaches for systematic data analysis are needed. To overcome subjective gene filtering and ineffective gene-to-gene comparisons, we recently demonstrated that gene set enrichment analysis (GSEA) has the potential to avoid these problems. Therefore, we developed a standardized protocol for the use of GSEA to distinguish between appropriate and inappropriate animal models for translational research. This protocol is not suitable to predict how to design new model systems a-priori, as it requires existing experimental omics data. However, the protocol describes how to interpret existing data in a standardized manner in order to select the most suitable animal model, thus avoiding unnecessary animal experiments and misleading translational studies.

  20. Mining microbial metatranscriptomes for expression of antibiotic resistance genes under natural conditions.

    PubMed

    Versluis, Dennis; D'Andrea, Marco Maria; Ramiro Garcia, Javier; Leimena, Milkha M; Hugenholtz, Floor; Zhang, Jing; Öztürk, Başak; Nylund, Lotta; Sipkema, Detmer; van Schaik, Willem; de Vos, Willem M; Kleerebezem, Michiel; Smidt, Hauke; van Passel, Mark W J

    2015-07-08

    Antibiotic resistance genes are found in a broad range of ecological niches associated with complex microbiota. Here we investigated if resistance genes are not only present, but also transcribed under natural conditions. Furthermore, we examined the potential for antibiotic production by assessing the expression of associated secondary metabolite biosynthesis gene clusters. Metatranscriptome datasets from intestinal microbiota of four human adults, one human infant, 15 mice and six pigs, of which only the latter have received antibiotics prior to the study, as well as from sea bacterioplankton, a marine sponge, forest soil and sub-seafloor sediment, were investigated. We found that resistance genes are expressed in all studied ecological niches, albeit with niche-specific differences in relative expression levels and diversity of transcripts. For example, in mice and human infant microbiota predominantly tetracycline resistance genes were expressed while in human adult microbiota the spectrum of expressed genes was more diverse, and also included β-lactam, aminoglycoside and macrolide resistance genes. Resistance gene expression could result from the presence of natural antibiotics in the environment, although we could not link it to expression of corresponding secondary metabolites biosynthesis clusters. Alternatively, resistance gene expression could be constitutive, or these genes serve alternative roles besides antibiotic resistance.

  1. Mining microbial metatranscriptomes for expression of antibiotic resistance genes under natural conditions

    NASA Astrophysics Data System (ADS)

    Versluis, Dennis; D'Andrea, Marco Maria; Ramiro Garcia, Javier; Leimena, Milkha M.; Hugenholtz, Floor; Zhang, Jing; Öztürk, Başak; Nylund, Lotta; Sipkema, Detmer; Schaik, Willem Van; de Vos, Willem M.; Kleerebezem, Michiel; Smidt, Hauke; Passel, Mark W. J. Van

    2015-07-01

    Antibiotic resistance genes are found in a broad range of ecological niches associated with complex microbiota. Here we investigated if resistance genes are not only present, but also transcribed under natural conditions. Furthermore, we examined the potential for antibiotic production by assessing the expression of associated secondary metabolite biosynthesis gene clusters. Metatranscriptome datasets from intestinal microbiota of four human adults, one human infant, 15 mice and six pigs, of which only the latter have received antibiotics prior to the study, as well as from sea bacterioplankton, a marine sponge, forest soil and sub-seafloor sediment, were investigated. We found that resistance genes are expressed in all studied ecological niches, albeit with niche-specific differences in relative expression levels and diversity of transcripts. For example, in mice and human infant microbiota predominantly tetracycline resistance genes were expressed while in human adult microbiota the spectrum of expressed genes was more diverse, and also included β-lactam, aminoglycoside and macrolide resistance genes. Resistance gene expression could result from the presence of natural antibiotics in the environment, although we could not link it to expression of corresponding secondary metabolites biosynthesis clusters. Alternatively, resistance gene expression could be constitutive, or these genes serve alternative roles besides antibiotic resistance.

  2. CrossLink: a novel method for cross-condition classification of cancer subtypes.

    PubMed

    Ma, Chifeng; Sastry, Konduru S; Flore, Mario; Gehani, Salah; Al-Bozom, Issam; Feng, Yusheng; Serpedin, Erchin; Chouchane, Lotfi; Chen, Yidong; Huang, Yufei

    2016-08-22

    We considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Therefore, the trained classifier would work well under one condition but not under another. To address the problem of current normalization approaches, we propose a novel algorithm called CrossLink (CL). CL recognizes that there is no universal, condition-independent normalization mapping of signatures. In contrast, it exploits the fact that the signature is unique to its associated class under any condition and thus employs an unsupervised clustering algorithm to discover this unique signature. We assessed the performance of CL for cross-condition predictions of PAM50 subtypes of breast cancer by using a simulated dataset modeled after TCGA BRCA tumor samples with a cross-validation scheme, and datasets with known and unknown PAM50 classification. CL achieved prediction accuracy >73 %, highest among other methods we evaluated. We also applied the algorithm to a set of breast cancer tumors derived from Arabic population to assign a PAM50 classification to each tumor based on their gene expression profiles. A novel algorithm CrossLink for cross-condition prediction of cancer classes was proposed. In all test datasets, CL showed robust and consistent improvement in prediction performance over other state-of-the-art normalization and classification algorithms.

  3. Robust transcriptional tumor signatures applicable to both formalin-fixed paraffin-embedded and fresh-frozen samples

    PubMed Central

    Cheng, Jun; He, Jun; Liu, Huaping; Cai, Hao; Hong, Guini; Zhang, Jiahui; Li, Na; Ao, Lu; Guo, Zheng

    2017-01-01

    Formalin-fixed paraffin-embedded (FFPE) samples represent a valuable resource for clinical researches. However, FFPE samples are usually considered an unreliable source for gene expression analysis due to the partial RNA degradation. In this study, through comparing gene expression profiles between FFPE samples and paired fresh-frozen (FF) samples for three cancer types, we firstly showed that expression measurements of thousands of genes had at least two-fold change in FFPE samples compared with paired FF samples. Therefore, for a transcriptional signature based on risk scores summarized from the expression levels of the signature genes, the risk score thresholds trained from FFPE (or FF) samples could not be applied to FF (or FFPE) samples. On the other hand, we found that more than 90% of the relative expression orderings (REOs) of gene pairs in the FF samples were maintained in their paired FFPE samples and largely unaffected by the storage time. The result suggested that the REOs of gene pairs were highly robust against partial RNA degradation in FFPE samples. Finally, as a case study, we developed a REOs-based signature to distinguish liver cirrhosis from hepatocellular carcinoma (HCC) using FFPE samples. The signature was validated in four datasets of FFPE samples and eight datasets of FF samples. In conclusion, the valuable FFPE samples can be fully exploited to identify REOs-based diagnostic and prognostic signatures which could be robustly applicable to both FF samples and FFPE samples with degraded RNA. PMID:28036264

  4. Role of miR-452-5p in the tumorigenesis of prostate cancer: A study based on the Cancer Genome Atl(TCGA), Gene Expression Omnibus (GEO), and bioinformatics analysis.

    PubMed

    Gao, Li; Zhang, Li-Jie; Li, Sheng-Hua; Wei, Li-Li; Luo, Bin; He, Rong-Quan; Xia, Shuang

    2018-03-06

    MiR-452-5p has been reported to be down-regulated in prostate cancer, affecting the development of this type of cancer. However, the molecular mechanism of miR-452-5p in prostate cancer remains unclear. Therefore, we investigated the network of target genes of miR-452-5p in prostate cancer using bioinformatics analyses. We first analyzed the expression profiles and prognostic value of miR-452-5p in prostate cancer tissues from a public database. Gene Ontology (GO), the Kyoto Encyclopedia of Genes and Genomes (KEGG), PANTHER pathway analyses, and a disease ontology (DG) analysis were performed to find the molecular functions of the target genes from GSE datasets and miRWalk. Finally, we validated hub genes from the protein-protein interaction (PPI) networks of the target genes in the Human Protein Atlas (HPA) database and Gene Expression Profiling Interactive Analysis (GEPIA). Narrowing down the optimal target genes was conducted by seeking the common parts of up-regulated genes from GEPIA, down-regulated genes from GSE datasets, and predicted genes in miRWalk. Based on mining of GEO and ArrayExpress microarray chips and miRNA-Seq data in the TCGA database, which includes 1007 prostate cancer samples and 387 non-cancer samples, miR-452-5p is shown to be down-regulated in prostate cancer. GO, KEGG, and PANTHER pathway analyses suggested that the target genes might participate in important biological processes, such as transforming growth factor beta signaling and the positive regulation of brown fat cell differentiation and mesenchymal cell differentiation, as well as the Ras signaling pathway and pathways regulating the pluripotency of stem cells and arrhythmogenic right ventricular cardiomyopathy (ARVC). Nine genes-GABBR, PNISR, NTSR1, DOCK1, EREG, SFRP1, PTGS2, LEF1, and BMP2-were defined as hub genes in the PPI network. Three genes-FAM174B, SLC30A4, and SLIT1-were jointly shared by GEPIA, the GSE datasets, and miRWalk. Down-regulated miR-452-5p might play an essential role in the tumorigenesis of prostate cancer. Copyright © 2018. Published by Elsevier GmbH.

  5. Microarray-based characterization of differential gene expression during vocal fold wound healing in rats

    PubMed Central

    Welham, Nathan V.; Ling, Changying; Dawson, John A.; Kendziorski, Christina; Thibeault, Susan L.; Yamashita, Masaru

    2015-01-01

    The vocal fold (VF) mucosa confers elegant biomechanical function for voice production but is susceptible to scar formation following injury. Current understanding of VF wound healing is hindered by a paucity of data and is therefore often generalized from research conducted in skin and other mucosal systems. Here, using a previously validated rat injury model, expression microarray technology and an empirical Bayes analysis approach, we generated a VF-specific transcriptome dataset to better capture the system-level complexity of wound healing in this specialized tissue. We measured differential gene expression at 3, 14 and 60 days post-injury compared to experimentally naïve controls, pursued functional enrichment analyses to refine and add greater biological definition to the previously proposed temporal phases of VF wound healing, and validated the expression and localization of a subset of previously unidentified repair- and regeneration-related genes at the protein level. Our microarray dataset is a resource for the wider research community and has the potential to stimulate new hypotheses and avenues of investigation, improve biological and mechanistic insight, and accelerate the identification of novel therapeutic targets. PMID:25592437

  6. Predictive Models of Cognitive Outcomes of Developmental Insults

    NASA Astrophysics Data System (ADS)

    Chan, Yupo; Bouaynaya, Nidhal; Chowdhury, Parimal; Leszczynska, Danuta; Patterson, Tucker A.; Tarasenko, Olga

    2010-04-01

    Representatives of Arkansas medical, research and educational institutions have gathered over the past four years to discuss the relationship between functional developmental perturbations and their neurological consequences. We wish to track the effect on the nervous system by developmental perturbations over time and across species. Except for perturbations, the sequence of events that occur during neural development was found to be remarkably conserved across mammalian species. The tracking includes consequences on anatomical regions and behavioral changes. The ultimate goal is to develop a predictive model of long-term genotypic and phenotypic outcomes that includes developmental insults. Such a model can subsequently be fostered into an educated intervention for therapeutic purposes. Several datasets were identified to test plausible hypotheses, ranging from evoked potential datasets to sleep-disorder datasets. An initial model may be mathematical and conceptual. However, we expect to see rapid progress as large-scale gene expression studies in the mammalian brain permit genome-wide searches to discover genes that are uniquely expressed in brain circuits and regions. These genes ultimately control behavior. By using a validated model we endeavor to make useful predictions.

  7. GTA: a game theoretic approach to identifying cancer subnetwork markers.

    PubMed

    Farahmand, S; Goliaei, S; Ansari-Pour, N; Razaghi-Moghadam, Z

    2016-03-01

    The identification of genetic markers (e.g. genes, pathways and subnetworks) for cancer has been one of the most challenging research areas in recent years. A subset of these studies attempt to analyze genome-wide expression profiles to identify markers with high reliability and reusability across independent whole-transcriptome microarray datasets. Therefore, the functional relationships of genes are integrated with their expression data. However, for a more accurate representation of the functional relationships among genes, utilization of the protein-protein interaction network (PPIN) seems to be necessary. Herein, a novel game theoretic approach (GTA) is proposed for the identification of cancer subnetwork markers by integrating genome-wide expression profiles and PPIN. The GTA method was applied to three distinct whole-transcriptome breast cancer datasets to identify the subnetwork markers associated with metastasis. To evaluate the performance of our approach, the identified subnetwork markers were compared with gene-based, pathway-based and network-based markers. We show that GTA is not only capable of identifying robust metastatic markers, it also provides a higher classification performance. In addition, based on these GTA-based subnetworks, we identified a new bonafide candidate gene for breast cancer susceptibility.

  8. Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories.

    PubMed

    Jong, Victor L; Novianti, Putri W; Roes, Kit C B; Eijkemans, Marinus J C

    2014-12-01

    The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.

  9. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics.

    PubMed

    Giambartolomei, Claudia; Vukcevic, Damjan; Schadt, Eric E; Franke, Lude; Hingorani, Aroon D; Wallace, Chris; Plagnol, Vincent

    2014-05-01

    Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.

  10. Gene expression metadata analysis reveals molecular mechanisms employed by Phanerochaete chrysosporium during lignin degradation and detoxification of plant extractives.

    PubMed

    Kameshwar, Ayyappa Kumar Sista; Qin, Wensheng

    2017-10-01

    Lignin, most complex and abundant biopolymer on the earth's surface, attains its stability from intricate polyphenolic units and non-phenolic bonds, making it difficult to depolymerize or separate from other units of biomass. Eccentric lignin degrading ability and availability of annotated genome make Phanerochaete chrysosporium ideal for studying lignin degrading mechanisms. Decoding and understanding the molecular mechanisms underlying the process of lignin degradation will significantly aid the progressing biofuel industries and lead to the production of commercially vital platform chemicals. In this study, we have performed a large-scale metadata analysis to understand the common gene expression patterns of P. chrysosporium during lignin degradation. Gene expression datasets were retrieved from NCBI GEO database and analyzed using GEO2R and Bioconductor packages. Commonly expressed statistically significant genes among different datasets were further considered to understand their involvement in lignin degradation and detoxification mechanisms. We have observed three sets of enzymes commonly expressed during ligninolytic conditions which were later classified into primary ligninolytic, aromatic compound-degrading and other necessary enzymes. Similarly, we have observed three sets of genes coding for detoxification and stress-responsive, phase I and phase II metabolic enzymes. Results obtained in this study indicate the coordinated action of enzymes involved in lignin depolymerization and detoxification-stress responses under ligninolytic conditions. We have developed tentative network of genes and enzymes involved in lignin degradation and detoxification mechanisms by P. chrysosporium based on the literature and results obtained in this study. However, ambiguity raised due to higher expression of several uncharacterized proteins necessitates for further proteomic studies in P. chrysosporium.

  11. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome.

    PubMed

    Tothill, Richard W; Tinker, Anna V; George, Joshy; Brown, Robert; Fox, Stephen B; Lade, Stephen; Johnson, Daryl S; Trivett, Melanie K; Etemadmoghadam, Dariush; Locandro, Bianca; Traficante, Nadia; Fereday, Sian; Hung, Jillian A; Chiew, Yoke-Eng; Haviv, Izhak; Gertig, Dorota; DeFazio, Anna; Bowtell, David D L

    2008-08-15

    The study aim to identify novel molecular subtypes of ovarian cancer by gene expression profiling with linkage to clinical and pathologic features. Microarray gene expression profiling was done on 285 serous and endometrioid tumors of the ovary, peritoneum, and fallopian tube. K-means clustering was applied to identify robust molecular subtypes. Statistical analysis identified differentially expressed genes, pathways, and gene ontologies. Laser capture microdissection, pathology review, and immunohistochemistry validated the array-based findings. Patient survival within k-means groups was evaluated using Cox proportional hazards models. Class prediction validated k-means groups in an independent dataset. A semisupervised survival analysis of the array data was used to compare against unsupervised clustering results. Optimal clustering of array data identified six molecular subtypes. Two subtypes represented predominantly serous low malignant potential and low-grade endometrioid subtypes, respectively. The remaining four subtypes represented higher grade and advanced stage cancers of serous and endometrioid morphology. A novel subtype of high-grade serous cancers reflected a mesenchymal cell type, characterized by overexpression of N-cadherin and P-cadherin and low expression of differentiation markers, including CA125 and MUC1. A poor prognosis subtype was defined by a reactive stroma gene expression signature, correlating with extensive desmoplasia in such samples. A similar poor prognosis signature could be found using a semisupervised analysis. Each subtype displayed distinct levels and patterns of immune cell infiltration. Class prediction identified similar subtypes in an independent ovarian dataset with similar prognostic trends. Gene expression profiling identified molecular subtypes of ovarian cancer of biological and clinical importance.

  12. A computational approach to identify cellular heterogeneity and tissue-specific gene regulatory networks.

    PubMed

    Jambusaria, Ankit; Klomp, Jeff; Hong, Zhigang; Rafii, Shahin; Dai, Yang; Malik, Asrar B; Rehman, Jalees

    2018-06-07

    The heterogeneity of cells across tissue types represents a major challenge for studying biological mechanisms as well as for therapeutic targeting of distinct tissues. Computational prediction of tissue-specific gene regulatory networks may provide important insights into the mechanisms underlying the cellular heterogeneity of cells in distinct organs and tissues. Using three pathway analysis techniques, gene set enrichment analysis (GSEA), parametric analysis of gene set enrichment (PGSEA), alongside our novel model (HeteroPath), which assesses heterogeneously upregulated and downregulated genes within the context of pathways, we generated distinct tissue-specific gene regulatory networks. We analyzed gene expression data derived from freshly isolated heart, brain, and lung endothelial cells and populations of neurons in the hippocampus, cingulate cortex, and amygdala. In both datasets, we found that HeteroPath segregated the distinct cellular populations by identifying regulatory pathways that were not identified by GSEA or PGSEA. Using simulated datasets, HeteroPath demonstrated robustness that was comparable to what was seen using existing gene set enrichment methods. Furthermore, we generated tissue-specific gene regulatory networks involved in vascular heterogeneity and neuronal heterogeneity by performing motif enrichment of the heterogeneous genes identified by HeteroPath and linking the enriched motifs to regulatory transcription factors in the ENCODE database. HeteroPath assesses contextual bidirectional gene expression within pathways and thus allows for transcriptomic assessment of cellular heterogeneity. Unraveling tissue-specific heterogeneity of gene expression can lead to a better understanding of the molecular underpinnings of tissue-specific phenotypes.

  13. The Regulation of Cytokine Networks in Hippocampal CA1 Differentiates Extinction from Those Required for the Maintenance of Contextual Fear Memory after Recall

    PubMed Central

    Scholz, Birger; Doidge, Amie N.; Barnes, Philip; Hall, Jeremy; Wilkinson, Lawrence S.; Thomas, Kerrie L.

    2016-01-01

    We investigated the distinctiveness of gene regulatory networks in CA1 associated with the extinction of contextual fear memory (CFM) after recall using Affymetrix GeneChip Rat Genome 230 2.0 Arrays. These data were compared to previously published retrieval and reconsolidation-attributed, and consolidation datasets. A stringent dual normalization and pareto-scaled orthogonal partial least-square discriminant multivariate analysis together with a jack-knifing-based cross-validation approach was used on all datasets to reduce false positives. Consolidation, retrieval and extinction were correlated with distinct patterns of gene expression 2 hours later. Extinction-related gene expression was most distinct from the profile accompanying consolidation. A highly specific feature was the discrete regulation of neuroimmunological gene expression associated with retrieval and extinction. Immunity–associated genes of the tyrosine kinase receptor TGFβ and PDGF, and TNF families’ characterized extinction. Cytokines and proinflammatory interleukins of the IL-1 and IL-6 families were enriched with the no-extinction retrieval condition. We used comparative genomics to predict transcription factor binding sites in proximal promoter regions of the retrieval-regulated genes. Retrieval that does not lead to extinction was associated with NF-κB-mediated gene expression. We confirmed differential NF-κBp65 expression, and activity in all of a representative sample of our candidate genes in the no-extinction condition. The differential regulation of cytokine networks after the acquisition and retrieval of CFM identifies the important contribution that neuroimmune signalling plays in normal hippocampal function. Further, targeting cytokine signalling upon retrieval offers a therapeutic strategy to promote extinction mechanisms in human disorders characterised by dysregulation of associative memory. PMID:27224427

  14. Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies.

    PubMed

    Yang, Tsun-Po; Beazley, Claude; Montgomery, Stephen B; Dimas, Antigone S; Gutierrez-Arcelus, Maria; Stranger, Barbara E; Deloukas, Panos; Dermitzakis, Emmanouil T

    2010-10-01

    Genevar (GENe Expression VARiation) is a database and Java tool designed to integrate multiple datasets, and provides analysis and visualization of associations between sequence variation and gene expression. Genevar allows researchers to investigate expression quantitative trait loci (eQTL) associations within a gene locus of interest in real time. The database and application can be installed on a standard computer in database mode and, in addition, on a server to share discoveries among affiliations or the broader community over the Internet via web services protocols. http://www.sanger.ac.uk/resources/software/genevar.

  15. A Stromal Immune Module Correlated with the Response to Neoadjuvant Chemotherapy, Prognosis and Lymphocyte Infiltration in HER2-Positive Breast Carcinoma Is Inversely Correlated with Hormonal Pathways

    PubMed Central

    Lae, Marick; Moarii, Matahi; Sadacca, Benjamin; Pinheiro, Alice; Galliot, Marion; Abecassis, Judith; Laurent, Cecile; Reyal, Fabien

    2016-01-01

    Introduction HER2-positive breast cancer (BC) is a heterogeneous group of aggressive breast cancers, the prognosis of which has greatly improved since the introduction of treatments targeting HER2. However, these tumors may display intrinsic or acquired resistance to treatment, and classifiers of HER2-positive tumors are required to improve the prediction of prognosis and to develop novel therapeutic interventions. Methods We analyzed 2893 primary human breast cancer samples from 21 publicly available datasets and developed a six-metagene signature on a training set of 448 HER2-positive BC. We then used external public datasets to assess the ability of these metagenes to predict the response to chemotherapy (Ignatiadis dataset), and prognosis (METABRIC dataset). Results We identified a six-metagene signature (138 genes) containing metagenes enriched in different gene ontologies. The gene clusters were named as follows: Immunity, Tumor suppressors/proliferation, Interferon, Signal transduction, Hormone/survival and Matrix clusters. In all datasets, the Immunity metagene was less strongly expressed in ER-positive than in ER-negative tumors, and was inversely correlated with the Hormonal/survival metagene. Within the signature, multivariate analyses showed that strong expression of the “Immunity” metagene was associated with higher pCR rates after NAC (OR = 3.71[1.28–11.91], p = 0.019) than weak expression, and with a better prognosis in HER2-positive/ER-negative breast cancers (HR = 0.58 [0.36–0.94], p = 0.026). Immunity metagene expression was associated with the presence of tumor-infiltrating lymphocytes (TILs). Conclusion The identification of a predictive and prognostic immune module in HER2-positive BC confirms the need for clinical testing for immune checkpoint modulators and vaccines for this specific subtype. The inverse correlation between Immunity and hormone pathways opens research perspectives and deserves further investigation. PMID:28005906

  16. Cross-platform method for identifying candidate network biomarkers for prostate cancer.

    PubMed

    Jin, G; Zhou, X; Cui, K; Zhang, X-S; Chen, L; Wong, S T C

    2009-11-01

    Discovering biomarkers using mass spectrometry (MS) and microarray expression profiles is a promising strategy in molecular diagnosis. Here, the authors proposed a new pipeline for biomarker discovery that integrates disease information for proteins and genes, expression profiles in both genomic and proteomic levels, and protein-protein interactions (PPIs) to discover high confidence network biomarkers. Using this pipeline, a total of 474 molecules (genes and proteins) related to prostate cancer were identified and a prostate-cancer-related network (PCRN) was derived from the integrative information. Thus, a set of candidate network biomarkers were identified from multiple expression profiles composed by eight microarray datasets and one proteomics dataset. The network biomarkers with PPIs can accurately distinguish the prostate patients from the normal ones, which potentially provide more reliable hits of biomarker candidates than conventional biomarker discovery methods.

  17. CMIP: a software package capable of reconstructing genome-wide regulatory networks using gene expression data.

    PubMed

    Zheng, Guangyong; Xu, Yaochen; Zhang, Xiujun; Liu, Zhi-Ping; Wang, Zhuo; Chen, Luonan; Zhu, Xin-Guang

    2016-12-23

    A gene regulatory network (GRN) represents interactions of genes inside a cell or tissue, in which vertexes and edges stand for genes and their regulatory interactions respectively. Reconstruction of gene regulatory networks, in particular, genome-scale networks, is essential for comparative exploration of different species and mechanistic investigation of biological processes. Currently, most of network inference methods are computationally intensive, which are usually effective for small-scale tasks (e.g., networks with a few hundred genes), but are difficult to construct GRNs at genome-scale. Here, we present a software package for gene regulatory network reconstruction at a genomic level, in which gene interaction is measured by the conditional mutual information measurement using a parallel computing framework (so the package is named CMIP). The package is a greatly improved implementation of our previous PCA-CMI algorithm. In CMIP, we provide not only an automatic threshold determination method but also an effective parallel computing framework for network inference. Performance tests on benchmark datasets show that the accuracy of CMIP is comparable to most current network inference methods. Moreover, running tests on synthetic datasets demonstrate that CMIP can handle large datasets especially genome-wide datasets within an acceptable time period. In addition, successful application on a real genomic dataset confirms its practical applicability of the package. This new software package provides a powerful tool for genomic network reconstruction to biological community. The software can be accessed at http://www.picb.ac.cn/CMIP/ .

  18. Pattern identification in time-course gene expression data with the CoGAPS matrix factorization.

    PubMed

    Fertig, Elana J; Stein-O'Brien, Genevieve; Jaffe, Andrew; Colantuoni, Carlo

    2014-01-01

    Patterns in time-course gene expression data can represent the biological processes that are active over the measured time period. However, the orthogonality constraint in standard pattern-finding algorithms, including notably principal components analysis (PCA), confounds expression changes resulting from simultaneous, non-orthogonal biological processes. Previously, we have shown that Markov chain Monte Carlo nonnegative matrix factorization algorithms are particularly adept at distinguishing such concurrent patterns. One such matrix factorization is implemented in the software package CoGAPS. We describe the application of this software and several technical considerations for identification of age-related patterns in a public, prefrontal cortex gene expression dataset.

  19. Transcriptome-wide selection of a reliable set of reference genes for gene expression studies in potato cyst nematodes (Globodera spp.).

    PubMed

    Sabeh, Michael; Duceppe, Marc-Olivier; St-Arnaud, Marc; Mimee, Benjamin

    2018-01-01

    Relative gene expression analyses by qRT-PCR (quantitative reverse transcription PCR) require an internal control to normalize the expression data of genes of interest and eliminate the unwanted variation introduced by sample preparation. A perfect reference gene should have a constant expression level under all the experimental conditions. However, the same few housekeeping genes selected from the literature or successfully used in previous unrelated experiments are often routinely used in new conditions without proper validation of their stability across treatments. The advent of RNA-Seq and the availability of public datasets for numerous organisms are opening the way to finding better reference genes for expression studies. Globodera rostochiensis is a plant-parasitic nematode that is particularly yield-limiting for potato. The aim of our study was to identify a reliable set of reference genes to study G. rostochiensis gene expression. Gene expression levels from an RNA-Seq database were used to identify putative reference genes and were validated with qRT-PCR analysis. Three genes, GR, PMP-3, and aaRS, were found to be very stable within the experimental conditions of this study and are proposed as reference genes for future work.

  20. Comprehensive single cell-resolution analysis of the role of chromatin regulators in early C. elegans embryogenesis.

    PubMed

    Krüger, Angela V; Jelier, Rob; Dzyubachyk, Oleh; Zimmerman, Timo; Meijering, Erik; Lehner, Ben

    2015-02-15

    Chromatin regulators are widely expressed proteins with diverse roles in gene expression, nuclear organization, cell cycle regulation, pluripotency, physiology and development, and are frequently mutated in human diseases such as cancer. Their inhibition often results in pleiotropic effects that are difficult to study using conventional approaches. We have developed a semi-automated nuclear tracking algorithm to quantify the divisions, movements and positions of all nuclei during the early development of Caenorhabditis elegans and have used it to systematically study the effects of inhibiting chromatin regulators. The resulting high dimensional datasets revealed that inhibition of multiple regulators, including F55A3.3 (encoding FACT subunit SUPT16H), lin-53 (RBBP4/7), rba-1 (RBBP4/7), set-16 (MLL2/3), hda-1 (HDAC1/2), swsn-7 (ARID2), and let-526 (ARID1A/1B) affected cell cycle progression and caused chromosome segregation defects. In contrast, inhibition of cir-1 (CIR1) accelerated cell division timing in specific cells of the AB lineage. The inhibition of RNA polymerase II also accelerated these division timings, suggesting that normal gene expression is required to delay cell cycle progression in multiple lineages in the early embryo. Quantitative analyses of the dataset suggested the existence of at least two functionally distinct SWI/SNF chromatin remodeling complex activities in the early embryo, and identified a redundant requirement for the egl-27 and lin-40 MTA orthologs in the development of endoderm and mesoderm lineages. Moreover, our dataset also revealed a characteristic rearrangement of chromatin to the nuclear periphery upon the inhibition of multiple general regulators of gene expression. Our systematic, comprehensive and quantitative datasets illustrate the power of single cell-resolution quantitative tracking and high dimensional phenotyping to investigate gene function. Furthermore, the results provide an overview of the functions of essential chromatin regulators during the early development of an animal. Copyright © 2014 Elsevier Inc. All rights reserved.

  1. Defining the gene expression signature of rhabdomyosarcoma by meta-analysis

    PubMed Central

    Romualdi, Chiara; De Pittà, Cristiano; Tombolan, Lucia; Bortoluzzi, Stefania; Sartori, Francesca; Rosolen, Angelo; Lanfranchi, Gerolamo

    2006-01-01

    Background Rhabdomyosarcoma is a highly malignant soft tissue sarcoma in childhood and arises as a consequence of regulatory disruption of the growth and differentiation pathways of myogenic precursor cells. The pathogenic pathways involved in this tumor are mostly unknown and therefore a better characterization of RMS gene expression profile would represent a considerable advance. The availability of publicly available gene expression datasets have opened up new challenges especially for the integration of data generated by different research groups and different array platforms with the purpose of obtaining new insights on the biological process investigated. Results In this work we performed a meta-analysis on four microarray and two SAGE datasets of gene expression data on RMS in order to evaluate the degree of agreement of the biological results obtained by these different studies and to identify common regulatory pathways that could be responsible of tumor growth. Regulatory pathways and biological processes significantly enriched has been investigated and a list of differentially meta-profiles have been identified as possible candidate of aggressiveness of RMS. Conclusion Our results point to a general down regulation of the energy production pathways, suggesting a hypoxic physiology for RMS cells. This result agrees with the high malignancy of RMS and with its resistance to most of the therapeutic treatments. In this context, different isoforms of the ANT gene have been consistently identified for the first time as differentially expressed in RMS. This gene is involved in anti-apoptotic processes when cells grow in low oxygen conditions. These new insights in the biological processes responsible of RMS growth and development demonstrate the effective advantage of the use of integrated analysis of gene expression studies. PMID:17090319

  2. A 15-gene signature for prediction of colon cancer recurrence and prognosis based on SVM.

    PubMed

    Xu, Guangru; Zhang, Minghui; Zhu, Hongxing; Xu, Jinhua

    2017-03-10

    To screen the gene signature for distinguishing patients with high risks from those with low-risks for colon cancer recurrence and predicting their prognosis. Five microarray datasets of colon cancer samples were collected from Gene Expression Omnibus database and one was obtained from The Cancer Genome Atlas (TCGA). After preprocessing, data in GSE17537 were analyzed using the Linear Models for Microarray data (LIMMA) method to identify the differentially expressed genes (DEGs). The DEGs further underwent PPI network-based neighborhood scoring and support vector machine (SVM) analyses to screen the feature genes associated with recurrence and prognosis, which were then validated by four datasets GSE38832, GSE17538, GSE28814 and TCGA using SVM and Cox regression analyses. A total of 1207 genes were identified as DEGs between recurrence and no-recurrence samples, including 726 downregulated and 481 upregulated genes. Using SVM analysis and five gene expression profile data confirmation, a 15-gene signature (HES5, ZNF417, GLRA2, OR8D2, HOXA7, FABP6, MUSK, HTR6, GRIP2, KLRK1, VEGFA, AKAP12, RHEB, NCRNA00152 and PMEPA1) were identified as a predictor of recurrence risk and prognosis for colon cancer patients. Our identified 15-gene signature may be useful to classify colon cancer patients with different prognosis and some genes in this signature may represent new therapeutic targets. Copyright © 2016. Published by Elsevier B.V.

  3. Genome-wide analysis of endogenously expressed ZEB2 binding sites reveals inverse correlations between ZEB2 and GalNAc-transferase GALNT3 in human tumors.

    PubMed

    Balcik-Ercin, Pelin; Cetin, Metin; Yalim-Camci, Irem; Odabas, Gorkem; Tokay, Nurettin; Sayan, A Emre; Yagci, Tamer

    2018-03-07

    ZEB2 is a transcriptional repressor that regulates epithelial-to-mesenchymal transition (EMT) through binding to bipartite E-box motifs in gene regulatory regions. Despite the abundant presence of E-boxes within the human genome and the multiplicity of pathophysiological processes regulated during ZEB2-induced EMT, only a small fraction of ZEB2 targets has been identified so far. Hence, we explored genome-wide ZEB2 binding by chromatin immunoprecipitation-sequencing (ChIP-seq) under endogenous ZEB2 expression conditions. For ChIP-Seq we used an anti-ZEB2 monoclonal antibody, clone 6E5, in SNU398 hepatocellular carcinoma cells exhibiting a high endogenous ZEB2 expression. The ChIP-Seq targets were validated using ChIP-qPCR, whereas ZEB2-dependent expression of target genes was assessed by RT-qPCR and Western blotting in shRNA-mediated ZEB2 silenced SNU398 cells and doxycycline-induced ZEB2 overexpressing colorectal carcinoma DLD1 cells. Changes in target gene expression were also assessed using primary human tumor cDNA arrays in conjunction with RT-qPCR. Additional differential expression and correlation analyses were performed using expO and Human Protein Atlas datasets. Over 500 ChIP-Seq positive genes were annotated, and intervals related to these genes were found to include the ZEB2 binding motif CACCTG according to TOMTOM motif analysis in the MEME Suite database. Assessment of ZEB2-dependent expression of target genes in ZEB2-silenced SNU398 cells and ZEB2-induced DLD1 cells revealed that the GALNT3 gene serves as a ZEB2 target with the highest, but inversely correlated, expression level. Remarkably, GALNT3 also exhibited the highest enrichment in the ChIP-qPCR validation assays. Through the analyses of primary tumor cDNA arrays and expO datasets a significant differential expression and a significant inverse correlation between ZEB2 and GALNT3 expression were detected in most of the tumors. We also explored ZEB2 and GALNT3 protein expression using the Human Protein Atlas dataset and, again, observed an inverse correlation in all analyzed tumor types, except malignant melanoma. In contrast to a generally negative or weak ZEB2 expression, we found that most tumor tissues exhibited a strong or moderate GALNT3 expression. Our observation that ZEB2 negatively regulates a GalNAc-transferase (GALNT3) that is involved in O-glycosylation adds another layer of complexity to the role of ZEB2 in cancer progression and metastasis. Proteins glycosylated by GALNT3 may be exploited as novel diagnostics and/or therapeutic targets.

  4. mRMR-ABC: A Hybrid Gene Selection Algorithm for Cancer Classification Using Microarray Gene Expression Profiling

    PubMed Central

    Alshamlan, Hala; Badr, Ghada; Alohali, Yousef

    2015-01-01

    An artificial bee colony (ABC) is a relatively recent swarm intelligence optimization approach. In this paper, we propose the first attempt at applying ABC algorithm in analyzing a microarray gene expression profile. In addition, we propose an innovative feature selection algorithm, minimum redundancy maximum relevance (mRMR), and combine it with an ABC algorithm, mRMR-ABC, to select informative genes from microarray profile. The new approach is based on a support vector machine (SVM) algorithm to measure the classification accuracy for selected genes. We evaluate the performance of the proposed mRMR-ABC algorithm by conducting extensive experiments on six binary and multiclass gene expression microarray datasets. Furthermore, we compare our proposed mRMR-ABC algorithm with previously known techniques. We reimplemented two of these techniques for the sake of a fair comparison using the same parameters. These two techniques are mRMR when combined with a genetic algorithm (mRMR-GA) and mRMR when combined with a particle swarm optimization algorithm (mRMR-PSO). The experimental results prove that the proposed mRMR-ABC algorithm achieves accurate classification performance using small number of predictive genes when tested using both datasets and compared to previously suggested methods. This shows that mRMR-ABC is a promising approach for solving gene selection and cancer classification problems. PMID:25961028

  5. mRMR-ABC: A Hybrid Gene Selection Algorithm for Cancer Classification Using Microarray Gene Expression Profiling.

    PubMed

    Alshamlan, Hala; Badr, Ghada; Alohali, Yousef

    2015-01-01

    An artificial bee colony (ABC) is a relatively recent swarm intelligence optimization approach. In this paper, we propose the first attempt at applying ABC algorithm in analyzing a microarray gene expression profile. In addition, we propose an innovative feature selection algorithm, minimum redundancy maximum relevance (mRMR), and combine it with an ABC algorithm, mRMR-ABC, to select informative genes from microarray profile. The new approach is based on a support vector machine (SVM) algorithm to measure the classification accuracy for selected genes. We evaluate the performance of the proposed mRMR-ABC algorithm by conducting extensive experiments on six binary and multiclass gene expression microarray datasets. Furthermore, we compare our proposed mRMR-ABC algorithm with previously known techniques. We reimplemented two of these techniques for the sake of a fair comparison using the same parameters. These two techniques are mRMR when combined with a genetic algorithm (mRMR-GA) and mRMR when combined with a particle swarm optimization algorithm (mRMR-PSO). The experimental results prove that the proposed mRMR-ABC algorithm achieves accurate classification performance using small number of predictive genes when tested using both datasets and compared to previously suggested methods. This shows that mRMR-ABC is a promising approach for solving gene selection and cancer classification problems.

  6. The Arabidopsis translatome cell-specific mRNA atlas: Mining suberin and cutin lipid monomer biosynthesis genes as an example for data application.

    PubMed

    Mustroph, Angelika; Bailey-Serres, Julia

    2010-03-01

    Plants consist of distinct cell types distinguished by position, morphological features and metabolic activities. We recently developed a method to extract cell-type specific mRNA populations by immunopurification of ribosome-associated mRNAs. Microarray profiles of 21 cell-specific mRNA populations from seedling roots and shoots comprise the Arabidopsis Translatome dataset. This gene expression atlas provides a new tool for the study of cell-specific processes. Here we provide an example of how genes involved in a pathway limited to one or few cell-types can be further characterized and new candidate genes can be predicted. Cells of the root endodermis produce suberin as an inner barrier between the cortex and stele, whereas the shoot epidermal cells form cutin as a barrier to the external environment. Both polymers consist of fatty acid derivates, and share biosynthetic origins. We use the Arabidopsis Translatome dataset to demonstrate the significant cell-specific expression patterns of genes involved in those biosynthetic processes and suggest new candidate genes in the biosynthesis of suberin and cutin.

  7. Influence of PCOS in Obese vs. Non-Obese women from Mesenchymal Progenitors Stem Cells and Other Endometrial Cells: An in silico biomarker discovery.

    PubMed

    Desai, Ashvini; Madar, Inamul Hasan; Asangani, Amjad Hussain; Ssadh, Hussain Al; Tayubi, Iftikhar Aslam

    2017-01-01

    Polycystic ovary syndrome (PCOS) is endocrine system disease which affect women ages 18 to 44 where the women's hormones are imbalance. Recently it has been reported to occur in early age. Alteration of normal gene expression in PCOS has shown negative effects on long-term health issues. PCOS has been the responsible factor for the infertility in women of reproductive age group. Early diagnosis and treatment can improve the women's health suffering from PCOS. Earlier Studies shows correlation of PCOS upon insulin resistance with significant outcome, Current study shows the linkage between PCOS with obesity and non-obese patients. Gene expression datasets has been downloaded from GEO (control and PCOS affected patients). Normalization of the datasets were performed using R based on RMA and differentially expressed gene (DEG) were selected on the basis of p-value 0.05 followed by functional annotation of selected gene using Enrich R and DAVID. The DEGs were significantly related to PCOS with obesity and other risk factors involved in disease. The Gene Enrichment Analysis suggests alteration of genes and associated pathway in case of obesity. Current study provides a productive groundwork for specific biomarkers identification for the accurate diagnosis and efficient target for the treatment of PCOS.

  8. A Systems Toxicology Approach Reveals Biological Pathways Dysregulated by Prenatal Arsenic Exposure

    PubMed Central

    Laine, Jessica E.; Fry, Rebecca C.

    2016-01-01

    BACKGROUND Prenatal exposure to inorganic arsenic (iAs) is associated with dysregulated gene and protein expression in the fetus, both evident at birth. Potential epigenetic mechanisms that underlie these changes include but are not limited to the methylation of cytosines (CpG). OBJECTIVE The aim of the present study was to compile datasets from studies on prenatal arsenic exposure to identify whether key genes, proteins, or both and their associated biological pathways are perturbed. METHODS We compiled datasets from 12 studies that analyzed the relationship between prenatal iAs exposure and fetal changes to the epigenome (5-methyl cytosine), transcriptome (mRNA expression), and/or proteome (protein expression changes). FINDINGS Across the 12 studies, a set of 845 unique genes was identified and found to enrich for their role in biological pathways, including those signaled by peroxisome proliferator-activated receptor, nuclear factor of kappa light polypeptide gene enhancer in B-cells inhibitor, and the glucocorticoid receptor. Tumor necrosis factor was identified as a putative cellular regulator underlying most (n = 277) of the identified iAs-associated genes or proteins. CONCLUSIONS Given their common identification across numerous human cohorts and their known toxicologic role in disease, the identified genes and pathways may underlie altered disease susceptibility associated with prenatal exposure to iAs. PMID:27325076

  9. A Leveraged Signal-to-Noise Ratio (LSTNR) Method to Extract Differentially Expressed Genes and Multivariate Patterns of Expression From Noisy and Low-Replication RNAseq Data

    PubMed Central

    Lozoya, Oswaldo A.; Santos, Janine H.; Woychik, Richard P.

    2018-01-01

    To life scientists, one important feature offered by RNAseq, a next-generation sequencing tool used to estimate changes in gene expression levels, lies in its unprecedented resolution. It can score countable differences in transcript numbers among thousands of genes and between experimental groups, all at once. However, its high cost limits experimental designs to very small sample sizes, usually N = 3, which often results in statistically underpowered analysis and poor reproducibility. All these issues are compounded by the presence of experimental noise, which is harder to distinguish from instrumental error when sample sizes are limiting (e.g., small-budget pilot tests), experimental populations exhibit biologically heterogeneous or diffuse expression phenotypes (e.g., patient samples), or when discriminating among transcriptional signatures of closely related experimental conditions (e.g., toxicological modes of action, or MOAs). Here, we present a leveraged signal-to-noise ratio (LSTNR) thresholding method, founded on generalized linear modeling (GLM) of aligned read detection limits to extract differentially expressed genes (DEGs) from noisy low-replication RNAseq data. The LSTNR method uses an agnostic independent filtering strategy to define the dynamic range of detected aggregate read counts per gene, and assigns statistical weights that prioritize genes with better sequencing resolution in differential expression analyses. To assess its performance, we implemented the LSTNR method to analyze three separate datasets: first, using a systematically noisy in silico dataset, we demonstrated that LSTNR can extract pre-designed patterns of expression and discriminate between “noise” and “true” differentially expressed pseudogenes at a 100% success rate; then, we illustrated how the LSTNR method can assign patient-derived breast cancer specimens correctly to one out of their four reported molecular subtypes (luminal A, luminal B, Her2-enriched and basal-like); and last, we showed the ability to retrieve five different modes of action (MOA) elicited in livers of rats exposed to three toxicants under three nutritional routes by using the LSTNR method. By combining differential measurements with resolving power to detect DEGs, the LSTNR method offers an alternative approach to interrogate noisy and low-replication RNAseq datasets, which handles multiple biological conditions at once, and defines benchmarks to validate RNAseq experiments with standard benchtop assays. PMID:29868123

  10. Identification of Common Differentially Expressed Genes in Urinary Bladder Cancer

    PubMed Central

    Zaravinos, Apostolos; Lambrou, George I.; Boulalas, Ioannis; Delakas, Dimitris; Spandidos, Demetrios A.

    2011-01-01

    Background Current diagnosis and treatment of urinary bladder cancer (BC) has shown great progress with the utilization of microarrays. Purpose Our goal was to identify common differentially expressed (DE) genes among clinically relevant subclasses of BC using microarrays. Methodology/Principal Findings BC samples and controls, both experimental and publicly available datasets, were analyzed by whole genome microarrays. We grouped the samples according to their histology and defined the DE genes in each sample individually, as well as in each tumor group. A dual analysis strategy was followed. First, experimental samples were analyzed and conclusions were formulated; and second, experimental sets were combined with publicly available microarray datasets and were further analyzed in search of common DE genes. The experimental dataset identified 831 genes that were DE in all tumor samples, simultaneously. Moreover, 33 genes were up-regulated and 85 genes were down-regulated in all 10 BC samples compared to the 5 normal tissues, simultaneously. Hierarchical clustering partitioned tumor groups in accordance to their histology. K-means clustering of all genes and all samples, as well as clustering of tumor groups, presented 49 clusters. K-means clustering of common DE genes in all samples revealed 24 clusters. Genes manifested various differential patterns of expression, based on PCA. YY1 and NFκB were among the most common transcription factors that regulated the expression of the identified DE genes. Chromosome 1 contained 32 DE genes, followed by chromosomes 2 and 11, which contained 25 and 23 DE genes, respectively. Chromosome 21 had the least number of DE genes. GO analysis revealed the prevalence of transport and binding genes in the common down-regulated DE genes; the prevalence of RNA metabolism and processing genes in the up-regulated DE genes; as well as the prevalence of genes responsible for cell communication and signal transduction in the DE genes that were down-regulated in T1-Grade III tumors and up-regulated in T2/T3-Grade III tumors. Combination of samples from all microarray platforms revealed 17 common DE genes, (BMP4, CRYGD, DBH, GJB1, KRT83, MPZ, NHLH1, TACR3, ACTC1, MFAP4, SPARCL1, TAGLN, TPM2, CDC20, LHCGR, TM9SF1 and HCCS) 4 of which participate in numerous pathways. Conclusions/Significance The identification of the common DE genes among BC samples of different histology can provide further insight into the discovery of new putative markers. PMID:21483740

  11. RNA-seq based transcriptomic map reveals new insights into mouse salivary gland development and maturation.

    PubMed

    Gluck, Christian; Min, Sangwon; Oyelakin, Akinsola; Smalley, Kirsten; Sinha, Satrajit; Romano, Rose-Anne

    2016-11-16

    Mouse models have served a valuable role in deciphering various facets of Salivary Gland (SG) biology, from normal developmental programs to diseased states. To facilitate such studies, gene expression profiling maps have been generated for various stages of SG organogenesis. However these prior studies fall short of capturing the transcriptional complexity due to the limited scope of gene-centric microarray-based technology. Compared to microarray, RNA-sequencing (RNA-seq) offers unbiased detection of novel transcripts, broader dynamic range and high specificity and sensitivity for detection of genes, transcripts, and differential gene expression. Although RNA-seq data, particularly under the auspices of the ENCODE project, have covered a large number of biological specimens, studies on the SG have been lacking. To better appreciate the wide spectrum of gene expression profiles, we isolated RNA from mouse submandibular salivary glands at different embryonic and adult stages. In parallel, we processed RNA-seq data for 24 organs and tissues obtained from the mouse ENCODE consortium and calculated the average gene expression values. To identify molecular players and pathways likely to be relevant for SG biology, we performed functional gene enrichment analysis, network construction and hierarchal clustering of the RNA-seq datasets obtained from different stages of SG development and maturation, and other mouse organs and tissues. Our bioinformatics-based data analysis not only reaffirmed known modulators of SG morphogenesis but revealed novel transcription factors and signaling pathways unique to mouse SG biology and function. Finally we demonstrated that the unique SG gene signature obtained from our mouse studies is also well conserved and can demarcate features of the human SG transcriptome that is different from other tissues. Our RNA-seq based Atlas has revealed a high-resolution cartographic view of the dynamic transcriptomic landscape of the mouse SG at various stages. These RNA-seq datasets will complement pre-existing microarray based datasets, including the Salivary Gland Molecular Anatomy Project by offering a broader systems-biology based perspective rather than the classical gene-centric view. Ultimately such resources will be valuable in providing a useful toolkit to better understand how the diverse cell population of the SG are organized and controlled during development and differentiation.

  12. A novel strategy of integrated microarray analysis identifies CENPA, CDK1 and CDC20 as a cluster of diagnostic biomarkers in lung adenocarcinoma.

    PubMed

    Liu, Wan-Ting; Wang, Yang; Zhang, Jing; Ye, Fei; Huang, Xiao-Hui; Li, Bin; He, Qing-Yu

    2018-07-01

    Lung adenocarcinoma (LAC) is the most lethal cancer and the leading cause of cancer-related death worldwide. The identification of meaningful clusters of co-expressed genes or representative biomarkers may help improve the accuracy of LAC diagnoses. Public databases, such as the Gene Expression Omnibus (GEO), provide rich resources of valuable information for clinics, however, the integration of multiple microarray datasets from various platforms and institutes remained a challenge. To determine potential indicators of LAC, we performed genome-wide relative significance (GWRS), genome-wide global significance (GWGS) and support vector machine (SVM) analyses progressively to identify robust gene biomarker signatures from 5 different microarray datasets that included 330 samples. The top 200 genes with robust signatures were selected for integrative analysis according to "guilt-by-association" methods, including protein-protein interaction (PPI) analysis and gene co-expression analysis. Of these 200 genes, only 10 genes showed both intensive PPI network and high gene co-expression correlation (r > 0.8). IPA analysis of this regulatory networks suggested that the cell cycle process is a crucial determinant of LAC. CENPA, as well as two linked hub genes CDK1 and CDC20, are determined to be potential indicators of LAC. Immunohistochemical staining showed that CENPA, CDK1 and CDC20 were highly expressed in LAC cancer tissue with co-expression patterns. A Cox regression model indicated that LAC patients with CENPA + /CDK1 + and CENPA + /CDC20 + were high-risk groups in terms of overall survival. In conclusion, our integrated microarray analysis demonstrated that CENPA, CDK1 and CDC20 might serve as novel cluster of prognostic biomarkers for LAC, and the cooperative unit of three genes provides a technically simple approach for identification of LAC patients. Copyright © 2018 Elsevier B.V. All rights reserved.

  13. Identification of diagnostic markers in colorectal cancer via integrative epigenomics and genomics data

    PubMed Central

    KOK-SIN, TEOW; MOKHTAR, NORFILZA MOHD; HASSAN, NUR ZARINA ALI; SAGAP, ISMAIL; ROSE, ISA MOHAMED; HARUN, ROSLAN; JAMAL, RAHMAN

    2015-01-01

    Apart from genetic mutations, epigenetic alteration is a common phenomenon that contributes to neoplastic transformation in colorectal cancer. Transcriptional silencing of tumor-suppressor genes without changes in the DNA sequence is explained by the existence of promoter hypermethylation. To test this hypothesis, we integrated the epigenome and transcriptome data from a similar set of colorectal tissue samples. Methylation profiling was performed using the Illumina InfiniumHumanMethylation27 BeadChip on 55 paired cancer and adjacent normal epithelial cells. Fifteen of the 55 paired tissues were used for gene expression profiling using the Affymetrix GeneChip Human Gene 1.0 ST array. Validation was carried out on 150 colorectal tissues using the methylation-specific multiplex ligation-dependent probe amplification (MS-MLPA) technique. PCA and supervised hierarchical clustering in the two microarray datasets showed good separation between cancer and normal samples. Significant genes from the two analyses were obtained based on a ≥2-fold change and a false discovery rate (FDR) P-value of <0.05. We identified 1,081 differentially hypermethylated CpG sites and 36 hypomethylated CpG sites. We also found 709 upregulated and 699 downregulated genes from the gene expression profiling. A comparison of the two datasets revealed 32 overlapping genes with 27 being hypermethylated with downregulated expression and 4 hypermethylated with upregulated expression. One gene was found to be hypomethylated and downregulated. The most enriched molecular pathway identified was cell adhesion molecules that involved 4 overlapped genes, JAM2, NCAM1, ITGA8 and CNTN1. In the present study, we successfully identified a group of genes that showed methylation and gene expression changes in well-defined colorectal cancer tissues with high purity. The integrated analysis gives additional insight regarding the regulation of colorectal cancer-associated genes and their underlying mechanisms that contribute to colorectal carcinogenesis. PMID:25997610

  14. Gene Network Construction from Microarray Data Identifies a Key Network Module and Several Candidate Hub Genes in Age-Associated Spatial Learning Impairment

    PubMed Central

    Uddin, Raihan; Singh, Shiva M.

    2017-01-01

    As humans age many suffer from a decrease in normal brain functions including spatial learning impairments. This study aimed to better understand the molecular mechanisms in age-associated spatial learning impairment (ASLI). We used a mathematical modeling approach implemented in Weighted Gene Co-expression Network Analysis (WGCNA) to create and compare gene network models of young (learning unimpaired) and aged (predominantly learning impaired) brains from a set of exploratory datasets in rats in the context of ASLI. The major goal was to overcome some of the limitations previously observed in the traditional meta- and pathway analysis using these data, and identify novel ASLI related genes and their networks based on co-expression relationship of genes. This analysis identified a set of network modules in the young, each of which is highly enriched with genes functioning in broad but distinct GO functional categories or biological pathways. Interestingly, the analysis pointed to a single module that was highly enriched with genes functioning in “learning and memory” related functions and pathways. Subsequent differential network analysis of this “learning and memory” module in the aged (predominantly learning impaired) rats compared to the young learning unimpaired rats allowed us to identify a set of novel ASLI candidate hub genes. Some of these genes show significant repeatability in networks generated from independent young and aged validation datasets. These hub genes are highly co-expressed with other genes in the network, which not only show differential expression but also differential co-expression and differential connectivity across age and learning impairment. The known function of these hub genes indicate that they play key roles in critical pathways, including kinase and phosphatase signaling, in functions related to various ion channels, and in maintaining neuronal integrity relating to synaptic plasticity and memory formation. Taken together, they provide a new insight and generate new hypotheses into the molecular mechanisms responsible for age associated learning impairment, including spatial learning. PMID:29066959

  15. Gene Network Construction from Microarray Data Identifies a Key Network Module and Several Candidate Hub Genes in Age-Associated Spatial Learning Impairment.

    PubMed

    Uddin, Raihan; Singh, Shiva M

    2017-01-01

    As humans age many suffer from a decrease in normal brain functions including spatial learning impairments. This study aimed to better understand the molecular mechanisms in age-associated spatial learning impairment (ASLI). We used a mathematical modeling approach implemented in Weighted Gene Co-expression Network Analysis (WGCNA) to create and compare gene network models of young (learning unimpaired) and aged (predominantly learning impaired) brains from a set of exploratory datasets in rats in the context of ASLI. The major goal was to overcome some of the limitations previously observed in the traditional meta- and pathway analysis using these data, and identify novel ASLI related genes and their networks based on co-expression relationship of genes. This analysis identified a set of network modules in the young, each of which is highly enriched with genes functioning in broad but distinct GO functional categories or biological pathways. Interestingly, the analysis pointed to a single module that was highly enriched with genes functioning in "learning and memory" related functions and pathways. Subsequent differential network analysis of this "learning and memory" module in the aged (predominantly learning impaired) rats compared to the young learning unimpaired rats allowed us to identify a set of novel ASLI candidate hub genes. Some of these genes show significant repeatability in networks generated from independent young and aged validation datasets. These hub genes are highly co-expressed with other genes in the network, which not only show differential expression but also differential co-expression and differential connectivity across age and learning impairment. The known function of these hub genes indicate that they play key roles in critical pathways, including kinase and phosphatase signaling, in functions related to various ion channels, and in maintaining neuronal integrity relating to synaptic plasticity and memory formation. Taken together, they provide a new insight and generate new hypotheses into the molecular mechanisms responsible for age associated learning impairment, including spatial learning.

  16. Gene expression profiles of breast biopsies from healthy women identify a group with claudin-low features.

    PubMed

    Haakensen, Vilde D; Lingjaerde, Ole Christian; Lüders, Torben; Riis, Margit; Prat, Aleix; Troester, Melissa A; Holmen, Marit M; Frantzen, Jan Ole; Romundstad, Linda; Navjord, Dina; Bukholm, Ida K; Johannesen, Tom B; Perou, Charles M; Ursin, Giske; Kristensen, Vessela N; Børresen-Dale, Anne-Lise; Helland, Aslaug

    2011-11-01

    Increased understanding of the variability in normal breast biology will enable us to identify mechanisms of breast cancer initiation and the origin of different subtypes, and to better predict breast cancer risk. Gene expression patterns in breast biopsies from 79 healthy women referred to breast diagnostic centers in Norway were explored by unsupervised hierarchical clustering and supervised analyses, such as gene set enrichment analysis and gene ontology analysis and comparison with previously published genelists and independent datasets. Unsupervised hierarchical clustering identified two separate clusters of normal breast tissue based on gene-expression profiling, regardless of clustering algorithm and gene filtering used. Comparison of the expression profile of the two clusters with several published gene lists describing breast cells revealed that the samples in cluster 1 share characteristics with stromal cells and stem cells, and to a certain degree with mesenchymal cells and myoepithelial cells. The samples in cluster 1 also share many features with the newly identified claudin-low breast cancer intrinsic subtype, which also shows characteristics of stromal and stem cells. More women belonging to cluster 1 have a family history of breast cancer and there is a slight overrepresentation of nulliparous women in cluster 1. Similar findings were seen in a separate dataset consisting of histologically normal tissue from both breasts harboring breast cancer and from mammoplasty reductions. This is the first study to explore the variability of gene expression patterns in whole biopsies from normal breasts and identified distinct subtypes of normal breast tissue. Further studies are needed to determine the specific cell contribution to the variation in the biology of normal breasts, how the clusters identified relate to breast cancer risk and their possible link to the origin of the different molecular subtypes of breast cancer.

  17. Evolutionary Approach for Relative Gene Expression Algorithms

    PubMed Central

    Czajkowski, Marcin

    2014-01-01

    A Relative Expression Analysis (RXA) uses ordering relationships in a small collection of genes and is successfully applied to classiffication using microarray data. As checking all possible subsets of genes is computationally infeasible, the RXA algorithms require feature selection and multiple restrictive assumptions. Our main contribution is a specialized evolutionary algorithm (EA) for top-scoring pairs called EvoTSP which allows finding more advanced gene relations. We managed to unify the major variants of relative expression algorithms through EA and introduce weights to the top-scoring pairs. Experimental validation of EvoTSP on public available microarray datasets showed that the proposed solution significantly outperforms in terms of accuracy other relative expression algorithms and allows exploring much larger solution space. PMID:24790574

  18. Functional networks inference from rule-based machine learning models.

    PubMed

    Lazzarini, Nicola; Widera, Paweł; Williamson, Stuart; Heer, Rakesh; Krasnogor, Natalio; Bacardit, Jaume

    2016-01-01

    Functional networks play an important role in the analysis of biological processes and systems. The inference of these networks from high-throughput (-omics) data is an area of intense research. So far, the similarity-based inference paradigm (e.g. gene co-expression) has been the most popular approach. It assumes a functional relationship between genes which are expressed at similar levels across different samples. An alternative to this paradigm is the inference of relationships from the structure of machine learning models. These models are able to capture complex relationships between variables, that often are different/complementary to the similarity-based methods. We propose a protocol to infer functional networks from machine learning models, called FuNeL. It assumes, that genes used together within a rule-based machine learning model to classify the samples, might also be functionally related at a biological level. The protocol is first tested on synthetic datasets and then evaluated on a test suite of 8 real-world datasets related to human cancer. The networks inferred from the real-world data are compared against gene co-expression networks of equal size, generated with 3 different methods. The comparison is performed from two different points of view. We analyse the enriched biological terms in the set of network nodes and the relationships between known disease-associated genes in a context of the network topology. The comparison confirms both the biological relevance and the complementary character of the knowledge captured by the FuNeL networks in relation to similarity-based methods and demonstrates its potential to identify known disease associations as core elements of the network. Finally, using a prostate cancer dataset as a case study, we confirm that the biological knowledge captured by our method is relevant to the disease and consistent with the specialised literature and with an independent dataset not used in the inference process. The implementation of our network inference protocol is available at: http://ico2s.org/software/funel.html.

  19. Configurable pattern-based evolutionary biclustering of gene expression data

    PubMed Central

    2013-01-01

    Background Biclustering algorithms for microarray data aim at discovering functionally related gene sets under different subsets of experimental conditions. Due to the problem complexity and the characteristics of microarray datasets, heuristic searches are usually used instead of exhaustive algorithms. Also, the comparison among different techniques is still a challenge. The obtained results vary in relevant features such as the number of genes or conditions, which makes it difficult to carry out a fair comparison. Moreover, existing approaches do not allow the user to specify any preferences on these properties. Results Here, we present the first biclustering algorithm in which it is possible to particularize several biclusters features in terms of different objectives. This can be done by tuning the specified features in the algorithm or also by incorporating new objectives into the search. Furthermore, our approach bases the bicluster evaluation in the use of expression patterns, being able to recognize both shifting and scaling patterns either simultaneously or not. Evolutionary computation has been chosen as the search strategy, naming thus our proposal Evo-Bexpa (Evolutionary Biclustering based in Expression Patterns). Conclusions We have conducted experiments on both synthetic and real datasets demonstrating Evo-Bexpa abilities to obtain meaningful biclusters. Synthetic experiments have been designed in order to compare Evo-Bexpa performance with other approaches when looking for perfect patterns. Experiments with four different real datasets also confirm the proper performing of our algorithm, whose results have been biologically validated through Gene Ontology. PMID:23433178

  20. Bi-Force: large-scale bicluster editing and its application to gene expression data biclustering.

    PubMed

    Sun, Peng; Speicher, Nora K; Röttger, Richard; Guo, Jiong; Baumbach, Jan

    2014-05-01

    The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as 'simultaneous clustering' or 'co-clustering', has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: 'Bi-Force'. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279-292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  1. GESearch: An Interactive GUI Tool for Identifying Gene Expression Signature.

    PubMed

    Ye, Ning; Yin, Hengfu; Liu, Jingjing; Dai, Xiaogang; Yin, Tongming

    2015-01-01

    The huge amount of gene expression data generated by microarray and next-generation sequencing technologies present challenges to exploit their biological meanings. When searching for the coexpression genes, the data mining process is largely affected by selection of algorithms. Thus, it is highly desirable to provide multiple options of algorithms in the user-friendly analytical toolkit to explore the gene expression signatures. For this purpose, we developed GESearch, an interactive graphical user interface (GUI) toolkit, which is written in MATLAB and supports a variety of gene expression data files. This analytical toolkit provides four models, including the mean, the regression, the delegate, and the ensemble models, to identify the coexpression genes, and enables the users to filter data and to select gene expression patterns by browsing the display window or by importing knowledge-based genes. Subsequently, the utility of this analytical toolkit is demonstrated by analyzing two sets of real-life microarray datasets from cell-cycle experiments. Overall, we have developed an interactive GUI toolkit that allows for choosing multiple algorithms for analyzing the gene expression signatures.

  2. Ion channel gene expression predicts survival in glioma patients

    PubMed Central

    Wang, Rong; Gurguis, Christopher I.; Gu, Wanjun; Ko, Eun A; Lim, Inja; Bang, Hyoweon; Zhou, Tong; Ko, Jae-Hong

    2015-01-01

    Ion channels are important regulators in cell proliferation, migration, and apoptosis. The malfunction and/or aberrant expression of ion channels may disrupt these important biological processes and influence cancer progression. In this study, we investigate the expression pattern of ion channel genes in glioma. We designate 18 ion channel genes that are differentially expressed in high-grade glioma as a prognostic molecular signature. This ion channel gene expression based signature predicts glioma outcome in three independent validation cohorts. Interestingly, 16 of these 18 genes were down-regulated in high-grade glioma. This signature is independent of traditional clinical, molecular, and histological factors. Resampling tests indicate that the prognostic power of the signature outperforms random gene sets selected from human genome in all the validation cohorts. More importantly, this signature performs better than the random gene signatures selected from glioma-associated genes in two out of three validation datasets. This study implicates ion channels in brain cancer, thus expanding on knowledge of their roles in other cancers. Individualized profiling of ion channel gene expression serves as a superior and independent prognostic tool for glioma patients. PMID:26235283

  3. Building gene co-expression networks using transcriptomics data for systems biology investigations: Comparison of methods using microarray data

    PubMed Central

    Kadarmideen, Haja N; Watson-haigh, Nathan S

    2012-01-01

    Gene co-expression networks (GCN), built using high-throughput gene expression data are fundamental aspects of systems biology. The main aims of this study were to compare two popular approaches to building and analysing GCN. We use real ovine microarray transcriptomics datasets representing four different treatments with Metyrapone, an inhibitor of cortisol biosynthesis. We conducted several microarray quality control checks before applying GCN methods to filtered datasets. Then we compared the outputs of two methods using connectivity as a criterion, as it measures how well a node (gene) is connected within a network. The two GCN construction methods used were, Weighted Gene Co-expression Network Analysis (WGCNA) and Partial Correlation and Information Theory (PCIT) methods. Nodes were ranked based on their connectivity measures in each of the four different networks created by WGCNA and PCIT and node ranks in two methods were compared to identify those nodes which are highly differentially ranked (HDR). A total of 1,017 HDR nodes were identified across one or more of four networks. We investigated HDR nodes by gene enrichment analyses in relation to their biological relevance to phenotypes. We observed that, in contrast to WGCNA method, PCIT algorithm removes many of the edges of the most highly interconnected nodes. Removal of edges of most highly connected nodes or hub genes will have consequences for downstream analyses and biological interpretations. In general, for large GCN construction (with > 20000 genes) access to large computer clusters, particularly those with larger amounts of shared memory is recommended. PMID:23144540

  4. Network Analysis of Rodent Transcriptomes in Spaceflight

    NASA Technical Reports Server (NTRS)

    Ramachandran, Maya; Fogle, Homer; Costes, Sylvain

    2017-01-01

    Network analysis methods leverage prior knowledge of cellular systems and the statistical and conceptual relationships between analyte measurements to determine gene connectivity. Correlation and conditional metrics are used to infer a network topology and provide a systems-level context for cellular responses. Integration across multiple experimental conditions and omics domains can reveal the regulatory mechanisms that underlie gene expression. GeneLab has assembled rich multi-omic (transcriptomics, proteomics, epigenomics, and epitranscriptomics) datasets for multiple murine tissues from the Rodent Research 1 (RR-1) experiment. RR-1 assesses the impact of 37 days of spaceflight on gene expression across a variety of tissue types, such as adrenal glands, quadriceps, gastrocnemius, tibalius anterior, extensor digitorum longus, soleus, eye, and kidney. Network analysis is particularly useful for RR-1 -omics datasets because it reinforces subtle relationships that may be overlooked in isolated analyses and subdues confounding factors. Our objective is to use network analysis to determine potential target nodes for therapeutic intervention and identify similarities with existing disease models. Multiple network algorithms are used for a higher confidence consensus.

  5. MARQ: an online tool to mine GEO for experiments with similar or opposite gene expression signatures.

    PubMed

    Vazquez, Miguel; Nogales-Cadenas, Ruben; Arroyo, Javier; Botías, Pedro; García, Raul; Carazo, Jose M; Tirado, Francisco; Pascual-Montano, Alberto; Carmona-Saez, Pedro

    2010-07-01

    The enormous amount of data available in public gene expression repositories such as Gene Expression Omnibus (GEO) offers an inestimable resource to explore gene expression programs across several organisms and conditions. This information can be used to discover experiments that induce similar or opposite gene expression patterns to a given query, which in turn may lead to the discovery of new relationships among diseases, drugs or pathways, as well as the generation of new hypotheses. In this work, we present MARQ, a web-based application that allows researchers to compare a query set of genes, e.g. a set of over- and under-expressed genes, against a signature database built from GEO datasets for different organisms and platforms. MARQ offers an easy-to-use and integrated environment to mine GEO, in order to identify conditions that induce similar or opposite gene expression patterns to a given experimental condition. MARQ also includes additional functionalities for the exploration of the results, including a meta-analysis pipeline to find genes that are differentially expressed across different experiments. The application is freely available at http://marq.dacya.ucm.es.

  6. Integrative analysis for identification of shared markers from various functional cells/tissues for rheumatoid arthritis.

    PubMed

    Xia, Wei; Wu, Jian; Deng, Fei-Yan; Wu, Long-Fei; Zhang, Yong-Hong; Guo, Yu-Fan; Lei, Shu-Feng

    2017-02-01

    Rheumatoid arthritis (RA) is a systemic autoimmune disease. So far, it is unclear whether there exist common RA-related genes shared in different tissues/cells. In this study, we conducted an integrative analysis on multiple datasets to identify potential shared genes that are significant in multiple tissues/cells for RA. Seven microarray gene expression datasets representing various RA-related tissues/cells were downloaded from the Gene Expression Omnibus (GEO). Statistical analyses, testing both marginal and joint effects, were conducted to identify significant genes shared in various samples. Followed-up analyses were conducted on functional annotation clustering analysis, protein-protein interaction (PPI) analysis, gene-based association analysis, and ELISA validation analysis in in-house samples. We identified 18 shared significant genes, which were mainly involved in the immune response and chemokine signaling pathway. Among the 18 genes, eight genes (PPBP, PF4, HLA-F, S100A8, RNASEH2A, P2RY6, JAG2, and PCBP1) interact with known RA genes. Two genes (HLA-F and PCBP1) are significant in gene-based association analysis (P = 1.03E-31, P = 1.30E-2, respectively). Additionally, PCBP1 also showed differential protein expression levels in in-house case-control plasma samples (P = 2.60E-2). This study represented the first effort to identify shared RA markers from different functional cells or tissues. The results suggested that one of the shared genes, i.e., PCBP1, is a promising biomarker for RA.

  7. Identification of Human HK Genes and Gene Expression Regulation Study in Cancer from Transcriptomics Data Analysis

    PubMed Central

    Zhang, Zhang; Liu, Jingxing; Wu, Jiayan; Yu, Jun

    2013-01-01

    The regulation of gene expression is essential for eukaryotes, as it drives the processes of cellular differentiation and morphogenesis, leading to the creation of different cell types in multicellular organisms. RNA-Sequencing (RNA-Seq) provides researchers with a powerful toolbox for characterization and quantification of transcriptome. Many different human tissue/cell transcriptome datasets coming from RNA-Seq technology are available on public data resource. The fundamental issue here is how to develop an effective analysis method to estimate expression pattern similarities between different tumor tissues and their corresponding normal tissues. We define the gene expression pattern from three directions: 1) expression breadth, which reflects gene expression on/off status, and mainly concerns ubiquitously expressed genes; 2) low/high or constant/variable expression genes, based on gene expression level and variation; and 3) the regulation of gene expression at the gene structure level. The cluster analysis indicates that gene expression pattern is higher related to physiological condition rather than tissue spatial distance. Two sets of human housekeeping (HK) genes are defined according to cell/tissue types, respectively. To characterize the gene expression pattern in gene expression level and variation, we firstly apply improved K-means algorithm and a gene expression variance model. We find that cancer-associated HK genes (a HK gene is specific in cancer group, while not in normal group) are expressed higher and more variable in cancer condition than in normal condition. Cancer-associated HK genes prefer to AT-rich genes, and they are enriched in cell cycle regulation related functions and constitute some cancer signatures. The expression of large genes is also avoided in cancer group. These studies will help us understand which cell type-specific patterns of gene expression differ among different cell types, and particularly for cancer. PMID:23382867

  8. Mining microbial metatranscriptomes for expression of antibiotic resistance genes under natural conditions

    PubMed Central

    Versluis, Dennis; D’Andrea, Marco Maria; Ramiro Garcia, Javier; Leimena, Milkha M.; Hugenholtz, Floor; Zhang, Jing; Öztürk, Başak; Nylund, Lotta; Sipkema, Detmer; Schaik, Willem van; de Vos, Willem M.; Kleerebezem, Michiel; Smidt, Hauke; Passel, Mark W.J. van

    2015-01-01

    Antibiotic resistance genes are found in a broad range of ecological niches associated with complex microbiota. Here we investigated if resistance genes are not only present, but also transcribed under natural conditions. Furthermore, we examined the potential for antibiotic production by assessing the expression of associated secondary metabolite biosynthesis gene clusters. Metatranscriptome datasets from intestinal microbiota of four human adults, one human infant, 15 mice and six pigs, of which only the latter have received antibiotics prior to the study, as well as from sea bacterioplankton, a marine sponge, forest soil and sub-seafloor sediment, were investigated. We found that resistance genes are expressed in all studied ecological niches, albeit with niche-specific differences in relative expression levels and diversity of transcripts. For example, in mice and human infant microbiota predominantly tetracycline resistance genes were expressed while in human adult microbiota the spectrum of expressed genes was more diverse, and also included β-lactam, aminoglycoside and macrolide resistance genes. Resistance gene expression could result from the presence of natural antibiotics in the environment, although we could not link it to expression of corresponding secondary metabolites biosynthesis clusters. Alternatively, resistance gene expression could be constitutive, or these genes serve alternative roles besides antibiotic resistance. PMID:26153129

  9. Time-series RNA-seq analysis package (TRAP) and its application to the analysis of rice, Oryza sativa L. ssp. Japonica, upon drought stress.

    PubMed

    Jo, Kyuri; Kwon, Hawk-Bin; Kim, Sun

    2014-06-01

    Measuring expression levels of genes at the whole genome level can be useful for many purposes, especially for revealing biological pathways underlying specific phenotype conditions. When gene expression is measured over a time period, we have opportunities to understand how organisms react to stress conditions over time. Thus many biologists routinely measure whole genome level gene expressions at multiple time points. However, there are several technical difficulties for analyzing such whole genome expression data. In addition, these days gene expression data is often measured by using RNA-sequencing rather than microarray technologies and then analysis of expression data is much more complicated since the analysis process should start with mapping short reads and produce differentially activated pathways and also possibly interactions among pathways. In addition, many useful tools for analyzing microarray gene expression data are not applicable for the RNA-seq data. Thus a comprehensive package for analyzing time series transcriptome data is much needed. In this article, we present a comprehensive package, Time-series RNA-seq Analysis Package (TRAP), integrating all necessary tasks such as mapping short reads, measuring gene expression levels, finding differentially expressed genes (DEGs), clustering and pathway analysis for time-series data in a single environment. In addition to implementing useful algorithms that are not available for RNA-seq data, we extended existing pathway analysis methods, ORA and SPIA, for time series analysis and estimates statistical values for combined dataset by an advanced metric. TRAP also produces visual summary of pathway interactions. Gene expression change labeling, a practical clustering method used in TRAP, enables more accurate interpretation of the data when combined with pathway analysis. We applied our methods on a real dataset for the analysis of rice (Oryza sativa L. Japonica nipponbare) upon drought stress. The result showed that TRAP was able to detect pathways more accurately than several existing methods. TRAP is available at http://biohealth.snu.ac.kr/software/TRAP/. Copyright © 2014 Elsevier Inc. All rights reserved.

  10. Analyzing gene expression data in mice with the Neuro Behavior Ontology.

    PubMed

    Hoehndorf, Robert; Hancock, John M; Hardy, Nigel W; Mallon, Ann-Marie; Schofield, Paul N; Gkoutos, Georgios V

    2014-02-01

    We have applied the Neuro Behavior Ontology (NBO), an ontology for the annotation of behavioral gene functions and behavioral phenotypes, to the annotation of more than 1,000 genes in the mouse that are known to play a role in behavior. These annotations can be explored by researchers interested in genes involved in particular behaviors and used computationally to provide insights into the behavioral phenotypes resulting from differences in gene expression. We developed the OntoFUNC tool and have applied it to enrichment analyses over the NBO to provide high-level behavioral interpretations of gene expression datasets. The resulting increase in the number of gene annotations facilitates the identification of behavioral or neurologic processes by assisting the formulation of hypotheses about the relationships between gene, processes, and phenotypic manifestations resulting from behavioral observations.

  11. BEAT: Bioinformatics Exon Array Tool to store, analyze and visualize Affymetrix GeneChip Human Exon Array data from disease experiments

    PubMed Central

    2012-01-01

    Background It is known from recent studies that more than 90% of human multi-exon genes are subject to Alternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a single gene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiation and pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to the study of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon 1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns. The exon array technology, with more than five million data points, can detect approximately one million exons, and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated user-friendly bioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a data warehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases. Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at http://beat.ba.itb.cnr.it. Results BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statistical methods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples and tuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samples produced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza. To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemical pathways annotations are integrated with exon and gene level expression plots. The user can customize the results choosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samples for a multivariate AS analysis. Conclusions Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysis tools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendly platform for a comprehensive study of AS events in human diseases, displaying the analysis results with easily interpretable and interactive tables and graphics. PMID:22536968

  12. Predicting ionizing radiation exposure using biochemically-inspired genomic machine learning.

    PubMed

    Zhao, Jonathan Z L; Mucaki, Eliseos J; Rogan, Peter K

    2018-01-01

    Background: Gene signatures derived from transcriptomic data using machine learning methods have shown promise for biodosimetry testing. These signatures may not be sufficiently robust for large scale testing, as their performance has not been adequately validated on external, independent datasets. The present study develops human and murine signatures with biochemically-inspired machine learning that are strictly validated using k-fold and traditional approaches. Methods: Gene Expression Omnibus (GEO) datasets of exposed human and murine lymphocytes were preprocessed via nearest neighbor imputation and expression of genes implicated in the literature to be responsive to radiation exposure (n=998) were then ranked by Minimum Redundancy Maximum Relevance (mRMR). Optimal signatures were derived by backward, complete, and forward sequential feature selection using Support Vector Machines (SVM), and validated using k-fold or traditional validation on independent datasets. Results: The best human signatures we derived exhibit k-fold validation accuracies of up to 98% ( DDB2 ,  PRKDC , TPP2 , PTPRE , and GADD45A ) when validated over 209 samples and traditional validation accuracies of up to 92% ( DDB2 ,  CD8A ,  TALDO1 ,  PCNA ,  EIF4G2 ,  LCN2 ,  CDKN1A ,  PRKCH ,  ENO1 ,  and PPM1D ) when validated over 85 samples. Some human signatures are specific enough to differentiate between chemotherapy and radiotherapy. Certain multi-class murine signatures have sufficient granularity in dose estimation to inform eligibility for cytokine therapy (assuming these signatures could be translated to humans). We compiled a list of the most frequently appearing genes in the top 20 human and mouse signatures. More frequently appearing genes among an ensemble of signatures may indicate greater impact of these genes on the performance of individual signatures. Several genes in the signatures we derived are present in previously proposed signatures. Conclusions: Gene signatures for ionizing radiation exposure derived by machine learning have low error rates in externally validated, independent datasets, and exhibit high specificity and granularity for dose estimation.

  13. Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies

    PubMed Central

    Yang, Tsun-Po; Beazley, Claude; Montgomery, Stephen B.; Dimas, Antigone S.; Gutierrez-Arcelus, Maria; Stranger, Barbara E.; Deloukas, Panos; Dermitzakis, Emmanouil T.

    2010-01-01

    Summary: Genevar (GENe Expression VARiation) is a database and Java tool designed to integrate multiple datasets, and provides analysis and visualization of associations between sequence variation and gene expression. Genevar allows researchers to investigate expression quantitative trait loci (eQTL) associations within a gene locus of interest in real time. The database and application can be installed on a standard computer in database mode and, in addition, on a server to share discoveries among affiliations or the broader community over the Internet via web services protocols. Availability: http://www.sanger.ac.uk/resources/software/genevar Contact: emmanouil.dermitzakis@unige.ch PMID:20702402

  14. A predictive signature gene set for discriminating active from latent tuberculosis in Warao Amerindian children.

    PubMed

    Verhagen, Lilly M; Zomer, Aldert; Maes, Mailis; Villalba, Julian A; Del Nogal, Berenice; Eleveld, Marc; van Hijum, Sacha Aft; de Waard, Jacobus H; Hermans, Peter Wm

    2013-02-01

    Tuberculosis (TB) continues to cause a high toll of disease and death among children worldwide. The diagnosis of childhood TB is challenged by the paucibacillary nature of the disease and the difficulties in obtaining specimens. Whereas scientific and clinical research efforts to develop novel diagnostic tools have focused on TB in adults, childhood TB has been relatively neglected. Blood transcriptional profiling has improved our understanding of disease pathogenesis of adult TB and may offer future leads for diagnosis and treatment. No studies applying gene expression profiling of children with TB have been published so far. We identified a 116-gene signature set that showed an average prediction error of 11% for TB vs. latent TB infection (LTBI) and for TB vs. LTBI vs. healthy controls (HC) in our dataset. A minimal gene set of only 9 genes showed the same prediction error of 11% for TB vs. LTBI in our dataset. Furthermore, this minimal set showed a significant discriminatory value for TB vs. LTBI for all previously published adult studies using whole blood gene expression, with average prediction errors between 17% and 23%. In order to identify a robust representative gene set that would perform well in populations of different genetic backgrounds, we selected ten genes that were highly discriminative between TB, LTBI and HC in all literature datasets as well as in our dataset. Functional annotation of these genes highlights a possible role for genes involved in calcium signaling and calcium metabolism as biomarkers for active TB. These ten genes were validated by quantitative real-time polymerase chain reaction in an additional cohort of 54 Warao Amerindian children with LTBI, HC and non-TB pneumonia. Decision tree analysis indicated that five of the ten genes were sufficient to classify 78% of the TB cases correctly with no LTBI subjects wrongly classified as TB (100% specificity). Our data justify the further exploration of our signature set as biomarkers for potential childhood TB diagnosis. We show that, as the identification of different biomarkers in ethnically distinct cohorts is apparent, it is important to cross-validate newly identified markers in all available cohorts.

  15. A predictive signature gene set for discriminating active from latent tuberculosis in Warao Amerindian children

    PubMed Central

    2013-01-01

    Background Tuberculosis (TB) continues to cause a high toll of disease and death among children worldwide. The diagnosis of childhood TB is challenged by the paucibacillary nature of the disease and the difficulties in obtaining specimens. Whereas scientific and clinical research efforts to develop novel diagnostic tools have focused on TB in adults, childhood TB has been relatively neglected. Blood transcriptional profiling has improved our understanding of disease pathogenesis of adult TB and may offer future leads for diagnosis and treatment. No studies applying gene expression profiling of children with TB have been published so far. Results We identified a 116-gene signature set that showed an average prediction error of 11% for TB vs. latent TB infection (LTBI) and for TB vs. LTBI vs. healthy controls (HC) in our dataset. A minimal gene set of only 9 genes showed the same prediction error of 11% for TB vs. LTBI in our dataset. Furthermore, this minimal set showed a significant discriminatory value for TB vs. LTBI for all previously published adult studies using whole blood gene expression, with average prediction errors between 17% and 23%. In order to identify a robust representative gene set that would perform well in populations of different genetic backgrounds, we selected ten genes that were highly discriminative between TB, LTBI and HC in all literature datasets as well as in our dataset. Functional annotation of these genes highlights a possible role for genes involved in calcium signaling and calcium metabolism as biomarkers for active TB. These ten genes were validated by quantitative real-time polymerase chain reaction in an additional cohort of 54 Warao Amerindian children with LTBI, HC and non-TB pneumonia. Decision tree analysis indicated that five of the ten genes were sufficient to classify 78% of the TB cases correctly with no LTBI subjects wrongly classified as TB (100% specificity). Conclusions Our data justify the further exploration of our signature set as biomarkers for potential childhood TB diagnosis. We show that, as the identification of different biomarkers in ethnically distinct cohorts is apparent, it is important to cross-validate newly identified markers in all available cohorts. PMID:23375113

  16. Evolution and cell-type specificity of human-specific genes preferentially expressed in progenitors of fetal neocortex.

    PubMed

    Florio, Marta; Heide, Michael; Pinson, Anneline; Brandl, Holger; Albert, Mareike; Winkler, Sylke; Wimberger, Pauline; Huttner, Wieland B; Hiller, Michael

    2018-03-21

    Understanding the molecular basis that underlies the expansion of the neocortex during primate, and notably human, evolution requires the identification of genes that are particularly active in the neural stem and progenitor cells of the developing neocortex. Here, we have used existing transcriptome datasets to carry out a comprehensive screen for protein-coding genes preferentially expressed in progenitors of fetal human neocortex. We show that 15 human-specific genes exhibit such expression, and many of them evolved distinct neural progenitor cell-type expression profiles and levels compared to their ancestral paralogs. Functional studies on one such gene, NOTCH2NL , demonstrate its ability to promote basal progenitor proliferation in mice. An additional 35 human genes with progenitor-enriched expression are shown to have orthologs only in primates. Our study provides a resource of genes that are promising candidates to exert specific, and novel, roles in neocortical development during primate, and notably human, evolution. © 2018, Florio et al.

  17. Evolution and cell-type specificity of human-specific genes preferentially expressed in progenitors of fetal neocortex

    PubMed Central

    Pinson, Anneline; Brandl, Holger; Albert, Mareike; Winkler, Sylke; Wimberger, Pauline

    2018-01-01

    Understanding the molecular basis that underlies the expansion of the neocortex during primate, and notably human, evolution requires the identification of genes that are particularly active in the neural stem and progenitor cells of the developing neocortex. Here, we have used existing transcriptome datasets to carry out a comprehensive screen for protein-coding genes preferentially expressed in progenitors of fetal human neocortex. We show that 15 human-specific genes exhibit such expression, and many of them evolved distinct neural progenitor cell-type expression profiles and levels compared to their ancestral paralogs. Functional studies on one such gene, NOTCH2NL, demonstrate its ability to promote basal progenitor proliferation in mice. An additional 35 human genes with progenitor-enriched expression are shown to have orthologs only in primates. Our study provides a resource of genes that are promising candidates to exert specific, and novel, roles in neocortical development during primate, and notably human, evolution. PMID:29561261

  18. From Saccharomyces cerevisiae to human: The important gene co-expression modules.

    PubMed

    Liu, Wei; Li, Li; Ye, Hua; Chen, Haiwei; Shen, Weibiao; Zhong, Yuexian; Tian, Tian; He, Huaqin

    2017-08-01

    Network-based systems biology has become an important method for analyzing high-throughput gene expression data and gene function mining. Yeast has long been a popular model organism for biomedical research. In the current study, a weighted gene co-expression network analysis algorithm was applied to construct a gene co-expression network in Saccharomyces cerevisiae . Seventeen stable gene co-expression modules were detected from 2,814 S. cerevisiae microarray data. Further characterization of these modules with the Database for Annotation, Visualization and Integrated Discovery tool indicated that these modules were associated with certain biological processes, such as heat response, cell cycle, translational regulation, mitochondrion oxidative phosphorylation, amino acid metabolism and autophagy. Hub genes were also screened by intra-modular connectivity. Finally, the module conservation was evaluated in a human disease microarray dataset. Functional modules were identified in budding yeast, some of which are associated with patient survival. The current study provided a paradigm for single cell microorganisms and potentially other organisms.

  19. Isoform-level gene expression patterns in single-cell RNA-sequencing data.

    PubMed

    Vu, Trung Nghia; Wills, Quin F; Kalari, Krishna R; Niu, Nifang; Wang, Liewei; Pawitan, Yudi; Rantalainen, Mattias

    2018-02-27

    RNA sequencing of single cells enables characterization of transcriptional heterogeneity in seemingly homogeneous cell populations. Single-cell sequencing has been applied in a wide range of researches fields. However, few studies have focus on characterization of isoform-level expression patterns at the single-cell level. In this study we propose and apply a novel method, ISOform-Patterns (ISOP), based on mixture modeling, to characterize the expression patterns of isoform pairs from the same gene in single-cell isoform-level expression data. We define six principal patterns of isoform expression relationships and describe a method for differential-pattern analysis. We demonstrate ISOP through analysis of single-cell RNA-sequencing data from a breast cancer cell line, with replication in three independent datasets. We assigned the pattern types to each of 16,562 isoform-pairs from 4,929 genes. Among those, 26% of the discovered patterns were significant (p<0.05), while remaining patterns are possibly effects of transcriptional bursting, drop-out and stochastic biological heterogeneity. Furthermore, 32% of genes discovered through differential-pattern analysis were not detected by differential-expression analysis. The effect of drop-out events, mean expression level, and properties of the expression distribution on the performances of ISOP were also investigated through simulated datasets. To conclude, ISOP provides a novel approach for characterization of isoformlevel preference, commitment and heterogeneity in single-cell RNA-sequencing data. The ISOP method has been implemented as a R package and is available at https://github.com/nghiavtr/ISOP under a GPL-3 license. mattias.rantalainen@ki.se. Supplementary data are available at Bioinformatics online.

  20. MEXPRESS: visualizing expression, DNA methylation and clinical TCGA data.

    PubMed

    Koch, Alexander; De Meyer, Tim; Jeschke, Jana; Van Criekinge, Wim

    2015-08-26

    In recent years, increasing amounts of genomic and clinical cancer data have become publically available through large-scale collaborative projects such as The Cancer Genome Atlas (TCGA). However, as long as these datasets are difficult to access and interpret, they are essentially useless for a major part of the research community and their scientific potential will not be fully realized. To address these issues we developed MEXPRESS, a straightforward and easy-to-use web tool for the integration and visualization of the expression, DNA methylation and clinical TCGA data on a single-gene level ( http://mexpress.be ). In comparison to existing tools, MEXPRESS allows researchers to quickly visualize and interpret the different TCGA datasets and their relationships for a single gene, as demonstrated for GSTP1 in prostate adenocarcinoma. We also used MEXPRESS to reveal the differences in the DNA methylation status of the PAM50 marker gene MLPH between the breast cancer subtypes and how these differences were linked to the expression of MPLH. We have created a user-friendly tool for the visualization and interpretation of TCGA data, offering clinical researchers a simple way to evaluate the TCGA data for their genes or candidate biomarkers of interest.

  1. Asymmetric latent semantic indexing for gene expression experiments visualization.

    PubMed

    González, Javier; Muñoz, Alberto; Martos, Gabriel

    2016-08-01

    We propose a new method to visualize gene expression experiments inspired by the latent semantic indexing technique originally proposed in the textual analysis context. By using the correspondence word-gene document-experiment, we define an asymmetric similarity measure of association for genes that accounts for potential hierarchies in the data, the key to obtain meaningful gene mappings. We use the polar decomposition to obtain the sources of asymmetry of the similarity matrix, which are later combined with previous knowledge. Genetic classes of genes are identified by means of a mixture model applied in the genes latent space. We describe the steps of the procedure and we show its utility in the Human Cancer dataset.

  2. The dynamics of gene expression changes in a mouse model of oral tumorigenesis may help refine prevention and treatment strategies in patients with oral cancer.

    PubMed

    Foy, Jean-Philippe; Tortereau, Antonin; Caulin, Carlos; Le Texier, Vincent; Lavergne, Emilie; Thomas, Emilie; Chabaud, Sylvie; Perol, David; Lachuer, Joël; Lang, Wenhua; Hong, Waun Ki; Goudot, Patrick; Lippman, Scott M; Bertolus, Chloé; Saintigny, Pierre

    2016-06-14

    A better understanding of the dynamics of molecular changes occurring during the early stages of oral tumorigenesis may help refine prevention and treatment strategies. We generated genome-wide expression profiles of microdissected normal mucosa, hyperplasia, dysplasia and tumors derived from the 4-NQO mouse model of oral tumorigenesis. Genes differentially expressed between tumor and normal mucosa defined the "tumor gene set" (TGS), including 4 non-overlapping gene subsets that characterize the dynamics of gene expression changes through different stages of disease progression. The majority of gene expression changes occurred early or progressively. The relevance of these mouse gene sets to human disease was tested in multiple datasets including the TCGA and the Genomics of Drug Sensitivity in Cancer project. The TGS was able to discriminate oral squamous cell carcinoma (OSCC) from normal oral mucosa in 3 independent datasets. The OSCC samples enriched in the mouse TGS displayed high frequency of CASP8 mutations, 11q13.3 amplifications and low frequency of PIK3CA mutations. Early changes observed in the 4-NQO model were associated with a trend toward a shorter oral cancer-free survival in patients with oral preneoplasia that was not seen in multivariate analysis. Progressive changes observed in the 4-NQO model were associated with an increased sensitivity to 4 different MEK inhibitors in a panel of 51 squamous cell carcinoma cell lines of the areodigestive tract. In conclusion, the dynamics of molecular changes in the 4-NQO model reveal that MEK inhibition may be relevant to prevention and treatment of a specific molecularly-defined subgroup of OSCC.

  3. Biclustering sparse binary genomic data.

    PubMed

    van Uitert, Miranda; Meuleman, Wouter; Wessels, Lodewyk

    2008-12-01

    Genomic datasets often consist of large, binary, sparse data matrices. In such a dataset, one is often interested in finding contiguous blocks that (mostly) contain ones. This is a biclustering problem, and while many algorithms have been proposed to deal with gene expression data, only two algorithms have been proposed that specifically deal with binary matrices. None of the gene expression biclustering algorithms can handle the large number of zeros in sparse binary matrices. The two proposed binary algorithms failed to produce meaningful results. In this article, we present a new algorithm that is able to extract biclusters from sparse, binary datasets. A powerful feature is that biclusters with different numbers of rows and columns can be detected, varying from many rows to few columns and few rows to many columns. It allows the user to guide the search towards biclusters of specific dimensions. When applying our algorithm to an input matrix derived from TRANSFAC, we find transcription factors with distinctly dissimilar binding motifs, but a clear set of common targets that are significantly enriched for GO categories.

  4. Selenium-binding protein 1 in head and neck cancer is low-expression and associates with the prognosis of nasopharyngeal carcinoma

    PubMed Central

    Chen, Fasheng; Chen, Chen; Qu, Yangang; Xiang, Hua; Ai, Qingxiu; Yang, Fei; Tan, Xueping; Zhou, Yi; Jiang, Guang; Zhang, Zixiong

    2016-01-01

    Abstract Background: Selenium-binding protein 1 (SELENBP1) expression is reduced markedly in many types of cancers and low SELENBP1 expression levels are associated with poor patient prognosis. Methods: SELENBP1 gene expression in head and neck squamous cell carcinoma (HNSCC) was analyzed with GEO dataset and characteristics of SELENBP1 expression in paraffin embedded tissue were summarized. Expression of SELENBP1 in nasopharyngeal carcinoma (NPC), laryngeal cancer, oral cancer, tonsil cancer, hypopharyngeal cancer and normal tissues were detected using immunohistochemistry, at last, 99 NPC patients were followed up more than 5 years and were analyzed the prognostic significance of SELENBP1. Results: Analysis of GEO dataset concluded that SELENBP1 gene expression in HNSCC was lower than that in normal tissue (P < 0.01), but there was no significant difference of SELENBP1 gene expression in different T-stage and N-stage (P > 0.05). Analysis of pathological section concluded that SELENBP1 in the majority of HNSCC is low expression and in cancer nests is lower expression than surrounding normal tissue, even associated with the malignant degree of tumor. Further study indicated the low SELENBP1 expression group of patients with NPC accompanied by poor overall survival and has significantly different comparing with the high expression group. Conclusion: SELENBP1 expression was down-regulated in HNSCC, but has no associated with T-stage and N-stage of tumor. Low expression of SELENBP1 in patients with NPC has poor over survival, so SELENBP1 could be a novel biomarker for predicting prognosis. PMID:27583873

  5. Conserved Non-Coding Regulatory Signatures in Arabidopsis Co-Expressed Gene Modules

    PubMed Central

    Spangler, Jacob B.; Ficklin, Stephen P.; Luo, Feng; Freeling, Michael; Feltus, F. Alex

    2012-01-01

    Complex traits and other polygenic processes require coordinated gene expression. Co-expression networks model mRNA co-expression: the product of gene regulatory networks. To identify regulatory mechanisms underlying coordinated gene expression in a tissue-enriched context, ten Arabidopsis thaliana co-expression networks were constructed after manually sorting 4,566 RNA profiling datasets into aerial, flower, leaf, root, rosette, seedling, seed, shoot, whole plant, and global (all samples combined) groups. Collectively, the ten networks contained 30% of the measurable genes of Arabidopsis and were circumscribed into 5,491 modules. Modules were scrutinized for cis regulatory mechanisms putatively encoded in conserved non-coding sequences (CNSs) previously identified as remnants of a whole genome duplication event. We determined the non-random association of 1,361 unique CNSs to 1,904 co-expression network gene modules. Furthermore, the CNS elements were placed in the context of known gene regulatory networks (GRNs) by connecting 250 CNS motifs with known GRN cis elements. Our results provide support for a regulatory role of some CNS elements and suggest the functional consequences of CNS activation of co-expression in specific gene sets dispersed throughout the genome. PMID:23024789

  6. Conserved non-coding regulatory signatures in Arabidopsis co-expressed gene modules.

    PubMed

    Spangler, Jacob B; Ficklin, Stephen P; Luo, Feng; Freeling, Michael; Feltus, F Alex

    2012-01-01

    Complex traits and other polygenic processes require coordinated gene expression. Co-expression networks model mRNA co-expression: the product of gene regulatory networks. To identify regulatory mechanisms underlying coordinated gene expression in a tissue-enriched context, ten Arabidopsis thaliana co-expression networks were constructed after manually sorting 4,566 RNA profiling datasets into aerial, flower, leaf, root, rosette, seedling, seed, shoot, whole plant, and global (all samples combined) groups. Collectively, the ten networks contained 30% of the measurable genes of Arabidopsis and were circumscribed into 5,491 modules. Modules were scrutinized for cis regulatory mechanisms putatively encoded in conserved non-coding sequences (CNSs) previously identified as remnants of a whole genome duplication event. We determined the non-random association of 1,361 unique CNSs to 1,904 co-expression network gene modules. Furthermore, the CNS elements were placed in the context of known gene regulatory networks (GRNs) by connecting 250 CNS motifs with known GRN cis elements. Our results provide support for a regulatory role of some CNS elements and suggest the functional consequences of CNS activation of co-expression in specific gene sets dispersed throughout the genome.

  7. ANISEED 2017: extending the integrated ascidian database to the exploration and evolutionary comparison of genome-scale datasets

    PubMed Central

    Brozovic, Matija; Dantec, Christelle; Dardaillon, Justine; Dauga, Delphine; Faure, Emmanuel; Gineste, Mathieu; Louis, Alexandra; Naville, Magali; Nitta, Kazuhiro R; Piette, Jacques; Reeves, Wendy; Scornavacca, Céline; Simion, Paul; Vincentelli, Renaud; Bellec, Maelle; Aicha, Sameh Ben; Fagotto, Marie; Guéroult-Bellone, Marion; Haeussler, Maximilian; Jacox, Edwin; Lowe, Elijah K; Mendez, Mickael; Roberge, Alexis; Stolfi, Alberto; Yokomori, Rui; Cambillau, Christian; Christiaen, Lionel; Delsuc, Frédéric; Douzery, Emmanuel; Dumollard, Rémi; Kusakabe, Takehiro; Nakai, Kenta; Nishida, Hiroki; Satou, Yutaka; Swalla, Billie; Veeman, Michael; Volff, Jean-Nicolas

    2018-01-01

    Abstract ANISEED (www.aniseed.cnrs.fr) is the main model organism database for tunicates, the sister-group of vertebrates. This release gives access to annotated genomes, gene expression patterns, and anatomical descriptions for nine ascidian species. It provides increased integration with external molecular and taxonomy databases, better support for epigenomics datasets, in particular RNA-seq, ChIP-seq and SELEX-seq, and features novel interactive interfaces for existing and novel datatypes. In particular, the cross-species navigation and comparison is enhanced through a novel taxonomy section describing each represented species and through the implementation of interactive phylogenetic gene trees for 60% of tunicate genes. The gene expression section displays the results of RNA-seq experiments for the three major model species of solitary ascidians. Gene expression is controlled by the binding of transcription factors to cis-regulatory sequences. A high-resolution description of the DNA-binding specificity for 131 Ciona robusta (formerly C. intestinalis type A) transcription factors by SELEX-seq is provided and used to map candidate binding sites across the Ciona robusta and Phallusia mammillata genomes. Finally, use of a WashU Epigenome browser enhances genome navigation, while a Genomicus server was set up to explore microsynteny relationships within tunicates and with vertebrates, Amphioxus, echinoderms and hemichordates. PMID:29149270

  8. Mutual information estimation reveals global associations between stimuli and biological processes

    PubMed Central

    Suzuki, Taiji; Sugiyama, Masashi; Kanamori, Takafumi; Sese, Jun

    2009-01-01

    Background Although microarray gene expression analysis has become popular, it remains difficult to interpret the biological changes caused by stimuli or variation of conditions. Clustering of genes and associating each group with biological functions are often used methods. However, such methods only detect partial changes within cell processes. Herein, we propose a method for discovering global changes within a cell by associating observed conditions of gene expression with gene functions. Results To elucidate the association, we introduce a novel feature selection method called Least-Squares Mutual Information (LSMI), which computes mutual information without density estimaion, and therefore LSMI can detect nonlinear associations within a cell. We demonstrate the effectiveness of LSMI through comparison with existing methods. The results of the application to yeast microarray datasets reveal that non-natural stimuli affect various biological processes, whereas others are no significant relation to specific cell processes. Furthermore, we discover that biological processes can be categorized into four types according to the responses of various stimuli: DNA/RNA metabolism, gene expression, protein metabolism, and protein localization. Conclusion We proposed a novel feature selection method called LSMI, and applied LSMI to mining the association between conditions of yeast and biological processes through microarray datasets. In fact, LSMI allows us to elucidate the global organization of cellular process control. PMID:19208155

  9. Exploring Plant Co-Expression and Gene-Gene Interactions with CORNET 3.0.

    PubMed

    Van Bel, Michiel; Coppens, Frederik

    2017-01-01

    Selecting and filtering a reference expression and interaction dataset when studying specific pathways and regulatory interactions can be a very time-consuming and error-prone task. In order to reduce the duplicated efforts required to amass such datasets, we have created the CORNET (CORrelation NETworks) platform which allows for easy access to a wide variety of data types: coexpression data, protein-protein interactions, regulatory interactions, and functional annotations. The CORNET platform outputs its results in either text format or through the Cytoscape framework, which is automatically launched by the CORNET website.CORNET 3.0 is the third iteration of the web platform designed for the user exploration of the coexpression space of plant genomes, with a focus on the model species Arabidopsis thaliana. Here we describe the platform: the tools, data, and best practices when using the platform. We indicate how the platform can be used to infer networks from a set of input genes, such as upregulated genes from an expression experiment. By exploring the network, new target and regulator genes can be discovered, allowing for follow-up experiments and more in-depth study. We also indicate how to avoid common pitfalls when evaluating the networks and how to avoid over interpretation of the results.All CORNET versions are available at http://bioinformatics.psb.ugent.be/cornet/ .

  10. Integrated pathway-based transcription regulation network mining and visualization based on gene expression profiles.

    PubMed

    Kibinge, Nelson; Ono, Naoaki; Horie, Masafumi; Sato, Tetsuo; Sugiura, Tadao; Altaf-Ul-Amin, Md; Saito, Akira; Kanaya, Shigehiko

    2016-06-01

    Conventionally, workflows examining transcription regulation networks from gene expression data involve distinct analytical steps. There is a need for pipelines that unify data mining and inference deduction into a singular framework to enhance interpretation and hypotheses generation. We propose a workflow that merges network construction with gene expression data mining focusing on regulation processes in the context of transcription factor driven gene regulation. The pipeline implements pathway-based modularization of expression profiles into functional units to improve biological interpretation. The integrated workflow was implemented as a web application software (TransReguloNet) with functions that enable pathway visualization and comparison of transcription factor activity between sample conditions defined in the experimental design. The pipeline merges differential expression, network construction, pathway-based abstraction, clustering and visualization. The framework was applied in analysis of actual expression datasets related to lung, breast and prostrate cancer. Copyright © 2016 Elsevier Inc. All rights reserved.

  11. Identification of pathogenic genes and upstream regulators in age-related macular degeneration.

    PubMed

    Zhao, Bin; Wang, Mengya; Xu, Jing; Li, Min; Yu, Yuhui

    2017-06-26

    Age-related macular degeneration (AMD) is the leading cause of irreversible blindness in older individuals. Our study aims to identify the key genes and upstream regulators in AMD. To screen pathogenic genes of AMD, an integrated analysis was performed by using the microarray datasets in AMD derived from the Gene Expression Omnibus (GEO) database. The functional annotation and potential pathways of differentially expressed genes (DEGs) were further discovered by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. We constructed the AMD-specific transcriptional regulatory network to find the crucial transcriptional factors (TFs) which target the DEGs in AMD. Quantitative real time polymerase chain reaction (qRT-PCR) was performed to verify the DEGs and TFs obtained by integrated analysis. From two GEO datasets obtained, we identified 1280 DEGs (730 up-regulated and 550 down-regulated genes) between AMD and normal control (NC). After KEGG analysis, steroid biosynthesis is a significantly enriched pathway for DEGs. The expression of 8 genes (TNC, GRP, TRAF6, ADAMTS5, GPX3, FAP, DHCR7 and FDFT1) was detected. Except for TNC and GPX3, the other 6 genes in qRT-PCR played the same pattern with that in our integrated analysis. The dysregulation of these eight genes may involve with the process of AMD. Two crucial transcription factors (c-rel and myogenin) were concluded to play a role in AMD. Especially, myogenin was associated with AMD by regulating TNC, GRP and FAP. Our finding can contribute to developing new potential biomarkers, revealing the underlying pathogenesis, and further raising new therapeutic targets for AMD.

  12. brain-coX: investigating and visualising gene co-expression in seven human brain transcriptomic datasets.

    PubMed

    Freytag, Saskia; Burgess, Rosemary; Oliver, Karen L; Bahlo, Melanie

    2017-06-08

    The pathogenesis of neurological and mental health disorders often involves multiple genes, complex interactions, as well as brain- and development-specific biological mechanisms. These characteristics make identification of disease genes for such disorders challenging, as conventional prioritisation tools are not specifically tailored to deal with the complexity of the human brain. Thus, we developed a novel web-application-brain-coX-that offers gene prioritisation with accompanying visualisations based on seven gene expression datasets in the post-mortem human brain, the largest such resource ever assembled. We tested whether our tool can correctly prioritise known genes from 37 brain-specific KEGG pathways and 17 psychiatric conditions. We achieved average sensitivity of nearly 50%, at the same time reaching a specificity of approximately 75%. We also compared brain-coX's performance to that of its main competitors, Endeavour and ToppGene, focusing on the ability to discover novel associations. Using a subset of the curated SFARI autism gene collection we show that brain-coX's prioritisations are most similar to SFARI's own curated gene classifications. brain-coX is the first prioritisation and visualisation web-tool targeted to the human brain and can be freely accessed via http://shiny.bioinf.wehi.edu.au/freytag.s/ .

  13. CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.

    PubMed

    Planey, Catherine R; Gevaert, Olivier

    2016-03-09

    Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.

  14. Characteristics of allelic gene expression in human brain cells from single-cell RNA-seq data analysis.

    PubMed

    Zhao, Dejian; Lin, Mingyan; Pedrosa, Erika; Lachman, Herbert M; Zheng, Deyou

    2017-11-10

    Monoallelic expression of autosomal genes has been implicated in human psychiatric disorders. However, there is a paucity of allelic expression studies in human brain cells at the single cell and genome wide levels. In this report, we reanalyzed a previously published single-cell RNA-seq dataset from several postmortem human brains and observed pervasive monoallelic expression in individual cells, largely in a random manner. Examining single nucleotide variants with a predicted functional disruption, we found that the "damaged" alleles were overall expressed in fewer brain cells than their counterparts, and at a lower level in cells where their expression was detected. We also identified many brain cell type-specific monoallelically expressed genes. Interestingly, many of these cell type-specific monoallelically expressed genes were enriched for functions important for those brain cell types. In addition, function analysis showed that genes displaying monoallelic expression and correlated expression across neuronal cells from different individual brains were implicated in the regulation of synaptic function. Our findings suggest that monoallelic gene expression is prevalent in human brain cells, which may play a role in generating cellular identity and neuronal diversity and thus increasing the complexity and diversity of brain cell functions.

  15. Integrative analysis of multi-omics data for identifying multi-markers for diagnosing pancreatic cancer

    PubMed Central

    2015-01-01

    Background microRNA (miRNA) expression plays an influential role in cancer classification and malignancy, and miRNAs are feasible as alternative diagnostic markers for pancreatic cancer, a highly aggressive neoplasm with silent early symptoms, high metastatic potential, and resistance to conventional therapies. Methods In this study, we evaluated the benefits of multi-omics data analysis by integrating miRNA and mRNA expression data in pancreatic cancer. Using support vector machine (SVM) modelling and leave-one-out cross validation (LOOCV), we evaluated the diagnostic performance of single- or multi-markers based on miRNA and mRNA expression profiles from 104 PDAC tissues and 17 benign pancreatic tissues. For selecting even more reliable and robust markers, we performed validation by independent datasets from the Gene Expression Omnibus (GEO) and the Cancer Genome Atlas (TCGA) data depositories. For validation, miRNA activity was estimated by miRNA-target gene interaction and mRNA expression datasets in pancreatic cancer. Results Using a comprehensive identification approach, we successfully identified 705 multi-markers having powerful diagnostic performance for PDAC. In addition, these marker candidates annotated with cancer pathways using gene ontology analysis. Conclusions Our prediction models have strong potential for the diagnosis of pancreatic cancer. PMID:26328610

  16. Protein-coding genes combined with long noncoding RNA as a novel transcriptome molecular staging model to predict the survival of patients with esophageal squamous cell carcinoma.

    PubMed

    Guo, Jin-Cheng; Wu, Yang; Chen, Yang; Pan, Feng; Wu, Zhi-Yong; Zhang, Jia-Sheng; Wu, Jian-Yi; Xu, Xiu-E; Zhao, Jian-Mei; Li, En-Min; Zhao, Yi; Xu, Li-Yan

    2018-04-09

    Esophageal squamous cell carcinoma (ESCC) is the predominant subtype of esophageal carcinoma in China. This study was to develop a staging model to predict outcomes of patients with ESCC. Using Cox regression analysis, principal component analysis (PCA), partitioning clustering, Kaplan-Meier analysis, receiver operating characteristic (ROC) curve analysis, and classification and regression tree (CART) analysis, we mined the Gene Expression Omnibus database to determine the expression profiles of genes in 179 patients with ESCC from GSE63624 and GSE63622 dataset. Univariate cox regression analysis of the GSE63624 dataset revealed that 2404 protein-coding genes (PCGs) and 635 long non-coding RNAs (lncRNAs) were associated with the survival of patients with ESCC. PCA categorized these PCGs and lncRNAs into three principal components (PCs), which were used to cluster the patients into three groups. ROC analysis demonstrated that the predictive ability of PCG-lncRNA PCs when applied to new patients was better than that of the tumor-node-metastasis staging (area under ROC curve [AUC]: 0.69 vs. 0.65, P < 0.05). Accordingly, we constructed a molecular disaggregated model comprising one lncRNA and two PCGs, which we designated as the LSB staging model using CART analysis in the GSE63624 dataset. This LSB staging model classified the GSE63622 dataset of patients into three different groups, and its effectiveness was validated by analysis of another cohort of 105 patients. The LSB staging model has clinical significance for the prognosis prediction of patients with ESCC and may serve as a three-gene staging microarray.

  17. Network regularised Cox regression and multiplex network models to predict disease comorbidities and survival of cancer.

    PubMed

    Xu, Haoming; Moni, Mohammad Ali; Liò, Pietro

    2015-12-01

    In cancer genomics, gene expression levels provide important molecular signatures for all types of cancer, and this could be very useful for predicting the survival of cancer patients. However, the main challenge of gene expression data analysis is high dimensionality, and microarray is characterised by few number of samples with large number of genes. To overcome this problem, a variety of penalised Cox proportional hazard models have been proposed. We introduce a novel network regularised Cox proportional hazard model and a novel multiplex network model to measure the disease comorbidities and to predict survival of the cancer patient. Our methods are applied to analyse seven microarray cancer gene expression datasets: breast cancer, ovarian cancer, lung cancer, liver cancer, renal cancer and osteosarcoma. Firstly, we applied a principal component analysis to reduce the dimensionality of original gene expression data. Secondly, we applied a network regularised Cox regression model on the reduced gene expression datasets. By using normalised mutual information method and multiplex network model, we predict the comorbidities for the liver cancer based on the integration of diverse set of omics and clinical data, and we find the diseasome associations (disease-gene association) among different cancers based on the identified common significant genes. Finally, we evaluated the precision of the approach with respect to the accuracy of survival prediction using ROC curves. We report that colon cancer, liver cancer and renal cancer share the CXCL5 gene, and breast cancer, ovarian cancer and renal cancer share the CCND2 gene. Our methods are useful to predict survival of the patient and disease comorbidities more accurately and helpful for improvement of the care of patients with comorbidity. Software in Matlab and R is available on our GitHub page: https://github.com/ssnhcom/NetworkRegularisedCox.git. Copyright © 2015. Published by Elsevier Ltd.

  18. Rrp1b, a New Candidate Susceptibility Gene for Breast Cancer Progression and Metastasis

    PubMed Central

    Crawford, Nigel P. S; Qian, Xiaolan; Ziogas, Argyrios; Papageorge, Alex G; Boersma, Brenda J; Walker, Renard C; Lukes, Luanne; Rowe, William L; Zhang, Jinghui; Ambs, Stefan; Lowy, Douglas R; Anton-Culver, Hoda; Hunter, Kent W

    2007-01-01

    A novel candidate metastasis modifier, ribosomal RNA processing 1 homolog B (Rrp1b), was identified through two independent approaches. First, yeast two-hybrid, immunoprecipitation, and functional assays demonstrated a physical and functional interaction between Rrp1b and the previous identified metastasis modifier Sipa1. In parallel, using mouse and human metastasis gene expression data it was observed that extracellular matrix (ECM) genes are common components of metastasis predictive signatures, suggesting that ECM genes are either important markers or causal factors in metastasis. To investigate the relationship between ECM genes and poor prognosis in breast cancer, expression quantitative trait locus analysis of polyoma middle-T transgene-induced mammary tumor was performed. ECM gene expression was found to be consistently associated with Rrp1b expression. In vitro expression of Rrp1b significantly altered ECM gene expression, tumor growth, and dissemination in metastasis assays. Furthermore, a gene signature induced by ectopic expression of Rrp1b in tumor cells predicted survival in a human breast cancer gene expression dataset. Finally, constitutional polymorphism within RRP1B was found to be significantly associated with tumor progression in two independent breast cancer cohorts. These data suggest that RRP1B may be a novel susceptibility gene for breast cancer progression and metastasis. PMID:18081427

  19. From gene networks to drugs: systems pharmacology approaches for AUD.

    PubMed

    Ferguson, Laura B; Harris, R Adron; Mayfield, Roy Dayne

    2018-06-01

    The alcohol research field has amassed an impressive number of gene expression datasets spanning key brain areas for addiction, species (humans as well as multiple animal models), and stages in the addiction cycle (binge/intoxication, withdrawal/negative effect, and preoccupation/anticipation). These data have improved our understanding of the molecular adaptations that eventually lead to dysregulation of brain function and the chronic, relapsing disorder of addiction. Identification of new medications to treat alcohol use disorder (AUD) will likely benefit from the integration of genetic, genomic, and behavioral information included in these important datasets. Systems pharmacology considers drug effects as the outcome of the complex network of interactions a drug has rather than a single drug-molecule interaction. Computational strategies based on this principle that integrate gene expression signatures of pharmaceuticals and disease states have shown promise for identifying treatments that ameliorate disease symptoms (called in silico gene mapping or connectivity mapping). In this review, we suggest that gene expression profiling for in silico mapping is critical to improve drug repurposing and discovery for AUD and other psychiatric illnesses. We highlight studies that successfully apply gene mapping computational approaches to identify or repurpose pharmaceutical treatments for psychiatric illnesses. Furthermore, we address important challenges that must be overcome to maximize the potential of these strategies to translate to the clinic and improve healthcare outcomes.

  20. ExpressionDB: An open source platform for distributing genome-scale datasets.

    PubMed

    Hughes, Laura D; Lewis, Scott A; Hughes, Michael E

    2017-01-01

    RNA-sequencing (RNA-seq) and microarrays are methods for measuring gene expression across the entire transcriptome. Recent advances have made these techniques practical and affordable for essentially any laboratory with experience in molecular biology. A variety of computational methods have been developed to decrease the amount of bioinformatics expertise necessary to analyze these data. Nevertheless, many barriers persist which discourage new labs from using functional genomics approaches. Since high-quality gene expression studies have enduring value as resources to the entire research community, it is of particular importance that small labs have the capacity to share their analyzed datasets with the research community. Here we introduce ExpressionDB, an open source platform for visualizing RNA-seq and microarray data accommodating virtually any number of different samples. ExpressionDB is based on Shiny, a customizable web application which allows data sharing locally and online with customizable code written in R. ExpressionDB allows intuitive searches based on gene symbols, descriptions, or gene ontology terms, and it includes tools for dynamically filtering results based on expression level, fold change, and false-discovery rates. Built-in visualization tools include heatmaps, volcano plots, and principal component analysis, ensuring streamlined and consistent visualization to all users. All of the scripts for building an ExpressionDB with user-supplied data are freely available on GitHub, and the Creative Commons license allows fully open customization by end-users. We estimate that a demo database can be created in under one hour with minimal programming experience, and that a new database with user-supplied expression data can be completed and online in less than one day.

  1. Research Resource: A Reference Transcriptome for Constitutive Androstane Receptor and Pregnane X Receptor Xenobiotic Signaling

    PubMed Central

    Ochsner, Scott A.; Tsimelzon, Anna; Dong, Jianrong; Coarfa, Cristian

    2016-01-01

    The pregnane X receptor (PXR) (PXR/NR1I3) and constitutive androstane receptor (CAR) (CAR/NR1I2) members of the nuclear receptor (NR) superfamily of ligand-regulated transcription factors are well-characterized mediators of xenobiotic and endocrine-disrupting chemical signaling. The Nuclear Receptor Signaling Atlas maintains a growing library of transcriptomic datasets involving perturbations of NR signaling pathways, many of which involve perturbations relevant to PXR and CAR xenobiotic signaling. Here, we generated a reference transcriptome based on the frequency of differential expression of genes across 159 experiments compiled from 22 datasets involving perturbations of CAR and PXR signaling pathways. In addition to the anticipated overrepresentation in the reference transcriptome of genes encoding components of the xenobiotic stress response, the ranking of genes involved in carbohydrate metabolism and gonadotropin action sheds mechanistic light on the suspected role of xenobiotics in metabolic syndrome and reproductive disorders. Gene Set Enrichment Analysis showed that although acetaminophen, chlorpromazine, and phenobarbital impacted many similar gene sets, differences in direction of regulation were evident in a variety of processes. Strikingly, gene sets representing genes linked to Parkinson's, Huntington's, and Alzheimer's diseases were enriched in all 3 transcriptomes. The reference xenobiotic transcriptome will be supplemented with additional future datasets to provide the community with a continually updated reference transcriptomic dataset for CAR- and PXR-mediated xenobiotic signaling. Our study demonstrates how aggregating and annotating transcriptomic datasets, and making them available for routine data mining, facilitates research into the mechanisms by which xenobiotics and endocrine-disrupting chemicals subvert conventional NR signaling modalities. PMID:27409825

  2. Research Resource: A Reference Transcriptome for Constitutive Androstane Receptor and Pregnane X Receptor Xenobiotic Signaling.

    PubMed

    Ochsner, Scott A; Tsimelzon, Anna; Dong, Jianrong; Coarfa, Cristian; McKenna, Neil J

    2016-08-01

    The pregnane X receptor (PXR) (PXR/NR1I3) and constitutive androstane receptor (CAR) (CAR/NR1I2) members of the nuclear receptor (NR) superfamily of ligand-regulated transcription factors are well-characterized mediators of xenobiotic and endocrine-disrupting chemical signaling. The Nuclear Receptor Signaling Atlas maintains a growing library of transcriptomic datasets involving perturbations of NR signaling pathways, many of which involve perturbations relevant to PXR and CAR xenobiotic signaling. Here, we generated a reference transcriptome based on the frequency of differential expression of genes across 159 experiments compiled from 22 datasets involving perturbations of CAR and PXR signaling pathways. In addition to the anticipated overrepresentation in the reference transcriptome of genes encoding components of the xenobiotic stress response, the ranking of genes involved in carbohydrate metabolism and gonadotropin action sheds mechanistic light on the suspected role of xenobiotics in metabolic syndrome and reproductive disorders. Gene Set Enrichment Analysis showed that although acetaminophen, chlorpromazine, and phenobarbital impacted many similar gene sets, differences in direction of regulation were evident in a variety of processes. Strikingly, gene sets representing genes linked to Parkinson's, Huntington's, and Alzheimer's diseases were enriched in all 3 transcriptomes. The reference xenobiotic transcriptome will be supplemented with additional future datasets to provide the community with a continually updated reference transcriptomic dataset for CAR- and PXR-mediated xenobiotic signaling. Our study demonstrates how aggregating and annotating transcriptomic datasets, and making them available for routine data mining, facilitates research into the mechanisms by which xenobiotics and endocrine-disrupting chemicals subvert conventional NR signaling modalities.

  3. Identification of pathogenic genes related to rheumatoid arthritis through integrated analysis of DNA methylation and gene expression profiling.

    PubMed

    Zhang, Lei; Ma, Shiyun; Wang, Huailiang; Su, Hang; Su, Ke; Li, Longjie

    2017-11-15

    The purpose of our study was to identify new pathogenic genes used for exploring the pathogenesis of rheumatoid arthritis (RA). To screen pathogenic genes of RA, an integrated analysis was performed by using the microarray datasets in RA derived from the Gene Expression Omnibus (GEO) database. The functional annotation and potential pathways of differentially expressed genes (DEGs) were further discovered by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. Afterwards, the integrated analysis of DNA methylation and gene expression profiling was used to screen crucial genes. In addition, we used RT-PCR and MSP to verify the expression levels and methylation status of these crucial genes in 20 synovial biopsy samples obtained from 10 RA model mice and 10 normal mice. BCL11B, CCDC88C, FCRLA and APOL6 were both up-regulated and hypomethylated in RA according to integrated analysis, RT-PCR and MSP verification. Four crucial genes (BCL11B, CCDC88C, FCRLA and APOL6) identified and analyzed in this study might be closely connected with the pathogenesis of RA. Copyright © 2017. Published by Elsevier B.V.

  4. Data Mining of Gene Arrays for Biomarkers of Survival in Ovarian Cancer

    PubMed Central

    Coveney, Clare; Boocock, David J.; Rees, Robert C.; Deen, Suha; Ball, Graham R.

    2015-01-01

    The expected five-year survival rate from a stage III ovarian cancer diagnosis is a mere 22%; this applies to the 7000 new cases diagnosed yearly in the UK. Stratification of patients with this heterogeneous disease, based on active molecular pathways, would aid a targeted treatment improving the prognosis for many cases. While hundreds of genes have been associated with ovarian cancer, few have yet been verified by peer research for clinical significance. Here, a meta-analysis approach was applied to two carefully selected gene expression microarray datasets. Artificial neural networks, Cox univariate survival analyses and T-tests identified genes whose expression was consistently and significantly associated with patient survival. The rigor of this experimental design increases confidence in the genes found to be of interest. A list of 56 genes were distilled from a potential 37,000 to be significantly related to survival in both datasets with a FDR of 1.39859 × 10−11, the identities of which both verify genes already implicated with this disease and provide novel genes and pathways to pursue. Further investigation and validation of these may lead to clinical insights and have potential to predict a patient’s response to treatment or be used as a novel target for therapy. PMID:27600227

  5. Hybrid genetic algorithm-neural network: feature extraction for unpreprocessed microarray data.

    PubMed

    Tong, Dong Ling; Schierz, Amanda C

    2011-09-01

    Suitable techniques for microarray analysis have been widely researched, particularly for the study of marker genes expressed to a specific type of cancer. Most of the machine learning methods that have been applied to significant gene selection focus on the classification ability rather than the selection ability of the method. These methods also require the microarray data to be preprocessed before analysis takes place. The objective of this study is to develop a hybrid genetic algorithm-neural network (GANN) model that emphasises feature selection and can operate on unpreprocessed microarray data. The GANN is a hybrid model where the fitness value of the genetic algorithm (GA) is based upon the number of samples correctly labelled by a standard feedforward artificial neural network (ANN). The model is evaluated by using two benchmark microarray datasets with different array platforms and differing number of classes (a 2-class oligonucleotide microarray data for acute leukaemia and a 4-class complementary DNA (cDNA) microarray dataset for SRBCTs (small round blue cell tumours)). The underlying concept of the GANN algorithm is to select highly informative genes by co-evolving both the GA fitness function and the ANN weights at the same time. The novel GANN selected approximately 50% of the same genes as the original studies. This may indicate that these common genes are more biologically significant than other genes in the datasets. The remaining 50% of the significant genes identified were used to build predictive models and for both datasets, the models based on the set of genes extracted by the GANN method produced more accurate results. The results also suggest that the GANN method not only can detect genes that are exclusively associated with a single cancer type but can also explore the genes that are differentially expressed in multiple cancer types. The results show that the GANN model has successfully extracted statistically significant genes from the unpreprocessed microarray data as well as extracting known biologically significant genes. We also show that assessing the biological significance of genes based on classification accuracy may be misleading and though the GANN's set of extra genes prove to be more statistically significant than those selected by other methods, a biological assessment of these genes is highly recommended to confirm their functionality. Copyright © 2011 Elsevier B.V. All rights reserved.

  6. Identifying novel glioma associated pathways based on systems biology level meta-analysis.

    PubMed

    Hu, Yangfan; Li, Jinquan; Yan, Wenying; Chen, Jiajia; Li, Yin; Hu, Guang; Shen, Bairong

    2013-01-01

    With recent advances in microarray technology, including genomics, proteomics, and metabolomics, it brings a great challenge for integrating this "-omics" data to analysis complex disease. Glioma is an extremely aggressive and lethal form of brain tumor, and thus the study of the molecule mechanism underlying glioma remains very important. To date, most studies focus on detecting the differentially expressed genes in glioma. However, the meta-analysis for pathway analysis based on multiple microarray datasets has not been systematically pursued. In this study, we therefore developed a systems biology based approach by integrating three types of omics data to identify common pathways in glioma. Firstly, the meta-analysis has been performed to study the overlapping of signatures at different levels based on the microarray gene expression data of glioma. Among these gene expression datasets, 12 pathways were found in GeneGO database that shared by four stages. Then, microRNA expression profiles and ChIP-seq data were integrated for the further pathway enrichment analysis. As a result, we suggest 5 of these pathways could be served as putative pathways in glioma. Among them, the pathway of TGF-beta-dependent induction of EMT via SMAD is of particular importance. Our results demonstrate that the meta-analysis based on systems biology level provide a more useful approach to study the molecule mechanism of complex disease. The integration of different types of omics data, including gene expression microarrays, microRNA and ChIP-seq data, suggest some common pathways correlated with glioma. These findings will offer useful potential candidates for targeted therapeutic intervention of glioma.

  7. Tissue Non-Specific Genes and Pathways Associated with Diabetes: An Expression Meta-Analysis.

    PubMed

    Mei, Hao; Li, Lianna; Liu, Shijian; Jiang, Fan; Griswold, Michael; Mosley, Thomas

    2017-01-21

    We performed expression studies to identify tissue non-specific genes and pathways of diabetes by meta-analysis. We searched curated datasets of the Gene Expression Omnibus (GEO) database and identified 13 and five expression studies of diabetes and insulin responses at various tissues, respectively. We tested differential gene expression by empirical Bayes-based linear method and investigated gene set expression association by knowledge-based enrichment analysis. Meta-analysis by different methods was applied to identify tissue non-specific genes and gene sets. We also proposed pathway mapping analysis to infer functions of the identified gene sets, and correlation and independent analysis to evaluate expression association profile of genes and gene sets between studies and tissues. Our analysis showed that PGRMC1 and HADH genes were significant over diabetes studies, while IRS1 and MPST genes were significant over insulin response studies, and joint analysis showed that HADH and MPST genes were significant over all combined data sets. The pathway analysis identified six significant gene sets over all studies. The KEGG pathway mapping indicated that the significant gene sets are related to diabetes pathogenesis. The results also presented that 12.8% and 59.0% pairwise studies had significantly correlated expression association for genes and gene sets, respectively; moreover, 12.8% pairwise studies had independent expression association for genes, but no studies were observed significantly different for expression association of gene sets. Our analysis indicated that there are both tissue specific and non-specific genes and pathways associated with diabetes pathogenesis. Compared to the gene expression, pathway association tends to be tissue non-specific, and a common pathway influencing diabetes development is activated through different genes at different tissues.

  8. Heterogeneous data fusion for brain tumor classification.

    PubMed

    Metsis, Vangelis; Huang, Heng; Andronesi, Ovidiu C; Makedon, Fillia; Tzika, Aria

    2012-10-01

    Current research in biomedical informatics involves analysis of multiple heterogeneous data sets. This includes patient demographics, clinical and pathology data, treatment history, patient outcomes as well as gene expression, DNA sequences and other information sources such as gene ontology. Analysis of these data sets could lead to better disease diagnosis, prognosis, treatment and drug discovery. In this report, we present a novel machine learning framework for brain tumor classification based on heterogeneous data fusion of metabolic and molecular datasets, including state-of-the-art high-resolution magic angle spinning (HRMAS) proton (1H) magnetic resonance spectroscopy and gene transcriptome profiling, obtained from intact brain tumor biopsies. Our experimental results show that our novel framework outperforms any analysis using individual dataset.

  9. ImpulseDE: detection of differentially expressed genes in time series data using impulse models.

    PubMed

    Sander, Jil; Schultze, Joachim L; Yosef, Nir

    2017-03-01

    Perturbations in the environment lead to distinctive gene expression changes within a cell. Observed over time, those variations can be characterized by single impulse-like progression patterns. ImpulseDE is an R package suited to capture these patterns in high throughput time series datasets. By fitting a representative impulse model to each gene, it reports differentially expressed genes across time points from a single or between two time courses from two experiments. To optimize running time, the code uses clustering and multi-threading. By applying ImpulseDE , we demonstrate its power to represent underlying biology of gene expression in microarray and RNA-Seq data. ImpulseDE is available on Bioconductor ( https://bioconductor.org/packages/ImpulseDE/ ). niryosef@berkeley.edu. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  10. Gene selection for cancer classification with the help of bees.

    PubMed

    Moosa, Johra Muhammad; Shakur, Rameen; Kaykobad, Mohammad; Rahman, Mohammad Sohel

    2016-08-10

    Development of biologically relevant models from gene expression data notably, microarray data has become a topic of great interest in the field of bioinformatics and clinical genetics and oncology. Only a small number of gene expression data compared to the total number of genes explored possess a significant correlation with a certain phenotype. Gene selection enables researchers to obtain substantial insight into the genetic nature of the disease and the mechanisms responsible for it. Besides improvement of the performance of cancer classification, it can also cut down the time and cost of medical diagnoses. This study presents a modified Artificial Bee Colony Algorithm (ABC) to select minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy. The search equation of ABC is believed to be good at exploration but poor at exploitation. To overcome this limitation we have modified the ABC algorithm by incorporating the concept of pheromones which is one of the major components of Ant Colony Optimization (ACO) algorithm and a new operation in which successive bees communicate to share their findings. The proposed algorithm is evaluated using a suite of ten publicly available datasets after the parameters are tuned scientifically with one of the datasets. Obtained results are compared to other works that used the same datasets. The performance of the proposed method is proved to be superior. The method presented in this paper can provide subset of genes leading to more accurate classification results while the number of selected genes is smaller. Additionally, the proposed modified Artificial Bee Colony Algorithm could conceivably be applied to problems in other areas as well.

  11. GO-PCA: An Unsupervised Method to Explore Gene Expression Data Using Prior Knowledge

    PubMed Central

    Wagner, Florian

    2015-01-01

    Method Genome-wide expression profiling is a widely used approach for characterizing heterogeneous populations of cells, tissues, biopsies, or other biological specimen. The exploratory analysis of such data typically relies on generic unsupervised methods, e.g. principal component analysis (PCA) or hierarchical clustering. However, generic methods fail to exploit prior knowledge about the molecular functions of genes. Here, I introduce GO-PCA, an unsupervised method that combines PCA with nonparametric GO enrichment analysis, in order to systematically search for sets of genes that are both strongly correlated and closely functionally related. These gene sets are then used to automatically generate expression signatures with functional labels, which collectively aim to provide a readily interpretable representation of biologically relevant similarities and differences. The robustness of the results obtained can be assessed by bootstrapping. Results I first applied GO-PCA to datasets containing diverse hematopoietic cell types from human and mouse, respectively. In both cases, GO-PCA generated a small number of signatures that represented the majority of lineages present, and whose labels reflected their respective biological characteristics. I then applied GO-PCA to human glioblastoma (GBM) data, and recovered signatures associated with four out of five previously defined GBM subtypes. My results demonstrate that GO-PCA is a powerful and versatile exploratory method that reduces an expression matrix containing thousands of genes to a much smaller set of interpretable signatures. In this way, GO-PCA aims to facilitate hypothesis generation, design of further analyses, and functional comparisons across datasets. PMID:26575370

  12. GO-PCA: An Unsupervised Method to Explore Gene Expression Data Using Prior Knowledge.

    PubMed

    Wagner, Florian

    2015-01-01

    Genome-wide expression profiling is a widely used approach for characterizing heterogeneous populations of cells, tissues, biopsies, or other biological specimen. The exploratory analysis of such data typically relies on generic unsupervised methods, e.g. principal component analysis (PCA) or hierarchical clustering. However, generic methods fail to exploit prior knowledge about the molecular functions of genes. Here, I introduce GO-PCA, an unsupervised method that combines PCA with nonparametric GO enrichment analysis, in order to systematically search for sets of genes that are both strongly correlated and closely functionally related. These gene sets are then used to automatically generate expression signatures with functional labels, which collectively aim to provide a readily interpretable representation of biologically relevant similarities and differences. The robustness of the results obtained can be assessed by bootstrapping. I first applied GO-PCA to datasets containing diverse hematopoietic cell types from human and mouse, respectively. In both cases, GO-PCA generated a small number of signatures that represented the majority of lineages present, and whose labels reflected their respective biological characteristics. I then applied GO-PCA to human glioblastoma (GBM) data, and recovered signatures associated with four out of five previously defined GBM subtypes. My results demonstrate that GO-PCA is a powerful and versatile exploratory method that reduces an expression matrix containing thousands of genes to a much smaller set of interpretable signatures. In this way, GO-PCA aims to facilitate hypothesis generation, design of further analyses, and functional comparisons across datasets.

  13. Who shares? Who doesn't? Factors associated with openly archiving raw research data.

    PubMed

    Piwowar, Heather A

    2011-01-01

    Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%-35% in 2007-2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.

  14. RNA-seq reveals more consistent reference genes for gene expression studies in human non-melanoma skin cancers

    PubMed Central

    Tan, Jean-Marie; Payne, Elizabeth J.; Lin, Lynlee L.; Sinnya, Sudipta; Raphael, Anthony P.; Lambie, Duncan; Frazer, Ian H.; Dinger, Marcel E.; Soyer, H. Peter

    2017-01-01

    Identification of appropriate reference genes (RGs) is critical to accurate data interpretation in quantitative real-time PCR (qPCR) experiments. In this study, we have utilised next generation RNA sequencing (RNA-seq) to analyse the transcriptome of a panel of non-melanoma skin cancer lesions, identifying genes that are consistently expressed across all samples. Genes encoding ribosomal proteins were amongst the most stable in this dataset. Validation of this RNA-seq data was examined using qPCR to confirm the suitability of a set of highly stable genes for use as qPCR RGs. These genes will provide a valuable resource for the normalisation of qPCR data for the analysis of non-melanoma skin cancer. PMID:28852586

  15. Genome-wide characterization of differential transcript usage in Arabidopsis thaliana.

    PubMed

    Vaneechoutte, Dries; Estrada, April R; Lin, Ying-Chen; Loraine, Ann E; Vandepoele, Klaas

    2017-12-01

    Alternative splicing and the usage of alternate transcription start- or stop sites allows a single gene to produce multiple transcript isoforms. Most plant genes express certain isoforms at a significantly higher level than others, but under specific conditions this expression dominance can change, resulting in a different set of dominant isoforms. These events of differential transcript usage (DTU) have been observed for thousands of Arabidopsis thaliana, Zea mays and Vitis vinifera genes, and have been linked to development and stress response. However, neither the characteristics of these genes, nor the implications of DTU on their protein coding sequences or functions, are currently well understood. Here we present a dataset of isoform dominance and DTU for all genes in the AtRTD2 reference transcriptome based on a protocol that was benchmarked on simulated data and validated through comparison with a published reverse transciptase-polymerase chain reaction panel. We report DTU events for 8148 genes across 206 public RNA-Seq samples, and find that protein sequences are affected in 22% of the cases. The observed DTU events show high consistency across replicates, and reveal reproducible patterns in response to treatment and development. We also demonstrate that genes with different evolutionary ages, expression breadths and functions show large differences in the frequency at which they undergo DTU, and in the effect that these events have on their protein sequences. Finally, we showcase how the generated dataset can be used to explore DTU events for genes of interest or to find genes with specific DTU in samples of interest. © 2017 The Authors The Plant Journal © 2017 John Wiley & Sons Ltd.

  16. Modifier locus mapping of a transgenic F2 mouse population identifies CCDC115 as a novel aggressive prostate cancer modifier gene in humans.

    PubMed

    Winter, Jean M; Curry, Natasha L; Gildea, Derek M; Williams, Kendra A; Lee, Minnkyong; Hu, Ying; Crawford, Nigel P S

    2018-06-11

    It is well known that development of prostate cancer (PC) can be attributed to somatic mutations of the genome, acquired within proto-oncogenes or tumor-suppressor genes. What is less well understood is how germline variation contributes to disease aggressiveness in PC patients. To map germline modifiers of aggressive neuroendocrine PC, we generated a genetically diverse F2 intercross population using the transgenic TRAMP mouse model and the wild-derived WSB/EiJ (WSB) strain. The relevance of germline modifiers of aggressive PC identified in these mice was extensively correlated in human PC datasets and functionally validated in cell lines. Aggressive PC traits were quantified in a population of 30 week old (TRAMP x WSB) F2 mice (n = 307). Correlation of germline genotype with aggressive disease phenotype revealed seven modifier loci that were significantly associated with aggressive disease. RNA-seq were analyzed using cis-eQTL and trait correlation analyses to identify candidate genes within each of these loci. Analysis of 92 (TRAMP x WSB) F2 prostates revealed 25 candidate genes that harbored both a significant cis-eQTL and mRNA expression correlations with an aggressive PC trait. We further delineated these candidate genes based on their clinical relevance, by interrogating human PC GWAS and PC tumor gene expression datasets. We identified four genes (CCDC115, DNAJC10, RNF149, and STYXL1), which encompassed all of the following characteristics: 1) one or more germline variants associated with aggressive PC traits; 2) differential mRNA levels associated with aggressive PC traits; and 3) differential mRNA expression between normal and tumor tissue. Functional validation studies of these four genes using the human LNCaP prostate adenocarcinoma cell line revealed ectopic overexpression of CCDC115 can significantly impede cell growth in vitro and tumor growth in vivo. Furthermore, CCDC115 human prostate tumor expression was associated with better survival outcomes. We have demonstrated how modifier locus mapping in mouse models of PC, coupled with in silico analyses of human PC datasets, can reveal novel germline modifier genes of aggressive PC. We have also characterized CCDC115 as being associated with less aggressive PC in humans, placing it as a potential prognostic marker of aggressive PC.

  17. LINC00472 expression is regulated by promoter methylation and associated with disease-free survival in patients with grade 2 breast cancer

    PubMed Central

    Shen, Yi; Wang, Zhanwei; Loo, Lenora WM; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A.; Katsaros, Dionyssios; Yu, Herbert

    2015-01-01

    Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management. PMID:26564482

  18. LINC00472 expression is regulated by promoter methylation and associated with disease-free survival in patients with grade 2 breast cancer.

    PubMed

    Shen, Yi; Wang, Zhanwei; Loo, Lenora W M; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A; Katsaros, Dionyssios; Yu, Herbert

    2015-12-01

    Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management.

  19. Serum-based six-miRNA signature as a potential marker for EC diagnosis: Comparison with TCGA miRNAseq dataset and identification of miRNA-mRNA target pairs by integrated analysis of TCGA miRNAseq and RNAseq datasets.

    PubMed

    Sharma, Priyanka; Saraya, Anoop; Sharma, Rinu

    2018-01-30

    To evaluate the diagnostic potential of a six microRNAs (miRNAs) panel consisting of miR-21, miR-144, miR-107, miR-342, miR-93 and miR-152 for esophageal cancer (EC) detection. The expression of miRNAs was analyzed in EC sera samples using quantitative real-time PCR. Risk score analysis was performed and linear regression models were then fitted to generate the six-miRNA panel. In addition, we made an effort to identify significantly dysregulated miRNAs and mRNAs in EC using the Cancer Genome Atlas (TCGA) miRNAseq and RNAseq datasets, respectively. Further, we identified significantly correlated miRNA-mRNA target pairs by integrating TCGA EC miRNAseq dataset with RNAseq dataset. The panel of circulating miRNAs showed enhanced sensitivity (87.5%) and specificity (90.48%) in terms of discriminating EC patients from normal subjects (area under the curve [AUC] = 0.968). Pathway enrichment analysis for potential targets of six miRNAs revealed 48 significant (P < 0.05) pathways, viz. pathways in cancer, mRNA surveillance, MAPK, Wnt, mTOR signaling, and so on. The expression data for mRNAs and miRNAs, downloaded from TCGA database, lead to identification of 2309 differentially expressed genes and 189 miRNAs. Gene ontology and pathway enrichment analysis showed that cell-cycle processes were most significantly enriched for differentially expressed mRNA. Integrated analysis of TCGA miRNAseq and RNAseq datasets resulted in identification of 53 063 significantly and negatively correlated miRNA-mRNA pairs. In summary, a novel and highly sensitive signature of serum miRNAs was identified for EC detection. Moreover, this is the first report identifying miRNA-mRNA target pairs from EC TCGA dataset, thus providing a comprehensive resource for understanding the interactions existing between miRNA and their target mRNAs in EC. © 2018 John Wiley & Sons Australia, Ltd.

  20. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shi, CY; Yang, H; Wei, CL

    Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Using high-throughput Illumina RNA-seq, the transcriptome from poly (A){sup +} RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled intomore » 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis.« less

  1. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    PubMed Central

    2011-01-01

    Background Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Results Using high-throughput Illumina RNA-seq, the transcriptome from poly (A)+ RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled into 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). Conclusions An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis. PMID:21356090

  2. A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast

    PubMed Central

    Kundaje, Anshul; Xin, Xiantong; Lan, Changgui; Lianoglou, Steve; Zhou, Mei; Zhang, Li; Leslie, Christina

    2008-01-01

    Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included. PMID:19008939

  3. Reduction in expression of the benign AR transcriptome is a hallmark of localised prostate cancer progression.

    PubMed

    Stuchbery, Ryan; Macintyre, Geoff; Cmero, Marek; Harewood, Laurence M; Peters, Justin S; Costello, Anthony J; Hovens, Christopher M; Corcoran, Niall M

    2016-05-24

    Despite the importance of androgen receptor (AR) signalling to prostate cancer development, little is known about how this signalling pathway changes with increasing grade and stage of the disease. To explore changes in the normal AR transcriptome in localised prostate cancer, and its relation to adverse pathological features and disease recurrence. Publically accessible human prostate cancer expression arrays as well as RNA sequencing data from the prostate TCGA. Tumour associated PSA and PSAD were calculated for a large cohort of men (n=1108) undergoing prostatectomy. We performed a meta-analysis of the expression of an androgen-regulated gene set across datasets using Oncomine. Differential expression of selected genes in the prostate TCGA database was probed using the edgeR Bioconductor package. Changes in tumour PSA density with stage and grade were assessed by Student's t-test, and its association with biochemical recurrence explored by Kaplan-Meier curves and Cox regression. Meta-analysis revealed a systematic decline in the expression of a previously identified benign prostate androgen-regulated gene set with increasing tumour grade, reaching significance in nine of 25 genes tested despite increasing AR expression. These results were confirmed in a large independent dataset from the TCGA. At the protein level, when serum PSA was corrected for tumour volume, significantly lower levels were observed with increasing tumour grade and stage, and predicted disease recurrence. Lower PSA secretion-per-tumour-volume is associated with increasing grade and stage of prostate cancer, has prognostic relevance, and reflects a systematic perturbation of androgen signalling.

  4. An interactive web application for the dissemination of human systems immunology data.

    PubMed

    Speake, Cate; Presnell, Scott; Domico, Kelly; Zeitner, Brad; Bjork, Anna; Anderson, David; Mason, Michael J; Whalen, Elizabeth; Vargas, Olivia; Popov, Dimitry; Rinchai, Darawan; Jourde-Chiche, Noemie; Chiche, Laurent; Quinn, Charlie; Chaussabel, Damien

    2015-06-19

    Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators' interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery. State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples. We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page ( https://gxb.benaroyaresearch.org/dm3/landing.gsp )]. The source code is also available openly [Gene Expression Browser Source Code ( https://github.com/BenaroyaResearch/gxbrowser )]. We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.

  5. Differential co-expression analysis reveals a novel prognostic gene module in ovarian cancer.

    PubMed

    Gov, Esra; Arga, Kazim Yalcin

    2017-07-10

    Ovarian cancer is one of the most significant disease among gynecological disorders that women suffered from over the centuries. However, disease-specific and effective biomarkers were still not available, since studies have focused on individual genes associated with ovarian cancer, ignoring the interactions and associations among the gene products. Here, ovarian cancer differential co-expression networks were reconstructed via meta-analysis of gene expression data and co-expressed gene modules were identified in epithelial cells from ovarian tumor and healthy ovarian surface epithelial samples to propose ovarian cancer associated genes and their interactions. We propose a novel, highly interconnected, differentially co-expressed, and co-regulated gene module in ovarian cancer consisting of 84 prognostic genes. Furthermore, the specificity of the module to ovarian cancer was shown through analyses of datasets in nine other cancers. These observations underscore the importance of transcriptome based systems biomarkers research in deciphering the elusive pathophysiology of ovarian cancer, and here, we present reciprocal interplay between candidate ovarian cancer genes and their transcriptional regulatory dynamics. The corresponding gene module might provide new insights on ovarian cancer prognosis and treatment strategies that continue to place a significant burden on global health.

  6. Feature genes in metastatic breast cancer identified by MetaDE and SVM classifier methods.

    PubMed

    Tuo, Youlin; An, Ning; Zhang, Ming

    2018-03-01

    The aim of the present study was to investigate the feature genes in metastatic breast cancer samples. A total of 5 expression profiles of metastatic breast cancer samples were downloaded from the Gene Expression Omnibus database, which were then analyzed using the MetaQC and MetaDE packages in R language. The feature genes between metastasis and non‑metastasis samples were screened under the threshold of P<0.05. Based on the protein‑protein interactions (PPIs) in the Biological General Repository for Interaction Datasets, Human Protein Reference Database and Biomolecular Interaction Network Database, the PPI network of the feature genes was constructed. The feature genes identified by topological characteristics were then used for support vector machine (SVM) classifier training and verification. The accuracy of the SVM classifier was then evaluated using another independent dataset from The Cancer Genome Atlas database. Finally, function and pathway enrichment analyses for genes in the SVM classifier were performed. A total of 541 feature genes were identified between metastatic and non‑metastatic samples. The top 10 genes with the highest betweenness centrality values in the PPI network of feature genes were Nuclear RNA Export Factor 1, cyclin‑dependent kinase 2 (CDK2), myelocytomatosis proto‑oncogene protein (MYC), Cullin 5, SHC Adaptor Protein 1, Clathrin heavy chain, Nucleolin, WD repeat domain 1, proteasome 26S subunit non‑ATPase 2 and telomeric repeat binding factor 2. The cyclin‑dependent kinase inhibitor 1A (CDKN1A), E2F transcription factor 1 (E2F1), and MYC interacted with CDK2. The SVM classifier constructed by the top 30 feature genes was able to distinguish metastatic samples from non‑metastatic samples [correct rate, specificity, positive predictive value and negative predictive value >0.89; sensitivity >0.84; area under the receiver operating characteristic curve (AUROC) >0.96]. The verification of the SVM classifier in an independent dataset (35 metastatic samples and 143 non‑metastatic samples) revealed an accuracy of 94.38% and AUROC of 0.958. Cell cycle associated functions and pathways were the most significant terms of the 30 feature genes. A SVM classifier was constructed to assess the possibility of breast cancer metastasis, which presented high accuracy in several independent datasets. CDK2, CDKN1A, E2F1 and MYC were indicated as the potential feature genes in metastatic breast cancer.

  7. Identification of crucial genes related to postmenopausal osteoporosis using gene expression profiling.

    PubMed

    Ma, Min; Chen, Xiaofei; Lu, Liangyu; Yuan, Feng; Zeng, Wen; Luo, Shulin; Yin, Feng; Cai, Junfeng

    2016-12-01

    Postmenopausal osteoporosis is a common bone disease and characterized by low bone mineral density. This study aimed to reveal key genes associated with postmenopausal osteoporosis (PMO), and provide a theoretical basis for subsequent experiments. The dataset GSE7429 was obtained from Gene Expression Omnibus. A total of 20 B cell samples (ten ones, respectively from postmenopausal women with low or high bone mineral density (BMD) were included in this dataset. Following screening of differentially expressed genes (DEGs), coexpression analysis of all genes was performed, and key genes in the coexpression network were screened using the random walk algorithm. Afterwards, functional and pathway analyses were conducted. Additionally, protein-protein interactions (PPIs) between DEGs and key genes were analyzed. A set of 308 DEGs (170 up-regulated ones and 138 down-regulated ones) between low BMD and high BMD samples were identified, and 101 key genes in the coexpression network were screened out. In the coexpression network, some genes had a higher score and degree, such as CSTA. The key genes in the coexpression network were mainly enriched in GO terms of the defense response (e.g., SERPINA1 and CST3), immune response (e.g., IL32 and CLEC7A); while, the DEGs were mainly enriched in structural constituent of cytoskeleton (e.g., CYLC2 and TUBA1B) and membrane-enclosed lumen (e.g., CCNE1 and INTS5). In the PPI network, CCNE1 interacted with REL; and TUBA1B interacted with ESR1. A series of interactions, such as CSTA/TYROBP, CCNE1/REL and TUBA1B/ESR1 might play pivotal roles in the occurrence and development of PMO.

  8. TIPMaP: a web server to establish transcript isoform profiles from reliable microarray probes.

    PubMed

    Chitturi, Neelima; Balagannavar, Govindkumar; Chandrashekar, Darshan S; Abinaya, Sadashivam; Srini, Vasan S; Acharya, Kshitish K

    2013-12-27

    Standard 3' Affymetrix gene expression arrays have contributed a significantly higher volume of existing gene expression data than other microarray platforms. These arrays were designed to identify differentially expressed genes, but not their alternatively spliced transcript forms. No resource can currently identify expression pattern of specific mRNA forms using these microarray data, even though it is possible to do this. We report a web server for expression profiling of alternatively spliced transcripts using microarray data sets from 31 standard 3' Affymetrix arrays for human, mouse and rat species. The tool has been experimentally validated for mRNAs transcribed or not-detected in a human disease condition (non-obstructive azoospermia, a male infertility condition). About 4000 gene expression datasets were downloaded from a public repository. 'Good probes' with complete coverage and identity to latest reference transcript sequences were first identified. Using them, 'Transcript specific probe-clusters' were derived for each platform and used to identify expression status of possible transcripts. The web server can lead the user to datasets corresponding to specific tissues, conditions via identifiers of the microarray studies or hybridizations, keywords, official gene symbols or reference transcript identifiers. It can identify, in the tissues and conditions of interest, about 40% of known transcripts as 'transcribed', 'not-detected' or 'differentially regulated'. Corresponding additional information for probes, genes, transcripts and proteins can be viewed too. We identified the expression of transcripts in a specific clinical condition and validated a few of these transcripts by experiments (using reverse transcription followed by polymerase chain reaction). The experimental observations indicated higher agreements with the web server results, than contradictions. The tool is accessible at http://resource.ibab.ac.in/TIPMaP. The newly developed online tool forms a reliable means for identification of alternatively spliced transcript-isoforms that may be differentially expressed in various tissues, cell types or physiological conditions. Thus, by making better use of existing data, TIPMaP avoids the dependence on precious tissue-samples, in experiments with a goal to establish expression profiles of alternative splice forms--at least in some cases.

  9. Effect of the absolute statistic on gene-sampling gene-set analysis methods.

    PubMed

    Nam, Dougu

    2017-06-01

    Gene-set enrichment analysis and its modified versions have commonly been used for identifying altered functions or pathways in disease from microarray data. In particular, the simple gene-sampling gene-set analysis methods have been heavily used for datasets with only a few sample replicates. The biggest problem with this approach is the highly inflated false-positive rate. In this paper, the effect of absolute gene statistic on gene-sampling gene-set analysis methods is systematically investigated. Thus far, the absolute gene statistic has merely been regarded as a supplementary method for capturing the bidirectional changes in each gene set. Here, it is shown that incorporating the absolute gene statistic in gene-sampling gene-set analysis substantially reduces the false-positive rate and improves the overall discriminatory ability. Its effect was investigated by power, false-positive rate, and receiver operating curve for a number of simulated and real datasets. The performances of gene-set analysis methods in one-tailed (genome-wide association study) and two-tailed (gene expression data) tests were also compared and discussed.

  10. Validation of reference genes for normalization of qPCR gene expression data from Coffea spp. hypocotyls inoculated with Colletotrichum kahawae

    PubMed Central

    2013-01-01

    Background Coffee production in Africa represents a significant share of the total export revenues and influences the lives of millions of people, yet severe socio-economic repercussions are annually felt in result of the overall losses caused by the coffee berry disease (CBD). This quarantine disease is caused by the fungus Colletotrichum kahawae Waller and Bridge, which remains one of the most devastating threats to Coffea arabica production in Africa at high altitude, and its dispersal to Latin America and Asia represents a serious concern. Understanding the molecular genetic basis of coffee resistance to this disease is of high priority to support breeding strategies. Selection and validation of suitable reference genes presenting stable expression in the system studied is the first step to engage studies of gene expression profiling. Results In this study, a set of ten genes (S24, 14-3-3, RPL7, GAPDH, UBQ9, VATP16, SAND, UQCC, IDE and β-Tub9) was evaluated to identify reference genes during the first hours of interaction (12, 48 and 72 hpi) between resistant and susceptible coffee genotypes and C. kahawae. Three analyses were done for the selection of these genes considering the entire dataset and the two genotypes (resistant and susceptible), separately. The three statistical methods applied GeNorm, NormFinder, and BestKeeper, allowed identifying IDE as one of the most stable genes for all datasets analysed, and in contrast GADPH and UBQ9 as the least stable ones. In addition, the expression of two defense-related transcripts, encoding for a receptor like kinase and a pathogenesis related protein 10, were used to validate the reference genes selected. Conclusion Taken together, our results provide guidelines for reference gene(s) selection towards a more accurate and widespread use of qPCR to study the interaction between Coffea spp. and C. kahawae. PMID:24073624

  11. Molecular differential diagnosis of follicular thyroid carcinoma and adenoma based on gene expression profiling by using formalin-fixed paraffin-embedded tissues

    PubMed Central

    2013-01-01

    Background Differential diagnosis between malignant follicular thyroid cancer (FTC) and benign follicular thyroid adenoma (FTA) is a great challenge for even an experienced pathologist and requires special effort. Molecular markers may potentially support a differential diagnosis between FTC and FTA in postoperative specimens. The purpose of this study was to derive molecular support for differential post-operative diagnosis, in the form of a simple multigene mRNA-based classifier that would differentiate between FTC and FTA tissue samples. Methods A molecular classifier was created based on a combined analysis of two microarray datasets (using 66 thyroid samples). The performance of the classifier was assessed using an independent dataset comprising 71 formalin-fixed paraffin-embedded (FFPE) samples (31 FTC and 40 FTA), which were analysed by quantitative real-time PCR (qPCR). In addition, three other microarray datasets (62 samples) were used to confirm the utility of the classifier. Results Five of 8 genes selected from training datasets (ELMO1, EMCN, ITIH5, KCNAB1, SLCO2A1) were amplified by qPCR in FFPE material from an independent sample set. Three other genes did not amplify in FFPE material, probably due to low abundance. All 5 analysed genes were downregulated in FTC compared to FTA. The sensitivity and specificity of the 5-gene classifier tested on the FFPE dataset were 71% and 72%, respectively. Conclusions The proposed approach could support histopathological examination: 5-gene classifier may aid in molecular discrimination between FTC and FTA in FFPE material. PMID:24099521

  12. Characterization of differential gene expression in adrenocortical tumors harboring beta-catenin (CTNNB1) mutations.

    PubMed

    Durand, Julien; Lampron, Antoine; Mazzuco, Tania L; Chapman, Audrey; Bourdeau, Isabelle

    2011-07-01

    Mutations of β-catenin gene (CTNNB1) are frequent in adrenocortical adenomas (AA) and adrenocortical carcinomas (ACC). However, the target genes of β-catenin have not yet been identified in adrenocortical tumors. Our objective was to identify genes deregulated in adrenocortical tumors harboring CTNNB1 genetic alterations and nuclear accumulation of β-catenin. Microarray analysis identified a dataset of genes that were differently expressed between AA with CTNNB1 mutations and wild-type (WT) tumors. Within this dataset, the expression profiles of five genes were validated by real time-PCR (RT-PCR) in a cohort of 34 adrenocortical tissues (six AA and one ACC with CTNNB1 mutations, 13 AA and four ACC with WT CTNNB1, and 10 normal adrenal glands) and two human ACC cell lines. We then studied the effects of suppressing β-catenin transcriptional activity with the T-cell factor/β-catenin inhibitors PKF115-584 and PNU74654 on gene expression in H295R and SW13 cells. RT-PCR analysis confirmed the overexpression of ISM1, RALBP1, and PDE2A and the down-regulation of PHYHIP in five of six AA harboring CTNNB1 mutations compared with WT AA (n = 13) and normal adrenal glands (n = 10). RALBP1 and PDE2A overexpression was also confirmed at the protein level by Western blotting analysis in mutated tumors. ENC1 was specifically overexpressed in three of three AA harboring CTNNB1 point mutations. mRNA expression and protein levels of RALBP1, PDE2A, and ENC1 were decreased in a dose-dependent manner in H295R cells after treatment with PKF115-584 or PNU74654. This study identified candidate genes deregulated in CTNNB1-mutated adrenocortical tumors that may lead to a better understanding of the role of the Wnt-β-catenin pathway in adrenocortical tumorigenesis.

  13. Gene expression profiles of breast biopsies from healthy women identify a group with claudin-low features

    PubMed Central

    2011-01-01

    Background Increased understanding of the variability in normal breast biology will enable us to identify mechanisms of breast cancer initiation and the origin of different subtypes, and to better predict breast cancer risk. Methods Gene expression patterns in breast biopsies from 79 healthy women referred to breast diagnostic centers in Norway were explored by unsupervised hierarchical clustering and supervised analyses, such as gene set enrichment analysis and gene ontology analysis and comparison with previously published genelists and independent datasets. Results Unsupervised hierarchical clustering identified two separate clusters of normal breast tissue based on gene-expression profiling, regardless of clustering algorithm and gene filtering used. Comparison of the expression profile of the two clusters with several published gene lists describing breast cells revealed that the samples in cluster 1 share characteristics with stromal cells and stem cells, and to a certain degree with mesenchymal cells and myoepithelial cells. The samples in cluster 1 also share many features with the newly identified claudin-low breast cancer intrinsic subtype, which also shows characteristics of stromal and stem cells. More women belonging to cluster 1 have a family history of breast cancer and there is a slight overrepresentation of nulliparous women in cluster 1. Similar findings were seen in a separate dataset consisting of histologically normal tissue from both breasts harboring breast cancer and from mammoplasty reductions. Conclusion This is the first study to explore the variability of gene expression patterns in whole biopsies from normal breasts and identified distinct subtypes of normal breast tissue. Further studies are needed to determine the specific cell contribution to the variation in the biology of normal breasts, how the clusters identified relate to breast cancer risk and their possible link to the origin of the different molecular subtypes of breast cancer. PMID:22044755

  14. Genome-wide prediction and analysis of human tissue-selective genes using microarray expression data

    PubMed Central

    2013-01-01

    Background Understanding how genes are expressed specifically in particular tissues is a fundamental question in developmental biology. Many tissue-specific genes are involved in the pathogenesis of complex human diseases. However, experimental identification of tissue-specific genes is time consuming and difficult. The accurate predictions of tissue-specific gene targets could provide useful information for biomarker development and drug target identification. Results In this study, we have developed a machine learning approach for predicting the human tissue-specific genes using microarray expression data. The lists of known tissue-specific genes for different tissues were collected from UniProt database, and the expression data retrieved from the previously compiled dataset according to the lists were used for input vector encoding. Random Forests (RFs) and Support Vector Machines (SVMs) were used to construct accurate classifiers. The RF classifiers were found to outperform SVM models for tissue-specific gene prediction. The results suggest that the candidate genes for brain or liver specific expression can provide valuable information for further experimental studies. Our approach was also applied for identifying tissue-selective gene targets for different types of tissues. Conclusions A machine learning approach has been developed for accurately identifying the candidate genes for tissue specific/selective expression. The approach provides an efficient way to select some interesting genes for developing new biomedical markers and improve our knowledge of tissue-specific expression. PMID:23369200

  15. sscMap: an extensible Java application for connecting small-molecule drugs using gene-expression signatures.

    PubMed

    Zhang, Shu-Dong; Gant, Timothy W

    2009-07-31

    Connectivity mapping is a process to recognize novel pharmacological and toxicological properties in small molecules by comparing their gene expression signatures with others in a database. A simple and robust method for connectivity mapping with increased specificity and sensitivity was recently developed, and its utility demonstrated using experimentally derived gene signatures. This paper introduces sscMap (statistically significant connections' map), a Java application designed to undertake connectivity mapping tasks using the recently published method. The software is bundled with a default collection of reference gene-expression profiles based on the publicly available dataset from the Broad Institute Connectivity Map 02, which includes data from over 7000 Affymetrix microarrays, for over 1000 small-molecule compounds, and 6100 treatment instances in 5 human cell lines. In addition, the application allows users to add their custom collections of reference profiles and is applicable to a wide range of other 'omics technologies. The utility of sscMap is two fold. First, it serves to make statistically significant connections between a user-supplied gene signature and the 6100 core reference profiles based on the Broad Institute expanded dataset. Second, it allows users to apply the same improved method to custom-built reference profiles which can be added to the database for future referencing. The software can be freely downloaded from http://purl.oclc.org/NET/sscMap.

  16. A nonparametric mean-variance smoothing method to assess Arabidopsis cold stress transcriptional regulator CBF2 overexpression microarray data.

    PubMed

    Hu, Pingsha; Maiti, Tapabrata

    2011-01-01

    Microarray is a powerful tool for genome-wide gene expression analysis. In microarray expression data, often mean and variance have certain relationships. We present a non-parametric mean-variance smoothing method (NPMVS) to analyze differentially expressed genes. In this method, a nonlinear smoothing curve is fitted to estimate the relationship between mean and variance. Inference is then made upon shrinkage estimation of posterior means assuming variances are known. Different methods have been applied to simulated datasets, in which a variety of mean and variance relationships were imposed. The simulation study showed that NPMVS outperformed the other two popular shrinkage estimation methods in some mean-variance relationships; and NPMVS was competitive with the two methods in other relationships. A real biological dataset, in which a cold stress transcription factor gene, CBF2, was overexpressed, has also been analyzed with the three methods. Gene ontology and cis-element analysis showed that NPMVS identified more cold and stress responsive genes than the other two methods did. The good performance of NPMVS is mainly due to its shrinkage estimation for both means and variances. In addition, NPMVS exploits a non-parametric regression between mean and variance, instead of assuming a specific parametric relationship between mean and variance. The source code written in R is available from the authors on request.

  17. A Nonparametric Mean-Variance Smoothing Method to Assess Arabidopsis Cold Stress Transcriptional Regulator CBF2 Overexpression Microarray Data

    PubMed Central

    Hu, Pingsha; Maiti, Tapabrata

    2011-01-01

    Microarray is a powerful tool for genome-wide gene expression analysis. In microarray expression data, often mean and variance have certain relationships. We present a non-parametric mean-variance smoothing method (NPMVS) to analyze differentially expressed genes. In this method, a nonlinear smoothing curve is fitted to estimate the relationship between mean and variance. Inference is then made upon shrinkage estimation of posterior means assuming variances are known. Different methods have been applied to simulated datasets, in which a variety of mean and variance relationships were imposed. The simulation study showed that NPMVS outperformed the other two popular shrinkage estimation methods in some mean-variance relationships; and NPMVS was competitive with the two methods in other relationships. A real biological dataset, in which a cold stress transcription factor gene, CBF2, was overexpressed, has also been analyzed with the three methods. Gene ontology and cis-element analysis showed that NPMVS identified more cold and stress responsive genes than the other two methods did. The good performance of NPMVS is mainly due to its shrinkage estimation for both means and variances. In addition, NPMVS exploits a non-parametric regression between mean and variance, instead of assuming a specific parametric relationship between mean and variance. The source code written in R is available from the authors on request. PMID:21611181

  18. ANISEED 2017: extending the integrated ascidian database to the exploration and evolutionary comparison of genome-scale datasets.

    PubMed

    Brozovic, Matija; Dantec, Christelle; Dardaillon, Justine; Dauga, Delphine; Faure, Emmanuel; Gineste, Mathieu; Louis, Alexandra; Naville, Magali; Nitta, Kazuhiro R; Piette, Jacques; Reeves, Wendy; Scornavacca, Céline; Simion, Paul; Vincentelli, Renaud; Bellec, Maelle; Aicha, Sameh Ben; Fagotto, Marie; Guéroult-Bellone, Marion; Haeussler, Maximilian; Jacox, Edwin; Lowe, Elijah K; Mendez, Mickael; Roberge, Alexis; Stolfi, Alberto; Yokomori, Rui; Brown, C Titus; Cambillau, Christian; Christiaen, Lionel; Delsuc, Frédéric; Douzery, Emmanuel; Dumollard, Rémi; Kusakabe, Takehiro; Nakai, Kenta; Nishida, Hiroki; Satou, Yutaka; Swalla, Billie; Veeman, Michael; Volff, Jean-Nicolas; Lemaire, Patrick

    2018-01-04

    ANISEED (www.aniseed.cnrs.fr) is the main model organism database for tunicates, the sister-group of vertebrates. This release gives access to annotated genomes, gene expression patterns, and anatomical descriptions for nine ascidian species. It provides increased integration with external molecular and taxonomy databases, better support for epigenomics datasets, in particular RNA-seq, ChIP-seq and SELEX-seq, and features novel interactive interfaces for existing and novel datatypes. In particular, the cross-species navigation and comparison is enhanced through a novel taxonomy section describing each represented species and through the implementation of interactive phylogenetic gene trees for 60% of tunicate genes. The gene expression section displays the results of RNA-seq experiments for the three major model species of solitary ascidians. Gene expression is controlled by the binding of transcription factors to cis-regulatory sequences. A high-resolution description of the DNA-binding specificity for 131 Ciona robusta (formerly C. intestinalis type A) transcription factors by SELEX-seq is provided and used to map candidate binding sites across the Ciona robusta and Phallusia mammillata genomes. Finally, use of a WashU Epigenome browser enhances genome navigation, while a Genomicus server was set up to explore microsynteny relationships within tunicates and with vertebrates, Amphioxus, echinoderms and hemichordates. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  19. GiniClust: detecting rare cell types from single-cell gene expression data with Gini index.

    PubMed

    Jiang, Lan; Chen, Huidong; Pinello, Luca; Yuan, Guo-Cheng

    2016-07-01

    High-throughput single-cell technologies have great potential to discover new cell types; however, it remains challenging to detect rare cell types that are distinct from a large population. We present a novel computational method, called GiniClust, to overcome this challenge. Validation against a benchmark dataset indicates that GiniClust achieves high sensitivity and specificity. Application of GiniClust to public single-cell RNA-seq datasets uncovers previously unrecognized rare cell types, including Zscan4-expressing cells within mouse embryonic stem cells and hemoglobin-expressing cells in the mouse cortex and hippocampus. GiniClust also correctly detects a small number of normal cells that are mixed in a cancer cell population.

  20. Identification of Disease Critical Genes Using Collective Meta-heuristic Approaches: An Application to Preeclampsia.

    PubMed

    Biswas, Surama; Dutta, Subarna; Acharyya, Sriyankar

    2017-12-01

    Identifying a small subset of disease critical genes out of a large size of microarray gene expression data is a challenge in computational life sciences. This paper has applied four meta-heuristic algorithms, namely, honey bee mating optimization (HBMO), harmony search (HS), differential evolution (DE) and genetic algorithm (basic version GA) to find disease critical genes of preeclampsia which affects women during gestation. Two hybrid algorithms, namely, HBMO-kNN and HS-kNN have been newly proposed here where kNN (k nearest neighbor classifier) is used for sample classification. Performances of these new approaches have been compared with other two hybrid algorithms, namely, DE-kNN and SGA-kNN. Three datasets of different sizes have been used. In a dataset, the set of genes found common in the output of each algorithm is considered here as disease critical genes. In different datasets, the percentage of classification or classification accuracy of meta-heuristic algorithms varied between 92.46 and 100%. HBMO-kNN has the best performance (99.64-100%) in almost all data sets. DE-kNN secures the second position (99.42-100%). Disease critical genes obtained here match with clinically revealed preeclampsia genes to a large extent.

  1. Clinical value of miR-198-5p in lung squamous cell carcinoma assessed using microarray and RT-qPCR.

    PubMed

    Liang, Yue-Ya; Huang, Jia-Cheng; Tang, Rui-Xue; Chen, Wen-Jie; Chen, Peng; Cen, Wei-Luan; Shi, Ke; Gao, Li; Gao, Xiang; Liu, An-Gui; Peng, Xiao-Tong; Chen, Gang; Huang, Su-Ning; Fang, Ye-Ying; Gu, Yong-Yao

    2018-02-02

    To examine the clinical value of miR-198-5p in lung squamous cell carcinoma (LUSC). Gene Expression Omnibus (GEO) microarray datasets were used to explore the miR-198-5p expression and its diagnostic value in LUSC. Real-time reverse transcription quantitative polymerase chain reaction was used to evaluate the expression of miR-198-5p in 23 formalin-fixed, paraffin-embedded (FFPE) LUSC tissues and corresponding non-cancerous tissues. The correlation between miR-198-5p expression and clinic pathological features was assessed. Meanwhile, putative target messenger RNAs of miR-198-5p were identified based on the analysis of differentially expressed genes in the Cancer Genome Atlas (TCGA) and 12 miRNA prediction tools. Subsequently, the putative target genes were sent to Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway analyses. MiR-198-5p was low expressed in LUSC tissues. The combined standard mean difference (SMD) values of miR-198-5p expression based on GEO datasets were - 0.30 (95% confidence interval (CI) - 0.54, - 0.06) and - 0.39 (95% CI - 0.83, 0.05) using fixed effect model and random effect model, respectively. The sensitivity and specificity were not sufficiently high, as the area under the curve (AUC) was 0.7749 (Q* = 0.7143) based on summarized receiver operating characteristic (SROC) curves constructed using GEO datasets. Based on the in-house RT-qPCR, miR-198-5p expression was 4.3826 ± 1.7660 in LUSC tissues and 4.4522 ± 1.8263 in adjacent normal tissues (P = 0.885). The expression of miR-198-5p was significantly higher in patients with early TNM stages (I-II) than that in cases with advanced TNM stages (III-IV) (5.4400 ± 1.5277 vs 3.5690 ± 1.5228, P = 0.008). Continuous variable-based meta-analysis of GEO and PCR data displayed the SMD values of - 0.26 (95% CI - 0.48, - 0.04) and - 0.34 (95% CI - 0.71, 0.04) based on fixed and random effect models, respectively. As for the diagnostic value of miR-198-5p, the AUC based on the SROC curve using GEO and PCR data was 0.7351 (Q* = 0.6812). In total, 542 genes were identified as the targets of miR-198-5p. The most enriched Gene Ontology terms were epidermis development among biological processes, cell junction among cellular components, and protein dimerization activity among molecule functions. The pathway of non-small cell lung cancer was the most significant pathway identified using Kyoto Encyclopedia of Genes and Genomes analysis. The expression of miR-198-5p is related to the TNM stage. Thus, miR-198-5p might play an important role via its target genes in LUSC.

  2. Research resource: Tissue-specific transcriptomics and cistromics of nuclear receptor signaling: a web research resource.

    PubMed

    Ochsner, Scott A; Watkins, Christopher M; LaGrone, Benjamin S; Steffen, David L; McKenna, Neil J

    2010-10-01

    Nuclear receptors (NRs) are ligand-regulated transcription factors that recruit coregulators and other transcription factors to gene promoters to effect regulation of tissue-specific transcriptomes. The prodigious rate at which the NR signaling field has generated high content gene expression and, more recently, genome-wide location analysis datasets has not been matched by a committed effort to archiving this information for routine access by bench and clinical scientists. As a first step towards this goal, we searched the MEDLINE database for studies, which referenced either expression microarray and/or genome-wide location analysis datasets in which a NR or NR ligand was an experimental variable. A total of 1122 studies encompassing 325 unique organs, tissues, primary cells, and cell lines, 35 NRs, and 91 NR ligands were retrieved and annotated. The data were incorporated into a new section of the Nuclear Receptor Signaling Atlas Molecule Pages, Transcriptomics and Cistromics, for which we designed an intuitive, freely accessible user interface to browse the studies. Each study links to an abstract, the MEDLINE record, and, where available, Gene Expression Omnibus and ArrayExpress records. The resource will be updated on a regular basis to provide a current and comprehensive entrez into the sum of transcriptomic and cistromic research in this field.

  3. MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.

    PubMed

    Gonzalez-Dominguez, Jorge; Martin, Maria J

    2017-10-10

    In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.

  4. Iterative local Gaussian clustering for expressed genes identification linked to malignancy of human colorectal carcinoma

    PubMed Central

    Wasito, Ito; Hashim, Siti Zaiton M; Sukmaningrum, Sri

    2007-01-01

    Gene expression profiling plays an important role in the identification of biological and clinical properties of human solid tumors such as colorectal carcinoma. Profiling is required to reveal underlying molecular features for diagnostic and therapeutic purposes. A non-parametric density-estimation-based approach called iterative local Gaussian clustering (ILGC), was used to identify clusters of expressed genes. We used experimental data from a previous study by Muro and others consisting of 1,536 genes in 100 colorectal cancer and 11 normal tissues. In this dataset, the ILGC finds three clusters, two large and one small gene clusters, similar to their results which used Gaussian mixture clustering. The correlation of each cluster of genes and clinical properties of malignancy of human colorectal cancer was analysed for the existence of tumor or normal, the existence of distant metastasis and the existence of lymph node metastasis. PMID:18305825

  5. Iterative local Gaussian clustering for expressed genes identification linked to malignancy of human colorectal carcinoma.

    PubMed

    Wasito, Ito; Hashim, Siti Zaiton M; Sukmaningrum, Sri

    2007-12-30

    Gene expression profiling plays an important role in the identification of biological and clinical properties of human solid tumors such as colorectal carcinoma. Profiling is required to reveal underlying molecular features for diagnostic and therapeutic purposes. A non-parametric density-estimation-based approach called iterative local Gaussian clustering (ILGC), was used to identify clusters of expressed genes. We used experimental data from a previous study by Muro and others consisting of 1,536 genes in 100 colorectal cancer and 11 normal tissues. In this dataset, the ILGC finds three clusters, two large and one small gene clusters, similar to their results which used Gaussian mixture clustering. The correlation of each cluster of genes and clinical properties of malignancy of human colorectal cancer was analysed for the existence of tumor or normal, the existence of distant metastasis and the existence of lymph node metastasis.

  6. Sex Determination in Ceratopteris richardii Is Accompanied by Transcriptome Changes That Drive Epigenetic Reprogramming of the Young Gametophyte.

    PubMed

    Atallah, Nadia M; Vitek, Olga; Gaiti, Federico; Tanurdzic, Milos; Banks, Jo Ann

    2018-05-02

    The fern Ceratopteris richardii is an important model for studies of sex determination and gamete differentiation in homosporous plants. Here we use RNA-seq to de novo assemble a transcriptome and identify genes differentially expressed in young gametophytes as their sex is determined by the presence or absence of the male-inducing pheromone called antheridiogen. Of the 1,163 consensus differentially expressed genes identified, the vast majority (1,030) are up-regulated in gametophytes treated with antheridiogen. GO term enrichment analyses of these DEGs reveals that a large number of genes involved in epigenetic reprogramming of the gametophyte genome are up-regulated by the pheromone. Additional hormone response and development genes are also up-regulated by the pheromone. This C. richardii gametophyte transcriptome and gene expression dataset will prove useful for studies focusing on sex determination and differentiation in plants. Copyright © 2018, G3: Genes, Genomes, Genetics.

  7. Inflammatory and mitochondrial gene expression data in GPER-deficient cardiomyocytes from male and female mice.

    PubMed

    Wang, Hao; Sun, Xuming; Chou, Jeff; Lin, Marina; Ferrario, Carlos M; Zapata-Sudo, Gisele; Groban, Leanne

    2017-02-01

    We previously showed that cardiomyocyte-specific G protein-coupled estrogen receptor (GPER) gene deletion leads to sex-specific adverse effects on cardiac structure and function; alterations which may be due to distinct differences in mitochondrial and inflammatory processes between sexes. Here, we provide the results of Gene Set Enrichment Analysis (GSEA) based on the DNA microarray data from GPER-knockout versus GPER-intact (intact) cardiomyocytes. This article contains complete data on the mitochondrial and inflammatory response-related gene expression changes that were significant in GPER knockout versus intact cardiomyocytes from adult male and female mice. The data are supplemental to our original research article "Cardiomyocyte-specific deletion of the G protein-coupled estrogen receptor (GPER) leads to left ventricular dysfunction and adverse remodeling: a sex-specific gene profiling" (Wang et al., 2016) [1]. Data have been deposited to the Gene Expression Omnibus (GEO) database repository with the dataset identifier GSE86843.

  8. Unity in defence: honeybee workers exhibit conserved molecular responses to diverse pathogens.

    PubMed

    Doublet, Vincent; Poeschl, Yvonne; Gogol-Döring, Andreas; Alaux, Cédric; Annoscia, Desiderato; Aurori, Christian; Barribeau, Seth M; Bedoya-Reina, Oscar C; Brown, Mark J F; Bull, James C; Flenniken, Michelle L; Galbraith, David A; Genersch, Elke; Gisder, Sebastian; Grosse, Ivo; Holt, Holly L; Hultmark, Dan; Lattorff, H Michael G; Le Conte, Yves; Manfredini, Fabio; McMahon, Dino P; Moritz, Robin F A; Nazzi, Francesco; Niño, Elina L; Nowick, Katja; van Rij, Ronald P; Paxton, Robert J; Grozinger, Christina M

    2017-03-02

    Organisms typically face infection by diverse pathogens, and hosts are thought to have developed specific responses to each type of pathogen they encounter. The advent of transcriptomics now makes it possible to test this hypothesis and compare host gene expression responses to multiple pathogens at a genome-wide scale. Here, we performed a meta-analysis of multiple published and new transcriptomes using a newly developed bioinformatics approach that filters genes based on their expression profile across datasets. Thereby, we identified common and unique molecular responses of a model host species, the honey bee (Apis mellifera), to its major pathogens and parasites: the Microsporidia Nosema apis and Nosema ceranae, RNA viruses, and the ectoparasitic mite Varroa destructor, which transmits viruses. We identified a common suite of genes and conserved molecular pathways that respond to all investigated pathogens, a result that suggests a commonality in response mechanisms to diverse pathogens. We found that genes differentially expressed after infection exhibit a higher evolutionary rate than non-differentially expressed genes. Using our new bioinformatics approach, we unveiled additional pathogen-specific responses of honey bees; we found that apoptosis appeared to be an important response following microsporidian infection, while genes from the immune signalling pathways, Toll and Imd, were differentially expressed after Varroa/virus infection. Finally, we applied our bioinformatics approach and generated a gene co-expression network to identify highly connected (hub) genes that may represent important mediators and regulators of anti-pathogen responses. Our meta-analysis generated a comprehensive overview of the host metabolic and other biological processes that mediate interactions between insects and their pathogens. We identified key host genes and pathways that respond to phylogenetically diverse pathogens, representing an important source for future functional studies as well as offering new routes to identify or generate pathogen resilient honey bee stocks. The statistical and bioinformatics approaches that were developed for this study are broadly applicable to synthesize information across transcriptomic datasets. These approaches will likely have utility in addressing a variety of biological questions.

  9. A deep auto-encoder model for gene expression prediction.

    PubMed

    Xie, Rui; Wen, Jia; Quitadamo, Andrew; Cheng, Jianlin; Shi, Xinghua

    2017-11-17

    Gene expression is a key intermediate level that genotypes lead to a particular trait. Gene expression is affected by various factors including genotypes of genetic variants. With an aim of delineating the genetic impact on gene expression, we build a deep auto-encoder model to assess how good genetic variants will contribute to gene expression changes. This new deep learning model is a regression-based predictive model based on the MultiLayer Perceptron and Stacked Denoising Auto-encoder (MLP-SAE). The model is trained using a stacked denoising auto-encoder for feature selection and a multilayer perceptron framework for backpropagation. We further improve the model by introducing dropout to prevent overfitting and improve performance. To demonstrate the usage of this model, we apply MLP-SAE to a real genomic datasets with genotypes and gene expression profiles measured in yeast. Our results show that the MLP-SAE model with dropout outperforms other models including Lasso, Random Forests and the MLP-SAE model without dropout. Using the MLP-SAE model with dropout, we show that gene expression quantifications predicted by the model solely based on genotypes, align well with true gene expression patterns. We provide a deep auto-encoder model for predicting gene expression from SNP genotypes. This study demonstrates that deep learning is appropriate for tackling another genomic problem, i.e., building predictive models to understand genotypes' contribution to gene expression. With the emerging availability of richer genomic data, we anticipate that deep learning models play a bigger role in modeling and interpreting genomics.

  10. A systems biology pipeline identifies new immune and disease related molecular signatures and networks in human cells during microgravity exposure

    NASA Astrophysics Data System (ADS)

    Mukhopadhyay, Sayak; Saha, Rohini; Palanisamy, Anbarasi; Ghosh, Madhurima; Biswas, Anupriya; Roy, Saheli; Pal, Arijit; Sarkar, Kathakali; Bagh, Sangram

    2016-05-01

    Microgravity is a prominent health hazard for astronauts, yet we understand little about its effect at the molecular systems level. In this study, we have integrated a set of systems-biology tools and databases and have analysed more than 8000 molecular pathways on published global gene expression datasets of human cells in microgravity. Hundreds of new pathways have been identified with statistical confidence for each dataset and despite the difference in cell types and experiments, around 100 of the new pathways are appeared common across the datasets. They are related to reduced inflammation, autoimmunity, diabetes and asthma. We have identified downregulation of NfκB pathway via Notch1 signalling as new pathway for reduced immunity in microgravity. Induction of few cancer types including liver cancer and leukaemia and increased drug response to cancer in microgravity are also found. Increase in olfactory signal transduction is also identified. Genes, based on their expression pattern, are clustered and mathematically stable clusters are identified. The network mapping of genes within a cluster indicates the plausible functional connections in microgravity. This pipeline gives a new systems level picture of human cells under microgravity, generates testable hypothesis and may help estimating risk and developing medicine for space missions.

  11. A systems biology pipeline identifies new immune and disease related molecular signatures and networks in human cells during microgravity exposure.

    PubMed

    Mukhopadhyay, Sayak; Saha, Rohini; Palanisamy, Anbarasi; Ghosh, Madhurima; Biswas, Anupriya; Roy, Saheli; Pal, Arijit; Sarkar, Kathakali; Bagh, Sangram

    2016-05-17

    Microgravity is a prominent health hazard for astronauts, yet we understand little about its effect at the molecular systems level. In this study, we have integrated a set of systems-biology tools and databases and have analysed more than 8000 molecular pathways on published global gene expression datasets of human cells in microgravity. Hundreds of new pathways have been identified with statistical confidence for each dataset and despite the difference in cell types and experiments, around 100 of the new pathways are appeared common across the datasets. They are related to reduced inflammation, autoimmunity, diabetes and asthma. We have identified downregulation of NfκB pathway via Notch1 signalling as new pathway for reduced immunity in microgravity. Induction of few cancer types including liver cancer and leukaemia and increased drug response to cancer in microgravity are also found. Increase in olfactory signal transduction is also identified. Genes, based on their expression pattern, are clustered and mathematically stable clusters are identified. The network mapping of genes within a cluster indicates the plausible functional connections in microgravity. This pipeline gives a new systems level picture of human cells under microgravity, generates testable hypothesis and may help estimating risk and developing medicine for space missions.

  12. Transcriptomic correlates of neuron electrophysiological diversity

    PubMed Central

    Li, Brenna; Crichlow, Cindy-Lee; Mancarci, B. Ogan; Pavlidis, Paul

    2017-01-01

    How neuronal diversity emerges from complex patterns of gene expression remains poorly understood. Here we present an approach to understand electrophysiological diversity through gene expression by integrating pooled- and single-cell transcriptomics with intracellular electrophysiology. Using neuroinformatics methods, we compiled a brain-wide dataset of 34 neuron types with paired gene expression and intrinsic electrophysiological features from publically accessible sources, the largest such collection to date. We identified 420 genes whose expression levels significantly correlated with variability in one or more of 11 physiological parameters. We next trained statistical models to infer cellular features from multivariate gene expression patterns. Such models were predictive of gene-electrophysiological relationships in an independent collection of 12 visual cortex cell types from the Allen Institute, suggesting that these correlations might reflect general principles relating expression patterns to phenotypic diversity across very different cell types. Many associations reported here have the potential to provide new insights into how neurons generate functional diversity, and correlations of ion channel genes like Gabrd and Scn1a (Nav1.1) with resting potential and spiking frequency are consistent with known causal mechanisms. Our work highlights the promise and inherent challenges in using cell type-specific transcriptomics to understand the mechanistic origins of neuronal diversity. PMID:29069078

  13. A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays.

    PubMed

    McLachlan, G J; Bean, R W; Jones, L Ben-Tovim

    2006-07-01

    An important problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. We provide a straightforward and easily implemented method for estimating the posterior probability that an individual gene is null. The problem can be expressed in a two-component mixture framework, using an empirical Bayes approach. Current methods of implementing this approach either have some limitations due to the minimal assumptions made or with more specific assumptions are computationally intensive. By converting to a z-score the value of the test statistic used to test the significance of each gene, we propose a simple two-component normal mixture that models adequately the distribution of this score. The usefulness of our approach is demonstrated on three real datasets.

  14. Comparison of Expression Profiles in Ovarian Epithelium In Vivo and Ovarian Cancer Identifies Novel Candidate Genes Involved in Disease Pathogenesis

    PubMed Central

    Emmanuel, Catherine; Gava, Natalie; Kennedy, Catherine; Balleine, Rosemary L.; Sharma, Raghwa; Wain, Gerard; Brand, Alison; Hogg, Russell; Etemadmoghadam, Dariush; George, Joshy; Birrer, Michael J.; Clarke, Christine L.; Chenevix-Trench, Georgia; Bowtell, David D. L.; Harnett, Paul R.; deFazio, Anna

    2011-01-01

    Molecular events leading to epithelial ovarian cancer are poorly understood but ovulatory hormones and a high number of life-time ovulations with concomitant proliferation, apoptosis, and inflammation, increases risk. We identified genes that are regulated during the estrous cycle in murine ovarian surface epithelium and analysed these profiles to identify genes dysregulated in human ovarian cancer, using publically available datasets. We identified 338 genes that are regulated in murine ovarian surface epithelium during the estrous cycle and dysregulated in ovarian cancer. Six of seven candidates selected for immunohistochemical validation were expressed in serous ovarian cancer, inclusion cysts, ovarian surface epithelium and in fallopian tube epithelium. Most were overexpressed in ovarian cancer compared with ovarian surface epithelium and/or inclusion cysts (EpCAM, EZH2, BIRC5) although BIRC5 and EZH2 were expressed as highly in fallopian tube epithelium as in ovarian cancer. We prioritised the 338 genes for those likely to be important for ovarian cancer development by in silico analyses of copy number aberration and mutation using publically available datasets and identified genes with established roles in ovarian cancer as well as novel genes for which we have evidence for involvement in ovarian cancer. Chromosome segregation emerged as an important process in which genes from our list of 338 were over-represented including two (BUB1, NCAPD2) for which there is evidence of amplification and mutation. NUAK2, upregulated in ovarian surface epithelium in proestrus and predicted to have a driver mutation in ovarian cancer, was examined in a larger cohort of serous ovarian cancer where patients with lower NUAK2 expression had shorter overall survival. In conclusion, defining genes that are activated in normal epithelium in the course of ovulation that are also dysregulated in cancer has identified a number of pathways and novel candidate genes that may contribute to the development of ovarian cancer. PMID:21423607

  15. DAPK1 as an independent prognostic marker in liver cancer.

    PubMed

    Li, Ling; Guo, Libin; Wang, Qingshui; Liu, Xiaolong; Zeng, Yongyi; Wen, Qing; Zhang, Shudong; Kwok, Hang Fai; Lin, Yao; Liu, Jingfeng

    2017-01-01

    The death-associated protein kinase 1 (DAPK1) can act as an oncogene or a tumor suppressor gene depending on the cellular context as well as external stimuli. Our study aims to investigate the prognostic significance of DAPK1 in liver cancer in both mRNA and protein levels. The mRNA expression of DAPK1 was extracted from the Gene Expression Omnibus database in three independent liver cancer datasets while protein expression of DAPK1 was detected by immunohistochemistry in our Chinese liver cancer patient cohort. The associations between DAPK1 expression and clinical characteristics were tested. DAPK1 mRNA expression was down-regulated in liver cancer. Low levels of DAPK1 mRNA were associated with shorter survival in a liver cancer patient cohort ( n  = 115;  p  = 0.041), while negative staining of DAPK1 protein was significantly correlated with shorter time to progression ( p  = 0.002) and overall survival ( p  = 0.02). DAPK1 was an independent prognostic marker for both time to progression and overall survival by multivariate analysis. Liver cancer with the b-catenin mutation has a lower DAPK1 expression, suggesting that DAPK1 may be regulated under the b-catenin pathway. In addition, we also identified genes that are co-regulated with DAPK1. DAPK1 expression was positively correlated with IRF2, IL7R, PCOLCE and ZBTB16, and negatively correlated with SLC16A3 in both liver cancer datasets. Among these genes, PCOLCE and ZBTB16 were significantly down-regulated, while SLC16A3 was significantly upregulated in liver cancer. By using connectivity mapping of these co-regulated genes, we have identified amcinonide and sulpiride as potential small molecules that could potentially reverse DAPK1/PCOLCE/ZBTB16/SLC16A3 expression. Our study demonstrated for the first time that both DAPK1 mRNA and protein expression levels are important prognostic markers in liver cancer, and have identified genes that may contribute to DAPK1-mediated liver carcinogenesis.

  16. Bioinformatics approach to evaluate differential gene expression of M1/M2 macrophage phenotypes and antioxidant genes in atherosclerosis.

    PubMed

    da Rocha, Ricardo Fagundes; De Bastiani, Marco Antônio; Klamt, Fábio

    2014-11-01

    Atherosclerosis is a pro-inflammatory process intrinsically related to systemic redox impairments. Macrophages play a major role on disease development. The specific involvement of classically activated, M1 (pro-inflammatory), or the alternatively activated, M2 (anti-inflammatory), on plaque formation and disease progression are still not established. Thus, based on meta-data analysis of public micro-array datasets, we compared differential gene expression levels of the human antioxidant genes (HAG) and M1/M2 genes between early and advanced human atherosclerotic plaques, and among peripheric macrophages (with or without foam cells induction by oxidized low density lipoprotein, oxLDL) from healthy and atherosclerotic subjects. Two independent datasets, GSE28829 and GSE9874, were selected from gene expression omnibus (http://www.ncbi.nlm.nih.gov/geo/) repository. Functional interactions were obtained with STRING (http://string-db.org/) and Medusa (http://coot.embl.de/medusa/). Statistical analysis was performed with ViaComplex(®) (http://lief.if.ufrgs.br/pub/biosoftwares/viacomplex/) and gene score enrichment analysis (http://www.broadinstitute.org/gsea/index.jsp). Bootstrap analysis demonstrated that the activity (expression) of HAG and M1 gene sets were significantly increased in advance compared to early atherosclerotic plaque. Increased expressions of HAG, M1, and M2 gene sets were found in peripheric macrophages from atherosclerotic subjects compared to peripheric macrophages from healthy subjects, while only M1 gene set was increased in foam cells from atherosclerotic subjects compared to foam cells from healthy subjects. However, M1 gene set was decreased in foam cells from healthy subjects compared to peripheric macrophages from healthy subjects, while no differences were found in foam cells from atherosclerotic subjects compared to peripheric macrophages from atherosclerotic subjects. Our data suggest that, different to cancer, in atherosclerosis there is no M1 or M2 polarization of macrophages. Actually, M1 and M2 phenotype are equally induced, what is an important aspect to better understand the disease progression, and can help to develop new therapeutic approaches.

  17. Exploiting the full power of temporal gene expression profiling through a new statistical test: application to the analysis of muscular dystrophy data.

    PubMed

    Vinciotti, Veronica; Liu, Xiaohui; Turk, Rolf; de Meijer, Emile J; 't Hoen, Peter A C

    2006-04-03

    The identification of biologically interesting genes in a temporal expression profiling dataset is challenging and complicated by high levels of experimental noise. Most statistical methods used in the literature do not fully exploit the temporal ordering in the dataset and are not suited to the case where temporal profiles are measured for a number of different biological conditions. We present a statistical test that makes explicit use of the temporal order in the data by fitting polynomial functions to the temporal profile of each gene and for each biological condition. A Hotelling T2-statistic is derived to detect the genes for which the parameters of these polynomials are significantly different from each other. We validate the temporal Hotelling T2-test on muscular gene expression data from four mouse strains which were profiled at different ages: dystrophin-, beta-sarcoglycan and gamma-sarcoglycan deficient mice, and wild-type mice. The first three are animal models for different muscular dystrophies. Extensive biological validation shows that the method is capable of finding genes with temporal profiles significantly different across the four strains, as well as identifying potential biomarkers for each form of the disease. The added value of the temporal test compared to an identical test which does not make use of temporal ordering is demonstrated via a simulation study, and through confirmation of the expression profiles from selected genes by quantitative PCR experiments. The proposed method maximises the detection of the biologically interesting genes, whilst minimising false detections. The temporal Hotelling T2-test is capable of finding relatively small and robust sets of genes that display different temporal profiles between the conditions of interest. The test is simple, it can be used on gene expression data generated from any experimental design and for any number of conditions, and it allows fast interpretation of the temporal behaviour of genes. The R code is available from V.V. The microarray data have been submitted to GEO under series GSE1574 and GSE3523.

  18. Exploiting the full power of temporal gene expression profiling through a new statistical test: Application to the analysis of muscular dystrophy data

    PubMed Central

    Vinciotti, Veronica; Liu, Xiaohui; Turk, Rolf; de Meijer, Emile J; 't Hoen, Peter AC

    2006-01-01

    Background The identification of biologically interesting genes in a temporal expression profiling dataset is challenging and complicated by high levels of experimental noise. Most statistical methods used in the literature do not fully exploit the temporal ordering in the dataset and are not suited to the case where temporal profiles are measured for a number of different biological conditions. We present a statistical test that makes explicit use of the temporal order in the data by fitting polynomial functions to the temporal profile of each gene and for each biological condition. A Hotelling T2-statistic is derived to detect the genes for which the parameters of these polynomials are significantly different from each other. Results We validate the temporal Hotelling T2-test on muscular gene expression data from four mouse strains which were profiled at different ages: dystrophin-, beta-sarcoglycan and gamma-sarcoglycan deficient mice, and wild-type mice. The first three are animal models for different muscular dystrophies. Extensive biological validation shows that the method is capable of finding genes with temporal profiles significantly different across the four strains, as well as identifying potential biomarkers for each form of the disease. The added value of the temporal test compared to an identical test which does not make use of temporal ordering is demonstrated via a simulation study, and through confirmation of the expression profiles from selected genes by quantitative PCR experiments. The proposed method maximises the detection of the biologically interesting genes, whilst minimising false detections. Conclusion The temporal Hotelling T2-test is capable of finding relatively small and robust sets of genes that display different temporal profiles between the conditions of interest. The test is simple, it can be used on gene expression data generated from any experimental design and for any number of conditions, and it allows fast interpretation of the temporal behaviour of genes. The R code is available from V.V. The microarray data have been submitted to GEO under series GSE1574 and GSE3523. PMID:16584545

  19. Employing conservation of co-expression to improve functional inference

    PubMed Central

    Daub, Carsten O; Sonnhammer, Erik LL

    2008-01-01

    Background Observing co-expression between genes suggests that they are functionally coupled. Co-expression of orthologous gene pairs across species may improve function prediction beyond the level achieved in a single species. Results We used orthology between genes of the three different species S. cerevisiae, D. melanogaster, and C. elegans to combine co-expression across two species at a time. This led to increased function prediction accuracy when we incorporated expression data from either of the other two species and even further increased when conservation across both of the two other species was considered at the same time. Employing the conservation across species to incorporate abundant model organism data for the prediction of protein interactions in poorly characterized species constitutes a very powerful annotation method. Conclusion To be able to employ the most suitable co-expression distance measure for our analysis, we evaluated the ability of four popular gene co-expression distance measures to detect biologically relevant interactions between pairs of genes. For the expression datasets employed in our co-expression conservation analysis above, we used the GO and the KEGG PATHWAY databases as gold standards. While the differences between distance measures were small, Spearman correlation showed to give most robust results. PMID:18808668

  20. A prior-based integrative framework for functional transcriptional regulatory network inference

    PubMed Central

    Siahpirani, Alireza F.

    2017-01-01

    Abstract Transcriptional regulatory networks specify regulatory proteins controlling the context-specific expression levels of genes. Inference of genome-wide regulatory networks is central to understanding gene regulation, but remains an open challenge. Expression-based network inference is among the most popular methods to infer regulatory networks, however, networks inferred from such methods have low overlap with experimentally derived (e.g. ChIP-chip and transcription factor (TF) knockouts) networks. Currently we have a limited understanding of this discrepancy. To address this gap, we first develop a regulatory network inference algorithm, based on probabilistic graphical models, to integrate expression with auxiliary datasets supporting a regulatory edge. Second, we comprehensively analyze our and other state-of-the-art methods on different expression perturbation datasets. Networks inferred by integrating sequence-specific motifs with expression have substantially greater agreement with experimentally derived networks, while remaining more predictive of expression than motif-based networks. Our analysis suggests natural genetic variation as the most informative perturbation for network inference, and, identifies core TFs whose targets are predictable from expression. Multiple reasons make the identification of targets of other TFs difficult, including network architecture and insufficient variation of TF mRNA level. Finally, we demonstrate the utility of our inference algorithm to infer stress-specific regulatory networks and for regulator prioritization. PMID:27794550

  1. A gene-signature progression approach to identifying candidate small-molecule cancer therapeutics with connectivity mapping.

    PubMed

    Wen, Qing; Kim, Chang-Sik; Hamilton, Peter W; Zhang, Shu-Dong

    2016-05-11

    Gene expression connectivity mapping has gained much popularity recently with a number of successful applications in biomedical research testifying its utility and promise. Previously methodological research in connectivity mapping mainly focused on two of the key components in the framework, namely, the reference gene expression profiles and the connectivity mapping algorithms. The other key component in this framework, the query gene signature, has been left to users to construct without much consensus on how this should be done, albeit it has been an issue most relevant to end users. As a key input to the connectivity mapping process, gene signature is crucially important in returning biologically meaningful and relevant results. This paper intends to formulate a standardized procedure for constructing high quality gene signatures from a user's perspective. We describe a two-stage process for making quality gene signatures using gene expression data as initial inputs. First, a differential gene expression analysis comparing two distinct biological states; only the genes that have passed stringent statistical criteria are considered in the second stage of the process, which involves ranking genes based on statistical as well as biological significance. We introduce a "gene signature progression" method as a standard procedure in connectivity mapping. Starting from the highest ranked gene, we progressively determine the minimum length of the gene signature that allows connections to the reference profiles (drugs) being established with a preset target false discovery rate. We use a lung cancer dataset and a breast cancer dataset as two case studies to demonstrate how this standardized procedure works, and we show that highly relevant and interesting biological connections are returned. Of particular note is gefitinib, identified as among the candidate therapeutics in our lung cancer case study. Our gene signature was based on gene expression data from Taiwan female non-smoker lung cancer patients, while there is evidence from independent studies that gefitinib is highly effective in treating women, non-smoker or former light smoker, advanced non-small cell lung cancer patients of Asian origin. In summary, we introduced a gene signature progression method into connectivity mapping, which enables a standardized procedure for constructing high quality gene signatures. This progression method is particularly useful when the number of differentially expressed genes identified is large, and when there is a need to prioritize them to be included in the query signature. The results from two case studies demonstrate that the approach we have developed is capable of obtaining pertinent candidate drugs with high precision.

  2. Gene expression profile predicting the response to anti-TNF treatment in patients with rheumatoid arthritis; analysis of GEO datasets.

    PubMed

    Kim, Tae-Hwan; Choi, Sung Jae; Lee, Young Ho; Song, Gwan Gyu; Ji, Jong Dae

    2014-07-01

    Anti-tumor necrosis factor (TNF) therapy is the treatment of choice for rheumatoid arthritis (RA) patients in whom standard disease-modifying anti-rheumatic drugs are ineffective. However, a substantial proportion of RA patients treated with anti-TNF agents do not show a significant clinical response. Therefore, biomarkers predicting response to anti-TNF agents are needed. Recently, gene expression profiling has been applied in research for developing such biomarkers. We compared gene expression profiles reported by previous studies dealing with the responsiveness of anti-TNF therapy in RA patients and attempted to identify differentially expressed genes (DEGs) that discriminated between responders and non-responders to anti-TNF therapy. We used microarray datasets available at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO). This analysis included 6 studies and 5 sets of microarray data that used peripheral blood samples for identification of DEGs predicting response to anti-TNF therapy. We found little overlap in the DEGs that were highly ranked in each study. Three DEGs including IL2RB, SH2D2A and G0S2 appeared in more than 1 study. In addition, a meta-analysis designed to increase statistical power found one DEG, G0S2 by the Fisher's method. Our finding suggests the possibility that G0S2 plays as a biomarker to predict response to anti-TNF therapy in patients with rheumatoid arthritis. Further investigations based on larger studies are therefore needed to confirm the significance of G0S2 in predicting response to anti-TNF therapy. Copyright © 2014 Société française de rhumatologie. Published by Elsevier SAS. All rights reserved.

  3. Meta-analysis of expression of l(3)mbt tumor-associated germline genes supports the model that a soma-to-germline transition is a hallmark of human cancers.

    PubMed

    Feichtinger, Julia; Larcombe, Lee; McFarlane, Ramsay J

    2014-05-15

    Evidence is starting to emerge indicating that tumorigenesis in metazoans involves a soma-to-germline transition, which may contribute to the acquisition of neoplastic characteristics. Here, we have meta-analyzed gene expression profiles of the human orthologs of Drosophila melanogaster germline genes that are ectopically expressed in l(3)mbt brain tumors using gene expression datasets derived from a large cohort of human tumors. We find these germline genes, some of which drive oncogenesis in D. melanogaster, are similarly ectopically activated in a wide range of human cancers. Some of these genes normally have expression restricted to the germline, making them of particular clinical interest. Importantly, these analyses provide additional support to the emerging model that proposes a soma-to-germline transition is a general hallmark of a wide range of human tumors. This has implications for our understanding of human oncogenesis and the development of new therapeutic and biomarker targets with clinical potential. © 2013 The Authors. Published by Wiley Periodicals, Inc. on behalf of UICC.

  4. Clustering approaches to identifying gene expression patterns from DNA microarray data.

    PubMed

    Do, Jin Hwan; Choi, Dong-Kug

    2008-04-30

    The analysis of microarray data is essential for large amounts of gene expression data. In this review we focus on clustering techniques. The biological rationale for this approach is the fact that many co-expressed genes are co-regulated, and identifying co-expressed genes could aid in functional annotation of novel genes, de novo identification of transcription factor binding sites and elucidation of complex biological pathways. Co-expressed genes are usually identified in microarray experiments by clustering techniques. There are many such methods, and the results obtained even for the same datasets may vary considerably depending on the algorithms and metrics for dissimilarity measures used, as well as on user-selectable parameters such as desired number of clusters and initial values. Therefore, biologists who want to interpret microarray data should be aware of the weakness and strengths of the clustering methods used. In this review, we survey the basic principles of clustering of DNA microarray data from crisp clustering algorithms such as hierarchical clustering, K-means and self-organizing maps, to complex clustering algorithms like fuzzy clustering.

  5. Gene expression links functional networks across cortex and striatum.

    PubMed

    Anderson, Kevin M; Krienen, Fenna M; Choi, Eun Young; Reinen, Jenna M; Yeo, B T Thomas; Holmes, Avram J

    2018-04-12

    The human brain is comprised of a complex web of functional networks that link anatomically distinct regions. However, the biological mechanisms supporting network organization remain elusive, particularly across cortical and subcortical territories with vastly divergent cellular and molecular properties. Here, using human and primate brain transcriptional atlases, we demonstrate that spatial patterns of gene expression show strong correspondence with limbic and somato/motor cortico-striatal functional networks. Network-associated expression is consistent across independent human datasets and evolutionarily conserved in non-human primates. Genes preferentially expressed within the limbic network (encompassing nucleus accumbens, orbital/ventromedial prefrontal cortex, and temporal pole) relate to risk for psychiatric illness, chloride channel complexes, and markers of somatostatin neurons. Somato/motor associated genes are enriched for oligodendrocytes and markers of parvalbumin neurons. These analyses indicate that parallel cortico-striatal processing channels possess dissociable genetic signatures that recapitulate distributed functional networks, and nominate molecular mechanisms supporting cortico-striatal circuitry in health and disease.

  6. Identifying gene coexpression networks underlying the dynamic regulation of wood-forming tissues in Populus under diverse environmental conditions.

    PubMed

    Zinkgraf, Matthew; Liu, Lijun; Groover, Andrew; Filkov, Vladimir

    2017-06-01

    Trees modify wood formation through integration of environmental and developmental signals in complex but poorly defined transcriptional networks, allowing trees to produce woody tissues appropriate to diverse environmental conditions. In order to identify relationships among genes expressed during wood formation, we integrated data from new and publically available datasets in Populus. These datasets were generated from woody tissue and include transcriptome profiling, transcription factor binding, DNA accessibility and genome-wide association mapping experiments. Coexpression modules were calculated, each of which contains genes showing similar expression patterns across experimental conditions, genotypes and treatments. Conserved gene coexpression modules (four modules totaling 8398 genes) were identified that were highly preserved across diverse environmental conditions and genetic backgrounds. Functional annotations as well as correlations with specific experimental treatments associated individual conserved modules with distinct biological processes underlying wood formation, such as cell-wall biosynthesis, meristem development and epigenetic pathways. Module genes were also enriched for DNase I hypersensitivity footprints and binding from four transcription factors associated with wood formation. The conserved modules are excellent candidates for modeling core developmental pathways common to wood formation in diverse environments and genotypes, and serve as testbeds for hypothesis generation and testing for future studies. No claim to original US government works. New Phytologist © 2017 New Phytologist Trust.

  7. Integrating genome-wide association study and expression quantitative trait loci data identifies multiple genes and gene set associated with neuroticism.

    PubMed

    Fan, Qianrui; Wang, Wenyu; Hao, Jingcan; He, Awen; Wen, Yan; Guo, Xiong; Wu, Cuiyan; Ning, Yujie; Wang, Xi; Wang, Sen; Zhang, Feng

    2017-08-01

    Neuroticism is a fundamental personality trait with significant genetic determinant. To identify novel susceptibility genes for neuroticism, we conducted an integrative analysis of genomic and transcriptomic data of genome wide association study (GWAS) and expression quantitative trait locus (eQTL) study. GWAS summary data was driven from published studies of neuroticism, totally involving 170,906 subjects. eQTL dataset containing 927,753 eQTLs were obtained from an eQTL meta-analysis of 5311 samples. Integrative analysis of GWAS and eQTL data was conducted by summary data-based Mendelian randomization (SMR) analysis software. To identify neuroticism associated gene sets, the SMR analysis results were further subjected to gene set enrichment analysis (GSEA). The gene set annotation dataset (containing 13,311 annotated gene sets) of GSEA Molecular Signatures Database was used. SMR single gene analysis identified 6 significant genes for neuroticism, including MSRA (p value=2.27×10 -10 ), MGC57346 (p value=6.92×10 -7 ), BLK (p value=1.01×10 -6 ), XKR6 (p value=1.11×10 -6 ), C17ORF69 (p value=1.12×10 -6 ) and KIAA1267 (p value=4.00×10 -6 ). Gene set enrichment analysis observed significant association for Chr8p23 gene set (false discovery rate=0.033). Our results provide novel clues for the genetic mechanism studies of neuroticism. Copyright © 2017. Published by Elsevier Inc.

  8. The large-scale investigation of gene expression in Leymus chinensis stigmas provides a valuable resource for understanding the mechanisms of poaceae self-incompatibility.

    PubMed

    Zhou, Qingyuan; Jia, Junting; Huang, Xing; Yan, Xueqing; Cheng, Liqin; Chen, Shuangyan; Li, Xiaoxia; Peng, Xianjun; Liu, Gongshe

    2014-05-26

    Many Poaceae species show a gametophytic self-incompatibility (GSI) system, which is controlled by at least two independent and multiallelic loci, S and Z. Until currently, the gene products for S and Z were unknown. Grass SI plant stigmas discriminate between pollen grains that land on its surface and support compatible pollen tube growth and penetration into the stigma, whereas recognizing incompatible pollen and thus inhibiting pollination behaviors. Leymus chinensis (Trin.) Tzvel. (sheepgrass) is a Poaceae SI species. A comprehensive analysis of sheepgrass stigma transcriptome may provide valuable information for understanding the mechanism of pollen-stigma interactions and grass SI. The transcript abundance profiles of mature stigmas, mature ovaries and leaves were examined using high-throughput next generation sequencing technology. A comparative transcriptomic analysis of these tissues identified 1,025 specifically or preferentially expressed genes in sheepgrass stigmas. These genes contained a significant proportion of genes predicted to function in cell-cell communication and signal transduction. We identified 111 putative transcription factors (TFs) genes and the most abundant groups were MYB, C2H2, C3H, FAR1, MADS. Comparative analysis of the sheepgrass, rice and Arabidopsis stigma-specific or preferential datasets showed broad similarities and some differences in the proportion of genes in the Gene Ontology (GO) functional categories. Potential SI candidate genes identified in other grasses were also detected in the sheepgrass stigma-specific or preferential dataset. Quantitative real-time PCR experiments validated the expression pattern of stigma preferential genes including homologous grass SI candidate genes. This study represents the first large-scale investigation of gene expression in the stigmas of an SI grass species. We uncovered many notable genes that are potentially involved in pollen-stigma interactions and SI mechanisms, including genes encoding receptor-like protein kinases (RLK), CBL (calcineurin B-like proteins) interacting protein kinases, calcium-dependent protein kinase, expansins, pectinesterase, peroxidases and various transcription factors. The availability of a pool of stigma-specific or preferential genes for L. chinensis offers an opportunity to elucidate the mechanisms of SI in Poaceae.

  9. Mega-analysis of Odds Ratio: A Convergent Method for a Deep Understanding of the Genetic Evidence in Schizophrenia.

    PubMed

    Jia, Peilin; Chen, Xiangning; Xie, Wei; Kendler, Kenneth S; Zhao, Zhongming

    2018-06-20

    Numerous high-throughput omics studies have been conducted in schizophrenia, providing an accumulated catalog of susceptible variants and genes. The results from these studies, however, are highly heterogeneous. The variants and genes nominated by different omics studies often have limited overlap with each other. There is thus a pressing need for integrative analysis to unify the different types of data and provide a convergent view of schizophrenia candidate genes (SZgenes). In this study, we collected a comprehensive, multidimensional dataset, including 7819 brain-expressed genes. The data hosted genome-wide association evidence in genetics (eg, genotyping data, copy number variations, de novo mutations), epigenetics, transcriptomics, and literature mining. We developed a method named mega-analysis of odds ratio (MegaOR) to prioritize SZgenes. Application of MegaOR in the multidimensional data resulted in consensus sets of SZgenes (up to 530), each enriched with dense, multidimensional evidence. We proved that these SZgenes had highly tissue-specific expression in brain and nerve and had intensive interactions that were significantly stronger than chance expectation. Furthermore, we found these SZgenes were involved in human brain development by showing strong spatiotemporal expression patterns; these characteristics were replicated in independent brain expression datasets. Finally, we found the SZgenes were enriched in critical functional gene sets involved in neuronal activities, ligand gated ion signaling, and fragile X mental retardation protein targets. In summary, MegaOR analysis reported consensus sets of SZgenes with enriched association evidence to schizophrenia, providing insights into the pathophysiology underlying schizophrenia.

  10. TP53 mutation-correlated genes predict the risk of tumor relapse and identify MPS1 as a potential therapeutic kinase in TP53-mutated breast cancers.

    PubMed

    Győrffy, Balázs; Bottai, Giulia; Lehmann-Che, Jacqueline; Kéri, György; Orfi, László; Iwamoto, Takayuki; Desmedt, Christine; Bianchini, Giampaolo; Turner, Nicholas C; de Thè, Hugues; André, Fabrice; Sotiriou, Christos; Hortobagyi, Gabriel N; Di Leo, Angelo; Pusztai, Lajos; Santarpia, Libero

    2014-05-01

    Breast cancers (BC) carry a complex set of gene mutations that can influence their gene expression and clinical behavior. We aimed to identify genes driven by the TP53 mutation status and assess their clinical relevance in estrogen receptor (ER)-positive and ER-negative BC, and their potential as targets for patients with TP53 mutated tumors. Separate ROC analyses of each gene expression according to TP53 mutation status were performed. The prognostic value of genes with the highest AUC were assessed in a large dataset of untreated, and neoadjuvant chemotherapy treated patients. The mitotic checkpoint gene MPS1 was the most significant gene correlated with TP53 status, and the most significant prognostic marker in all ER-positive BC datasets. MPS1 retained its prognostic value independently from the type of treatment administered. The biological functions of MPS1 were investigated in different BC cell lines. We also assessed the effects of a potent small molecule inhibitor of MPS1, SP600125, alone and in combination with chemotherapy. Consistent with the gene expression profiling and siRNA assays, the inhibition of MPS1 by SP600125 led to a reduction in cell viability and a significant increase in cell death, selectively in TP53-mutated BC cells. Furthermore, the chemical inhibition of MPS1 sensitized BC cells to conventional chemotherapy, particularly taxanes. Our results collectively demonstrate that TP53-correlated kinase MPS1, is a potential therapeutic target in BC patients with TP53 mutated tumors, and that SP600125 warrant further development in future clinical trials. Copyright © 2014 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved.

  11. An integrative machine learning strategy for improved prediction of essential genes in Escherichia coli metabolism using flux-coupled features.

    PubMed

    Nandi, Sutanu; Subramanian, Abhishek; Sarkar, Ram Rup

    2017-07-25

    Prediction of essential genes helps to identify a minimal set of genes that are absolutely required for the appropriate functioning and survival of a cell. The available machine learning techniques for essential gene prediction have inherent problems, like imbalanced provision of training datasets, biased choice of the best model for a given balanced dataset, choice of a complex machine learning algorithm, and data-based automated selection of biologically relevant features for classification. Here, we propose a simple support vector machine-based learning strategy for the prediction of essential genes in Escherichia coli K-12 MG1655 metabolism that integrates a non-conventional combination of an appropriate sample balanced training set, a unique organism-specific genotype, phenotype attributes that characterize essential genes, and optimal parameters of the learning algorithm to generate the best machine learning model (the model with the highest accuracy among all the models trained for different sample training sets). For the first time, we also introduce flux-coupled metabolic subnetwork-based features for enhancing the classification performance. Our strategy proves to be superior as compared to previous SVM-based strategies in obtaining a biologically relevant classification of genes with high sensitivity and specificity. This methodology was also trained with datasets of other recent supervised classification techniques for essential gene classification and tested using reported test datasets. The testing accuracy was always high as compared to the known techniques, proving that our method outperforms known methods. Observations from our study indicate that essential genes are conserved among homologous bacterial species, demonstrate high codon usage bias, GC content and gene expression, and predominantly possess a tendency to form physiological flux modules in metabolism.

  12. IDPT: Insights into potential intrinsically disordered proteins through transcriptomic analysis of genes for prostate carcinoma epigenetic data.

    PubMed

    Mallik, Saurav; Sen, Sagnik; Maulik, Ujjwal

    2016-07-15

    Involvement of intrinsically disordered proteins (IDPs) with various dreadful diseases like cancer is an interesting research topic. In order to gain novel insights into the regulation of IDPs, in this article, we perform a transcriptomic analysis of mRNAs (genes) for transcripts encoding IDPs on a human multi-omics prostate carcinoma dataset having both gene expression and methylation data. In this regard, firstly the genes that consist of both the expression and methylation data, and that are corresponding to the cancer-related prostate-tissue-specific disordered proteins of MobiDb database, are selected. We apply standard t-test for determining differentially expressed genes as well as differentially methylated genes. A network having these genes and their targeter miRNAs from Diana Tarbase v7.0 database and corresponding Transcription Factors from TRANSFAC and ITFP databases, is then built. Thereafter, we perform literature search, and KEGG pathway and Gene Ontology analyses using DAVID database. Finally, we report several significant potential gene-markers (with the corresponding IDPs) that have inverse relationship between differential expression and methylation patterns, and that are hub genes of the TF-miRNA-gene network. Copyright © 2016 Elsevier B.V. All rights reserved.

  13. State Space Model with hidden variables for reconstruction of gene regulatory networks.

    PubMed

    Wu, Xi; Li, Peng; Wang, Nan; Gong, Ping; Perkins, Edward J; Deng, Youping; Zhang, Chaoyang

    2011-01-01

    State Space Model (SSM) is a relatively new approach to inferring gene regulatory networks. It requires less computational time than Dynamic Bayesian Networks (DBN). There are two types of variables in the linear SSM, observed variables and hidden variables. SSM uses an iterative method, namely Expectation-Maximization, to infer regulatory relationships from microarray datasets. The hidden variables cannot be directly observed from experiments. How to determine the number of hidden variables has a significant impact on the accuracy of network inference. In this study, we used SSM to infer Gene regulatory networks (GRNs) from synthetic time series datasets, investigated Bayesian Information Criterion (BIC) and Principle Component Analysis (PCA) approaches to determining the number of hidden variables in SSM, and evaluated the performance of SSM in comparison with DBN. True GRNs and synthetic gene expression datasets were generated using GeneNetWeaver. Both DBN and linear SSM were used to infer GRNs from the synthetic datasets. The inferred networks were compared with the true networks. Our results show that inference precision varied with the number of hidden variables. For some regulatory networks, the inference precision of DBN was higher but SSM performed better in other cases. Although the overall performance of the two approaches is compatible, SSM is much faster and capable of inferring much larger networks than DBN. This study provides useful information in handling the hidden variables and improving the inference precision.

  14. The green impact: bacterioplankton response toward a phytoplankton spring bloom in the southern North Sea assessed by comparative metagenomic and metatranscriptomic approaches

    PubMed Central

    Wemheuer, Bernd; Wemheuer, Franziska; Hollensteiner, Jacqueline; Meyer, Frauke-Dorothee; Voget, Sonja; Daniel, Rolf

    2015-01-01

    Phytoplankton blooms exhibit a severe impact on bacterioplankton communities as they change nutrient availabilities and other environmental factors. In the current study, the response of a bacterioplankton community to a Phaeocystis globosa spring bloom was investigated in the southern North Sea. For this purpose, water samples were taken inside and reference samples outside of an algal spring bloom. Structural changes of the bacterioplankton community were assessed by amplicon-based analysis of 16S rRNA genes and transcripts generated from environmental DNA and RNA, respectively. Several marine groups responded to bloom presence. The abundance of the Roseobacter RCA cluster and the SAR92 clade significantly increased in bloom presence in the total and active fraction of the bacterial community. Functional changes were investigated by direct sequencing of environmental DNA and mRNA. The corresponding datasets comprised more than 500 million sequences across all samples. Metatranscriptomic data sets were mapped on representative genomes of abundant marine groups present in the samples and on assembled metagenomic and metatranscriptomic datasets. Differences in gene expression profiles between non-bloom and bloom samples were recorded. The genome-wide gene expression level of Planktomarina temperata, an abundant member of the Roseobacter RCA cluster, was higher inside the bloom. Genes that were differently expressed included transposases, which showed increased expression levels inside the bloom. This might contribute to the adaptation of this organism toward environmental stresses through genome reorganization. In addition, several genes affiliated to the SAR92 clade were significantly upregulated inside the bloom including genes encoding for proteins involved in isoleucine and leucine incorporation. Obtained results provide novel insights into compositional and functional variations of marine bacterioplankton communities as response to a phytoplankton bloom. PMID:26322028

  15. Microarray-based cancer prediction using soft computing approach.

    PubMed

    Wang, Xiaosheng; Gotoh, Osamu

    2009-05-26

    One of the difficulties in using gene expression profiles to predict cancer is how to effectively select a few informative genes to construct accurate prediction models from thousands or ten thousands of genes. We screen highly discriminative genes and gene pairs to create simple prediction models involved in single genes or gene pairs on the basis of soft computing approach and rough set theory. Accurate cancerous prediction is obtained when we apply the simple prediction models for four cancerous gene expression datasets: CNS tumor, colon tumor, lung cancer and DLBCL. Some genes closely correlated with the pathogenesis of specific or general cancers are identified. In contrast with other models, our models are simple, effective and robust. Meanwhile, our models are interpretable for they are based on decision rules. Our results demonstrate that very simple models may perform well on cancerous molecular prediction and important gene markers of cancer can be detected if the gene selection approach is chosen reasonably.

  16. SiBIC: a web server for generating gene set networks based on biclusters obtained by maximal frequent itemset mining.

    PubMed

    Takahashi, Kei-ichiro; Takigawa, Ichigaku; Mamitsuka, Hiroshi

    2013-01-01

    Detecting biclusters from expression data is useful, since biclusters are coexpressed genes under only part of all given experimental conditions. We present a software called SiBIC, which from a given expression dataset, first exhaustively enumerates biclusters, which are then merged into rather independent biclusters, which finally are used to generate gene set networks, in which a gene set assigned to one node has coexpressed genes. We evaluated each step of this procedure: 1) significance of the generated biclusters biologically and statistically, 2) biological quality of merged biclusters, and 3) biological significance of gene set networks. We emphasize that gene set networks, in which nodes are not genes but gene sets, can be more compact than usual gene networks, meaning that gene set networks are more comprehensible. SiBIC is available at http://utrecht.kuicr.kyoto-u.ac.jp:8080/miami/faces/index.jsp.

  17. Comparison of TCDD-elicited genome-wide hepatic gene expression in Sprague–Dawley rats and C57BL/6 mice

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nault, Rance; Kim, Suntae; Zacharewski, Timothy R., E-mail: tzachare@msu.edu

    2013-03-01

    Although the structure and function of the AhR are conserved, emerging evidence suggests that downstream effects are species-specific. In this study, rat hepatic gene expression data from the DrugMatrix database (National Toxicology Program) were compared to mouse hepatic whole-genome gene expression data following treatment with 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD). For the DrugMatrix study, male Sprague–Dawley rats were gavaged daily with 20 μg/kg TCDD for 1, 3 and 5 days, while female C57BL/6 ovariectomized mice were examined 1, 3 and 7 days after a single oral gavage of 30 μg/kg TCDD. A total of 649 rat and 1386 mouse genes (|fold change| ≥more » 1.5, P1(t) ≥ 0.99) were differentially expressed following treatment. HomoloGene identified 11,708 orthologs represented across the rat Affymetrix 230 2.0 GeneChip (12,310 total orthologs), and the mouse 4 × 44K v.1 Agilent oligonucleotide array (17,578 total orthologs). Comparative analysis found 563 and 922 orthologs differentially expressed in response to TCDD in the rat and mouse, respectively, with 70 responses associated with immune function and lipid metabolism in common to both. Moreover, QRTPCR analysis of Ceacam1, showed divergent expression (induced in rat; repressed in mouse) functionally consistent with TCDD-elicited hepatic steatosis in the mouse but not the rat. Functional analysis identified orthologs involved in nucleotide binding and acetyltransferase activity in rat, while mouse-specific responses were associated with steroid, phospholipid, fatty acid, and carbohydrate metabolism. These results provide further evidence that TCDD elicits species-specific regulation of distinct gene networks, and outlines considerations for future comparisons of publicly available microarray datasets. - Highlights: ► We performed a whole-genome comparison of TCDD-regulated genes in mice and rats. ► Previous species comparisons were extended using data from the DrugMatrix database. ► Less than 15% of TCDD-regulated orthologs were common to mice and rats. ► Considerations for the comparison of publicly available datasets are described.« less

  18. Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

    PubMed Central

    Maulik, Ujjwal; Mallik, Saurav; Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra

    2015-01-01

    Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level. PMID:25830807

  19. Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining.

    PubMed

    Maulik, Ujjwal; Mallik, Saurav; Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra

    2015-01-01

    Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.

  20. Biological classification with RNA-Seq data: Can alternatively spliced transcript expression enhance machine learning classifier?

    PubMed

    Johnson, Nathan T; Dhroso, Andi; Hughes, Katelyn J; Korkin, Dmitry

    2018-06-25

    The extent to which the genes are expressed in the cell can be simplistically defined as a function of one or more factors of the environment, lifestyle, and genetics. RNA sequencing (RNA-Seq) is becoming a prevalent approach to quantify gene expression, and is expected to gain better insights to a number of biological and biomedical questions, compared to the DNA microarrays. Most importantly, RNA-Seq allows to quantify expression at the gene and alternative splicing isoform levels. However, leveraging the RNA-Seq data requires development of new data mining and analytics methods. Supervised machine learning methods are commonly used approaches for biological data analysis, and have recently gained attention for their applications to the RNA-Seq data. In this work, we assess the utility of supervised learning methods trained on RNA-Seq data for a diverse range of biological classification tasks. We hypothesize that the isoform-level expression data is more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment is done through utilizing multiple datasets, organisms, lab groups, and RNA-Seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-Seq datasets and include over 2,000 samples that come from multiple organisms, lab groups, and RNA-Seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes and, the pathological tumor stage for the samples from the cancerous tissue. For each classification problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the isoform-based classifiers outperform or are comparable with gene expression based methods. The top-performing supervised learning techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-Seq based data analysis. Published by Cold Spring Harbor Laboratory Press for the RNA Society.

  1. dictyExpress: a Dictyostelium discoideum gene expression database with an explorative data analysis web-based interface.

    PubMed

    Rot, Gregor; Parikh, Anup; Curk, Tomaz; Kuspa, Adam; Shaulsky, Gad; Zupan, Blaz

    2009-08-25

    Bioinformatics often leverages on recent advancements in computer science to support biologists in their scientific discovery process. Such efforts include the development of easy-to-use web interfaces to biomedical databases. Recent advancements in interactive web technologies require us to rethink the standard submit-and-wait paradigm, and craft bioinformatics web applications that share analytical and interactive power with their desktop relatives, while retaining simplicity and availability. We have developed dictyExpress, a web application that features a graphical, highly interactive explorative interface to our database that consists of more than 1000 Dictyostelium discoideum gene expression experiments. In dictyExpress, the user can select experiments and genes, perform gene clustering, view gene expression profiles across time, view gene co-expression networks, perform analyses of Gene Ontology term enrichment, and simultaneously display expression profiles for a selected gene in various experiments. Most importantly, these tasks are achieved through web applications whose components are seamlessly interlinked and immediately respond to events triggered by the user, thus providing a powerful explorative data analysis environment. dictyExpress is a precursor for a new generation of web-based bioinformatics applications with simple but powerful interactive interfaces that resemble that of the modern desktop. While dictyExpress serves mainly the Dictyostelium research community, it is relatively easy to adapt it to other datasets. We propose that the design ideas behind dictyExpress will influence the development of similar applications for other model organisms.

  2. dictyExpress: a Dictyostelium discoideum gene expression database with an explorative data analysis web-based interface

    PubMed Central

    Rot, Gregor; Parikh, Anup; Curk, Tomaz; Kuspa, Adam; Shaulsky, Gad; Zupan, Blaz

    2009-01-01

    Background Bioinformatics often leverages on recent advancements in computer science to support biologists in their scientific discovery process. Such efforts include the development of easy-to-use web interfaces to biomedical databases. Recent advancements in interactive web technologies require us to rethink the standard submit-and-wait paradigm, and craft bioinformatics web applications that share analytical and interactive power with their desktop relatives, while retaining simplicity and availability. Results We have developed dictyExpress, a web application that features a graphical, highly interactive explorative interface to our database that consists of more than 1000 Dictyostelium discoideum gene expression experiments. In dictyExpress, the user can select experiments and genes, perform gene clustering, view gene expression profiles across time, view gene co-expression networks, perform analyses of Gene Ontology term enrichment, and simultaneously display expression profiles for a selected gene in various experiments. Most importantly, these tasks are achieved through web applications whose components are seamlessly interlinked and immediately respond to events triggered by the user, thus providing a powerful explorative data analysis environment. Conclusion dictyExpress is a precursor for a new generation of web-based bioinformatics applications with simple but powerful interactive interfaces that resemble that of the modern desktop. While dictyExpress serves mainly the Dictyostelium research community, it is relatively easy to adapt it to other datasets. We propose that the design ideas behind dictyExpress will influence the development of similar applications for other model organisms. PMID:19706156

  3. Open source machine-learning algorithms for the prediction of optimal cancer drug therapies.

    PubMed

    Huang, Cai; Mezencev, Roman; McDonald, John F; Vannberg, Fredrik

    2017-01-01

    Precision medicine is a rapidly growing area of modern medical science and open source machine-learning codes promise to be a critical component for the successful development of standardized and automated analysis of patient data. One important goal of precision cancer medicine is the accurate prediction of optimal drug therapies from the genomic profiles of individual patient tumors. We introduce here an open source software platform that employs a highly versatile support vector machine (SVM) algorithm combined with a standard recursive feature elimination (RFE) approach to predict personalized drug responses from gene expression profiles. Drug specific models were built using gene expression and drug response data from the National Cancer Institute panel of 60 human cancer cell lines (NCI-60). The models are highly accurate in predicting the drug responsiveness of a variety of cancer cell lines including those comprising the recent NCI-DREAM Challenge. We demonstrate that predictive accuracy is optimized when the learning dataset utilizes all probe-set expression values from a diversity of cancer cell types without pre-filtering for genes generally considered to be "drivers" of cancer onset/progression. Application of our models to publically available ovarian cancer (OC) patient gene expression datasets generated predictions consistent with observed responses previously reported in the literature. By making our algorithm "open source", we hope to facilitate its testing in a variety of cancer types and contexts leading to community-driven improvements and refinements in subsequent applications.

  4. Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae

    PubMed Central

    Reguly, Teresa; Breitkreutz, Ashton; Boucher, Lorrie; Breitkreutz, Bobby-Joe; Hon, Gary C; Myers, Chad L; Parsons, Ainslie; Friesen, Helena; Oughtred, Rose; Tong, Amy; Stark, Chris; Ho, Yuen; Botstein, David; Andrews, Brenda; Boone, Charles; Troyanskya, Olga G; Ideker, Trey; Dolinski, Kara; Batada, Nizar N; Tyers, Mike

    2006-01-01

    Background The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference. Results We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID () and SGD () databases. Conclusion Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks. PMID:16762047

  5. Geoseq: a tool for dissecting deep-sequencing datasets.

    PubMed

    Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi

    2010-10-12

    Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.

  6. Determining Physical Mechanisms of Gene Expression Regulation from Single Cell Gene Expression Data.

    PubMed

    Ezer, Daphne; Moignard, Victoria; Göttgens, Berthold; Adryan, Boris

    2016-08-01

    Many genes are expressed in bursts, which can contribute to cell-to-cell heterogeneity. It is now possible to measure this heterogeneity with high throughput single cell gene expression assays (single cell qPCR and RNA-seq). These experimental approaches generate gene expression distributions which can be used to estimate the kinetic parameters of gene expression bursting, namely the rate that genes turn on, the rate that genes turn off, and the rate of transcription. We construct a complete pipeline for the analysis of single cell qPCR data that uses the mathematics behind bursty expression to develop more accurate and robust algorithms for analyzing the origin of heterogeneity in experimental samples, specifically an algorithm for clustering cells by their bursting behavior (Simulated Annealing for Bursty Expression Clustering, SABEC) and a statistical tool for comparing the kinetic parameters of bursty expression across populations of cells (Estimation of Parameter changes in Kinetics, EPiK). We applied these methods to hematopoiesis, including a new single cell dataset in which transcription factors (TFs) involved in the earliest branchpoint of blood differentiation were individually up- and down-regulated. We could identify two unique sub-populations within a seemingly homogenous group of hematopoietic stem cells. In addition, we could predict regulatory mechanisms controlling the expression levels of eighteen key hematopoietic transcription factors throughout differentiation. Detailed information about gene regulatory mechanisms can therefore be obtained simply from high throughput single cell gene expression data, which should be widely applicable given the rapid expansion of single cell genomics.

  7. Constrained clusters of gene expression profiles with pathological features.

    PubMed

    Sese, Jun; Kurokawa, Yukinori; Monden, Morito; Kato, Kikuya; Morishita, Shinichi

    2004-11-22

    Gene expression profiles should be useful in distinguishing variations in disease, since they reflect accurately the status of cells. The primary clustering of gene expression reveals the genotypes that are responsible for the proximity of members within each cluster, while further clustering elucidates the pathological features of the individual members of each cluster. However, since the first clustering process and the second classification step, in which the features are associated with clusters, are performed independently, the initial set of clusters may omit genes that are associated with pathologically meaningful features. Therefore, it is important to devise a way of identifying gene expression clusters that are associated with pathological features. We present the novel technique of 'itemset constrained clustering' (IC-Clustering), which computes the optimal cluster that maximizes the interclass variance of gene expression between groups, which are divided according to the restriction that only divisions that can be expressed using common features are allowed. This constraint automatically labels each cluster with a set of pathological features which characterize that cluster. When applied to liver cancer datasets, IC-Clustering revealed informative gene expression clusters, which could be annotated with various pathological features, such as 'tumor' and 'man', or 'except tumor' and 'normal liver function'. In contrast, the k-means method overlooked these clusters.

  8. MAAMD: a workflow to standardize meta-analyses and comparison of affymetrix microarray data

    PubMed Central

    2014-01-01

    Background Mandatory deposit of raw microarray data files for public access, prior to study publication, provides significant opportunities to conduct new bioinformatics analyses within and across multiple datasets. Analysis of raw microarray data files (e.g. Affymetrix CEL files) can be time consuming, complex, and requires fundamental computational and bioinformatics skills. The development of analytical workflows to automate these tasks simplifies the processing of, improves the efficiency of, and serves to standardize multiple and sequential analyses. Once installed, workflows facilitate the tedious steps required to run rapid intra- and inter-dataset comparisons. Results We developed a workflow to facilitate and standardize Meta-Analysis of Affymetrix Microarray Data analysis (MAAMD) in Kepler. Two freely available stand-alone software tools, R and AltAnalyze were embedded in MAAMD. The inputs of MAAMD are user-editable csv files, which contain sample information and parameters describing the locations of input files and required tools. MAAMD was tested by analyzing 4 different GEO datasets from mice and drosophila. MAAMD automates data downloading, data organization, data quality control assesment, differential gene expression analysis, clustering analysis, pathway visualization, gene-set enrichment analysis, and cross-species orthologous-gene comparisons. MAAMD was utilized to identify gene orthologues responding to hypoxia or hyperoxia in both mice and drosophila. The entire set of analyses for 4 datasets (34 total microarrays) finished in ~ one hour. Conclusions MAAMD saves time, minimizes the required computer skills, and offers a standardized procedure for users to analyze microarray datasets and make new intra- and inter-dataset comparisons. PMID:24621103

  9. Comprehensive Evaluation of the Contribution of X Chromosome Genes to Platinum Sensitivity

    PubMed Central

    Gamazon, Eric R.; Im, Hae Kyung; O’Donnell, Peter H.; Ziliak, Dana; Stark, Amy L.; Cox, Nancy J.; Dolan, M. Eileen; Huang, Rong Stephanie

    2011-01-01

    Utilizing a genome-wide gene expression dataset generated from Affymetrix GeneChip® Human Exon 1.0ST array, we comprehensively surveyed the role of 322 X chromosome gene expression traits on cellular sensitivity to cisplatin and carboplatin. We identified 31 and 17 X chromosome genes whose expression levels are significantly correlated (after multiple testing correction) with sensitivity to carboplatin and cisplatin, respectively, in the combined HapMap CEU and YRI populations (false discovery rate, FDR<0.05). Of those, 14 overlap for both cisplatin and carboplatin. Employing an independent gene expression quantification method, the Illumina Sentrix Human-6 Expression BeadChip, measured on the same HapMap cell lines, we found that 4 and 2 of these genes are significantly associated with carboplatin and cisplatin sensitivity respectively in both analyses. Two genes, CTPS2 and DLG3, were identified by both genome-wide gene expression analyses as correlated with cellular sensitivity to both platinating agents. The expression of DLG3 gene was also found to correlate with cellular sensitivity to platinating agents in NCI60 cancer cell lines. In addition, we evaluated the role of X chromosome gene expression to the observed differences in sensitivity to the platinums between CEU and YRI derived cell lines. Of the 34 distinct genes significantly correlated with either carboplatin or cisplatin sensitivity, 14 are differentially expressed (defined as p<0.05) between CEU and YRI. Thus, sex chromosome genes play a role in cellular sensitivity to platinating agents and differences in the expression level of these genes are an important source of variation that should be included in comprehensive pharmacogenomic studies. PMID:21252287

  10. Identifying candidate drivers of drug response in heterogeneous cancer by mining high throughput genomics data.

    PubMed

    Nabavi, Sheida

    2016-08-15

    With advances in technologies, huge amounts of multiple types of high-throughput genomics data are available. These data have tremendous potential to identify new and clinically valuable biomarkers to guide the diagnosis, assessment of prognosis, and treatment of complex diseases, such as cancer. Integrating, analyzing, and interpreting big and noisy genomics data to obtain biologically meaningful results, however, remains highly challenging. Mining genomics datasets by utilizing advanced computational methods can help to address these issues. To facilitate the identification of a short list of biologically meaningful genes as candidate drivers of anti-cancer drug resistance from an enormous amount of heterogeneous data, we employed statistical machine-learning techniques and integrated genomics datasets. We developed a computational method that integrates gene expression, somatic mutation, and copy number aberration data of sensitive and resistant tumors. In this method, an integrative method based on module network analysis is applied to identify potential driver genes. This is followed by cross-validation and a comparison of the results of sensitive and resistance groups to obtain the final list of candidate biomarkers. We applied this method to the ovarian cancer data from the cancer genome atlas. The final result contains biologically relevant genes, such as COL11A1, which has been reported as a cis-platinum resistant biomarker for epithelial ovarian carcinoma in several recent studies. The described method yields a short list of aberrant genes that also control the expression of their co-regulated genes. The results suggest that the unbiased data driven computational method can identify biologically relevant candidate biomarkers. It can be utilized in a wide range of applications that compare two conditions with highly heterogeneous datasets.

  11. Temporal Expression of Peripheral Blood Leukocyte Biomarkers in a Macaca fascicularis Infection Model of Tuberculosis; Comparison with Human Datasets and Analysis with Parametric/Non-parametric Tools for Improved Diagnostic Biomarker Identification

    PubMed Central

    Wareham, Alice; Lewandowski, Kuiama S.; Williams, Ann; Dennis, Michael J.; Sharpe, Sally; Vipond, Richard; Silman, Nigel; Ball, Graham

    2016-01-01

    A temporal study of gene expression in peripheral blood leukocytes (PBLs) from a Mycobacterium tuberculosis primary, pulmonary challenge model Macaca fascicularis has been conducted. PBL samples were taken prior to challenge and at one, two, four and six weeks post-challenge and labelled, purified RNAs hybridised to Operon Human Genome AROS V4.0 slides. Data analyses revealed a large number of differentially regulated gene entities, which exhibited temporal profiles of expression across the time course study. Further data refinements identified groups of key markers showing group-specific expression patterns, with a substantial reprogramming event evident at the four to six week interval. Selected statistically-significant gene entities from this study and other immune and apoptotic markers were validated using qPCR, which confirmed many of the results obtained using microarray hybridisation. These showed evidence of a step-change in gene expression from an ‘early’ FOS-associated response, to a ‘late’ predominantly type I interferon-driven response, with coincident reduction of expression of other markers. Loss of T-cell-associate marker expression was observed in responsive animals, with concordant elevation of markers which may be associated with a myeloid suppressor cell phenotype e.g. CD163. The animals in the study were of different lineages and these Chinese and Mauritian cynomolgous macaque lines showed clear evidence of differing susceptibilities to Tuberculosis challenge. We determined a number of key differences in response profiles between the groups, particularly in expression of T-cell and apoptotic makers, amongst others. These have provided interesting insights into innate susceptibility related to different host `phenotypes. Using a combination of parametric and non-parametric artificial neural network analyses we have identified key genes and regulatory pathways which may be important in early and adaptive responses to TB. Using comparisons between data outputs of each analytical pipeline and comparisons with previously published Human TB datasets, we have delineated a subset of gene entities which may be of use for biomarker diagnostic test development. PMID:27228113

  12. Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization

    PubMed Central

    Liu, Jin; Huang, Jian; Ma, Shuangge

    2013-01-01

    Summary In cancer diagnosis studies, high-throughput gene profiling has been extensively conducted, searching for genes whose expressions may serve as markers. Data generated from such studies have the “large d, small n” feature, with the number of genes profiled much larger than the sample size. Penalization has been extensively adopted for simultaneous estimation and marker selection. Because of small sample sizes, markers identified from the analysis of single datasets can be unsatisfactory. A cost-effective remedy is to conduct integrative analysis of multiple heterogeneous datasets. In this article, we investigate composite penalization methods for estimation and marker selection in integrative analysis. The proposed methods use the minimax concave penalty (MCP) as the outer penalty. Under the homogeneity model, the ridge penalty is adopted as the inner penalty. Under the heterogeneity model, the Lasso penalty and MCP are adopted as the inner penalty. Effective computational algorithms based on coordinate descent are developed. Numerical studies, including simulation and analysis of practical cancer datasets, show satisfactory performance of the proposed methods. PMID:24578589

  13. A database for the analysis of immunity genes in Drosophila: PADMA database.

    PubMed

    Lee, Mark J; Mondal, Ariful; Small, Chiyedza; Paddibhatla, Indira; Kawaguchi, Akira; Govind, Shubha

    2011-01-01

    While microarray experiments generate voluminous data, discerning trends that support an existing or alternative paradigm is challenging. To synergize hypothesis building and testing, we designed the Pathogen Associated Drosophila MicroArray (PADMA) database for easy retrieval and comparison of microarray results from immunity-related experiments (www.padmadatabase.org). PADMA also allows biologists to upload their microarray-results and compare it with datasets housed within PADMA. We tested PADMA using a preliminary dataset from Ganaspis xanthopoda-infected fly larvae, and uncovered unexpected trends in gene expression, reshaping our hypothesis. Thus, the PADMA database will be a useful resource to fly researchers to evaluate, revise, and refine hypotheses.

  14. T-cell lymphomas associated gene expression signature: Bioinformatics analysis based on gene expression Omnibus.

    PubMed

    Zhou, Lei-Lei; Xu, Xiao-Yue; Ni, Jie; Zhao, Xia; Zhou, Jian-Wei; Feng, Ji-Feng

    2018-06-01

    Due to the low incidence and the heterogeneity of subtypes, the biological process of T-cell lymphomas is largely unknown. Although many genes have been detected in T-cell lymphomas, the role of these genes in biological process of T-cell lymphomas was not further analyzed. Two qualified datasets were downloaded from Gene Expression Omnibus database. The biological functions of differentially expressed genes were evaluated by gene ontology enrichment and KEGG pathway analysis. The network for intersection genes was constructed by the cytoscape v3.0 software. Kaplan-Meier survival curves and log-rank test were employed to assess the association between differentially expressed genes and clinical characters. The intersection mRNAs were proved to be associated with fundamental processes of T-cell lymphoma cells. These intersection mRNAs were involved in the activation of some cancer-related pathways, including PI3K/AKT, Ras, JAK-STAT, and NF-kappa B signaling pathway. PDGFRA, CXCL12, and CCL19 were the most significant central genes in the signal-net analysis. The results of survival analysis are not entirely credible. Our findings uncovered aberrantly expressed genes and a complex RNA signal network in T-cell lymphomas and indicated cancer-related pathways involved in disease initiation and progression, providing a new insight for biotargeted therapy in T-cell lymphomas. © 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  15. DGEM--a microarray gene expression database for primary human disease tissues.

    PubMed

    Xia, Yuni; Campen, Andrew; Rigsby, Dan; Guo, Ying; Feng, Xingdong; Su, Eric W; Palakal, Mathew; Li, Shuyu

    2007-01-01

    Gene expression patterns can reflect gene regulations in human tissues under normal or pathologic conditions. Gene expression profiling data from studies of primary human disease samples are particularly valuable since these studies often span many years in order to collect patient clinical information and achieve a large sample size. Disease-to-Gene Expression Mapper (DGEM) provides a beneficial community resource to access and analyze these data; it currently includes Affymetrix oligonucleotide array datasets for more than 40 human diseases and 1400 samples. The data are normalized to the same scale and stored in a relational database. A statistical-analysis pipeline was implemented to identify genes abnormally expressed in disease tissues or genes whose expressions are associated with clinical parameters such as cancer patient survival. Data-mining results can be queried through a web-based interface at http://dgem.dhcp.iupui.edu/. The query tool enables dynamic generation of graphs and tables that are further linked to major gene and pathway resources that connect the data to relevant biology, including Entrez Gene and Kyoto Encyclopedia of Genes and Genomes (KEGG). In summary, DGEM provides scientists and physicians a valuable tool to study disease mechanisms, to discover potential disease biomarkers for diagnosis and prognosis, and to identify novel gene targets for drug discovery. The source code is freely available for non-profit use, on request to the authors.

  16. In silico gene expression profiling in Cannabis sativa.

    PubMed

    Massimino, Luca

    2017-01-01

    The cannabis plant and its active ingredients (i.e., cannabinoids and terpenoids) have been socially stigmatized for half a century. Luckily, with more than 430,000 published scientific papers and about 600 ongoing and completed clinical trials, nowadays cannabis is employed for the treatment of many different medical conditions. Nevertheless, even if a large amount of high-throughput functional genomic data exists, most researchers feature a strong background in molecular biology but lack advanced bioinformatics skills. In this work, publicly available gene expression datasets have been analyzed giving rise to a total of 40,224 gene expression profiles taken from cannabis plant tissue at different developmental stages. The resource presented here will provide researchers with a starting point for future investigations with Cannabis sativa .

  17. Picking Cell Lines for High-Throughput Transcriptomic Toxicity ...

    EPA Pesticide Factsheets

    High throughput, whole genome transcriptomic profiling is a promising approach to comprehensively evaluate chemicals for potential biological effects. To be useful for in vitro toxicity screening, gene expression must be quantified in a set of representative cell types that captures the diversity of potential responses across chemicals. The ideal dataset to select these cell types would consist of hundreds of cell types treated with thousands of chemicals, but does not yet exist. However, basal gene expression data may be useful as a surrogate for representing the relevant biological space necessary for cell type selection. The goal of this study was to identify a small (< 20) number of cell types that capture a large, quantifiable fraction of basal gene expression diversity. Three publicly available collections of Affymetrix U133+2.0 cellular gene expression data were used: 1) 59 cell lines from the NCI60 set; 2) 303 primary cell types from the Mabbott et al (2013) expression atlas; and 3) 1036 cell lines from the Cancer Cell Line Encyclopedia. The data were RMA normalized, log-transformed, and the probe sets mapped to HUGO gene identifiers. The results showed that <20 cell lines capture only a small fraction of the total diversity in basal gene expression when evaluated using either the entire set of 20960 HUGO genes or a subset of druggable genes likely to be chemical targets. The fraction of the total gene expression variation explained was consistent when

  18. Comparisons between Arabidopsis thaliana and Drosophila melanogaster in relation to Coding and Noncoding Sequence Length and Gene Expression

    PubMed Central

    Caldwell, Rachel; Lin, Yan-Xia; Zhang, Ren

    2015-01-01

    There is a continuing interest in the analysis of gene architecture and gene expression to determine the relationship that may exist. Advances in high-quality sequencing technologies and large-scale resource datasets have increased the understanding of relationships and cross-referencing of expression data to the large genome data. Although a negative correlation between expression level and gene (especially transcript) length has been generally accepted, there have been some conflicting results arising from the literature concerning the impacts of different regions of genes, and the underlying reason is not well understood. The research aims to apply quantile regression techniques for statistical analysis of coding and noncoding sequence length and gene expression data in the plant, Arabidopsis thaliana, and fruit fly, Drosophila melanogaster, to determine if a relationship exists and if there is any variation or similarities between these species. The quantile regression analysis found that the coding sequence length and gene expression correlations varied, and similarities emerged for the noncoding sequence length (5′ and 3′ UTRs) between animal and plant species. In conclusion, the information described in this study provides the basis for further exploration into gene regulation with regard to coding and noncoding sequence length. PMID:26114098

  19. Floral pathway integrator gene expression mediates gradual transmission of environmental and endogenous cues to flowering time.

    PubMed

    van Dijk, Aalt D J; Molenaar, Jaap

    2017-01-01

    The appropriate timing of flowering is crucial for the reproductive success of plants. Hence, intricate genetic networks integrate various environmental and endogenous cues such as temperature or hormonal statues. These signals integrate into a network of floral pathway integrator genes. At a quantitative level, it is currently unclear how the impact of genetic variation in signaling pathways on flowering time is mediated by floral pathway integrator genes. Here, using datasets available from literature, we connect Arabidopsis thaliana flowering time in genetic backgrounds varying in upstream signalling components with the expression levels of floral pathway integrator genes in these genetic backgrounds. Our modelling results indicate that flowering time depends in a quite linear way on expression levels of floral pathway integrator genes. This gradual, proportional response of flowering time to upstream changes enables a gradual adaptation to changing environmental factors such as temperature and light.

  20. Hybrid Binary Imperialist Competition Algorithm and Tabu Search Approach for Feature Selection Using Gene Expression Data.

    PubMed

    Wang, Shuaiqun; Aorigele; Kong, Wei; Zeng, Weiming; Hong, Xiaomin

    2016-01-01

    Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes.

  1. Hybrid Binary Imperialist Competition Algorithm and Tabu Search Approach for Feature Selection Using Gene Expression Data

    PubMed Central

    Aorigele; Zeng, Weiming; Hong, Xiaomin

    2016-01-01

    Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes. PMID:27579323

  2. A Pipeline for High-Throughput Concentration Response Modeling of Gene Expression for Toxicogenomics

    PubMed Central

    House, John S.; Grimm, Fabian A.; Jima, Dereje D.; Zhou, Yi-Hui; Rusyn, Ivan; Wright, Fred A.

    2017-01-01

    Cell-based assays are an attractive option to measure gene expression response to exposure, but the cost of whole-transcriptome RNA sequencing has been a barrier to the use of gene expression profiling for in vitro toxicity screening. In addition, standard RNA sequencing adds variability due to variable transcript length and amplification. Targeted probe-sequencing technologies such as TempO-Seq, with transcriptomic representation that can vary from hundreds of genes to the entire transcriptome, may reduce some components of variation. Analyses of high-throughput toxicogenomics data require renewed attention to read-calling algorithms and simplified dose–response modeling for datasets with relatively few samples. Using data from induced pluripotent stem cell-derived cardiomyocytes treated with chemicals at varying concentrations, we describe here and make available a pipeline for handling expression data generated by TempO-Seq to align reads, clean and normalize raw count data, identify differentially expressed genes, and calculate transcriptomic concentration–response points of departure. The methods are extensible to other forms of concentration–response gene-expression data, and we discuss the utility of the methods for assessing variation in susceptibility and the diseased cellular state. PMID:29163636

  3. A curated compendium of monocyte transcriptome datasets of relevance to human monocyte immunobiology research

    PubMed Central

    Rinchai, Darawan; Boughorbel, Sabri; Presnell, Scott; Quinn, Charlie; Chaussabel, Damien

    2016-01-01

    Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp. PMID:27158452

  4. TESTING HIGH-DIMENSIONAL COVARIANCE MATRICES, WITH APPLICATION TO DETECTING SCHIZOPHRENIA RISK GENES

    PubMed Central

    Zhu, Lingxue; Lei, Jing; Devlin, Bernie; Roeder, Kathryn

    2017-01-01

    Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leading-Eigenvalue-Driven (sLED) test for comparing two high-dimensional covariance matrices. By focusing on the spectrum of the differential matrix, sLED provides a novel perspective that accommodates what we assume to be common, namely sparse and weak signals in gene expression data, and it is closely related with Sparse Principal Component Analysis. We prove that sLED achieves full power asymptotically under mild assumptions, and simulation studies verify that it outperforms other existing procedures under many biologically plausible scenarios. Applying sLED to the largest gene-expression dataset obtained from post-mortem brain tissue from Schizophrenia patients and controls, we provide a novel list of genes implicated in Schizophrenia and reveal intriguing patterns in gene co-expression change for Schizophrenia subjects. We also illustrate that sLED can be generalized to compare other gene-gene “relationship” matrices that are of practical interest, such as the weighted adjacency matrices. PMID:29081874

  5. TESTING HIGH-DIMENSIONAL COVARIANCE MATRICES, WITH APPLICATION TO DETECTING SCHIZOPHRENIA RISK GENES.

    PubMed

    Zhu, Lingxue; Lei, Jing; Devlin, Bernie; Roeder, Kathryn

    2017-09-01

    Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leading-Eigenvalue-Driven (sLED) test for comparing two high-dimensional covariance matrices. By focusing on the spectrum of the differential matrix, sLED provides a novel perspective that accommodates what we assume to be common, namely sparse and weak signals in gene expression data, and it is closely related with Sparse Principal Component Analysis. We prove that sLED achieves full power asymptotically under mild assumptions, and simulation studies verify that it outperforms other existing procedures under many biologically plausible scenarios. Applying sLED to the largest gene-expression dataset obtained from post-mortem brain tissue from Schizophrenia patients and controls, we provide a novel list of genes implicated in Schizophrenia and reveal intriguing patterns in gene co-expression change for Schizophrenia subjects. We also illustrate that sLED can be generalized to compare other gene-gene "relationship" matrices that are of practical interest, such as the weighted adjacency matrices.

  6. Comparative analysis of human protein-coding and noncoding RNAs between brain and 10 mixed cell lines by RNA-Seq.

    PubMed

    Chen, Geng; Yin, Kangping; Shi, Leming; Fang, Yuanzhang; Qi, Ya; Li, Peng; Luo, Jian; He, Bing; Liu, Mingyao; Shi, Tieliu

    2011-01-01

    In their expression process, different genes can generate diverse functional products, including various protein-coding or noncoding RNAs. Here, we investigated the protein-coding capacities and the expression levels of their isoforms for human known genes, the conservation and disease association of long noncoding RNAs (ncRNAs) with two transcriptome sequencing datasets from human brain tissues and 10 mixed cell lines. Comparative analysis revealed that about two-thirds of the genes expressed between brain and cell lines are the same, but less than one-third of their isoforms are identical. Besides those genes specially expressed in brain and cell lines, about 66% of genes expressed in common encoded different isoforms. Moreover, most genes dominantly expressed one isoform and some genes only generated protein-coding (or noncoding) RNAs in one sample but not in another. We found 282 human genes could encode both protein-coding and noncoding RNAs through alternative splicing in the two samples. We also identified more than 1,000 long ncRNAs, and most of those long ncRNAs contain conserved elements across either 46 vertebrates or 33 placental mammals or 10 primates. Further analysis showed that some long ncRNAs differentially expressed in human breast cancer or lung cancer, several of those differentially expressed long ncRNAs were validated by RT-PCR. In addition, those validated differentially expressed long ncRNAs were found significantly correlated with certain breast cancer or lung cancer related genes, indicating the important biological relevance between long ncRNAs and human cancers. Our findings reveal that the differences of gene expression profile between samples mainly result from the expressed gene isoforms, and highlight the importance of studying genes at the isoform level for completely illustrating the intricate transcriptome.

  7. Beta-Poisson model for single-cell RNA-seq data analyses.

    PubMed

    Vu, Trung Nghia; Wills, Quin F; Kalari, Krishna R; Niu, Nifang; Wang, Liewei; Rantalainen, Mattias; Pawitan, Yudi

    2016-07-15

    Single-cell RNA-sequencing technology allows detection of gene expression at the single-cell level. One typical feature of the data is a bimodality in the cellular distribution even for highly expressed genes, primarily caused by a proportion of non-expressing cells. The standard and the over-dispersed gamma-Poisson models that are commonly used in bulk-cell RNA-sequencing are not able to capture this property. We introduce a beta-Poisson mixture model that can capture the bimodality of the single-cell gene expression distribution. We further integrate the model into the generalized linear model framework in order to perform differential expression analyses. The whole analytical procedure is called BPSC. The results from several real single-cell RNA-seq datasets indicate that ∼90% of the transcripts are well characterized by the beta-Poisson model; the model-fit from BPSC is better than the fit of the standard gamma-Poisson model in > 80% of the transcripts. Moreover, in differential expression analyses of simulated and real datasets, BPSC performs well against edgeR, a conventional method widely used in bulk-cell RNA-sequencing data, and against scde and MAST, two recent methods specifically designed for single-cell RNA-seq data. An R package BPSC for model fitting and differential expression analyses of single-cell RNA-seq data is available under GPL-3 license at https://github.com/nghiavtr/BPSC CONTACT: yudi.pawitan@ki.se or mattias.rantalainen@ki.se Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  8. Differential reconstructed gene interaction networks for deriving toxicity threshold in chemical risk assessment.

    PubMed

    Yang, Yi; Maxwell, Andrew; Zhang, Xiaowei; Wang, Nan; Perkins, Edward J; Zhang, Chaoyang; Gong, Ping

    2013-01-01

    Pathway alterations reflected as changes in gene expression regulation and gene interaction can result from cellular exposure to toxicants. Such information is often used to elucidate toxicological modes of action. From a risk assessment perspective, alterations in biological pathways are a rich resource for setting toxicant thresholds, which may be more sensitive and mechanism-informed than traditional toxicity endpoints. Here we developed a novel differential networks (DNs) approach to connect pathway perturbation with toxicity threshold setting. Our DNs approach consists of 6 steps: time-series gene expression data collection, identification of altered genes, gene interaction network reconstruction, differential edge inference, mapping of genes with differential edges to pathways, and establishment of causal relationships between chemical concentration and perturbed pathways. A one-sample Gaussian process model and a linear regression model were used to identify genes that exhibited significant profile changes across an entire time course and between treatments, respectively. Interaction networks of differentially expressed (DE) genes were reconstructed for different treatments using a state space model and then compared to infer differential edges/interactions. DE genes possessing differential edges were mapped to biological pathways in databases such as KEGG pathways. Using the DNs approach, we analyzed a time-series Escherichia coli live cell gene expression dataset consisting of 4 treatments (control, 10, 100, 1000 mg/L naphthenic acids, NAs) and 18 time points. Through comparison of reconstructed networks and construction of differential networks, 80 genes were identified as DE genes with a significant number of differential edges, and 22 KEGG pathways were altered in a concentration-dependent manner. Some of these pathways were perturbed to a degree as high as 70% even at the lowest exposure concentration, implying a high sensitivity of our DNs approach. Findings from this proof-of-concept study suggest that our approach has a great potential in providing a novel and sensitive tool for threshold setting in chemical risk assessment. In future work, we plan to analyze more time-series datasets with a full spectrum of concentrations and sufficient replications per treatment. The pathway alteration-derived thresholds will also be compared with those derived from apical endpoints such as cell growth rate.

  9. Transcriptional over-expression of chloride intracellular channels 3 and 4 in malignant pleural mesothelioma.

    PubMed

    Tasiopoulou, Vasiliki; Magouliotis, Dimitrios; Solenov, Evgeniy I; Vavougios, Georgios; Molyvdas, Paschalis-Adam; Gourgoulianis, Konstantinos I; Hatzoglou, Chrissi; Zarogiannis, Sotirios G

    2015-12-01

    Chloride Intracellular Channels (CLICs) are contributing to the regulation of multiple cellular functions. CLICs have been found over-expressed in several malignancies, and therefore they are currently considered as potential drug targets. The goal of our study was to assess the gene expression levels of the CLIC's 1-6 in malignant pleural mesothelioma (MPM) as compared to controls. We used gene expression data from a publicly available microarray dataset comparing MPM versus healthy tissue in order to investigate the differential expression profile of CLIC 1-6. False discovery rates were calculated and the interactome of the significantly differentially expressed CLICs was constructed and Functional Enrichment Analysis for Gene Ontologies (FEAGO) was performed. In MPM, the gene expressions of CLIC3 and CLIC4 were significantly increased compared to controls (p=0.001 and p<0.001 respectively). A significant positive correlation between the gene expressions of CLIC3 and CLIC4 (p=0.0008 and Pearson's r=0.51) was found. Deming regression analysis provided an association equation between the CLIC3 and CLIC4 gene expressions: CLIC3=4.42CLIC4-10.07. Our results indicate that CLIC3 and CLIC4 are over-expressed in human MPM. Moreover, their expressions correlate suggesting that they either share common gene expression inducers or that their products act synergistically. FAEGO showed that CLIC interactome might contribute to TGF beta signaling and water transport. Copyright © 2015 Elsevier Ltd. All rights reserved.

  10. Altered metabolic pathways in clear cell renal cell carcinoma: A meta-analysis and validation study focused on the deregulated genes and their associated networks

    PubMed Central

    Zaravinos, Apostolos; Pieri, Myrtani; Mourmouras, Nikos; Anastasiadou, Natassa; Zouvani, Ioanna; Delakas, Dimitris; Deltas, Constantinos

    2014-01-01

    Clear cell renal cell carcinoma (ccRCC) is the predominant subtype of renal cell carcinoma (RCC). It is one of the most therapy-resistant carcinomas, responding very poorly or not at all to radiotherapy, hormonal therapy and chemotherapy. A more comprehensive understanding of the deregulated pathways in ccRCC can lead to the development of new therapies and prognostic markers. We performed a meta- analysis of 5 publicly available gene expression datasets and identified a list of co- deregulated genes, for which we performed extensive bioinformatic analysis coupled with experimental validation on the mRNA level. Gene ontology enrichment showed that many proteins are involved in response to hypoxia/oxygen levels and positive regulation of the VEGFR signaling pathway. KEGG analysis revealed that metabolic pathways are mostly altered in ccRCC. Similarly, Ingenuity Pathway Analysis showed that the antigen presentation, inositol metabolism, pentose phosphate, glycolysis/gluconeogenesis and fructose/mannose metabolism pathways are altered in the disease. Cellular growth, proliferation and carbohydrate metabolism, were among the top molecular and cellular functions of the co-deregulated genes. qRT-PCR validated the deregulated expression of several genes in Caki-2 and ACHN cell lines and in a cohort of ccRCC tissues. NNMT and NR3C1 increased expression was evident in ccRCC biopsies from patients using immunohistochemistry. ROC curves evaluated the diagnostic performance of the top deregulated genes in each dataset. We show that metabolic pathways are mostly deregulated in ccRCC and we highlight those being most responsible in its formation. We suggest that these genes are candidate predictive markers of the disease. PMID:25594006

  11. Statistical modeling of isoform splicing dynamics from RNA-seq time series data.

    PubMed

    Huang, Yuanhua; Sanguinetti, Guido

    2016-10-01

    Isoform quantification is an important goal of RNA-seq experiments, yet it remains problematic for genes with low expression or several isoforms. These difficulties may in principle be ameliorated by exploiting correlated experimental designs, such as time series or dosage response experiments. Time series RNA-seq experiments, in particular, are becoming increasingly popular, yet there are no methods that explicitly leverage the experimental design to improve isoform quantification. Here, we present DICEseq, the first isoform quantification method tailored to correlated RNA-seq experiments. DICEseq explicitly models the correlations between different RNA-seq experiments to aid the quantification of isoforms across experiments. Numerical experiments on simulated datasets show that DICEseq yields more accurate results than state-of-the-art methods, an advantage that can become considerable at low coverage levels. On real datasets, our results show that DICEseq provides substantially more reproducible and robust quantifications, increasing the correlation of estimates from replicate datasets by up to 10% on genes with low or moderate expression levels (bottom third of all genes). Furthermore, DICEseq permits to quantify the trade-off between temporal sampling of RNA and depth of sequencing, frequently an important choice when planning experiments. Our results have strong implications for the design of RNA-seq experiments, and offer a novel tool for improved analysis of such datasets. Python code is freely available at http://diceseq.sf.net G.Sanguinetti@ed.ac.uk Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  12. Identifying Epigenetic Biomarkers using Maximal Relevance and Minimal Redundancy Based Feature Selection for Multi-Omics Data.

    PubMed

    Mallik, Saurav; Bhadra, Tapas; Maulik, Ujjwal

    2017-01-01

    Epigenetic Biomarker discovery is an important task in bioinformatics. In this article, we develop a new framework of identifying statistically significant epigenetic biomarkers using maximal-relevance and minimal-redundancy criterion based feature (gene) selection for multi-omics dataset. Firstly, we determine the genes that have both expression as well as methylation values, and follow normal distribution. Similarly, we identify the genes which consist of both expression and methylation values, but do not follow normal distribution. For each case, we utilize a gene-selection method that provides maximal-relevant, but variable-weighted minimum-redundant genes as top ranked genes. For statistical validation, we apply t-test on both the expression and methylation data consisting of only the normally distributed top ranked genes to determine how many of them are both differentially expressed andmethylated. Similarly, we utilize Limma package for performing non-parametric Empirical Bayes test on both expression and methylation data comprising only the non-normally distributed top ranked genes to identify how many of them are both differentially expressed and methylated. We finally report the top-ranking significant gene-markerswith biological validation. Moreover, our framework improves positive predictive rate and reduces false positive rate in marker identification. In addition, we provide a comparative analysis of our gene-selection method as well as othermethods based on classificationperformances obtained using several well-known classifiers.

  13. Analysis of multiplex gene expression maps obtained by voxelation.

    PubMed

    An, Li; Xie, Hongbo; Chin, Mark H; Obradovic, Zoran; Smith, Desmond J; Megalooikonomou, Vasileios

    2009-04-29

    Gene expression signatures in the mammalian brain hold the key to understanding neural development and neurological disease. Researchers have previously used voxelation in combination with microarrays for acquisition of genome-wide atlases of expression patterns in the mouse brain. On the other hand, some work has been performed on studying gene functions, without taking into account the location information of a gene's expression in a mouse brain. In this paper, we present an approach for identifying the relation between gene expression maps obtained by voxelation and gene functions. To analyze the dataset, we chose typical genes as queries and aimed at discovering similar gene groups. Gene similarity was determined by using the wavelet features extracted from the left and right hemispheres averaged gene expression maps, and by the Euclidean distance between each pair of feature vectors. We also performed a multiple clustering approach on the gene expression maps, combined with hierarchical clustering. Among each group of similar genes and clusters, the gene function similarity was measured by calculating the average gene function distances in the gene ontology structure. By applying our methodology to find similar genes to certain target genes we were able to improve our understanding of gene expression patterns and gene functions. By applying the clustering analysis method, we obtained significant clusters, which have both very similar gene expression maps and very similar gene functions respectively to their corresponding gene ontologies. The cellular component ontology resulted in prominent clusters expressed in cortex and corpus callosum. The molecular function ontology gave prominent clusters in cortex, corpus callosum and hypothalamus. The biological process ontology resulted in clusters in cortex, hypothalamus and choroid plexus. Clusters from all three ontologies combined were most prominently expressed in cortex and corpus callosum. The experimental results confirm the hypothesis that genes with similar gene expression maps might have similar gene functions. The voxelation data takes into account the location information of gene expression level in mouse brain, which is novel in related research. The proposed approach can potentially be used to predict gene functions and provide helpful suggestions to biologists.

  14. Evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data.

    PubMed

    Chen, Shuonan; Mar, Jessica C

    2018-06-19

    A fundamental fact in biology states that genes do not operate in isolation, and yet, methods that infer regulatory networks for single cell gene expression data have been slow to emerge. With single cell sequencing methods now becoming accessible, general network inference algorithms that were initially developed for data collected from bulk samples may not be suitable for single cells. Meanwhile, although methods that are specific for single cell data are now emerging, whether they have improved performance over general methods is unknown. In this study, we evaluate the applicability of five general methods and three single cell methods for inferring gene regulatory networks from both experimental single cell gene expression data and in silico simulated data. Standard evaluation metrics using ROC curves and Precision-Recall curves against reference sets sourced from the literature demonstrated that most of the methods performed poorly when they were applied to either experimental single cell data, or simulated single cell data, which demonstrates their lack of performance for this task. Using default settings, network methods were applied to the same datasets. Comparisons of the learned networks highlighted the uniqueness of some predicted edges for each method. The fact that different methods infer networks that vary substantially reflects the underlying mathematical rationale and assumptions that distinguish network methods from each other. This study provides a comprehensive evaluation of network modeling algorithms applied to experimental single cell gene expression data and in silico simulated datasets where the network structure is known. Comparisons demonstrate that most of these assessed network methods are not able to predict network structures from single cell expression data accurately, even if they are specifically developed for single cell methods. Also, single cell methods, which usually depend on more elaborative algorithms, in general have less similarity to each other in the sets of edges detected. The results from this study emphasize the importance for developing more accurate optimized network modeling methods that are compatible for single cell data. Newly-developed single cell methods may uniquely capture particular features of potential gene-gene relationships, and caution should be taken when we interpret these results.

  15. Multiple hot-deck imputation for network inference from RNA sequencing data.

    PubMed

    Imbert, Alyssa; Valsesia, Armand; Le Gall, Caroline; Armenise, Claudia; Lefebvre, Gregory; Gourraud, Pierre-Antoine; Viguerie, Nathalie; Villa-Vialaneix, Nathalie

    2018-05-15

    Network inference provides a global view of the relations existing between gene expression in a given transcriptomic experiment (often only for a restricted list of chosen genes). However, it is still a challenging problem: even if the cost of sequencing techniques has decreased over the last years, the number of samples in a given experiment is still (very) small compared to the number of genes. We propose a method to increase the reliability of the inference when RNA-seq expression data have been measured together with an auxiliary dataset that can provide external information on gene expression similarity between samples. Our statistical approach, hd-MI, is based on imputation for samples without available RNA-seq data that are considered as missing data but are observed on the secondary dataset. hd-MI can improve the reliability of the inference for missing rates up to 30% and provides more stable networks with a smaller number of false positive edges. On a biological point of view, hd-MI was also found relevant to infer networks from RNA-seq data acquired in adipose tissue during a nutritional intervention in obese individuals. In these networks, novel links between genes were highlighted, as well as an improved comparability between the two steps of the nutritional intervention. Software and sample data are available as an R package, RNAseqNet, that can be downloaded from the Comprehensive R Archive Network (CRAN). alyssa.imbert@inra.fr or nathalie.villa-vialaneix@inra.fr. Supplementary data are available at Bioinformatics online.

  16. VaDiR: an integrated approach to Variant Detection in RNA.

    PubMed

    Neums, Lisa; Suenaga, Seiji; Beyerlein, Peter; Anders, Sara; Koestler, Devin; Mariani, Andrea; Chien, Jeremy

    2018-02-01

    Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.

  17. Differential Connectivity in Colorectal Cancer Gene Expression Network

    PubMed

    Izadi, Fereshteh

    2018-05-30

    Colorectal cancer (CRC) is one of the challenging types of cancers; thus, exploring effective biomarkers related to colorectal could lead to significant progresses toward the treatment of this disease. In the present study, CRC gene expression datasets have been reanalyzed. Mutual differentially expressed genes across 294 normal mucosa and adjacent tumoral samples were then utilized in order to build two independent transcriptional regulatory networks. By analyzing the networks topologically, genes with differential global connectivity related to cancer state were determined for which the potential transcriptional regulators including transcription factors were identified. The majority of differentially connected genes (DCGs) were up-regulated in colorectal transcriptome experiments. Moreover, a number of these genes have been experimentally validated as cancer or CRC-associated genes. The DCGs, including GART, TGFB1, ITGA2, SLC16A5, SOX9, and MMP7, were investigated across 12 cancer types. Functional enrichment analysis followed by detailed data mining exhibited that these candidate genes could be related to CRC by mediating in metastatic cascade in addition to shared pathways with 12 cancer types by triggering the inflammatory events Our study uncovered correlated alterations in gene expression related to CRC susceptibility and progression that the potent candidate biomarkers could provide a link to disease.

  18. Structured association analysis leads to insight into Saccharomyces cerevisiae gene regulation by finding multiple contributing eQTL hotspots associated with functional gene modules.

    PubMed

    Curtis, Ross E; Kim, Seyoung; Woolford, John L; Xu, Wenjie; Xing, Eric P

    2013-03-21

    Association analysis using genome-wide expression quantitative trait locus (eQTL) data investigates the effect that genetic variation has on cellular pathways and leads to the discovery of candidate regulators. Traditional analysis of eQTL data via pairwise statistical significance tests or linear regression does not leverage the availability of the structural information of the transcriptome, such as presence of gene networks that reveal correlation and potentially regulatory relationships among the study genes. We employ a new eQTL mapping algorithm, GFlasso, which we have previously developed for sparse structured regression, to reanalyze a genome-wide yeast dataset. GFlasso fully takes into account the dependencies among expression traits to suppress false positives and to enhance the signal/noise ratio. Thus, GFlasso leverages the gene-interaction network to discover the pleiotropic effects of genetic loci that perturb the expression level of multiple (rather than individual) genes, which enables us to gain more power in detecting previously neglected signals that are marginally weak but pleiotropically significant. While eQTL hotspots in yeast have been reported previously as genomic regions controlling multiple genes, our analysis reveals additional novel eQTL hotspots and, more interestingly, uncovers groups of multiple contributing eQTL hotspots that affect the expression level of functional gene modules. To our knowledge, our study is the first to report this type of gene regulation stemming from multiple eQTL hotspots. Additionally, we report the results from in-depth bioinformatics analysis for three groups of these eQTL hotspots: ribosome biogenesis, telomere silencing, and retrotransposon biology. We suggest candidate regulators for the functional gene modules that map to each group of hotspots. Not only do we find that many of these candidate regulators contain mutations in the promoter and coding regions of the genes, in the case of the Ribi group, we provide experimental evidence suggesting that the identified candidates do regulate the target genes predicted by GFlasso. Thus, this structured association analysis of a yeast eQTL dataset via GFlasso, coupled with extensive bioinformatics analysis, discovers a novel regulation pattern between multiple eQTL hotspots and functional gene modules. Furthermore, this analysis demonstrates the potential of GFlasso as a powerful computational tool for eQTL studies that exploit the rich structural information among expression traits due to correlation, regulation, or other forms of biological dependencies.

  19. An approach for reduction of false predictions in reverse engineering of gene regulatory networks.

    PubMed

    Khan, Abhinandan; Saha, Goutam; Pal, Rajat Kumar

    2018-05-14

    A gene regulatory network discloses the regulatory interactions amongst genes, at a particular condition of the human body. The accurate reconstruction of such networks from time-series genetic expression data using computational tools offers a stiff challenge for contemporary computer scientists. This is crucial to facilitate the understanding of the proper functioning of a living organism. Unfortunately, the computational methods produce many false predictions along with the correct predictions, which is unwanted. Investigations in the domain focus on the identification of as many correct regulations as possible in the reverse engineering of gene regulatory networks to make it more reliable and biologically relevant. One way to achieve this is to reduce the number of incorrect predictions in the reconstructed networks. In the present investigation, we have proposed a novel scheme to decrease the number of false predictions by suitably combining several metaheuristic techniques. We have implemented the same using a dataset ensemble approach (i.e. combining multiple datasets) also. We have employed the proposed methodology on real-world experimental datasets of the SOS DNA Repair network of Escherichia coli and the IMRA network of Saccharomyces cerevisiae. Subsequently, we have experimented upon somewhat larger, in silico networks, namely, DREAM3 and DREAM4 Challenge networks, and 15-gene and 20-gene networks extracted from the GeneNetWeaver database. To study the effect of multiple datasets on the quality of the inferred networks, we have used four datasets in each experiment. The obtained results are encouraging enough as the proposed methodology can reduce the number of false predictions significantly, without using any supplementary prior biological information for larger gene regulatory networks. It is also observed that if a small amount of prior biological information is incorporated here, the results improve further w.r.t. the prediction of true positives. Copyright © 2018 Elsevier Ltd. All rights reserved.

  20. Gene expression profiling of long-lived dwarf mice: longevity-associated genes and relationships with diet, gender and aging

    PubMed Central

    Swindell, William R

    2007-01-01

    Background Long-lived strains of dwarf mice carry mutations that suppress growth hormone (GH) and insulin-like growth factor I (IGF-I) signaling. The downstream effects of these endocrine abnormalities, however, are not well understood and it is unclear how these processes interact with aging mechanisms. This study presents a comparative analysis of microarray experiments that have measured hepatic gene expression levels in long-lived strains carrying one of four mutations (Prop1df/df, Pit1dw/dw, Ghrhrlit/lit, GHR-KO) and describes how the effects of these mutations relate to one another at the transcriptional level. Points of overlap with the effects of calorie restriction (CR), CR mimetic compounds, low fat diets, gender dimorphism and aging were also examined. Results All dwarf mutations had larger and more consistent effects on IGF-I expression than dietary treatments. In comparison to dwarf mutations, however, the transcriptional effects of CR (and some CR mimetics) overlapped more strongly with those of aging. Surprisingly, the Ghrhrlit/lit mutation had much larger effects on gene expression than the GHR-KO mutation, even though both mutations affect the same endocrine pathway. Several genes potentially regulated or co-regulated with the IGF-I transcript in liver tissue were identified, including a DNA repair gene (Snm1) that is upregulated in proportion to IGF-I inhibition. A total of 13 genes exhibiting parallel differential expression patterns among all four strains of long-lived dwarf mice were identified, in addition to 30 genes with matching differential expression patterns in multiple long-lived dwarf strains and under CR. Conclusion Comparative analysis of microarray datasets can identify patterns and consistencies not discernable from any one dataset individually. This study implements new analytical approaches to provide a detailed comparison among the effects of life-extending mutations, dietary treatments, gender and aging. This comparison provides insight into a broad range of issues relevant to the study of mammalian aging. In this context, 43 longevity-associated genes are identified and individual genes with the highest level of support among all microarray experiments are highlighted. These results provide promising targets for future experimental investigation as well as potential clues for understanding the functional basis of lifespan extension in mammalian systems. PMID:17915019

  1. Exploring the key genes and pathways in enchondromas using a gene expression microarray.

    PubMed

    Shi, Zhongju; Zhou, Hengxing; Pan, Bin; Lu, Lu; Kang, Yi; Liu, Lu; Wei, Zhijian; Feng, Shiqing

    2017-07-04

    Enchondromas are the most common primary benign osseous neoplasms that occur in the medullary bone; they can undergo malignant transformation into chondrosarcoma. However, enchondromas are always undetected in patients, and the molecular mechanism is unclear. To identify key genes and pathways associated with the occurrence and development of enchondromas, we downloaded the gene expression dataset GSE22855 and obtained the differentially expressed genes (DEGs) by analyzing high-throughput gene expression in enchondromas. In total, 635 genes were identified as DEGs. Of these, 225 genes (35.43%) were up-regulated, and the remaining 410 genes (64.57%) were down-regulated. We identified the predominant gene ontology (GO) categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways that were significantly over-represented in the enchondromas samples compared with the control samples. Subsequently the top 10 core genes were identified from the protein-protein interaction (PPI) network. The enrichment analyses of the genes mainly involved in two significant modules showed that the DEGs were principally related to ribosomes, protein digestion and absorption, ECM-receptor interaction, focal adhesion, amoebiasis and the PI3K-Akt signaling pathway.Together, these data elucidate the molecular mechanisms underlying the occurrence and development of enchondromas and provide promising candidates for therapeutic intervention and prognostic evaluation. However, further experimental studies are needed to confirm these results.

  2. Identification of ELF3 as an early transcriptional regulator of human urothelium.

    PubMed

    Böck, Matthias; Hinley, Jennifer; Schmitt, Constanze; Wahlicht, Tom; Kramer, Stefan; Southgate, Jennifer

    2014-02-15

    Despite major advances in high-throughput and computational modelling techniques, understanding of the mechanisms regulating tissue specification and differentiation in higher eukaryotes, particularly man, remains limited. Microarray technology has been explored exhaustively in recent years and several standard approaches have been established to analyse the resultant datasets on a genome-wide scale. Gene expression time series offer a valuable opportunity to define temporal hierarchies and gain insight into the regulatory relationships of biological processes. However, unless datasets are exactly synchronous, time points cannot be compared directly. Here we present a data-driven analysis of regulatory elements from a microarray time series that tracked the differentiation of non-immortalised normal human urothelial (NHU) cells grown in culture. The datasets were obtained by harvesting differentiating and control cultures from finite bladder- and ureter-derived NHU cell lines at different time points using two previously validated, independent differentiation-inducing protocols. Due to the asynchronous nature of the data, a novel ranking analysis approach was adopted whereby we compared changes in the amplitude of experiment and control time series to identify common regulatory elements. Our approach offers a simple, fast and effective ranking method for genes that can be applied to other time series. The analysis identified ELF3 as a candidate transcriptional regulator involved in human urothelial cytodifferentiation. Differentiation-associated expression of ELF3 was confirmed in cell culture experiments and by immunohistochemical demonstration in situ. The importance of ELF3 in urothelial differentiation was verified by knockdown in NHU cells, which led to reduced expression of FOXA1 and GRHL3 transcription factors in response to PPARγ activation. The consequences of this were seen in the repressed expression of late/terminal differentiation-associated uroplakin 3a gene expression and in the compromised development and regeneration of urothelial barrier function. Copyright © 2014 Elsevier Inc. All rights reserved.

  3. Differentially expressed microRNAs in lung adenocarcinoma invert effects of copy number aberrations of prognostic genes

    PubMed Central

    Tokar, Tomas; Pastrello, Chiara; Ramnarine, Varune R.; Zhu, Chang-Qi; Craddock, Kenneth J.; Pikor, Larrisa A.; Vucic, Emily A.; Vary, Simon; Shepherd, Frances A.; Tsao, Ming-Sound; Lam, Wan L.; Jurisica, Igor

    2018-01-01

    In many cancers, significantly down- or upregulated genes are found within chromosomal regions with DNA copy number alteration opposite to the expression changes. Generally, this paradox has been overlooked as noise, but can potentially be a consequence of interference of epigenetic regulatory mechanisms, including microRNA-mediated control of mRNA levels. To explore potential associations between microRNAs and paradoxes in non-small-cell lung cancer (NSCLC) we curated and analyzed lung adenocarcinoma (LUAD) data, comprising gene expressions, copy number aberrations (CNAs) and microRNA expressions. We integrated data from 1,062 tumor samples and 241 normal lung samples, including newly-generated array comparative genomic hybridization (aCGH) data from 63 LUAD samples. We identified 85 “paradoxical” genes whose differential expression consistently contrasted with aberrations of their copy numbers. Paradoxical status of 70 out of 85 genes was validated on sample-wise basis using The Cancer Genome Atlas (TCGA) LUAD data. Of these, 41 genes are prognostic and form a clinically relevant signature, which we validated on three independent datasets. By meta-analysis of results from 9 LUAD microRNA expression studies we identified 24 consistently-deregulated microRNAs. Using TCGA-LUAD data we showed that deregulation of 19 of these microRNAs explains differential expression of the paradoxical genes. Our results show that deregulation of paradoxical genes is crucial in LUAD and their expression pattern is maintained epigenetically, defying gene copy number status. PMID:29507679

  4. Wide-Open: Accelerating public data release by automating detection of overdue datasets

    PubMed Central

    Poon, Hoifung; Howe, Bill

    2017-01-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819

  5. Wide-Open: Accelerating public data release by automating detection of overdue datasets.

    PubMed

    Grechkin, Maxim; Poon, Hoifung; Howe, Bill

    2017-06-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

  6. Gene expression distribution deconvolution in single-cell RNA sequencing.

    PubMed

    Wang, Jingshu; Huang, Mo; Torre, Eduardo; Dueck, Hannah; Shaffer, Sydney; Murray, John; Raj, Arjun; Li, Mingyao; Zhang, Nancy R

    2018-06-26

    Single-cell RNA sequencing (scRNA-seq) enables the quantification of each gene's expression distribution across cells, thus allowing the assessment of the dispersion, nonzero fraction, and other aspects of its distribution beyond the mean. These statistical characterizations of the gene expression distribution are critical for understanding expression variation and for selecting marker genes for population heterogeneity. However, scRNA-seq data are noisy, with each cell typically sequenced at low coverage, thus making it difficult to infer properties of the gene expression distribution from raw counts. Based on a reexamination of nine public datasets, we propose a simple technical noise model for scRNA-seq data with unique molecular identifiers (UMI). We develop deconvolution of single-cell expression distribution (DESCEND), a method that deconvolves the true cross-cell gene expression distribution from observed scRNA-seq counts, leading to improved estimates of properties of the distribution such as dispersion and nonzero fraction. DESCEND can adjust for cell-level covariates such as cell size, cell cycle, and batch effects. DESCEND's noise model and estimation accuracy are further evaluated through comparisons to RNA FISH data, through data splitting and simulations and through its effectiveness in removing known batch effects. We demonstrate how DESCEND can clarify and improve downstream analyses such as finding differentially expressed genes, identifying cell types, and selecting differentiation markers. Copyright © 2018 the Author(s). Published by PNAS.

  7. Gene expression profiling in the adult Down syndrome brain.

    PubMed

    Lockstone, H E; Harris, L W; Swatton, J E; Wayland, M T; Holland, A J; Bahn, S

    2007-12-01

    The mechanisms by which trisomy 21 leads to the characteristic Down syndrome (DS) phenotype are unclear. We used whole genome microarrays to characterize for the first time the transcriptome of human adult brain tissue (dorsolateral prefrontal cortex) from seven DS subjects and eight controls. These data were coanalyzed with a publicly available dataset from fetal DS tissue and functional profiling was performed to identify the biological processes central to DS and those that may be related to late onset pathologies, particularly Alzheimer disease neuropathology. A total of 685 probe sets were differentially expressed between adult DS and control brains at a stringent significance threshold (adjusted p value (q) < 0.005), 70% of these being up-regulated in DS. Over 25% of genes on chromosome 21 were differentially expressed in comparison to a median of 4.4% for all chromosomes. The unique profile of up-regulation on chromosome 21, consistent with primary dosage effects, was accompanied by widespread transcriptional disruption. The critical Alzheimer disease gene, APP, located on chromosome 21, was not found to be up-regulated in adult brain by microarray or QPCR analysis. However, numerous other genes functionally linked to APP processing were dysregulated. Functional profiling of genes dysregulated in both fetal and adult datasets identified categories including development (notably Notch signaling and Dlx family genes), lipid transport, and cellular proliferation. In the adult brain these processes were concomitant with cytoskeletal regulation and vesicle trafficking categories, and increased immune response and oxidative stress response, which are likely linked to the development of Alzheimer pathology in individuals with DS.

  8. Analyzing Kernel Matrices for the Identification of Differentially Expressed Genes

    PubMed Central

    Xia, Xiao-Lei; Xing, Huanlai; Liu, Xueqin

    2013-01-01

    One of the most important applications of microarray data is the class prediction of biological samples. For this purpose, statistical tests have often been applied to identify the differentially expressed genes (DEGs), followed by the employment of the state-of-the-art learning machines including the Support Vector Machines (SVM) in particular. The SVM is a typical sample-based classifier whose performance comes down to how discriminant samples are. However, DEGs identified by statistical tests are not guaranteed to result in a training dataset composed of discriminant samples. To tackle this problem, a novel gene ranking method namely the Kernel Matrix Gene Selection (KMGS) is proposed. The rationale of the method, which roots in the fundamental ideas of the SVM algorithm, is described. The notion of ''the separability of a sample'' which is estimated by performing -like statistics on each column of the kernel matrix, is first introduced. The separability of a classification problem is then measured, from which the significance of a specific gene is deduced. Also described is a method of Kernel Matrix Sequential Forward Selection (KMSFS) which shares the KMGS method's essential ideas but proceeds in a greedy manner. On three public microarray datasets, our proposed algorithms achieved noticeably competitive performance in terms of the B.632+ error rate. PMID:24349110

  9. eMBI: Boosting Gene Expression-based Clustering for Cancer Subtypes.

    PubMed

    Chang, Zheng; Wang, Zhenjia; Ashby, Cody; Zhou, Chuan; Li, Guojun; Zhang, Shuzhong; Huang, Xiuzhen

    2014-01-01

    Identifying clinically relevant subtypes of a cancer using gene expression data is a challenging and important problem in medicine, and is a necessary premise to provide specific and efficient treatments for patients of different subtypes. Matrix factorization provides a solution by finding checker-board patterns in the matrices of gene expression data. In the context of gene expression profiles of cancer patients, these checkerboard patterns correspond to genes that are up- or down-regulated in patients with particular cancer subtypes. Recently, a new matrix factorization framework for biclustering called Maximum Block Improvement (MBI) is proposed; however, it still suffers several problems when applied to cancer gene expression data analysis. In this study, we developed many effective strategies to improve MBI and designed a new program called enhanced MBI (eMBI), which is more effective and efficient to identify cancer subtypes. Our tests on several gene expression profiling datasets of cancer patients consistently indicate that eMBI achieves significant improvements in comparison with MBI, in terms of cancer subtype prediction accuracy, robustness, and running time. In addition, the performance of eMBI is much better than another widely used matrix factorization method called nonnegative matrix factorization (NMF) and the method of hierarchical clustering, which is often the first choice of clinical analysts in practice.

  10. eMBI: Boosting Gene Expression-based Clustering for Cancer Subtypes

    PubMed Central

    Chang, Zheng; Wang, Zhenjia; Ashby, Cody; Zhou, Chuan; Li, Guojun; Zhang, Shuzhong; Huang, Xiuzhen

    2014-01-01

    Identifying clinically relevant subtypes of a cancer using gene expression data is a challenging and important problem in medicine, and is a necessary premise to provide specific and efficient treatments for patients of different subtypes. Matrix factorization provides a solution by finding checker-board patterns in the matrices of gene expression data. In the context of gene expression profiles of cancer patients, these checkerboard patterns correspond to genes that are up- or down-regulated in patients with particular cancer subtypes. Recently, a new matrix factorization framework for biclustering called Maximum Block Improvement (MBI) is proposed; however, it still suffers several problems when applied to cancer gene expression data analysis. In this study, we developed many effective strategies to improve MBI and designed a new program called enhanced MBI (eMBI), which is more effective and efficient to identify cancer subtypes. Our tests on several gene expression profiling datasets of cancer patients consistently indicate that eMBI achieves significant improvements in comparison with MBI, in terms of cancer subtype prediction accuracy, robustness, and running time. In addition, the performance of eMBI is much better than another widely used matrix factorization method called nonnegative matrix factorization (NMF) and the method of hierarchical clustering, which is often the first choice of clinical analysts in practice. PMID:25374455

  11. Estimating replicate time shifts using Gaussian process regression

    PubMed Central

    Liu, Qiang; Andersen, Bogi; Smyth, Padhraic; Ihler, Alexander

    2010-01-01

    Motivation: Time-course gene expression datasets provide important insights into dynamic aspects of biological processes, such as circadian rhythms, cell cycle and organ development. In a typical microarray time-course experiment, measurements are obtained at each time point from multiple replicate samples. Accurately recovering the gene expression patterns from experimental observations is made challenging by both measurement noise and variation among replicates' rates of development. Prior work on this topic has focused on inference of expression patterns assuming that the replicate times are synchronized. We develop a statistical approach that simultaneously infers both (i) the underlying (hidden) expression profile for each gene, as well as (ii) the biological time for each individual replicate. Our approach is based on Gaussian process regression (GPR) combined with a probabilistic model that accounts for uncertainty about the biological development time of each replicate. Results: We apply GPR with uncertain measurement times to a microarray dataset of mRNA expression for the hair-growth cycle in mouse back skin, predicting both profile shapes and biological times for each replicate. The predicted time shifts show high consistency with independently obtained morphological estimates of relative development. We also show that the method systematically reduces prediction error on out-of-sample data, significantly reducing the mean squared error in a cross-validation study. Availability: Matlab code for GPR with uncertain time shifts is available at http://sli.ics.uci.edu/Code/GPRTimeshift/ Contact: ihler@ics.uci.edu PMID:20147305

  12. Multi-tissue analysis of co-expression networks by higher-order generalized singular value decomposition identifies functionally coherent transcriptional modules.

    PubMed

    Xiao, Xiaolin; Moreno-Moral, Aida; Rotival, Maxime; Bottolo, Leonardo; Petretto, Enrico

    2014-01-01

    Recent high-throughput efforts such as ENCODE have generated a large body of genome-scale transcriptional data in multiple conditions (e.g., cell-types and disease states). Leveraging these data is especially important for network-based approaches to human disease, for instance to identify coherent transcriptional modules (subnetworks) that can inform functional disease mechanisms and pathological pathways. Yet, genome-scale network analysis across conditions is significantly hampered by the paucity of robust and computationally-efficient methods. Building on the Higher-Order Generalized Singular Value Decomposition, we introduce a new algorithmic approach for efficient, parameter-free and reproducible identification of network-modules simultaneously across multiple conditions. Our method can accommodate weighted (and unweighted) networks of any size and can similarly use co-expression or raw gene expression input data, without hinging upon the definition and stability of the correlation used to assess gene co-expression. In simulation studies, we demonstrated distinctive advantages of our method over existing methods, which was able to recover accurately both common and condition-specific network-modules without entailing ad-hoc input parameters as required by other approaches. We applied our method to genome-scale and multi-tissue transcriptomic datasets from rats (microarray-based) and humans (mRNA-sequencing-based) and identified several common and tissue-specific subnetworks with functional significance, which were not detected by other methods. In humans we recapitulated the crosstalk between cell-cycle progression and cell-extracellular matrix interactions processes in ventricular zones during neocortex expansion and further, we uncovered pathways related to development of later cognitive functions in the cortical plate of the developing brain which were previously unappreciated. Analyses of seven rat tissues identified a multi-tissue subnetwork of co-expressed heat shock protein (Hsp) and cardiomyopathy genes (Bag3, Cryab, Kras, Emd, Plec), which was significantly replicated using separate failing heart and liver gene expression datasets in humans, thus revealing a conserved functional role for Hsp genes in cardiovascular disease.

  13. Forager bees (Apis mellifera) highly express immune and detoxification genes in tissues associated with nectar processing.

    PubMed

    Vannette, Rachel L; Mohamed, Abbas; Johnson, Brian R

    2015-11-09

    Pollinators, including honey bees, routinely encounter potentially harmful microorganisms and phytochemicals during foraging. However, the mechanisms by which honey bees manage these potential threats are poorly understood. In this study, we examine the expression of antimicrobial, immune and detoxification genes in Apis mellifera and compare between forager and nurse bees using tissue-specific RNA-seq and qPCR. Our analysis revealed extensive tissue-specific expression of antimicrobial, immune signaling, and detoxification genes. Variation in gene expression between worker stages was pronounced in the mandibular and hypopharyngeal gland (HPG), where foragers were enriched in transcripts that encode antimicrobial peptides (AMPs) and immune response. Additionally, forager HPGs and mandibular glands were enriched in transcripts encoding detoxification enzymes, including some associated with xenobiotic metabolism. Using qPCR on an independent dataset, we verified differential expression of three AMP and three P450 genes between foragers and nurses. High expression of AMP genes in nectar-processing tissues suggests that these peptides may contribute to antimicrobial properties of honey or to honey bee defense against environmentally-acquired microorganisms. Together, these results suggest that worker role and tissue-specific expression of AMPs, and immune and detoxification enzymes may contribute to defense against microorganisms and xenobiotic compounds acquired while foraging.

  14. Forager bees (Apis mellifera) highly express immune and detoxification genes in tissues associated with nectar processing

    PubMed Central

    Vannette, Rachel L.; Mohamed, Abbas; Johnson, Brian R.

    2015-01-01

    Pollinators, including honey bees, routinely encounter potentially harmful microorganisms and phytochemicals during foraging. However, the mechanisms by which honey bees manage these potential threats are poorly understood. In this study, we examine the expression of antimicrobial, immune and detoxification genes in Apis mellifera and compare between forager and nurse bees using tissue-specific RNA-seq and qPCR. Our analysis revealed extensive tissue-specific expression of antimicrobial, immune signaling, and detoxification genes. Variation in gene expression between worker stages was pronounced in the mandibular and hypopharyngeal gland (HPG), where foragers were enriched in transcripts that encode antimicrobial peptides (AMPs) and immune response. Additionally, forager HPGs and mandibular glands were enriched in transcripts encoding detoxification enzymes, including some associated with xenobiotic metabolism. Using qPCR on an independent dataset, we verified differential expression of three AMP and three P450 genes between foragers and nurses. High expression of AMP genes in nectar-processing tissues suggests that these peptides may contribute to antimicrobial properties of honey or to honey bee defense against environmentally-acquired microorganisms. Together, these results suggest that worker role and tissue-specific expression of AMPs, and immune and detoxification enzymes may contribute to defense against microorganisms and xenobiotic compounds acquired while foraging. PMID:26549293

  15. Transcription Factor Map Alignment of Promoter Regions

    PubMed Central

    Blanco, Enrique; Messeguer, Xavier; Smith, Temple F; Guigó, Roderic

    2006-01-01

    We address the problem of comparing and characterizing the promoter regions of genes with similar expression patterns. This remains a challenging problem in sequence analysis, because often the promoter regions of co-expressed genes do not show discernible sequence conservation. In our approach, thus, we have not directly compared the nucleotide sequence of promoters. Instead, we have obtained predictions of transcription factor binding sites, annotated the predicted sites with the labels of the corresponding binding factors, and aligned the resulting sequences of labels—to which we refer here as transcription factor maps (TF-maps). To obtain the global pairwise alignment of two TF-maps, we have adapted an algorithm initially developed to align restriction enzyme maps. We have optimized the parameters of the algorithm in a small, but well-curated, collection of human–mouse orthologous gene pairs. Results in this dataset, as well as in an independent much larger dataset from the CISRED database, indicate that TF-map alignments are able to uncover conserved regulatory elements, which cannot be detected by the typical sequence alignments. PMID:16733547

  16. Survival, gene and metabolite responses of Litoria verreauxii alpina frogs to fungal disease chytridiomycosis

    NASA Astrophysics Data System (ADS)

    Grogan, Laura F.; Mulvenna, Jason; Gummer, Joel P. A.; Scheele, Ben C.; Berger, Lee; Cashins, Scott D.; McFadden, Michael S.; Harlow, Peter; Hunter, David A.; Trengove, Robert D.; Skerratt, Lee F.

    2018-03-01

    The fungal skin disease chytridiomycosis has caused the devastating decline and extinction of hundreds of amphibian species globally, yet the potential for evolving resistance, and the underlying pathophysiological mechanisms remain poorly understood. We exposed 406 naïve, captive-raised alpine tree frogs (Litoria verreauxii alpina) from multiple populations (one evolutionarily naïve to chytridiomycosis) to the aetiological agent Batrachochytrium dendrobatidis in two concurrent and controlled infection experiments. We investigated (A) survival outcomes and clinical pathogen burdens between populations and clutches, and (B) individual host tissue responses to chytridiomycosis. Here we present multiple interrelated datasets associated with these exposure experiments, including animal signalment, survival and pathogen burden of 355 animals from Experiment A, and the following datasets related to 61 animals from Experiment B: animal signalment and pathogen burden; raw RNA-Seq reads from skin, liver and spleen tissues; de novo assembled transcriptomes for each tissue type; raw gene expression data; annotation data for each gene; and raw metabolite expression data from skin and liver tissues. These data provide an extensive baseline for future analyses.

  17. Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Data Analysis and Visualization; nternational Research Training Group ``Visualization of Large and Unstructured Data Sets,'' University of Kaiserslautern, Germany; Computational Research Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA

    2008-05-12

    The recent development of methods for extracting precise measurements of spatial gene expression patterns from three-dimensional (3D) image data opens the way for new analyses of the complex gene regulatory networks controlling animal development. We present an integrated visualization and analysis framework that supports user-guided data clustering to aid exploration of these new complex datasets. The interplay of data visualization and clustering-based data classification leads to improved visualization and enables a more detailed analysis than previously possible. We discuss (i) integration of data clustering and visualization into one framework; (ii) application of data clustering to 3D gene expression data; (iii)more » evaluation of the number of clusters k in the context of 3D gene expression clustering; and (iv) improvement of overall analysis quality via dedicated post-processing of clustering results based on visualization. We discuss the use of this framework to objectively define spatial pattern boundaries and temporal profiles of genes and to analyze how mRNA patterns are controlled by their regulatory transcription factors.« less

  18. Multiparametric Determination of Radiation Risk

    NASA Technical Reports Server (NTRS)

    Richmond, Robert C.

    2003-01-01

    Predicting risk of human cancer following exposure to ionizing space radiation is challenging in part because of uncertainties of low-dose distribution amongst cells, of unknown potentially synergistic effects of microgravity upon cellular protein-expression, and of processing dose-related damage within cells to produce rare and late-appearing malignant transformation, degrade the confidence of cancer risk-estimates. The NASA- specific responsibility to estimate the risks of radiogenic cancer in a limited number of astronauts is not amenable to epidemiologic study, thereby increasing this challenge. Developing adequately sensitive cellular biodosimeters that simultaneously report 1) the quantity of absorbed close after exposure to ionizing radiation, 2) the quality of radiation delivering that dose, and 3) the risk of developing malignant transformation by the cells absorbing that dose could be useful for resolving these challenges. Use of a multiparametric cellular biodosimeter is suggested using analyses of gene-expression and protein-expression whereby large datasets of cellular response to radiation-induced damage are obtained and analyzed for expression-profiles correlated with established end points and molecular markers predictive for cancer-risk. Analytical techniques of genomics and proteomics may be used to establish dose-dependency of multiple gene- and protein- expressions resulting from radiation-induced cellular damage. Furthermore, gene- and protein-expression from cells in microgravity are known to be altered relative to cells grown on the ground at 1g. Therefore, hypotheses are proposed that 1) macromolecular expression caused by radiation-induced damage in cells in microgravity may be different than on the ground, and 2) different patterns of macromolecular expression in microgravity may alter human radiogenic cancer risk relative to radiation exposure on Earth. A new paradigm is accordingly suggested as a national database wherein genomic and proteomic datasets are registered and interrogated in order to provide statistically significant dose-dependent risk estimation of radiogenic cancer in astronauts.

  19. PhenomeExpress: a refined network analysis of expression datasets by inclusion of known disease phenotypes.

    PubMed

    Soul, Jamie; Hardingham, Timothy E; Boot-Handford, Raymond P; Schwartz, Jean-Marc

    2015-01-29

    We describe a new method, PhenomeExpress, for the analysis of transcriptomic datasets to identify pathogenic disease mechanisms. Our analysis method includes input from both protein-protein interaction and phenotype similarity networks. This introduces valuable information from disease relevant phenotypes, which aids the identification of sub-networks that are significantly enriched in differentially expressed genes and are related to the disease relevant phenotypes. This contrasts with many active sub-network detection methods, which rely solely on protein-protein interaction networks derived from compounded data of many unrelated biological conditions and which are therefore not specific to the context of the experiment. PhenomeExpress thus exploits readily available animal model and human disease phenotype information. It combines this prior evidence of disease phenotypes with the experimentally derived disease data sets to provide a more targeted analysis. Two case studies, in subchondral bone in osteoarthritis and in Pax5 in acute lymphoblastic leukaemia, demonstrate that PhenomeExpress identifies core disease pathways in both mouse and human disease expression datasets derived from different technologies. We also validate the approach by comparison to state-of-the-art active sub-network detection methods, which reveals how it may enhance the detection of molecular phenotypes and provide a more detailed context to those previously identified as possible candidates.

  20. Towards systems genetic analyses in barley: Integration of phenotypic, expression and genotype data into GeneNetwork.

    PubMed

    Druka, Arnis; Druka, Ilze; Centeno, Arthur G; Li, Hongqiang; Sun, Zhaohui; Thomas, William T B; Bonar, Nicola; Steffenson, Brian J; Ullrich, Steven E; Kleinhofs, Andris; Wise, Roger P; Close, Timothy J; Potokina, Elena; Luo, Zewei; Wagner, Carola; Schweizer, Günther F; Marshall, David F; Kearsey, Michael J; Williams, Robert W; Waugh, Robbie

    2008-11-18

    A typical genetical genomics experiment results in four separate data sets; genotype, gene expression, higher-order phenotypic data and metadata that describe the protocols, processing and the array platform. Used in concert, these data sets provide the opportunity to perform genetic analysis at a systems level. Their predictive power is largely determined by the gene expression dataset where tens of millions of data points can be generated using currently available mRNA profiling technologies. Such large, multidimensional data sets often have value beyond that extracted during their initial analysis and interpretation, particularly if conducted on widely distributed reference genetic materials. Besides quality and scale, access to the data is of primary importance as accessibility potentially allows the extraction of considerable added value from the same primary dataset by the wider research community. Although the number of genetical genomics experiments in different plant species is rapidly increasing, none to date has been presented in a form that allows quick and efficient on-line testing for possible associations between genes, loci and traits of interest by an entire research community. Using a reference population of 150 recombinant doubled haploid barley lines we generated novel phenotypic, mRNA abundance and SNP-based genotyping data sets, added them to a considerable volume of legacy trait data and entered them into the GeneNetwork http://www.genenetwork.org. GeneNetwork is a unified on-line analytical environment that enables the user to test genetic hypotheses about how component traits, such as mRNA abundance, may interact to condition more complex biological phenotypes (higher-order traits). Here we describe these barley data sets and demonstrate some of the functionalities GeneNetwork provides as an easily accessible and integrated analytical environment for exploring them. By integrating barley genotypic, phenotypic and mRNA abundance data sets directly within GeneNetwork's analytical environment we provide simple web access to the data for the research community. In this environment, a combination of correlation analysis and linkage mapping provides the potential to identify and substantiate gene targets for saturation mapping and positional cloning. By integrating datasets from an unsequenced crop plant (barley) in a database that has been designed for an animal model species (mouse) with a well established genome sequence, we prove the importance of the concept and practice of modular development and interoperability of software engineering for biological data sets.

  1. A Canonical Correlation Analysis of AIDS Restriction Genes and Metabolic Pathways Identifies Purine Metabolism as a Key Cooperator.

    PubMed

    Ye, Hanhui; Yuan, Jinjin; Wang, Zhengwu; Huang, Aiqiong; Liu, Xiaolong; Han, Xiao; Chen, Yahong

    2016-01-01

    Human immunodeficiency virus causes a severe disease in humans, referred to as immune deficiency syndrome. Studies on the interaction between host genetic factors and the virus have revealed dozens of genes that impact diverse processes in the AIDS disease. To resolve more genetic factors related to AIDS, a canonical correlation analysis was used to determine the correlation between AIDS restriction and metabolic pathway gene expression. The results show that HIV-1 postentry cellular viral cofactors from AIDS restriction genes are coexpressed in human transcriptome microarray datasets. Further, the purine metabolism pathway comprises novel host factors that are coexpressed with AIDS restriction genes. Using a canonical correlation analysis for expression is a reliable approach to exploring the mechanism underlying AIDS.

  2. Extensive variation between tissues in allele specific expression in an outbred mammal.

    PubMed

    Chamberlain, Amanda J; Vander Jagt, Christy J; Hayes, Benjamin J; Khansefid, Majid; Marett, Leah C; Millen, Catriona A; Nguyen, Thuy T T; Goddard, Michael E

    2015-11-23

    Allele specific gene expression (ASE), with the paternal allele more expressed than the maternal allele or vice versa, appears to be a common phenomenon in humans and mice. In other species the extent of ASE is unknown, and even in humans and mice there are several outstanding questions. These include; to what extent is ASE tissue specific? how often does the direction of allele expression imbalance reverse between tissues? how often is only one of the two alleles expressed? is there a genome wide bias towards expression of the paternal or maternal allele; and finally do genes that are nearby on a chromosome share the same direction of ASE? Here we use gene expression data (RNASeq) from 18 tissues from a single cow to investigate each of these questions in turn, and then validate some of these findings in two tissues from 20 cows. Between 40 and 100 million sequence reads were generated per tissue across three replicate samples for each of the eighteen tissues from the single cow (the discovery dataset). A bovine gene expression atlas was created (the first from RNASeq data), and differentially expressed genes in each tissue were identified. To analyse ASE, we had access to unambiguously phased genotypes for all heterozygous variants in the cow's whole genome sequence, where these variants were homozygous in the whole genome sequence of her sire, and as a result we were able to map reads to parental genomes, to determine SNP and genes showing ASE in each tissue. In total 25,251 heterozygous SNP within 7985 genes were tested for ASE in at least one tissue. ASE was pervasive, 89 % of genes tested had significant ASE in at least one tissue. This large proportion of genes displaying ASE was confirmed in the two tissues in a validation dataset. For individual tissues the proportion of genes showing significant ASE varied from as low as 8-16 % of those tested in thymus to as high as 71-82 % of those tested in lung. There were a number of cases where the direction of allele expression imbalance reversed between tissues. For example the gene SPTY2D1 showed almost complete paternal allele expression in kidney and thymus, and almost complete maternal allele expression in the brain caudal lobe and brain cerebellum. Mono allelic expression (MAE) was common, with 1349 of 4856 genes (28 %) tested with more than one heterozygous SNP showing MAE. Across all tissues, 54.17 % of all genes with ASE favoured the paternal allele. Genes that are closely linked on the chromosome were more likely to show higher expression of the same allele (paternal or maternal) than expected by chance. We identified several long runs of neighbouring genes that showed either paternal or maternal ASE, one example was five adjacent genes (GIMAP8, GIMAP7 copy1, GIMAP4, GIMAP7 copy 2 and GIMAP5) that showed almost exclusive paternal expression in brain caudal lobe. Investigating the extent of ASE across 18 bovine tissues in one cow and two tissues in 20 cows demonstrated 1) ASE is pervasive in cattle, 2) the ASE is often MAE but ranges from MAE to slight overexpression of the major allele, 3) the ASE is most often tissue specific and that more than half the time displays divergent allele specific expression patterns across tissues, 4) across all genes there is a slight bias towards expression of the paternal allele and 5) genes expressing the same parental allele are clustered together more than expected by chance, and there are several runs of large numbers of genes expressing the same parental allele.

  3. ADGO: analysis of differentially expressed gene sets using composite GO annotation.

    PubMed

    Nam, Dougu; Kim, Sang-Bae; Kim, Seon-Kyu; Yang, Sungjin; Kim, Seon-Young; Chu, In-Sun

    2006-09-15

    Genes are typically expressed in modular manners in biological processes. Recent studies reflect such features in analyzing gene expression patterns by directly scoring gene sets. Gene annotations have been used to define the gene sets, which have served to reveal specific biological themes from expression data. However, current annotations have limited analytical power, because they are classified by single categories providing only unary information for the gene sets. Here we propose a method for discovering composite biological themes from expression data. We intersected two annotated gene sets from different categories of Gene Ontology (GO). We then scored the expression changes of all the single and intersected sets. In this way, we were able to uncover, for example, a gene set with the molecular function F and the cellular component C that showed significant expression change, while the changes in individual gene sets were not significant. We provided an exemplary analysis for HIV-1 immune response. In addition, we tested the method on 20 public datasets where we found many 'filtered' composite terms the number of which reached approximately 34% (a strong criterion, 5% significance) of the number of significant unary terms on average. By using composite annotation, we can derive new and improved information about disease and biological processes from expression data. We provide a web application (ADGO: http://array.kobic.re.kr/ADGO) for the analysis of differentially expressed gene sets with composite GO annotations. The user can analyze Affymetrix and dual channel array (spotted cDNA and spotted oligo microarray) data for four species: human, mouse, rat and yeast. chu@kribb.re.kr http://array.kobic.re.kr/ADGO.

  4. An expanded maize gene expression atlas based on RNA sequencing and its use to explore root development

    DOE PAGES

    Stelpflug, Scott C.; Sekhon, Rajandeep S.; Vaillancourt, Brieanne; ...

    2015-12-30

    Comprehensive and systematic transcriptome profiling provides valuable insight into biological and developmental processes that occur throughout the life cycle of a plant. We have enhanced our previously published microarray-based gene atlas of maize ( Zea mays L.) inbred B73 to now include 79 distinct replicated samples that have been interrogated using RNA sequencing (RNA-seq). The current version of the atlas includes 50 original array-based gene atlas samples, a time-course of 12 stalk and leaf samples postflowering, and an additional set of 17 samples from the maize seedling and adult root system. The entire dataset contains 4.6 billion mapped reads, withmore » an average of 20.5 million mapped reads per biological replicate, allowing for detection of genes with lower transcript abundance. As the new root samples represent key additions to the previously examined tissues, we highlight insights into the root transcriptome, which is represented by 28,894 (73.2%) annotated genes in maize. Additionally, we observed remarkable expression differences across both the longitudinal (four zones) and radial gradients (cortical parenchyma and stele) of the primary root supported by fourfold differential expression of 9353 and 4728 genes, respectively. Among the latter were 1110 genes that encode transcription factors, some of which are orthologs of previously characterized transcription factors known to regulate root development in Arabidopsis thaliana (L.) Heynh., while most are novel, and represent attractive targets for reverse genetics approaches to determine their roles in this important organ. As a result, this comprehensive transcriptome dataset is a powerful tool toward understanding maize development, physiology, and phenotypic diversity.« less

  5. An expanded maize gene expression atlas based on RNA sequencing and its use to explore root development

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Stelpflug, Scott C.; Sekhon, Rajandeep S.; Vaillancourt, Brieanne

    Comprehensive and systematic transcriptome profiling provides valuable insight into biological and developmental processes that occur throughout the life cycle of a plant. We have enhanced our previously published microarray-based gene atlas of maize ( Zea mays L.) inbred B73 to now include 79 distinct replicated samples that have been interrogated using RNA sequencing (RNA-seq). The current version of the atlas includes 50 original array-based gene atlas samples, a time-course of 12 stalk and leaf samples postflowering, and an additional set of 17 samples from the maize seedling and adult root system. The entire dataset contains 4.6 billion mapped reads, withmore » an average of 20.5 million mapped reads per biological replicate, allowing for detection of genes with lower transcript abundance. As the new root samples represent key additions to the previously examined tissues, we highlight insights into the root transcriptome, which is represented by 28,894 (73.2%) annotated genes in maize. Additionally, we observed remarkable expression differences across both the longitudinal (four zones) and radial gradients (cortical parenchyma and stele) of the primary root supported by fourfold differential expression of 9353 and 4728 genes, respectively. Among the latter were 1110 genes that encode transcription factors, some of which are orthologs of previously characterized transcription factors known to regulate root development in Arabidopsis thaliana (L.) Heynh., while most are novel, and represent attractive targets for reverse genetics approaches to determine their roles in this important organ. As a result, this comprehensive transcriptome dataset is a powerful tool toward understanding maize development, physiology, and phenotypic diversity.« less

  6. Parkinson's disease candidate gene prioritization based on expression profile of midbrain dopaminergic neurons

    PubMed Central

    2010-01-01

    Background Parkinson's disease is the second most common neurodegenerative disorder. The pathological hallmark of the disease is degeneration of midbrain dopaminergic neurons. Genetic association studies have linked 13 human chromosomal loci to Parkinson's disease. Identification of gene(s), as part of the etiology of Parkinson's disease, within the large number of genes residing in these loci can be achieved through several approaches, including screening methods, and considering appropriate criteria. Since several of the indentified Parkinson's disease genes are expressed in substantia nigra pars compact of the midbrain, expression within the neurons of this area could be a suitable criterion to limit the number of candidates and identify PD genes. Methods In this work we have used the combination of findings from six rodent transcriptome analysis studies on the gene expression profile of midbrain dopaminergic neurons and the PARK loci in OMIM (Online Mendelian Inheritance in Man) database, to identify new candidate genes for Parkinson's disease. Results Merging the two datasets, we identified 20 genes within PARK loci, 7 of which are located in an orphan Parkinson's disease locus and one, which had been identified as a disease gene. In addition to identifying a set of candidates for further genetic association studies, these results show that the criteria of expression in midbrain dopaminergic neurons may be used to narrow down the number of genes in PARK loci for such studies. PMID:20716345

  7. KCNN4 and S100A14 act as predictors of recurrence in optimally debulked patients with serous ovarian cancer

    PubMed Central

    Hu, Ting; Sun, Qian; Wu, Jianli; Lin, Xingguang; Luo, Danfeng; Sun, Chaoyang; Wang, Changyu; Zhou, Bo; Li, Na; Xia, Meng; Lu, Hao; Meng, Li; Xu, Xiaoyan; Hu, Junbo; Ma, Ding; Chen, Gang; Zhu, Tao

    2016-01-01

    Approximately 50-75% of patients with serous ovarian carcinoma (SOC) experience recurrence within 18 months after first-line treatment. Current clinical indicators are inadequate for predicting the risk of recurrence. In this study, we used 7 publicly available microarray datasets to identify gene signatures related to recurrence in optimally debulked SOC patients, and validated their expressions in an independent clinic cohort of 127 patients using immunohistochemistry (IHC). We identified a two-gene signature including KCNN4 and S100A14 which was related to recurrence in optimally debulked SOC patients. Their mRNA expression levels were positively correlated and regulated by DNA copy number alterations (CNA) (KCNN4: p=1.918e-05) and DNA promotermethylation (KCNN4: p=0.0179; S100A14: p=2.787e-13). Recurrence prediction models built in the TCGA dataset based on KCNN4 and S100A14 individually and in combination showed good prediction performance in the other 6 datasets (AUC:0.5442-0.9524). The independent cohort supported the expression difference between SOC recurrences. Also, a KCNN4 and S100A14-centered protein interaction subnetwork was built from the STRING database, and the shortest regulation path between them, called the KCNN4-UBA52-KLF4-S100A14 axis, was identified. This discovery might facilitate individualized treatment of SOC. PMID:27270322

  8. Identification of Differentially Expressed Genes in Breast Muscle and Skin Fat of Postnatal Pekin Duck

    PubMed Central

    Schachtschneider, Kyle Michael; Liu, Xiaolin; Huang, Wei; Xie, Ming; Hou, Shuisheng

    2014-01-01

    Lean-type Pekin duck is a commercial breed that has been obtained through long-term selection. Investigation of the differentially expressed genes in breast muscle and skin fat at different developmental stages will contribute to a comprehensive understanding of the potential mechanisms underlying the lean-type Pekin duck phenotype. In the present study, RNA-seq was performed on breast muscle and skin fat at 2-, 4- and 6-weeks of age. More than 89% of the annotated duck genes were covered by our RNA-seq dataset. Thousands of differentially expressed genes, including many important genes involved in the regulation of muscle development and fat deposition, were detected through comparison of the expression levels in the muscle and skin fat of the same time point, or the same tissue at different time points. KEGG pathway analysis showed that the differentially expressed genes clustered significantly in many muscle development and fat deposition related pathways such as MAPK signaling pathway, PPAR signaling pathway, Calcium signaling pathway, Fat digestion and absorption, and TGF-beta signaling pathway. The results presented here could provide a basis for further investigation of the mechanisms involved in muscle development and fat deposition in Pekin duck. PMID:25264787

  9. iGC-an integrated analysis package of gene expression and copy number alteration.

    PubMed

    Lai, Yi-Pin; Wang, Liang-Bo; Wang, Wei-An; Lai, Liang-Chuan; Tsai, Mong-Hsun; Lu, Tzu-Pin; Chuang, Eric Y

    2017-01-14

    With the advancement in high-throughput technologies, researchers can simultaneously investigate gene expression and copy number alteration (CNA) data from individual patients at a lower cost. Traditional analysis methods analyze each type of data individually and integrate their results using Venn diagrams. Challenges arise, however, when the results are irreproducible and inconsistent across multiple platforms. To address these issues, one possible approach is to concurrently analyze both gene expression profiling and CNAs in the same individual. We have developed an open-source R/Bioconductor package (iGC). Multiple input formats are supported and users can define their own criteria for identifying differentially expressed genes driven by CNAs. The analysis of two real microarray datasets demonstrated that the CNA-driven genes identified by the iGC package showed significantly higher Pearson correlation coefficients with their gene expression levels and copy numbers than those genes located in a genomic region with CNA. Compared with the Venn diagram approach, the iGC package showed better performance. The iGC package is effective and useful for identifying CNA-driven genes. By simultaneously considering both comparative genomic and transcriptomic data, it can provide better understanding of biological and medical questions. The iGC package's source code and manual are freely available at https://www.bioconductor.org/packages/release/bioc/html/iGC.html .

  10. Temporal and spatial transcriptomic and microRNA dynamics of CAM photosynthesis in pineapple.

    PubMed

    Wai, Ching M; VanBuren, Robert; Zhang, Jisen; Huang, Lixian; Miao, Wenjing; Edger, Patrick P; Yim, Won C; Priest, Henry D; Meyers, Blake C; Mockler, Todd; Smith, J Andrew C; Cushman, John C; Ming, Ray

    2017-10-01

    The altered carbon assimilation pathway of crassulacean acid metabolism (CAM) photosynthesis results in an up to 80% higher water-use efficiency than C 3 photosynthesis in plants making it a potentially useful pathway for engineering crop plants with improved drought tolerance. Here we surveyed detailed temporal (diel time course) and spatial (across a leaf gradient) gene and microRNA (miRNA) expression patterns in the obligate CAM plant pineapple [Ananas comosus (L.) Merr.]. The high-resolution transcriptome atlas allowed us to distinguish between CAM-related and non-CAM gene copies. A differential gene co-expression network across green and white leaf diel datasets identified genes with circadian oscillation, CAM-related functions, and source-sink relations. Gene co-expression clusters containing CAM pathway genes are enriched with clock-associated cis-elements, suggesting circadian regulation of CAM. About 20% of pineapple microRNAs have diel expression patterns, with several that target key CAM-related genes. Expression and physiology data provide a model for CAM-specific carbohydrate flux and long-distance hexose transport. Together these resources provide a list of candidate genes for targeted engineering of CAM into C 3 photosynthesis crop species. © 2017 The Authors The Plant Journal © 2017 John Wiley & Sons Ltd.

  11. Automated Discovery of Functional Generality of Human Gene Expression Programs

    PubMed Central

    Gerber, Georg K; Dowell, Robin D; Jaakkola, Tommi S; Gifford, David K

    2007-01-01

    An important research problem in computational biology is the identification of expression programs, sets of co-expressed genes orchestrating normal or pathological processes, and the characterization of the functional breadth of these programs. The use of human expression data compendia for discovery of such programs presents several challenges including cellular inhomogeneity within samples, genetic and environmental variation across samples, uncertainty in the numbers of programs and sample populations, and temporal behavior. We developed GeneProgram, a new unsupervised computational framework based on Hierarchical Dirichlet Processes that addresses each of the above challenges. GeneProgram uses expression data to simultaneously organize tissues into groups and genes into overlapping programs with consistent temporal behavior, to produce maps of expression programs, which are sorted by generality scores that exploit the automatically learned groupings. Using synthetic and real gene expression data, we showed that GeneProgram outperformed several popular expression analysis methods. We applied GeneProgram to a compendium of 62 short time-series gene expression datasets exploring the responses of human cells to infectious agents and immune-modulating molecules. GeneProgram produced a map of 104 expression programs, a substantial number of which were significantly enriched for genes involved in key signaling pathways and/or bound by NF-κB transcription factors in genome-wide experiments. Further, GeneProgram discovered expression programs that appear to implicate surprising signaling pathways or receptor types in the response to infection, including Wnt signaling and neurotransmitter receptors. We believe the discovered map of expression programs involved in the response to infection will be useful for guiding future biological experiments; genes from programs with low generality scores might serve as new drug targets that exhibit minimal “cross-talk,” and genes from high generality programs may maintain common physiological responses that go awry in disease states. Further, our method is multipurpose, and can be applied readily to novel compendia of biological data. PMID:17696603

  12. Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier.

    PubMed

    Kumar, Mukesh; Rath, Nitish Kumar; Rath, Santanu Kumar

    2016-04-01

    Microarray-based gene expression profiling has emerged as an efficient technique for classification, prognosis, diagnosis, and treatment of cancer. Frequent changes in the behavior of this disease generates an enormous volume of data. Microarray data satisfies both the veracity and velocity properties of big data, as it keeps changing with time. Therefore, the analysis of microarray datasets in a small amount of time is essential. They often contain a large amount of expression, but only a fraction of it comprises genes that are significantly expressed. The precise identification of genes of interest that are responsible for causing cancer are imperative in microarray data analysis. Most existing schemes employ a two-phase process such as feature selection/extraction followed by classification. In this paper, various statistical methods (tests) based on MapReduce are proposed for selecting relevant features. After feature selection, a MapReduce-based K-nearest neighbor (mrKNN) classifier is also employed to classify microarray data. These algorithms are successfully implemented in a Hadoop framework. A comparative analysis is done on these MapReduce-based models using microarray datasets of various dimensions. From the obtained results, it is observed that these models consume much less execution time than conventional models in processing big data. Copyright © 2016 Elsevier Inc. All rights reserved.

  13. Association of HADHA expression with the risk of breast cancer: targeted subset analysis and meta-analysis of microarray data

    PubMed Central

    2012-01-01

    Background The role of n-3 fatty acids in prevention of breast cancer is well recognized, but the underlying molecular mechanisms are still unclear. In view of the growing need for early detection of breast cancer, Graham et al. (2010) studied the microarray gene expression in histologically normal epithelium of subjects with or without breast cancer. We conducted a secondary analysis of this dataset with a focus on the genes (n = 47) involved in fat and lipid metabolism. We used stepwise multivariate logistic regression analyses, volcano plots and false discovery rates for association analyses. We also conducted meta-analyses of other microarray studies using random effects models for three outcomes--risk of breast cancer (380 breast cancer patients and 240 normal subjects), risk of metastasis (430 metastatic compared to 1104 non-metastatic breast cancers) and risk of recurrence (484 recurring versus 890 non-recurring breast cancers). Results The HADHA gene [hydroxyacyl-CoA dehydrogenase/3-ketoacyl-CoA thiolase/enoyl-CoA hydratase (trifunctional protein), alpha subunit] was significantly under-expressed in breast cancer; more so in those with estrogen receptor-negative status. Our meta-analysis showed an 18.4%-26% reduction in HADHA expression in breast cancer. Also, there was an inconclusive but consistent under-expression of HADHA in subjects with metastatic and recurring breast cancers. Conclusions Involvement of mitochondria and the mitochondrial trifunctional protein (encoded by HADHA gene) in breast carcinogenesis is known. Our results lend additional support to the possibility of this involvement. Further, our results suggest that targeted subset analysis of large genome-based datasets can provide interesting association signals. PMID:22240105

  14. Integrative Exploratory Analysis of Two or More Genomic Datasets.

    PubMed

    Meng, Chen; Culhane, Aedin

    2016-01-01

    Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.

  15. Gene expression profiling in multipotent DFAT cells derived from mature adipocytes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ono, Hiromasa; Database Center for Life Science; Oki, Yoshinao

    2011-04-15

    Highlights: {yields} Adipocyte dedifferentiation is evident in a significant decrease in typical genes. {yields} Cell proliferation is strongly related to adipocyte dedifferentiation. {yields} Dedifferentiated adipocytes express several lineage-specific genes. {yields} Comparative analyses using publicly available datasets boost the interpretation. -- Abstract: Cellular dedifferentiation signifies the withdrawal of cells from a specific differentiated state to a stem cell-like undifferentiated state. However, the mechanism of dedifferentiation remains obscure. Here we performed comparative transcriptome analyses during dedifferentiation in mature adipocytes (MAs) to identify the transcriptional signatures of multipotent dedifferentiated fat (DFAT) cells derived from MAs. Using microarray systems, we explored similarly expressed asmore » well as significantly differentially expressed genes in MAs during dedifferentiation. This analysis revealed significant changes in gene expression during this process, including a significant reduction in expression of genes for lipid metabolism concomitantly with a significant increase in expression of genes for cell movement, cell migration, tissue developmental processes, cell growth, cell proliferation, cell morphogenesis, altered cell shape, and cell differentiation. Our observations indicate that the transcriptional signatures of DFAT cells derived from MAs are summarized in terms of a significant decrease in functional phenotype-related genes and a parallel increase in cell proliferation, altered cell morphology, and regulation of the differentiation of related genes. A better understanding of the mechanisms involved in dedifferentiation may enable scientists to control and possibly alter the plasticity of the differentiated state, which may lead to benefits not only in stem cell research but also in regenerative medicine.« less

  16. Identification and expression analysis of cold and freezing stress responsive genes of Brassica oleracea.

    PubMed

    Ahmed, Nasar Uddin; Jung, Hee-Jeong; Park, Jong-In; Cho, Yong-Gu; Hur, Yoonkang; Nou, Ill-Sup

    2015-01-10

    Cold and freezing stress is a major environmental constraint to the production of Brassica crops. Enhancement of tolerance by exploiting cold and freezing tolerance related genes offers the most efficient approach to address this problem. Cold-induced transcriptional profiling is a promising approach to the identification of potential genes related to cold and freezing stress tolerance. In this study, 99 highly expressed genes were identified from a whole genome microarray dataset of Brassica rapa. Blast search analysis of the Brassica oleracea database revealed the corresponding homologous genes. To validate their expression, pre-selected cold tolerant and susceptible cabbage lines were analyzed. Out of 99 BoCRGs, 43 were differentially expressed in response to varying degrees of cold and freezing stress in the contrasting cabbage lines. Among the differentially expressed genes, 18 were highly up-regulated in the tolerant lines, which is consistent with their microarray expression. Additionally, 12 BoCRGs were expressed differentially after cold stress treatment in two contrasting cabbage lines, and BoCRG54, 56, 59, 62, 70, 72 and 99 were predicted to be involved in cold regulatory pathways. Taken together, the cold-responsive genes identified in this study provide additional direction for elucidating the regulatory network of low temperature stress tolerance and developing cold and freezing stress resistant Brassica crops. Copyright © 2014 Elsevier B.V. All rights reserved.

  17. Identification of stable reference genes in differentiating human pluripotent stem cells.

    PubMed

    Holmgren, Gustav; Ghosheh, Nidal; Zeng, Xianmin; Bogestål, Yalda; Sartipy, Peter; Synnergren, Jane

    2015-06-01

    Reference genes, often referred to as housekeeping genes (HKGs), are frequently used to normalize gene expression data based on the assumption that they are expressed at a constant level in the cells. However, several studies have shown that there may be a large variability in the gene expression levels of HKGs in various cell types. In a previous study, employing human embryonic stem cells (hESCs) subjected to spontaneous differentiation, we observed that the expression of commonly used HKG varied to a degree that rendered them inappropriate to use as reference genes under those experimental settings. Here we present a substantially extended study of the HKG signature in human pluripotent stem cells (hPSC), including nine global gene expression datasets from both hESC and human induced pluripotent stem cells, obtained during directed differentiation toward endoderm-, mesoderm-, and ectoderm derivatives. Sets of stably expressed genes were compiled, and a handful of genes (e.g., EID2, ZNF324B, CAPN10, and RABEP2) were identified as generally applicable reference genes in hPSCs across all cell lines and experimental conditions. The stability in gene expression profiles was confirmed by reverse transcription quantitative PCR analysis. Taken together, the current results suggest that differentiating hPSCs have a distinct HKG signature, which in some aspects is different from somatic cell types, and underscore the necessity to validate the stability of reference genes under the actual experimental setup used. In addition, the novel putative HKGs identified in this study can preferentially be used for normalization of gene expression data obtained from differentiating hPSCs. Copyright © 2015 the American Physiological Society.

  18. Genome-wide methylomic and transcriptomic analyses identify subtype-specific epigenetic signatures commonly dysregulated in glioma stem cells and glioblastoma.

    PubMed

    Pangeni, Rajendra P; Zhang, Zhou; Alvarez, Angel A; Wan, Xuechao; Sastry, Namratha; Lu, Songjian; Shi, Taiping; Huang, Tianzhi; Lei, Charles X; James, C David; Kessler, John A; Brennan, Cameron W; Nakano, Ichiro; Lu, Xinghua; Hu, Bo; Zhang, Wei; Cheng, Shi-Yuan

    2018-06-21

    Glioma stem cells (GSCs), a subpopulation of tumor cells, contribute to tumor heterogeneity and therapy resistance. Gene expression profiling classified glioblastoma (GBM) and GSCs into four transcriptomically-defined subtypes. Here, we determined the DNA methylation signatures in transcriptomically pre-classified GSC and GBM bulk tumors subtypes. We hypothesized that these DNA methylation signatures correlate with gene expression and are uniquely associated either with only GSCs or only GBM bulk tumors. Additional methylation signatures may be commonly associated with both GSCs and GBM bulk tumors, i.e., common to non-stem-like and stem-like tumor cell populations and correlating with the clinical prognosis of glioma patients. We analyzed Illumina 450K methylation array and expression data from a panel of 23 patient-derived GSCs. We referenced these results with The Cancer Genome Atlas (TCGA) GBM datasets to generate methylomic and transcriptomic signatures for GSCs and GBM bulk tumors of each transcriptomically pre-defined tumor subtype. Survival analyses were carried out for these signature genes using publicly available datasets, including from TCGA. We report that DNA methylation signatures in proneural and mesenchymal tumor subtypes are either unique to GSCs, unique to GBM bulk tumors, or common to both. Further, dysregulated DNA methylation correlates with gene expression and clinical prognoses. Additionally, many previously identified transcriptionally-regulated markers are also dysregulated due to DNA methylation. The subtype-specific DNA methylation signatures described in this study could be useful for refining GBM sub-classification, improving prognostic accuracy, and making therapeutic decisions.

  19. Microarray-based gene expression profiling in patients with cryopyrin-associated periodic syndromes defines a disease-related signature and IL-1-responsive transcripts.

    PubMed

    Balow, James E; Ryan, John G; Chae, Jae Jin; Booty, Matthew G; Bulua, Ariel; Stone, Deborah; Sun, Hong-Wei; Greene, James; Barham, Beverly; Goldbach-Mansky, Raphaela; Kastner, Daniel L; Aksentijevich, Ivona

    2013-06-01

    To analyse gene expression patterns and to define a specific gene expression signature in patients with the severe end of the spectrum of cryopyrin-associated periodic syndromes (CAPS). The molecular consequences of interleukin 1 inhibition were examined by comparing gene expression patterns in 16 CAPS patients before and after treatment with anakinra. We collected peripheral blood mononuclear cells from 22 CAPS patients with active disease and from 14 healthy children. Transcripts that passed stringent filtering criteria (p values≤false discovery rate 1%) were considered as differentially expressed genes (DEG). A set of DEG was validated by quantitative reverse transcription PCR and functional studies with primary cells from CAPS patients and healthy controls. We used 17 CAPS and 66 non-CAPS patient samples to create a set of gene expression models that differentiates CAPS patients from controls and from patients with other autoinflammatory conditions. Many DEG include transcripts related to the regulation of innate and adaptive immune responses, oxidative stress, cell death, cell adhesion and motility. A set of gene expression-based models comprising the CAPS-specific gene expression signature correctly classified all 17 samples from an independent dataset. This classifier also correctly identified 15 of 16 post-anakinra CAPS samples despite the fact that these CAPS patients were in clinical remission. We identified a gene expression signature that clearly distinguished CAPS patients from controls. A number of DEG were in common with other systemic inflammatory diseases such as systemic onset juvenile idiopathic arthritis. The CAPS-specific gene expression classifiers also suggest incomplete suppression of inflammation at low doses of anakinra.

  20. Microarray-based gene expression profiling in patients with cryopyrin-associated periodic syndromes defines a disease-related signature and IL-1-responsive transcripts

    PubMed Central

    Balow, James E; Ryan, John G; Chae, Jae Jin; Booty, Matthew G; Bulua, Ariel; Stone, Deborah; Sun, Hong-Wei; Greene, James; Barham, Beverly; Goldbach-Mansky, Raphaela; Kastner, Daniel L; Aksentijevich, Ivona

    2014-01-01

    Objective To analyse gene expression patterns and to define a specific gene expression signature in patients with the severe end of the spectrum of cryopyrin-associated periodic syndromes (CAPS). The molecular consequences of interleukin 1 inhibition were examined by comparing gene expression patterns in 16 CAPS patients before and after treatment with anakinra. Methods We collected peripheral blood mononuclear cells from 22 CAPS patients with active disease and from 14 healthy children. Transcripts that passed stringent filtering criteria (p values ≤ false discovery rate 1%) were considered as differentially expressed genes (DEG). A set of DEG was validated by quantitative reverse transcription PCR and functional studies with primary cells from CAPS patients and healthy controls. We used 17 CAPS and 66 non-CAPS patient samples to create a set of gene expression models that differentiates CAPS patients from controls and from patients with other autoinflammatory conditions. Results Many DEG include transcripts related to the regulation of innate and adaptive immune responses, oxidative stress, cell death, cell adhesion and motility. A set of gene expression-based models comprising the CAPS-specific gene expression signature correctly classified all 17 samples from an independent dataset. This classifier also correctly identified 15 of 16 postanakinra CAPS samples despite the fact that these CAPS patients were in clinical remission. Conclusions We identified a gene expression signature that clearly distinguished CAPS patients from controls. A number of DEG were in common with other systemic inflammatory diseases such as systemic onset juvenile idiopathic arthritis. The CAPS-specific gene expression classifiers also suggest incomplete suppression of inflammation at low doses of anakinra. PMID:23223423

  1. De novo characterization of the Chinese fir (Cunninghamia lanceolata) transcriptome and analysis of candidate genes involved in cellulose and lignin biosynthesis

    PubMed Central

    2012-01-01

    Background Chinese fir (Cunninghamia lanceolata) is an important timber species that accounts for 20–30% of the total commercial timber production in China. However, the available genomic information of Chinese fir is limited, and this severely encumbers functional genomic analysis and molecular breeding in Chinese fir. Recently, major advances in transcriptome sequencing have provided fast and cost-effective approaches to generate large expression datasets that have proven to be powerful tools to profile the transcriptomes of non-model organisms with undetermined genomes. Results In this study, the transcriptomes of nine tissues from Chinese fir were analyzed using the Illumina HiSeq™ 2000 sequencing platform. Approximately 40 million paired-end reads were obtained, generating 3.62 gigabase pairs of sequencing data. These reads were assembled into 83,248 unique sequences (i.e. Unigenes) with an average length of 449 bp, amounting to 37.40 Mb. A total of 73,779 Unigenes were supported by more than 5 reads, 42,663 (57.83%) had homologs in the NCBI non-redundant and Swiss-Prot protein databases, corresponding to 27,224 unique protein entries. Of these Unigenes, 16,750 were assigned to Gene Ontology classes, and 14,877 were clustered into orthologous groups. A total of 21,689 (29.40%) were mapped to 119 pathways by BLAST comparison against the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. The majority of the genes encoding the enzymes in the biosynthetic pathways of cellulose and lignin were identified in the Unigene dataset by targeted searches of their annotations. And a number of candidate Chinese fir genes in the two metabolic pathways were discovered firstly. Eighteen genes related to cellulose and lignin biosynthesis were cloned for experimental validating of transcriptome data. Overall 49 Unigenes, covering different regions of these selected genes, were found by alignment. Their expression patterns in different tissues were analyzed by qRT-PCR to explore their putative functions. Conclusions A substantial fraction of transcript sequences was obtained from the deep sequencing of Chinese fir. The assembled Unigene dataset was used to discover candidate genes of cellulose and lignin biosynthesis. This transcriptome dataset will provide a comprehensive sequence resource for molecular genetics research of C. lanceolata. PMID:23171398

  2. TRACING CO-REGULATORY NETWORK DYNAMICS IN NOISY, SINGLE-CELL TRANSCRIPTOME TRAJECTORIES.

    PubMed

    Cordero, Pablo; Stuart, Joshua M

    2017-01-01

    The availability of gene expression data at the single cell level makes it possible to probe the molecular underpinnings of complex biological processes such as differentiation and oncogenesis. Promising new methods have emerged for reconstructing a progression 'trajectory' from static single-cell transcriptome measurements. However, it remains unclear how to adequately model the appreciable level of noise in these data to elucidate gene regulatory network rewiring. Here, we present a framework called Single Cell Inference of MorphIng Trajectories and their Associated Regulation (SCIMITAR) that infers progressions from static single-cell transcriptomes by employing a continuous parametrization of Gaussian mixtures in high-dimensional curves. SCIMITAR yields rich models from the data that highlight genes with expression and co-expression patterns that are associated with the inferred progression. Further, SCIMITAR extracts regulatory states from the implicated trajectory-evolvingco-expression networks. We benchmark the method on simulated data to show that it yields accurate cell ordering and gene network inferences. Applied to the interpretation of a single-cell human fetal neuron dataset, SCIMITAR finds progression-associated genes in cornerstone neural differentiation pathways missed by standard differential expression tests. Finally, by leveraging the rewiring of gene-gene co-expression relations across the progression, the method reveals the rise and fall of co-regulatory states and trajectory-dependent gene modules. These analyses implicate new transcription factors in neural differentiation including putative co-factors for the multi-functional NFAT pathway.

  3. Cancer-cell intrinsic gene expression signatures overcome intratumoural heterogeneity bias in colorectal cancer patient classification

    PubMed Central

    Dunne, Philip D.; Alderdice, Matthew; O'Reilly, Paul G.; Roddy, Aideen C.; McCorry, Amy M. B.; Richman, Susan; Maughan, Tim; McDade, Simon S.; Johnston, Patrick G.; Longley, Daniel B.; Kay, Elaine; McArt, Darragh G.; Lawler, Mark

    2017-01-01

    Stromal-derived intratumoural heterogeneity (ITH) has been shown to undermine molecular stratification of patients into appropriate prognostic/predictive subgroups. Here, using several clinically relevant colorectal cancer (CRC) gene expression signatures, we assessed the susceptibility of these signatures to the confounding effects of ITH using gene expression microarray data obtained from multiple tumour regions of a cohort of 24 patients, including central tumour, the tumour invasive front and lymph node metastasis. Sample clustering alongside correlative assessment revealed variation in the ability of each signature to cluster samples according to patient-of-origin rather than region-of-origin within the multi-region dataset. Signatures focused on cancer-cell intrinsic gene expression were found to produce more clinically useful, patient-centred classifiers, as exemplified by the CRC intrinsic signature (CRIS), which robustly clustered samples by patient-of-origin rather than region-of-origin. These findings highlight the potential of cancer-cell intrinsic signatures to reliably stratify CRC patients by minimising the confounding effects of stromal-derived ITH. PMID:28561046

  4. Novel prediction of anticancer drug chemosensitivity in cancer cell lines: evidence of moderation by microRNA expressions.

    PubMed

    Yang, Daniel S

    2014-01-01

    The objectives of this study are (1) to develop a novel "moderation" model of drug chemosensitivity and (2) to investigate if miRNA expression moderates the relationship between gene expression and drug chemosensitivity, specifically for HSP90 inhibitors applied to human cancer cell lines. A moderation model integrating the interaction between miRNA and gene expressions was developed to examine if miRNA expression affects the strength of the relationship between gene expression and chemosensitivity. Comprehensive datasets on miRNA expressions, gene expressions, and drug chemosensitivities were obtained from National Cancer Institute's NCI-60 cell lines including nine different cancer types. A workflow including steps of selecting genes, miRNAs, and compounds, correlating gene expression with chemosensitivity, and performing multivariate analysis was utilized to test the proposed model. The proposed moderation model identified 12 significantly-moderating miRNAs: miR-15b*, miR-16-2*, miR-9, miR-126*, miR-129*, miR-138, miR-519e*, miR-624*, miR-26b, miR-30e*, miR-32, and miR-196a, as well as two genes ERCC2 and SF3B1 which affect chemosensitivities of Tanespimycin and Alvespimycin - both HSP90 inhibitors. A bootstrap resampling of 2,500 times validates the significance of all 12 identified miRNAs. The results confirm that certain miRNA and gene expressions interact to produce an effect on drug response. The lack of correlation between miRNA and gene expression themselves suggests that miRNA transmits its effect through translation inhibition/control rather than mRNA degradation. The results suggest that miRNAs could serve not only as prognostic biomarkers for cancer treatment outcome but also as interventional agents to modulate desired chemosensitivity.

  5. GeneNetFinder2: Improved Inference of Dynamic Gene Regulatory Relations with Multiple Regulators.

    PubMed

    Han, Kyungsook; Lee, Jeonghoon

    2016-01-01

    A gene involved in complex regulatory interactions may have multiple regulators since gene expression in such interactions is often controlled by more than one gene. Another thing that makes gene regulatory interactions complicated is that regulatory interactions are not static, but change over time during the cell cycle. Most research so far has focused on identifying gene regulatory relations between individual genes in a particular stage of the cell cycle. In this study we developed a method for identifying dynamic gene regulations of several types from the time-series gene expression data. The method can find gene regulations with multiple regulators that work in combination or individually as well as those with single regulators. The method has been implemented as the second version of GeneNetFinder (hereafter called GeneNetFinder2) and tested on several gene expression datasets. Experimental results with gene expression data revealed the existence of genes that are not regulated by individual genes but rather by a combination of several genes. Such gene regulatory relations cannot be found by conventional methods. Our method finds such regulatory relations as well as those with multiple, independent regulators or single regulators, and represents gene regulatory relations as a dynamic network in which different gene regulatory relations are shown in different stages of the cell cycle. GeneNetFinder2 is available at http://bclab.inha.ac.kr/GeneNetFinder and will be useful for modeling dynamic gene regulations with multiple regulators.

  6. A Bootstrap Based Measure Robust to the Choice of Normalization Methods for Detecting Rhythmic Features in High Dimensional Data.

    PubMed

    Larriba, Yolanda; Rueda, Cristina; Fernández, Miguel A; Peddada, Shyamal D

    2018-01-01

    Motivation: Gene-expression data obtained from high throughput technologies are subject to various sources of noise and accordingly the raw data are pre-processed before formally analyzed. Normalization of the data is a key pre-processing step, since it removes systematic variations across arrays. There are numerous normalization methods available in the literature. Based on our experience, in the context of oscillatory systems, such as cell-cycle, circadian clock, etc., the choice of the normalization method may substantially impact the determination of a gene to be rhythmic. Thus rhythmicity of a gene can purely be an artifact of how the data were normalized. Since the determination of rhythmic genes is an important component of modern toxicological and pharmacological studies, it is important to determine truly rhythmic genes that are robust to the choice of a normalization method. Results: In this paper we introduce a rhythmicity measure and a bootstrap methodology to detect rhythmic genes in an oscillatory system. Although the proposed methodology can be used for any high-throughput gene expression data, in this paper we illustrate the proposed methodology using several publicly available circadian clock microarray gene-expression datasets. We demonstrate that the choice of normalization method has very little effect on the proposed methodology. Specifically, for any pair of normalization methods considered in this paper, the resulting values of the rhythmicity measure are highly correlated. Thus it suggests that the proposed measure is robust to the choice of a normalization method. Consequently, the rhythmicity of a gene is potentially not a mere artifact of the normalization method used. Lastly, as demonstrated in the paper, the proposed bootstrap methodology can also be used for simulating data for genes participating in an oscillatory system using a reference dataset. Availability: A user friendly code implemented in R language can be downloaded from http://www.eio.uva.es/~miguel/robustdetectionprocedure.html.

  7. A Bootstrap Based Measure Robust to the Choice of Normalization Methods for Detecting Rhythmic Features in High Dimensional Data

    PubMed Central

    Larriba, Yolanda; Rueda, Cristina; Fernández, Miguel A.; Peddada, Shyamal D.

    2018-01-01

    Motivation: Gene-expression data obtained from high throughput technologies are subject to various sources of noise and accordingly the raw data are pre-processed before formally analyzed. Normalization of the data is a key pre-processing step, since it removes systematic variations across arrays. There are numerous normalization methods available in the literature. Based on our experience, in the context of oscillatory systems, such as cell-cycle, circadian clock, etc., the choice of the normalization method may substantially impact the determination of a gene to be rhythmic. Thus rhythmicity of a gene can purely be an artifact of how the data were normalized. Since the determination of rhythmic genes is an important component of modern toxicological and pharmacological studies, it is important to determine truly rhythmic genes that are robust to the choice of a normalization method. Results: In this paper we introduce a rhythmicity measure and a bootstrap methodology to detect rhythmic genes in an oscillatory system. Although the proposed methodology can be used for any high-throughput gene expression data, in this paper we illustrate the proposed methodology using several publicly available circadian clock microarray gene-expression datasets. We demonstrate that the choice of normalization method has very little effect on the proposed methodology. Specifically, for any pair of normalization methods considered in this paper, the resulting values of the rhythmicity measure are highly correlated. Thus it suggests that the proposed measure is robust to the choice of a normalization method. Consequently, the rhythmicity of a gene is potentially not a mere artifact of the normalization method used. Lastly, as demonstrated in the paper, the proposed bootstrap methodology can also be used for simulating data for genes participating in an oscillatory system using a reference dataset. Availability: A user friendly code implemented in R language can be downloaded from http://www.eio.uva.es/~miguel/robustdetectionprocedure.html PMID:29456555

  8. Coral life history and symbiosis: Functional genomic resources for two reef building Caribbean corals, Acropora palmata and Montastraea faveolata

    PubMed Central

    Schwarz, Jodi A; Brokstein, Peter B; Voolstra, Christian; Terry, Astrid Y; Miller, David J; Szmant, Alina M; Coffroth, Mary Alice; Medina, Mónica

    2008-01-01

    Background Scleractinian corals are the foundation of reef ecosystems in tropical marine environments. Their great success is due to interactions with endosymbiotic dinoflagellates (Symbiodinium spp.), with which they are obligately symbiotic. To develop a foundation for studying coral biology and coral symbiosis, we have constructed a set of cDNA libraries and generated and annotated ESTs from two species of corals, Acropora palmata and Montastraea faveolata. Results We generated 14,588 (Ap) and 3,854 (Mf) high quality ESTs from five life history/symbiosis stages (spawned eggs, early-stage planula larvae, late-stage planula larvae either infected with symbionts or uninfected, and adult coral). The ESTs assembled into a set of primarily stage-specific clusters, producing 4,980 (Ap), and 1,732 (Mf) unigenes. The egg stage library, relative to the other developmental stages, was enriched in genes functioning in cell division and proliferation, transcription, signal transduction, and regulation of protein function. Fifteen unigenes were identified as candidate symbiosis-related genes as they were expressed in all libraries constructed from the symbiotic stages and were absent from all of the non symbiotic stages. These include several DNA interacting proteins, and one highly expressed unigene (containing 17 cDNAs) with no significant protein-coding region. A significant number of unigenes (25) encode potential pattern recognition receptors (lectins, scavenger receptors, and others), as well as genes that may function in signaling pathways involved in innate immune responses (toll-like signaling, NFkB p105, and MAP kinases). Comparison between the A. palmata and an A. millepora EST dataset identified ferritin as a highly expressed gene in both datasets that appears to be undergoing adaptive evolution. Five unigenes appear to be restricted to the Scleractinia, as they had no homology to any sequences in the nr databases nor to the non-scleractinian cnidarians Nematostella vectensis and Hydra magnipapillata. Conclusion Partial sequencing of 5 cDNA libraries each for A. palmata and M. faveolata has produced a rich set of candidate genes (4,980 genes from A. palmata, and 1,732 genes from M. faveolata) that we can use as a starting point for examining the life history and symbiosis of these two species, as well as to further expand the dataset of cnidarian genes for comparative genomics and evolutionary studies. PMID:18298846

  9. Coral Life History and Symbiosis: functional genomic resources for two reef building Caribbean corals, Acropora palmata and Montastraea faveolata

    DOE PAGES

    Schwarz, Jodi A.; Brokstein, Peter B.; Voolstra, Christian R.; ...

    2008-02-25

    Scleractinian corals are the foundation of reef ecosystems in tropical marine environments. Their great success is due to interactions with endosymbiotic dinoflagellates (Symbiodinium spp.), with which they are obligately symbiotic. To develop a foundation for studying coral biology and coral symbiosis, we have constructed a set of cDNA libraries and generated and annotated ESTs from two species of corals, Acropora palmata and Montastraea faveolata. Here we generated 14,588 (Ap) and 3,854 (Mf) high quality ESTs from five life history/symbiosis stages (spawned eggs, early-stage planula larvae, late-stage planula larvae either infected with symbionts or uninfected, and adult coral). The ESTs assembledmore » into a set of primarily stage-specific clusters, producing 4,980 (Ap), and 1,732 (Mf) unigenes. The egg stage library, relative to the other developmental stages, was enriched in genes functioning in cell division and proliferation, transcription, signal transduction, and regulation of protein function. Fifteen unigenes were identified as candidate symbiosis-related genes as they were expressed in all libraries constructed from the symbiotic stages and were absent from all of the non symbiotic stages. These include several DNA interacting proteins, and one highly expressed unigene (containing 17 cDNAs) with no significant protein-coding region. A significant number of unigenes (25) encode potential pattern recognition receptors (lectins, scavenger receptors, and others), as well as genes that may function in signaling pathways involved in innate immune responses (toll-like signaling, NFkB p105, and MAP kinases). Comparison between the A. palmata and an A. millepora EST dataset identified ferritin as a highly expressed gene in both datasets that appears to be undergoing adaptive evolution. Five unigenes appear to be restricted to the Scleractinia, as they had no homology to any sequences in the nr databases nor to the non-scleractinian cnidarians Nematostella vectensis and Hydra magnipapillata. In conclusion, partial sequencing of 5 cDNA libraries each for A. palmata and M. faveolata has produced a rich set of candidate genes (4,980 genes from A. palmata, and 1,732 genes from M. faveolata) that we can use as a starting point for examining the life history and symbiosis of these two species, as well as to further expand the dataset of cnidarian genes for comparative genomics and evolutionary studies.« less

  10. Coral Life History and Symbiosis: functional genomic resources for two reef building Caribbean corals, Acropora palmata and Montastraea faveolata

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schwarz, Jodi A.; Brokstein, Peter B.; Voolstra, Christian R.

    Scleractinian corals are the foundation of reef ecosystems in tropical marine environments. Their great success is due to interactions with endosymbiotic dinoflagellates (Symbiodinium spp.), with which they are obligately symbiotic. To develop a foundation for studying coral biology and coral symbiosis, we have constructed a set of cDNA libraries and generated and annotated ESTs from two species of corals, Acropora palmata and Montastraea faveolata. Here we generated 14,588 (Ap) and 3,854 (Mf) high quality ESTs from five life history/symbiosis stages (spawned eggs, early-stage planula larvae, late-stage planula larvae either infected with symbionts or uninfected, and adult coral). The ESTs assembledmore » into a set of primarily stage-specific clusters, producing 4,980 (Ap), and 1,732 (Mf) unigenes. The egg stage library, relative to the other developmental stages, was enriched in genes functioning in cell division and proliferation, transcription, signal transduction, and regulation of protein function. Fifteen unigenes were identified as candidate symbiosis-related genes as they were expressed in all libraries constructed from the symbiotic stages and were absent from all of the non symbiotic stages. These include several DNA interacting proteins, and one highly expressed unigene (containing 17 cDNAs) with no significant protein-coding region. A significant number of unigenes (25) encode potential pattern recognition receptors (lectins, scavenger receptors, and others), as well as genes that may function in signaling pathways involved in innate immune responses (toll-like signaling, NFkB p105, and MAP kinases). Comparison between the A. palmata and an A. millepora EST dataset identified ferritin as a highly expressed gene in both datasets that appears to be undergoing adaptive evolution. Five unigenes appear to be restricted to the Scleractinia, as they had no homology to any sequences in the nr databases nor to the non-scleractinian cnidarians Nematostella vectensis and Hydra magnipapillata. In conclusion, partial sequencing of 5 cDNA libraries each for A. palmata and M. faveolata has produced a rich set of candidate genes (4,980 genes from A. palmata, and 1,732 genes from M. faveolata) that we can use as a starting point for examining the life history and symbiosis of these two species, as well as to further expand the dataset of cnidarian genes for comparative genomics and evolutionary studies.« less

  11. Inference of combinatorial Boolean rules of synergistic gene sets from cancer microarray datasets.

    PubMed

    Park, Inho; Lee, Kwang H; Lee, Doheon

    2010-06-15

    Gene set analysis has become an important tool for the functional interpretation of high-throughput gene expression datasets. Moreover, pattern analyses based on inferred gene set activities of individual samples have shown the ability to identify more robust disease signatures than individual gene-based pattern analyses. Although a number of approaches have been proposed for gene set-based pattern analysis, the combinatorial influence of deregulated gene sets on disease phenotype classification has not been studied sufficiently. We propose a new approach for inferring combinatorial Boolean rules of gene sets for a better understanding of cancer transcriptome and cancer classification. To reduce the search space of the possible Boolean rules, we identify small groups of gene sets that synergistically contribute to the classification of samples into their corresponding phenotypic groups (such as normal and cancer). We then measure the significance of the candidate Boolean rules derived from each group of gene sets; the level of significance is based on the class entropy of the samples selected in accordance with the rules. By applying the present approach to publicly available prostate cancer datasets, we identified 72 significant Boolean rules. Finally, we discuss several identified Boolean rules, such as the rule of glutathione metabolism (down) and prostaglandin synthesis regulation (down), which are consistent with known prostate cancer biology. Scripts written in Python and R are available at http://biosoft.kaist.ac.kr/~ihpark/. The refined gene sets and the full list of the identified Boolean rules are provided in the Supplementary Material. Supplementary data are available at Bioinformatics online.

  12. Pattern Genes Suggest Functional Connectivity of Organs

    NASA Astrophysics Data System (ADS)

    Qin, Yangmei; Pan, Jianbo; Cai, Meichun; Yao, Lixia; Ji, Zhiliang

    2016-05-01

    Human organ, as the basic structural and functional unit in human body, is made of a large community of different cell types that organically bound together. Each organ usually exerts highly specified physiological function; while several related organs work smartly together to perform complicated body functions. In this study, we present a computational effort to understand the roles of genes in building functional connection between organs. More specifically, we mined multiple transcriptome datasets sampled from 36 human organs and tissues, and quantitatively identified 3,149 genes whose expressions showed consensus modularly patterns: specific to one organ/tissue, selectively expressed in several functionally related tissues and ubiquitously expressed. These pattern genes imply intrinsic connections between organs. According to the expression abundance of the 766 selective genes, we consistently cluster the 36 human organs/tissues into seven functional groups: adipose & gland, brain, muscle, immune, metabolism, mucoid and nerve conduction. The organs and tissues in each group either work together to form organ systems or coordinate to perform particular body functions. The particular roles of specific genes and selective genes suggest that they could not only be used to mechanistically explore organ functions, but also be designed for selective biomarkers and therapeutic targets.

  13. Characteristics of carboxylesterase genes and their expression-level between acaricide-susceptible and resistant Tetranychus cinnabarinus (Boisduval).

    PubMed

    Wei, Peng; Shi, Li; Shen, Guangmao; Xu, Zhifeng; Liu, Jialu; Pan, Yu; He, Lin

    2016-07-01

    Carboxylesterases (CarEs) play important roles in metabolism and detoxification of dietary and environmental xenobiotics in insects and mites. On the basis of the Tetranychuscinnabarinus transcriptome dataset, 23 CarE genes (6 genes are full sequence and 17 genes are partial sequence) were identified. Synergist bioassay showed that CarEs were involved in acaricide detoxification and resistance in fenpropathrin- (FeR) and cyflumetofen-resistant (CyR) strains. In order to further reveal the relationship between CarE gene's expression and acaricide-resistance in T. cinnabarinus, we profiled their expression in susceptible (SS) and resistant strains (FeR, and CyR). There were 8 and 4 over-expressed carboxylesterase genes in FeR and CyR, respectively, from which the over-expressions were detected at mRNA level, but not DNA level. Pesticide induction experiment elucidated that 4 of 8 and 2 of 4 up-regulated genes were inducible with significance in FeR and CyR strains, respectively, but they could not be induced in SS strain, which indicated that these genes became more enhanced and effective to withstand the pesticides' stress in resistant T. cinnabarinus. Most expression-changed and all inducible genes possess the Abhydrolase_3 motif, which is a catalytic domain for hydrolyzing. As a whole, these findings in current study provide clues for further elucidating the function and regulation mechanism of these carboxylesterase genes in T. cinnabarinus' resistance formation. Copyright © 2015 Elsevier B.V. All rights reserved.

  14. Integrated pathway-based approach identifies association between genomic regions at CTCF and CACNB2 and schizophrenia.

    PubMed

    Juraeva, Dilafruz; Haenisch, Britta; Zapatka, Marc; Frank, Josef; Witt, Stephanie H; Mühleisen, Thomas W; Treutlein, Jens; Strohmaier, Jana; Meier, Sandra; Degenhardt, Franziska; Giegling, Ina; Ripke, Stephan; Leber, Markus; Lange, Christoph; Schulze, Thomas G; Mössner, Rainald; Nenadic, Igor; Sauer, Heinrich; Rujescu, Dan; Maier, Wolfgang; Børglum, Anders; Ophoff, Roel; Cichon, Sven; Nöthen, Markus M; Rietschel, Marcella; Mattheisen, Manuel; Brors, Benedikt

    2014-06-01

    In the present study, an integrated hierarchical approach was applied to: (1) identify pathways associated with susceptibility to schizophrenia; (2) detect genes that may be potentially affected in these pathways since they contain an associated polymorphism; and (3) annotate the functional consequences of such single-nucleotide polymorphisms (SNPs) in the affected genes or their regulatory regions. The Global Test was applied to detect schizophrenia-associated pathways using discovery and replication datasets comprising 5,040 and 5,082 individuals of European ancestry, respectively. Information concerning functional gene-sets was retrieved from the Kyoto Encyclopedia of Genes and Genomes, Gene Ontology, and the Molecular Signatures Database. Fourteen of the gene-sets or pathways identified in the discovery dataset were confirmed in the replication dataset. These include functional processes involved in transcriptional regulation and gene expression, synapse organization, cell adhesion, and apoptosis. For two genes, i.e. CTCF and CACNB2, evidence for association with schizophrenia was available (at the gene-level) in both the discovery study and published data from the Psychiatric Genomics Consortium schizophrenia study. Furthermore, these genes mapped to four of the 14 presently identified pathways. Several of the SNPs assigned to CTCF and CACNB2 have potential functional consequences, and a gene in close proximity to CACNB2, i.e. ARL5B, was identified as a potential gene of interest. Application of the present hierarchical approach thus allowed: (1) identification of novel biological gene-sets or pathways with potential involvement in the etiology of schizophrenia, as well as replication of these findings in an independent cohort; (2) detection of genes of interest for future follow-up studies; and (3) the highlighting of novel genes in previously reported candidate regions for schizophrenia.

  15. Gene expression meta-analysis identifies chromosomal regions and candidate genes involved in breast cancer metastasis.

    PubMed

    Thomassen, Mads; Tan, Qihua; Kruse, Torben A

    2009-01-01

    Breast cancer cells exhibit complex karyotypic alterations causing deregulation of numerous genes. Some of these genes are probably causal for cancer formation and local growth whereas others are causal for the various steps of metastasis. In a fraction of tumors deregulation of the same genes might be caused by epigenetic modulations, point mutations or the influence of other genes. We have investigated the relation of gene expression and chromosomal position, using eight datasets including more than 1200 breast tumors, to identify chromosomal regions and candidate genes possibly causal for breast cancer metastasis. By use of "Gene Set Enrichment Analysis" we have ranked chromosomal regions according to their relation to metastasis. Overrepresentation analysis identified regions with increased expression for chromosome 1q41-42, 8q24, 12q14, 16q22, 16q24, 17q12-21.2, 17q21-23, 17q25, 20q11, and 20q13 among metastasizing tumors and reduced gene expression at 1p31-21, 8p22-21, and 14q24. By analysis of genes with extremely imbalanced expression in these regions we identified DIRAS3 at 1p31, PSD3, LPL, EPHX2 at 8p21-22, and FOS at 14q24 as candidate metastasis suppressor genes. Potential metastasis promoting genes includes RECQL4 at 8q24, PRMT7 at 16q22, GINS2 at 16q24, and AURKA at 20q13.

  16. Inferring causal genomic alterations in breast cancer using gene expression data

    PubMed Central

    2011-01-01

    Background One of the primary objectives in cancer research is to identify causal genomic alterations, such as somatic copy number variation (CNV) and somatic mutations, during tumor development. Many valuable studies lack genomic data to detect CNV; therefore, methods that are able to infer CNVs from gene expression data would help maximize the value of these studies. Results We developed a framework for identifying recurrent regions of CNV and distinguishing the cancer driver genes from the passenger genes in the regions. By inferring CNV regions across many datasets we were able to identify 109 recurrent amplified/deleted CNV regions. Many of these regions are enriched for genes involved in many important processes associated with tumorigenesis and cancer progression. Genes in these recurrent CNV regions were then examined in the context of gene regulatory networks to prioritize putative cancer driver genes. The cancer driver genes uncovered by the framework include not only well-known oncogenes but also a number of novel cancer susceptibility genes validated via siRNA experiments. Conclusions To our knowledge, this is the first effort to systematically identify and validate drivers for expression based CNV regions in breast cancer. The framework where the wavelet analysis of copy number alteration based on expression coupled with the gene regulatory network analysis, provides a blueprint for leveraging genomic data to identify key regulatory components and gene targets. This integrative approach can be applied to many other large-scale gene expression studies and other novel types of cancer data such as next-generation sequencing based expression (RNA-Seq) as well as CNV data. PMID:21806811

  17. Classification based upon gene expression data: bias and precision of error rates.

    PubMed

    Wood, Ian A; Visscher, Peter M; Mengersen, Kerrie L

    2007-06-01

    Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean. Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3-5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors. R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp

  18. Identification of the Consistently Altered Metabolic Targets in Human Hepatocellular Carcinoma.

    PubMed

    Nwosu, Zeribe Chike; Megger, Dominik Andre; Hammad, Seddik; Sitek, Barbara; Roessler, Stephanie; Ebert, Matthias Philip; Meyer, Christoph; Dooley, Steven

    2017-09-01

    Cancer cells rely on metabolic alterations to enhance proliferation and survival. Metabolic gene alterations that repeatedly occur in liver cancer are largely unknown. We aimed to identify metabolic genes that are consistently deregulated, and are of potential clinical significance in human hepatocellular carcinoma (HCC). We studied the expression of 2,761 metabolic genes in 8 microarray datasets comprising 521 human HCC tissues. Genes exclusively up-regulated or down-regulated in 6 or more datasets were defined as consistently deregulated. The consistent genes that correlated with tumor progression markers ( ECM2 and MMP9) (Pearson correlation P < .05) were used for Kaplan-Meier overall survival analysis in a patient cohort. We further compared proteomic expression of metabolic genes in 19 tumors vs adjacent normal liver tissues. We identified 634 consistent metabolic genes, ∼60% of which are not yet described in HCC. The down-regulated genes (n = 350) are mostly involved in physiologic hepatocyte metabolic functions (eg, xenobiotic, fatty acid, and amino acid metabolism). In contrast, among consistently up-regulated metabolic genes (n = 284) are those involved in glycolysis, pentose phosphate pathway, nucleotide biosynthesis, tricarboxylic acid cycle, oxidative phosphorylation, proton transport, membrane lipid, and glycan metabolism. Several metabolic genes (n = 434) correlated with progression markers, and of these, 201 predicted overall survival outcome in the patient cohort analyzed. Over 90% of the metabolic targets significantly altered at the protein level were similarly up- or down-regulated as in genomic profile. We provide the first exposition of the consistently altered metabolic genes in HCC and show that these genes are potentially relevant targets for onward studies in preclinical and clinical contexts.

  19. Network-Based Integration of GWAS and Gene Expression Identifies a HOX-Centric Network Associated with Serous Ovarian Cancer Risk.

    PubMed

    Kar, Siddhartha P; Tyrer, Jonathan P; Li, Qiyuan; Lawrenson, Kate; Aben, Katja K H; Anton-Culver, Hoda; Antonenkova, Natalia; Chenevix-Trench, Georgia; Baker, Helen; Bandera, Elisa V; Bean, Yukie T; Beckmann, Matthias W; Berchuck, Andrew; Bisogna, Maria; Bjørge, Line; Bogdanova, Natalia; Brinton, Louise; Brooks-Wilson, Angela; Butzow, Ralf; Campbell, Ian; Carty, Karen; Chang-Claude, Jenny; Chen, Yian Ann; Chen, Zhihua; Cook, Linda S; Cramer, Daniel; Cunningham, Julie M; Cybulski, Cezary; Dansonka-Mieszkowska, Agnieszka; Dennis, Joe; Dicks, Ed; Doherty, Jennifer A; Dörk, Thilo; du Bois, Andreas; Dürst, Matthias; Eccles, Diana; Easton, Douglas F; Edwards, Robert P; Ekici, Arif B; Fasching, Peter A; Fridley, Brooke L; Gao, Yu-Tang; Gentry-Maharaj, Aleksandra; Giles, Graham G; Glasspool, Rosalind; Goode, Ellen L; Goodman, Marc T; Grownwald, Jacek; Harrington, Patricia; Harter, Philipp; Hein, Alexander; Heitz, Florian; Hildebrandt, Michelle A T; Hillemanns, Peter; Hogdall, Estrid; Hogdall, Claus K; Hosono, Satoyo; Iversen, Edwin S; Jakubowska, Anna; Paul, James; Jensen, Allan; Ji, Bu-Tian; Karlan, Beth Y; Kjaer, Susanne K; Kelemen, Linda E; Kellar, Melissa; Kelley, Joseph; Kiemeney, Lambertus A; Krakstad, Camilla; Kupryjanczyk, Jolanta; Lambrechts, Diether; Lambrechts, Sandrina; Le, Nhu D; Lee, Alice W; Lele, Shashi; Leminen, Arto; Lester, Jenny; Levine, Douglas A; Liang, Dong; Lissowska, Jolanta; Lu, Karen; Lubinski, Jan; Lundvall, Lene; Massuger, Leon; Matsuo, Keitaro; McGuire, Valerie; McLaughlin, John R; McNeish, Iain A; Menon, Usha; Modugno, Francesmary; Moysich, Kirsten B; Narod, Steven A; Nedergaard, Lotte; Ness, Roberta B; Nevanlinna, Heli; Odunsi, Kunle; Olson, Sara H; Orlow, Irene; Orsulic, Sandra; Weber, Rachel Palmieri; Pearce, Celeste Leigh; Pejovic, Tanja; Pelttari, Liisa M; Permuth-Wey, Jennifer; Phelan, Catherine M; Pike, Malcolm C; Poole, Elizabeth M; Ramus, Susan J; Risch, Harvey A; Rosen, Barry; Rossing, Mary Anne; Rothstein, Joseph H; Rudolph, Anja; Runnebaum, Ingo B; Rzepecka, Iwona K; Salvesen, Helga B; Schildkraut, Joellen M; Schwaab, Ira; Shu, Xiao-Ou; Shvetsov, Yurii B; Siddiqui, Nadeem; Sieh, Weiva; Song, Honglin; Southey, Melissa C; Sucheston-Campbell, Lara E; Tangen, Ingvild L; Teo, Soo-Hwang; Terry, Kathryn L; Thompson, Pamela J; Timorek, Agnieszka; Tsai, Ya-Yu; Tworoger, Shelley S; van Altena, Anne M; Van Nieuwenhuysen, Els; Vergote, Ignace; Vierkant, Robert A; Wang-Gohrke, Shan; Walsh, Christine; Wentzensen, Nicolas; Whittemore, Alice S; Wicklund, Kristine G; Wilkens, Lynne R; Woo, Yin-Ling; Wu, Xifeng; Wu, Anna; Yang, Hannah; Zheng, Wei; Ziogas, Argyrios; Sellers, Thomas A; Monteiro, Alvaro N A; Freedman, Matthew L; Gayther, Simon A; Pharoah, Paul D P

    2015-10-01

    Genome-wide association studies (GWAS) have so far reported 12 loci associated with serous epithelial ovarian cancer (EOC) risk. We hypothesized that some of these loci function through nearby transcription factor (TF) genes and that putative target genes of these TFs as identified by coexpression may also be enriched for additional EOC risk associations. We selected TF genes within 1 Mb of the top signal at the 12 genome-wide significant risk loci. Mutual information, a form of correlation, was used to build networks of genes strongly coexpressed with each selected TF gene in the unified microarray dataset of 489 serous EOC tumors from The Cancer Genome Atlas. Genes represented in this dataset were subsequently ranked using a gene-level test based on results for germline SNPs from a serous EOC GWAS meta-analysis (2,196 cases/4,396 controls). Gene set enrichment analysis identified six networks centered on TF genes (HOXB2, HOXB5, HOXB6, HOXB7 at 17q21.32 and HOXD1, HOXD3 at 2q31) that were significantly enriched for genes from the risk-associated end of the ranked list (P < 0.05 and FDR < 0.05). These results were replicated (P < 0.05) using an independent association study (7,035 cases/21,693 controls). Genes underlying enrichment in the six networks were pooled into a combined network. We identified a HOX-centric network associated with serous EOC risk containing several genes with known or emerging roles in serous EOC development. Network analysis integrating large, context-specific datasets has the potential to offer mechanistic insights into cancer susceptibility and prioritize genes for experimental characterization. ©2015 American Association for Cancer Research.

  20. Cancer cell redirection biomarker discovery using a mutual information approach.

    PubMed

    Roche, Kimberly; Feltus, F Alex; Park, Jang Pyo; Coissieux, Marie-May; Chang, Chenyan; Chan, Vera B S; Bentires-Alj, Mohamed; Booth, Brian W

    2017-01-01

    Introducing tumor-derived cells into normal mammary stem cell niches at a sufficiently high ratio of normal to tumorous cells causes those tumor cells to undergo a change to normal mammary phenotype and yield normal mammary progeny. This phenomenon has been termed cancer cell redirection. We have developed an in vitro model that mimics in vivo redirection of cancer cells by the normal mammary microenvironment. Using the RNA profiling data from this cellular model, we examined high-level characteristics of the normal, redirected, and tumor transcriptomes and found the global expression profiles clearly distinguish the three expression states. To identify potential redirection biomarkers that cause the redirected state to shift toward the normal expression pattern, we used mutual information relationships between normal, redirected, and tumor cell groups. Mutual information relationship analysis reduced a dataset of over 35,000 gene expression measurements spread over 13,000 curated gene sets to a set of 20 significant molecular signatures totaling 906 unique loci. Several of these molecular signatures are hallmark drivers of the tumor state. Using differential expression as a guide, we further refined the gene set to 120 core redirection biomarker genes. The expression levels of these core biomarkers are sufficient to make the normal and redirected gene expression states indistinguishable from each other but radically different from the tumor state.

  1. Cancer cell redirection biomarker discovery using a mutual information approach

    PubMed Central

    Roche, Kimberly; Feltus, F. Alex; Park, Jang Pyo; Coissieux, Marie-May; Chang, Chenyan; Chan, Vera B. S.; Bentires-Alj, Mohamed

    2017-01-01

    Introducing tumor-derived cells into normal mammary stem cell niches at a sufficiently high ratio of normal to tumorous cells causes those tumor cells to undergo a change to normal mammary phenotype and yield normal mammary progeny. This phenomenon has been termed cancer cell redirection. We have developed an in vitro model that mimics in vivo redirection of cancer cells by the normal mammary microenvironment. Using the RNA profiling data from this cellular model, we examined high-level characteristics of the normal, redirected, and tumor transcriptomes and found the global expression profiles clearly distinguish the three expression states. To identify potential redirection biomarkers that cause the redirected state to shift toward the normal expression pattern, we used mutual information relationships between normal, redirected, and tumor cell groups. Mutual information relationship analysis reduced a dataset of over 35,000 gene expression measurements spread over 13,000 curated gene sets to a set of 20 significant molecular signatures totaling 906 unique loci. Several of these molecular signatures are hallmark drivers of the tumor state. Using differential expression as a guide, we further refined the gene set to 120 core redirection biomarker genes. The expression levels of these core biomarkers are sufficient to make the normal and redirected gene expression states indistinguishable from each other but radically different from the tumor state. PMID:28594912

  2. Combined Use of Gene Expression Modeling and siRNA Screening Identifies Genes and Pathways Which Enhance the Activity of Cisplatin When Added at No Effect Levels to Non-Small Cell Lung Cancer Cells In Vitro

    PubMed Central

    Leung, Ada W. Y.; Hung, Stacy S.; Backstrom, Ian; Ricaurte, Daniel; Kwok, Brian; Poon, Steven; McKinney, Steven; Segovia, Romulo; Rawji, Jenna; Qadir, Mohammed A.; Aparicio, Samuel; Stirling, Peter C.; Steidl, Christian; Bally, Marcel B.

    2016-01-01

    Platinum-based combination chemotherapy is the standard treatment for advanced non-small cell lung cancer (NSCLC). While cisplatin is effective, its use is not curative and resistance often emerges. As a consequence of microenvironmental heterogeneity, many tumour cells are exposed to sub-lethal doses of cisplatin. Further, genomic heterogeneity and unique tumor cell sub-populations with reduced sensitivities to cisplatin play a role in its effectiveness within a site of tumor growth. Being exposed to sub-lethal doses will induce changes in gene expression that contribute to the tumour cell’s ability to survive and eventually contribute to the selective pressures leading to cisplatin resistance. Such changes in gene expression, therefore, may contribute to cytoprotective mechanisms. Here, we report on studies designed to uncover how tumour cells respond to sub-lethal doses of cisplatin. A microarray study revealed changes in gene expressions that occurred when A549 cells were exposed to a no-observed-effect level (NOEL) of cisplatin (e.g. the IC10). These data were integrated with results from a genome-wide siRNA screen looking for novel therapeutic targets that when inhibited transformed a NOEL of cisplatin into one that induced significant increases in lethality. Pathway analyses were performed to identify pathways that could be targeted to enhance cisplatin activity. We found that over 100 genes were differentially expressed when A549 cells were exposed to a NOEL of cisplatin. Pathways associated with apoptosis and DNA repair were activated. The siRNA screen revealed the importance of the hedgehog, cell cycle regulation, and insulin action pathways in A549 cell survival and response to cisplatin treatment. Results from both datasets suggest that RRM2B, CABYR, ALDH3A1, and FHL2 could be further explored as cisplatin-enhancing gene targets. Finally, pathways involved in repairing double-strand DNA breaks and INO80 chromatin remodeling were enriched in both datasets, warranting further research into combinations of cisplatin and therapeutics targeting these pathways. PMID:26938915

  3. Combined Use of Gene Expression Modeling and siRNA Screening Identifies Genes and Pathways Which Enhance the Activity of Cisplatin When Added at No Effect Levels to Non-Small Cell Lung Cancer Cells In Vitro.

    PubMed

    Leung, Ada W Y; Hung, Stacy S; Backstrom, Ian; Ricaurte, Daniel; Kwok, Brian; Poon, Steven; McKinney, Steven; Segovia, Romulo; Rawji, Jenna; Qadir, Mohammed A; Aparicio, Samuel; Stirling, Peter C; Steidl, Christian; Bally, Marcel B

    2016-01-01

    Platinum-based combination chemotherapy is the standard treatment for advanced non-small cell lung cancer (NSCLC). While cisplatin is effective, its use is not curative and resistance often emerges. As a consequence of microenvironmental heterogeneity, many tumour cells are exposed to sub-lethal doses of cisplatin. Further, genomic heterogeneity and unique tumor cell sub-populations with reduced sensitivities to cisplatin play a role in its effectiveness within a site of tumor growth. Being exposed to sub-lethal doses will induce changes in gene expression that contribute to the tumour cell's ability to survive and eventually contribute to the selective pressures leading to cisplatin resistance. Such changes in gene expression, therefore, may contribute to cytoprotective mechanisms. Here, we report on studies designed to uncover how tumour cells respond to sub-lethal doses of cisplatin. A microarray study revealed changes in gene expressions that occurred when A549 cells were exposed to a no-observed-effect level (NOEL) of cisplatin (e.g. the IC10). These data were integrated with results from a genome-wide siRNA screen looking for novel therapeutic targets that when inhibited transformed a NOEL of cisplatin into one that induced significant increases in lethality. Pathway analyses were performed to identify pathways that could be targeted to enhance cisplatin activity. We found that over 100 genes were differentially expressed when A549 cells were exposed to a NOEL of cisplatin. Pathways associated with apoptosis and DNA repair were activated. The siRNA screen revealed the importance of the hedgehog, cell cycle regulation, and insulin action pathways in A549 cell survival and response to cisplatin treatment. Results from both datasets suggest that RRM2B, CABYR, ALDH3A1, and FHL2 could be further explored as cisplatin-enhancing gene targets. Finally, pathways involved in repairing double-strand DNA breaks and INO80 chromatin remodeling were enriched in both datasets, warranting further research into combinations of cisplatin and therapeutics targeting these pathways.

  4. Multiclass classification for skin cancer profiling based on the integration of heterogeneous gene expression series.

    PubMed

    Gálvez, Juan Manuel; Castillo, Daniel; Herrera, Luis Javier; San Román, Belén; Valenzuela, Olga; Ortuño, Francisco Manuel; Rojas, Ignacio

    2018-01-01

    Most of the research studies developed applying microarray technology to the characterization of different pathological states of any disease may fail in reaching statistically significant results. This is largely due to the small repertoire of analysed samples, and to the limitation in the number of states or pathologies usually addressed. Moreover, the influence of potential deviations on the gene expression quantification is usually disregarded. In spite of the continuous changes in omic sciences, reflected for instance in the emergence of new Next-Generation Sequencing-related technologies, the existing availability of a vast amount of gene expression microarray datasets should be properly exploited. Therefore, this work proposes a novel methodological approach involving the integration of several heterogeneous skin cancer series, and a later multiclass classifier design. This approach is thus a way to provide the clinicians with an intelligent diagnosis support tool based on the use of a robust set of selected biomarkers, which simultaneously distinguishes among different cancer-related skin states. To achieve this, a multi-platform combination of microarray datasets from Affymetrix and Illumina manufacturers was carried out. This integration is expected to strengthen the statistical robustness of the study as well as the finding of highly-reliable skin cancer biomarkers. Specifically, the designed operation pipeline has allowed the identification of a small subset of 17 differentially expressed genes (DEGs) from which to distinguish among 7 involved skin states. These genes were obtained from the assessment of a number of potential batch effects on the gene expression data. The biological interpretation of these genes was inspected in the specific literature to understand their underlying information in relation to skin cancer. Finally, in order to assess their possible effectiveness in cancer diagnosis, a cross-validation Support Vector Machines (SVM)-based classification including feature ranking was performed. The accuracy attained exceeded the 92% in overall recognition of the 7 different cancer-related skin states. The proposed integration scheme is expected to allow the co-integration with other state-of-the-art technologies such as RNA-seq.

  5. The opportunities and challenges of large-scale molecular approaches to songbird neurobiology

    PubMed Central

    Mello, C.V.; Clayton, D.F.

    2014-01-01

    High-through put methods for analyzing genome structure and function are having a large impact in song-bird neurobiology. Methods include genome sequencing and annotation, comparative genomics, DNA microarrays and transcriptomics, and the development of a brain atlas of gene expression. Key emerging findings include the identification of complex transcriptional programs active during singing, the robust brain expression of non-coding RNAs, evidence of profound variations in gene expression across brain regions, and the identification of molecular specializations within song production and learning circuits. Current challenges include the statistical analysis of large datasets, effective genome curations, the efficient localization of gene expression changes to specific neuronal circuits and cells, and the dissection of behavioral and environmental factors that influence brain gene expression. The field requires efficient methods for comparisons with organisms like chicken, which offer important anatomical, functional and behavioral contrasts. As sequencing costs plummet, opportunities emerge for comparative approaches that may help reveal evolutionary transitions contributing to vocal learning, social behavior and other properties that make songbirds such compelling research subjects. PMID:25280907

  6. Identification of the Key Genes and Pathways in Esophageal Carcinoma.

    PubMed

    Su, Peng; Wen, Shiwang; Zhang, Yuefeng; Li, Yong; Xu, Yanzhao; Zhu, Yonggang; Lv, Huilai; Zhang, Fan; Wang, Mingbo; Tian, Ziqiang

    2016-01-01

    Objective . Esophageal carcinoma (EC) is a frequently common malignancy of gastrointestinal cancer in the world. This study aims to screen key genes and pathways in EC and elucidate the mechanism of it. Methods . 5 microarray datasets of EC were downloaded from Gene Expression Omnibus. Differentially expressed genes (DEGs) were screened by bioinformatics analysis. Gene Ontology (GO) enrichment, Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment, and protein-protein interaction (PPI) network construction were performed to obtain the biological roles of DEGs in EC. Quantitative real-time polymerase chain reaction (qRT-PCR) was used to verify the expression level of DEGs in EC. Results . A total of 1955 genes were filtered as DEGs in EC. The upregulated genes were significantly enriched in cell cycle and the downregulated genes significantly enriched in Endocytosis. PPI network displayed CDK4 and CCT3 were hub proteins in the network. The expression level of 8 dysregulated DEGs including CDK4, CCT3, THSD4, SIM2, MYBL2, CENPF, CDCA3, and CDKN3 was validated in EC compared to adjacent nontumor tissues and the results were matched with the microarray analysis. Conclusion . The significantly DEGs including CDK4, CCT3, THSD4, and SIM2 may play key roles in tumorigenesis and development of EC involved in cell cycle and Endocytosis.

  7. Identification of Gene Expression Signatures in the Chicken Intestinal Intraepithelial Lymphocytes in Response to Herb Additive Supplementations

    PubMed Central

    Won, Kyeong-Hye; Song, Ki-Duk; Park, Jong-Eun; Kim, Duk-Kyung; Na, Chong-Sam

    2016-01-01

    Anethole and garlic have an immune modulatory effects on avian coccidiosis, and these effects are correlated with gene expression changes in intestinal epithelial lymphocytes (IELs). In this study, we integrated gene expression datasets from two independent experiments and investigated gene expression profile changes by anethole and garlic respectively, and identified gene expression signatures, which are common targets of these herbs as they might be used for the evaluation of the effect of plant herbs on immunity toward avian coccidiosis. We identified 4,382 and 371 genes, which were differentially expressed in IELs of chickens supplemented with garlic and anethole respectively. The gene ontology (GO) term of differentially expressed genes (DEGs) from garlic treatment resulted in the biological processes (BPs) related to proteolysis, e.g., “modification-dependent protein catabolic process”, “proteolysis involved in cellular protein catabolic process”, “cellular protein catabolic process”, “protein catabolic process”, and “ubiquitin-dependent protein catabolic process”. In GO analysis, one BP term, “Proteolysis”, was obtained. Among DEGs, 300 genes were differentially regulated in response to both garlic and anethole, and 234 and 59 genes were either up- or down-regulated in supplementation with both herbs. Pathway analysis resulted in enrichment of the pathways related to digestion such as “Starch and sucrose metabolism” and “Insulin signaling pathway”. Taken together, the results obtained in the present study could contribute to the effective development of evaluation system of plant herbs based on molecular signatures related with their immunological functions in chicken IELs. PMID:26954117

  8. Gene Expression Profiling Predicts the Development of Oral Cancer

    PubMed Central

    Saintigny, Pierre; Zhang, Li; Fan, You-Hong; El-Naggar, Adel K.; Papadimitrakopoulou, Vali; Feng, Lei; Lee, J. Jack; Kim, Edward S.; Hong, Waun Ki; Mao, Li

    2011-01-01

    Patients with oral preneoplastic lesion (OPL) have high risk of developing oral cancer. Although certain risk factors such as smoking status and histology are known, our ability to predict oral cancer risk remains poor. The study objective was to determine the value of gene expression profiling in predicting oral cancer development. Gene expression profile was measured in 86 of 162 OPL patients who were enrolled in a clinical chemoprevention trial that used the incidence of oral cancer development as a prespecified endpoint. The median follow-up time was 6.08 years and 35 of the 86 patients developed oral cancer over the course. Gene expression profiles were associated with oral cancer-free survival and used to develope multivariate predictive models for oral cancer prediction. We developed a 29-transcript predictive model which showed marked improvement in terms of prediction accuracy (with 8% predicting error rate) over the models using previously known clinico-pathological risk factors. Based on the gene expression profile data, we also identified 2182 transcripts significantly associated with oral cancer risk associated genes (P-value<0.01, single variate Cox proportional hazards model). Functional pathway analysis revealed proteasome machinery, MYC, and ribosomes components as the top gene sets associated with oral cancer risk. In multiple independent datasets, the expression profiles of the genes can differentiate head and neck cancer from normal mucosa. Our results show that gene expression profiles may improve the prediction of oral cancer risk in OPL patients and the significant genes identified may serve as potential targets for oral cancer chemoprevention. PMID:21292635

  9. Clustering gene expression data based on predicted differential effects of GV interaction.

    PubMed

    Pan, Hai-Yan; Zhu, Jun; Han, Dan-Fu

    2005-02-01

    Microarray has become a popular biotechnology in biological and medical research. However, systematic and stochastic variabilities in microarray data are expected and unavoidable, resulting in the problem that the raw measurements have inherent "noise" within microarray experiments. Currently, logarithmic ratios are usually analyzed by various clustering methods directly, which may introduce bias interpretation in identifying groups of genes or samples. In this paper, a statistical method based on mixed model approaches was proposed for microarray data cluster analysis. The underlying rationale of this method is to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV (gene by variety) interaction using the adjusted unbiased prediction (AUP) method. The predicted GV interaction effects can then be used as the inputs of cluster analysis. We illustrated the application of our method with a gene expression dataset and elucidated the utility of our approach using an external validation.

  10. Genome-wide expression profiling analysis to identify key genes in the anti-HIV mechanism of CD4+ and CD8+ T cells.

    PubMed

    Gao, Lijie; Wang, Yunqi; Li, Yi; Dong, Ya; Yang, Aimin; Zhang, Jie; Li, Fengying; Zhang, Rongqiang

    2018-07-01

    Comprehensive bioinformatics analyses were performed to explore the key biomarkers in response to HIV infection of CD4 + and CD8 + T cells. The numbers of CD4 + and CD8 + T cells of HIV infected individuals were analyzed and the GEO database (GSE6740) was screened for differentially expressed genes (DEGs) in HIV infected CD4 + and CD8 + T cells. Gene Ontology enrichment, KEGG pathway analyses, and protein-protein interaction (PPI) network were performed to identify the key pathway and core proteins in anti-HIV virus process of CD4 + and CD8 + T cells. Finally, we analyzed the expressions of key proteins in HIV-infected T cells (GSE6740 dataset) and peripheral blood mononuclear cells(PBMCs) (GSE511 dataset). 1) CD4 + T cells counts and ratio of CD4 + /CD8 + T cells decreased while CD8 + T cells counts increased in HIV positive individuals; 2) 517 DEGs were found in HIV infected CD4 + and CD8 + T cells at acute and chronic stage with the criterial of P-value <0.05 and fold change (FC) ≥2; 3) In acute HIV infection, type 1 interferon (IFN-1) pathway might played a critical role in response to HIV infection of T cells. The main biological processes of the DEGs were response to virus and defense response to virus. At chronic stage, ISG15 protein, in conjunction with IFN-1 pathway might play key roles in anti-HIV responses of CD4 + T cells; and 4) The expression of ISG15 increased in both T cells and PBMCs after HIV infection. Gene expression profile of CD4 + and CD8 + T cells changed significantly in HIV infection, in which ISG15 gene may play a central role in activating the natural antiviral process of immune cells. © 2018 Wiley Periodicals, Inc.

  11. SGDB: a database of synthetic genes re-designed for optimizing protein over-expression.

    PubMed

    Wu, Gang; Zheng, Yuanpu; Qureshi, Imran; Zin, Htar Thant; Beck, Tyler; Bulka, Blazej; Freeland, Stephen J

    2007-01-01

    Here we present the Synthetic Gene Database (SGDB): a relational database that houses sequences and associated experimental information on synthetic (artificially engineered) genes from all peer-reviewed studies published to date. At present, the database comprises information from more than 200 published experiments. This resource not only provides reference material to guide experimentalists in designing new genes that improve protein expression, but also offers a dataset for analysis by bioinformaticians who seek to test ideas regarding the underlying factors that influence gene expression. The SGDB was built under MySQL database management system. We also offer an XML schema for standardized data description of synthetic genes. Users can access the database at http://www.evolvingcode.net/codon/sgdb/index.php, or batch downloads all information through XML files. Moreover, users may visually compare the coding sequences of a synthetic gene and its natural counterpart with an integrated web tool at http://www.evolvingcode.net/codon/sgdb/aligner.php, and discuss questions, findings and related information on an associated e-forum at http://www.evolvingcode.net/forum/viewforum.php?f=27.

  12. Comprehensive analysis of coding-lncRNA gene co-expression network uncovers conserved functional lncRNAs in zebrafish.

    PubMed

    Chen, Wen; Zhang, Xuan; Li, Jing; Huang, Shulan; Xiang, Shuanglin; Hu, Xiang; Liu, Changning

    2018-05-09

    Zebrafish is a full-developed model system for studying development processes and human disease. Recent studies of deep sequencing had discovered a large number of long non-coding RNAs (lncRNAs) in zebrafish. However, only few of them had been functionally characterized. Therefore, how to take advantage of the mature zebrafish system to deeply investigate the lncRNAs' function and conservation is really intriguing. We systematically collected and analyzed a series of zebrafish RNA-seq data, then combined them with resources from known database and literatures. As a result, we obtained by far the most complete dataset of zebrafish lncRNAs, containing 13,604 lncRNA genes (21,128 transcripts) in total. Based on that, a co-expression network upon zebrafish coding and lncRNA genes was constructed and analyzed, and used to predict the Gene Ontology (GO) and the KEGG annotation of lncRNA. Meanwhile, we made a conservation analysis on zebrafish lncRNA, identifying 1828 conserved zebrafish lncRNA genes (1890 transcripts) that have their putative mammalian orthologs. We also found that zebrafish lncRNAs play important roles in regulation of the development and function of nervous system; these conserved lncRNAs present a significant sequential and functional conservation, with their mammalian counterparts. By integrative data analysis and construction of coding-lncRNA gene co-expression network, we gained the most comprehensive dataset of zebrafish lncRNAs up to present, as well as their systematic annotations and comprehensive analyses on function and conservation. Our study provides a reliable zebrafish-based platform to deeply explore lncRNA function and mechanism, as well as the lncRNA commonality between zebrafish and human.

  13. Bayesian correlated clustering to integrate multiple datasets

    PubMed Central

    Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.

    2012-01-01

    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558

  14. Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses

    PubMed Central

    Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M.; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V.; Ma’ayan, Avi

    2018-01-01

    Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated ‘canned’ analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools. PMID:29485625

  15. Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses.

    PubMed

    Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V; Ma'ayan, Avi

    2018-02-27

    Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated 'canned' analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools.

  16. Positive correlation between ADAR expression and its targets suggests a complex regulation mediated by RNA editing in the human brain

    PubMed Central

    Liscovitch, Noa; Bazak, Lily; Levanon, Erez Y; Chechik, Gal

    2014-01-01

    A-to-I RNA editing by adenosine deaminases acting on RNA is a post-transcriptional modification that is crucial for normal life and development in vertebrates. RNA editing has been shown to be very abundant in the human transcriptome, specifically at the primate-specific Alu elements. The functional role of this wide-spread effect is still not clear; it is believed that editing of transcripts is a mechanism for their down-regulation via processes such as nuclear retention or RNA degradation. Here we combine 2 neural gene expression datasets with genome-level editing information to examine the relation between the expression of ADAR genes with the expression of their target genes. Specifically, we computed the spatial correlation across structures of post-mortem human brains between ADAR and a large set of targets that were found to be edited in their Alu repeats. Surprisingly, we found that a large fraction of the edited genes are positively correlated with ADAR, opposing the assumption that editing would reduce expression. When considering the correlations between ADAR and its targets over development, 2 gene subsets emerge, positively correlated and negatively correlated with ADAR expression. Specifically, in embryonic time points, ADAR is positively correlated with many genes related to RNA processing and regulation of gene expression. These findings imply that the suggested mechanism of regulation of expression by editing is probably not a global one; ADAR expression does not have a genome wide effect reducing the expression of editing targets. It is possible, however, that RNA editing by ADAR in non-coding regions of the gene might be a part of a more complex expression regulation mechanism. PMID:25692240

  17. Positive correlation between ADAR expression and its targets suggests a complex regulation mediated by RNA editing in the human brain.

    PubMed

    Liscovitch, Noa; Bazak, Lily; Levanon, Erez Y; Chechik, Gal

    2014-01-01

    A-to-I RNA editing by adenosine deaminases acting on RNA is a post-transcriptional modification that is crucial for normal life and development in vertebrates. RNA editing has been shown to be very abundant in the human transcriptome, specifically at the primate-specific Alu elements. The functional role of this wide-spread effect is still not clear; it is believed that editing of transcripts is a mechanism for their down-regulation via processes such as nuclear retention or RNA degradation. Here we combine 2 neural gene expression datasets with genome-level editing information to examine the relation between the expression of ADAR genes with the expression of their target genes. Specifically, we computed the spatial correlation across structures of post-mortem human brains between ADAR and a large set of targets that were found to be edited in their Alu repeats. Surprisingly, we found that a large fraction of the edited genes are positively correlated with ADAR, opposing the assumption that editing would reduce expression. When considering the correlations between ADAR and its targets over development, 2 gene subsets emerge, positively correlated and negatively correlated with ADAR expression. Specifically, in embryonic time points, ADAR is positively correlated with many genes related to RNA processing and regulation of gene expression. These findings imply that the suggested mechanism of regulation of expression by editing is probably not a global one; ADAR expression does not have a genome wide effect reducing the expression of editing targets. It is possible, however, that RNA editing by ADAR in non-coding regions of the gene might be a part of a more complex expression regulation mechanism.

  18. Bioinformatic detection of E47, E2F1 and SREBP1 transcription factors as potential regulators of genes associated to acquisition of endometrial receptivity

    PubMed Central

    2011-01-01

    Background The endometrium is a dynamic tissue whose changes are driven by the ovarian steroidal hormones. Its main function is to provide an adequate substrate for embryo implantation. Using microarray technology, several reports have provided the gene expression patterns of human endometrial tissue during the window of implantation. However it is required that biological connections be made across these genomic datasets to take full advantage of them. The objective of this work was to perform a research synthesis of available gene expression profiles related to acquisition of endometrial receptivity for embryo implantation, in order to gain insights into its molecular basis and regulation. Methods Gene expression datasets were intersected to determine a consensus endometrial receptivity transcript list (CERTL). For this cluster of genes we determined their functional annotations using available web-based databases. In addition, promoter sequences were analyzed to identify putative transcription factor binding sites using bioinformatics tools and determined over-represented features. Results We found 40 up- and 21 down-regulated transcripts in the CERTL. Those more consistently increased were C4BPA, SPP1, APOD, CD55, CFD, CLDN4, DKK1, ID4, IL15 and MAP3K5 whereas the more consistently decreased were OLFM1, CCNB1, CRABP2, EDN3, FGFR1, MSX1 and MSX2. Functional annotation of CERTL showed it was enriched with transcripts related to the immune response, complement activation and cell cycle regulation. Promoter sequence analysis of genes revealed that DNA binding sites for E47, E2F1 and SREBP1 transcription factors were the most consistently over-represented and in both up- and down-regulated genes during the window of implantation. Conclusions Our research synthesis allowed organizing and mining high throughput data to explore endometrial receptivity and focus future research efforts on specific genes and pathways. The discovery of possible new transcription factors orchestrating the CERTL opens new alternatives for understanding gene expression regulation in uterine function. PMID:21272326

  19. ESR1 Is Co-Expressed with Closely Adjacent Uncharacterised Genes Spanning a Breast Cancer Susceptibility Locus at 6q25.1

    PubMed Central

    Dunbier, Anita K.; Anderson, Helen; Ghazoui, Zara; Lopez-Knowles, Elena; Pancholi, Sunil; Ribas, Ricardo; Drury, Suzanne; Sidhu, Kally; Leary, Alexandra; Martin, Lesley-Ann; Dowsett, Mitch

    2011-01-01

    Approximately 80% of human breast carcinomas present as oestrogen receptor α-positive (ER+ve) disease, and ER status is a critical factor in treatment decision-making. Recently, single nucleotide polymorphisms (SNPs) in the region immediately upstream of the ER gene (ESR1) on 6q25.1 have been associated with breast cancer risk. Our investigation of factors associated with the level of expression of ESR1 in ER+ve tumours has revealed unexpected associations between genes in this region and ESR1 expression that are important to consider in studies of the genetic causes of breast cancer risk. RNA from tumour biopsies taken from 104 postmenopausal women before and after 2 weeks treatment with an aromatase (oestrogen synthase) inhibitor was analyzed on Illumina 48K microarrays. Multiple-testing corrected Spearman correlation revealed that three previously uncharacterized open reading frames (ORFs) located immediately upstream of ESR1, C6ORF96, C6ORF97, and C6ORF211 were highly correlated with ESR1 (Rs = 0.67, 0.64, and 0.55 respectively, FDR<1×10−7). Publicly available datasets confirmed this relationship in other groups of ER+ve tumours. DNA copy number changes did not account for the correlations. The correlations were maintained in cultured cells. An ERα antagonist did not affect the ORFs' expression or their correlation with ESR1, suggesting their transcriptional co-activation is not directly mediated by ERα. siRNA inhibition of C6ORF211 suppressed proliferation in MCF7 cells, and C6ORF211 positively correlated with a proliferation metagene in tumours. In contrast, C6ORF97 expression correlated negatively with the metagene and predicted for improved disease-free survival in a tamoxifen-treated published dataset, independently of ESR1. Our observations suggest that some of the biological effects previously attributed to ER could be mediated and/or modified by these co-expressed genes. The co-expression and function of these genes may be important influences on the recently identified relationship between SNPs in this region and breast cancer risk. PMID:21552322

  20. Recursive feature selection with significant variables of support vectors.

    PubMed

    Tsai, Chen-An; Huang, Chien-Hsun; Chang, Ching-Wei; Chen, Chun-Houh

    2012-01-01

    The development of DNA microarray makes researchers screen thousands of genes simultaneously and it also helps determine high- and low-expression level genes in normal and disease tissues. Selecting relevant genes for cancer classification is an important issue. Most of the gene selection methods use univariate ranking criteria and arbitrarily choose a threshold to choose genes. However, the parameter setting may not be compatible to the selected classification algorithms. In this paper, we propose a new gene selection method (SVM-t) based on the use of t-statistics embedded in support vector machine. We compared the performance to two similar SVM-based methods: SVM recursive feature elimination (SVMRFE) and recursive support vector machine (RSVM). The three methods were compared based on extensive simulation experiments and analyses of two published microarray datasets. In the simulation experiments, we found that the proposed method is more robust in selecting informative genes than SVMRFE and RSVM and capable to attain good classification performance when the variations of informative and noninformative genes are different. In the analysis of two microarray datasets, the proposed method yields better performance in identifying fewer genes with good prediction accuracy, compared to SVMRFE and RSVM.

Top