Johnstone, Daniel M.; Riveros, Carlos; Heidari, Moones; Graham, Ross M.; Trinder, Debbie; Berretta, Regina; Olynyk, John K.; Scott, Rodney J.; Moscato, Pablo; Milward, Elizabeth A.
2013-01-01
While Illumina microarrays can be used successfully for detecting small gene expression changes due to their high degree of technical replicability, there is little information on how different normalization and differential expression analysis strategies affect outcomes. To evaluate this, we assessed concordance across gene lists generated by applying different combinations of normalization strategy and analytical approach to two Illumina datasets with modest expression changes. In addition to using traditional statistical approaches, we also tested an approach based on combinatorial optimization. We found that the choice of both normalization strategy and analytical approach considerably affected outcomes, in some cases leading to substantial differences in gene lists and subsequent pathway analysis results. Our findings suggest that important biological phenomena may be overlooked when there is a routine practice of using only one approach to investigate all microarray datasets. Analytical artefacts of this kind are likely to be especially relevant for datasets involving small fold changes, where inherent technical variation—if not adequately minimized by effective normalization—may overshadow true biological variation. This report provides some basic guidelines for optimizing outcomes when working with Illumina datasets involving small expression changes. PMID:27605185
Moon, Myungjin; Nakai, Kenta
2018-04-01
Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.
Xu, Huayong; Yu, Hui; Tu, Kang; Shi, Qianqian; Wei, Chaochun; Li, Yuan-Yuan; Li, Yi-Xue
2013-01-01
We are witnessing rapid progress in the development of methodologies for building the combinatorial gene regulatory networks involving both TFs (Transcription Factors) and miRNAs (microRNAs). There are a few tools available to do these jobs but most of them are not easy to use and not accessible online. A web server is especially needed in order to allow users to upload experimental expression datasets and build combinatorial regulatory networks corresponding to their particular contexts. In this work, we compiled putative TF-gene, miRNA-gene and TF-miRNA regulatory relationships from forward-engineering pipelines and curated them as built-in data libraries. We streamlined the R codes of our two separate forward-and-reverse engineering algorithms for combinatorial gene regulatory network construction and formalized them as two major functional modules. As a result, we released the cGRNB (combinatorial Gene Regulatory Networks Builder): a web server for constructing combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. The cGRNB enables two major network-building modules, one for MPGE (miRNA-perturbed gene expression) datasets and the other for parallel miRNA/mRNA expression datasets. A miRNA-centered two-layer combinatorial regulatory cascade is the output of the first module and a comprehensive genome-wide network involving all three types of combinatorial regulations (TF-gene, TF-miRNA, and miRNA-gene) are the output of the second module. In this article we propose cGRNB, a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. Since parallel miRNA/mRNA expression datasets are rapidly accumulated by the advance of next-generation sequencing techniques, cGRNB will be very useful tool for researchers to build combinatorial gene regulatory networks based on expression datasets. The cGRNB web-server is free and available online at http://www.scbit.org/cgrnb.
Utility and Limitations of Using Gene Expression Data to Identify Functional Associations
Peng, Cheng; Shiu, Shin-Han
2016-01-01
Gene co-expression has been widely used to hypothesize gene function through guilt-by association. However, it is not clear to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the type of dataset impacts inferences about gene functions. Here our goal is to assess the utility and limitations of using co-expression as a criterion to recover functional associations between genes. By determining the percentage of gene pairs in a metabolic pathway with significant expression correlation, we found that many genes in the same pathway do not have similar transcript profiles and the choice of dataset, annotation quality, gene function, expression similarity measure, and clustering approach significantly impacts the ability to recover functional associations between genes using Arabidopsis thaliana as an example. Some datasets are more informative in capturing coordinated expression profiles and larger data sets are not always better. In addition, to recover the maximum number of known pathways and identify candidate genes with similar functions, it is important to explore rather exhaustively multiple dataset combinations, similarity measures, clustering algorithms and parameters. Finally, we validated the biological relevance of co-expression cluster memberships with an independent phenomics dataset and found that genes that consistently cluster with leucine degradation genes tend to have similar leucine levels in mutants. This study provides a framework for obtaining gene functional associations by maximizing the information that can be obtained from gene expression datasets. PMID:27935950
Mukwaya, Anthony; Lindvall, Jessica M; Xeroudaki, Maria; Peebo, Beatrice; Ali, Zaheer; Lennikov, Anton; Jensen, Lasse Dahl Ejby; Lagali, Neil
2016-11-22
In angiogenesis with concurrent inflammation, many pathways are activated, some linked to VEGF and others largely VEGF-independent. Pathways involving inflammatory mediators, chemokines, and micro-RNAs may play important roles in maintaining a pro-angiogenic environment or mediating angiogenic regression. Here, we describe a gene expression dataset to facilitate exploration of pro-angiogenic, pro-inflammatory, and remodelling/normalization-associated genes during both an active capillary sprouting phase, and in the restoration of an avascular phenotype. The dataset was generated by microarray analysis of the whole transcriptome in a rat model of suture-induced inflammatory corneal neovascularisation. Regions of active capillary sprout growth or regression in the cornea were harvested and total RNA extracted from four biological replicates per group. High quality RNA was obtained for gene expression analysis using microarrays. Fold change of selected genes was validated by qPCR, and protein expression was evaluated by immunohistochemistry. We provide a gene expression dataset that may be re-used to investigate corneal neovascularisation, and may also have implications in other contexts of inflammation-mediated angiogenesis.
Cao, Huojun; Amendt, Brad A
2016-11-01
Developmental dental anomalies are common forms of congenital defects. The molecular mechanisms of dental anomalies are poorly understood. Systematic approaches such as clustering genes based on similar expression patterns could identify novel genes involved in dental anomalies and provide a framework for understanding molecular regulatory mechanisms of these genes during tooth development (odontogenesis). A python package (pySAPC) of sparse affinity propagation clustering algorithm for large datasets was developed. Whole genome pair-wise similarity was calculated based on expression pattern similarity based on 45 microarrays of several stages during odontogenesis. pySAPC identified 743 gene clusters based on expression pattern similarity during mouse tooth development. Three clusters are significantly enriched for genes associated with dental anomalies (with FDR <0.1). The three clusters of genes have distinct expression patterns during odontogenesis. Clustering genes based on similar expression profiles recovered several known regulatory relationships for genes involved in odontogenesis, as well as many novel genes that may be involved with the same genetic pathways as genes that have already been shown to contribute to dental defects. By using sparse similarity matrix, pySAPC use much less memory and CPU time compared with the original affinity propagation program that uses a full similarity matrix. This python package will be useful for many applications where dataset(s) are too large to use full similarity matrix. This article is part of a Special Issue entitled "System Genetics" Guest Editor: Dr. Yudong Cai and Dr. Tao Huang. Copyright © 2016. Published by Elsevier B.V.
Shanley, Thomas P; Cvijanovich, Natalie; Lin, Richard; Allen, Geoffrey L; Thomas, Neal J; Doctor, Allan; Kalyanaraman, Meena; Tofil, Nancy M; Penfil, Scott; Monaco, Marie; Odoms, Kelli; Barnes, Michael; Sakthivel, Bhuvaneswari; Aronow, Bruce J; Wong, Hector R
2007-01-01
We have conducted longitudinal studies focused on the expression profiles of signaling pathways and gene networks in children with septic shock. Genome-level expression profiles were generated from whole blood-derived RNA of children with septic shock (n = 30) corresponding to day one and day three of septic shock, respectively. Based on sequential statistical and expression filters, day one and day three of septic shock were characterized by differential regulation of 2,142 and 2,504 gene probes, respectively, relative to controls (n = 15). Venn analysis demonstrated 239 unique genes in the day one dataset, 598 unique genes in the day three dataset, and 1,906 genes common to both datasets. Functional analyses demonstrated time-dependent, differential regulation of genes involved in multiple signaling pathways and gene networks primarily related to immunity and inflammation. Notably, multiple and distinct gene networks involving T cell- and MHC antigen-related biology were persistently downregulated on both day one and day three. Further analyses demonstrated large scale, persistent downregulation of genes corresponding to functional annotations related to zinc homeostasis. These data represent the largest reported cohort of patients with septic shock subjected to longitudinal genome-level expression profiling. The data further advance our genome-level understanding of pediatric septic shock and support novel hypotheses. PMID:17932561
Shen, Yi; Wang, Zhanwei; Loo, Lenora WM; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A.; Katsaros, Dionyssios; Yu, Herbert
2015-01-01
Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management. PMID:26564482
Shen, Yi; Wang, Zhanwei; Loo, Lenora W M; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A; Katsaros, Dionyssios; Yu, Herbert
2015-12-01
Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management.
Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun
2014-01-01
As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.
Matott, L Shawn; Jiang, Zhengzheng; Rabideau, Alan J; Allen-King, Richelle M
2015-01-01
Numerous isotherm expressions have been developed for describing sorption of hydrophobic organic compounds (HOCs), including "dual-mode" approaches that combine nonlinear behavior with a linear partitioning component. Choosing among these alternative expressions for describing a given dataset is an important task that can significantly influence subsequent transport modeling and/or mechanistic interpretation. In this study, a series of numerical experiments were undertaken to identify "best-in-class" isotherms by refitting 10 alternative models to a suite of 13 previously published literature datasets. The corrected Akaike Information Criterion (AICc) was used for ranking these alternative fits and distinguishing between plausible and implausible isotherms for each dataset. The occurrence of multiple plausible isotherms was inversely correlated with dataset "richness", such that datasets with fewer observations and/or a narrow range of aqueous concentrations resulted in a greater number of plausible isotherms. Overall, only the Polanyi-partition dual-mode isotherm was classified as "plausible" across all 13 of the considered datasets, indicating substantial statistical support consistent with current advances in sorption theory. However, these findings are predicated on the use of the AICc measure as an unbiased ranking metric and the adoption of a subjective, but defensible, threshold for separating plausible and implausible isotherms. Copyright © 2015 Elsevier B.V. All rights reserved.
Novel harmonic regularization approach for variable selection in Cox's proportional hazards model.
Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan
2014-01-01
Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods.
Nidheesh, N; Abdul Nazeer, K A; Ameer, P M
2017-12-01
Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data. Copyright © 2017 Elsevier Ltd. All rights reserved.
2010-01-01
Background The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells. Results The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques. Conclusions BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show BABAR transforms real and simulated datasets to allow for the correct interpretation of these data, and is the ideal tool to facilitate the identification of differentially expressed genes or network inference analysis from transcriptomic datasets. PMID:20128918
Novel Harmonic Regularization Approach for Variable Selection in Cox's Proportional Hazards Model
Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan
2014-01-01
Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods. PMID:25506389
Kameshwar, Ayyappa Kumar Sista; Qin, Wensheng
2017-10-01
Lignin, most complex and abundant biopolymer on the earth's surface, attains its stability from intricate polyphenolic units and non-phenolic bonds, making it difficult to depolymerize or separate from other units of biomass. Eccentric lignin degrading ability and availability of annotated genome make Phanerochaete chrysosporium ideal for studying lignin degrading mechanisms. Decoding and understanding the molecular mechanisms underlying the process of lignin degradation will significantly aid the progressing biofuel industries and lead to the production of commercially vital platform chemicals. In this study, we have performed a large-scale metadata analysis to understand the common gene expression patterns of P. chrysosporium during lignin degradation. Gene expression datasets were retrieved from NCBI GEO database and analyzed using GEO2R and Bioconductor packages. Commonly expressed statistically significant genes among different datasets were further considered to understand their involvement in lignin degradation and detoxification mechanisms. We have observed three sets of enzymes commonly expressed during ligninolytic conditions which were later classified into primary ligninolytic, aromatic compound-degrading and other necessary enzymes. Similarly, we have observed three sets of genes coding for detoxification and stress-responsive, phase I and phase II metabolic enzymes. Results obtained in this study indicate the coordinated action of enzymes involved in lignin depolymerization and detoxification-stress responses under ligninolytic conditions. We have developed tentative network of genes and enzymes involved in lignin degradation and detoxification mechanisms by P. chrysosporium based on the literature and results obtained in this study. However, ambiguity raised due to higher expression of several uncharacterized proteins necessitates for further proteomic studies in P. chrysosporium.
He, Hao; Zhang, Lei; Li, Jian; Wang, Yu-Ping; Zhang, Ji-Gang; Shen, Jie; Guo, Yan-Fang
2014-01-01
Context: To date, few systems genetics studies in the bone field have been performed. We designed our study from a systems-level perspective by integrating genome-wide association studies (GWASs), human protein-protein interaction (PPI) network, and gene expression to identify gene modules contributing to osteoporosis risk. Methods: First we searched for modules significantly enriched with bone mineral density (BMD)-associated genes in human PPI network by using 2 large meta-analysis GWAS datasets through a dense module search algorithm. One included 7 individual GWAS samples (Meta7). The other was from the Genetic Factors for Osteoporosis Consortium (GEFOS2). One was assigned as a discovery dataset and the other as an evaluation dataset, and vice versa. Results: In total, 42 modules and 129 modules were identified significantly in both Meta7 and GEFOS2 datasets for femoral neck and spine BMD, respectively. There were 3340 modules identified for hip BMD only in Meta7. As candidate modules, they were assessed for the biological relevance to BMD by gene set enrichment analysis in 2 expression profiles generated from circulating monocytes in subjects with low versus high BMD values. Interestingly, there were 2 modules significantly enriched in monocytes from the low BMD group in both gene expression datasets (nominal P value <.05). Two modules had 16 nonredundant genes. Functional enrichment analysis revealed that both modules were enriched for genes involved in Wnt receptor signaling and osteoblast differentiation. Conclusion: We highlighted 2 modules and novel genes playing important roles in the regulation of bone mass, providing important clues for therapeutic approaches for osteoporosis. PMID:25119315
A biclustering algorithm for extracting bit-patterns from binary datasets.
Rodriguez-Baena, Domingo S; Perez-Pulido, Antonio J; Aguilar-Ruiz, Jesus S
2011-10-01
Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html dsrodbae@upo.es Supplementary data are available at Bioinformatics online.
Ochsner, Scott A.; Tsimelzon, Anna; Dong, Jianrong; Coarfa, Cristian
2016-01-01
The pregnane X receptor (PXR) (PXR/NR1I3) and constitutive androstane receptor (CAR) (CAR/NR1I2) members of the nuclear receptor (NR) superfamily of ligand-regulated transcription factors are well-characterized mediators of xenobiotic and endocrine-disrupting chemical signaling. The Nuclear Receptor Signaling Atlas maintains a growing library of transcriptomic datasets involving perturbations of NR signaling pathways, many of which involve perturbations relevant to PXR and CAR xenobiotic signaling. Here, we generated a reference transcriptome based on the frequency of differential expression of genes across 159 experiments compiled from 22 datasets involving perturbations of CAR and PXR signaling pathways. In addition to the anticipated overrepresentation in the reference transcriptome of genes encoding components of the xenobiotic stress response, the ranking of genes involved in carbohydrate metabolism and gonadotropin action sheds mechanistic light on the suspected role of xenobiotics in metabolic syndrome and reproductive disorders. Gene Set Enrichment Analysis showed that although acetaminophen, chlorpromazine, and phenobarbital impacted many similar gene sets, differences in direction of regulation were evident in a variety of processes. Strikingly, gene sets representing genes linked to Parkinson's, Huntington's, and Alzheimer's diseases were enriched in all 3 transcriptomes. The reference xenobiotic transcriptome will be supplemented with additional future datasets to provide the community with a continually updated reference transcriptomic dataset for CAR- and PXR-mediated xenobiotic signaling. Our study demonstrates how aggregating and annotating transcriptomic datasets, and making them available for routine data mining, facilitates research into the mechanisms by which xenobiotics and endocrine-disrupting chemicals subvert conventional NR signaling modalities. PMID:27409825
Ochsner, Scott A; Tsimelzon, Anna; Dong, Jianrong; Coarfa, Cristian; McKenna, Neil J
2016-08-01
The pregnane X receptor (PXR) (PXR/NR1I3) and constitutive androstane receptor (CAR) (CAR/NR1I2) members of the nuclear receptor (NR) superfamily of ligand-regulated transcription factors are well-characterized mediators of xenobiotic and endocrine-disrupting chemical signaling. The Nuclear Receptor Signaling Atlas maintains a growing library of transcriptomic datasets involving perturbations of NR signaling pathways, many of which involve perturbations relevant to PXR and CAR xenobiotic signaling. Here, we generated a reference transcriptome based on the frequency of differential expression of genes across 159 experiments compiled from 22 datasets involving perturbations of CAR and PXR signaling pathways. In addition to the anticipated overrepresentation in the reference transcriptome of genes encoding components of the xenobiotic stress response, the ranking of genes involved in carbohydrate metabolism and gonadotropin action sheds mechanistic light on the suspected role of xenobiotics in metabolic syndrome and reproductive disorders. Gene Set Enrichment Analysis showed that although acetaminophen, chlorpromazine, and phenobarbital impacted many similar gene sets, differences in direction of regulation were evident in a variety of processes. Strikingly, gene sets representing genes linked to Parkinson's, Huntington's, and Alzheimer's diseases were enriched in all 3 transcriptomes. The reference xenobiotic transcriptome will be supplemented with additional future datasets to provide the community with a continually updated reference transcriptomic dataset for CAR- and PXR-mediated xenobiotic signaling. Our study demonstrates how aggregating and annotating transcriptomic datasets, and making them available for routine data mining, facilitates research into the mechanisms by which xenobiotics and endocrine-disrupting chemicals subvert conventional NR signaling modalities.
Gurunathan, Rajalakshmi; Van Emden, Bernard; Panchanathan, Sethuraman; Kumar, Sudhir
2004-01-01
Background Modern developmental biology relies heavily on the analysis of embryonic gene expression patterns. Investigators manually inspect hundreds or thousands of expression patterns to identify those that are spatially similar and to ultimately infer potential gene interactions. However, the rapid accumulation of gene expression pattern data over the last two decades, facilitated by high-throughput techniques, has produced a need for the development of efficient approaches for direct comparison of images, rather than their textual descriptions, to identify spatially similar expression patterns. Results The effectiveness of the Binary Feature Vector (BFV) and Invariant Moment Vector (IMV) based digital representations of the gene expression patterns in finding biologically meaningful patterns was compared for a small (226 images) and a large (1819 images) dataset. For each dataset, an ordered list of images, with respect to a query image, was generated to identify overlapping and similar gene expression patterns, in a manner comparable to what a developmental biologist might do. The results showed that the BFV representation consistently outperforms the IMV representation in finding biologically meaningful matches when spatial overlap of the gene expression pattern and the genes involved are considered. Furthermore, we explored the value of conducting image-content based searches in a dataset where individual expression components (or domains) of multi-domain expression patterns were also included separately. We found that this technique improves performance of both IMV and BFV based searches. Conclusions We conclude that the BFV representation consistently produces a more extensive and better list of biologically useful patterns than the IMV representation. The high quality of results obtained scales well as the search database becomes larger, which encourages efforts to build automated image query and retrieval systems for spatial gene expression patterns. PMID:15603586
Nehme, A; Zibara, K; Cerutti, C; Bricca, G
2015-06-01
The implication of the renin-angiotensin-aldosterone system (RAAS) in atheroma development is well described. However, a complete view of the local RAAS in atheroma is still missing. In this study we aimed to reveal the organization of RAAS in atheroma at the transcriptomic level and identify the transcriptional regulators behind it. Extended RAAS (extRAAS) was defined as the set of 37 genes coding for classical and novel RAAS participants (Figure 1). Five microarray datasets containing overall 590 samples representing carotid and peripheral atheroma were downloaded from the GEO database. Correlation-based hierarchical clustering (R software) of extRAAS genes within each dataset allowed the identification of modules of co-expressed genes. Reproducible co-expression modules across datasets were then extracted. Transcription factors (TFs) having common binding sites (TFBSs) in the promoters of coordinated genes were identified using the Genomatix database tools and analyzed for their correlation with extRAAS genes in the microarray datasets. Expression data revealed the expressed extRAAS components and their relative abundance displaying the favored pathways in atheroma. Three co-expression modules with more than 80% reproducibility across datasets were extracted. Two of them (M1 and M2) contained genes coding for angiotensin metabolizing enzymes involved in different pathways: M1 included ACE, MME, RNPEP, and DPP3, in addition to 7 other genes; and M2 included CMA1, CTSG, and CPA3. The third module (M3) contained genes coding for receptors known to be implicated in atheroma (AGTR1, MR, GR, LNPEP, EGFR and GPER). M1 and M3 were negatively correlated in 3 of 5 datasets. We identified 19 TFs that have enriched TFBSs in the promoters of genes of M1, and two for M3, but none was found for M2. Among the extracted TFs, ELF1, MAX, and IRF5 showed significant positive correlations with peptidase-coding genes from M1 and negative correlations with receptors-coding genes from M3 (p < 0.05). The identified co-expression modules display the transcriptional organization of local extRAAS in human carotid atheroma. The identification of several TFs potentially associated to extRAAS genes may provide a frame for the discovery of atheroma-specific modulators of extRAAS activity.(Figure is included in full-text article.).
Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K
2015-06-04
Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.
Structure and transcriptional regulation of the major intrinsic protein gene family in grapevine.
Wong, Darren Chern Jan; Zhang, Li; Merlin, Isabelle; Castellarin, Simone D; Gambetta, Gregory A
2018-04-11
The major intrinsic protein (MIP) family is a family of proteins, including aquaporins, which facilitate water and small molecule transport across plasma membranes. In plants, MIPs function in a huge variety of processes including water transport, growth, stress response, and fruit development. In this study, we characterize the structure and transcriptional regulation of the MIP family in grapevine, describing the putative genome duplication events leading to the family structure and characterizing the family's tissue and developmental specific expression patterns across numerous preexisting microarray and RNAseq datasets. Gene co-expression network (GCN) analyses were carried out across these datasets and the promoters of each family member were analyzed for cis-regulatory element structure in order to provide insight into their transcriptional regulation. A total of 29 Vitis vinifera MIP family members (excluding putative pseudogenes) were identified of which all but two were mapped onto Vitis vinifera chromosomes. In this study, segmental duplication events were identified for five plasma membrane intrinsic protein (PIP) and four tonoplast intrinsic protein (TIP) genes, contributing to the expansion of PIPs and TIPs in grapevine. Grapevine MIP family members have distinct tissue and developmental expression patterns and hierarchical clustering revealed two primary groups regardless of the datasets analyzed. Composite microarray and RNA-seq gene co-expression networks (GCNs) highlighted the relationships between MIP genes and functional categories involved in cell wall modification and transport, as well as with other MIPs revealing a strong co-regulation within the family itself. Some duplicated MIP family members have undergone sub-functionalization and exhibit distinct expression patterns and GCNs. Cis-regulatory element (CRE) analyses of the MIP promoters and their associated GCN members revealed enrichment for numerous CREs including AP2/ERFs and NACs. Combining phylogenetic analyses, gene expression profiling, gene co-expression network analyses, and cis-regulatory element enrichment, this study provides a comprehensive overview of the structure and transcriptional regulation of the grapevine MIP family. The study highlights the duplication and sub-functionalization of the family, its strong coordinated expression with genes involved in growth and transport, and the putative classes of TFs responsible for its regulation.
Dynamic Modularity of Host Protein Interaction Networks in Salmonella Typhi Infection
Dhal, Paltu Kumar; Barman, Ranjan Kumar; Saha, Sudipto; Das, Santasabuj
2014-01-01
Background Salmonella Typhi is a human-restricted pathogen, which causes typhoid fever and remains a global health problem in the developing countries. Although previously reported host expression datasets had identified putative biomarkers and therapeutic targets of typhoid fever, the underlying molecular mechanism of pathogenesis remains incompletely understood. Methods We used five gene expression datasets of human peripheral blood from patients suffering from S. Typhi or other bacteremic infections or non-infectious disease like leukemia. The expression datasets were merged into human protein interaction network (PIN) and the expression correlation between the hubs and their interacting proteins was measured by calculating Pearson Correlation Coefficient (PCC) values. The differences in the average PCC for each hub between the disease states and their respective controls were calculated for studied datasets. The individual hubs and their interactors with expression, PCC and average PCC values were treated as dynamic subnetworks. The hubs that showed unique trends of alterations specific to S. Typhi infection were identified. Results We identified S. Typhi infection-specific dynamic subnetworks of the host, which involve 81 hubs and 1343 interactions. The major enriched GO biological process terms in the identified subnetworks were regulation of apoptosis and biological adhesions, while the enriched pathways include cytokine signalling in the immune system and downstream TCR signalling. The dynamic nature of the hubs CCR1, IRS2 and PRKCA with their interactors was studied in detail. The difference in the dynamics of the subnetworks specific to S. Typhi infection suggests a potential molecular model of typhoid fever. Conclusions Hubs and their interactors of the S. Typhi infection-specific dynamic subnetworks carrying distinct PCC values compared with the non-typhoid and other disease conditions reveal new insight into the pathogenesis of S. Typhi. PMID:25144185
2012-01-01
Background The role of n-3 fatty acids in prevention of breast cancer is well recognized, but the underlying molecular mechanisms are still unclear. In view of the growing need for early detection of breast cancer, Graham et al. (2010) studied the microarray gene expression in histologically normal epithelium of subjects with or without breast cancer. We conducted a secondary analysis of this dataset with a focus on the genes (n = 47) involved in fat and lipid metabolism. We used stepwise multivariate logistic regression analyses, volcano plots and false discovery rates for association analyses. We also conducted meta-analyses of other microarray studies using random effects models for three outcomes--risk of breast cancer (380 breast cancer patients and 240 normal subjects), risk of metastasis (430 metastatic compared to 1104 non-metastatic breast cancers) and risk of recurrence (484 recurring versus 890 non-recurring breast cancers). Results The HADHA gene [hydroxyacyl-CoA dehydrogenase/3-ketoacyl-CoA thiolase/enoyl-CoA hydratase (trifunctional protein), alpha subunit] was significantly under-expressed in breast cancer; more so in those with estrogen receptor-negative status. Our meta-analysis showed an 18.4%-26% reduction in HADHA expression in breast cancer. Also, there was an inconclusive but consistent under-expression of HADHA in subjects with metastatic and recurring breast cancers. Conclusions Involvement of mitochondria and the mitochondrial trifunctional protein (encoded by HADHA gene) in breast carcinogenesis is known. Our results lend additional support to the possibility of this involvement. Further, our results suggest that targeted subset analysis of large genome-based datasets can provide interesting association signals. PMID:22240105
Reem, Nathan T; Chen, Han-Yi; Hur, Manhoi; Zhao, Xuefeng; Wurtele, Eve Syrkin; Li, Xu; Li, Ling; Zabotina, Olga
2018-03-01
This research provides new insights into plant response to cell wall perturbations through correlation of transcriptome and metabolome datasets obtained from transgenic plants expressing cell wall-modifying enzymes. Plants respond to changes in their cell walls in order to protect themselves from pathogens and other stresses. Cell wall modifications in Arabidopsis thaliana have profound effects on gene expression and defense response, but the cell signaling mechanisms underlying these responses are not well understood. Three transgenic Arabidopsis lines, two with reduced cell wall acetylation (AnAXE and AnRAE) and one with reduced feruloylation (AnFAE), were used in this study to investigate the plant responses to cell wall modifications. RNA-Seq in combination with untargeted metabolome was employed to assess differential gene expression and metabolite abundance. RNA-Seq results were correlated with metabolite abundances to determine the pathways involved in response to cell wall modifications introduced in each line. The resulting pathway enrichments revealed the deacetylation events in AnAXE and AnRAE plants induced similar responses, notably, upregulation of aromatic amino acid biosynthesis and changes in regulation of primary metabolic pathways that supply substrates to specialized metabolism, particularly those related to defense responses. In contrast, genes and metabolites of lipid biosynthetic pathways and peroxidases involved in lignin polymerization were downregulated in AnFAE plants. These results elucidate how primary metabolism responds to extracellular stimuli. Combining the transcriptomics and metabolomics datasets increased the power of pathway prediction, and demonstrated the complexity of pathways involved in cell wall-mediated signaling.
Mustroph, Angelika; Bailey-Serres, Julia
2010-03-01
Plants consist of distinct cell types distinguished by position, morphological features and metabolic activities. We recently developed a method to extract cell-type specific mRNA populations by immunopurification of ribosome-associated mRNAs. Microarray profiles of 21 cell-specific mRNA populations from seedling roots and shoots comprise the Arabidopsis Translatome dataset. This gene expression atlas provides a new tool for the study of cell-specific processes. Here we provide an example of how genes involved in a pathway limited to one or few cell-types can be further characterized and new candidate genes can be predicted. Cells of the root endodermis produce suberin as an inner barrier between the cortex and stele, whereas the shoot epidermal cells form cutin as a barrier to the external environment. Both polymers consist of fatty acid derivates, and share biosynthetic origins. We use the Arabidopsis Translatome dataset to demonstrate the significant cell-specific expression patterns of genes involved in those biosynthetic processes and suggest new candidate genes in the biosynthesis of suberin and cutin.
DigOut: viewing differential expression genes as outliers.
Yu, Hui; Tu, Kang; Xie, Lu; Li, Yuan-Yuan
2010-12-01
With regards to well-replicated two-conditional microarray datasets, the selection of differentially expressed (DE) genes is a well-studied computational topic, but for multi-conditional microarray datasets with limited or no replication, the same task is not properly addressed by previous studies. This paper adopts multivariate outlier analysis to analyze replication-lacking multi-conditional microarray datasets, finding that it performs significantly better than the widely used limit fold change (LFC) model in a simulated comparative experiment. Compared with the LFC model, the multivariate outlier analysis also demonstrates improved stability against sample variations in a series of manipulated real expression datasets. The reanalysis of a real non-replicated multi-conditional expression dataset series leads to satisfactory results. In conclusion, a multivariate outlier analysis algorithm, like DigOut, is particularly useful for selecting DE genes from non-replicated multi-conditional gene expression dataset.
Pashaei, Elnaz; Guzel, Esra; Ozgurses, Mete Emir; Demirel, Goksun; Aydin, Nizamettin; Ozen, Mustafa
MicroRNAs, which are small regulatory RNAs, post-transcriptionally regulate gene expression by binding 3'-UTR of their mRNA targets. Their deregulation has been shown to cause increased proliferation, migration, invasion, and apoptosis. miR-145, an important tumor supressor microRNA, has shown to be downregulated in many cancer types and has crucial roles in tumor initiation, progression, metastasis, invasion, recurrence, and chemo-radioresistance. Our aim is to investigate potential common target genes of miR-145, and to help understanding the underlying molecular pathways of tumor pathogenesis in association with those common target genes. Eight published microarray datasets, where targets of mir-145 were investigated in cell lines upon mir-145 over expression, were included into this study for meta-analysis. Inter group variabilities were assessed by box-plot analysis. Microarray datasets were analyzed using GEOquery package in Bioconducter 3.2 with R version 3.2.2 and two-way Hierarchical Clustering was used for gene expression data analysis. Meta-analysis of different GEO datasets showed that UNG, FUCA2, DERA, GMFB, TF, and SNX2 were commonly downregulated genes, whereas MYL9 and TAGLN were found to be commonly upregulated upon mir-145 over expression in prostate, breast, esophageal, bladder cancer, and head and neck squamous cell carcinoma. Biological process, molecular function, and pathway analysis of these potential targets of mir-145 through functional enrichments in PPI network demonstrated that those genes are significantly involved in telomere maintenance, DNA binding and repair mechanisms. As a conclusion, our results indicated that mir-145, through targeting its common potential targets, may significantly contribute to tumor pathogenesis in distinct cancer types and might serve as an important target for cancer therapy.
Ooi, Chia Huey; Chetty, Madhu; Teng, Shyh Wei
2006-06-23
Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.
Similarity of markers identified from cancer gene expression studies: observations from GEO.
Shi, Xingjie; Shen, Shihao; Liu, Jin; Huang, Jian; Zhou, Yong; Ma, Shuangge
2014-09-01
Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Pan- and core- network analysis of co-expression genes in a model plant
He, Fei; Maslov, Sergei
2016-12-16
Genome-wide gene expression experiments have been performed using the model plant Arabidopsis during the last decade. Some studies involved construction of coexpression networks, a popular technique used to identify groups of co-regulated genes, to infer unknown gene functions. One approach is to construct a single coexpression network by combining multiple expression datasets generated in different labs. We advocate a complementary approach in which we construct a large collection of 134 coexpression networks based on expression datasets reported in individual publications. To this end we reanalyzed public expression data. To describe this collection of networks we introduced concepts of ‘pan-network’ andmore » ‘core-network’ representing union and intersection between a sizeable fractions of individual networks, respectively. Here, we showed that these two types of networks are different both in terms of their topology and biological function of interacting genes. For example, the modules of the pan-network are enriched in regulatory and signaling functions, while the modules of the core-network tend to include components of large macromolecular complexes such as ribosomes and photosynthetic machinery. Our analysis is aimed to help the plant research community to better explore the information contained within the existing vast collection of gene expression data in Arabidopsis.« less
Pan- and core- network analysis of co-expression genes in a model plant
DOE Office of Scientific and Technical Information (OSTI.GOV)
He, Fei; Maslov, Sergei
Genome-wide gene expression experiments have been performed using the model plant Arabidopsis during the last decade. Some studies involved construction of coexpression networks, a popular technique used to identify groups of co-regulated genes, to infer unknown gene functions. One approach is to construct a single coexpression network by combining multiple expression datasets generated in different labs. We advocate a complementary approach in which we construct a large collection of 134 coexpression networks based on expression datasets reported in individual publications. To this end we reanalyzed public expression data. To describe this collection of networks we introduced concepts of ‘pan-network’ andmore » ‘core-network’ representing union and intersection between a sizeable fractions of individual networks, respectively. Here, we showed that these two types of networks are different both in terms of their topology and biological function of interacting genes. For example, the modules of the pan-network are enriched in regulatory and signaling functions, while the modules of the core-network tend to include components of large macromolecular complexes such as ribosomes and photosynthetic machinery. Our analysis is aimed to help the plant research community to better explore the information contained within the existing vast collection of gene expression data in Arabidopsis.« less
Defining the gene expression signature of rhabdomyosarcoma by meta-analysis
Romualdi, Chiara; De Pittà, Cristiano; Tombolan, Lucia; Bortoluzzi, Stefania; Sartori, Francesca; Rosolen, Angelo; Lanfranchi, Gerolamo
2006-01-01
Background Rhabdomyosarcoma is a highly malignant soft tissue sarcoma in childhood and arises as a consequence of regulatory disruption of the growth and differentiation pathways of myogenic precursor cells. The pathogenic pathways involved in this tumor are mostly unknown and therefore a better characterization of RMS gene expression profile would represent a considerable advance. The availability of publicly available gene expression datasets have opened up new challenges especially for the integration of data generated by different research groups and different array platforms with the purpose of obtaining new insights on the biological process investigated. Results In this work we performed a meta-analysis on four microarray and two SAGE datasets of gene expression data on RMS in order to evaluate the degree of agreement of the biological results obtained by these different studies and to identify common regulatory pathways that could be responsible of tumor growth. Regulatory pathways and biological processes significantly enriched has been investigated and a list of differentially meta-profiles have been identified as possible candidate of aggressiveness of RMS. Conclusion Our results point to a general down regulation of the energy production pathways, suggesting a hypoxic physiology for RMS cells. This result agrees with the high malignancy of RMS and with its resistance to most of the therapeutic treatments. In this context, different isoforms of the ANT gene have been consistently identified for the first time as differentially expressed in RMS. This gene is involved in anti-apoptotic processes when cells grow in low oxygen conditions. These new insights in the biological processes responsible of RMS growth and development demonstrate the effective advantage of the use of integrated analysis of gene expression studies. PMID:17090319
Speaker emotion recognition: from classical classifiers to deep neural networks
NASA Astrophysics Data System (ADS)
Mezghani, Eya; Charfeddine, Maha; Nicolas, Henri; Ben Amar, Chokri
2018-04-01
Speaker emotion recognition is considered among the most challenging tasks in recent years. In fact, automatic systems for security, medicine or education can be improved when considering the speech affective state. In this paper, a twofold approach for speech emotion classification is proposed. At the first side, a relevant set of features is adopted, and then at the second one, numerous supervised training techniques, involving classic methods as well as deep learning, are experimented. Experimental results indicate that deep architecture can improve classification performance on two affective databases, the Berlin Dataset of Emotional Speech and the SAVEE Dataset Surrey Audio-Visual Expressed Emotion.
Convergent Genetic and Expression Datasets Highlight TREM2 in Parkinson's Disease Susceptibility.
Liu, Guiyou; Liu, Yongquan; Jiang, Qinghua; Jiang, Yongshuai; Feng, Rennan; Zhang, Liangcai; Chen, Zugen; Li, Keshen; Liu, Jiafeng
2016-09-01
A rare TREM2 missense mutation (rs75932628-T) was reported to confer a significant Alzheimer's disease (AD) risk. A recent study indicated no evidence of the involvement of this variant in Parkinson's disease (PD). Here, we used the genetic and expression data to reinvestigate the potential association between TREM2 and PD susceptibility. In stage 1, using 10 independent studies (N = 89,157; 8787 cases and 80,370 controls), we conducted a subgroup meta-analysis. We identified a significant association between rs75932628 and PD (P = 3.10E-03, odds ratio (OR) = 3.88, 95 % confidence interval (CI) 1.58-9.54) in No-Northern Europe subgroup, and significantly increased PD risks (P = 0.01 for Mann-Whitney test) in No-Northern Europe subgroup than in Northern Europe subgroup. In stage 2, we used the summary results from a large-scale PD genome-wide association study (GWAS; N = 108,990; 13,708 cases and 95,282 controls) to search for other TREM2 variants contributing to PD susceptibility. We identified 14 single-nucleotide polymorphisms (SNPs) associated with PD within 50-kb upstream and downstream range of TREM2. In stage 3, using two brain expression GWAS datasets (N = 773), we identified 6 of the 14 SNPs regulating increased expression of TREM2. In stage 4, using the whole human genome microarray data (N = 50), we further identified significantly increased expression of TREM2 in PD cases compared with controls in human prefrontal cortex. In summary, convergent genetic and expression datasets demonstrate that TREM2 is a potent risk factor for PD and may be a therapeutic target in PD and other neurodegenerative diseases.
Wong, Kah Keng; Ch'ng, Ewe Seng; Loo, Suet Kee; Husin, Azlan; Muruzabal, María Arestin; Møller, Michael B; Pedersen, Lars M; Pomposo, María Puente; Gaafar, Ayman; Banham, Alison H; Green, Tina M; Lawrie, Charles H
2015-12-01
Huntingtin-interacting protein 1-related (HIP1R) is an endocytic protein involved in receptor trafficking, including regulating cell surface expression of receptor tyrosine kinases. We have previously shown that low HIP1R protein expression was associated with poorer survival in diffuse large B-cell lymphoma (DLBCL) patients from Denmark treated with R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, prednisone). In this multicenter study, we extend these findings and validate the prognostic and subtyping utility of HIP1R expression at both transcript and protein level. Using data mining on three independent transcriptomic datasets of DLBCL, HIP1R transcript was preferentially expressed in germinal center B-cell (GCB)-like DLBCL subtype (P<0.01 in all three datasets), and lower expression was correlated with worse overall survival (OS; P<0.01) and progression-free survival (PFS; P<0.05) in a microarray-profiled DLBCL dataset. At the protein level examined by immunohistochemistry, HIP1R expression at 30% cut-off was associated with GCB-DLBCL molecular subtype (P=0.0004; n=42), and predictive of OS (P=0.0006) and PFS (P=0.0230) in de novo DLBCL patients treated with R-CHOP (n=73). Cases with high FOXP1 and low HIP1R expression frequency (FOXP1(hi)/HIP1R(lo) phenotype) exhibited poorer OS (P=0.0038) and PFS (P=0.0134). Multivariate analysis showed that HIP1R<30% or FOXP1(hi)/HIP1R(lo) subgroup of patients exhibited inferior OS and PFS (P<0.05) independently of the International Prognostic Index. We conclude that HIP1R expression is strongly indicative of survival when utilized on its own or in combination with FOXP1, and the molecule is potentially applicable for subtyping of DLBCL cases. Copyright © 2015 Elsevier Inc. All rights reserved.
VTCdb: a gene co-expression database for the crop species Vitis vinifera (grapevine).
Wong, Darren C J; Sweetman, Crystal; Drew, Damian P; Ford, Christopher M
2013-12-16
Gene expression datasets in model plants such as Arabidopsis have contributed to our understanding of gene function and how a single underlying biological process can be governed by a diverse network of genes. The accumulation of publicly available microarray data encompassing a wide range of biological and environmental conditions has enabled the development of additional capabilities including gene co-expression analysis (GCA). GCA is based on the understanding that genes encoding proteins involved in similar and/or related biological processes may exhibit comparable expression patterns over a range of experimental conditions, developmental stages and tissues. We present an open access database for the investigation of gene co-expression networks within the cultivated grapevine, Vitis vinifera. The new gene co-expression database, VTCdb (http://vtcdb.adelaide.edu.au/Home.aspx), offers an online platform for transcriptional regulatory inference in the cultivated grapevine. Using condition-independent and condition-dependent approaches, grapevine co-expression networks were constructed using the latest publicly available microarray datasets from diverse experimental series, utilising the Affymetrix Vitis vinifera GeneChip (16 K) and the NimbleGen Grape Whole-genome microarray chip (29 K), thus making it possible to profile approximately 29,000 genes (95% of the predicted grapevine transcriptome). Applications available with the online platform include the use of gene names, probesets, modules or biological processes to query the co-expression networks, with the option to choose between Affymetrix or Nimblegen datasets and between multiple co-expression measures. Alternatively, the user can browse existing network modules using interactive network visualisation and analysis via CytoscapeWeb. To demonstrate the utility of the database, we present examples from three fundamental biological processes (berry development, photosynthesis and flavonoid biosynthesis) whereby the recovered sub-networks reconfirm established plant gene functions and also identify novel associations. Together, we present valuable insights into grapevine transcriptional regulation by developing network models applicable to researchers in their prioritisation of gene candidates, for on-going study of biological processes related to grapevine development, metabolism and stress responses.
Dynamic regulation of genetic pathways and targets during aging in Caenorhabditis elegans.
He, Kan; Zhou, Tao; Shao, Jiaofang; Ren, Xiaoliang; Zhao, Zhongying; Liu, Dahai
2014-03-01
Numerous genetic targets and some individual pathways associated with aging have been identified using the worm model. However, less is known about the genetic mechanisms of aging in genome wide, particularly at the level of multiple pathways as well as the regulatory networks during aging. Here, we employed the gene expression datasets of three time points during aging in Caenorhabditis elegans (C. elegans) and performed the approach of gene set enrichment analysis (GSEA) on each dataset between adjacent stages. As a result, multiple genetic pathways and targets were identified as significantly down- or up-regulated. Among them, 5 truly aging-dependent signaling pathways including MAPK signaling pathway, mTOR signaling pathway, Wnt signaling pathway, TGF-beta signaling pathway and ErbB signaling pathway as well as 12 significantly associated genes were identified with dynamic expression pattern during aging. On the other hand, the continued declines in the regulation of several metabolic pathways have been demonstrated to display age-related changes. Furthermore, the reconstructed regulatory networks based on three of aging related Chromatin immunoprecipitation experiments followed by sequencing (ChIP-seq) datasets and the expression matrices of 154 involved genes in above signaling pathways provide new insights into aging at the multiple pathways level. The combination of multiple genetic pathways and targets needs to be taken into consideration in future studies of aging, in which the dynamic regulation would be uncovered.
Sethuraman, Sunantha; Thomas, Merin; Gay, Lauren A; Renne, Rolf
2018-05-29
Ribonomics experiments involving crosslinking and immuno-precipitation (CLIP) of Ago proteins have expanded the understanding of the miRNA targetome of several organisms. These techniques, collectively referred to as CLIP-seq, have been applied to identifying the mRNA targets of miRNAs expressed by Kaposi's Sarcoma-associated herpes virus (KSHV) and Epstein-Barr virus (EBV). However, these studies focused on identifying only those RNA targets of KSHV and EBV miRNAs that are known to encode proteins. Recent studies have demonstrated that long non-coding RNAs (lncRNAs) are also targeted by miRNAs. In this study, we performed a systematic re-analysis of published datasets from KSHV- and EBV-driven cancers. We used CLIP-seq data from lymphoma cells or EBV-transformed B cells, and a crosslinking, ligation and sequencing of hybrids dataset from KSHV-infected endothelial cells, to identify novel lncRNA targets of viral miRNAs. Here, we catalog the lncRNA targetome of KSHV and EBV miRNAs, and provide a detailed in silico analysis of lncRNA-miRNA binding interactions. Viral miRNAs target several hundred lncRNAs, including a subset previously shown to be aberrantly expressed in human malignancies. In addition, we identified thousands of lncRNAs to be putative targets of human miRNAs, suggesting that miRNA-lncRNA interactions broadly contribute to the regulation of gene expression.
Chen, Junhui; Meng, Yuhuan; Zhou, Jinghui; Zhuo, Min; Ling, Fei; Zhang, Yu; Du, Hongli; Wang, Xiaoning
2013-01-01
Type 2 Diabetes Mellitus (T2DM) and obesity have become increasingly prevalent in recent years. Recent studies have focused on identifying causal variations or candidate genes for obesity and T2DM via analysis of expression quantitative trait loci (eQTL) within a single tissue. T2DM and obesity are affected by comprehensive sets of genes in multiple tissues. In the current study, gene expression levels in multiple human tissues from GEO datasets were analyzed, and 21 candidate genes displaying high percentages of differential expression were filtered out. Specifically, DENND1B, LYN, MRPL30, POC1B, PRKCB, RP4-655J12.3, HIBADH, and TMBIM4 were identified from the T2DM-control study, and BCAT1, BMP2K, CSRNP2, MYNN, NCKAP5L, SAP30BP, SLC35B4, SP1, BAP1, GRB14, HSP90AB1, ITGA5, and TOMM5 were identified from the obesity-control study. The majority of these genes are known to be involved in T2DM and obesity. Therefore, analysis of gene expression in various tissues using GEO datasets may be an effective and feasible method to determine novel or causal genes associated with T2DM and obesity.
Multimedia Content Development as a Facial Expression Datasets for Recognition of Human Emotions
NASA Astrophysics Data System (ADS)
Mamonto, N. E.; Maulana, H.; Liliana, D. Y.; Basaruddin, T.
2018-02-01
Datasets that have been developed before contain facial expression from foreign people. The development of multimedia content aims to answer the problems experienced by the research team and other researchers who will conduct similar research. The method used in the development of multimedia content as facial expression datasets for human emotion recognition is the Villamil-Molina version of the multimedia development method. Multimedia content developed with 10 subjects or talents with each talent performing 3 shots with each capturing talent having to demonstrate 19 facial expressions. After the process of editing and rendering, tests are carried out with the conclusion that the multimedia content can be used as a facial expression dataset for recognition of human emotions.
Koda, Satoru; Onda, Yoshihiko; Matsui, Hidetoshi; Takahagi, Kotaro; Yamaguchi-Uehara, Yukiko; Shimizu, Minami; Inoue, Komaki; Yoshida, Takuhiro; Sakurai, Tetsuya; Honda, Hiroshi; Eguchi, Shinto; Nishii, Ryuei; Mochida, Keiichi
2017-01-01
We report the comprehensive identification of periodic genes and their network inference, based on a gene co-expression analysis and an Auto-Regressive eXogenous (ARX) model with a group smoothly clipped absolute deviation (SCAD) method using a time-series transcriptome dataset in a model grass, Brachypodium distachyon . To reveal the diurnal changes in the transcriptome in B. distachyon , we performed RNA-seq analysis of its leaves sampled through a diurnal cycle of over 48 h at 4 h intervals using three biological replications, and identified 3,621 periodic genes through our wavelet analysis. The expression data are feasible to infer network sparsity based on ARX models. We found that genes involved in biological processes such as transcriptional regulation, protein degradation, and post-transcriptional modification and photosynthesis are significantly enriched in the periodic genes, suggesting that these processes might be regulated by circadian rhythm in B. distachyon . On the basis of the time-series expression patterns of the periodic genes, we constructed a chronological gene co-expression network and identified putative transcription factors encoding genes that might be involved in the time-specific regulatory transcriptional network. Moreover, we inferred a transcriptional network composed of the periodic genes in B. distachyon , aiming to identify genes associated with other genes through variable selection by grouping time points for each gene. Based on the ARX model with the group SCAD regularization using our time-series expression datasets of the periodic genes, we constructed gene networks and found that the networks represent typical scale-free structure. Our findings demonstrate that the diurnal changes in the transcriptome in B. distachyon leaves have a sparse network structure, demonstrating the spatiotemporal gene regulatory network over the cyclic phase transitions in B. distachyon diurnal growth.
Heterogeneous data fusion for brain tumor classification.
Metsis, Vangelis; Huang, Heng; Andronesi, Ovidiu C; Makedon, Fillia; Tzika, Aria
2012-10-01
Current research in biomedical informatics involves analysis of multiple heterogeneous data sets. This includes patient demographics, clinical and pathology data, treatment history, patient outcomes as well as gene expression, DNA sequences and other information sources such as gene ontology. Analysis of these data sets could lead to better disease diagnosis, prognosis, treatment and drug discovery. In this report, we present a novel machine learning framework for brain tumor classification based on heterogeneous data fusion of metabolic and molecular datasets, including state-of-the-art high-resolution magic angle spinning (HRMAS) proton (1H) magnetic resonance spectroscopy and gene transcriptome profiling, obtained from intact brain tumor biopsies. Our experimental results show that our novel framework outperforms any analysis using individual dataset.
Jia, Peilin; Chen, Xiangning; Xie, Wei; Kendler, Kenneth S; Zhao, Zhongming
2018-06-20
Numerous high-throughput omics studies have been conducted in schizophrenia, providing an accumulated catalog of susceptible variants and genes. The results from these studies, however, are highly heterogeneous. The variants and genes nominated by different omics studies often have limited overlap with each other. There is thus a pressing need for integrative analysis to unify the different types of data and provide a convergent view of schizophrenia candidate genes (SZgenes). In this study, we collected a comprehensive, multidimensional dataset, including 7819 brain-expressed genes. The data hosted genome-wide association evidence in genetics (eg, genotyping data, copy number variations, de novo mutations), epigenetics, transcriptomics, and literature mining. We developed a method named mega-analysis of odds ratio (MegaOR) to prioritize SZgenes. Application of MegaOR in the multidimensional data resulted in consensus sets of SZgenes (up to 530), each enriched with dense, multidimensional evidence. We proved that these SZgenes had highly tissue-specific expression in brain and nerve and had intensive interactions that were significantly stronger than chance expectation. Furthermore, we found these SZgenes were involved in human brain development by showing strong spatiotemporal expression patterns; these characteristics were replicated in independent brain expression datasets. Finally, we found the SZgenes were enriched in critical functional gene sets involved in neuronal activities, ligand gated ion signaling, and fragile X mental retardation protein targets. In summary, MegaOR analysis reported consensus sets of SZgenes with enriched association evidence to schizophrenia, providing insights into the pathophysiology underlying schizophrenia.
Meta-Analysis of Tumor Stem-Like Breast Cancer Cells Using Gene Set and Network Analysis
Lee, Won Jun; Kim, Sang Cheol; Yoon, Jung-Ho; Yoon, Sang Jun; Lim, Johan; Kim, You-Sun; Kwon, Sung Won; Park, Jeong Hill
2016-01-01
Generally, cancer stem cells have epithelial-to-mesenchymal-transition characteristics and other aggressive properties that cause metastasis. However, there have been no confident markers for the identification of cancer stem cells and comparative methods examining adherent and sphere cells are widely used to investigate mechanism underlying cancer stem cells, because sphere cells have been known to maintain cancer stem cell characteristics. In this study, we conducted a meta-analysis that combined gene expression profiles from several studies that utilized tumorsphere technology to investigate tumor stem-like breast cancer cells. We used our own gene expression profiles along with the three different gene expression profiles from the Gene Expression Omnibus, which we combined using the ComBat method, and obtained significant gene sets using the gene set analysis of our datasets and the combined dataset. This experiment focused on four gene sets such as cytokine-cytokine receptor interaction that demonstrated significance in both datasets. Our observations demonstrated that among the genes of four significant gene sets, six genes were consistently up-regulated and satisfied the p-value of < 0.05, and our network analysis showed high connectivity in five genes. From these results, we established CXCR4, CXCL1 and HMGCS1, the intersecting genes of the datasets with high connectivity and p-value of < 0.05, as significant genes in the identification of cancer stem cells. Additional experiment using quantitative reverse transcription-polymerase chain reaction showed significant up-regulation in MCF-7 derived sphere cells and confirmed the importance of these three genes. Taken together, using meta-analysis that combines gene set and network analysis, we suggested CXCR4, CXCL1 and HMGCS1 as candidates involved in tumor stem-like breast cancer cells. Distinct from other meta-analysis, by using gene set analysis, we selected possible markers which can explain the biological mechanisms and suggested network analysis as an additional criterion for selecting candidates. PMID:26870956
Wolff, Alexander; Bayerlová, Michaela; Gaedcke, Jochen; Kube, Dieter; Beißbarth, Tim
2018-01-01
Pipeline comparisons for gene expression data are highly valuable for applied real data analyses, as they enable the selection of suitable analysis strategies for the dataset at hand. Such pipelines for RNA-Seq data should include mapping of reads, counting and differential gene expression analysis or preprocessing, normalization and differential gene expression in case of microarray analysis, in order to give a global insight into pipeline performances. Four commonly used RNA-Seq pipelines (STAR/HTSeq-Count/edgeR, STAR/RSEM/edgeR, Sailfish/edgeR, TopHat2/Cufflinks/CuffDiff)) were investigated on multiple levels (alignment and counting) and cross-compared with the microarray counterpart on the level of gene expression and gene ontology enrichment. For these comparisons we generated two matched microarray and RNA-Seq datasets: Burkitt Lymphoma cell line data and rectal cancer patient data. The overall mapping rate of STAR was 98.98% for the cell line dataset and 98.49% for the patient dataset. Tophat's overall mapping rate was 97.02% and 96.73%, respectively, while Sailfish had only an overall mapping rate of 84.81% and 54.44%. The correlation of gene expression in microarray and RNA-Seq data was moderately worse for the patient dataset (ρ = 0.67-0.69) than for the cell line dataset (ρ = 0.87-0.88). An exception were the correlation results of Cufflinks, which were substantially lower (ρ = 0.21-0.29 and 0.34-0.53). For both datasets we identified very low numbers of differentially expressed genes using the microarray platform. For RNA-Seq we checked the agreement of differentially expressed genes identified in the different pipelines and of GO-term enrichment results. In conclusion the combination of STAR aligner with HTSeq-Count followed by STAR aligner with RSEM and Sailfish generated differentially expressed genes best suited for the dataset at hand and in agreement with most of the other transcriptomics pipelines.
Devi, Kamalakshi; Mishra, Surajit K; Sahu, Jagajjit; Panda, Debashis; Modi, Mahendra K; Sen, Priyabrata
2016-02-15
Advances in transcriptome sequencing provide fast, cost-effective and reliable approach to generate large expression datasets especially suitable for non-model species to identify putative genes, key pathway and regulatory mechanism. Citronella (Cymbopogon winterianus) is an aromatic medicinal grass used for anti-tumoral, antibacterial, anti-fungal, antiviral, detoxifying and natural insect repellent properties. Despite of having number of utilities, the genes involved in terpenes biosynthetic pathway is not yet clearly elucidated. The present study is a pioneering attempt to generate an exhaustive molecular information of secondary metabolite pathway and to increase genomic resources in Citronella. Using high-throughput RNA-Seq technology, root and leaf transcriptome was analysed at an unprecedented depth (11.7 Gb). Targeted searches identified majority of the genes associated with metabolic pathway and other natural product pathway viz. antibiotics synthesis along with many novel genes. Terpenoid biosynthesis genes comparative expression results were validated for 15 unigenes by RT-PCR and qRT-PCR. Thus the coverage of these transcriptome is comprehensive enough to discover all known genes of major metabolic pathways. This transcriptome dataset can serve as important public information for gene expression, genomics and function genomics studies in Citronella and shall act as a benchmark for future improvement of the crop.
CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets
Li, Yang; Liu, Jun S.; Mootha, Vamsi K.
2017-01-01
In recent years, there has been a huge rise in the number of publicly available transcriptional profiling datasets. These massive compendia comprise billions of measurements and provide a special opportunity to predict the function of unstudied genes based on co-expression to well-studied pathways. Such analyses can be very challenging, however, since biological pathways are modular and may exhibit co-expression only in specific contexts. To overcome these challenges we introduce CLIC, CLustering by Inferred Co-expression. CLIC accepts as input a pathway consisting of two or more genes. It then uses a Bayesian partition model to simultaneously partition the input gene set into coherent co-expressed modules (CEMs), while assigning the posterior probability for each dataset in support of each CEM. CLIC then expands each CEM by scanning the transcriptome for additional co-expressed genes, quantified by an integrated log-likelihood ratio (LLR) score weighted for each dataset. As a byproduct, CLIC automatically learns the conditions (datasets) within which a CEM is operative. We implemented CLIC using a compendium of 1774 mouse microarray datasets (28628 microarrays) or 1887 human microarray datasets (45158 microarrays). CLIC analysis reveals that of 910 canonical biological pathways, 30% consist of strongly co-expressed gene modules for which new members are predicted. For example, CLIC predicts a functional connection between protein C7orf55 (FMC1) and the mitochondrial ATP synthase complex that we have experimentally validated. CLIC is freely available at www.gene-clic.org. We anticipate that CLIC will be valuable both for revealing new components of biological pathways as well as the conditions in which they are active. PMID:28719601
Kibinge, Nelson; Ono, Naoaki; Horie, Masafumi; Sato, Tetsuo; Sugiura, Tadao; Altaf-Ul-Amin, Md; Saito, Akira; Kanaya, Shigehiko
2016-06-01
Conventionally, workflows examining transcription regulation networks from gene expression data involve distinct analytical steps. There is a need for pipelines that unify data mining and inference deduction into a singular framework to enhance interpretation and hypotheses generation. We propose a workflow that merges network construction with gene expression data mining focusing on regulation processes in the context of transcription factor driven gene regulation. The pipeline implements pathway-based modularization of expression profiles into functional units to improve biological interpretation. The integrated workflow was implemented as a web application software (TransReguloNet) with functions that enable pathway visualization and comparison of transcription factor activity between sample conditions defined in the experimental design. The pipeline merges differential expression, network construction, pathway-based abstraction, clustering and visualization. The framework was applied in analysis of actual expression datasets related to lung, breast and prostrate cancer. Copyright © 2016 Elsevier Inc. All rights reserved.
Target of obstructive sleep apnea syndrome merge lung cancer: based on big data platform.
Li, Lifeng; Lu, Jingli; Xue, Wenhua; Wang, Liping; Zhai, Yunkai; Fan, Zhirui; Wu, Ge; Fan, Feifei; Li, Jieyao; Zhang, Chaoqi; Zhang, Yi; Zhao, Jie
2017-03-28
Based on our hospital database, the incidence of lung cancer diagnoses was similar in obstructive sleep apnea Syndrome (OSAS) and hospital general population; among individual with a diagnosis of lung cancer, the presence of OSAS was associated with an increased risk for mortality. In the gene expression and network-level information, we revealed significant alterations of molecules related to HIF1 and metabolic pathways in the hypoxic-conditioned lung cancer cells. We also observed that GBE1 and HK2 are downstream of HIF1 pathway important in hypoxia-conditioned lung cancer cell. Furthermore, we used publicly available datasets to validate that the late-stage lung adenocarcinoma patients showed higher expression HK2 and GBE1 than early-stage ones. In terms of prognostic features, a survival analysis revealed that the high GBE1 and HK2 expression group exhibited poorer survival in lung adenocarcinoma patients. By analyzing and integrating multiple datasets, we identify molecular convergence between hypoxia and lung cancer that reflects their clinical profiles and reveals molecular pathways involved in hypoxic-induced lung cancer progression. In conclusion, we show that OSAS severity appears to increase the risk of lung cancer mortality.
Summerfield, Taryn L.; Yu, Lianbo; Gulati, Parul; Zhang, Jie; Huang, Kun; Romero, Roberto; Kniss, Douglas A.
2011-01-01
A majority of the studies examining the molecular regulation of human labor have been conducted using single gene approaches. While the technology to produce multi-dimensional datasets is readily available, the means for facile analysis of such data are limited. The objective of this study was to develop a systems approach to infer regulatory mechanisms governing global gene expression in cytokine-challenged cells in vitro, and to apply these methods to predict gene regulatory networks (GRNs) in intrauterine tissues during term parturition. To this end, microarray analysis was applied to human amnion mesenchymal cells (AMCs) stimulated with interleukin-1β, and differentially expressed transcripts were subjected to hierarchical clustering, temporal expression profiling, and motif enrichment analysis, from which a GRN was constructed. These methods were then applied to fetal membrane specimens collected in the absence or presence of spontaneous term labor. Analysis of cytokine-responsive genes in AMCs revealed a sterile immune response signature, with promoters enriched in response elements for several inflammation-associated transcription factors. In comparison to the fetal membrane dataset, there were 34 genes commonly upregulated, many of which were part of an acute inflammation gene expression signature. Binding motifs for nuclear factor-κB were prominent in the gene interaction and regulatory networks for both datasets; however, we found little evidence to support the utilization of pathogen-associated molecular pattern (PAMP) signaling. The tissue specimens were also enriched for transcripts governed by hypoxia-inducible factor. The approach presented here provides an uncomplicated means to infer global relationships among gene clusters involved in cellular responses to labor-associated signals. PMID:21655103
Richard, Arianne C; Lyons, Paul A; Peters, James E; Biasci, Daniele; Flint, Shaun M; Lee, James C; McKinney, Eoin F; Siegel, Richard M; Smith, Kenneth G C
2014-08-04
Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study. Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this "gold-standard" comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues. Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently.
GEsture: an online hand-drawing tool for gene expression pattern search.
Wang, Chunyan; Xu, Yiqing; Wang, Xuelin; Zhang, Li; Wei, Suyun; Ye, Qiaolin; Zhu, Youxiang; Yin, Hengfu; Nainwal, Manoj; Tanon-Reyes, Luis; Cheng, Feng; Yin, Tongming; Ye, Ning
2018-01-01
Gene expression profiling data provide useful information for the investigation of biological function and process. However, identifying a specific expression pattern from extensive time series gene expression data is not an easy task. Clustering, a popular method, is often used to classify similar expression genes, however, genes with a 'desirable' or 'user-defined' pattern cannot be efficiently detected by clustering methods. To address these limitations, we developed an online tool called GEsture. Users can draw, or graph a curve using a mouse instead of inputting abstract parameters of clustering methods. GEsture explores genes showing similar, opposite and time-delay expression patterns with a gene expression curve as input from time series datasets. We presented three examples that illustrate the capacity of GEsture in gene hunting while following users' requirements. GEsture also provides visualization tools (such as expression pattern figure, heat map and correlation network) to display the searching results. The result outputs may provide useful information for researchers to understand the targets, function and biological processes of the involved genes.
Intra- and interspecies gene expression models for predicting drug response in canine osteosarcoma.
Fowles, Jared S; Brown, Kristen C; Hess, Ann M; Duval, Dawn L; Gustafson, Daniel L
2016-02-19
Genomics-based predictors of drug response have the potential to improve outcomes associated with cancer therapy. Osteosarcoma (OS), the most common primary bone cancer in dogs, is commonly treated with adjuvant doxorubicin or carboplatin following amputation of the affected limb. We evaluated the use of gene-expression based models built in an intra- or interspecies manner to predict chemosensitivity and treatment outcome in canine OS. Models were built and evaluated using microarray gene expression and drug sensitivity data from human and canine cancer cell lines, and canine OS tumor datasets. The "COXEN" method was utilized to filter gene signatures between human and dog datasets based on strong co-expression patterns. Models were built using linear discriminant analysis via the misclassification penalized posterior algorithm. The best doxorubicin model involved genes identified in human lines that were co-expressed and trained on canine OS tumor data, which accurately predicted clinical outcome in 73 % of dogs (p = 0.0262, binomial). The best carboplatin model utilized canine lines for gene identification and model training, with canine OS tumor data for co-expression. Dogs whose treatment matched our predictions had significantly better clinical outcomes than those that didn't (p = 0.0006, Log Rank), and this predictor significantly associated with longer disease free intervals in a Cox multivariate analysis (hazard ratio = 0.3102, p = 0.0124). Our data show that intra- and interspecies gene expression models can successfully predict response in canine OS, which may improve outcome in dogs and serve as pre-clinical validation for similar methods in human cancer research.
Zhang, Dapeng; Xiong, Huiling; Mennigen, Jan A; Popesku, Jason T; Marlatt, Vicki L; Martyniuk, Christopher J; Crump, Kate; Cossins, Andrew R; Xia, Xuhua; Trudeau, Vance L
2009-06-05
Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted. In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABA(A) gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays. Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development.
Mennigen, Jan A.; Popesku, Jason T.; Marlatt, Vicki L.; Martyniuk, Christopher J.; Crump, Kate; Cossins, Andrew R.; Xia, Xuhua; Trudeau, Vance L.
2009-01-01
Background Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted. Methodology/Principal Findings In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABAA gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays. Conclusions/Significance Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development. PMID:19503831
Schachtschneider, Kyle Michael; Liu, Xiaolin; Huang, Wei; Xie, Ming; Hou, Shuisheng
2014-01-01
Lean-type Pekin duck is a commercial breed that has been obtained through long-term selection. Investigation of the differentially expressed genes in breast muscle and skin fat at different developmental stages will contribute to a comprehensive understanding of the potential mechanisms underlying the lean-type Pekin duck phenotype. In the present study, RNA-seq was performed on breast muscle and skin fat at 2-, 4- and 6-weeks of age. More than 89% of the annotated duck genes were covered by our RNA-seq dataset. Thousands of differentially expressed genes, including many important genes involved in the regulation of muscle development and fat deposition, were detected through comparison of the expression levels in the muscle and skin fat of the same time point, or the same tissue at different time points. KEGG pathway analysis showed that the differentially expressed genes clustered significantly in many muscle development and fat deposition related pathways such as MAPK signaling pathway, PPAR signaling pathway, Calcium signaling pathway, Fat digestion and absorption, and TGF-beta signaling pathway. The results presented here could provide a basis for further investigation of the mechanisms involved in muscle development and fat deposition in Pekin duck. PMID:25264787
Altered lipid metabolism in the aging kidney identified by three layered omic analysis
Braun, Fabian; Rinschen, Markus M.; Bartels, Valerie; Frommolt, Peter; Habermann, Bianca; Hoeijmakers, Jan H.J.; Schumacher, Björn; Dollé, Martijn E.T.; Müller, Roman-Ulrich; Benzing, Thomas; Schermer, Bernhard; Kurschat, Christine E.
2016-01-01
Aging-associated diseases and their comorbidities affect the life of a constantly growing proportion of the population in developed countries. At the center of these comorbidities are changes of kidney structure and function as age-related chronic kidney disease predisposes to the development of cardiovascular diseases such as stroke, myocardial infarction or heart failure. To detect molecular mechanisms involved in kidney aging, we analyzed gene expression profiles of kidneys from adult and aged wild-type mice by transcriptomic, proteomic and targeted lipidomic methodologies. Interestingly, transcriptome and proteome analyses revealed differential expression of genes primarily involved in lipid metabolism and immune response. Additional lipidomic analyses uncovered significant age-related differences in the total amount of phosphatidylethanolamines, phosphatidylcholines and sphingomyelins as well as in subspecies of phosphatidylserines and ceramides with age. By integration of these datasets we identified Aldh1a1, a key enzyme in vitamin A metabolism specifically expressed in the medullary ascending limb, as one of the most prominent upregulated proteins in old kidneys. Moreover, ceramidase Asah1 was highly expressed in aged kidneys, consistent with a decrease in ceramide C16. In summary, our data suggest that changes in lipid metabolism are involved in the process of kidney aging and in the development of chronic kidney disease. PMID:26886165
Altered lipid metabolism in the aging kidney identified by three layered omic analysis.
Braun, Fabian; Rinschen, Markus M; Bartels, Valerie; Frommolt, Peter; Habermann, Bianca; Hoeijmakers, Jan H J; Schumacher, Björn; Dollé, Martijn E T; Müller, Roman-Ulrich; Benzing, Thomas; Schermer, Bernhard; Kurschat, Christine E
2016-03-01
Aging-associated diseases and their comorbidities affect the life of a constantly growing proportion of the population in developed countries. At the center of these comorbidities are changes of kidney structure and function as age-related chronic kidney disease predisposes to the development of cardiovascular diseases such as stroke, myocardial infarction or heart failure. To detect molecular mechanisms involved in kidney aging, we analyzed gene expression profiles of kidneys from adult and aged wild-type mice by transcriptomic, proteomic and targeted lipidomic methodologies. Interestingly, transcriptome and proteome analyses revealed differential expression of genes primarily involved in lipid metabolism and immune response. Additional lipidomic analyses uncovered significant age-related differences in the total amount of phosphatidylethanolamines, phosphatidylcholines and sphingomyelins as well as in subspecies of phosphatidylserines and ceramides with age. By integration of these datasets we identified Aldh1a1, a key enzyme in vitamin A metabolism specifically expressed in the medullary ascending limb, as one of the most prominent upregulated proteins in old kidneys. Moreover, ceramidase Asah1 was highly expressed in aged kidneys, consistent with a decrease in ceramide C16. In summary, our data suggest that changes in lipid metabolism are involved in the process of kidney aging and in the development of chronic kidney disease.
Proteomic changes during intestinal cell maturation in vivo
Chang, Jinsook; Chance, Mark R.; Nicholas, Courtney; Ahmed, Naseem; Guilmeau, Sandra; Flandez, Marta; Wang, Donghai; Byun, Do-Sun; Nasser, Shannon; Albanese, Joseph M.; Corner, Georgia A.; Heerdt, Barbara G.; Wilson, Andrew J.; Augenlicht, Leonard H.; Mariadason, John M.
2008-01-01
Intestinal epithelial cells undergo progressive cell maturation as they migrate along the crypt-villus axis. To determine molecular signatures that define this process, proteins differentially expressed between the crypt and villus were identified by 2D-DIGE and MALDI-MS. Forty-six differentially expressed proteins were identified, several of which were validated by immunohistochemistry. Proteins upregulated in the villus were enriched for those involved in brush border assembly and lipid uptake, established features of differentiated intestinal epithelial cells. Multiple proteins involved in glycolysis were also upregulated in the villus, suggesting increased glycolysis is a feature of intestinal cell differentiation. Conversely, proteins involved in nucleotide metabolism, and protein processing and folding were increased in the crypt, consistent with functions associated with cell proliferation. Three novel paneth cell markers, AGR2, HSPA5 and RRBP1 were also identified. Notably, significant correlation was observed between overall proteomic changes and corresponding gene expression changes along the crypt-villus axis, indicating intestinal cell maturation is primarily regulated at the transcriptional level. This proteomic profiling analysis identified several novel proteins and functional processes differentially induced during intestinal cell maturation in vivo. Integration of proteomic, immunohistochemical, and parallel gene expression datasets demonstrate the coordinated manner in which intestinal cell maturation is regulated. PMID:18824147
Botta, C; Di Martino, M T; Ciliberto, D; Cucè, M; Correale, P; Rossi, M; Tagliaferri, P; Tassone, P
2016-12-16
Multiple myeloma (MM) is closely dependent on cross-talk between malignant plasma cells and cellular components of the inflammatory/immunosuppressive bone marrow milieu, which promotes disease progression, drug resistance, neo-angiogenesis, bone destruction and immune-impairment. We investigated the relevance of inflammatory genes in predicting disease evolution and patient survival. A bioinformatics study by Ingenuity Pathway Analysis on gene expression profiling dataset of monoclonal gammopathy of undetermined significance, smoldering and symptomatic-MM, identified inflammatory and cytokine/chemokine pathways as the most progressively affected during disease evolution. We then selected 20 candidate genes involved in B-cell inflammation and we investigated their role in predicting clinical outcome, through univariate and multivariate analyses (log-rank test, logistic regression and Cox-regression model). We defined an 8-genes signature (IL8, IL10, IL17A, CCL3, CCL5, VEGFA, EBI3 and NOS2) identifying each condition (MGUS/smoldering/symptomatic-MM) with 84% accuracy. Moreover, six genes (IFNG, IL2, LTA, CCL2, VEGFA, CCL3) were found independently correlated with patients' survival. Patients whose MM cells expressed high levels of Th1 cytokines (IFNG/LTA/IL2/CCL2) and low levels of CCL3 and VEGFA, experienced the longest survival. On these six genes, we built a prognostic risk score that was validated in three additional independent datasets. In this study, we provide proof-of-concept that inflammation has a critical role in MM patient progression and survival. The inflammatory-gene prognostic signature validated in different datasets clearly indicates novel opportunities for personalized anti-MM treatment.
Identification of ELF3 as an early transcriptional regulator of human urothelium.
Böck, Matthias; Hinley, Jennifer; Schmitt, Constanze; Wahlicht, Tom; Kramer, Stefan; Southgate, Jennifer
2014-02-15
Despite major advances in high-throughput and computational modelling techniques, understanding of the mechanisms regulating tissue specification and differentiation in higher eukaryotes, particularly man, remains limited. Microarray technology has been explored exhaustively in recent years and several standard approaches have been established to analyse the resultant datasets on a genome-wide scale. Gene expression time series offer a valuable opportunity to define temporal hierarchies and gain insight into the regulatory relationships of biological processes. However, unless datasets are exactly synchronous, time points cannot be compared directly. Here we present a data-driven analysis of regulatory elements from a microarray time series that tracked the differentiation of non-immortalised normal human urothelial (NHU) cells grown in culture. The datasets were obtained by harvesting differentiating and control cultures from finite bladder- and ureter-derived NHU cell lines at different time points using two previously validated, independent differentiation-inducing protocols. Due to the asynchronous nature of the data, a novel ranking analysis approach was adopted whereby we compared changes in the amplitude of experiment and control time series to identify common regulatory elements. Our approach offers a simple, fast and effective ranking method for genes that can be applied to other time series. The analysis identified ELF3 as a candidate transcriptional regulator involved in human urothelial cytodifferentiation. Differentiation-associated expression of ELF3 was confirmed in cell culture experiments and by immunohistochemical demonstration in situ. The importance of ELF3 in urothelial differentiation was verified by knockdown in NHU cells, which led to reduced expression of FOXA1 and GRHL3 transcription factors in response to PPARγ activation. The consequences of this were seen in the repressed expression of late/terminal differentiation-associated uroplakin 3a gene expression and in the compromised development and regeneration of urothelial barrier function. Copyright © 2014 Elsevier Inc. All rights reserved.
A gene expression estimator of intramuscular fat percentage for use in both cattle and sheep
2014-01-01
Background The expression of genes encoding proteins involved in triacyglyceride and fatty acid synthesis and storage in cattle muscle are correlated with intramuscular fat (IMF)%. Are the same genes also correlated with IMF% in sheep muscle, and can the same set of genes be used to estimate IMF% in both species? Results The correlation between gene expression (microarray) and IMF% in the longissimus muscle (LM) of twenty sheep was calculated. An integrated analysis of this dataset with an equivalent cattle correlation dataset and a cattle differential expression dataset was undertaken. A total of 30 genes were identified to be strongly correlated with IMF% in both cattle and sheep. The overlap of genes was highly significant, 8 of the 13 genes in the TAG gene set and 8 of the 13 genes in the FA gene set were in the top 100 and 500 genes respectively most correlated with IMF% in sheep, P-value = 0. Of the 30 genes, CIDEA, THRSP, ACSM1, DGAT2 and FABP4 had the highest average rank in both species. Using the data from two small groups of Brahman cattle (control and Hormone growth promotant-treated [known to decrease IMF% in muscle]) and 22 animals in total, the utility of a direct measure and different estimators of IMF% (ultrasound and gene expression) to differentiate between the two groups were examined. Directly measured IMF% and IMF% estimated from ultrasound scanning could not discriminate between the two groups. However, using gene expression to estimate IMF% discriminated between the two groups. Increasing the number of genes used to estimate IMF% from one to five significantly increased the discrimination power; but increasing the number of genes to 15 resulted in little further improvement. Conclusion We have demonstrated the utility of a comparative approach to identify robust estimators of IMF% in the LM in cattle and sheep. We have also demonstrated a number of approaches (potentially applicable to much smaller groups of animals than conventional methods) to using gene expression to rank animals for IMF% within a single farm/treatment, or to estimate differences in IMF% between two farms/treatments. PMID:25028604
2013-10-01
experiments, a statistically significant data is not yet available. Additional experiments are needed for us to be able to draw conclu comb Figure...well-defined stress pathways, UPR and autophagy, are involved breast involution regulation. Using published gene expression array datasets from...performed involution time-course experiments using both low-dose drug interventions and an autophagy-related gene 7 (ATG7) deletion mouse model
Liu, Yan-Xia; Li, Fen-Xiang; Liu, Zhuan-Zhuan; Jia, Zhi-Rong; Zhou, Yan-He; Zhang, Hao; Yan, Hui; Zhou, Xian-Qiang; Chen, Xiao-Guang
2016-06-01
Mosquito microRNAs (miRNAs) are involved in host-virus interaction, and have been reported to be altered by dengue virus (DENV) infection in Aedes albopictus (Diptera: Culicidae). However, little is known about the molecular mechanisms of Aedes albopictus midgut-the first organ to interact with DENV-involved in its resistance to DENV. Here we used high-throughput sequencing to characterize miRNA and messenger RNA (mRNA) expression patterns in Aedes albopictus midgut in response to dengue virus serotype 2. A total of three miRNAs and 777 mRNAs were identified to be differentially expressed upon DENV infection. For the mRNAs, we identified 198 immune-related genes and 31 of them were differentially expressed. Gene Ontology and Kyoto Encyclopedia of Genes and Genomes enrichment analyses also showed that the differentially expressed immune-related genes were involved in immune response. Then the differential expression patterns of six immune-related genes and three miRNAs were confirmed by real-time reverse transcription polymerase chain reaction. Furthermore, seven known miRNA-mRNA interaction pairs were identified by aligning our two datasets. These analyses of miRNA and mRNA transcriptomes provide valuable information for uncovering the DENV response genes and provide a basis for future study of the resistance mechanisms in Aedes albopictus midgut. © 2016 Institute of Zoology, Chinese Academy of Sciences.
Chen, Hao; Sun, Wei; Zhang, Xian Sheng
2013-01-01
Pollination is the first crucial step of sexual reproduction in flowering plants, and it requires communication and coordination between the pollen and the stigma. Maize (Zea mays) is a model monocot with extraordinarily long silks, and a fully sequenced genome, but little is known about the mechanism of its pollen–stigma interactions. In this study, the dynamic gene expression of silks at four different stages before and after pollination was analyzed. The expression profiles of immature silks (IMS), mature silks (MS), and silks at 20 minutes and 3 hours after pollination (20MAP and 3HAP, respectively) were compared. In total, we identified 6,337 differentially expressed genes in silks (SDEG) at the four stages. Among them, the expression of 172 genes were induced upon pollination, most of which participated in RNA binding, processing and transcription, signal transduction, and lipid metabolism processes. Genes in the SDEG dataset could be divided into 12 time-course clusters according to their expression patterns. Gene Ontology (GO) enrichment analysis revealed that many genes involved in microtubule-based movement, ubiquitin-mediated protein degradation, and transport were predominantly expressed at specific stages, indicating that they might play important roles in the pollination process of maize. These results add to current knowledge about the pollination process of grasses and provide a foundation for future studies on key genes involved in the pollen–silk interaction in maize. PMID:23301084
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets
Wernisch, Lorenz
2017-01-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.
Gabasova, Evelina; Reid, John; Wernisch, Lorenz
2017-10-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.
Paraboschi, Elvezia Maria; Cardamone, Giulia; Rimoldi, Valeria; Gemmati, Donato; Spreafico, Marta; Duga, Stefano; Soldà, Giulia; Asselta, Rosanna
2015-09-30
Abnormalities in RNA metabolism and alternative splicing (AS) are emerging as important players in complex disease phenotypes. In particular, accumulating evidence suggests the existence of pathogenic links between multiple sclerosis (MS) and altered AS, including functional studies showing that an imbalance in alternatively-spliced isoforms may contribute to disease etiology. Here, we tested whether the altered expression of AS-related genes represents a MS-specific signature. A comprehensive comparative analysis of gene expression profiles of publicly-available microarray datasets (190 MS cases, 182 controls), followed by gene-ontology enrichment analysis, highlighted a significant enrichment for differentially-expressed genes involved in RNA metabolism/AS. In detail, a total of 17 genes were found to be differentially expressed in MS in multiple datasets, with CELF1 being dysregulated in five out of seven studies. We confirmed CELF1 downregulation in MS (p=0.0015) by real-time RT-PCRs on RNA extracted from blood cells of 30 cases and 30 controls. As a proof of concept, we experimentally verified the unbalance in alternatively-spliced isoforms in MS of the NFAT5 gene, a putative CELF1 target. In conclusion, for the first time we provide evidence of a consistent dysregulation of splicing-related genes in MS and we discuss its possible implications in modulating specific AS events in MS susceptibility genes.
Hierarchical cortical transcriptome disorganization in autism.
Lombardo, Michael V; Courchesne, Eric; Lewis, Nathan E; Pramparo, Tiziano
2017-01-01
Autism spectrum disorders (ASD) are etiologically heterogeneous and complex. Functional genomics work has begun to identify a diverse array of dysregulated transcriptomic programs (e.g., synaptic, immune, cell cycle, DNA damage, WNT signaling, cortical patterning and differentiation) potentially involved in ASD brain abnormalities during childhood and adulthood. However, it remains unclear whether such diverse dysregulated pathways are independent of each other or instead reflect coordinated hierarchical systems-level pathology. Two ASD cortical transcriptome datasets were re-analyzed using consensus weighted gene co-expression network analysis (WGCNA) to identify common co-expression modules across datasets. Linear mixed-effect models and Bayesian replication statistics were used to identify replicable differentially expressed modules. Eigengene network analysis was then utilized to identify between-group differences in how co-expression modules interact and cluster into hierarchical meta-modular organization. Protein-protein interaction analyses were also used to determine whether dysregulated co-expression modules show enhanced interactions. We find replicable evidence for 10 gene co-expression modules that are differentially expressed in ASD cortex. Rather than being independent non-interacting sources of pathology, these dysregulated co-expression modules work in synergy and physically interact at the protein level. These systems-level transcriptional signals are characterized by downregulation of synaptic processes coordinated with upregulation of immune/inflammation, response to other organism, catabolism, viral processes, translation, protein targeting and localization, cell proliferation, and vasculature development. Hierarchical organization of meta-modules (clusters of highly correlated modules) is also highly affected in ASD. These findings highlight that dysregulation of the ASD cortical transcriptome is characterized by the dysregulation of multiple coordinated transcriptional programs producing synergistic systems-level effects that cannot be fully appreciated by studying the individual component biological processes in isolation.
Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige
2017-01-01
The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.
Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige
2017-01-01
The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp. PMID:28413616
Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G; Treviño, Victor
2013-01-01
Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R.
Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G.; Treviño, Victor
2013-01-01
Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R. PMID:24066126
A Self-Directed Method for Cell-Type Identification and Separation of Gene Expression Microarrays
Zuckerman, Neta S.; Noam, Yair; Goldsmith, Andrea J.; Lee, Peter P.
2013-01-01
Gene expression analysis is generally performed on heterogeneous tissue samples consisting of multiple cell types. Current methods developed to separate heterogeneous gene expression rely on prior knowledge of the cell-type composition and/or signatures - these are not available in most public datasets. We present a novel method to identify the cell-type composition, signatures and proportions per sample without need for a-priori information. The method was successfully tested on controlled and semi-controlled datasets and performed as accurately as current methods that do require additional information. As such, this method enables the analysis of cell-type specific gene expression using existing large pools of publically available microarray datasets. PMID:23990767
Hierarchical Recognition Scheme for Human Facial Expression Recognition Systems
Siddiqi, Muhammad Hameed; Lee, Sungyoung; Lee, Young-Koo; Khan, Adil Mehmood; Truc, Phan Tran Ho
2013-01-01
Over the last decade, human facial expressions recognition (FER) has emerged as an important research area. Several factors make FER a challenging research problem. These include varying light conditions in training and test images; need for automatic and accurate face detection before feature extraction; and high similarity among different expressions that makes it difficult to distinguish these expressions with a high accuracy. This work implements a hierarchical linear discriminant analysis-based facial expressions recognition (HL-FER) system to tackle these problems. Unlike the previous systems, the HL-FER uses a pre-processing step to eliminate light effects, incorporates a new automatic face detection scheme, employs methods to extract both global and local features, and utilizes a HL-FER to overcome the problem of high similarity among different expressions. Unlike most of the previous works that were evaluated using a single dataset, the performance of the HL-FER is assessed using three publicly available datasets under three different experimental settings: n-fold cross validation based on subjects for each dataset separately; n-fold cross validation rule based on datasets; and, finally, a last set of experiments to assess the effectiveness of each module of the HL-FER separately. Weighted average recognition accuracy of 98.7% across three different datasets, using three classifiers, indicates the success of employing the HL-FER for human FER. PMID:24316568
Jong, Victor L; Novianti, Putri W; Roes, Kit C B; Eijkemans, Marinus J C
2014-12-01
The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.
Desai, Ashvini; Madar, Inamul Hasan; Asangani, Amjad Hussain; Ssadh, Hussain Al; Tayubi, Iftikhar Aslam
2017-01-01
Polycystic ovary syndrome (PCOS) is endocrine system disease which affect women ages 18 to 44 where the women's hormones are imbalance. Recently it has been reported to occur in early age. Alteration of normal gene expression in PCOS has shown negative effects on long-term health issues. PCOS has been the responsible factor for the infertility in women of reproductive age group. Early diagnosis and treatment can improve the women's health suffering from PCOS. Earlier Studies shows correlation of PCOS upon insulin resistance with significant outcome, Current study shows the linkage between PCOS with obesity and non-obese patients. Gene expression datasets has been downloaded from GEO (control and PCOS affected patients). Normalization of the datasets were performed using R based on RMA and differentially expressed gene (DEG) were selected on the basis of p-value 0.05 followed by functional annotation of selected gene using Enrich R and DAVID. The DEGs were significantly related to PCOS with obesity and other risk factors involved in disease. The Gene Enrichment Analysis suggests alteration of genes and associated pathway in case of obesity. Current study provides a productive groundwork for specific biomarkers identification for the accurate diagnosis and efficient target for the treatment of PCOS.
Logotheti, Marianthi; Papadodima, Olga; Venizelos, Nikolaos; Chatziioannou, Aristotelis; Kolisis, Fragiskos
2013-01-01
Schizophrenia affecting almost 1% and bipolar disorder affecting almost 3%–5% of the global population constitute two severe mental disorders. The catecholaminergic and the serotonergic pathways have been proved to play an important role in the development of schizophrenia, bipolar disorder, and other related psychiatric disorders. The aim of the study was to perform and interpret the results of a comparative genomic profiling study in schizophrenic patients as well as in healthy controls and in patients with bipolar disorder and try to relate and integrate our results with an aberrant amino acid transport through cell membranes. In particular we have focused on genes and mechanisms involved in amino acid transport through cell membranes from whole genome expression profiling data. We performed bioinformatic analysis on raw data derived from four different published studies. In two studies postmortem samples from prefrontal cortices, derived from patients with bipolar disorder, schizophrenia, and control subjects, have been used. In another study we used samples from postmortem orbitofrontal cortex of bipolar subjects while the final study was performed based on raw data from a gene expression profiling dataset in the postmortem superior temporal cortex of schizophrenics. The data were downloaded from NCBI's GEO datasets. PMID:23554570
GSNFS: Gene subnetwork biomarker identification of lung cancer expression data.
Doungpan, Narumol; Engchuan, Worrawat; Chan, Jonathan H; Meechai, Asawin
2016-12-05
Gene expression has been used to identify disease gene biomarkers, but there are ongoing challenges. Single gene or gene-set biomarkers are inadequate to provide sufficient understanding of complex disease mechanisms and the relationship among those genes. Network-based methods have thus been considered for inferring the interaction within a group of genes to further study the disease mechanism. Recently, the Gene-Network-based Feature Set (GNFS), which is capable of handling case-control and multiclass expression for gene biomarker identification, has been proposed, partly taking into account of network topology. However, its performance relies on a greedy search for building subnetworks and thus requires further improvement. In this work, we establish a new approach named Gene Sub-Network-based Feature Selection (GSNFS) by implementing the GNFS framework with two proposed searching and scoring algorithms, namely gene-set-based (GS) search and parent-node-based (PN) search, to identify subnetworks. An additional dataset is used to validate the results. The two proposed searching algorithms of the GSNFS method for subnetwork expansion are concerned with the degree of connectivity and the scoring scheme for building subnetworks and their topology. For each iteration of expansion, the neighbour genes of a current subnetwork, whose expression data improved the overall subnetwork score, is recruited. While the GS search calculated the subnetwork score using an activity score of a current subnetwork and the gene expression values of its neighbours, the PN search uses the expression value of the corresponding parent of each neighbour gene. Four lung cancer expression datasets were used for subnetwork identification. In addition, using pathway data and protein-protein interaction as network data in order to consider the interaction among significant genes were discussed. Classification was performed to compare the performance of the identified gene subnetworks with three subnetwork identification algorithms. The two searching algorithms resulted in better classification and gene/gene-set agreement compared to the original greedy search of the GNFS method. The identified lung cancer subnetwork using the proposed searching algorithm resulted in an improvement of the cross-dataset validation and an increase in the consistency of findings between two independent datasets. The homogeneity measurement of the datasets was conducted to assess dataset compatibility in cross-dataset validation. The lung cancer dataset with higher homogeneity showed a better result when using the GS search while the dataset with low homogeneity showed a better result when using the PN search. The 10-fold cross-dataset validation on the independent lung cancer datasets showed higher classification performance of the proposed algorithms when compared with the greedy search in the original GNFS method. The proposed searching algorithms provide a higher number of genes in the subnetwork expansion step than the greedy algorithm. As a result, the performance of the subnetworks identified from the GSNFS method was improved in terms of classification performance and gene/gene-set level agreement depending on the homogeneity of the datasets used in the analysis. Some common genes obtained from the four datasets using different searching algorithms are genes known to play a role in lung cancer. The improvement of classification performance and the gene/gene-set level agreement, and the biological relevance indicated the effectiveness of the GSNFS method for gene subnetwork identification using expression data.
Genomic pathways modulated by Twist in breast cancer.
Vesuna, Farhad; Bergman, Yehudit; Raman, Venu
2017-01-13
The basic helix-loop-helix transcription factor TWIST1 (Twist) is involved in embryonic cell lineage determination and mesodermal differentiation. There is evidence to indicate that Twist expression plays a role in breast tumor formation and metastasis, but the role of Twist in dysregulating pathways that drive the metastatic cascade is unclear. Moreover, many of the genes and pathways dysregulated by Twist in cell lines and mouse models have not been validated against data obtained from larger, independant datasets of breast cancer patients. We over-expressed the human Twist gene in non-metastatic MCF-7 breast cancer cells to generate the estrogen-independent metastatic breast cancer cell line MCF-7/Twist. These cells were inoculated in the mammary fat pad of female severe compromised immunodeficient mice, which subsequently formed xenograft tumors that metastasized to the lungs. Microarray data was collected from both in vitro (MCF-7 and MCF-7/Twist cell lines) and in vivo (primary tumors and lung metastases) models of Twist expression. Our data was compared to several gene datasets of various subtypes, classes, and grades of human breast cancers. Our data establishes a Twist over-expressing mouse model of breast cancer, which metastasizes to the lung and replicates some of the ontogeny of human breast cancer progression. Gene profiling data, following Twist expression, exhibited novel metastasis driver genes as well as cellular maintenance genes that were synonymous with the metastatic process. We demonstrated that the genes and pathways altered in the transgenic cell line and metastatic animal models parallel many of the dysregulated gene pathways observed in human breast cancers. Analogous gene expression patterns were observed in both in vitro and in vivo Twist preclinical models of breast cancer metastasis and breast cancer patient datasets supporting the functional role of Twist in promoting breast cancer metastasis. The data suggests that genetic dysregulation of Twist at the cellular level drives alterations in gene pathways in the Twist metastatic mouse model which are comparable to changes seen in human breast cancers. Lastly, we have identified novel genes and pathways that could be further investigated as targets for drugs to treat metastatic breast cancer.
Lemieux, Sébastien
2006-08-25
The identification of differentially expressed genes (DEGs) from Affymetrix GeneChips arrays is currently done by first computing expression levels from the low-level probe intensities, then deriving significance by comparing these expression levels between conditions. The proposed PL-LM (Probe-Level Linear Model) method implements a linear model applied on the probe-level data to directly estimate the treatment effect. A finite mixture of Gaussian components is then used to identify DEGs using the coefficients estimated by the linear model. This approach can readily be applied to experimental design with or without replication. On a wholly defined dataset, the PL-LM method was able to identify 75% of the differentially expressed genes within 10% of false positives. This accuracy was achieved both using the three replicates per conditions available in the dataset and using only one replicate per condition. The method achieves, on this dataset, a higher accuracy than the best set of tools identified by the authors of the dataset, and does so using only one replicate per condition.
Gupta, Sanjay K.; Dahiya, Saurabh; Lundy, Robert F.; Kumar, Ashok
2010-01-01
Background Skeletal muscle wasting is a debilitating consequence of large number of disease states and conditions. Tumor necrosis factor-α (TNF-α) is one of the most important muscle-wasting cytokine, elevated levels of which cause significant muscular abnormalities. However, the underpinning molecular mechanisms by which TNF-α causes skeletal muscle wasting are less well-understood. Methodology/Principal Findings We have used microarray, quantitative real-time PCR (QRT-PCR), Western blot, and bioinformatics tools to study the effects of TNF-α on various molecular pathways and gene networks in C2C12 cells (a mouse myoblastic cell line). Microarray analyses of C2C12 myotubes treated with TNF-α (10 ng/ml) for 18h showed differential expression of a number of genes involved in distinct molecular pathways. The genes involved in nuclear factor-kappa B (NF-kappaB) signaling, 26s proteasome pathway, Notch1 signaling, and chemokine networks are the most important ones affected by TNF-α. The expression of some of the genes in microarray dataset showed good correlation in independent QRT-PCR and Western blot assays. Analysis of TNF-treated myotubes showed that TNF-α augments the activity of both canonical and alternative NF-κB signaling pathways in myotubes. Bioinformatics analyses of microarray dataset revealed that TNF-α affects the activity of several important pathways including those involved in oxidative stress, hepatic fibrosis, mitochondrial dysfunction, cholesterol biosynthesis, and TGF-β signaling. Furthermore, TNF-α was found to affect the gene networks related to drug metabolism, cell cycle, cancer, neurological disease, organismal injury, and abnormalities in myotubes. Conclusions TNF-α regulates the expression of multiple genes involved in various toxic pathways which may be responsible for TNF-induced muscle loss in catabolic conditions. Our study suggests that TNF-α activates both canonical and alternative NF-κB signaling pathways in a time-dependent manner in skeletal muscle cells. The study provides novel insight into the mechanisms of action of TNF-α in skeletal muscle cells. PMID:20967264
Mining Gene Regulatory Networks by Neural Modeling of Expression Time-Series.
Rubiolo, Mariano; Milone, Diego H; Stegmayer, Georgina
2015-01-01
Discovering gene regulatory networks from data is one of the most studied topics in recent years. Neural networks can be successfully used to infer an underlying gene network by modeling expression profiles as times series. This work proposes a novel method based on a pool of neural networks for obtaining a gene regulatory network from a gene expression dataset. They are used for modeling each possible interaction between pairs of genes in the dataset, and a set of mining rules is applied to accurately detect the subjacent relations among genes. The results obtained on artificial and real datasets confirm the method effectiveness for discovering regulatory networks from a proper modeling of the temporal dynamics of gene expression profiles.
Kidd, Mark; Modlin, Irvin M; Drozdov, Ignat
2014-07-15
Tumor transcriptomes contain information of critical value to understanding the different capacities of a cell at both a physiological and pathological level. In terms of clinical relevance, they provide information regarding the cellular "toolbox" e.g., pathways associated with malignancy and metastasis or drug dependency. Exploration of this resource can therefore be leveraged as a translational tool to better manage and assess neoplastic behavior. The availability of public genome-wide expression datasets, provide an opportunity to reassess neuroendocrine tumors at a more fundamental level. We hypothesized that stringent analysis of expression profiles as well as regulatory networks of the neoplastic cell would provide novel information that facilitates further delineation of the genomic basis of small intestinal neuroendocrine tumors. We re-analyzed two publically available small intestinal tumor transcriptomes using stringent quality control parameters and network-based approaches and validated expression of core secretory regulatory elements e.g., CPE, PCSK1, secretogranins, including genes involved in depolarization e.g., SCN3A, as well as transcription factors associated with neurodevelopment (NKX2-2, NeuroD1, INSM1) and glucose homeostasis (APLP1). The candidate metastasis-associated transcription factor, ST18, was highly expressed (>14-fold, p < 0.004). Genes previously associated with neoplasia, CEBPA and SDHD, were decreased in expression (-1.5 - -2, p < 0.02). Genomic interrogation indicated that intestinal tumors may consist of two different subtypes, serotonin-producing neoplasms and serotonin/substance P/tachykinin lesions. QPCR validation in an independent dataset (n = 13 neuroendocrine tumors), confirmed up-regulated expression of 87% of genes (13/15). An integrated cellular transcriptomic analysis of small intestinal neuroendocrine tumors identified that they are regulated at a developmental level, have key activation of hypoxic pathways (a known regulator of malignant stem cell phenotypes) as well as activation of genes involved in apoptosis and proliferation. Further refinement of these analyses by RNAseq studies of large-scale databases will enable definition of individual master regulators and facilitate the development of novel tissue and blood-based tools to better understand diagnose and treat tumors.
Soh, Jung; Turinsky, Andrei L; Trinh, Quang M; Chang, Jasmine; Sabhaney, Ajay; Dong, Xiaoli; Gordon, Paul Mk; Janzen, Ryan Pw; Hau, David; Xia, Jianguo; Wishart, David S; Sensen, Christoph W
2009-01-01
We have developed a computational framework for spatiotemporal integration of molecular and anatomical datasets in a virtual reality environment. Using two case studies involving gene expression data and pharmacokinetic data, respectively, we demonstrate how existing knowledge bases for molecular data can be semantically mapped onto a standardized anatomical context of human body. Our data mapping methodology uses ontological representations of heterogeneous biomedical datasets and an ontology reasoner to create complex semantic descriptions of biomedical processes. This framework provides a means to systematically combine an increasing amount of biomedical imaging and numerical data into spatiotemporally coherent graphical representations. Our work enables medical researchers with different expertise to simulate complex phenomena visually and to develop insights through the use of shared data, thus paving the way for pathological inference, developmental pattern discovery and biomedical hypothesis testing.
Moretti, Stefano; van Leeuwen, Danitsja; Gmuender, Hans; Bonassi, Stefano; van Delft, Joost; Kleinjans, Jos; Patrone, Fioravante; Merlo, Domenico Franco
2008-01-01
Background In gene expression analysis, statistical tests for differential gene expression provide lists of candidate genes having, individually, a sufficiently low p-value. However, the interpretation of each single p-value within complex systems involving several interacting genes is problematic. In parallel, in the last sixty years, game theory has been applied to political and social problems to assess the power of interacting agents in forcing a decision and, more recently, to represent the relevance of genes in response to certain conditions. Results In this paper we introduce a Bootstrap procedure to test the null hypothesis that each gene has the same relevance between two conditions, where the relevance is represented by the Shapley value of a particular coalitional game defined on a microarray data-set. This method, which is called Comparative Analysis of Shapley value (shortly, CASh), is applied to data concerning the gene expression in children differentially exposed to air pollution. The results provided by CASh are compared with the results from a parametric statistical test for testing differential gene expression. Both lists of genes provided by CASh and t-test are informative enough to discriminate exposed subjects on the basis of their gene expression profiles. While many genes are selected in common by CASh and the parametric test, it turns out that the biological interpretation of the differences between these two selections is more interesting, suggesting a different interpretation of the main biological pathways in gene expression regulation for exposed individuals. A simulation study suggests that CASh offers more power than t-test for the detection of differential gene expression variability. Conclusion CASh is successfully applied to gene expression analysis of a data-set where the joint expression behavior of genes may be critical to characterize the expression response to air pollution. We demonstrate a synergistic effect between coalitional games and statistics that resulted in a selection of genes with a potential impact in the regulation of complex pathways. PMID:18764936
Proteomic dataset of the sea urchin Paracentrotus lividus adhesive organs and secreted adhesive.
Lebesgue, Nicolas; da Costa, Gonçalo; Ribeiro, Raquel Mesquita; Ribeiro-Silva, Cristina; Martins, Gabriel G; Matranga, Valeria; Scholten, Arjen; Cordeiro, Carlos; Heck, Albert J R; Santos, Romana
2016-06-01
Sea urchins have specialized adhesive organs called tube feet, which mediate strong but reversible adhesion. Tube feet are composed by a disc, producing adhesive and de-adhesive secretions for substratum attachment, and a stem for movement. After detachment the secreted adhesive remains bound to the substratum as a footprint. Recently, a label-free quantitative proteomic approach coupled with the latest mass-spectrometry technology was used to analyze the differential proteome of Paracentrotus lividus adhesive organ, comparing protein expression levels in the tube feet adhesive part (the disc) versus the non-adhesive part (the stem), and also to profile the proteome of the secreted adhesive (glue). This data article contains complementary figures and results related to the research article "Deciphering the molecular mechanisms underlying sea urchin reversible adhesion: a quantitative proteomics approach" (Lebesgue et al., 2016) [1]. Here we provide a dataset of 1384 non-redundant proteins, their fragmented peptides and expression levels, resultant from the analysis of the tube feet differential proteome. Of these, 163 highly over-expressed tube feet disc proteins (>3-fold), likely representing the most relevant proteins for sea urchin reversible adhesion, were further annotated in order to determine the potential functions. In addition, we provide a dataset of 611 non-redundant proteins identified in the secreted adhesive proteome, as well as their functional annotation and grouping in 5 major protein groups related with adhesive exocytosis, and microbial protection. This list was further analyzed to identify the most abundant protein groups and pinpoint putative adhesive proteins, such as Nectin, the most abundant adhesive protein in sea urchin glue. The obtained data uncover the key proteins involved in sea urchins reversible adhesion, representing a step forward to the development of new wet-effective bio-inspired adhesives.
Proteomic dataset of the sea urchin Paracentrotus lividus adhesive organs and secreted adhesive
Lebesgue, Nicolas; da Costa, Gonçalo; Ribeiro, Raquel Mesquita; Ribeiro-Silva, Cristina; Martins, Gabriel G.; Matranga, Valeria; Scholten, Arjen; Cordeiro, Carlos; Heck, Albert J.R.; Santos, Romana
2016-01-01
Sea urchins have specialized adhesive organs called tube feet, which mediate strong but reversible adhesion. Tube feet are composed by a disc, producing adhesive and de-adhesive secretions for substratum attachment, and a stem for movement. After detachment the secreted adhesive remains bound to the substratum as a footprint. Recently, a label-free quantitative proteomic approach coupled with the latest mass-spectrometry technology was used to analyze the differential proteome of Paracentrotus lividus adhesive organ, comparing protein expression levels in the tube feet adhesive part (the disc) versus the non-adhesive part (the stem), and also to profile the proteome of the secreted adhesive (glue). This data article contains complementary figures and results related to the research article “Deciphering the molecular mechanisms underlying sea urchin reversible adhesion: a quantitative proteomics approach” (Lebesgue et al., 2016) [1]. Here we provide a dataset of 1384 non-redundant proteins, their fragmented peptides and expression levels, resultant from the analysis of the tube feet differential proteome. Of these, 163 highly over-expressed tube feet disc proteins (>3-fold), likely representing the most relevant proteins for sea urchin reversible adhesion, were further annotated in order to determine the potential functions. In addition, we provide a dataset of 611 non-redundant proteins identified in the secreted adhesive proteome, as well as their functional annotation and grouping in 5 major protein groups related with adhesive exocytosis, and microbial protection. This list was further analyzed to identify the most abundant protein groups and pinpoint putative adhesive proteins, such as Nectin, the most abundant adhesive protein in sea urchin glue. The obtained data uncover the key proteins involved in sea urchins reversible adhesion, representing a step forward to the development of new wet-effective bio-inspired adhesives. PMID:27182547
Liu, Guiyou; Zhang, Fang; Jiang, Yongshuai; Hu, Yang; Gong, Zhongying; Liu, Shoufeng; Chen, Xiuju; Jiang, Qinghua; Hao, Junwei
2017-02-01
Much effort has been expended on identifying the genetic determinants of multiple sclerosis (MS). Existing large-scale genome-wide association study (GWAS) datasets provide strong support for using pathway and network-based analysis methods to investigate the mechanisms underlying MS. However, no shared genetic pathways have been identified to date. We hypothesize that shared genetic pathways may indeed exist in different MS-GWAS datasets. Here, we report results from a three-stage analysis of GWAS and expression datasets. In stage 1, we conducted multiple pathway analyses of two MS-GWAS datasets. In stage 2, we performed a candidate pathway analysis of the large-scale MS-GWAS dataset. In stage 3, we performed a pathway analysis using the dysregulated MS gene list from seven human MS case-control expression datasets. In stage 1, we identified 15 shared pathways. In stage 2, we successfully replicated 14 of these 15 significant pathways. In stage 3, we found that dysregulated MS genes were significantly enriched in 10 of 15 MS risk pathways identified in stages 1 and 2. We report shared genetic pathways in different MS-GWAS datasets and highlight some new MS risk pathways. Our findings provide new insights on the genetic determinants of MS.
Complex nature of SNP genotype effects on gene expression in primary human leucocytes.
Heap, Graham A; Trynka, Gosia; Jansen, Ritsert C; Bruinenberg, Marcel; Swertz, Morris A; Dinesen, Lotte C; Hunt, Karen A; Wijmenga, Cisca; Vanheel, David A; Franke, Lude
2009-01-07
Genome wide association studies have been hugely successful in identifying disease risk variants, yet most variants do not lead to coding changes and how variants influence biological function is usually unknown. We correlated gene expression and genetic variation in untouched primary leucocytes (n = 110) from individuals with celiac disease - a common condition with multiple risk variants identified. We compared our observations with an EBV-transformed HapMap B cell line dataset (n = 90), and performed a meta-analysis to increase power to detect non-tissue specific effects. In celiac peripheral blood, 2,315 SNP variants influenced gene expression at 765 different transcripts (< 250 kb from SNP, at FDR = 0.05, cis expression quantitative trait loci, eQTLs). 135 of the detected SNP-probe effects (reflecting 51 unique probes) were also detected in a HapMap B cell line published dataset, all with effects in the same allelic direction. Overall gene expression differences within the two datasets predominantly explain the limited overlap in observed cis-eQTLs. Celiac associated risk variants from two regions, containing genes IL18RAP and CCR3, showed significant cis genotype-expression correlations in the peripheral blood but not in the B cell line datasets. We identified 14 genes where a SNP affected the expression of different probes within the same gene, but in opposite allelic directions. By incorporating genetic variation in co-expression analyses, functional relationships between genes can be more significantly detected. In conclusion, the complex nature of genotypic effects in human populations makes the use of a relevant tissue, large datasets, and analysis of different exons essential to enable the identification of the function for many genetic risk variants in common diseases.
Genomics of apicomplexan parasites.
Swapna, Lakshmipuram Seshadri; Parkinson, John
2017-06-01
The increasing prevalence of infections involving intracellular apicomplexan parasites such as Plasmodium, Toxoplasma, and Cryptosporidium (the causative agents of malaria, toxoplasmosis, and cryptosporidiosis, respectively) represent a significant global healthcare burden. Despite their significance, few treatments are available; a situation that is likely to deteriorate with the emergence of new resistant strains of parasites. To lay the foundation for programs of drug discovery and vaccine development, genome sequences for many of these organisms have been generated, together with large-scale expression and proteomic datasets. Comparative analyses of these datasets are beginning to identify the molecular innovations supporting both conserved processes mediating fundamental roles in parasite survival and persistence, as well as lineage-specific adaptations associated with divergent life-cycle strategies. The challenge is how best to exploit these data to derive insights into parasite virulence and identify those genes representing the most amenable targets. In this review, we outline genomic datasets currently available for apicomplexans and discuss biological insights that have emerged as a consequence of their analysis. Of particular interest are systems-based resources, focusing on areas of metabolism and host invasion that are opening up opportunities for discovering new therapeutic targets.
Chang, Chia-Ming; Yang, Yi-Ping; Chuang, Jen-Hua; Chuang, Chi-Mu; Lin, Tzu-Wei; Wang, Peng-Hui; Yu, Mu-Hsien
2017-01-01
The clinical characteristics of clear cell carcinoma (CCC) and endometrioid carcinoma EC) are concomitant with endometriosis (ES), which leads to the postulation of malignant transformation of ES to endometriosis-associated ovarian carcinoma (EAOC). Different deregulated functional areas were proposed accounting for the pathogenesis of EAOC transformation, and there is still a lack of a data-driven analysis with the accumulated experimental data in publicly-available databases to incorporate the deregulated functions involved in the malignant transformation of EOAC. We used the microarray gene expression datasets of ES, CCC and EC downloaded from the National Center for Biotechnology Information Gene Expression Omnibus (NCBI GEO) database. Then, we investigated the pathogenesis of EAOC by a data-driven, function-based analytic model with the quantified molecular functions defined by 1454 Gene Ontology (GO) term gene sets. This model converts the gene expression profiles to the functionome consisting of 1454 quantified GO functions, and then, the key functions involving the malignant transformation of EOAC can be extracted by a series of filters. Our results demonstrate that the deregulated oxidoreductase activity, metabolism, hormone activity, inflammatory response, innate immune response and cell-cell signaling play the key roles in the malignant transformation of EAOC. These results provide the evidence supporting the specific molecular pathways involved in the malignant transformation of EAOC. PMID:29113136
The immune gene repertoire of an important viral reservoir, the Australian black flying fox.
Papenfuss, Anthony T; Baker, Michelle L; Feng, Zhi-Ping; Tachedjian, Mary; Crameri, Gary; Cowled, Chris; Ng, Justin; Janardhana, Vijaya; Field, Hume E; Wang, Lin-Fa
2012-06-20
Bats are the natural reservoir host for a range of emerging and re-emerging viruses, including SARS-like coronaviruses, Ebola viruses, henipaviruses and Rabies viruses. However, the mechanisms responsible for the control of viral replication in bats are not understood and there is little information available on any aspect of antiviral immunity in bats. Massively parallel sequencing of the bat transcriptome provides the opportunity for rapid gene discovery. Although the genomes of one megabat and one microbat have now been sequenced to low coverage, no transcriptomic datasets have been reported from any bat species. In this study, we describe the immune transcriptome of the Australian flying fox, Pteropus alecto, providing an important resource for identification of genes involved in a range of activities including antiviral immunity. Towards understanding the adaptations that have allowed bats to coexist with viruses, we have de novo assembled transcriptome sequence from immune tissues and stimulated cells from P. alecto. We identified about 18,600 genes involved in a broad range of activities with the most highly expressed genes involved in cell growth and maintenance, enzyme activity, cellular components and metabolism and energy pathways. 3.5% of the bat transcribed genes corresponded to immune genes and a total of about 500 immune genes were identified, providing an overview of both innate and adaptive immunity. A small proportion of transcripts found no match with annotated sequences in any of the public databases and may represent bat-specific transcripts. This study represents the first reported bat transcriptome dataset and provides a survey of expressed bat genes that complement existing bat genomic data. In addition, these data provide insight into genes relevant to the antiviral responses of bats, and form a basis for examining the roles of these molecules in immune response to viral infection.
Emmanuel, Catherine; Gava, Natalie; Kennedy, Catherine; Balleine, Rosemary L.; Sharma, Raghwa; Wain, Gerard; Brand, Alison; Hogg, Russell; Etemadmoghadam, Dariush; George, Joshy; Birrer, Michael J.; Clarke, Christine L.; Chenevix-Trench, Georgia; Bowtell, David D. L.; Harnett, Paul R.; deFazio, Anna
2011-01-01
Molecular events leading to epithelial ovarian cancer are poorly understood but ovulatory hormones and a high number of life-time ovulations with concomitant proliferation, apoptosis, and inflammation, increases risk. We identified genes that are regulated during the estrous cycle in murine ovarian surface epithelium and analysed these profiles to identify genes dysregulated in human ovarian cancer, using publically available datasets. We identified 338 genes that are regulated in murine ovarian surface epithelium during the estrous cycle and dysregulated in ovarian cancer. Six of seven candidates selected for immunohistochemical validation were expressed in serous ovarian cancer, inclusion cysts, ovarian surface epithelium and in fallopian tube epithelium. Most were overexpressed in ovarian cancer compared with ovarian surface epithelium and/or inclusion cysts (EpCAM, EZH2, BIRC5) although BIRC5 and EZH2 were expressed as highly in fallopian tube epithelium as in ovarian cancer. We prioritised the 338 genes for those likely to be important for ovarian cancer development by in silico analyses of copy number aberration and mutation using publically available datasets and identified genes with established roles in ovarian cancer as well as novel genes for which we have evidence for involvement in ovarian cancer. Chromosome segregation emerged as an important process in which genes from our list of 338 were over-represented including two (BUB1, NCAPD2) for which there is evidence of amplification and mutation. NUAK2, upregulated in ovarian surface epithelium in proestrus and predicted to have a driver mutation in ovarian cancer, was examined in a larger cohort of serous ovarian cancer where patients with lower NUAK2 expression had shorter overall survival. In conclusion, defining genes that are activated in normal epithelium in the course of ovulation that are also dysregulated in cancer has identified a number of pathways and novel candidate genes that may contribute to the development of ovarian cancer. PMID:21423607
Caracciolo, Daniele; Agnelli, Luca; Neri, Antonino; Walker, Brian A.; Morgan, Gareth J.; Cannataro, Mario
2015-01-01
Multiple Myeloma (MM) is a malignancy characterized by the hyperdiploid (HD-MM) and the non-hyperdiploid (nHD-MM) subtypes. To shed light within the molecular architecture of these subtypes, we used a novel integromics approach. By annotated MM patient mRNA/microRNA (miRNA) datasets, we investigated mRNAs and miRNAs profiles with relation to changes in transcriptional regulators expression. We found that HD-MM displays specific gene and miRNA expression profiles, involving the Signal Transducer and Activator of Transcription (STAT)3 pathway as well as the Transforming Growth Factor–beta (TGFβ) and the transcription regulator Nuclear Protein-1 (NUPR1). Our data define specific molecular features of HD-MM that may translate in the identification of novel relevant druggable targets. PMID:26056083
Analyzing gene expression data in mice with the Neuro Behavior Ontology.
Hoehndorf, Robert; Hancock, John M; Hardy, Nigel W; Mallon, Ann-Marie; Schofield, Paul N; Gkoutos, Georgios V
2014-02-01
We have applied the Neuro Behavior Ontology (NBO), an ontology for the annotation of behavioral gene functions and behavioral phenotypes, to the annotation of more than 1,000 genes in the mouse that are known to play a role in behavior. These annotations can be explored by researchers interested in genes involved in particular behaviors and used computationally to provide insights into the behavioral phenotypes resulting from differences in gene expression. We developed the OntoFUNC tool and have applied it to enrichment analyses over the NBO to provide high-level behavioral interpretations of gene expression datasets. The resulting increase in the number of gene annotations facilitates the identification of behavioral or neurologic processes by assisting the formulation of hypotheses about the relationships between gene, processes, and phenotypic manifestations resulting from behavioral observations.
Salem, Saeed; Ozcaglar, Cagri
2014-01-01
Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways.
Li, Qi; Jia, Hongmei; Li, Haowen; Dong, Chengya; Wang, Yajie; Zou, Zhongmei
2016-11-01
Glioblastoma multiforme (GBM) is the most common brain malignancy. Long non-coding RNAs (lncRNAs) are aberrantly expressed in many cancers and are involved in their cell proliferation, apoptosis, angiogenesis, and invasion. The functional roles of lncRNAs in GBM are less known. We analyzed a cohort of exon microarray datasets from The Cancer Genome Atlas. The differently expressed lncRNAs and mRNA were subjected to construct lncRNA-mRNA co-expression network. Probable functions for lncRNAs were predicted according to lncRNA-mRNA network and genomic adjacency by GO and pathway analysis. The expression of lncRNAs and mRNAs in GBM tissues versus normal brain tissues was examined by quantitative reverse transcription polymerase chain reaction. The 398 lncRNAs and 1995 mRNAs were identified as distinctively expressed in GBM. Probable functional roles for 98 lncRNAs were involved in 30 pathways and 32 gene functions related to tumorigenesis, development, and metastasis. The identified sets of key lncRNAs specific to GBM were subsequently verified by experiment in GBM tissues. Our reports predict the biological functions of a multitude of lncRNAs in GBM that could be potential diagnostic and prognostic biomarkers as well as therapeutic targets. Moreover, our research provides a road map for the identification and analysis of lncRNAs in tumors.
Zhou, Lei-Lei; Xu, Xiao-Yue; Ni, Jie; Zhao, Xia; Zhou, Jian-Wei; Feng, Ji-Feng
2018-06-01
Due to the low incidence and the heterogeneity of subtypes, the biological process of T-cell lymphomas is largely unknown. Although many genes have been detected in T-cell lymphomas, the role of these genes in biological process of T-cell lymphomas was not further analyzed. Two qualified datasets were downloaded from Gene Expression Omnibus database. The biological functions of differentially expressed genes were evaluated by gene ontology enrichment and KEGG pathway analysis. The network for intersection genes was constructed by the cytoscape v3.0 software. Kaplan-Meier survival curves and log-rank test were employed to assess the association between differentially expressed genes and clinical characters. The intersection mRNAs were proved to be associated with fundamental processes of T-cell lymphoma cells. These intersection mRNAs were involved in the activation of some cancer-related pathways, including PI3K/AKT, Ras, JAK-STAT, and NF-kappa B signaling pathway. PDGFRA, CXCL12, and CCL19 were the most significant central genes in the signal-net analysis. The results of survival analysis are not entirely credible. Our findings uncovered aberrantly expressed genes and a complex RNA signal network in T-cell lymphomas and indicated cancer-related pathways involved in disease initiation and progression, providing a new insight for biotargeted therapy in T-cell lymphomas. © 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Cancer Detection in Microarray Data Using a Modified Cat Swarm Optimization Clustering Approach
M, Pandi; R, Balamurugan; N, Sadhasivam
2017-12-29
Objective: A better understanding of functional genomics can be obtained by extracting patterns hidden in gene expression data. This could have paramount implications for cancer diagnosis, gene treatments and other domains. Clustering may reveal natural structures and identify interesting patterns in underlying data. The main objective of this research was to derive a heuristic approach to detection of highly co-expressed genes related to cancer from gene expression data with minimum Mean Squared Error (MSE). Methods: A modified CSO algorithm using Harmony Search (MCSO-HS) for clustering cancer gene expression data was applied. Experiment results are analyzed using two cancer gene expression benchmark datasets, namely for leukaemia and for breast cancer. Result: The results indicated MCSO-HS to be better than HS and CSO, 13% and 9% with the leukaemia dataset. For breast cancer dataset improvement was by 22% and 17%, respectively, in terms of MSE. Conclusion: The results showed MCSO-HS to outperform HS and CSO with both benchmark datasets. To validate the clustering results, this work was tested with internal and external cluster validation indices. Also this work points to biological validation of clusters with gene ontology in terms of function, process and component. Creative Commons Attribution License
Haram, Kerstyn M; Peltier, Heidi J; Lu, Bin; Bhasin, Manoj; Otu, Hasan H; Choy, Bob; Regan, Meredith; Libermann, Towia A; Latham, Gary J; Sanda, Martin G; Arredouani, Mohamed S
2008-10-01
Translation of preclinical studies into effective human cancer therapy is hampered by the lack of defined molecular expression patterns in mouse models that correspond to the human counterpart. We sought to generate an open source TRAMP mouse microarray dataset and to use this array to identify differentially expressed genes from human prostate cancer (PCa) that have concordant expression in TRAMP tumors, and thereby represent lead targets for preclinical therapy development. We performed microarrays on total RNA extracted and amplified from eight TRAMP tumors and nine normal prostates. A subset of differentially expressed genes was validated by QRT-PCR. Differentially expressed TRAMP genes were analyzed for concordant expression in publicly available human prostate array datasets and a subset of resulting genes was analyzed by QRT-PCR. Cross-referencing differentially expressed TRAMP genes to public human prostate array datasets revealed 66 genes with concordant expression in mouse and human PCa; 56 between metastases and normal and 10 between primary tumor and normal tissues. Of these 10 genes, two, Sox4 and Tubb2a, were validated by QRT-PCR. Our analysis also revealed various dysregulations in major biologic pathways in the TRAMP prostates. We report a TRAMP microarray dataset of which a gene subset was validated by QRT-PCR with expression patterns consistent with previous gene-specific TRAMP studies. Concordance analysis between TRAMP and human PCa associated genes supports the utility of the model and suggests several novel molecular targets for preclinical therapy.
Polyester: simulating RNA-seq datasets with differential transcript expression.
Frazee, Alyssa C; Jaffe, Andrew E; Langmead, Ben; Leek, Jeffrey T
2015-09-01
Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data. Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user. Polyester is freely available from Bioconductor (http://bioconductor.org/). jtleek@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Kurscheid, Sebastian; Bady, Pierre; Sciuscio, Davide; Samarzija, Ivana; Shay, Tal; Vassallo, Irene; Van Criekinge, Wim; Domany, Eytan; Stupp, Roger; Delorenzi, Mauro; Hegi, Monika
2014-01-01
We previously reported a stem cell related HOX gene signature associated with resistance to chemo-radiotherapy (TMZ/RT- > TMZ) in glioblastoma. However, underlying mechanisms triggering overexpression remain mostly elusive. Interestingly, HOX genes are neither involved in the developing brain, nor expressed in normal brain, suggestive of an acquired gene expression signature during gliomagenesis. HOXA genes are located on CHR 7 that displays trisomy in most glioblastoma which strongly impacts gene expression on this chromosome, modulated by local regulatory elements. Furthermore we observed more pronounced DNA methylation across the HOXA locus as compared to non-tumoral brain (Human methylation 450K BeadChip Illumina; 59 glioblastoma, 5 non-tumoral brain sampes). CpG probes annotated for HOX-signature genes, contributing most to the variability, served as input into the analysis of DNA methylation and expression to identify key regulatory regions. The structural similarity of the observed correlation matrices between DNA methylation and gene expression in our cohort and an independent data-set from TCGA (106 glioblastoma) was remarkable (RV-coefficient, 0.84; p-value < 0.0001). We identified a CpG located in the promoter region of the HOXA10 locus exerting the strongest mean negative correlation between methylation and expression of the whole HOX-signature. Applying this analysis the same CpG emerged in the external set. We then determined the contribution of both, gene copy aberration (CNA) and methylation at the selected probe to explain expression of the HOX-signature using a linear model. Statistically significant results suggested an additive effect between gene dosage and methylation at the key CpG identified. Similarly, such an additive effect was also observed in the external data-set. Taken together, we hypothesize that overexpression of the stem-cell related HOX signature is triggered by gain of trisomy 7 and escape from compensatory DNA methylation at positions controlling the effect of enhanced gene dose on expression.
SPP1 and AGER as potential prognostic biomarkers for lung adenocarcinoma.
Zhang, Weiguo; Fan, Junli; Chen, Qiang; Lei, Caipeng; Qiao, Bin; Liu, Qin
2018-05-01
Overdue treatment and prognostic evaluation lead to low survival rates in patients with lung adenocarcinoma (LUAD). To date, effective biomarkers for prognosis are still required. The aim of the present study was to screen differentially expressed genes (DEGs) as biomarkers for prognostic evaluation of LUAD. DEGs in tumor and normal samples were identified and analyzed for Kyoto Encyclopedia of Genes and Genomes/Gene Ontology functional enrichments. The common genes that are up and downregulated were selected for prognostic analysis using RNAseq data in The Cancer Genome Atlas. Differential expression analysis was performed with 164 samples in GSE10072 and GSE7670 datasets. A total of 484 DEGs that were present in GSE10072 and GSE7670 datasets were screened, including secreted phosphoprotein 1 (SPP1) that was highly expressed and DEGs ficolin 3, advanced glycosylation end-product specific receptor (AGER), transmembrane protein 100 that were lowly expressed in tumor tissues. These four key genes were subsequently verified using an independent dataset, GSE19804. The gene expression model was consistent with GSE10072 and GSE7670 datasets. The dysregulation of highly expressed SPP1 and lowly expressed AGER significantly reduced the median survival time of patients with LUAD. These findings suggest that SPP1 and AGER are risk factors for LUAD, and these two genes may be utilized in the prognostic evaluation of patients with LUAD. Additionally, the key genes and functional enrichments may provide a reference for investigating the molecular expression mechanisms underlying LUAD.
Data Quality Screening Service
NASA Technical Reports Server (NTRS)
Strub, Richard; Lynnes, Christopher; Hearty, Thomas; Won, Young-In; Fox, Peter; Zednik, Stephan
2013-01-01
A report describes the Data Quality Screening Service (DQSS), which is designed to help automate the filtering of remote sensing data on behalf of science users. Whereas this process often involves much research through quality documents followed by laborious coding, the DQSS is a Web Service that provides data users with data pre-filtered to their particular criteria, while at the same time guiding the user with filtering recommendations of the cognizant data experts. The DQSS design is based on a formal semantic Web ontology that describes data fields and the quality fields for applying quality control within a data product. The accompanying code base handles several remote sensing datasets and quality control schemes for data products stored in Hierarchical Data Format (HDF), a common format for NASA remote sensing data. Together, the ontology and code support a variety of quality control schemes through the implementation of the Boolean expression with simple, reusable conditional expressions as operands. Additional datasets are added to the DQSS simply by registering instances in the ontology if they follow a quality scheme that is already modeled in the ontology. New quality schemes are added by extending the ontology and adding code for each new scheme.
A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast
Kundaje, Anshul; Xin, Xiantong; Lan, Changgui; Lianoglou, Steve; Zhou, Mei; Zhang, Li; Leslie, Christina
2008-01-01
Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included. PMID:19008939
Balcik-Ercin, Pelin; Cetin, Metin; Yalim-Camci, Irem; Odabas, Gorkem; Tokay, Nurettin; Sayan, A Emre; Yagci, Tamer
2018-03-07
ZEB2 is a transcriptional repressor that regulates epithelial-to-mesenchymal transition (EMT) through binding to bipartite E-box motifs in gene regulatory regions. Despite the abundant presence of E-boxes within the human genome and the multiplicity of pathophysiological processes regulated during ZEB2-induced EMT, only a small fraction of ZEB2 targets has been identified so far. Hence, we explored genome-wide ZEB2 binding by chromatin immunoprecipitation-sequencing (ChIP-seq) under endogenous ZEB2 expression conditions. For ChIP-Seq we used an anti-ZEB2 monoclonal antibody, clone 6E5, in SNU398 hepatocellular carcinoma cells exhibiting a high endogenous ZEB2 expression. The ChIP-Seq targets were validated using ChIP-qPCR, whereas ZEB2-dependent expression of target genes was assessed by RT-qPCR and Western blotting in shRNA-mediated ZEB2 silenced SNU398 cells and doxycycline-induced ZEB2 overexpressing colorectal carcinoma DLD1 cells. Changes in target gene expression were also assessed using primary human tumor cDNA arrays in conjunction with RT-qPCR. Additional differential expression and correlation analyses were performed using expO and Human Protein Atlas datasets. Over 500 ChIP-Seq positive genes were annotated, and intervals related to these genes were found to include the ZEB2 binding motif CACCTG according to TOMTOM motif analysis in the MEME Suite database. Assessment of ZEB2-dependent expression of target genes in ZEB2-silenced SNU398 cells and ZEB2-induced DLD1 cells revealed that the GALNT3 gene serves as a ZEB2 target with the highest, but inversely correlated, expression level. Remarkably, GALNT3 also exhibited the highest enrichment in the ChIP-qPCR validation assays. Through the analyses of primary tumor cDNA arrays and expO datasets a significant differential expression and a significant inverse correlation between ZEB2 and GALNT3 expression were detected in most of the tumors. We also explored ZEB2 and GALNT3 protein expression using the Human Protein Atlas dataset and, again, observed an inverse correlation in all analyzed tumor types, except malignant melanoma. In contrast to a generally negative or weak ZEB2 expression, we found that most tumor tissues exhibited a strong or moderate GALNT3 expression. Our observation that ZEB2 negatively regulates a GalNAc-transferase (GALNT3) that is involved in O-glycosylation adds another layer of complexity to the role of ZEB2 in cancer progression and metastasis. Proteins glycosylated by GALNT3 may be exploited as novel diagnostics and/or therapeutic targets.
A method for generating new datasets based on copy number for cancer analysis.
Kim, Shinuk; Kon, Mark; Kang, Hyunsik
2015-01-01
New data sources for the analysis of cancer data are rapidly supplementing the large number of gene-expression markers used for current methods of analysis. Significant among these new sources are copy number variation (CNV) datasets, which typically enumerate several hundred thousand CNVs distributed throughout the genome. Several useful algorithms allow systems-level analyses of such datasets. However, these rich data sources have not yet been analyzed as deeply as gene-expression data. To address this issue, the extensive toolsets used for analyzing expression data in cancerous and noncancerous tissue (e.g., gene set enrichment analysis and phenotype prediction) could be redirected to extract a great deal of predictive information from CNV data, in particular those derived from cancers. Here we present a software package capable of preprocessing standard Agilent copy number datasets into a form to which essentially all expression analysis tools can be applied. We illustrate the use of this toolset in predicting the survival time of patients with ovarian cancer or glioblastoma multiforme and also provide an analysis of gene- and pathway-level deletions in these two types of cancer.
Dynamic association rules for gene expression data analysis.
Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung
2015-10-14
The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.
Carroll, Adam J; Badger, Murray R; Harvey Millar, A
2010-07-14
Standardization of analytical approaches and reporting methods via community-wide collaboration can work synergistically with web-tool development to result in rapid community-driven expansion of online data repositories suitable for data mining and meta-analysis. In metabolomics, the inter-laboratory reproducibility of gas-chromatography/mass-spectrometry (GC/MS) makes it an obvious target for such development. While a number of web-tools offer access to datasets and/or tools for raw data processing and statistical analysis, none of these systems are currently set up to act as a public repository by easily accepting, processing and presenting publicly submitted GC/MS metabolomics datasets for public re-analysis. Here, we present MetabolomeExpress, a new File Transfer Protocol (FTP) server and web-tool for the online storage, processing, visualisation and statistical re-analysis of publicly submitted GC/MS metabolomics datasets. Users may search a quality-controlled database of metabolite response statistics from publicly submitted datasets by a number of parameters (eg. metabolite, species, organ/biofluid etc.). Users may also perform meta-analysis comparisons of multiple independent experiments or re-analyse public primary datasets via user-friendly tools for t-test, principal components analysis, hierarchical cluster analysis and correlation analysis. They may interact with chromatograms, mass spectra and peak detection results via an integrated raw data viewer. Researchers who register for a free account may upload (via FTP) their own data to the server for online processing via a novel raw data processing pipeline. MetabolomeExpress https://www.metabolome-express.org provides a new opportunity for the general metabolomics community to transparently present online the raw and processed GC/MS data underlying their metabolomics publications. Transparent sharing of these data will allow researchers to assess data quality and draw their own insights from published metabolomics datasets.
Gálvez, Juan Manuel; Castillo, Daniel; Herrera, Luis Javier; San Román, Belén; Valenzuela, Olga; Ortuño, Francisco Manuel; Rojas, Ignacio
2018-01-01
Most of the research studies developed applying microarray technology to the characterization of different pathological states of any disease may fail in reaching statistically significant results. This is largely due to the small repertoire of analysed samples, and to the limitation in the number of states or pathologies usually addressed. Moreover, the influence of potential deviations on the gene expression quantification is usually disregarded. In spite of the continuous changes in omic sciences, reflected for instance in the emergence of new Next-Generation Sequencing-related technologies, the existing availability of a vast amount of gene expression microarray datasets should be properly exploited. Therefore, this work proposes a novel methodological approach involving the integration of several heterogeneous skin cancer series, and a later multiclass classifier design. This approach is thus a way to provide the clinicians with an intelligent diagnosis support tool based on the use of a robust set of selected biomarkers, which simultaneously distinguishes among different cancer-related skin states. To achieve this, a multi-platform combination of microarray datasets from Affymetrix and Illumina manufacturers was carried out. This integration is expected to strengthen the statistical robustness of the study as well as the finding of highly-reliable skin cancer biomarkers. Specifically, the designed operation pipeline has allowed the identification of a small subset of 17 differentially expressed genes (DEGs) from which to distinguish among 7 involved skin states. These genes were obtained from the assessment of a number of potential batch effects on the gene expression data. The biological interpretation of these genes was inspected in the specific literature to understand their underlying information in relation to skin cancer. Finally, in order to assess their possible effectiveness in cancer diagnosis, a cross-validation Support Vector Machines (SVM)-based classification including feature ranking was performed. The accuracy attained exceeded the 92% in overall recognition of the 7 different cancer-related skin states. The proposed integration scheme is expected to allow the co-integration with other state-of-the-art technologies such as RNA-seq.
A high resolution atlas of gene expression in the domestic sheep (Ovis aries)
Farquhar, Iseabail L.; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G.; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C. Bruce; Freeman, Tom C.; Archibald, Alan L.; Hume, David A.
2017-01-01
Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of ‘guilt by association’ was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages. PMID:28915238
A high resolution atlas of gene expression in the domestic sheep (Ovis aries).
Clark, Emily L; Bush, Stephen J; McCulloch, Mary E B; Farquhar, Iseabail L; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G; Wu, Chunlei; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C Bruce; Freeman, Tom C; Summers, Kim M; Archibald, Alan L; Hume, David A
2017-09-01
Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of 'guilt by association' was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages.
New Statistics for Testing Differential Expression of Pathways from Microarray Data
NASA Astrophysics Data System (ADS)
Siu, Hoicheong; Dong, Hua; Jin, Li; Xiong, Momiao
Exploring biological meaning from microarray data is very important but remains a great challenge. Here, we developed three new statistics: linear combination test, quadratic test and de-correlation test to identify differentially expressed pathways from gene expression profile. We apply our statistics to two rheumatoid arthritis datasets. Notably, our results reveal three significant pathways and 275 genes in common in two datasets. The pathways we found are meaningful to uncover the disease mechanisms of rheumatoid arthritis, which implies that our statistics are a powerful tool in functional analysis of gene expression data.
Atallah, Nadia M; Vitek, Olga; Gaiti, Federico; Tanurdzic, Milos; Banks, Jo Ann
2018-05-02
The fern Ceratopteris richardii is an important model for studies of sex determination and gamete differentiation in homosporous plants. Here we use RNA-seq to de novo assemble a transcriptome and identify genes differentially expressed in young gametophytes as their sex is determined by the presence or absence of the male-inducing pheromone called antheridiogen. Of the 1,163 consensus differentially expressed genes identified, the vast majority (1,030) are up-regulated in gametophytes treated with antheridiogen. GO term enrichment analyses of these DEGs reveals that a large number of genes involved in epigenetic reprogramming of the gametophyte genome are up-regulated by the pheromone. Additional hormone response and development genes are also up-regulated by the pheromone. This C. richardii gametophyte transcriptome and gene expression dataset will prove useful for studies focusing on sex determination and differentiation in plants. Copyright © 2018, G3: Genes, Genomes, Genetics.
Affective State Level Recognition in Naturalistic Facial and Vocal Expressions.
Meng, Hongying; Bianchi-Berthouze, Nadia
2014-03-01
Naturalistic affective expressions change at a rate much slower than the typical rate at which video or audio is recorded. This increases the probability that consecutive recorded instants of expressions represent the same affective content. In this paper, we exploit such a relationship to improve the recognition performance of continuous naturalistic affective expressions. Using datasets of naturalistic affective expressions (AVEC 2011 audio and video dataset, PAINFUL video dataset) continuously labeled over time and over different dimensions, we analyze the transitions between levels of those dimensions (e.g., transitions in pain intensity level). We use an information theory approach to show that the transitions occur very slowly and hence suggest modeling them as first-order Markov models. The dimension levels are considered to be the hidden states in the Hidden Markov Model (HMM) framework. Their discrete transition and emission matrices are trained by using the labels provided with the training set. The recognition problem is converted into a best path-finding problem to obtain the best hidden states sequence in HMMs. This is a key difference from previous use of HMMs as classifiers. Modeling of the transitions between dimension levels is integrated in a multistage approach, where the first level performs a mapping between the affective expression features and a soft decision value (e.g., an affective dimension level), and further classification stages are modeled as HMMs that refine that mapping by taking into account the temporal relationships between the output decision labels. The experimental results for each of the unimodal datasets show overall performance to be significantly above that of a standard classification system that does not take into account temporal relationships. In particular, the results on the AVEC 2011 audio dataset outperform all other systems presented at the international competition.
Validation of MIMGO: a method to identify differentially expressed GO terms in a microarray dataset
2012-01-01
Background We previously proposed an algorithm for the identification of GO terms that commonly annotate genes whose expression is upregulated or downregulated in some microarray data compared with in other microarray data. We call these “differentially expressed GO terms” and have named the algorithm “matrix-assisted identification method of differentially expressed GO terms” (MIMGO). MIMGO can also identify microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. However, MIMGO has not yet been validated on a real microarray dataset using all available GO terms. Findings We combined Gene Set Enrichment Analysis (GSEA) with MIMGO to identify differentially expressed GO terms in a yeast cell cycle microarray dataset. GSEA followed by MIMGO (GSEA + MIMGO) correctly identified (p < 0.05) microarray data in which genes annotated to differentially expressed GO terms are upregulated. We found that GSEA + MIMGO was slightly less effective than, or comparable to, GSEA (Pearson), a method that uses Pearson’s correlation as a metric, at detecting true differentially expressed GO terms. However, unlike other methods including GSEA (Pearson), GSEA + MIMGO can comprehensively identify the microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. Conclusions MIMGO is a reliable method to identify differentially expressed GO terms comprehensively. PMID:23232071
Zaag, Rim; Tamby, Jean Philippe; Guichard, Cécile; Tariq, Zakia; Rigaill, Guillem; Delannoy, Etienne; Renou, Jean-Pierre; Balzergue, Sandrine; Mary-Huard, Tristan; Aubourg, Sébastien; Martin-Magniette, Marie-Laure; Brunaud, Véronique
2015-01-01
CATdb (http://urgv.evry.inra.fr/CATdb) is a database providing a public access to a large collection of transcriptomic data, mainly for Arabidopsis but also for other plants. This resource has the rare advantage to contain several thousands of microarray experiments obtained with the same technical protocol and analyzed by the same statistical pipelines. In this paper, we present GEM2Net, a new module of CATdb that takes advantage of this homogeneous dataset to mine co-expression units and decipher Arabidopsis gene functions. GEM2Net explores 387 stress conditions organized into 18 biotic and abiotic stress categories. For each one, a model-based clustering is applied on expression differences to identify clusters of co-expressed genes. To characterize functions associated with these clusters, various resources are analyzed and integrated: Gene Ontology, subcellular localization of proteins, Hormone Families, Transcription Factor Families and a refined stress-related gene list associated to publications. Exploiting protein-protein interactions and transcription factors-targets interactions enables to display gene networks. GEM2Net presents the analysis of the 18 stress categories, in which 17,264 genes are involved and organized within 681 co-expression clusters. The meta-data analyses were stored and organized to compose a dynamic Web resource. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
2014-01-01
Background Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. Results We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways. PMID:25221624
Androgen receptor (AR) cistrome in prostate differentiation and cancer progression.
Wang, Fengtian; Koul, Hari K
2017-01-01
Despite the progress in development of better AR-targeted therapies for prostate cancer (PCa), there is no curative therapy for castration-resistant prostate cancer (CRPC). Therapeutic resistance in PCa can be characterized in two broad categories of AR therapy resistance: the first and most prevalent one involves restoration of AR activity despite AR targeted therapy, and the second one involves tumor progression despite blockade of AR activity. As such AR remains the most attractive drug target for CRPC. Despite its oncogenic role, AR signaling also contributes to the maturation and differentiation of prostate luminal cells during development. Recent evidence suggests that AR cistrome is altered in advanced PCa. Alteration in AR may result from AR amplification, alternative splicing, mutations, post-translational modification of AR, and altered expression of AR co-factors. We reasoned that such alterations would result in the transcription of disparate AR target genes and as such may contribute to the emergence of castration-resistance. In the present study, we evaluated the expression of genes associated with canonical or non-canonical AR cistrome in relationship with PCa progression and prostate development by analyzing publicly available datasets. We discovered a transcription switch from canonical AR cistrome target genes to the non-canonical AR cistrome target genes during PCa progression. Using Gene Set Enrichment Analysis (GSEA), we discovered that canonical AR cistrome target genes are enriched in indolent PCa patients and the loss of canonical AR cistrome is associated with tumor metastasis and poor clinical outcome. Analysis of the datasets involving prostate development, revealed that canonical AR cistrome target genes are significantly enriched in prostate luminal cells and can distinguish luminal cells from basal cells, suggesting a pivotal role for canonical AR cistrome driven genes in prostate development. These data suggest that the expression of canonical AR cistrome related genes play an important role in maintaining the prostate luminal cell identity and might restrict the lineage plasticity observed in lethal PCa. Understanding the molecular mechanisms that dictate AR cistrome may lead to development of new therapeutic strategies aimed at restoring canonical AR cistrome, rewiring the oncogenic AR signaling and overcome resistance to AR targeted therapies.
Yılmaz Isıkhan, Selen; Karabulut, Erdem; Alpar, Celal Reha
2016-01-01
Background/Aim . Evaluating the success of dose prediction based on genetic or clinical data has substantially advanced recently. The aim of this study is to predict various clinical dose values from DNA gene expression datasets using data mining techniques. Materials and Methods . Eleven real gene expression datasets containing dose values were included. First, important genes for dose prediction were selected using iterative sure independence screening. Then, the performances of regression trees (RTs), support vector regression (SVR), RT bagging, SVR bagging, and RT boosting were examined. Results . The results demonstrated that a regression-based feature selection method substantially reduced the number of irrelevant genes from raw datasets. Overall, the best prediction performance in nine of 11 datasets was achieved using SVR; the second most accurate performance was provided using a gradient-boosting machine (GBM). Conclusion . Analysis of various dose values based on microarray gene expression data identified common genes found in our study and the referenced studies. According to our findings, SVR and GBM can be good predictors of dose-gene datasets. Another result of the study was to identify the sample size of n = 25 as a cutoff point for RT bagging to outperform a single RT.
2013-01-01
Background The availability of gene expression data that corresponds to pig immune response challenges provides compelling material for the understanding of the host immune system. Meta-analysis offers the opportunity to confirm and expand our knowledge by combining and studying at one time a vast set of independent studies creating large datasets with increased statistical power. In this study, we performed two meta-analyses of porcine transcriptomic data: i) scrutinized the global immune response to different challenges, and ii) determined the specific response to Porcine Reproductive and Respiratory Syndrome Virus (PRRSV) infection. To gain an in-depth knowledge of the pig response to PRRSV infection, we used an original approach comparing and eliminating the common genes from both meta-analyses in order to identify genes and pathways specifically involved in the PRRSV immune response. The software Pointillist was used to cope with the highly disparate data, circumventing the biases generated by the specific responses linked to single studies. Next, we used the Ingenuity Pathways Analysis (IPA) software to survey the canonical pathways, biological functions and transcription factors found to be significantly involved in the pig immune response. We used 779 chips corresponding to 29 datasets for the pig global immune response and 279 chips obtained from 6 datasets for the pig response to PRRSV infection, respectively. Results The pig global immune response analysis showed interconnected canonical pathways involved in the regulation of translation and mitochondrial energy metabolism. Biological functions revealed in this meta-analysis were centred around translation regulation, which included protein synthesis, RNA-post transcriptional gene expression and cellular growth and proliferation. Furthermore, the oxidative phosphorylation and mitochondria dysfunctions, associated with stress signalling, were highly regulated. Transcription factors such as MYCN, MYC and NFE2L2 were found in this analysis to be potentially involved in the regulation of the immune response. The host specific response to PRRSV infection engendered the activation of well-defined canonical pathways in response to pathogen challenge such as TREM1, toll-like receptor and hyper-cytokinemia/ hyper-chemokinemia signalling. Furthermore, this analysis brought forth the central role of the crosstalk between innate and adaptive immune response and the regulation of anti-inflammatory response. The most significant transcription factor potentially involved in this analysis was HMGB1, which is required for the innate recognition of viral nucleic acids. Other transcription factors like interferon regulatory factors IRF1, IRF3, IRF5 and IRF8 were also involved in the pig specific response to PRRSV infection. Conclusions This work reveals key genes, canonical pathways and biological functions involved in the pig global immune response to diverse challenges, including PRRSV infection. The powerful statistical approach led us to consolidate previous findings as well as to gain new insights into the pig immune response either to common stimuli or specifically to PRRSV infection. PMID:23552196
2011-01-01
Abstract Background Bupleurum chinense DC. is a widely used traditional Chinese medicinal plant. Saikosaponins are the major bioactive constituents of B. chinense, but relatively little is known about saikosaponin biosynthesis. The 454 pyrosequencing technology provides a promising opportunity for finding novel genes that participate in plant metabolism. Consequently, this technology may help to identify the candidate genes involved in the saikosaponin biosynthetic pathway. Results One-quarter of the 454 pyrosequencing runs produced a total of 195, 088 high-quality reads, with an average read length of 356 bases (NCBI SRA accession SRA039388). A de novo assembly generated 24, 037 unique sequences (22, 748 contigs and 1, 289 singletons), 12, 649 (52.6%) of which were annotated against three public protein databases using a basic local alignment search tool (E-value ≤1e-10). All unique sequences were compared with NCBI expressed sequence tags (ESTs) (237) and encoding sequences (44) from the Bupleurum genus, and with a Sanger-sequenced EST dataset (3, 111). The 23, 173 (96.4%) unique sequences obtained in the present study represent novel Bupleurum genes. The ESTs of genes related to saikosaponin biosynthesis were found to encode known enzymes that catalyze the formation of the saikosaponin backbone; 246 cytochrome P450 (P450s) and 102 glycosyltransferases (GTs) unique sequences were also found in the 454 dataset. Full length cDNAs of 7 P450s and 7 uridine diphosphate GTs (UGTs) were verified by reverse transcriptase polymerase chain reaction or by cloning using 5' and/or 3' rapid amplification of cDNA ends. Two P450s and three UGTs were identified as the most likely candidates involved in saikosaponin biosynthesis. This finding was based on the coordinate up-regulation of their expression with β-AS in methyl jasmonate-treated adventitious roots and on their similar expression patterns with β-AS in various B. chinense tissues. Conclusions A collection of high-quality ESTs for B. chinense obtained by 454 pyrosequencing is provided here for the first time. These data should aid further research on the functional genomics of B. chinense and other Bupleurum species. The candidate genes for enzymes involved in saikosaponin biosynthesis, especially the P450s and UGTs, that were revealed provide a substantial foundation for follow-up research on the metabolism and regulation of the saikosaponins. PMID:22047182
Wagner, Wolfgang; Feldmann, Robert E; Seckinger, Anja; Maurer, Martin H; Wein, Frederik; Blake, Jonathon; Krause, Ulf; Kalenka, Armin; Bürgers, Heinrich F; Saffrich, Rainer; Wuchter, Patrick; Kuschinsky, Wolfgang; Ho, Anthony D
2006-04-01
Mesenchymal stem cells (MSC) raise high hopes in clinical applications. However, the lack of common standards and a precise definition of MSC preparations remains a major obstacle in research and application of MSC. Whereas surface antigen markers have failed to precisely define this population, a combination of proteomic data and microarray data provides a new dimension for the definition of MSC preparations. In our continuing effort to characterize MSC, we have analyzed the differential transcriptome and proteome expression profiles of MSC preparations isolated from human bone marrow under two different expansion media (BM-MSC-M1 and BM-MSC-M2). In proteomics, 136 protein spots were unambiguously identified by MALDI-TOF-MS and corresponding cDNA spots were selected on our "Human Transcriptome cDNA Microarray." Combination of datasets revealed a correlation in differential gene expression and protein expression of BM-MSC-M1 vs BM-MSC-M2. Genes involved in metabolism were more highly expressed in BM-MSC-M1, whereas genes involved in development, morphogenesis, extracellular matrix, and differentiation were more highly expressed in BM-MSC-M2. Interchanging culture conditions for 8 days revealed that differential expression was retained in several genes whereas it was altered in others. Our results have provided evidence that homogeneous BM-MSC preparations can reproducibly be isolated under standardized conditions, whereas culture conditions exert a prominent impact on transcriptome, proteome, and cellular organization of BM-MSC.
USDA-ARS?s Scientific Manuscript database
A GeXP multiplex, RT-PCR assay was developed and optimized that simultaneously measures expression of a suite of immune-relevant genes in rainbow trout (Oncorhynchus mykiss), concentrating on tumor necrosis factor and interleukin-1 ligand/receptor systems and acute phase response genes. The dataset ...
Microarray Data Mining for Potential Selenium Targets in Chemoprevention of Prostate Cancer
ZHANG, HAITAO; DONG, YAN; ZHAO, HONGJUAN; BROOKS, JAMES D.; HAWTHORN, LESLEYANN; NOWAK, NORMA; MARSHALL, JAMES R.; GAO, ALLEN C.; IP, CLEMENT
2008-01-01
Background A previous clinical trial showed that selenium supplementation significantly reduced the incidence of prostate cancer. We report here a bioinformatics approach to gain new insights into selenium molecular targets that might be relevant to prostate cancer chemoprevention. Materials and Methods We first performed data mining analysis to identify genes which are consistently dysregulated in prostate cancer using published datasets from gene expression profiling of clinical prostate specimens. We then devised a method to systematically analyze three selenium microarray datasets from the LNCaP human prostate cancer cells, and to match the analysis to the cohort of genes implicated in prostate carcinogenesis. Moreover, we compared the selenium datasets with two datasets obtained from expression profiling of androgen-stimulated LNCaP cells. Results We found that selenium reverses the expression of genes implicated in prostate carcinogenesis. In addition, we found that selenium could counteract the effect of androgen on the expression of a subset obtained from androgen-regulated genes. Conclusions The above information provides us with a treasure of new clues to investigate the mechanism of selenium chemoprevention of prostate cancer. Furthermore, these selenium target genes could also serve as biomarkers in future clinical trials to gauge the efficacy of selenium intervention. PMID:18548127
SpeCond: a method to detect condition-specific gene expression
2011-01-01
Transcriptomic studies routinely measure expression levels across numerous conditions. These datasets allow identification of genes that are specifically expressed in a small number of conditions. However, there are currently no statistically robust methods for identifying such genes. Here we present SpeCond, a method to detect condition-specific genes that outperforms alternative approaches. We apply the method to a dataset of 32 human tissues to determine 2,673 specifically expressed genes. An implementation of SpeCond is freely available as a Bioconductor package at http://www.bioconductor.org/packages/release/bioc/html/SpeCond.html. PMID:22008066
RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes.
Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa
2017-08-29
Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/.
RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes
Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa
2017-01-01
Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/. PMID:28850115
Neuroligin 4X overexpression in human breast cancer is associated with poor relapse-free survival.
Henderson, Henry J; Karanam, Balasubramanyam; Samant, Rajeev; Vig, Komal; Singh, Shree R; Yates, Clayton; Bedi, Deepa
2017-01-01
The molecular mechanisms involved in breast cancer progression and metastasis still remain unclear to date. It is a heterogeneous disease featuring several different phenotypes with consistently different biological characteristics. Neuroligins are neural cell adhesion molecules that have been implicated in heterotopic cell adhesion. In humans, alterations in neuroligin genes are implicated in autism and other cognitive diseases. Until recently, neuroligins have been shown to be abundantly expressed in blood vessels and also play a role implicated in the growth of glioma cells. Here we report increased expression of neuroligin 4X (NLGN4X) in breast cancer. We found NLGN4X was abundantly expressed in breast cancer tissues. NLGN4X expression data for all breast cancer cell lines in the Cancer Cell Line Encyclopedia (CCLE) was analyzed. Correlation between NLGN4X levels and clinicopathologic parameters were analyzed within Oncomine datasets. Evaluation of these bioinfomatic datasets results revealed that NLGN4X expression was higher in triple negative breast cancer cells, particularly the basal subtype and tissues versus non-triple-negative sets. Its level was also observed to be higher in metastatic tissues. RT-PCR, flow cytometry and immunofluorescence study of MDA-MB-231 and MCF-7 breast cancer cells validated that NLGN4X was increased in MDA-MB-231. Knockdown of NLGN4X expression by siRNA decreased cell proliferation and migration significantly in MDA-MB-231 breast cancer cells. NLGN4X knockdown in MDA-MB-231 cells resulted in induction of apoptosis as determined by annexin staining, elevated caspase 3/7 and cleaved PARP by flow cytometry. High NLGN4X expression highly correlated with decrease in relapse free-survival in TNBC. NLGN4X might represent novel biomarkers and therapeutic targets for breast cancer. Inhibition of NLGN4X may be a new target for the prevention and treatment of breast cancer.
The transcriptional landscape of αβ T cell differentiation
Mingueneau, Michael; Kreslavsky, Taras; Gray, Daniel; Heng, Tracy; Cruse, Richard; Ericson, Jeffrey; Bendall, Sean; Spitzer, Matt; Nolan, Garry; Kobayashi, Koichi; von Boehmer, Harald; Mathis, Diane; Benoist, Christophe
2013-01-01
αβT cell differentiation from thymic precursors is a complex process, explored here with the breadth of ImmGen expression datasets, analyzing how differentiation of thymic precursors gives rise to transcriptomes. After surprisingly gradual changes though early T commitment, transit through the CD4+CD8+ stage involves a shutdown or rare breadth, and correlating tightly with MYC. MHC-driven selection promotes a large-scale transcriptional reactivation. We identify distinct signatures that mark cells destined for positive selection versus apoptotic deletion. Differential expression of surprisingly few genes accompany CD4 or CD8 commitment, a similarity that carries through to peripheral T cells and their activation, revealed by mass cytometry phosphoproteomics. The novel transcripts identified as candidate mediators of key transitions help define the “known unknown” of thymocyte differentiation. PMID:23644507
Feichtinger, Julia; Larcombe, Lee; McFarlane, Ramsay J
2014-05-15
Evidence is starting to emerge indicating that tumorigenesis in metazoans involves a soma-to-germline transition, which may contribute to the acquisition of neoplastic characteristics. Here, we have meta-analyzed gene expression profiles of the human orthologs of Drosophila melanogaster germline genes that are ectopically expressed in l(3)mbt brain tumors using gene expression datasets derived from a large cohort of human tumors. We find these germline genes, some of which drive oncogenesis in D. melanogaster, are similarly ectopically activated in a wide range of human cancers. Some of these genes normally have expression restricted to the germline, making them of particular clinical interest. Importantly, these analyses provide additional support to the emerging model that proposes a soma-to-germline transition is a general hallmark of a wide range of human tumors. This has implications for our understanding of human oncogenesis and the development of new therapeutic and biomarker targets with clinical potential. © 2013 The Authors. Published by Wiley Periodicals, Inc. on behalf of UICC.
Prom-On, Santitham; Chanthaphan, Atthawut; Chan, Jonathan Hoyin; Meechai, Asawin
2011-02-01
Relationships among gene expression levels may be associated with the mechanisms of the disease. While identifying a direct association such as a difference in expression levels between case and control groups links genes to disease mechanisms, uncovering an indirect association in the form of a network structure may help reveal the underlying functional module associated with the disease under scrutiny. This paper presents a method to improve the biological relevance in functional module identification from the gene expression microarray data by enhancing the structure of a weighted gene co-expression network using minimum spanning tree. The enhanced network, which is called a backbone network, contains only the essential structural information to represent the gene co-expression network. The entire backbone network is decoupled into a number of coherent sub-networks, and then the functional modules are reconstructed from these sub-networks to ensure minimum redundancy. The method was tested with a simulated gene expression dataset and case-control expression datasets of autism spectrum disorder and colorectal cancer studies. The results indicate that the proposed method can accurately identify clusters in the simulated dataset, and the functional modules of the backbone network are more biologically relevant than those obtained from the original approach.
Methods to increase reproducibility in differential gene expression via meta-analysis
Sweeney, Timothy E.; Haynes, Winston A.; Vallania, Francesco; Ioannidis, John P.; Khatri, Purvesh
2017-01-01
Findings from clinical and biological studies are often not reproducible when tested in independent cohorts. Due to the testing of a large number of hypotheses and relatively small sample sizes, results from whole-genome expression studies in particular are often not reproducible. Compared to single-study analysis, gene expression meta-analysis can improve reproducibility by integrating data from multiple studies. However, there are multiple choices in designing and carrying out a meta-analysis. Yet, clear guidelines on best practices are scarce. Here, we hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility. We therefore constructed three very large gene expression meta-analyses from clinical samples, and then examined meta-analyses of subsets of the datasets (all combinations of datasets with up to N/2 samples and K/2 datasets) compared to a ‘silver standard’ of differentially expressed genes found in the entire cohort. We tested three random-effects meta-analysis models using this procedure. We showed relatively greater reproducibility with more-stringent effect size thresholds with relaxed significance thresholds; relatively lower reproducibility when imposing extraneous constraints on residual heterogeneity; and an underestimation of actual false positive rate by Benjamini–Hochberg correction. In addition, multivariate regression showed that the accuracy of a meta-analysis increased significantly with more included datasets even when controlling for sample size. PMID:27634930
Jambusaria, Ankit; Klomp, Jeff; Hong, Zhigang; Rafii, Shahin; Dai, Yang; Malik, Asrar B; Rehman, Jalees
2018-06-07
The heterogeneity of cells across tissue types represents a major challenge for studying biological mechanisms as well as for therapeutic targeting of distinct tissues. Computational prediction of tissue-specific gene regulatory networks may provide important insights into the mechanisms underlying the cellular heterogeneity of cells in distinct organs and tissues. Using three pathway analysis techniques, gene set enrichment analysis (GSEA), parametric analysis of gene set enrichment (PGSEA), alongside our novel model (HeteroPath), which assesses heterogeneously upregulated and downregulated genes within the context of pathways, we generated distinct tissue-specific gene regulatory networks. We analyzed gene expression data derived from freshly isolated heart, brain, and lung endothelial cells and populations of neurons in the hippocampus, cingulate cortex, and amygdala. In both datasets, we found that HeteroPath segregated the distinct cellular populations by identifying regulatory pathways that were not identified by GSEA or PGSEA. Using simulated datasets, HeteroPath demonstrated robustness that was comparable to what was seen using existing gene set enrichment methods. Furthermore, we generated tissue-specific gene regulatory networks involved in vascular heterogeneity and neuronal heterogeneity by performing motif enrichment of the heterogeneous genes identified by HeteroPath and linking the enriched motifs to regulatory transcription factors in the ENCODE database. HeteroPath assesses contextual bidirectional gene expression within pathways and thus allows for transcriptomic assessment of cellular heterogeneity. Unraveling tissue-specific heterogeneity of gene expression can lead to a better understanding of the molecular underpinnings of tissue-specific phenotypes.
Coalescence computations for large samples drawn from populations of time-varying sizes
Polanski, Andrzej; Szczesna, Agnieszka; Garbulowski, Mateusz; Kimmel, Marek
2017-01-01
We present new results concerning probability distributions of times in the coalescence tree and expected allele frequencies for coalescent with large sample size. The obtained results are based on computational methodologies, which involve combining coalescence time scale changes with techniques of integral transformations and using analytical formulae for infinite products. We show applications of the proposed methodologies for computing probability distributions of times in the coalescence tree and their limits, for evaluation of accuracy of approximate expressions for times in the coalescence tree and expected allele frequencies, and for analysis of large human mitochondrial DNA dataset. PMID:28170404
Freytag, Saskia; Burgess, Rosemary; Oliver, Karen L; Bahlo, Melanie
2017-06-08
The pathogenesis of neurological and mental health disorders often involves multiple genes, complex interactions, as well as brain- and development-specific biological mechanisms. These characteristics make identification of disease genes for such disorders challenging, as conventional prioritisation tools are not specifically tailored to deal with the complexity of the human brain. Thus, we developed a novel web-application-brain-coX-that offers gene prioritisation with accompanying visualisations based on seven gene expression datasets in the post-mortem human brain, the largest such resource ever assembled. We tested whether our tool can correctly prioritise known genes from 37 brain-specific KEGG pathways and 17 psychiatric conditions. We achieved average sensitivity of nearly 50%, at the same time reaching a specificity of approximately 75%. We also compared brain-coX's performance to that of its main competitors, Endeavour and ToppGene, focusing on the ability to discover novel associations. Using a subset of the curated SFARI autism gene collection we show that brain-coX's prioritisations are most similar to SFARI's own curated gene classifications. brain-coX is the first prioritisation and visualisation web-tool targeted to the human brain and can be freely accessed via http://shiny.bioinf.wehi.edu.au/freytag.s/ .
An indicator of cancer: downregulation of monoamine oxidase-A in multiple organs and species.
Rybaczyk, Leszek A; Bashaw, Meredith J; Pathak, Dorothy R; Huang, Kun
2008-03-20
Identifying consistent changes in cellular function that occur in multiple types of cancer could revolutionize the way cancer is treated. Previous work has produced promising results such as the identification of p53. Recently drugs that affect serotonin reuptake were shown to reduce the risk of colon cancer in man. Here, we analyze an ensemble of cancer datasets focusing on genes involved in the serotonergic pathway. Genechip datasets consisting of cancerous tissue from human, mouse, rat, or zebrafish were extracted from the GEO database. We first compared gene expression between cancerous tissues and normal tissues for each type of cancer and then identified changes that were common to a variety of cancer types. Our analysis found that significant downregulation of MAO-A, the enzyme that metabolizes serotonin, occurred in multiple tissues from humans, rodents, and fish. MAO-A expression was decreased in 95.4% of human cancer patients and 94.2% of animal cancer cases compared to the non-cancerous controls. These are the first findings that identify a single reliable change in so many different cancers. Future studies should investigate links between MAO-A suppression and the development of cancer to determine the extent that MAO-A suppression contributes to increased cancer risk.
Molecular Subtypes of Glioblastoma Are Relevant to Lower Grade Glioma
Sloan, Andrew E.; Chen, Yanwen; Brat, Daniel J.; O’Neill, Brian Patrick; de Groot, John; Yust-Katz, Shlomit; Yung, Wai-Kwan Alfred; Cohen, Mark L.; Aldape, Kenneth D.; Rosenfeld, Steven; Verhaak, Roeland G. W.; Barnholtz-Sloan, Jill S.
2014-01-01
Background Gliomas are the most common primary malignant brain tumors in adults with great heterogeneity in histopathology and clinical course. The intent was to evaluate the relevance of known glioblastoma (GBM) expression and methylation based subtypes to grade II and III gliomas (ie. lower grade gliomas). Methods Gene expression array, single nucleotide polymorphism (SNP) array and clinical data were obtained for 228 GBMs and 176 grade II/II gliomas (GII/III) from the publically available Rembrandt dataset. Two additional datasets with IDH1 mutation status were utilized as validation datasets (one publicly available dataset and one newly generated dataset from MD Anderson). Unsupervised clustering was performed and compared to gene expression subtypes assigned using the Verhaak et al 840-gene classifier. The glioma-CpG Island Methylator Phenotype (G-CIMP) was assigned using prediction models by Fine et al. Results Unsupervised clustering by gene expression aligned with the Verhaak 840-gene subtype group assignments. GII/IIIs were preferentially assigned to the proneural subtype with IDH1 mutation and G-CIMP. GBMs were evenly distributed among the four subtypes. Proneural, IDH1 mutant, G-CIMP GII/III s had significantly better survival than other molecular subtypes. Only 6% of GBMs were proneural and had either IDH1 mutation or G-CIMP but these tumors had significantly better survival than other GBMs. Copy number changes in chromosomes 1p and 19q were associated with GII/IIIs, while these changes in CDKN2A, PTEN and EGFR were more commonly associated with GBMs. Conclusions GBM gene-expression and methylation based subtypes are relevant for GII/III s and associate with overall survival differences. A better understanding of the association between these subtypes and GII/IIIs could further knowledge regarding prognosis and mechanisms of glioma progression. PMID:24614622
Liu, Xuewu; Huang, Yuxiao; Liang, Jiao; Zhang, Shuai; Li, Yinghui; Wang, Jun; Shen, Yan; Xu, Zhikai; Zhao, Ya
2014-11-30
The invasion of red blood cells (RBCs) by malarial parasites is an essential step in the life cycle of Plasmodium falciparum. Human-parasite surface protein interactions play a critical role in this process. Although several interactions between human and parasite proteins have been discovered, the mechanism related to invasion remains poorly understood because numerous human-parasite protein interactions have not yet been identified. High-throughput screening experiments are not feasible for malarial parasites due to difficulty in expressing the parasite proteins. Here, we performed computational prediction of the PPIs involved in malaria parasite invasion to elucidate the mechanism by which invasion occurs. In this study, an expectation maximization algorithm was used to estimate the probabilities of domain-domain interactions (DDIs). Estimates of DDI probabilities were then used to infer PPI probabilities. We found that our prediction performance was better than that based on the information of D. melanogaster alone when information related to the six species was used. Prediction performance was assessed using protein interaction data from S. cerevisiae, indicating that the predicted results were reliable. We then used the estimates of DDI probabilities to infer interactions between 490 parasite and 3,787 human membrane proteins. A small-scale dataset was used to illustrate the usability of our method in predicting interactions between human and parasite proteins. The positive predictive value (PPV) was lower than that observed in S. cerevisiae. We integrated gene expression data to improve prediction accuracy and to reduce false positives. We identified 80 membrane proteins highly expressed in the schizont stage by fast Fourier transform method. Approximately 221 erythrocyte membrane proteins were identified using published mass spectral datasets. A network consisting of 205 interactions was predicted. Results of network analysis suggest that SNARE proteins of parasites and APP of humans may function in the invasion of RBCs by parasites. We predicted a small-scale PPI network that may be involved in parasite invasion of RBCs by integrating DDI information and expression profiles. Experimental studies should be conducted to validate the predicted interactions. The predicted PPIs help elucidate the mechanism of parasite invasion and provide directions for future experimental investigations.
Rai, Amit; Nakaya, Taiki; Shimizu, Yohei; Rai, Megha; Nakamura, Michimi; Suzuki, Hideyuki; Saito, Kazuki; Yamazaki, Mami
2018-05-29
Lithospermum officinale is a valuable source of bioactive metabolites with medicinal and industrial values. However, little is known about genes involved in the biosynthesis of these metabolites, primarily due to the lack of genome or transcriptome resources. This study presents the first effort to establish and characterize de novo transcriptome assembly resource for L. officinale and expression analysis for three of its tissues, namely leaf, stem, and root. Using over 4Gbps of RNA-sequencing datasets, we obtained de novo transcriptome assembly of L. officinale , consisting of 77,047 unigenes with assembly N50 value as 1524 bps. Based on transcriptome annotation and functional classification, 52,766 unigenes were assigned with putative genes functions, gene ontology terms, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. KEGG pathway and gene ontology enrichment analysis using highly expressed unigenes across three tissues and targeted metabolome analysis showed active secondary metabolic processes enriched specifically in the root of L. officinale . Using co-expression analysis, we also identified 20 and 48 unigenes representing different enzymes of lithospermic/chlorogenic acid and shikonin biosynthesis pathways, respectively. We further identified 15 candidate unigenes annotated as cytochrome P450 with the highest expression in the root of L. officinale as novel genes with a role in key biochemical reactions toward shikonin biosynthesis. Thus, through this study, we not only generated a high-quality genomic resource for L. officinale but also propose candidate genes to be involved in shikonin biosynthesis pathways for further functional characterization. Georg Thieme Verlag KG Stuttgart · New York.
The immune gene repertoire of an important viral reservoir, the Australian black flying fox
2012-01-01
Background Bats are the natural reservoir host for a range of emerging and re-emerging viruses, including SARS-like coronaviruses, Ebola viruses, henipaviruses and Rabies viruses. However, the mechanisms responsible for the control of viral replication in bats are not understood and there is little information available on any aspect of antiviral immunity in bats. Massively parallel sequencing of the bat transcriptome provides the opportunity for rapid gene discovery. Although the genomes of one megabat and one microbat have now been sequenced to low coverage, no transcriptomic datasets have been reported from any bat species. In this study, we describe the immune transcriptome of the Australian flying fox, Pteropus alecto, providing an important resource for identification of genes involved in a range of activities including antiviral immunity. Results Towards understanding the adaptations that have allowed bats to coexist with viruses, we have de novo assembled transcriptome sequence from immune tissues and stimulated cells from P. alecto. We identified about 18,600 genes involved in a broad range of activities with the most highly expressed genes involved in cell growth and maintenance, enzyme activity, cellular components and metabolism and energy pathways. 3.5% of the bat transcribed genes corresponded to immune genes and a total of about 500 immune genes were identified, providing an overview of both innate and adaptive immunity. A small proportion of transcripts found no match with annotated sequences in any of the public databases and may represent bat-specific transcripts. Conclusions This study represents the first reported bat transcriptome dataset and provides a survey of expressed bat genes that complement existing bat genomic data. In addition, these data provide insight into genes relevant to the antiviral responses of bats, and form a basis for examining the roles of these molecules in immune response to viral infection. PMID:22716473
Anjanappa, Ravi B; Mehta, Devang; Okoniewski, Michal J; Szabelska-Berȩsewicz, Alicja; Gruissem, Wilhelm; Vanderschuren, Hervé
2018-02-01
Cassava brown streak virus (CBSV) and Ugandan cassava brown streak virus (UCBSV) are responsible for significant cassava yield losses in eastern sub-Saharan Africa. To study the possible mechanisms of plant resistance to CBSVs, we inoculated CBSV-susceptible and CBSV-resistant cassava varieties with a mixed infection of CBSVs using top-cleft grafting. Transcriptome profiling of the two cassava varieties was performed at the earliest time point of full infection (28 days after grafting) in the susceptible scions. The expression of genes encoding proteins in RNA silencing, salicylic acid pathways and callose deposition was altered in the susceptible cassava variety, but transcriptional changes were limited in the resistant variety. In total, the expression of 585 genes was altered in the resistant variety and 1292 in the susceptible variety. Transcriptional changes led to the activation of β-1,3-glucanase enzymatic activity and a reduction in callose deposition in the susceptible cassava variety. Time course analysis also showed that CBSV replication in susceptible cassava induced a strong up-regulation of RDR1, a gene previously reported to be a susceptibility factor in other potyvirus-host pathosystems. The differences in the transcriptional responses to CBSV infection indicated that susceptibility involves the restriction of callose deposition at plasmodesmata. Aniline blue staining of callose deposits also indicated that the resistant variety displays a moderate, but significant, increase in callose deposition at the plasmodesmata. Transcriptome data suggested that resistance does not involve typical antiviral defence responses (i.e. RNA silencing and salicylic acid). A meta-analysis of the current RNA-sequencing (RNA-seq) dataset and selected potyvirus-host and virus-cassava RNA-seq datasets revealed that the conservation of the host response across pathosystems is restricted to genes involved in developmental processes. © 2017 THE AUTHORS. MOLECULAR PLANT PATHOLOGY PUBLISHED BY BRITISH SOCIETY FOR PLANT PATHOLOGY AND JOHN WILEY & SONS LTD.
Sargeant, Tobias; Laperrière, David; Ismail, Houssam; Boucher, Geneviève; Rozendaal, Marieke; Lavallée, Vincent-Philippe; Ashton-Beaucage, Dariel; Wilhelm, Brian; Hébert, Josée; Hilton, Douglas J.
2017-01-01
Abstract Genome-wide transcriptome profiling has enabled non-supervised classification of tumours, revealing different sub-groups characterized by specific gene expression features. However, the biological significance of these subtypes remains for the most part unclear. We describe herein an interactive platform, Minimum Spanning Trees Inferred Clustering (MiSTIC), that integrates the direct visualization and comparison of the gene correlation structure between datasets, the analysis of the molecular causes underlying co-variations in gene expression in cancer samples, and the clinical annotation of tumour sets defined by the combined expression of selected biomarkers. We have used MiSTIC to highlight the roles of specific transcription factors in breast cancer subtype specification, to compare the aspects of tumour heterogeneity targeted by different prognostic signatures, and to highlight biomarker interactions in AML. A version of MiSTIC preloaded with datasets described herein can be accessed through a public web server (http://mistic.iric.ca); in addition, the MiSTIC software package can be obtained (github.com/iric-soft/MiSTIC) for local use with personalized datasets. PMID:28472340
In search of druggable targets for GBM amino acid metabolism.
Panosyan, Eduard H; Lin, Henry J; Koster, Jan; Lasky, Joseph L
2017-02-28
Amino acid (AA) pathways may contain druggable targets for glioblastoma (GBM). Literature reviews and GBM database ( http://r2.amc.nl ) analyses were carried out to screen for such targets among 95 AA related enzymes. First, we identified the genes that were differentially expressed in GBMs (3 datasets) compared to non-GBM brain tissues (5 datasets), or were associated with survival differences. Further, protein expression for these enzymes was also analyzed in high grade gliomas (HGGs) (proteinatlas.org). Finally, AA enzyme and gene expression were compared among the 4 TCGA (The Cancer Genome Atlas) subtypes of GBMs. We detected differences in enzymes involved in glutamate and urea cycle metabolism in GBM. For example, expression levels of BCAT1 (branched chain amino acid transferase 1) and ASL (argininosuccinate lyase) were high, but ASS1 (argininosuccinate synthase 1) was low in GBM. Proneural and neural TCGA subtypes had low expression of all three. High expression of all three correlated with worse outcome. ASL and ASS1 protein levels were mostly undetected in high grade gliomas, whereas BCAT1 was high. GSS (glutathione synthetase) was not differentially expressed, but higher levels were linked to poor progression free survival. ASPA (aspartoacylase) and GOT1 (glutamic-oxaloacetic transaminase 1) had lower expression in GBM (associated with poor outcomes). All three GABA related genes -- glutamate decarboxylase 1 (GAD1) and 2 (GAD2) and 4-aminobutyrate aminotransferase (ABAT) -- were lower in mesenchymal tumors, which in contrast showed higher IDO1 (indoleamine 2, 3-dioxygenase 1) and TDO2 (tryptophan 2, 3-diaxygenase). Expression of PRODH (proline dehydrogenase), a putative tumor suppressor, was lower in GBM. Higher levels predicted poor survival. Several AA-metabolizing enzymes that are higher in GBM, are also linked to poor outcome (such as BCAT1), which makes them potential targets for therapeutic inhibition. Moreover, existing drugs that deplete asparagine and arginine may be effective against brain tumors, and should be studied in conjunction with chemotherapy. Last, AA metabolism is heterogeneous in TCGA subtypes of GBM (as well as medulloblastomas and other pediatric tumors), which may translate to variable responses to AA targeted therapies.
Li, Mingjie; Yang, Yanhui; Li, Xinyu; Gu, Li; Wang, Fengji; Feng, Fajie; Tian, Yunhe; Wang, Fengqing; Wang, Xiaoran; Lin, Wenxiong; Chen, Xinjian; Zhang, Zhongyi
2015-09-01
All tuberous roots in Rehmannia glutinosa originate from the expansion of fibrous roots (FRs), but not all FRs can successfully transform into tuberous roots. This study identified differentially expressed genes and proteins associated with the expansion of FRs, by comparing the tuberous root at expansion stages (initiated tuberous root, ITRs) and FRs at the seedling stage (initiated FRs, IFRs). The role of miRNAs in the expansion of FRs was also explored using the sRNA transcriptome and degradome to identify miRNAs and their target genes that were differentially expressed between ITRs and FRs at the mature stage (unexpanded FRs, UFRs, which are unable to expand into ITRs). A total of 6032 genes and 450 proteins were differentially expressed between ITRs and IFRs. Integrated analyses of these data revealed several genes and proteins involved in light signalling, hormone response, and signal transduction that might participate in the induction of tuberous root formation. Several genes related to cell division and cell wall metabolism were involved in initiating the expansion of IFRs. Of 135 miRNAs differentially expressed between ITRs and UFRs, there were 27 miRNAs whose targets were specifically identified in the degradome. Analysis of target genes showed that several miRNAs specifically expressed in UFRs were involved in the degradation of key genes required for the formation of tuberous roots. As far as could be ascertained, this is the first time that the miRNAs that control the transition of FRs to tuberous roots in R. glutinosa have been identified. This comprehensive analysis of 'omics' data sheds new light on the mechanisms involved in the regulation of tuberous roots formation. © The Author 2015. Published by Oxford University Press on behalf of the Society for Experimental Biology. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Aspesi, Anna; Pavesi, Elisa; Robotti, Elisa; Crescitelli, Rossella; Boria, Ilenia; Avondo, Federica; Moniz, Hélène; Da Costa, Lydie; Mohandas, Narla; Roncaglia, Paola; Ramenghi, Ugo; Ronchi, Antonella; Gustincich, Stefano; Merlin, Simone; Marengo, Emilio; Ellis, Steven R.; Follenzi, Antonia; Santoro, Claudio; Dianzani, Irma
2014-01-01
Defects in genes encoding ribosomal proteins cause Diamond Blackfan Anemia (DBA), a red cell aplasia often associated with physical abnormalities. Other bone marrow failure syndromes have been attributed to defects in ribosomal components but the link between erythropoiesis and the ribosome remains to be fully defined. Several lines of evidence suggest that defects in ribosome synthesis lead to “ribosomal stress” with p53 activation and either cell cycle arrest or induction of apoptosis. Pathways independent of p53 have also been proposed to play a role in DBA pathogenesis. We took an unbiased approach to identify p53-independent pathways activated by defects in ribosome synthesis by analyzing global gene expression in various cellular models of DBA. Ranking-Principal Component Analysis (Ranking-PCA) was applied to the identified datasets to determine whether there are common sets of genes whose expression is altered in these different cellular models. We observed consistent changes in the expression of genes involved in cellular amino acid metabolic process, negative regulation of cell proliferation and cell redox homeostasis. These data indicate that cells respond to defects in ribosome synthesis by changing the level of expression of a limited subset of genes involved in critical cellular processes. Moreover, our data support a role for p53-independent pathways in the pathophysiology of DBA. PMID:24835311
Gao, Jing; Li, Yuhong; Wang, Tongmei; Shi, Zhuo; Zhang, Yiqi; Liu, Shuang; Wen, Pushuai; Ma, Chunyan
2018-03-06
The aim of this study was to identify the key genes involved in the cardiac hypertrophy (CH) induced by pressure overload. mRNA microarray dataset GSE5500 and GSE18801 were downloaded from GEO database, and differentially expressed genes (DEGs) were screened using Limma package; then, functional and pathway enrichment analysis were performed for common DEGs using DAVID database. Furthermore, the top DEGs were further validated using qPCR in the hypertrophic heart tissue induced by Isoprenaline (ISO). A total of 113 common DEGs with absolute fold change >0.5, including 60 significantly up-regulated DEGs and 53 down-regulated DEGs were obtained. GO term enrichment analysis suggested that common up-regulated DEG mainly enriched in neutrophil chemotaxis, extracellular fibril organization and cell proliferation, and the common down-regulated genes were significantly enriched in ion transport, endoplasmic reticulum and dendritic spine. KEGG pathway analysis found that the common DEGs were mainly enriched in ECM-receptor interaction, phagosome, and focal adhesion. Additionally, the expression of Mfap4, Ltbp2, Aspn, Serpina3n, and Cnksr1 were up-regulated in the model of cardiac hypertrophy, while the expression of Anp32a was down-regulated. The current study identified the key deregulated genes and pathways involved in the CH, which could shed new light to understand the mechanism of CH.
NASA Astrophysics Data System (ADS)
Sun, Min; Ting Li, Yi; Liu, Yang; Chin Lee, Shao; Wang, Lan
2016-01-01
Cadmium (Cd) pollution is a serious global problem, which causes irreversible toxic effects on animals. Freshwater crab, Sinopotamon henanense, is a useful environmental indicator since it is widely distributed in benthic habitats whereby it tends to accumulate Cd and other toxicants. However, its molecular responses to Cd toxicity remain unclear. In this study, we performed transcriptome sequencing and gene expression analyses of its hepatopancreas with and without Cd treatments. A total of 7.78 G clean reads were obtained from the pooled samples, and 68,648 unigenes with an average size of 622 bp were assembled, in which 5,436 were metabolism-associated and 2,728 were stimulus response-associated that include 380 immunity-related unigenes. Expression profile analysis demonstrated that most genes involved in macromolecular metabolism, oxidative phosphorylation, detoxification and anti-oxidant defense were up-regulated by Cd exposure, whereas immunity-related genes were down-regulated, except the genes involved in phagocytosis were up-regulated. The current data indicate that Cd exposure alters gene expressions in a concentration-dependent manner. Therefore, our results provide the first comprehensive S.henanense transcriptome dataset, which is useful for biological and ecotoxicological studies on this crab and its related species at molecular level, and some key Cd-responsive genes may provide candidate biomarkers for monitoring aquatic pollution by heavy metals.
Mukherjee, Angana; Jones, Jacqueline; Karanam, Balasubramanyam; Davis, Melissa; Jaynes, Jesse; Reams, R. Renee; Dean-Colomb, Windy; Yates, Clayton
2016-01-01
Kaiso, a bi-modal transcription factor, regulates gene expression, and is elevated in breast, prostate, and colon cancers. Depletion of Kaiso in other cancer types leads to a reduction in markers for the epithelial–mesenchymal transition (EMT) (Jones et al., 2014), however its clinical implications in pancreatic ductal adenocarcinoma (PDCA) have not been widely explored. PDCA is rarely detected at an early stage but is characterized by rapid progression and invasiveness. We now report the significance of the subcellular localization of Kaiso in PDCAs from African Americans. Kaiso expression is higher in the cytoplasm of invasive and metastatic pancreatic cancers. In males, cytoplasmic expression of Kaiso correlates with cancer grade and lymph node positivity. In male and female patients, cytoplasmic Kaiso expression correlates with invasiveness. Also, nuclear expression of Kaiso increases with increased invasiveness and lymph node positivity. Further, analysis of the largest PDCA dataset available on ONCOMINE shows that as Kaiso increases, there is an overall increase in Zeb1, which is the inverse for E-cadherin. Hence, these findings suggest a role for Kaiso in the progression of PDCAs, involving the EMT markers, E-cadherin and Zeb1. PMID:27424525
Epigenetic determinants of ovarian clear cell carcinoma biology
Yamaguchi, Ken; Huang, Zhiqing; Matsumura, Noriomi; Mandai, Masaki; Okamoto, Takako; Baba, Tsukasa; Konishi, Ikuo; Berchuck, Andrew; Murphy, Susan K.
2015-01-01
Targeted approaches have revealed frequent epigenetic alterations in ovarian cancer, but the scope and relation of these changes to histologic subtype of disease is unclear. Genome-wide methylation and expression data for 14 clear cell carcinoma (CCC), 32 non-CCC, and 4 corresponding normal cell lines were generated to determine how methylation profiles differ between cells of different histological derivations of ovarian cancer. Consensus clustering showed that CCC is epigenetically distinct. Inverse relationships between expression and methylation in CCC were identified, suggesting functional regulation by methylation, and included 22 hypomethylated (UM) genes and 276 hypermethylated (HM) genes. Categorical and pathway analyses indicated that the CCC-specific UM genes were involved in response to stress and many contain hepatocyte nuclear factor (HNF) 1 binding sites, while the CCC-specific HM genes included members of the estrogen receptor alpha (ERalpha) network and genes involved in tumor development. We independently validated the methylation status of 17 of these pathway-specific genes, and confirmed increased expression of HNF1 network genes and repression of ERalpha pathway genes in CCC cell lines and primary cancer tissues relative to non-CCC specimens. Treatment of three CCC cell lines with the demethylating agent Decitabine significantly induced expression for all five genes analyzed. Coordinate changes in pathway expression were confirmed using two primary ovarian cancer datasets (p<0.0001 for both). Our results suggest that methylation regulates specific pathways and biological functions in CCC, with hypomethylation influencing the characteristic biology of the disease while hypermethylation contributes to the carcinogenic process. PMID:24382740
Soul, Jamie; Hardingham, Timothy E; Boot-Handford, Raymond P; Schwartz, Jean-Marc
2015-01-29
We describe a new method, PhenomeExpress, for the analysis of transcriptomic datasets to identify pathogenic disease mechanisms. Our analysis method includes input from both protein-protein interaction and phenotype similarity networks. This introduces valuable information from disease relevant phenotypes, which aids the identification of sub-networks that are significantly enriched in differentially expressed genes and are related to the disease relevant phenotypes. This contrasts with many active sub-network detection methods, which rely solely on protein-protein interaction networks derived from compounded data of many unrelated biological conditions and which are therefore not specific to the context of the experiment. PhenomeExpress thus exploits readily available animal model and human disease phenotype information. It combines this prior evidence of disease phenotypes with the experimentally derived disease data sets to provide a more targeted analysis. Two case studies, in subchondral bone in osteoarthritis and in Pax5 in acute lymphoblastic leukaemia, demonstrate that PhenomeExpress identifies core disease pathways in both mouse and human disease expression datasets derived from different technologies. We also validate the approach by comparison to state-of-the-art active sub-network detection methods, which reveals how it may enhance the detection of molecular phenotypes and provide a more detailed context to those previously identified as possible candidates.
Investigating the Control of Chlorophyll Degradation by Genomic Correlation Mining.
Ghandchi, Frederick P; Caetano-Anolles, Gustavo; Clough, Steven J; Ort, Donald R
2016-01-01
Chlorophyll degradation is an intricate process that is critical in a variety of plant tissues at different times during the plant life cycle. Many of the photoactive chlorophyll degradation intermediates are exceptionally cytotoxic necessitating that the pathway be carefully coordinated and regulated. The primary regulatory step in the chlorophyll degradation pathway involves the enzyme pheophorbide a oxygenase (PAO), which oxidizes the chlorophyll intermediate pheophorbide a, that is eventually converted to non-fluorescent chlorophyll catabolites. There is evidence that PAO is differentially regulated across different environmental and developmental conditions with both transcriptional and post-transcriptional components, but the involved regulatory elements are uncertain or unknown. We hypothesized that transcription factors modulate PAO expression across different environmental conditions, such as cold and drought, as well as during developmental transitions to leaf senescence and maturation of green seeds. To test these hypotheses, several sets of Arabidopsis genomic and bioinformatic experiments were investigated and re-analyzed using computational approaches. PAO expression was compared across varied environmental conditions in the three separate datasets using regression modeling and correlation mining to identify gene elements co-expressed with PAO. Their functions were investigated as candidate upstream transcription factors or other regulatory elements that may regulate PAO expression. PAO transcript expression was found to be significantly up-regulated in warm conditions, during leaf senescence, and in drought conditions, and in all three conditions significantly positively correlated with expression of transcription factor Arabidopsis thaliana activating factor 1 (ATAF1), suggesting that ATAF1 is triggered in the plant response to these processes or abiotic stresses and in result up-regulates PAO expression. The proposed regulatory network includes the freezing, senescence, and drought stresses modulating factor ATAF1 and various other transcription factors and pathways, which in turn act to regulate chlorophyll degradation by up-regulating PAO expression.
Li, Yongsheng; Chen, Juan; Zhang, Jinwen; Wang, Zishan; Shao, Tingting; Jiang, Chunjie; Xu, Juan; Li, Xia
2015-09-22
Long non-coding RNAs (lncRNAs) play key roles in diverse biological processes. Moreover, the development and progression of cancer often involves the combined actions of several lncRNAs. Here we propose a multi-step method for constructing lncRNA-lncRNA functional synergistic networks (LFSNs) through co-regulation of functional modules having three features: common coexpressed genes of lncRNA pairs, enrichment in the same functional category and close proximity within protein interaction networks. Applied to three cancers, we constructed cancer-specific LFSNs and found that they exhibit a scale free and modular architecture. In addition, cancer-associated lncRNAs tend to be hubs and are enriched within modules. Although there is little synergistic pairing of lncRNAs across cancers, lncRNA pairs involved in the same cancer hallmarks by regulating same or different biological processes. Finally, we identify prognostic biomarkers within cancer lncRNA expression datasets using modules derived from LFSNs. In summary, this proof-of-principle study indicates synergistic lncRNA pairs can be identified through integrative analysis of genome-wide expression data sets and functional information.
Functional annotation of the vlinc class of non-coding RNAs using systems biology approach
Laurent, Georges St.; Vyatkin, Yuri; Antonets, Denis; Ri, Maxim; Qi, Yao; Saik, Olga; Shtokalo, Dmitry; de Hoon, Michiel J.L.; Kawaji, Hideya; Itoh, Masayoshi; Lassmann, Timo; Arner, Erik; Forrest, Alistair R.R.; Nicolas, Estelle; McCaffrey, Timothy A.; Carninci, Piero; Hayashizaki, Yoshihide; Wahlestedt, Claes; Kapranov, Philipp
2016-01-01
Functionality of the non-coding transcripts encoded by the human genome is the coveted goal of the modern genomics research. While commonly relied on the classical methods of forward genetics, integration of different genomics datasets in a global Systems Biology fashion presents a more productive avenue of achieving this very complex aim. Here we report application of a Systems Biology-based approach to dissect functionality of a newly identified vast class of very long intergenic non-coding (vlinc) RNAs. Using highly quantitative FANTOM5 CAGE dataset, we show that these RNAs could be grouped into 1542 novel human genes based on analysis of insulators that we show here indeed function as genomic barrier elements. We show that vlincRNAs genes likely function in cis to activate nearby genes. This effect while most pronounced in closely spaced vlincRNA–gene pairs can be detected over relatively large genomic distances. Furthermore, we identified 101 vlincRNA genes likely involved in early embryogenesis based on patterns of their expression and regulation. We also found another 109 such genes potentially involved in cellular functions also happening at early stages of development such as proliferation, migration and apoptosis. Overall, we show that Systems Biology-based methods have great promise for functional annotation of non-coding RNAs. PMID:27001520
Wemheuer, Bernd; Wemheuer, Franziska; Hollensteiner, Jacqueline; Meyer, Frauke-Dorothee; Voget, Sonja; Daniel, Rolf
2015-01-01
Phytoplankton blooms exhibit a severe impact on bacterioplankton communities as they change nutrient availabilities and other environmental factors. In the current study, the response of a bacterioplankton community to a Phaeocystis globosa spring bloom was investigated in the southern North Sea. For this purpose, water samples were taken inside and reference samples outside of an algal spring bloom. Structural changes of the bacterioplankton community were assessed by amplicon-based analysis of 16S rRNA genes and transcripts generated from environmental DNA and RNA, respectively. Several marine groups responded to bloom presence. The abundance of the Roseobacter RCA cluster and the SAR92 clade significantly increased in bloom presence in the total and active fraction of the bacterial community. Functional changes were investigated by direct sequencing of environmental DNA and mRNA. The corresponding datasets comprised more than 500 million sequences across all samples. Metatranscriptomic data sets were mapped on representative genomes of abundant marine groups present in the samples and on assembled metagenomic and metatranscriptomic datasets. Differences in gene expression profiles between non-bloom and bloom samples were recorded. The genome-wide gene expression level of Planktomarina temperata, an abundant member of the Roseobacter RCA cluster, was higher inside the bloom. Genes that were differently expressed included transposases, which showed increased expression levels inside the bloom. This might contribute to the adaptation of this organism toward environmental stresses through genome reorganization. In addition, several genes affiliated to the SAR92 clade were significantly upregulated inside the bloom including genes encoding for proteins involved in isoleucine and leucine incorporation. Obtained results provide novel insights into compositional and functional variations of marine bacterioplankton communities as response to a phytoplankton bloom. PMID:26322028
Fasting and Fast Food Diet Play an Opposite Role in Mice Brain Aging.
Castrogiovanni, Paola; Li Volti, Giovanni; Sanfilippo, Cristina; Tibullo, Daniele; Galvano, Fabio; Vecchio, Michele; Avola, Roberto; Barbagallo, Ignazio; Malaguarnera, Lucia; Castorina, Sergio; Musumeci, Giuseppe; Imbesi, Rosa; Di Rosa, Michelino
2018-01-20
Fasting may be exploited as a possible strategy for prevention and treatment of several diseases such as diabetes, obesity, and aging. On the other hand, high-fat diet (HFD) represents a risk factor for several diseases and increased mortality. The aim of the present study was to evaluate the impact of fasting on mouse brain aging transcriptome and how HFD regulates such pathways. We used the NCBI Gene Expression Omnibus (GEO) database, in order to identify suitable microarray datasets comparing mouse brain transcriptome under fasting or HFD vs aged mouse brain transcriptome. Three microarray datasets were selected for this study, GSE24504, GSE6285, and GSE8150, and the principal molecular mechanisms involved in this process were evaluated. This analysis showed that, regardless of fasting duration, mouse brain significantly expressed 21 and 30 upregulated and downregulated genes, respectively. The involved biological processes were related to cell cycle arrest, cell death inhibition, and regulation of cellular metabolism. Comparing mouse brain transcriptome under fasting and aged conditions, we found out that the number of genes in common increased with the duration of fasting (222 genes), peaking at 72 h. In addition, mouse brain transcriptome under HFD resembles for the 30% the one of the aged mice. Furthermore, several molecular processes were found to be shared between HFD and aging. In conclusion, we suggest that fasting and HFD play an opposite role in brain transcriptome of aged mice. Therefore, an intermittent diet could represent a possible clinical strategy to counteract aging, loss of memory, and neuroinflammation. Furthermore, low-fat diet leads to the inactivation of brain degenerative processes triggered by aging.
Identification of the Consistently Altered Metabolic Targets in Human Hepatocellular Carcinoma.
Nwosu, Zeribe Chike; Megger, Dominik Andre; Hammad, Seddik; Sitek, Barbara; Roessler, Stephanie; Ebert, Matthias Philip; Meyer, Christoph; Dooley, Steven
2017-09-01
Cancer cells rely on metabolic alterations to enhance proliferation and survival. Metabolic gene alterations that repeatedly occur in liver cancer are largely unknown. We aimed to identify metabolic genes that are consistently deregulated, and are of potential clinical significance in human hepatocellular carcinoma (HCC). We studied the expression of 2,761 metabolic genes in 8 microarray datasets comprising 521 human HCC tissues. Genes exclusively up-regulated or down-regulated in 6 or more datasets were defined as consistently deregulated. The consistent genes that correlated with tumor progression markers ( ECM2 and MMP9) (Pearson correlation P < .05) were used for Kaplan-Meier overall survival analysis in a patient cohort. We further compared proteomic expression of metabolic genes in 19 tumors vs adjacent normal liver tissues. We identified 634 consistent metabolic genes, ∼60% of which are not yet described in HCC. The down-regulated genes (n = 350) are mostly involved in physiologic hepatocyte metabolic functions (eg, xenobiotic, fatty acid, and amino acid metabolism). In contrast, among consistently up-regulated metabolic genes (n = 284) are those involved in glycolysis, pentose phosphate pathway, nucleotide biosynthesis, tricarboxylic acid cycle, oxidative phosphorylation, proton transport, membrane lipid, and glycan metabolism. Several metabolic genes (n = 434) correlated with progression markers, and of these, 201 predicted overall survival outcome in the patient cohort analyzed. Over 90% of the metabolic targets significantly altered at the protein level were similarly up- or down-regulated as in genomic profile. We provide the first exposition of the consistently altered metabolic genes in HCC and show that these genes are potentially relevant targets for onward studies in preclinical and clinical contexts.
2012-01-01
Background It is known from recent studies that more than 90% of human multi-exon genes are subject to Alternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a single gene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiation and pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to the study of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon 1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns. The exon array technology, with more than five million data points, can detect approximately one million exons, and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated user-friendly bioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a data warehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases. Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at http://beat.ba.itb.cnr.it. Results BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statistical methods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples and tuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samples produced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza. To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemical pathways annotations are integrated with exon and gene level expression plots. The user can customize the results choosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samples for a multivariate AS analysis. Conclusions Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysis tools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendly platform for a comprehensive study of AS events in human diseases, displaying the analysis results with easily interpretable and interactive tables and graphics. PMID:22536968
Lu, Yan; Wang, Liang; Liu, Pengyuan; Yang, Ping; You, Ming
2012-01-01
About 30% stage I non-small cell lung cancer (NSCLC) patients undergoing resection will recur. Robust prognostic markers are required to better manage therapy options. The purpose of this study is to develop and validate a novel gene-expression signature that can predict tumor recurrence of stage I NSCLC patients. Cox proportional hazards regression analysis was performed to identify recurrence-related genes and a partial Cox regression model was used to generate a gene signature of recurrence in the training dataset −142 stage I lung adenocarcinomas without adjunctive therapy from the Director's Challenge Consortium. Four independent validation datasets, including GSE5843, GSE8894, and two other datasets provided by Mayo Clinic and Washington University, were used to assess the prediction accuracy by calculating the correlation between risk score estimated from gene expression and real recurrence-free survival time and AUC of time-dependent ROC analysis. Pathway-based survival analyses were also performed. 104 probesets correlated with recurrence in the training dataset. They are enriched in cell adhesion, apoptosis and regulation of cell proliferation. A 51-gene expression signature was identified to distinguish patients likely to develop tumor recurrence (Dxy = −0.83, P<1e-16) and this signature was validated in four independent datasets with AUC >85%. Multiple pathways including leukocyte transendothelial migration and cell adhesion were highly correlated with recurrence-free survival. The gene signature is highly predictive of recurrence in stage I NSCLC patients, which has important prognostic and therapeutic implications for the future management of these patients. PMID:22292069
Harnessing Diversity towards the Reconstructing of Large Scale Gene Regulatory Networks
Yamanaka, Ryota; Kitano, Hiroaki
2013-01-01
Elucidating gene regulatory network (GRN) from large scale experimental data remains a central challenge in systems biology. Recently, numerous techniques, particularly consensus driven approaches combining different algorithms, have become a potentially promising strategy to infer accurate GRNs. Here, we develop a novel consensus inference algorithm, TopkNet that can integrate multiple algorithms to infer GRNs. Comprehensive performance benchmarking on a cloud computing framework demonstrated that (i) a simple strategy to combine many algorithms does not always lead to performance improvement compared to the cost of consensus and (ii) TopkNet integrating only high-performance algorithms provide significant performance improvement compared to the best individual algorithms and community prediction. These results suggest that a priori determination of high-performance algorithms is a key to reconstruct an unknown regulatory network. Similarity among gene-expression datasets can be useful to determine potential optimal algorithms for reconstruction of unknown regulatory networks, i.e., if expression-data associated with known regulatory network is similar to that with unknown regulatory network, optimal algorithms determined for the known regulatory network can be repurposed to infer the unknown regulatory network. Based on this observation, we developed a quantitative measure of similarity among gene-expression datasets and demonstrated that, if similarity between the two expression datasets is high, TopkNet integrating algorithms that are optimal for known dataset perform well on the unknown dataset. The consensus framework, TopkNet, together with the similarity measure proposed in this study provides a powerful strategy towards harnessing the wisdom of the crowds in reconstruction of unknown regulatory networks. PMID:24278007
Muldoon, P P; Jackson, K J; Perez, E; Harenza, J L; Molas, S; Rais, B; Anwar, H; Zaveri, N T; Maldonado, R; Maskos, U; McIntosh, J M; Dierssen, M; Miles, M F; Chen, X; De Biasi, M; Damaj, M I
2014-08-01
Recent data have indicated that α3β4* neuronal nicotinic (n) ACh receptors may play a role in morphine dependence. Here we investigated if nACh receptors modulate morphine physical withdrawal. To assess the role of α3β4* nACh receptors in morphine withdrawal, we used a genetic correlation approach using publically available datasets within the GeneNetwork web resource, genetic knockout and pharmacological tools. Male and female European-American (n = 2772) and African-American (n = 1309) subjects from the Study of Addiction: Genetics and Environment dataset were assessed for possible associations of polymorphisms in the 15q25 gene cluster and opioid dependence. BXD recombinant mouse lines demonstrated an increased expression of α3, β4 and α5 nACh receptor mRNA in the forebrain and midbrain, which significantly correlated with increased defecation in mice undergoing morphine withdrawal. Mice overexpressing the gene cluster CHRNA5/A3/B4 exhibited increased somatic signs of withdrawal. Furthermore, α5 and β4 nACh receptor knockout mice expressed decreased somatic withdrawal signs compared with their wild-type counterparts. Moreover, selective α3β4* nACh receptor antagonists, α-conotoxin AuIB and AT-1001, attenuated somatic signs of morphine withdrawal in a dose-related manner. In addition, two human datasets revealed a protective role for variants in the CHRNA3 gene, which codes for the α3 nACh receptor subunit, in opioid dependence and withdrawal. In contrast, we found that the α4β2* nACh receptor subtype is not involved in morphine somatic withdrawal signs. Overall, our findings suggest an important role for the α3β4* nACh receptor subtype in morphine physical dependence. © 2014 The British Pharmacological Society.
MicroRNA array normalization: an evaluation using a randomized dataset as the benchmark.
Qin, Li-Xuan; Zhou, Qin
2014-01-01
MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays.
MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark
Qin, Li-Xuan; Zhou, Qin
2014-01-01
MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays. PMID:24905456
Automated Discovery of Functional Generality of Human Gene Expression Programs
Gerber, Georg K; Dowell, Robin D; Jaakkola, Tommi S; Gifford, David K
2007-01-01
An important research problem in computational biology is the identification of expression programs, sets of co-expressed genes orchestrating normal or pathological processes, and the characterization of the functional breadth of these programs. The use of human expression data compendia for discovery of such programs presents several challenges including cellular inhomogeneity within samples, genetic and environmental variation across samples, uncertainty in the numbers of programs and sample populations, and temporal behavior. We developed GeneProgram, a new unsupervised computational framework based on Hierarchical Dirichlet Processes that addresses each of the above challenges. GeneProgram uses expression data to simultaneously organize tissues into groups and genes into overlapping programs with consistent temporal behavior, to produce maps of expression programs, which are sorted by generality scores that exploit the automatically learned groupings. Using synthetic and real gene expression data, we showed that GeneProgram outperformed several popular expression analysis methods. We applied GeneProgram to a compendium of 62 short time-series gene expression datasets exploring the responses of human cells to infectious agents and immune-modulating molecules. GeneProgram produced a map of 104 expression programs, a substantial number of which were significantly enriched for genes involved in key signaling pathways and/or bound by NF-κB transcription factors in genome-wide experiments. Further, GeneProgram discovered expression programs that appear to implicate surprising signaling pathways or receptor types in the response to infection, including Wnt signaling and neurotransmitter receptors. We believe the discovered map of expression programs involved in the response to infection will be useful for guiding future biological experiments; genes from programs with low generality scores might serve as new drug targets that exhibit minimal “cross-talk,” and genes from high generality programs may maintain common physiological responses that go awry in disease states. Further, our method is multipurpose, and can be applied readily to novel compendia of biological data. PMID:17696603
A Convex Formulation for Learning a Shared Predictive Structure from Multiple Tasks
Chen, Jianhui; Tang, Lei; Liu, Jun; Ye, Jieping
2013-01-01
In this paper, we consider the problem of learning from multiple related tasks for improved generalization performance by extracting their shared structures. The alternating structure optimization (ASO) algorithm, which couples all tasks using a shared feature representation, has been successfully applied in various multitask learning problems. However, ASO is nonconvex and the alternating algorithm only finds a local solution. We first present an improved ASO formulation (iASO) for multitask learning based on a new regularizer. We then convert iASO, a nonconvex formulation, into a relaxed convex one (rASO). Interestingly, our theoretical analysis reveals that rASO finds a globally optimal solution to its nonconvex counterpart iASO under certain conditions. rASO can be equivalently reformulated as a semidefinite program (SDP), which is, however, not scalable to large datasets. We propose to employ the block coordinate descent (BCD) method and the accelerated projected gradient (APG) algorithm separately to find the globally optimal solution to rASO; we also develop efficient algorithms for solving the key subproblems involved in BCD and APG. The experiments on the Yahoo webpages datasets and the Drosophila gene expression pattern images datasets demonstrate the effectiveness and efficiency of the proposed algorithms and confirm our theoretical analysis. PMID:23520249
Mazzarelli, Joan M; Brestelli, John; Gorski, Regina K; Liu, Junmin; Manduchi, Elisabetta; Pinney, Deborah F; Schug, Jonathan; White, Peter; Kaestner, Klaus H; Stoeckert, Christian J
2007-01-01
EPConDB (http://www.cbil.upenn.edu/EPConDB) is a public web site that supports research in diabetes, pancreatic development and beta-cell function by providing information about genes expressed in cells of the pancreas. EPConDB displays expression profiles for individual genes and information about transcripts, promoter elements and transcription factor binding sites. Gene expression results are obtained from studies examining tissue expression, pancreatic development and growth, differentiation of insulin-producing cells, islet or beta-cell injury, and genetic models of impaired beta-cell function. The expression datasets are derived using different microarray platforms, including the BCBC PancChips and Affymetrix gene expression arrays. Other datasets include semi-quantitative RT-PCR and MPSS expression studies. For selected microarray studies, lists of differentially expressed genes, derived from PaGE analysis, are displayed on the site. EPConDB provides database queries and tools to examine the relationship between a gene, its transcriptional regulation, protein function and expression in pancreatic tissues.
A Morpholino-based screen to identify novel genes involved in craniofacial morphogenesis
Melvin, Vida Senkus; Feng, Weiguo; Hernandez-Lagunas, Laura; Artinger, Kristin Bruk; Williams, Trevor
2014-01-01
BACKGROUND The regulatory mechanisms underpinning facial development are conserved between diverse species. Therefore, results from model systems provide insight into the genetic causes of human craniofacial defects. Previously, we generated a comprehensive dataset examining gene expression during development and fusion of the mouse facial prominences. Here, we used this resource to identify genes that have dynamic expression patterns in the facial prominences, but for which only limited information exists concerning developmental function. RESULTS This set of ~80 genes was used for a high throughput functional analysis in the zebrafish system using Morpholino gene knockdown technology. This screen revealed three classes of cranial cartilage phenotypes depending upon whether knockdown of the gene affected the neurocranium, viscerocranium, or both. The targeted genes that produced consistent phenotypes encoded proteins linked to transcription (meis1, meis2a, tshz2, vgll4l), signaling (pkdcc, vlk, macc1, wu:fb16h09), and extracellular matrix function (smoc2). The majority of these phenotypes were not altered by reduction of p53 levels, demonstrating that both p53 dependent and independent mechanisms were involved in the craniofacial abnormalities. CONCLUSIONS This Morpholino-based screen highlights new genes involved in development of the zebrafish craniofacial skeleton with wider relevance to formation of the face in other species, particularly mouse and human. PMID:23559552
Kaushik, Abhinav; Ali, Shakir; Gupta, Dinesh
2017-01-01
Gene connection rewiring is an essential feature of gene network dynamics. Apart from its normal functional role, it may also lead to dysregulated functional states by disturbing pathway homeostasis. Very few computational tools measure rewiring within gene co-expression and its corresponding regulatory networks in order to identify and prioritize altered pathways which may or may not be differentially regulated. We have developed Altered Pathway Analyzer (APA), a microarray dataset analysis tool for identification and prioritization of altered pathways, including those which are differentially regulated by TFs, by quantifying rewired sub-network topology. Moreover, APA also helps in re-prioritization of APA shortlisted altered pathways enriched with context-specific genes. We performed APA analysis of simulated datasets and p53 status NCI-60 cell line microarray data to demonstrate potential of APA for identification of several case-specific altered pathways. APA analysis reveals several altered pathways not detected by other tools evaluated by us. APA analysis of unrelated prostate cancer datasets identifies sample-specific as well as conserved altered biological processes, mainly associated with lipid metabolism, cellular differentiation and proliferation. APA is designed as a cross platform tool which may be transparently customized to perform pathway analysis in different gene expression datasets. APA is freely available at http://bioinfo.icgeb.res.in/APA. PMID:28084397
CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.
Planey, Catherine R; Gevaert, Olivier
2016-03-09
Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.
Evidence of function for conserved noncoding sequences in Arabidopsis thaliana.
Spangler, Jacob B; Subramaniam, Sabarinath; Freeling, Michael; Feltus, F Alex
2012-01-01
• Whole genome duplication events provide a lineage with a large reservoir of genes that can be molded by evolutionary forces into phenotypes that fit alternative environments. A well-studied whole genome duplication, the α-event, occurred in an ancestor of the model plant Arabidopsis thaliana. Retained segments of the α-event have been defined in recent years in the form of duplicate protein coding sequences (α-pairs) and associated conserved noncoding DNA sequences (CNSs). Our aim was to identify any association between CNSs and α-pair co-functionality at the gene expression level. • Here, we tested for correlation between CNS counts and α-pair co-expression and expression intensity across nine expression datasets: aerial tissue, flowers, leaves, roots, rosettes, seedlings, seeds, shoots and whole plants. • We provide evidence for a putative regulatory role of the CNSs. The association of CNSs with α-pair co-expression and expression intensity varied by gene function, subgene position and the presence of transcription factor binding motifs. A range of possible CNS regulatory mechanisms, including intron-mediated enhancement, messenger RNA fold stability and transcriptional regulation, are discussed. • This study provides a framework to understand how CNS motifs are involved in the maintenance of gene expression after a whole genome duplication event. © 2011 The Authors. New Phytologist © 2011 New Phytologist Trust.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Corbin, Cyrielle; Drouet, Samantha; Markulin, Lucija
Identification of DIR encoding genes in flax genome. Analysis of phylogeny, gene/protein structures and evolution. Identification of new conserved motifs linked to biochemical functions. Investigation of spatio-temporal gene expression and response to stress. Dirigent proteins (DIRs) were discovered during 8-8' lignan biosynthesis studies, through identification of stereoselective coupling to afford either (+)- or (-)-pinoresinols from E-coniferyl alcohol. DIRs are also involved or potentially involved in terpenoid, allyl/propenyl phenol lignan, pterocarpan and lignin biosynthesis. DIRs have very large multigene families in different vascular plants including flax, with most still of unknown function. DIR studies typically focus on a small subset ofmore » genes and identification of biochemical/physiological functions. Herein, a genome-wide analysis and characterization of the predicted flax DIR 44-membered multigene family was performed, this species being a rich natural grain source of 8-8' linked secoisolariciresinol-derived lignan oligomers. All predicted DIR sequences, including their promoters, were analyzed together with their public gene expression datasets. Expression patterns of selected DIRs were examined using qPCR, as well as through clustering analysis of DIR gene expression. These analyses further implicated roles for specific DIRs in (-)-pinoresinol formation in seed-coats, as well as (+)-pinoresinol in vegetative organs and/or specific responses to stress. Phylogeny and gene expression analysis segregated flax DIRs into six distinct clusters with new cluster-specific motifs identified. We propose that these findings can serve as a foundation to further systematically determine functions of DIRs, i.e. other than those already known in lignan biosynthesis in flax and other species. Given the differential expression profiles and inducibility of the flax DIR family, we provisionally propose that some DIR genes of unknown function could be involved in different aspects of secondary cell wall biosynthesis and plant defense.« less
Corbin, Cyrielle; Drouet, Samantha; Markulin, Lucija; Auguin, Daniel; Lainé, Éric; Davin, Laurence B; Cort, John R; Lewis, Norman G; Hano, Christophe
2018-05-01
Identification of DIR encoding genes in flax genome. Analysis of phylogeny, gene/protein structures and evolution. Identification of new conserved motifs linked to biochemical functions. Investigation of spatio-temporal gene expression and response to stress. Dirigent proteins (DIRs) were discovered during 8-8' lignan biosynthesis studies, through identification of stereoselective coupling to afford either (+)- or (-)-pinoresinols from E-coniferyl alcohol. DIRs are also involved or potentially involved in terpenoid, allyl/propenyl phenol lignan, pterocarpan and lignin biosynthesis. DIRs have very large multigene families in different vascular plants including flax, with most still of unknown function. DIR studies typically focus on a small subset of genes and identification of biochemical/physiological functions. Herein, a genome-wide analysis and characterization of the predicted flax DIR 44-membered multigene family was performed, this species being a rich natural grain source of 8-8' linked secoisolariciresinol-derived lignan oligomers. All predicted DIR sequences, including their promoters, were analyzed together with their public gene expression datasets. Expression patterns of selected DIRs were examined using qPCR, as well as through clustering analysis of DIR gene expression. These analyses further implicated roles for specific DIRs in (-)-pinoresinol formation in seed-coats, as well as (+)-pinoresinol in vegetative organs and/or specific responses to stress. Phylogeny and gene expression analysis segregated flax DIRs into six distinct clusters with new cluster-specific motifs identified. We propose that these findings can serve as a foundation to further systematically determine functions of DIRs, i.e. other than those already known in lignan biosynthesis in flax and other species. Given the differential expression profiles and inducibility of the flax DIR family, we provisionally propose that some DIR genes of unknown function could be involved in different aspects of secondary cell wall biosynthesis and plant defense.
Sharma, Priyanka; Saraya, Anoop; Sharma, Rinu
2018-01-30
To evaluate the diagnostic potential of a six microRNAs (miRNAs) panel consisting of miR-21, miR-144, miR-107, miR-342, miR-93 and miR-152 for esophageal cancer (EC) detection. The expression of miRNAs was analyzed in EC sera samples using quantitative real-time PCR. Risk score analysis was performed and linear regression models were then fitted to generate the six-miRNA panel. In addition, we made an effort to identify significantly dysregulated miRNAs and mRNAs in EC using the Cancer Genome Atlas (TCGA) miRNAseq and RNAseq datasets, respectively. Further, we identified significantly correlated miRNA-mRNA target pairs by integrating TCGA EC miRNAseq dataset with RNAseq dataset. The panel of circulating miRNAs showed enhanced sensitivity (87.5%) and specificity (90.48%) in terms of discriminating EC patients from normal subjects (area under the curve [AUC] = 0.968). Pathway enrichment analysis for potential targets of six miRNAs revealed 48 significant (P < 0.05) pathways, viz. pathways in cancer, mRNA surveillance, MAPK, Wnt, mTOR signaling, and so on. The expression data for mRNAs and miRNAs, downloaded from TCGA database, lead to identification of 2309 differentially expressed genes and 189 miRNAs. Gene ontology and pathway enrichment analysis showed that cell-cycle processes were most significantly enriched for differentially expressed mRNA. Integrated analysis of TCGA miRNAseq and RNAseq datasets resulted in identification of 53 063 significantly and negatively correlated miRNA-mRNA pairs. In summary, a novel and highly sensitive signature of serum miRNAs was identified for EC detection. Moreover, this is the first report identifying miRNA-mRNA target pairs from EC TCGA dataset, thus providing a comprehensive resource for understanding the interactions existing between miRNA and their target mRNAs in EC. © 2018 John Wiley & Sons Australia, Ltd.
Juraeva, Dilafruz; Haenisch, Britta; Zapatka, Marc; Frank, Josef; Witt, Stephanie H; Mühleisen, Thomas W; Treutlein, Jens; Strohmaier, Jana; Meier, Sandra; Degenhardt, Franziska; Giegling, Ina; Ripke, Stephan; Leber, Markus; Lange, Christoph; Schulze, Thomas G; Mössner, Rainald; Nenadic, Igor; Sauer, Heinrich; Rujescu, Dan; Maier, Wolfgang; Børglum, Anders; Ophoff, Roel; Cichon, Sven; Nöthen, Markus M; Rietschel, Marcella; Mattheisen, Manuel; Brors, Benedikt
2014-06-01
In the present study, an integrated hierarchical approach was applied to: (1) identify pathways associated with susceptibility to schizophrenia; (2) detect genes that may be potentially affected in these pathways since they contain an associated polymorphism; and (3) annotate the functional consequences of such single-nucleotide polymorphisms (SNPs) in the affected genes or their regulatory regions. The Global Test was applied to detect schizophrenia-associated pathways using discovery and replication datasets comprising 5,040 and 5,082 individuals of European ancestry, respectively. Information concerning functional gene-sets was retrieved from the Kyoto Encyclopedia of Genes and Genomes, Gene Ontology, and the Molecular Signatures Database. Fourteen of the gene-sets or pathways identified in the discovery dataset were confirmed in the replication dataset. These include functional processes involved in transcriptional regulation and gene expression, synapse organization, cell adhesion, and apoptosis. For two genes, i.e. CTCF and CACNB2, evidence for association with schizophrenia was available (at the gene-level) in both the discovery study and published data from the Psychiatric Genomics Consortium schizophrenia study. Furthermore, these genes mapped to four of the 14 presently identified pathways. Several of the SNPs assigned to CTCF and CACNB2 have potential functional consequences, and a gene in close proximity to CACNB2, i.e. ARL5B, was identified as a potential gene of interest. Application of the present hierarchical approach thus allowed: (1) identification of novel biological gene-sets or pathways with potential involvement in the etiology of schizophrenia, as well as replication of these findings in an independent cohort; (2) detection of genes of interest for future follow-up studies; and (3) the highlighting of novel genes in previously reported candidate regions for schizophrenia.
Genomic convergence to identify candidate genes for Alzheimer disease on chromosome 10
Liang, Xueying; Slifer, Michael; Martin, Eden R.; Schnetz-Boutaud, Nathalie; Bartlett, Jackie; Anderson, Brent; Züchner, Stephan; Gwirtsman, Harry; Gilbert, John R.; Pericak-Vance, Margaret A.; Haines, Jonathan L.
2009-01-01
A broad region of chromosome 10 (chr10) has engendered continued interest in the etiology of late-onset Alzheimer Disease (LOAD) from both linkage and candidate gene studies. However, there is a very extensive heterogeneity on chr10. We converged linkage analysis and gene expression data using the concept of genomic convergence that suggests that genes showing positive results across multiple different data types are more likely to be involved in AD. We identified and examined 28 genes on chr10 for association with AD in a Caucasian case-control dataset of 506 cases and 558 controls with substantial clinical information. The cases were all LOAD (minimum age at onset ≥ 60 years). Both single marker and haplotypic associations were tested in the overall dataset and 8 subsets defined by age, gender, ApoE and clinical status. PTPLA showed allelic, genotypic and haplotypic association in the overall dataset. SORCS1 was significant in the overall data sets (p=0.0025) and most significant in the female subset (allelic association p=0.00002, a 3-locus haplotype had p=0.0005). Odds Ratio of SORCS1 in the female subset was 1.7 (p<0.0001). SORCS1 is an interesting candidate gene involved in the Aβ pathway. Therefore, genetic variations in PTPLA and SORCS1 may be associated and have modest effect to the risk of AD by affecting Aβ pathway. The replication of the effect of these genes in different study populations and search for susceptible variants and functional studies of these genes are necessary to get a better understanding of the roles of the genes in Alzheimer disease. PMID:19241460
SPICE for ESA Planetary Missions
NASA Astrophysics Data System (ADS)
Costa, M.
2017-09-01
SPICE is an information system that provides the geometry needed to plan scientific observations and to analyze the obtained. The ESA SPICE Service generates the SPICE Kernel datasets for missions in all the active ESA Missions. This contribution describes the current status of the datasets, the extended services and the SPICE support provided to the ESA Planetary Missions (Mars-Express, ExoMars2016, BepiColombo, JUICE, Rosetta, Venus-Express and SMART-1) for the benefit of the science community.
Phosphoproteome and transcriptome analyses of ErbB ligand-stimulated MCF-7 cells.
Nagashima, Takeshi; Oyama, Masaaki; Kozuka-Hata, Hiroko; Yumoto, Noriko; Sakaki, Yoshiyuki; Hatakeyama, Mariko
2008-01-01
Cellular signal transduction pathways and gene expression are tightly regulated to accommodate changes in response to physiological environments. In the current study, molecules were identified that are activated as a result of intracellular signaling and immediately expressed as mRNA in MCF-7 breast cancer cells shortly after stimulation of ErbB receptor ligands, epidermal growth factor (EGF) or heregulin (HRG). For the identification of tyrosine-phosphorylated proteins and expressed genes, a SILAC (stable isotopic labeling using amino acids in cell culture) method and Affymetrix gene expression array system, respectively, were used. Unexpectedly, the overlapping of genes appeared in two experimental datasets was very low for HRG (43 hits in the proteome data, 1,655 in the transcriptome data, and 5 hits common to both datasets), while no overlapping gene was detected for EGF (15 hits in the proteome data, 211 hits in the transcriptome data, and no hits common to both datasets). The HRG overlapping genes included ERBB2, NEDD9, MAPK3, JUP and EPHA2. Biological pathway analysis indicated that HRG-stimulated molecular activation is significantly related to cancer pathways including bladder cancer, chronic myeloid leukemia and pancreatic cancer (p < 0.05). The proteome datasets of EGF and HRG contain molecules that are related to Axon guidance, ErbB signaling and VEGF signaling at a high rate.
CHAI, Lian En; LAW, Chow Kuan; MOHAMAD, Mohd Saberi; CHONG, Chuii Khim; CHOON, Yee Wen; DERIS, Safaai; ILLIAS, Rosli Md
2014-01-01
Background: Gene expression data often contain missing expression values. Therefore, several imputation methods have been applied to solve the missing values, which include k-nearest neighbour (kNN), local least squares (LLS), and Bayesian principal component analysis (BPCA). However, the effects of these imputation methods on the modelling of gene regulatory networks from gene expression data have rarely been investigated and analysed using a dynamic Bayesian network (DBN). Methods: In the present study, we separately imputed datasets of the Escherichia coli S.O.S. DNA repair pathway and the Saccharomyces cerevisiae cell cycle pathway with kNN, LLS, and BPCA, and subsequently used these to generate gene regulatory networks (GRNs) using a discrete DBN. We made comparisons on the basis of previous studies in order to select the gene network with the least error. Results: We found that BPCA and LLS performed better on larger networks (based on the S. cerevisiae dataset), whereas kNN performed better on smaller networks (based on the E. coli dataset). Conclusion: The results suggest that the performance of each imputation method is dependent on the size of the dataset, and this subsequently affects the modelling of the resultant GRNs using a DBN. In addition, on the basis of these results, a DBN has the capacity to discover potential edges, as well as display interactions, between genes. PMID:24876803
Lemieux, Sebastien; Sargeant, Tobias; Laperrière, David; Ismail, Houssam; Boucher, Geneviève; Rozendaal, Marieke; Lavallée, Vincent-Philippe; Ashton-Beaucage, Dariel; Wilhelm, Brian; Hébert, Josée; Hilton, Douglas J; Mader, Sylvie; Sauvageau, Guy
2017-07-27
Genome-wide transcriptome profiling has enabled non-supervised classification of tumours, revealing different sub-groups characterized by specific gene expression features. However, the biological significance of these subtypes remains for the most part unclear. We describe herein an interactive platform, Minimum Spanning Trees Inferred Clustering (MiSTIC), that integrates the direct visualization and comparison of the gene correlation structure between datasets, the analysis of the molecular causes underlying co-variations in gene expression in cancer samples, and the clinical annotation of tumour sets defined by the combined expression of selected biomarkers. We have used MiSTIC to highlight the roles of specific transcription factors in breast cancer subtype specification, to compare the aspects of tumour heterogeneity targeted by different prognostic signatures, and to highlight biomarker interactions in AML. A version of MiSTIC preloaded with datasets described herein can be accessed through a public web server (http://mistic.iric.ca); in addition, the MiSTIC software package can be obtained (github.com/iric-soft/MiSTIC) for local use with personalized datasets. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Ni, Ming; Ye, Fuqiang; Zhu, Juanjuan; Li, Zongwei; Yang, Shuai; Yang, Bite; Han, Lu; Wu, Yongge; Chen, Ying; Li, Fei; Wang, Shengqi; Bo, Xiaochen
2014-12-01
Numerous public microarray datasets are valuable resources for the scientific communities. Several online tools have made great steps to use these data by querying related datasets with users' own gene signatures or expression profiles. However, dataset annotation and result exhibition still need to be improved. ExpTreeDB is a database that allows for queries on human and mouse microarray experiments from Gene Expression Omnibus with gene signatures or profiles. Compared with similar applications, ExpTreeDB pays more attention to dataset annotations and result visualization. We introduced a multiple-level annotation system to depict and organize original experiments. For example, a tamoxifen-treated cell line experiment is hierarchically annotated as 'agent→drug→estrogen receptor antagonist→tamoxifen'. Consequently, retrieved results are exhibited by an interactive tree-structured graphics, which provide an overview for related experiments and might enlighten users on key items of interest. The database is freely available at http://biotech.bmi.ac.cn/ExpTreeDB. Web site is implemented in Perl, PHP, R, MySQL and Apache. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Integrative sparse principal component analysis of gene expression data.
Liu, Mengque; Fan, Xinyan; Fang, Kuangnan; Zhang, Qingzhao; Ma, Shuangge
2017-12-01
In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance. © 2017 WILEY PERIODICALS, INC.
Reproducibility-optimized test statistic for ranking genes in microarray studies.
Elo, Laura L; Filén, Sanna; Lahesmaa, Riitta; Aittokallio, Tero
2008-01-01
A principal goal of microarray studies is to identify the genes showing differential expression under distinct conditions. In such studies, the selection of an optimal test statistic is a crucial challenge, which depends on the type and amount of data under analysis. While previous studies on simulated or spike-in datasets do not provide practical guidance on how to choose the best method for a given real dataset, we introduce an enhanced reproducibility-optimization procedure, which enables the selection of a suitable gene- anking statistic directly from the data. In comparison with existing ranking methods, the reproducibilityoptimized statistic shows good performance consistently under various simulated conditions and on Affymetrix spike-in dataset. Further, the feasibility of the novel statistic is confirmed in a practical research setting using data from an in-house cDNA microarray study of asthma-related gene expression changes. These results suggest that the procedure facilitates the selection of an appropriate test statistic for a given dataset without relying on a priori assumptions, which may bias the findings and their interpretation. Moreover, the general reproducibilityoptimization procedure is not limited to detecting differential expression only but could be extended to a wide range of other applications as well.
Unmasking Upstream Gene Expression Regulators with miRNA-corrected mRNA Data
Bollmann, Stephanie; Bu, Dengpan; Wang, Jiaqi; Bionaz, Massimo
2015-01-01
Expressed micro-RNA (miRNA) affects messenger RNA (mRNA) abundance, hindering the accuracy of upstream regulator analysis. Our objective was to provide an algorithm to correct such bias. Large mRNA and miRNA analyses were performed on RNA extracted from bovine liver and mammary tissue. Using four levels of target scores from TargetScan (all miRNA:mRNA target gene pairs or only the top 25%, 50%, or 75%). Using four levels of target scores from TargetScan (all miRNA:mRNA target gene pairs or only the top 25%, 50%, or 75%) and four levels of the magnitude of miRNA effect (ME) on mRNA expression (30%, 50%, 75%, and 83% mRNA reduction), we generated 17 different datasets (including the original dataset). For each dataset, we performed upstream regulator analysis using two bioinformatics tools. We detected an increased effect on the upstream regulator analysis with larger miRNA:mRNA pair bins and higher ME. The miRNA correction allowed identification of several upstream regulators not present in the analysis of the original dataset. Thus, the proposed algorithm improved the prediction of upstream regulators. PMID:27279737
Irigoyen, Antonio; Jimenez-Luna, Cristina; Benavides, Manuel; Caba, Octavio; Gallego, Javier; Ortuño, Francisco Manuel; Guillen-Ponce, Carmen; Rojas, Ignacio; Aranda, Enrique; Torres, Carolina; Prados, Jose
2018-01-01
Applying differentially expressed genes (DEGs) to identify feasible biomarkers in diseases can be a hard task when working with heterogeneous datasets. Expression data are strongly influenced by technology, sample preparation processes, and/or labeling methods. The proliferation of different microarray platforms for measuring gene expression increases the need to develop models able to compare their results, especially when different technologies can lead to signal values that vary greatly. Integrative meta-analysis can significantly improve the reliability and robustness of DEG detection. The objective of this work was to develop an integrative approach for identifying potential cancer biomarkers by integrating gene expression data from two different platforms. Pancreatic ductal adenocarcinoma (PDAC), where there is an urgent need to find new biomarkers due its late diagnosis, is an ideal candidate for testing this technology. Expression data from two different datasets, namely Affymetrix and Illumina (18 and 36 PDAC patients, respectively), as well as from 18 healthy controls, was used for this study. A meta-analysis based on an empirical Bayesian methodology (ComBat) was then proposed to integrate these datasets. DEGs were finally identified from the integrated data by using the statistical programming language R. After our integrative meta-analysis, 5 genes were commonly identified within the individual analyses of the independent datasets. Also, 28 novel genes that were not reported by the individual analyses ('gained' genes) were also discovered. Several of these gained genes have been already related to other gastroenterological tumors. The proposed integrative meta-analysis has revealed novel DEGs that may play an important role in PDAC and could be potential biomarkers for diagnosing the disease.
Lu, Chenqi; Liu, Xiaoqin; Wang, Lin; Jiang, Ning; Yu, Jun; Zhao, Xiaobo; Hu, Hairong; Zheng, Saihua; Li, Xuelian; Wang, Guiying
2017-01-10
Due to genetic heterogeneity and variable diagnostic criteria, genetic studies of polycystic ovary syndrome are particularly challenging. Furthermore, lack of sufficiently large cohorts limits the identification of susceptibility genes contributing to polycystic ovary syndrome. Here, we carried out a systematic search of studies deposited in the Gene Expression Omnibus database through August 31, 2016. The present analyses included studies with: 1) patients with polycystic ovary syndrome and normal controls, 2) gene expression profiling of messenger RNA, and 3) sufficient data for our analysis. Ultimately, a total of 9 studies with 13 datasets met the inclusion criteria and were performed for the subsequent integrated analyses. Through comprehensive analyses, there were 13 genetic factors overlapped in all datasets and identified as significant specific genes for polycystic ovary syndrome. After quality control assessment, there were six datasets remained. Further gene ontology enrichment and pathway analyses suggested that differentially expressed genes mainly enriched in oocyte pathways. These findings provide potential molecular markers for diagnosis and prognosis of polycystic ovary syndrome, and need in-depth studies on the exact function and mechanism in polycystic ovary syndrome.
Qian, Liwei; Zheng, Haoran; Zhou, Hong; Qin, Ruibin; Li, Jinlong
2013-01-01
The increasing availability of time series expression datasets, although promising, raises a number of new computational challenges. Accordingly, the development of suitable classification methods to make reliable and sound predictions is becoming a pressing issue. We propose, here, a new method to classify time series gene expression via integration of biological networks. We evaluated our approach on 2 different datasets and showed that the use of a hidden Markov model/Gaussian mixture models hybrid explores the time-dependence of the expression data, thereby leading to better prediction results. We demonstrated that the biclustering procedure identifies function-related genes as a whole, giving rise to high accordance in prognosis prediction across independent time series datasets. In addition, we showed that integration of biological networks into our method significantly improves prediction performance. Moreover, we compared our approach with several state-of–the-art algorithms and found that our method outperformed previous approaches with regard to various criteria. Finally, our approach achieved better prediction results on early-stage data, implying the potential of our method for practical prediction. PMID:23516469
Zhao, Ying; Thammannagowda, Shivegowda; Staton, Margaret; Tang, Sha; Xia, Xinli; Yin, Weilun; Liang, Haiying
2013-03-01
The "living fossil" Metasequoia glyptostroboides Hu et Cheng, commonly known as dawn redwood or Chinese redwood, is the only living species in the genus and is valued for its essential oil and crude extracts that have great potential for anti-fungal activity. Despite its paleontological significance and economical value as a rare relict species, genomic resources of Metasequoia are very limited. In order to gain insight into the molecular mechanisms behind the formation of reproductive buds and the transition from vegetative phase to reproductive phase in Metasequoia, we performed sequencing of expressed sequence tags from Metasequoia vegetative buds and female buds. By using the 454 pyrosequencing technology, a total of 1,571,764 high-quality reads were generated, among which 733,128 were from vegetative buds and 775,636 were from female buds. These EST reads were clustered and assembled into 114,124 putative unique transcripts (PUTs) with an average length of 536 bp. The 97,565 PUTs that were at least 100 bp in length were functionally annotated by a similarity search against public databases and assigned with Gene Ontology (GO) terms. A total of 59 known floral gene families and 190 isotigs involved in hormone regulation were captured in the dataset. Furthermore, a set of PUTs differentially expressed in vegetative and reproductive buds, as well as SSR motifs and high confidence SNPs, were identified. This is the first large-scale expressed sequence tags ever generated in Metasequoia and the first evidence for floral genes in this critically endangered deciduous conifer species.
ITGBL1 promotes migration, invasion and predicts a poor prognosis in colorectal cancer.
Qiu, Xiao; Feng, Jue-Rong; Qiu, Jun; Liu, Lan; Xie, Yang; Zhang, Yu-Peng; Liu, Jing; Zhao, Qiu
2018-05-14
Colorectal cancer (CRC) is one of the most common malignancies worldwide; its progression and prognosis are associated with oncogenes. The present study aimed to identify differentially expressed genes (DEGs) and explore the role and potential mechanism of integrin subunit β like 1 (ITGBL1) in CRC. The microarray dataset GSE41258 was used to screen DEGs involved in CRC. Survival analysis was performed to predict the prognosis of CRC patients. To validate ITGBL1 expression, immunohistochemistry, quantitative real-time PCR and western blotting were performed in CRC tissues and cells. Subsequently, the effects of ITGBL1 were evaluated through colony formation, cell proliferation, migration and invasion assays. Finally, we took advantage of Gene Ontology (GO) analysis and Gene Set Enrichment Analysis (GSEA) to explore potential function and mechanism of ITGBL1 in CRC. In our study, 182 primary CRC tissues and 54 normal colon tissues were contained in GSE41258 dataset. A total of 318 DEGs were screened, among which ITGBL1 was found to be significantly up-regulated in CRC, and its high expression was associated with shortened survival of CRC patients. Moreover, knockdown of ITGBL1 promoted CRC cell proliferation, migration and invasion. Finally, GO analysis revealed that ITGBL1 was associated with cell adhesion. GSEA indicated that ITGBL1 was enriched in ECM receptor interaction and focal adhesion. In conclusion, a novel oncogene ITGBL1 was identified and demonstrated to be associated with the progression and prognosis of CRC, which might be a potential therapeutic target and prognostic biomarker for CRC patients. Copyright © 2018 Elsevier Masson SAS. All rights reserved.
Xia, Wei; Wu, Jian; Deng, Fei-Yan; Wu, Long-Fei; Zhang, Yong-Hong; Guo, Yu-Fan; Lei, Shu-Feng
2017-02-01
Rheumatoid arthritis (RA) is a systemic autoimmune disease. So far, it is unclear whether there exist common RA-related genes shared in different tissues/cells. In this study, we conducted an integrative analysis on multiple datasets to identify potential shared genes that are significant in multiple tissues/cells for RA. Seven microarray gene expression datasets representing various RA-related tissues/cells were downloaded from the Gene Expression Omnibus (GEO). Statistical analyses, testing both marginal and joint effects, were conducted to identify significant genes shared in various samples. Followed-up analyses were conducted on functional annotation clustering analysis, protein-protein interaction (PPI) analysis, gene-based association analysis, and ELISA validation analysis in in-house samples. We identified 18 shared significant genes, which were mainly involved in the immune response and chemokine signaling pathway. Among the 18 genes, eight genes (PPBP, PF4, HLA-F, S100A8, RNASEH2A, P2RY6, JAG2, and PCBP1) interact with known RA genes. Two genes (HLA-F and PCBP1) are significant in gene-based association analysis (P = 1.03E-31, P = 1.30E-2, respectively). Additionally, PCBP1 also showed differential protein expression levels in in-house case-control plasma samples (P = 2.60E-2). This study represented the first effort to identify shared RA markers from different functional cells or tissues. The results suggested that one of the shared genes, i.e., PCBP1, is a promising biomarker for RA.
Identification of pathogenic genes and upstream regulators in age-related macular degeneration.
Zhao, Bin; Wang, Mengya; Xu, Jing; Li, Min; Yu, Yuhui
2017-06-26
Age-related macular degeneration (AMD) is the leading cause of irreversible blindness in older individuals. Our study aims to identify the key genes and upstream regulators in AMD. To screen pathogenic genes of AMD, an integrated analysis was performed by using the microarray datasets in AMD derived from the Gene Expression Omnibus (GEO) database. The functional annotation and potential pathways of differentially expressed genes (DEGs) were further discovered by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. We constructed the AMD-specific transcriptional regulatory network to find the crucial transcriptional factors (TFs) which target the DEGs in AMD. Quantitative real time polymerase chain reaction (qRT-PCR) was performed to verify the DEGs and TFs obtained by integrated analysis. From two GEO datasets obtained, we identified 1280 DEGs (730 up-regulated and 550 down-regulated genes) between AMD and normal control (NC). After KEGG analysis, steroid biosynthesis is a significantly enriched pathway for DEGs. The expression of 8 genes (TNC, GRP, TRAF6, ADAMTS5, GPX3, FAP, DHCR7 and FDFT1) was detected. Except for TNC and GPX3, the other 6 genes in qRT-PCR played the same pattern with that in our integrated analysis. The dysregulation of these eight genes may involve with the process of AMD. Two crucial transcription factors (c-rel and myogenin) were concluded to play a role in AMD. Especially, myogenin was associated with AMD by regulating TNC, GRP and FAP. Our finding can contribute to developing new potential biomarkers, revealing the underlying pathogenesis, and further raising new therapeutic targets for AMD.
KOK-SIN, TEOW; MOKHTAR, NORFILZA MOHD; HASSAN, NUR ZARINA ALI; SAGAP, ISMAIL; ROSE, ISA MOHAMED; HARUN, ROSLAN; JAMAL, RAHMAN
2015-01-01
Apart from genetic mutations, epigenetic alteration is a common phenomenon that contributes to neoplastic transformation in colorectal cancer. Transcriptional silencing of tumor-suppressor genes without changes in the DNA sequence is explained by the existence of promoter hypermethylation. To test this hypothesis, we integrated the epigenome and transcriptome data from a similar set of colorectal tissue samples. Methylation profiling was performed using the Illumina InfiniumHumanMethylation27 BeadChip on 55 paired cancer and adjacent normal epithelial cells. Fifteen of the 55 paired tissues were used for gene expression profiling using the Affymetrix GeneChip Human Gene 1.0 ST array. Validation was carried out on 150 colorectal tissues using the methylation-specific multiplex ligation-dependent probe amplification (MS-MLPA) technique. PCA and supervised hierarchical clustering in the two microarray datasets showed good separation between cancer and normal samples. Significant genes from the two analyses were obtained based on a ≥2-fold change and a false discovery rate (FDR) P-value of <0.05. We identified 1,081 differentially hypermethylated CpG sites and 36 hypomethylated CpG sites. We also found 709 upregulated and 699 downregulated genes from the gene expression profiling. A comparison of the two datasets revealed 32 overlapping genes with 27 being hypermethylated with downregulated expression and 4 hypermethylated with upregulated expression. One gene was found to be hypomethylated and downregulated. The most enriched molecular pathway identified was cell adhesion molecules that involved 4 overlapped genes, JAM2, NCAM1, ITGA8 and CNTN1. In the present study, we successfully identified a group of genes that showed methylation and gene expression changes in well-defined colorectal cancer tissues with high purity. The integrated analysis gives additional insight regarding the regulation of colorectal cancer-associated genes and their underlying mechanisms that contribute to colorectal carcinogenesis. PMID:25997610
Li, Wen-Xing; Dai, Shao-Xing; Liu, Jia-Qian; Wang, Qian; Li, Gong-Hua; Huang, Jing-Fei
2016-01-01
Alzheimer's disease (AD) and schizophrenia (SZ) are both accompanied by impaired learning and memory functions. This study aims to explore the expression profiles of learning or memory genes between AD and SZ. We downloaded 10 AD and 10 SZ datasets from GEO-NCBI for integrated analysis. These datasets were processed using RMA algorithm and a global renormalization for all studies. Then Empirical Bayes algorithm was used to find the differentially expressed genes between patients and controls. The results showed that most of the differentially expressed genes were related to AD whereas the gene expression profile was little affected in the SZ. Furthermore, in the aspects of the number of differentially expressed genes, the fold change and the brain region, there was a great difference in the expression of learning or memory related genes between AD and SZ. In AD, the CALB1, GABRA5, and TAC1 were significantly downregulated in whole brain, frontal lobe, temporal lobe, and hippocampus. However, in SZ, only two genes CRHBP and CX3CR1 were downregulated in hippocampus, and other brain regions were not affected. The effect of these genes on learning or memory impairment has been widely studied. It was suggested that these genes may play a crucial role in AD or SZ pathogenesis. The different gene expression patterns between AD and SZ on learning and memory functions in different brain regions revealed in our study may help to understand the different mechanism between two diseases.
Combining Genotype, Phenotype, and Environment to Infer Potential Candidate Genes.
Talbot, Benoit; Chen, Ting-Wen; Zimmerman, Shawna; Joost, Stéphane; Eckert, Andrew J; Crow, Taylor M; Semizer-Cuming, Devrim; Seshadri, Chitra; Manel, Stéphanie
2017-03-01
Population genomic analysis can be an important tool in understanding local adaptation. Identification of potential adaptive loci in such analyses is usually based on the survey of a large genomic dataset in combination with environmental variables. Phenotypic data are less commonly incorporated into such studies, although combining a genome scan analysis with a phenotypic trait analysis can greatly improve the insights obtained from each analysis individually. Here, we aimed to identify loci potentially involved in adaptation to climate in 283 Loblolly pine (Pinus taeda) samples from throughout the species' range in the southeastern United States. We analyzed associations between phenotypic, molecular, and environmental variables from datasets of 3082 single nucleotide polymorphism (SNP) loci and 3 categories of phenotypic traits (gene expression, metabolites, and whole-plant traits). We found only 6 SNP loci that displayed potential signals of local adaptation. Five of the 6 identified SNPs are linked to gene expression traits for lignin development, and 1 is linked with whole-plant traits. We subsequently compared the 6 candidate genes with environmental variables and found a high correlation in only 3 of them (R2 > 0.2). Our study highlights the need for a combination of genotypes, phenotypes, and environmental variables, and for an appropriate sampling scheme and study design, to improve confidence in the identification of potential candidate genes. © The American Genetic Association 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Divergent homologs of the predicted small RNA BpCand697 in Burkholderia spp.
NASA Astrophysics Data System (ADS)
Damiri, Nadzirah; Mohd-Padil, Hirzahida; Firdaus-Raih, Mohd
2015-09-01
The small RNA (sRNA) gene candidate, BpCand697 was previously reported to be unique to Burkholderia spp. and is encoded at 3' non-coding region of a putative AraC family transcription regulator gene. This study demonstrates the conservation of BpCand697 sequence across 32 Burkholderia spp. including B. pseudomallei, B. mallei, B. thailandensis and Burkholderia sp. by integrating both sequence homology and secondary structural analyses of BpCand697 within the dataset. The divergent sequence of BpCand697 was also used as a discriminatory power in clustering the dataset according to the potential virulence of Burkholderia spp., showing that B. thailandensis was clearly secluded from the virulent cluster of B. pseudomallei and B. mallei. Finally, the differential co-transcript expression of BpCand697 and its flanking gene, bpsl2391 was detected in Burkholderia pseudomallei D286 after grown under two different culture conditions using nutrient-rich and minimal media. It is hypothesized that the differential expression of BpCand697-bpsl2391 co-transcript between the two standard prepared media might correlate with nutrient availability in the culture media, suggesting that the physical co-localization of BpCand697 in B. pseudomallei D286 might be directly or indirectly involved with the transcript regulation of bpsl2391 under the selected in vitro culture conditions.
Li, Yanjie; Lu, Yue; Lin, Kevin; Hauser, Lauren A.; Lynch, David R.
2017-01-01
ABSTRACT Friedreich's ataxia (FRDA) is an autosomal recessive neurodegenerative disease usually caused by large homozygous expansions of GAA repeat sequences in intron 1 of the frataxin (FXN) gene. FRDA patients homozygous for GAA expansions have low FXN mRNA and protein levels when compared with heterozygous carriers or healthy controls. Frataxin is a mitochondrial protein involved in iron–sulfur cluster synthesis, and many FRDA phenotypes result from deficiencies in cellular metabolism due to lowered expression of FXN. Presently, there is no effective treatment for FRDA, and biomarkers to measure therapeutic trial outcomes and/or to gauge disease progression are lacking. Peripheral tissues, including blood cells, buccal cells and skin fibroblasts, can readily be isolated from FRDA patients and used to define molecular hallmarks of disease pathogenesis. For instance, FXN mRNA and protein levels as well as FXN GAA-repeat tract lengths are routinely determined using all of these cell types. However, because these tissues are not directly involved in disease pathogenesis, their relevance as models of the molecular aspects of the disease is yet to be decided. Herein, we conducted unbiased RNA sequencing to profile the transcriptomes of fibroblast cell lines derived from 18 FRDA patients and 17 unaffected control individuals. Bioinformatic analyses revealed significantly upregulated expression of genes encoding plasma membrane solute carrier proteins in FRDA fibroblasts. Conversely, the expression of genes encoding accessory factors and enzymes involved in cytoplasmic and mitochondrial protein synthesis was consistently decreased in FRDA fibroblasts. Finally, comparison of genes differentially expressed in FRDA fibroblasts to three previously published gene expression signatures defined for FRDA blood cells showed substantial overlap between the independent datasets, including correspondingly deficient expression of antioxidant defense genes. Together, these results indicate that gene expression profiling of cells derived from peripheral tissues can, in fact, consistently reveal novel molecular pathways of the disease. When performed on statistically meaningful sample group sizes, unbiased global profiling analyses utilizing peripheral tissues are critical for the discovery and validation of FRDA disease biomarkers. PMID:29125828
Microarray analysis reveals key genes and pathways in Tetralogy of Fallot
He, Yue-E; Qiu, Hui-Xian; Jiang, Jian-Bing; Wu, Rong-Zhou; Xiang, Ru-Lian; Zhang, Yuan-Hai
2017-01-01
The aim of the present study was to identify key genes that may be involved in the pathogenesis of Tetralogy of Fallot (TOF) using bioinformatics methods. The GSE26125 microarray dataset, which includes cardiovascular tissue samples derived from 16 children with TOF and five healthy age-matched control infants, was downloaded from the Gene Expression Omnibus database. Differential expression analysis was performed between TOF and control samples to identify differentially expressed genes (DEGs) using Student's t-test, and the R/limma package, with a log2 fold-change of >2 and a false discovery rate of <0.01 set as thresholds. The biological functions of DEGs were analyzed using the ToppGene database. The ReactomeFIViz application was used to construct functional interaction (FI) networks, and the genes in each module were subjected to pathway enrichment analysis. The iRegulon plugin was used to identify transcription factors predicted to regulate the DEGs in the FI network, and the gene-transcription factor pairs were then visualized using Cytoscape software. A total of 878 DEGs were identified, including 848 upregulated genes and 30 downregulated genes. The gene FI network contained seven function modules, which were all comprised of upregulated genes. Genes enriched in Module 1 were enriched in the following three neurological disorder-associated signaling pathways: Parkinson's disease, Alzheimer's disease and Huntington's disease. Genes in Modules 0, 3 and 5 were dominantly enriched in pathways associated with ribosomes and protein translation. The Xbox binding protein 1 transcription factor was demonstrated to be involved in the regulation of genes encoding the subunits of cytoplasmic and mitochondrial ribosomes, as well as genes involved in neurodegenerative disorders. Therefore, dysfunction of genes involved in signaling pathways associated with neurodegenerative disorders, ribosome function and protein translation may contribute to the pathogenesis of TOF. PMID:28713939
2017-02-01
To) 15 July 2010 – 2 Nov.2016 4 . TITLE AND SUBTITLE A Gene Expression Profile of BRCAness That Predicts for Responsiveness to Platinum and PARP...resistance in vitro, and to investigate the mechanism for this effect. The major goal for Aim 4 was to determine the reproducibility of the BRCAness...we used the epithelial ovarian cancer (EOC) dataset from The Cancer Genome Atlas (TCGA) ( 4 ). The TCGA dataset is a unique tool for these studies as
This file contains a link for Gene Expression Omnibus and the GSE designations for the publicly available gene expression data used in the study and reflected in Figures 6 and 7 for the Das et al., 2016 paper.This dataset is associated with the following publication:Das, K., C. Wood, M. Lin, A.A. Starkov, C. Lau, K.B. Wallace, C. Corton, and B. Abbott. Perfluoroalky acids-induced liver steatosis: Effects on genes controlling lipid homeostasis. TOXICOLOGY. Elsevier Science Ltd, New York, NY, USA, 378: 32-52, (2017).
Microarray Analysis of Iris Gene Expression in Mice with Mutations Influencing Pigmentation
Trantow, Colleen M.; Cuffy, Tryphena L.; Fingert, John H.; Kuehn, Markus H.
2011-01-01
Purpose. Several ocular diseases involve the iris, notably including oculocutaneous albinism, pigment dispersion syndrome, and exfoliation syndrome. To screen for candidate genes that may contribute to the pathogenesis of these diseases, genome-wide iris gene expression patterns were comparatively analyzed from mouse models of these conditions. Methods. Iris samples from albino mice with a Tyr mutation, pigment dispersion–prone mice with Tyrp1 and Gpnmb mutations, and mice resembling exfoliation syndrome with a Lyst mutation were compared with samples from wild-type mice. All mice were strain (C57BL/6J), age (60 days old), and sex (female) matched. Microarrays were used to compare transcriptional profiles, and differentially expressed transcripts were described by functional annotation clustering using DAVID Bioinformatics Resources. Quantitative real-time PCR was performed to validate a subset of identified changes. Results. Compared with wild-type C57BL/6J mice, each disease context exhibited a large number of statistically significant changes in gene expression, including 685 transcripts differentially expressed in albino irides, 403 in pigment dispersion–prone irides, and 460 in exfoliative-like irides. Conclusions. Functional annotation clusterings were particularly striking among the overrepresented genes, with albino and pigment dispersion–prone irides both exhibiting overall evidence of crystallin-mediated stress responses. Exfoliative-like irides from mice with a Lyst mutation showed overall evidence of involvement of genes that influence immune system processes, lytic vacuoles, and lysosomes. These findings have several biologically relevant implications, particularly with respect to secondary forms of glaucoma, and represent a useful resource as a hypothesis-generating dataset. PMID:20739468
TGFbeta and miRNA regulation in familial and sporadic breast cancer
Pinto, Rosamaria; Pilato, Brunella; Palumbo, Orazio; Carella, Massimo; Popescu, Ondina; Digennaro, Maria; Lacalamita, Rosanna; Tommasi, Stefania
2017-01-01
The term ‘BRCAness’ was introduced to identify sporadic malignant tumors sharing characteristics similar to those germline BRCA-related. Among all mechanisms attributable to BRCA1 expression silencing, a major role has been assigned to microRNAs. MicroRNAs role in familial and sporadic breast cancer has been explored but few data are available about microRNAs involvement in homologous recombination repair control in these breast cancer subgroups. Our aim was to seek microRNAs associated to pathways underlying DNA repair dysfunction in breast cancer according to a family history of the disease. Affymetrix GeneChip microRNA Arrays were used to perform microRNA expression analysis in familial and sporadic breast cancer. Pathway enrichment analysis and microRNA target prediction was carried out using DIANA miRPath v.3 web-based computational tool and miRWalk v.2 database. We analyzed an external gene expression dataset (E-GEOD-49481), including both familial and sporadic breast cancers. For microRNA validation, an independent set of 19 familial and 10 sporadic breast cancers was used. Microarray analysis identified a signature of 28 deregulated miRNAs. For our validation analyses by real time PCR, we focused on miR-92a-1*, miR-1184 and miR-943 because associated to TGF-β signalling pathway, ATM and BRCA1 genes expression. Our results highlighted alterations in miR-92a-1*, miR-1184 and miR-943 expression levels suggesting their involvement in repair of DNA double-strand breaks through TGF-beta pathway control. PMID:28881597
Microarray-based cancer prediction using soft computing approach.
Wang, Xiaosheng; Gotoh, Osamu
2009-05-26
One of the difficulties in using gene expression profiles to predict cancer is how to effectively select a few informative genes to construct accurate prediction models from thousands or ten thousands of genes. We screen highly discriminative genes and gene pairs to create simple prediction models involved in single genes or gene pairs on the basis of soft computing approach and rough set theory. Accurate cancerous prediction is obtained when we apply the simple prediction models for four cancerous gene expression datasets: CNS tumor, colon tumor, lung cancer and DLBCL. Some genes closely correlated with the pathogenesis of specific or general cancers are identified. In contrast with other models, our models are simple, effective and robust. Meanwhile, our models are interpretable for they are based on decision rules. Our results demonstrate that very simple models may perform well on cancerous molecular prediction and important gene markers of cancer can be detected if the gene selection approach is chosen reasonably.
Snezhkina, Anastasiya Vladimirovna; Krasnov, George Sergeevich; Zaretsky, Andrew Rostislavovich; Zhavoronkov, Alex; Nyushko, Kirill Mikhailovich; Moskalev, Alexey Alexandrovich; Karpova, Irina Yurievna; Afremova, Anastasiya Isaevna; Lipatova, Anastasiya Valerievna; Kochetkov, Dmitriy Vladimitovich; Fedorova, Maria Sergeena; Volchenko, Nadezhda Nikolaevna; Sadritdinova, Asiya Fayazovna; Melnikova, Nataliya Vladimirovna; Sidorov, Dmitry Vladimirovich; Popov, Anatoly Yurievich; Kalinin, Dmitry Valerievich; Kaprin, Andrey Dmitrievich; Alekseev, Boris Yakovlevich; Dmitriev, Alexey Alexandrovich; Kudryavtseva, Anna Viktorovna
2016-12-28
Colorectal cancer (CRC) is one of the most common malignant tumors worldwide. CRC molecular pathogenesis is heterogeneous and may be followed by mutations in oncogenes and tumor suppressor genes, chromosomal and microsatellite instability, alternative splicing alterations, hypermethylation of CpG islands, oxidative stress, impairment of different signaling pathways and energy metabolism. In the present work, we have studied the alterations of alternative splicing patterns of genes related to energy metabolism in CRC. Using CrossHub software, we analyzed The Cancer Genome Atlas (TCGA) RNA-Seq datasets derived from colon tumor and matched normal tissues. The expression of 1014 alternative mRNA isoforms involved in cell energy metabolism was examined. We found 7 genes with differentially expressed alternative transcripts whereas overall expression of these genes was not significantly altered in CRC. A set of 8 differentially expressed transcripts of interest has been validated by qPCR. These eight isoforms encoded by OGDH, COL6A3, ICAM1, PHPT1, PPP2R5D, SLC29A1, and TRIB3 genes were up-regulated in colorectal tumors, and this is in concordance with the bioinformatics data. The alternative transcript NM_057167 of COL6A3 was also strongly up-regulated in breast, lung, prostate, and kidney tumors. Alternative transcript of SLC29A1 (NM_001078177) was up-regulated only in CRC samples, but not in the other tested tumor types. We identified tumor-specific expression of alternative spliced transcripts of seven genes involved in energy metabolism in CRC. Our results bring new knowledge on alternative splicing in colorectal cancer and suggest a set of mRNA isoforms that could be used for cancer diagnosis and development of treatment methods.
Transcriptomic profiling of genes in matured dimorphic seeds of euhalophyte Suaeda salsa.
Xu, Yange; Zhao, Yuanqin; Duan, Huimin; Sui, Na; Yuan, Fang; Song, Jie
2017-09-13
Suaeda salsa (S. salsa) is a euhalophyte with high economic value. S. salsa can produce dimorphic seeds. Brown seeds are more salt tolerant, can germinate quickly and maintain the fitness of the species under high saline conditions. Black seeds are less salt tolerant, may become part of the seed bank and germinate when soil salinity is reduced. Previous reports have mainly focused on the ecophysiological traits of seed germination and production under saline conditions in this species. However, there is no information available on the molecular characteristics of S. salsa dimorphic seeds. In the present study, a total of 5825 differentially expressed genes were obtained; and 4648 differentially expressed genes were annotated based on a sequence similarity search, utilizing five public databases by transcriptome analysis. The different expression of these genes may be associated with embryo development, fatty acid, osmotic regulation substances and plant hormones in brown and black seeds. Compared to black seeds, most genes may relate to embryo development, and various genes that encode fatty acid desaturase and are involved in osmotic regulation substance synthesis or transport are upregulated in brown seeds. A large number of differentially expressed genes related to plant hormones were found in brown and black seeds, and their possible roles in regulating seed dormancy/germination were discussed. Upregulated genes involved in seed development and osmotic regulation substance accumulation may relate to bigger seed size and rapid seed germination in brown seeds, compared to black seeds. Differentially expressed genes of hormones may relate to seed dormancy/germination and the development of brown and black seeds. The transcriptome dataset will serve as a valuable resource to further understand gene expression and functional genomics in S. salsa dimorphic seeds.
Sun, Mei-Yu; Li, Jing-Yi; Li, Dong; Huang, Feng-Jie; Wang, Di; Li, Hui; Xing, Quan; Zhu, Hui-Bin; Shi, Lei
2018-04-12
Drynaria roosii (Nakaike) is a traditional Chinese medicinal fern, known as 'GuSuiBu'. The corresponding effective components of naringin/neoeriocitrin share highly similar chemical structure and medicinal function. Our HPLC-MS/MS results showed that the accumulation of naringin/neoeriocitrin depended on specific tissues or ages. However, little was known about the expression patterns of naringin/neoeriocitrin related genes involved in their regulatory pathways. For lack of the basic genetic information, we applied a combination of SMRT sequencing and SGS to generate the complete and full-length transcriptome of D. roosii. According to the SGS data, the DEG-based heat map analysis revealed the naringin/neoeriocitrin related gene expression exhibited obvious tissue- and time-specific transcriptomic differences. Using the systems biology method of modular organization analysis, we clustered 16,472 DEGs into 17 gene modules and studied the relationships between modules and tissue/time point samples, as well as modules and naringin/neoeriocitrin contents. Hereinto, naringin/neoeriocitrin related DEGs distributed in nine distinct modules, and DEGs in these modules showed significant different patterns of transcript abundance to be linked with specific tissues or ages. Moreover, WGCNA results further identified that PAL, 4CL, C4H and C3H, HCT acted as the major hub genes involved in naringin and neoeriocitrin synthesis respectively and exhibited high co-expression with MYB- and bHLH-regulated genes. In this work, modular organization and co-expression networks elucidated the tissue- and time-specificity of gene expression pattern, as well as hub genes associated with naringin/neoeriocitrin synthesis in D. roosii. Simultaneously, the comprehensive transcriptome dataset provided the important genetic information for further research on D. roosii.
GC-Content Normalization for RNA-Seq Data
2011-01-01
Background Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof. Results We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq. Conclusions Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes. PMID:22177264
Cross-platform normalization of microarray and RNA-seq data for machine learning applications
Thompson, Jeffrey A.; Tan, Jie
2016-01-01
Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language. PMID:26844019
Verhagen, Lilly M; Zomer, Aldert; Maes, Mailis; Villalba, Julian A; Del Nogal, Berenice; Eleveld, Marc; van Hijum, Sacha Aft; de Waard, Jacobus H; Hermans, Peter Wm
2013-02-01
Tuberculosis (TB) continues to cause a high toll of disease and death among children worldwide. The diagnosis of childhood TB is challenged by the paucibacillary nature of the disease and the difficulties in obtaining specimens. Whereas scientific and clinical research efforts to develop novel diagnostic tools have focused on TB in adults, childhood TB has been relatively neglected. Blood transcriptional profiling has improved our understanding of disease pathogenesis of adult TB and may offer future leads for diagnosis and treatment. No studies applying gene expression profiling of children with TB have been published so far. We identified a 116-gene signature set that showed an average prediction error of 11% for TB vs. latent TB infection (LTBI) and for TB vs. LTBI vs. healthy controls (HC) in our dataset. A minimal gene set of only 9 genes showed the same prediction error of 11% for TB vs. LTBI in our dataset. Furthermore, this minimal set showed a significant discriminatory value for TB vs. LTBI for all previously published adult studies using whole blood gene expression, with average prediction errors between 17% and 23%. In order to identify a robust representative gene set that would perform well in populations of different genetic backgrounds, we selected ten genes that were highly discriminative between TB, LTBI and HC in all literature datasets as well as in our dataset. Functional annotation of these genes highlights a possible role for genes involved in calcium signaling and calcium metabolism as biomarkers for active TB. These ten genes were validated by quantitative real-time polymerase chain reaction in an additional cohort of 54 Warao Amerindian children with LTBI, HC and non-TB pneumonia. Decision tree analysis indicated that five of the ten genes were sufficient to classify 78% of the TB cases correctly with no LTBI subjects wrongly classified as TB (100% specificity). Our data justify the further exploration of our signature set as biomarkers for potential childhood TB diagnosis. We show that, as the identification of different biomarkers in ethnically distinct cohorts is apparent, it is important to cross-validate newly identified markers in all available cohorts.
2013-01-01
Background Tuberculosis (TB) continues to cause a high toll of disease and death among children worldwide. The diagnosis of childhood TB is challenged by the paucibacillary nature of the disease and the difficulties in obtaining specimens. Whereas scientific and clinical research efforts to develop novel diagnostic tools have focused on TB in adults, childhood TB has been relatively neglected. Blood transcriptional profiling has improved our understanding of disease pathogenesis of adult TB and may offer future leads for diagnosis and treatment. No studies applying gene expression profiling of children with TB have been published so far. Results We identified a 116-gene signature set that showed an average prediction error of 11% for TB vs. latent TB infection (LTBI) and for TB vs. LTBI vs. healthy controls (HC) in our dataset. A minimal gene set of only 9 genes showed the same prediction error of 11% for TB vs. LTBI in our dataset. Furthermore, this minimal set showed a significant discriminatory value for TB vs. LTBI for all previously published adult studies using whole blood gene expression, with average prediction errors between 17% and 23%. In order to identify a robust representative gene set that would perform well in populations of different genetic backgrounds, we selected ten genes that were highly discriminative between TB, LTBI and HC in all literature datasets as well as in our dataset. Functional annotation of these genes highlights a possible role for genes involved in calcium signaling and calcium metabolism as biomarkers for active TB. These ten genes were validated by quantitative real-time polymerase chain reaction in an additional cohort of 54 Warao Amerindian children with LTBI, HC and non-TB pneumonia. Decision tree analysis indicated that five of the ten genes were sufficient to classify 78% of the TB cases correctly with no LTBI subjects wrongly classified as TB (100% specificity). Conclusions Our data justify the further exploration of our signature set as biomarkers for potential childhood TB diagnosis. We show that, as the identification of different biomarkers in ethnically distinct cohorts is apparent, it is important to cross-validate newly identified markers in all available cohorts. PMID:23375113
Ramus, Claire; Hovasse, Agnès; Marcellin, Marlène; Hesse, Anne-Marie; Mouton-Barbosa, Emmanuelle; Bouyssié, David; Vaca, Sebastian; Carapito, Christine; Chaoui, Karima; Bruley, Christophe; Garin, Jérôme; Cianférani, Sarah; Ferro, Myriam; Van Dorssaeler, Alain; Burlet-Schiltz, Odile; Schaeffer, Christine; Couté, Yohann; Gonzalez de Peredo, Anne
2016-01-30
Proteomic workflows based on nanoLC-MS/MS data-dependent-acquisition analysis have progressed tremendously in recent years. High-resolution and fast sequencing instruments have enabled the use of label-free quantitative methods, based either on spectral counting or on MS signal analysis, which appear as an attractive way to analyze differential protein expression in complex biological samples. However, the computational processing of the data for label-free quantification still remains a challenge. Here, we used a proteomic standard composed of an equimolar mixture of 48 human proteins (Sigma UPS1) spiked at different concentrations into a background of yeast cell lysate to benchmark several label-free quantitative workflows, involving different software packages developed in recent years. This experimental design allowed to finely assess their performances in terms of sensitivity and false discovery rate, by measuring the number of true and false-positive (respectively UPS1 or yeast background proteins found as differential). The spiked standard dataset has been deposited to the ProteomeXchange repository with the identifier PXD001819 and can be used to benchmark other label-free workflows, adjust software parameter settings, improve algorithms for extraction of the quantitative metrics from raw MS data, or evaluate downstream statistical methods. Bioinformatic pipelines for label-free quantitative analysis must be objectively evaluated in their ability to detect variant proteins with good sensitivity and low false discovery rate in large-scale proteomic studies. This can be done through the use of complex spiked samples, for which the "ground truth" of variant proteins is known, allowing a statistical evaluation of the performances of the data processing workflow. We provide here such a controlled standard dataset and used it to evaluate the performances of several label-free bioinformatics tools (including MaxQuant, Skyline, MFPaQ, IRMa-hEIDI and Scaffold) in different workflows, for detection of variant proteins with different absolute expression levels and fold change values. The dataset presented here can be useful for tuning software tool parameters, and also testing new algorithms for label-free quantitative analysis, or for evaluation of downstream statistical methods. Copyright © 2015 Elsevier B.V. All rights reserved.
Zhuo, Yang Jia; Liu, Ze Zhen; Wan, Song; Cai, Zhi Duan; Xie, Jian Jiang; Cai, Zhou da; Song, Sheng da; Wan, Yue Ping; Hua, Wei; Zhong, Wei de; Wu, Chin Lee
2018-06-01
Serine/Arginine-Rich Protein-Specific Kinase-2 (SRSF protein kinase-2, SRPK2) is up-regulated in multiple human tumors. However, the expression, function and clinical significance of SRPK2 in prostate cancer (PCa) has not yet been understood. We therefore aimed to determine the association of SRPK2 with tumor progression and metastasis in PCa patients in our present study. The expression of SRPK2 was detected by some public datasets and validated using a clinical tissue microarray (TMA) by immunohistochemistry. The association of SRPK2 expression with various clinicopathological characteristics of PCa patients was subsequently statistically analyzed based on the The Cancer Genome Atlas (TCGA) dataset and clinical TMA. The effects of SRPK2 on cancer cell proliferation, migration, invasion, cell cycle progression, apoptosis and tumor growth were then respectively investigated using in vitro and in vivo experiments. First, public datasets showed that SRPK2 expression was greater in PCa tissues when compared with non-cancerous tissues. Statistical analysis demonstrated that high expression of SRPK2 was significantly correlated with a higher Gleason Score, advanced pathological stage and the presence of tumor metastasis in the TCGA Dataset (all P < 0.01). Similar correlations between SRPK2 and a higher Gleason Score or advanced pathological stage were also identified in the TMA (P < 0.05). Kaplan-Meier curve analyses showed that the biochemical recurrence (BCR)-free time of PCa patients with SRPK2 high expression was shorter than for those with SRPK2 low expression (P < 0.05). Second, cell function experiments in PCa cell lines revealed that enhanced SRPK2 expression could promote cell proliferation, migration, invasion and cell cycle progression but suppress tumor cell apoptosis in vitro. Xenograft experiments showed that SRPK2 promoted tumor growth in vivo. In conclusion, our data demonstrated that SRPK2 may play an important role in the progression and metastasis of PCa, which suggests that it might be a potential therapeutic target for PCa clinical therapy. Copyright © 2018 Elsevier Masson SAS. All rights reserved.
Semantic integration of gene expression analysis tools and data sources using software connectors
2013-01-01
Background The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heteregeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. Results We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. Conclusions The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data. PMID:24341380
Semantic integration of gene expression analysis tools and data sources using software connectors.
Miyazaki, Flávia A; Guardia, Gabriela D A; Vêncio, Ricardo Z N; de Farias, Cléver R G
2013-10-25
The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heterogeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data.
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses
Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M.; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V.; Ma’ayan, Avi
2018-01-01
Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated ‘canned’ analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools. PMID:29485625
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses.
Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V; Ma'ayan, Avi
2018-02-27
Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated 'canned' analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools.
Booma, P M; Prabhakaran, S; Dhanalakshmi, R
2014-01-01
Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality.
Booma, P. M.; Prabhakaran, S.; Dhanalakshmi, R.
2014-01-01
Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality. PMID:25136661
Functional annotation of the vlinc class of non-coding RNAs using systems biology approach.
St Laurent, Georges; Vyatkin, Yuri; Antonets, Denis; Ri, Maxim; Qi, Yao; Saik, Olga; Shtokalo, Dmitry; de Hoon, Michiel J L; Kawaji, Hideya; Itoh, Masayoshi; Lassmann, Timo; Arner, Erik; Forrest, Alistair R R; Nicolas, Estelle; McCaffrey, Timothy A; Carninci, Piero; Hayashizaki, Yoshihide; Wahlestedt, Claes; Kapranov, Philipp
2016-04-20
Functionality of the non-coding transcripts encoded by the human genome is the coveted goal of the modern genomics research. While commonly relied on the classical methods of forward genetics, integration of different genomics datasets in a global Systems Biology fashion presents a more productive avenue of achieving this very complex aim. Here we report application of a Systems Biology-based approach to dissect functionality of a newly identified vast class of very long intergenic non-coding (vlinc) RNAs. Using highly quantitative FANTOM5 CAGE dataset, we show that these RNAs could be grouped into 1542 novel human genes based on analysis of insulators that we show here indeed function as genomic barrier elements. We show that vlinc RNAs genes likely function in cisto activate nearby genes. This effect while most pronounced in closely spaced vlinc RNA-gene pairs can be detected over relatively large genomic distances. Furthermore, we identified 101 vlinc RNA genes likely involved in early embryogenesis based on patterns of their expression and regulation. We also found another 109 such genes potentially involved in cellular functions also happening at early stages of development such as proliferation, migration and apoptosis. Overall, we show that Systems Biology-based methods have great promise for functional annotation of non-coding RNAs. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Bessonov, Kyrylo; Walkey, Christopher J.; Shelp, Barry J.; van Vuuren, Hennie J. J.; Chiu, David; van der Merwe, George
2013-01-01
Analyzing time-course expression data captured in microarray datasets is a complex undertaking as the vast and complex data space is represented by a relatively low number of samples as compared to thousands of available genes. Here, we developed the Interdependent Correlation Clustering (ICC) method to analyze relationships that exist among genes conditioned on the expression of a specific target gene in microarray data. Based on Correlation Clustering, the ICC method analyzes a large set of correlation values related to gene expression profiles extracted from given microarray datasets. ICC can be applied to any microarray dataset and any target gene. We applied this method to microarray data generated from wine fermentations and selected NSF1, which encodes a C2H2 zinc finger-type transcription factor, as the target gene. The validity of the method was verified by accurate identifications of the previously known functional roles of NSF1. In addition, we identified and verified potential new functions for this gene; specifically, NSF1 is a negative regulator for the expression of sulfur metabolism genes, the nuclear localization of Nsf1 protein (Nsf1p) is controlled in a sulfur-dependent manner, and the transcription of NSF1 is regulated by Met4p, an important transcriptional activator of sulfur metabolism genes. The inter-disciplinary approach adopted here highlighted the accuracy and relevancy of the ICC method in mining for novel gene functions using complex microarray datasets with a limited number of samples. PMID:24130853
Stewart, Paul A; Parapatics, Katja; Welsh, Eric A; Müller, André C; Cao, Haoyun; Fang, Bin; Koomen, John M; Eschrich, Steven A; Bennett, Keiryn L; Haura, Eric B
2015-01-01
We performed a pilot proteogenomic study to compare lung adenocarcinoma to lung squamous cell carcinoma using quantitative proteomics (6-plex TMT) combined with a customized Affymetrix GeneChip. Using MaxQuant software, we identified 51,001 unique peptides that mapped to 7,241 unique proteins and from these identified 6,373 genes with matching protein expression for further analysis. We found a minor correlation between gene expression and protein expression; both datasets were able to independently recapitulate known differences between the adenocarcinoma and squamous cell carcinoma subtypes. We found 565 proteins and 629 genes to be differentially expressed between adenocarcinoma and squamous cell carcinoma, with 113 of these consistently differentially expressed at both the gene and protein levels. We then compared our results to published adenocarcinoma versus squamous cell carcinoma proteomic data that we also processed with MaxQuant. We selected two proteins consistently overexpressed in squamous cell carcinoma in all studies, MCT1 (SLC16A1) and GLUT1 (SLC2A1), for further investigation. We found differential expression of these same proteins at the gene level in our study as well as in other public gene expression datasets. These findings combined with survival analysis of public datasets suggest that MCT1 and GLUT1 may be potential prognostic markers in adenocarcinoma and druggable targets in squamous cell carcinoma. Data are available via ProteomeXchange with identifier PXD002622.
Farlora, Rodolfo; Araya-Garay, José; Gallardo-Escárate, Cristian
2014-06-01
Understanding the molecular underpinnings involved in the reproduction of the salmon louse is critical for designing novel strategies of pest management for this ectoparasite. However, genomic information on sex-related genes is still limited. In the present work, sex-specific gene transcription was revealed in the salmon louse Caligus rogercresseyi using high-throughput Illumina sequencing. A total of 30,191,914 and 32,292,250 high quality reads were generated for females and males, and these were de novo assembled into 32,173 and 38,177 contigs, respectively. Gene ontology analysis showed a pattern of higher expression in the female as compared to the male transcriptome. Based on our sequence analysis and known sex-related proteins, several genes putatively involved in sex differentiation, including Dmrt3, FOXL2, VASA, and FEM1, and other potentially significant candidate genes in C. rogercresseyi, were identified for the first time. In addition, the occurrence of SNPs in several differentially expressed contigs annotating for sex-related genes was found. This transcriptome dataset provides a useful resource for future functional analyses, opening new opportunities for sea lice pest control. Copyright © 2014 Elsevier B.V. All rights reserved.
Seq-ing answers: uncovering the unexpected in global gene regulation.
Otto, George Maxwell; Brar, Gloria Ann
2018-04-19
The development of techniques for measuring gene expression globally has greatly expanded our understanding of gene regulatory mechanisms in depth and scale. We can now quantify every intermediate and transition in the canonical pathway of gene expression-from DNA to mRNA to protein-genome-wide. Employing such measurements in parallel can produce rich datasets, but extracting the most information requires careful experimental design and analysis. Here, we argue for the value of genome-wide studies that measure multiple outputs of gene expression over many timepoints during the course of a natural developmental process. We discuss our findings from a highly parallel gene expression dataset of meiotic differentiation, and those of others, to illustrate how leveraging these features can provide new and surprising insight into fundamental mechanisms of gene regulation.
DMirNet: Inferring direct microRNA-mRNA association networks.
Lee, Minsu; Lee, HyungJune
2016-12-05
MicroRNAs (miRNAs) play important regulatory roles in the wide range of biological processes by inducing target mRNA degradation or translational repression. Based on the correlation between expression profiles of a miRNA and its target mRNA, various computational methods have previously been proposed to identify miRNA-mRNA association networks by incorporating the matched miRNA and mRNA expression profiles. However, there remain three major issues to be resolved in the conventional computation approaches for inferring miRNA-mRNA association networks from expression profiles. 1) Inferred correlations from the observed expression profiles using conventional correlation-based methods include numerous erroneous links or over-estimated edge weight due to the transitive information flow among direct associations. 2) Due to the high-dimension-low-sample-size problem on the microarray dataset, it is difficult to obtain an accurate and reliable estimate of the empirical correlations between all pairs of expression profiles. 3) Because the previously proposed computational methods usually suffer from varying performance across different datasets, a more reliable model that guarantees optimal or suboptimal performance across different datasets is highly needed. In this paper, we present DMirNet, a new framework for identifying direct miRNA-mRNA association networks. To tackle the aforementioned issues, DMirNet incorporates 1) three direct correlation estimation methods (namely Corpcor, SPACE, Network deconvolution) to infer direct miRNA-mRNA association networks, 2) the bootstrapping method to fully utilize insufficient training expression profiles, and 3) a rank-based Ensemble aggregation to build a reliable and robust model across different datasets. Our empirical experiments on three datasets demonstrate the combinatorial effects of necessary components in DMirNet. Additional performance comparison experiments show that DMirNet outperforms the state-of-the-art Ensemble-based model [1] which has shown the best performance across the same three datasets, with a factor of up to 1.29. Further, we identify 43 putative novel multi-cancer-related miRNA-mRNA association relationships from an inferred Top 1000 direct miRNA-mRNA association network. We believe that DMirNet is a promising method to identify novel direct miRNA-mRNA relations and to elucidate the direct miRNA-mRNA association networks. Since DMirNet infers direct relationships from the observed data, DMirNet can contribute to reconstructing various direct regulatory pathways, including, but not limited to, the direct miRNA-mRNA association networks.
The df: A proposed data format standard
NASA Technical Reports Server (NTRS)
Lait, Leslie R.; Nash, Eric R.; Newman, Paul A.
1993-01-01
A standard is proposed describing a portable format for electronic exchange of data in the physical sciences. Writing scientific data in a standard format has three basic advantages: portability; the ability to use metadata to aid in interpretation of the data (understandability); and reusability. An improperly formulated standard format tends towards four disadvantages: (1) it can be inflexible and fail to allow the user to express his data as needed; (2) reading and writing such datasets can involve high overhead in computing time and storage space; (3) the format may be accessible only on certain machines using certain languages; and (4) under some circumstances it may be uncertain whether a given dataset actually conforms to the standard. A format was designed which enhances these advantages and lessens the disadvantages. The fundamental approach is to allow the user to make her own choices regarding strategic tradeoffs to achieve the performance desired in her local environment. The choices made are encoded in a specific and portable way in a set of records. A fully detailed description and specification of the format is given, and examples are used to illustrate various concepts. Implementation is discussed.
SoFoCles: feature filtering for microarray classification based on gene ontology.
Papachristoudis, Georgios; Diplaris, Sotiris; Mitkas, Pericles A
2010-02-01
Marker gene selection has been an important research topic in the classification analysis of gene expression data. Current methods try to reduce the "curse of dimensionality" by using statistical intra-feature set calculations, or classifiers that are based on the given dataset. In this paper, we present SoFoCles, an interactive tool that enables semantic feature filtering in microarray classification problems with the use of external, well-defined knowledge retrieved from the Gene Ontology. The notion of semantic similarity is used to derive genes that are involved in the same biological path during the microarray experiment, by enriching a feature set that has been initially produced with legacy methods. Among its other functionalities, SoFoCles offers a large repository of semantic similarity methods that are used in order to derive feature sets and marker genes. The structure and functionality of the tool are discussed in detail, as well as its ability to improve classification accuracy. Through experimental evaluation, SoFoCles is shown to outperform other classification schemes in terms of classification accuracy in two real datasets using different semantic similarity computation approaches.
Beyond Reasonable Doubt: Evolution from DNA Sequences
Penny, David
2013-01-01
We demonstrate quantitatively that, as predicted by evolutionary theory, sequences of homologous proteins from different species converge as we go further and further back in time. The converse, a non-evolutionary model can be expressed as probabilities, and the test works for chloroplast, nuclear and mitochondrial sequences, as well as for sequences that diverged at different time depths. Even on our conservative test, the probability that chance could produce the observed levels of ancestral convergence for just one of the eight datasets of 51 proteins is ≈1×10−19 and combined over 8 datasets is ≈1×10−132. By comparison, there are about 1080 protons in the universe, hence the probability that the sequences could have been produced by a process involving unrelated ancestral sequences is about 1050 lower than picking, among all protons, the same proton at random twice in a row. A non-evolutionary control model shows no convergence, and only a small number of parameters are required to account for the observations. It is time that that researchers insisted that doubters put up testable alternatives to evolution. PMID:23950906
Boo, Lily; Ho, Wan Yong; Mohd Ali, Norlaily; Yeap, Swee Keong; Ky, Huynh; Chan, Kok Gan; Yin, Wai Fong; Satharasinghe, Dilan Amila; Liew, Woan Charn; Tan, Sheau Wei; Cheong, Soon Keng; Ong, Han Kiat
2017-01-01
Breast cancer spheroids have been widely used as in vitro models of cancer stem cells (CSCs), yet little is known about their phenotypic characteristics and microRNAs (miRNAs) expression profiles. The objectives of this research were to evaluate the phenotypic characteristics of MDA-MB-231 spheroid-enriched cells for their CSCs properties and also to determine their miRNAs expression profile. Similar to our previously published MCF-7 spheroid, MDA-MB-231 spheroid also showed typical CSCs characteristics namely self-renewability, expression of putative CSCs-related surface markers and enhancement of drug resistance. From the miRNA profile, miR-15b, miR-34a, miR-148a, miR-628 and miR-196b were shown to be involved in CSCs-associated signalling pathways in both models of spheroids, which highlights the involvement of these miRNAs in maintaining the CSCs features. In addition, unique clusters of miRNAs namely miR-205, miR-181a and miR-204 were found in basal-like spheroid whereas miR-125, miR-760, miR-30c and miR-136 were identified in luminal-like spheroid. Our results highlight the roles of miRNAs as well as novel perspectives of the relevant pathways underlying spheroid-enriched CSCs in breast cancer.
Karnik, Rahul; Beer, Michael A.
2015-01-01
The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs. PMID:26465884
Karnik, Rahul; Beer, Michael A
2015-01-01
The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.
Zhao, Zheng; Bai, Jing; Wu, Aiwei; Wang, Yuan; Zhang, Jinwen; Wang, Zishan; Li, Yongsheng; Xu, Juan; Li, Xia
2015-01-01
Long non-coding RNAs (lncRNAs) are emerging as key regulators of diverse biological processes and diseases. However, the combinatorial effects of these molecules in a specific biological function are poorly understood. Identifying co-expressed protein-coding genes of lncRNAs would provide ample insight into lncRNA functions. To facilitate such an effort, we have developed Co-LncRNA, which is a web-based computational tool that allows users to identify GO annotations and KEGG pathways that may be affected by co-expressed protein-coding genes of a single or multiple lncRNAs. LncRNA co-expressed protein-coding genes were first identified in publicly available human RNA-Seq datasets, including 241 datasets across 6560 total individuals representing 28 tissue types/cell lines. Then, the lncRNA combinatorial effects in a given GO annotations or KEGG pathways are taken into account by the simultaneous analysis of multiple lncRNAs in user-selected individual or multiple datasets, which is realized by enrichment analysis. In addition, this software provides a graphical overview of pathways that are modulated by lncRNAs, as well as a specific tool to display the relevant networks between lncRNAs and their co-expressed protein-coding genes. Co-LncRNA also supports users in uploading their own lncRNA and protein-coding gene expression profiles to investigate the lncRNA combinatorial effects. It will be continuously updated with more human RNA-Seq datasets on an annual basis. Taken together, Co-LncRNA provides a web-based application for investigating lncRNA combinatorial effects, which could shed light on their biological roles and could be a valuable resource for this community. Database URL: http://www.bio-bigdata.com/Co-LncRNA/ PMID:26363020
Exploring Transcription Factors-microRNAs Co-regulation Networks in Schizophrenia.
Xu, Yong; Yue, Weihua; Yao Shugart, Yin; Li, Sheng; Cai, Lei; Li, Qiang; Cheng, Zaohuo; Wang, Guoqiang; Zhou, Zhenhe; Jin, Chunhui; Yuan, Jianmin; Tian, Lin; Wang, Jun; Zhang, Kai; Zhang, Kerang; Liu, Sha; Song, Yuqing; Zhang, Fuquan
2016-07-01
Transcriptional factors (TFs) and microRNAs (miRNAs) have been recognized as 2 classes of principal gene regulators that may be responsible for genome coexpression changes observed in schizophrenia (SZ). This study aims to (1) identify differentially coexpressed genes (DCGs) in 3 mRNA expression microarray datasets; (2) explore potential interactions among the DCGs, and differentially expressed miRNAs identified in our dataset composed of early-onset SZ patients and healthy controls; (3) validate expression levels of some key transcripts; and (4) explore the druggability of DCGs using the curated database. We detected a differential coexpression network associated with SZ and found that 9 out of the 12 regulators were replicated in either of the 2 other datasets. Leveraging the differentially expressed miRNAs identified in our previous dataset, we constructed a miRNA-TF-gene network relevant to SZ, including an EGR1-miR-124-3p-SKIL feed-forward loop. Our real-time quantitative PCR analysis indicated the overexpression of miR-124-3p, the under expression of SKIL and EGR1 in the blood of SZ patients compared with controls, and the direction of change of miR-124-3p and SKIL mRNA levels in SZ cases were reversed after a 12-week treatment cycle. Our druggability analysis revealed that many of these genes have the potential to be drug targets. Together, our results suggest that coexpression network abnormalities driven by combinatorial and interactive action from TFs and miRNAs may contribute to the development of SZ and be relevant to the clinical treatment of the disease. © The Author 2015. Published by Oxford University Press on behalf of the Maryland Psychiatric Research Center. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Exploring Transcription Factors-microRNAs Co-regulation Networks in Schizophrenia
Xu, Yong; Yue, Weihua; Yao Shugart, Yin; Li, Sheng; Cai, Lei; Li, Qiang; Cheng, Zaohuo; Wang, Guoqiang; Zhou, Zhenhe; Jin, Chunhui; Yuan, Jianmin; Tian, Lin; Wang, Jun; Zhang, Kai; Zhang, Kerang; Liu, Sha; Song, Yuqing; Zhang, Fuquan
2016-01-01
Background: Transcriptional factors (TFs) and microRNAs (miRNAs) have been recognized as 2 classes of principal gene regulators that may be responsible for genome coexpression changes observed in schizophrenia (SZ). Methods: This study aims to (1) identify differentially coexpressed genes (DCGs) in 3 mRNA expression microarray datasets; (2) explore potential interactions among the DCGs, and differentially expressed miRNAs identified in our dataset composed of early-onset SZ patients and healthy controls; (3) validate expression levels of some key transcripts; and (4) explore the druggability of DCGs using the curated database. Results: We detected a differential coexpression network associated with SZ and found that 9 out of the 12 regulators were replicated in either of the 2 other datasets. Leveraging the differentially expressed miRNAs identified in our previous dataset, we constructed a miRNA–TF–gene network relevant to SZ, including an EGR1–miR-124-3p–SKIL feed-forward loop. Our real-time quantitative PCR analysis indicated the overexpression of miR-124-3p, the under expression of SKIL and EGR1 in the blood of SZ patients compared with controls, and the direction of change of miR-124-3p and SKIL mRNA levels in SZ cases were reversed after a 12-week treatment cycle. Our druggability analysis revealed that many of these genes have the potential to be drug targets. Conclusions: Together, our results suggest that coexpression network abnormalities driven by combinatorial and interactive action from TFs and miRNAs may contribute to the development of SZ and be relevant to the clinical treatment of the disease. PMID:26609121
Chen, Zhenyu; Li, Jianping; Wei, Liwei
2007-10-01
Recently, gene expression profiling using microarray techniques has been shown as a promising tool to improve the diagnosis and treatment of cancer. Gene expression data contain high level of noise and the overwhelming number of genes relative to the number of available samples. It brings out a great challenge for machine learning and statistic techniques. Support vector machine (SVM) has been successfully used to classify gene expression data of cancer tissue. In the medical field, it is crucial to deliver the user a transparent decision process. How to explain the computed solutions and present the extracted knowledge becomes a main obstacle for SVM. A multiple kernel support vector machine (MK-SVM) scheme, consisting of feature selection, rule extraction and prediction modeling is proposed to improve the explanation capacity of SVM. In this scheme, we show that the feature selection problem can be translated into an ordinary multiple parameters learning problem. And a shrinkage approach: 1-norm based linear programming is proposed to obtain the sparse parameters and the corresponding selected features. We propose a novel rule extraction approach using the information provided by the separating hyperplane and support vectors to improve the generalization capacity and comprehensibility of rules and reduce the computational complexity. Two public gene expression datasets: leukemia dataset and colon tumor dataset are used to demonstrate the performance of this approach. Using the small number of selected genes, MK-SVM achieves encouraging classification accuracy: more than 90% for both two datasets. Moreover, very simple rules with linguist labels are extracted. The rule sets have high diagnostic power because of their good classification performance.
Damiani, Isabelle; Drain, Alice; Guichard, Marjorie; Balzergue, Sandrine; Boscari, Alexandre; Boyer, Jean-Christophe; Brunaud, Véronique; Cottaz, Sylvain; Rancurel, Corinne; Da Rocha, Martine; Fizames, Cécile; Fort, Sébastien; Gaillard, Isabelle; Maillol, Vincent; Danchin, Etienne G J; Rouached, Hatem; Samain, Eric; Su, Yan-Hua; Thouin, Julien; Touraine, Bruno; Puppo, Alain; Frachisse, Jean-Marie; Pauly, Nicolas; Sentenac, Hervé
2016-01-01
Root hairs are involved in water and nutrient uptake, and thereby in plant autotrophy. In legumes, they also play a crucial role in establishment of rhizobial symbiosis. To obtain a holistic view of Medicago truncatula genes expressed in root hairs and of their regulation during the first hours of the engagement in rhizobial symbiotic interaction, a high throughput RNA sequencing on isolated root hairs from roots challenged or not with lipochitooligosaccharides Nod factors (NF) for 4 or 20 h was carried out. This provided a repertoire of genes displaying expression in root hairs, responding or not to NF, and specific or not to legumes. In analyzing the transcriptome dataset, special attention was paid to pumps, transporters, or channels active at the plasma membrane, to other proteins likely to play a role in nutrient ion uptake, NF electrical and calcium signaling, control of the redox status or the dynamic reprogramming of root hair transcriptome induced by NF treatment, and to the identification of papilionoid legume-specific genes expressed in root hairs. About 10% of the root hair expressed genes were significantly up- or down-regulated by NF treatment, suggesting their involvement in remodeling plant functions to allow establishment of the symbiotic relationship. For instance, NF-induced changes in expression of genes encoding plasma membrane transport systems or disease response proteins indicate that root hairs reduce their involvement in nutrient ion absorption and adapt their immune system in order to engage in the symbiotic interaction. It also appears that the redox status of root hair cells is tuned in response to NF perception. In addition, 1176 genes that could be considered as "papilionoid legume-specific" were identified in the M. truncatula root hair transcriptome, from which 141 were found to possess an ortholog in every of the six legume genomes that we considered, suggesting their involvement in essential functions specific to legumes. This transcriptome provides a valuable resource to investigate root hair biology in legumes and the roles that these cells play in rhizobial symbiosis establishment. These results could also contribute to the long-term objective of transferring this symbiotic capacity to non-legume plants.
GiniClust: detecting rare cell types from single-cell gene expression data with Gini index.
Jiang, Lan; Chen, Huidong; Pinello, Luca; Yuan, Guo-Cheng
2016-07-01
High-throughput single-cell technologies have great potential to discover new cell types; however, it remains challenging to detect rare cell types that are distinct from a large population. We present a novel computational method, called GiniClust, to overcome this challenge. Validation against a benchmark dataset indicates that GiniClust achieves high sensitivity and specificity. Application of GiniClust to public single-cell RNA-seq datasets uncovers previously unrecognized rare cell types, including Zscan4-expressing cells within mouse embryonic stem cells and hemoglobin-expressing cells in the mouse cortex and hippocampus. GiniClust also correctly detects a small number of normal cells that are mixed in a cancer cell population.
Cis-eQTL-based trans-ethnic meta-analysis reveals novel genes associated with breast cancer risk
Tai, Caroline G.; Passarelli, Michael N.; Hu, Donglei; Huntsman, Scott; Zaitlen, Noah; Ziv, Elad; Witte, John S.
2017-01-01
Breast cancer is the most common solid organ malignancy and the most frequent cause of cancer death among women worldwide. Previous research has yielded insights into its genetic etiology, but there remains a gap in the understanding of genetic factors that contribute to risk, and particularly in the biological mechanisms by which genetic variation modulates risk. The National Cancer Institute’s “Up for a Challenge” (U4C) competition provided an opportunity to further elucidate the genetic basis of the disease. Our group leveraged the seven datasets made available by the U4C organizers and data from the publicly available UK Biobank cohort to examine associations between imputed gene expression and breast cancer risk. In particular, we used reference datasets describing the breast tissue and whole blood transcriptomes to impute expression levels in breast cancer cases and controls. In trans-ethnic meta-analyses of U4C and UK Biobank data, we found significant associations between breast cancer risk and the expression of RCCD1 (joint p-value: 3.6x10-06) and DHODH (p-value: 7.1x10-06) in breast tissue, as well as a suggestive association for ANKLE1 (p-value: 9.3x10-05). Expression of RCCD1 in whole blood was also suggestively associated with disease risk (p-value: 1.2x10-05), as were expression of ACAP1 (p-value: 1.9x10-05) and LRRC25 (p-value: 5.2x10-05). While genome-wide association studies (GWAS) have implicated RCCD1 and ANKLE1 in breast cancer risk, they have not identified the remaining three genes. Among the genetic variants that contributed to the predicted expression of the five genes, we found 23 nominally (p-value < 0.05) associated with breast cancer risk, among which 15 are not in high linkage disequilibrium with risk variants previously identified by GWAS. In summary, we used a transcriptome-based approach to investigate the genetic underpinnings of breast carcinogenesis. This approach provided an avenue for deciphering the functional relevance of genes and genetic variants involved in breast cancer. PMID:28362817
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nault, Rance; Kim, Suntae; Zacharewski, Timothy R., E-mail: tzachare@msu.edu
2013-03-01
Although the structure and function of the AhR are conserved, emerging evidence suggests that downstream effects are species-specific. In this study, rat hepatic gene expression data from the DrugMatrix database (National Toxicology Program) were compared to mouse hepatic whole-genome gene expression data following treatment with 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD). For the DrugMatrix study, male Sprague–Dawley rats were gavaged daily with 20 μg/kg TCDD for 1, 3 and 5 days, while female C57BL/6 ovariectomized mice were examined 1, 3 and 7 days after a single oral gavage of 30 μg/kg TCDD. A total of 649 rat and 1386 mouse genes (|fold change| ≥more » 1.5, P1(t) ≥ 0.99) were differentially expressed following treatment. HomoloGene identified 11,708 orthologs represented across the rat Affymetrix 230 2.0 GeneChip (12,310 total orthologs), and the mouse 4 × 44K v.1 Agilent oligonucleotide array (17,578 total orthologs). Comparative analysis found 563 and 922 orthologs differentially expressed in response to TCDD in the rat and mouse, respectively, with 70 responses associated with immune function and lipid metabolism in common to both. Moreover, QRTPCR analysis of Ceacam1, showed divergent expression (induced in rat; repressed in mouse) functionally consistent with TCDD-elicited hepatic steatosis in the mouse but not the rat. Functional analysis identified orthologs involved in nucleotide binding and acetyltransferase activity in rat, while mouse-specific responses were associated with steroid, phospholipid, fatty acid, and carbohydrate metabolism. These results provide further evidence that TCDD elicits species-specific regulation of distinct gene networks, and outlines considerations for future comparisons of publicly available microarray datasets. - Highlights: ► We performed a whole-genome comparison of TCDD-regulated genes in mice and rats. ► Previous species comparisons were extended using data from the DrugMatrix database. ► Less than 15% of TCDD-regulated orthologs were common to mice and rats. ► Considerations for the comparison of publicly available datasets are described.« less
Ahmed, Nasar Uddin; Jung, Hee-Jeong; Park, Jong-In; Cho, Yong-Gu; Hur, Yoonkang; Nou, Ill-Sup
2015-01-10
Cold and freezing stress is a major environmental constraint to the production of Brassica crops. Enhancement of tolerance by exploiting cold and freezing tolerance related genes offers the most efficient approach to address this problem. Cold-induced transcriptional profiling is a promising approach to the identification of potential genes related to cold and freezing stress tolerance. In this study, 99 highly expressed genes were identified from a whole genome microarray dataset of Brassica rapa. Blast search analysis of the Brassica oleracea database revealed the corresponding homologous genes. To validate their expression, pre-selected cold tolerant and susceptible cabbage lines were analyzed. Out of 99 BoCRGs, 43 were differentially expressed in response to varying degrees of cold and freezing stress in the contrasting cabbage lines. Among the differentially expressed genes, 18 were highly up-regulated in the tolerant lines, which is consistent with their microarray expression. Additionally, 12 BoCRGs were expressed differentially after cold stress treatment in two contrasting cabbage lines, and BoCRG54, 56, 59, 62, 70, 72 and 99 were predicted to be involved in cold regulatory pathways. Taken together, the cold-responsive genes identified in this study provide additional direction for elucidating the regulatory network of low temperature stress tolerance and developing cold and freezing stress resistant Brassica crops. Copyright © 2014 Elsevier B.V. All rights reserved.
Genome-wide prediction and analysis of human tissue-selective genes using microarray expression data
2013-01-01
Background Understanding how genes are expressed specifically in particular tissues is a fundamental question in developmental biology. Many tissue-specific genes are involved in the pathogenesis of complex human diseases. However, experimental identification of tissue-specific genes is time consuming and difficult. The accurate predictions of tissue-specific gene targets could provide useful information for biomarker development and drug target identification. Results In this study, we have developed a machine learning approach for predicting the human tissue-specific genes using microarray expression data. The lists of known tissue-specific genes for different tissues were collected from UniProt database, and the expression data retrieved from the previously compiled dataset according to the lists were used for input vector encoding. Random Forests (RFs) and Support Vector Machines (SVMs) were used to construct accurate classifiers. The RF classifiers were found to outperform SVM models for tissue-specific gene prediction. The results suggest that the candidate genes for brain or liver specific expression can provide valuable information for further experimental studies. Our approach was also applied for identifying tissue-selective gene targets for different types of tissues. Conclusions A machine learning approach has been developed for accurately identifying the candidate genes for tissue specific/selective expression. The approach provides an efficient way to select some interesting genes for developing new biomedical markers and improve our knowledge of tissue-specific expression. PMID:23369200
Rey, Benjamin; Dégletagne, Cyril; Duchamp, Claude
2016-12-01
In this article, we present differentially expressed gene profiles in the pectoralis muscle of wild juvenile king penguins that were either naturally acclimated to cold marine environment or experimentally immersed in cold water as compared with penguin juveniles that never experienced cold water immersion. Transcriptomic data were obtained by hybridizing penguins total cDNA on Affymetrix GeneChip Chicken Genome arrays and analyzed using maxRS algorithm , " Transcriptome analysis in non-model species: a new method for the analysis of heterologous hybridization on microarrays " (Dégletagne et al., 2010) [1] . We focused on genes involved in multiple antioxidant pathways. For better clarity, these differentially expressed genes were clustered into six functional groups according to their role in controlling redox homeostasis. The data are related to a comprehensive research study on the ontogeny of antioxidant functions in king penguins, "Hormetic response triggers multifaceted anti-oxidant strategies in immature king penguins (Aptenodytes patagonicus)" (Rey et al., 2016) [2] . The raw microarray dataset supporting the present analyses has been deposited at the Gene Expression Omnibus (GEO) repository under accessions GEO: GSE17725 and GEO: GSE82344.
Coordinated transcriptional regulation patterns associated with infertility phenotypes in men
Ellis, Peter J I; Furlong, Robert A; Conner, Sarah J; Kirkman‐Brown, Jackson; Afnan, Masoud; Barratt, Christopher; Griffin, Darren K; Affara, Nabeel A
2007-01-01
Introduction Microarray gene‐expression profiling is a powerful tool for global analysis of the transcriptional consequences of disease phenotypes. Understanding the genetic correlates of particular pathological states is important for more accurate diagnosis and screening of patients, and thus for suggesting appropriate avenues of treatment. As yet, there has been little research describing gene‐expression profiling of infertile and subfertile men, and thus the underlying transcriptional events involved in loss of spermatogenesis remain unclear. Here we present the results of an initial screen of 33 patients with differing spermatogenic phenotypes. Methods Oligonucleotide array expression profiling was performed on testis biopsies for 33 patients presenting for testicular sperm extraction. Significantly regulated genes were selected using a mixed model analysis of variance. Principle components analysis and hierarchical clustering were used to interpret the resulting dataset with reference to the patient history, clinical findings and histological composition of the biopsies. Results Striking patterns of coordinated gene expression were found. The most significant contains multiple germ cell‐specific genes and corresponds to the degree of successful spermatogenesis in each patient, whereas a second pattern corresponds to inflammatory activity within the testis. Smaller‐scale patterns were also observed, relating to unique features of the individual biopsies. PMID:17496197
Mallik, Saurav; Sen, Sagnik; Maulik, Ujjwal
2016-07-15
Involvement of intrinsically disordered proteins (IDPs) with various dreadful diseases like cancer is an interesting research topic. In order to gain novel insights into the regulation of IDPs, in this article, we perform a transcriptomic analysis of mRNAs (genes) for transcripts encoding IDPs on a human multi-omics prostate carcinoma dataset having both gene expression and methylation data. In this regard, firstly the genes that consist of both the expression and methylation data, and that are corresponding to the cancer-related prostate-tissue-specific disordered proteins of MobiDb database, are selected. We apply standard t-test for determining differentially expressed genes as well as differentially methylated genes. A network having these genes and their targeter miRNAs from Diana Tarbase v7.0 database and corresponding Transcription Factors from TRANSFAC and ITFP databases, is then built. Thereafter, we perform literature search, and KEGG pathway and Gene Ontology analyses using DAVID database. Finally, we report several significant potential gene-markers (with the corresponding IDPs) that have inverse relationship between differential expression and methylation patterns, and that are hub genes of the TF-miRNA-gene network. Copyright © 2016 Elsevier B.V. All rights reserved.
S100A8/A9 is associated with estrogen receptor loss in breast cancer.
Bao, Y I; Wang, Antao; Mo, Juanfen
2016-03-01
S100A8 and S100A9 are calcium-binding proteins that are secreted primarily by granulocytes and monocytes, and are upregulated during the inflammatory response. S100A8 and S100A9 have been identified to be expressed by epithelial cells involved in malignancy. In the present study, the transcriptional levels of S100A8 and S100A9 were investigated in various subtypes of breast cancer (BC), and the correlation with estrogen receptor 1 (ESR1) and GATA binding protein 3 (GATA3) gene expression was evaluated using microarray datasets. The expression of S100A8 and S100A9 in BC cells was assessed by reverse transcription-polymerase chain reaction (RT-PCR). The regulation of ESR1 and GATA3 by administration of recombinant S100A8/A9 was examined in the BC MCF-7 cell line using quantitative (q)PCR. The association between S100A8 and S100A9 and overall survival (OS) was investigated in GeneChip® data of BC. The expression levels of S100A8 and S100A9 were higher in human epidermal growth factor receptor 2 (Her2)-amplified and basal-like BC. The messenger (m)RNA levels of S100A8 and S100A9 were inversely correlated with ESR1 and GATA3 expression. S100A8/A9 induced a 10-fold decrease in the mRNA levels of ESR1 in MCF-7 cells. Poor OS was associated with high expression levels of S100A9, but not with high expression levels of S100A8 in BC. In conclusion, strong expression and secretion of S100A8/A9 may be associated with the loss of estrogen receptor in BC, and may be involved in the poor prognosis of Her2+/basal-like subtypes of BC.
S100A8/A9 is associated with estrogen receptor loss in breast cancer
BAO, YI; WANG, ANTAO; MO, JUANFEN
2016-01-01
S100A8 and S100A9 are calcium-binding proteins that are secreted primarily by granulocytes and monocytes, and are upregulated during the inflammatory response. S100A8 and S100A9 have been identified to be expressed by epithelial cells involved in malignancy. In the present study, the transcriptional levels of S100A8 and S100A9 were investigated in various subtypes of breast cancer (BC), and the correlation with estrogen receptor 1 (ESR1) and GATA binding protein 3 (GATA3) gene expression was evaluated using microarray datasets. The expression of S100A8 and S100A9 in BC cells was assessed by reverse transcription-polymerase chain reaction (RT-PCR). The regulation of ESR1 and GATA3 by administration of recombinant S100A8/A9 was examined in the BC MCF-7 cell line using quantitative (q)PCR. The association between S100A8 and S100A9 and overall survival (OS) was investigated in GeneChip® data of BC. The expression levels of S100A8 and S100A9 were higher in human epidermal growth factor receptor 2 (Her2)-amplified and basal-like BC. The messenger (m)RNA levels of S100A8 and S100A9 were inversely correlated with ESR1 and GATA3 expression. S100A8/A9 induced a 10-fold decrease in the mRNA levels of ESR1 in MCF-7 cells. Poor OS was associated with high expression levels of S100A9, but not with high expression levels of S100A8 in BC. In conclusion, strong expression and secretion of S100A8/A9 may be associated with the loss of estrogen receptor in BC, and may be involved in the poor prognosis of Her2+/basal-like subtypes of BC. PMID:26998104
Zaravinos, Apostolos; Pieri, Myrtani; Mourmouras, Nikos; Anastasiadou, Natassa; Zouvani, Ioanna; Delakas, Dimitris; Deltas, Constantinos
2014-01-01
Clear cell renal cell carcinoma (ccRCC) is the predominant subtype of renal cell carcinoma (RCC). It is one of the most therapy-resistant carcinomas, responding very poorly or not at all to radiotherapy, hormonal therapy and chemotherapy. A more comprehensive understanding of the deregulated pathways in ccRCC can lead to the development of new therapies and prognostic markers. We performed a meta- analysis of 5 publicly available gene expression datasets and identified a list of co- deregulated genes, for which we performed extensive bioinformatic analysis coupled with experimental validation on the mRNA level. Gene ontology enrichment showed that many proteins are involved in response to hypoxia/oxygen levels and positive regulation of the VEGFR signaling pathway. KEGG analysis revealed that metabolic pathways are mostly altered in ccRCC. Similarly, Ingenuity Pathway Analysis showed that the antigen presentation, inositol metabolism, pentose phosphate, glycolysis/gluconeogenesis and fructose/mannose metabolism pathways are altered in the disease. Cellular growth, proliferation and carbohydrate metabolism, were among the top molecular and cellular functions of the co-deregulated genes. qRT-PCR validated the deregulated expression of several genes in Caki-2 and ACHN cell lines and in a cohort of ccRCC tissues. NNMT and NR3C1 increased expression was evident in ccRCC biopsies from patients using immunohistochemistry. ROC curves evaluated the diagnostic performance of the top deregulated genes in each dataset. We show that metabolic pathways are mostly deregulated in ccRCC and we highlight those being most responsible in its formation. We suggest that these genes are candidate predictive markers of the disease. PMID:25594006
Predicting Response to Histone Deacetylase Inhibitors Using High-Throughput Genomics.
Geeleher, Paul; Loboda, Andrey; Lenkala, Divya; Wang, Fan; LaCroix, Bonnie; Karovic, Sanja; Wang, Jacqueline; Nebozhyn, Michael; Chisamore, Michael; Hardwick, James; Maitland, Michael L; Huang, R Stephanie
2015-11-01
Many disparate biomarkers have been proposed as predictors of response to histone deacetylase inhibitors (HDI); however, all have failed when applied clinically. Rather than this being entirely an issue of reproducibility, response to the HDI vorinostat may be determined by the additive effect of multiple molecular factors, many of which have previously been demonstrated. We conducted a large-scale gene expression analysis using the Cancer Genome Project for discovery and generated another large independent cancer cell line dataset across different cancers for validation. We compared different approaches in terms of how accurately vorinostat response can be predicted on an independent out-of-batch set of samples and applied the polygenic marker prediction principles in a clinical trial. Using machine learning, the small effects that aggregate, resulting in sensitivity or resistance, can be recovered from gene expression data in a large panel of cancer cell lines.This approach can predict vorinostat response accurately, whereas single gene or pathway markers cannot. Our analyses recapitulated and contextualized many previous findings and suggest an important role for processes such as chromatin remodeling, autophagy, and apoptosis. As a proof of concept, we also discovered a novel causative role for CHD4, a helicase involved in the histone deacetylase complex that is associated with poor clinical outcome. As a clinical validation, we demonstrated that a common dose-limiting toxicity of vorinostat, thrombocytopenia, can be predicted (r = 0.55, P = .004) several days before it is detected clinically. Our work suggests a paradigm shift from single-gene/pathway evaluation to simultaneously evaluating multiple independent high-throughput gene expression datasets, which can be easily extended to other investigational compounds where similar issues are hampering clinical adoption. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Acerbi, Enzo; Viganò, Elena; Poidinger, Michael; Mortellaro, Alessandra; Zelante, Teresa; Stella, Fabio
2016-01-01
T helper 17 (TH17) cells represent a pivotal adaptive cell subset involved in multiple immune disorders in mammalian species. Deciphering the molecular interactions regulating TH17 cell differentiation is particularly critical for novel drug target discovery designed to control maladaptive inflammatory conditions. Using continuous time Bayesian networks over a time-course gene expression dataset, we inferred the global regulatory network controlling TH17 differentiation. From the network, we identified the Prdm1 gene encoding the B lymphocyte-induced maturation protein 1 as a crucial negative regulator of human TH17 cell differentiation. The results have been validated by perturbing Prdm1 expression on freshly isolated CD4+ naïve T cells: reduction of Prdm1 expression leads to augmentation of IL-17 release. These data unravel a possible novel target to control TH17 polarization in inflammatory disorders. Furthermore, this study represents the first in vitro validation of continuous time Bayesian networks as gene network reconstruction method and as hypothesis generation tool for wet-lab biological experiments. PMID:26976045
Use of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells.
Xin, Yurong; Kim, Jinrang; Ni, Min; Wei, Yi; Okamoto, Haruka; Lee, Joseph; Adler, Christina; Cavino, Katie; Murphy, Andrew J; Yancopoulos, George D; Lin, Hsin Chieh; Gromada, Jesper
2016-03-22
This study provides an assessment of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells. The system combines microfluidic technology and nanoliter-scale reactions. We sequenced 622 cells, allowing identification of 341 islet cells with high-quality gene expression profiles. The cells clustered into populations of α-cells (5%), β-cells (92%), δ-cells (1%), and pancreatic polypeptide cells (2%). We identified cell-type-specific transcription factors and pathways primarily involved in nutrient sensing and oxidation and cell signaling. Unexpectedly, 281 cells had to be removed from the analysis due to low viability, low sequencing quality, or contamination resulting in the detection of more than one islet hormone. Collectively, we provide a resource for identification of high-quality gene expression datasets to help expand insights into genes and pathways characterizing islet cell types. We reveal limitations in the C1 Fluidigm cell capture process resulting in contaminated cells with altered gene expression patterns. This calls for caution when interpreting single-cell transcriptomics data using the C1 Fluidigm system.
An interactive web application for the dissemination of human systems immunology data.
Speake, Cate; Presnell, Scott; Domico, Kelly; Zeitner, Brad; Bjork, Anna; Anderson, David; Mason, Michael J; Whalen, Elizabeth; Vargas, Olivia; Popov, Dimitry; Rinchai, Darawan; Jourde-Chiche, Noemie; Chiche, Laurent; Quinn, Charlie; Chaussabel, Damien
2015-06-19
Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators' interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery. State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples. We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page ( https://gxb.benaroyaresearch.org/dm3/landing.gsp )]. The source code is also available openly [Gene Expression Browser Source Code ( https://github.com/BenaroyaResearch/gxbrowser )]. We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.
Yang, Lingjian; Ainali, Chrysanthi; Tsoka, Sophia; Papageorgiou, Lazaros G
2014-12-05
Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies. A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile. The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.
Kadakkuzha, Beena M.; Liu, Xin-An; McCrate, Jennifer; Shankar, Gautam; Rizzo, Valerio; Afinogenova, Alina; Young, Brandon; Fallahi, Mohammad; Carvalloza, Anthony C.; Raveendra, Bindu; Puthanveettil, Sathyanarayanan V.
2015-01-01
Despite the importance of the long non-coding RNAs (lncRNAs) in regulating biological functions, the expression profiles of lncRNAs in the sub-regions of the mammalian brain and neuronal populations remain largely uncharacterized. By analyzing RNASeq datasets, we demonstrate region specific enrichment of populations of lncRNAs and mRNAs in the mouse hippocampus and pre-frontal cortex (PFC), the two major regions of the brain involved in memory storage and neuropsychiatric disorders. We identified 2759 lncRNAs and 17,859 mRNAs in the hippocampus and 2561 lncRNAs and 17,464 mRNAs expressed in the PFC. The lncRNAs identified correspond to ~14% of the transcriptome of the hippocampus and PFC and ~70% of the lncRNAs annotated in the mouse genome (NCBIM37) and are localized along the chromosomes as varying numbers of clusters. Importantly, we also found that a few of the tested lncRNA-mRNA pairs that share a genomic locus display specific co-expression in a region-specific manner. Furthermore, we find that sub-regions of the brain and specific neuronal populations have characteristic lncRNA expression signatures. These results reveal an unexpected complexity of the lncRNA expression in the mouse brain. PMID:25798087
Pazhamala, Lekha T; Agarwal, Gaurav; Bajaj, Prasad; Kumar, Vinay; Kulshreshtha, Akanksha; Saxena, Rachit K; Varshney, Rajeev K
2016-01-01
Seed development is an important event in plant life cycle that has interested humankind since ages, especially in crops of economic importance. Pigeonpea is an important grain legume of the semi-arid tropics, used mainly for its protein rich seeds. In order to understand the transcriptional programming during the pod and seed development, RNA-seq data was generated from embryo sac from the day of anthesis (0 DAA), seed and pod wall (5, 10, 20 and 30 DAA) of pigeonpea variety "Asha" (ICPL 87119) using Illumina HiSeq 2500. About 684 million sequencing reads have been generated from nine samples, which resulted in the identification of 27,441 expressed genes after sequence analysis. These genes have been studied for their differentially expression, co-expression, temporal and spatial gene expression. We have also used the RNA-seq data to identify important seed-specific transcription factors, biological processes and associated pathways during seed development process in pigeonpea. The comprehensive gene expression study from flowering to mature pod development in pigeonpea would be crucial in identifying candidate genes involved in seed traits directly or indirectly related to yield and quality. The dataset will serve as an important resource for gene discovery and deciphering the molecular mechanisms underlying various seed related traits.
Pazhamala, Lekha T.; Agarwal, Gaurav; Bajaj, Prasad; Kumar, Vinay; Kulshreshtha, Akanksha; Saxena, Rachit K.; Varshney, Rajeev K.
2016-01-01
Seed development is an important event in plant life cycle that has interested humankind since ages, especially in crops of economic importance. Pigeonpea is an important grain legume of the semi-arid tropics, used mainly for its protein rich seeds. In order to understand the transcriptional programming during the pod and seed development, RNA-seq data was generated from embryo sac from the day of anthesis (0 DAA), seed and pod wall (5, 10, 20 and 30 DAA) of pigeonpea variety “Asha” (ICPL 87119) using Illumina HiSeq 2500. About 684 million sequencing reads have been generated from nine samples, which resulted in the identification of 27,441 expressed genes after sequence analysis. These genes have been studied for their differentially expression, co-expression, temporal and spatial gene expression. We have also used the RNA-seq data to identify important seed-specific transcription factors, biological processes and associated pathways during seed development process in pigeonpea. The comprehensive gene expression study from flowering to mature pod development in pigeonpea would be crucial in identifying candidate genes involved in seed traits directly or indirectly related to yield and quality. The dataset will serve as an important resource for gene discovery and deciphering the molecular mechanisms underlying various seed related traits. PMID:27760186
Wide-Open: Accelerating public data release by automating detection of overdue datasets
Poon, Hoifung; Howe, Bill
2017-01-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819
Wide-Open: Accelerating public data release by automating detection of overdue datasets.
Grechkin, Maxim; Poon, Hoifung; Howe, Bill
2017-06-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.
A formal concept analysis approach to consensus clustering of multi-experiment expression data
2014-01-01
Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. PMID:24885407
An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.
Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W
2010-07-02
The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.
Li, Jiangeng; Su, Lei; Pang, Zenan
2015-12-01
Feature selection techniques have been widely applied to tumor gene expression data analysis in recent years. A filter feature selection method named marginal Fisher analysis score (MFA score) which is based on graph embedding has been proposed, and it has been widely used mainly because it is superior to Fisher score. Considering the heavy redundancy in gene expression data, we proposed a new filter feature selection technique in this paper. It is named MFA score+ and is based on MFA score and redundancy excluding. We applied it to an artificial dataset and eight tumor gene expression datasets to select important features and then used support vector machine as the classifier to classify the samples. Compared with MFA score, t test and Fisher score, it achieved higher classification accuracy.
Gene networks specific for innate immunity define post-traumatic stress disorder.
Breen, M S; Maihofer, A X; Glatt, S J; Tylee, D S; Chandler, S D; Tsuang, M T; Risbrough, V B; Baker, D G; O'Connor, D T; Nievergelt, C M; Woelk, C H
2015-12-01
The molecular factors involved in the development of Post-Traumatic Stress Disorder (PTSD) remain poorly understood. Previous transcriptomic studies investigating the mechanisms of PTSD apply targeted approaches to identify individual genes under a cross-sectional framework lack a holistic view of the behaviours and properties of these genes at the system-level. Here we sought to apply an unsupervised gene-network based approach to a prospective experimental design using whole-transcriptome RNA-Seq gene expression from peripheral blood leukocytes of U.S. Marines (N=188), obtained both pre- and post-deployment to conflict zones. We identified discrete groups of co-regulated genes (i.e., co-expression modules) and tested them for association to PTSD. We identified one module at both pre- and post-deployment containing putative causal signatures for PTSD development displaying an over-expression of genes enriched for functions of innate-immune response and interferon signalling (Type-I and Type-II). Importantly, these results were replicated in a second non-overlapping independent dataset of U.S. Marines (N=96), further outlining the role of innate immune and interferon signalling genes within co-expression modules to explain at least part of the causal pathophysiology for PTSD development. A second module, consequential of trauma exposure, contained PTSD resiliency signatures and an over-expression of genes involved in hemostasis and wound responsiveness suggesting that chronic levels of stress impair proper wound healing during/after exposure to the battlefield while highlighting the role of the hemostatic system as a clinical indicator of chronic-based stress. These findings provide novel insights for early preventative measures and advanced PTSD detection, which may lead to interventions that delay or perhaps abrogate the development of PTSD.
Comparative transcriptome analysis of microsclerotia development in Nomuraea rileyi.
Song, Zhangyong; Yin, Youping; Jiang, Shasha; Liu, Juanjuan; Chen, Huan; Wang, Zhongkang
2013-06-19
Nomuraea rileyi is used as an environmental-friendly biopesticide. However, mass production and commercialization of this organism are limited due to its fastidious growth and sporulation requirements. When cultured in amended medium, we found that N. rileyi could produce microsclerotia bodies, replacing conidiophores as the infectious agent. However, little is known about the genes involved in microsclerotia development. In the present study, the transcriptomes were analyzed using next-generation sequencing technology to find the genes involved in microsclerotia development. A total of 4.69 Gb of clean nucleotides comprising 32,061 sequences was obtained, and 20,919 sequences were annotated (about 65%). Among the annotated sequences, only 5928 were annotated with 34 gene ontology (GO) functional categories, and 12,778 sequences were mapped to 165 pathways by searching against the Kyoto Encyclopedia of Genes and Genomes pathway (KEGG) database. Furthermore, we assessed the transcriptomic differences between cultures grown in minimal and amended medium. In total, 4808 sequences were found to be differentially expressed; 719 differentially expressed unigenes were assigned to 25 GO classes and 1888 differentially expressed unigenes were assigned to 161 KEGG pathways, including 25 enrichment pathways. Subsequently, we examined the up-regulation or uniquely expressed genes following amended medium treatment, which were also expressed on the enrichment pathway, and found that most of them participated in mediating oxidative stress homeostasis. To elucidate the role of oxidative stress in microsclerotia development, we analyzed the diversification of unigenes using quantitative reverse transcription-PCR (RT-qPCR). Our findings suggest that oxidative stress occurs during microsclerotia development, along with a broad metabolic activity change. Our data provide the most comprehensive sequence resource available for the study of N. rileyi. We believe that the transcriptome datasets will serve as an important public information platform to accelerate studies on N. rileyi microsclerotia.
Ning, Tongbo; Cui, Hao; Sun, Feng; Zou, Jidian
2017-09-05
Glioblastoma represents one of the most aggressive malignant brain tumors with high morbidity and motility. Demethylation drugs have been developed for its treatment with little efficacy has been observed. The purpose of this study was to screen therapeutic targets of demethylation drugs or bioactive molecules for glioblastoma through systemic bioinformatics analysis. We firstly downloaded genome-wide expression profiles from the Gene Expression Omnibus (GEO) and conducted the primary analysis through R software, mainly including preprocessing of raw microarray data, transformation between probe ID and gene symbol and identification of differential expression genes (DEGs). Secondly, functional enrichment analysis was conducted via the Database for Annotation, Visualization and Integrated Discovery (DAVID) to explore biological processes involved in the development of glioblastoma. Thirdly, we constructed protein-protein interaction (PPI) network of interested genes and conducted cross analysis for multi datasets to obtain potential therapeutic targets for glioblastoma. Finally, we further confirmed the therapeutic targets through real-time RT-PCR. As a result, biological processes that related to cancer development, amino metabolism, immune response and etc. were found to be significantly enriched in genes that differential expression in glioblastoma and regulated by 5'aza-dC. Besides, network and cross analysis identified ACAT2, UFC1 and CYB5R1 as novel therapeutic targets of demethylation drugs which also confirmed by real time RT-PCR. In conclusions, our study identified several biological processes and genes that involved in the development of glioblastoma and regulated by 5'aza-dC, which would be helpful for the treatment of glioblastoma. Copyright © 2017 Elsevier B.V. All rights reserved.
dbMDEGA: a database for meta-analysis of differentially expressed genes in autism spectrum disorder.
Zhang, Shuyun; Deng, Libin; Jia, Qiyue; Huang, Shaoting; Gu, Junwang; Zhou, Fankun; Gao, Meng; Sun, Xinyi; Feng, Chang; Fan, Guangqin
2017-11-16
Autism spectrum disorders (ASD) are hereditary, heterogeneous and biologically complex neurodevelopmental disorders. Individual studies on gene expression in ASD cannot provide clear consensus conclusions. Therefore, a systematic review to synthesize the current findings from brain tissues and a search tool to share the meta-analysis results are urgently needed. Here, we conducted a meta-analysis of brain gene expression profiles in the current reported human ASD expression datasets (with 84 frozen male cortex samples, 17 female cortex samples, 32 cerebellum samples and 4 formalin fixed samples) and knock-out mouse ASD model expression datasets (with 80 collective brain samples). Then, we applied R language software and developed an interactive shared and updated database (dbMDEGA) displaying the results of meta-analysis of data from ASD studies regarding differentially expressed genes (DEGs) in the brain. This database, dbMDEGA ( https://dbmdega.shinyapps.io/dbMDEGA/ ), is a publicly available web-portal for manual annotation and visualization of DEGs in the brain from data from ASD studies. This database uniquely presents meta-analysis values and homologous forest plots of DEGs in brain tissues. Gene entries are annotated with meta-values, statistical values and forest plots of DEGs in brain samples. This database aims to provide searchable meta-analysis results based on the current reported brain gene expression datasets of ASD to help detect candidate genes underlying this disorder. This new analytical tool may provide valuable assistance in the discovery of DEGs and the elucidation of the molecular pathogenicity of ASD. This database model may be replicated to study other disorders.
Lepre, Jorge; Rice, J Jeremy; Tu, Yuhai; Stolovitzky, Gustavo
2004-05-01
Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).
Lae, Marick; Moarii, Matahi; Sadacca, Benjamin; Pinheiro, Alice; Galliot, Marion; Abecassis, Judith; Laurent, Cecile; Reyal, Fabien
2016-01-01
Introduction HER2-positive breast cancer (BC) is a heterogeneous group of aggressive breast cancers, the prognosis of which has greatly improved since the introduction of treatments targeting HER2. However, these tumors may display intrinsic or acquired resistance to treatment, and classifiers of HER2-positive tumors are required to improve the prediction of prognosis and to develop novel therapeutic interventions. Methods We analyzed 2893 primary human breast cancer samples from 21 publicly available datasets and developed a six-metagene signature on a training set of 448 HER2-positive BC. We then used external public datasets to assess the ability of these metagenes to predict the response to chemotherapy (Ignatiadis dataset), and prognosis (METABRIC dataset). Results We identified a six-metagene signature (138 genes) containing metagenes enriched in different gene ontologies. The gene clusters were named as follows: Immunity, Tumor suppressors/proliferation, Interferon, Signal transduction, Hormone/survival and Matrix clusters. In all datasets, the Immunity metagene was less strongly expressed in ER-positive than in ER-negative tumors, and was inversely correlated with the Hormonal/survival metagene. Within the signature, multivariate analyses showed that strong expression of the “Immunity” metagene was associated with higher pCR rates after NAC (OR = 3.71[1.28–11.91], p = 0.019) than weak expression, and with a better prognosis in HER2-positive/ER-negative breast cancers (HR = 0.58 [0.36–0.94], p = 0.026). Immunity metagene expression was associated with the presence of tumor-infiltrating lymphocytes (TILs). Conclusion The identification of a predictive and prognostic immune module in HER2-positive BC confirms the need for clinical testing for immune checkpoint modulators and vaccines for this specific subtype. The inverse correlation between Immunity and hormone pathways opens research perspectives and deserves further investigation. PMID:28005906
Comparative Microbial Modules Resource: Generation and Visualization of Multi-species Biclusters
Bate, Ashley; Eichenberger, Patrick; Bonneau, Richard
2011-01-01
The increasing abundance of large-scale, high-throughput datasets for many closely related organisms provides opportunities for comparative analysis via the simultaneous biclustering of datasets from multiple species. These analyses require a reformulation of how to organize multi-species datasets and visualize comparative genomics data analyses results. Recently, we developed a method, multi-species cMonkey, which integrates heterogeneous high-throughput datatypes from multiple species to identify conserved regulatory modules. Here we present an integrated data visualization system, built upon the Gaggle, enabling exploration of our method's results (available at http://meatwad.bio.nyu.edu/cmmr.html). The system can also be used to explore other comparative genomics datasets and outputs from other data analysis procedures – results from other multiple-species clustering programs or from independent clustering of different single-species datasets. We provide an example use of our system for two bacteria, Escherichia coli and Salmonella Typhimurium. We illustrate the use of our system by exploring conserved biclusters involved in nitrogen metabolism, uncovering a putative function for yjjI, a currently uncharacterized gene that we predict to be involved in nitrogen assimilation. PMID:22144874
Comparative microbial modules resource: generation and visualization of multi-species biclusters.
Kacmarczyk, Thadeous; Waltman, Peter; Bate, Ashley; Eichenberger, Patrick; Bonneau, Richard
2011-12-01
The increasing abundance of large-scale, high-throughput datasets for many closely related organisms provides opportunities for comparative analysis via the simultaneous biclustering of datasets from multiple species. These analyses require a reformulation of how to organize multi-species datasets and visualize comparative genomics data analyses results. Recently, we developed a method, multi-species cMonkey, which integrates heterogeneous high-throughput datatypes from multiple species to identify conserved regulatory modules. Here we present an integrated data visualization system, built upon the Gaggle, enabling exploration of our method's results (available at http://meatwad.bio.nyu.edu/cmmr.html). The system can also be used to explore other comparative genomics datasets and outputs from other data analysis procedures - results from other multiple-species clustering programs or from independent clustering of different single-species datasets. We provide an example use of our system for two bacteria, Escherichia coli and Salmonella Typhimurium. We illustrate the use of our system by exploring conserved biclusters involved in nitrogen metabolism, uncovering a putative function for yjjI, a currently uncharacterized gene that we predict to be involved in nitrogen assimilation. © 2011 Kacmarczyk et al.
Fan, Qianrui; Wang, Wenyu; Hao, Jingcan; He, Awen; Wen, Yan; Guo, Xiong; Wu, Cuiyan; Ning, Yujie; Wang, Xi; Wang, Sen; Zhang, Feng
2017-08-01
Neuroticism is a fundamental personality trait with significant genetic determinant. To identify novel susceptibility genes for neuroticism, we conducted an integrative analysis of genomic and transcriptomic data of genome wide association study (GWAS) and expression quantitative trait locus (eQTL) study. GWAS summary data was driven from published studies of neuroticism, totally involving 170,906 subjects. eQTL dataset containing 927,753 eQTLs were obtained from an eQTL meta-analysis of 5311 samples. Integrative analysis of GWAS and eQTL data was conducted by summary data-based Mendelian randomization (SMR) analysis software. To identify neuroticism associated gene sets, the SMR analysis results were further subjected to gene set enrichment analysis (GSEA). The gene set annotation dataset (containing 13,311 annotated gene sets) of GSEA Molecular Signatures Database was used. SMR single gene analysis identified 6 significant genes for neuroticism, including MSRA (p value=2.27×10 -10 ), MGC57346 (p value=6.92×10 -7 ), BLK (p value=1.01×10 -6 ), XKR6 (p value=1.11×10 -6 ), C17ORF69 (p value=1.12×10 -6 ) and KIAA1267 (p value=4.00×10 -6 ). Gene set enrichment analysis observed significant association for Chr8p23 gene set (false discovery rate=0.033). Our results provide novel clues for the genetic mechanism studies of neuroticism. Copyright © 2017. Published by Elsevier Inc.
A ground truth based comparative study on clustering of gene expression data.
Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue
2008-05-01
Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.
Two-pass imputation algorithm for missing value estimation in gene expression time series.
Tsiporkova, Elena; Boeva, Veselka
2007-10-01
Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.
Huang, Lei; Zhao, Shuangping; Frasor, Jonna M.; Dai, Yang
2011-01-01
Approximately half of estrogen receptor (ER) positive breast tumors will fail to respond to endocrine therapy. Here we used an integrative bioinformatics approach to analyze three gene expression profiling data sets from breast tumors in an attempt to uncover underlying mechanisms contributing to the development of resistance and potential therapeutic strategies to counteract these mechanisms. Genes that are differentially expressed in tamoxifen resistant vs. sensitive breast tumors were identified from three different publically available microarray datasets. These differentially expressed (DE) genes were analyzed using gene function and gene set enrichment and examined in intrinsic subtypes of breast tumors. The Connectivity Map analysis was utilized to link gene expression profiles of tamoxifen resistant tumors to small molecules and validation studies were carried out in a tamoxifen resistant cell line. Despite little overlap in genes that are differentially expressed in tamoxifen resistant vs. sensitive tumors, a high degree of functional similarity was observed among the three datasets. Tamoxifen resistant tumors displayed enriched expression of genes related to cell cycle and proliferation, as well as elevated activity of E2F transcription factors, and were highly correlated with a Luminal intrinsic subtype. A number of small molecules, including phenothiazines, were found that induced a gene signature in breast cancer cell lines opposite to that found in tamoxifen resistant vs. sensitive tumors and the ability of phenothiazines to down-regulate cyclin E2 and inhibit proliferation of tamoxifen resistant breast cancer cells was validated. Our findings demonstrate that an integrated bioinformatics approach to analyze gene expression profiles from multiple breast tumor datasets can identify important biological pathways and potentially novel therapeutic options for tamoxifen-resistant breast cancers. PMID:21789246
Annotation of gene function in citrus using gene expression information and co-expression networks
2014-01-01
Background The genus Citrus encompasses major cultivated plants such as sweet orange, mandarin, lemon and grapefruit, among the world’s most economically important fruit crops. With increasing volumes of transcriptomics data available for these species, Gene Co-expression Network (GCN) analysis is a viable option for predicting gene function at a genome-wide scale. GCN analysis is based on a “guilt-by-association” principle whereby genes encoding proteins involved in similar and/or related biological processes may exhibit similar expression patterns across diverse sets of experimental conditions. While bioinformatics resources such as GCN analysis are widely available for efficient gene function prediction in model plant species including Arabidopsis, soybean and rice, in citrus these tools are not yet developed. Results We have constructed a comprehensive GCN for citrus inferred from 297 publicly available Affymetrix Genechip Citrus Genome microarray datasets, providing gene co-expression relationships at a genome-wide scale (33,000 transcripts). The comprehensive citrus GCN consists of a global GCN (condition-independent) and four condition-dependent GCNs that survey the sweet orange species only, all citrus fruit tissues, all citrus leaf tissues, or stress-exposed plants. All of these GCNs are clustered using genome-wide, gene-centric (guide) and graph clustering algorithms for flexibility of gene function prediction. For each putative cluster, gene ontology (GO) enrichment and gene expression specificity analyses were performed to enhance gene function, expression and regulation pattern prediction. The guide-gene approach was used to infer novel roles of genes involved in disease susceptibility and vitamin C metabolism, and graph-clustering approaches were used to investigate isoprenoid/phenylpropanoid metabolism in citrus peel, and citric acid catabolism via the GABA shunt in citrus fruit. Conclusions Integration of citrus gene co-expression networks, functional enrichment analysis and gene expression information provide opportunities to infer gene function in citrus. We present a publicly accessible tool, Network Inference for Citrus Co-Expression (NICCE, http://citrus.adelaide.edu.au/nicce/home.aspx), for the gene co-expression analysis in citrus. PMID:25023870
Lohmann, Ingrid
2012-01-01
In multi-cellular organisms, spatiotemporal activity of cis-regulatory DNA elements depends on their occupancy by different transcription factors (TFs). In recent years, genome-wide ChIP-on-Chip, ChIP-Seq and DamID assays have been extensively used to unravel the combinatorial interaction of TFs with cis-regulatory modules (CRMs) in the genome. Even though genome-wide binding profiles are increasingly becoming available for different TFs, single TF binding profiles are in most cases not sufficient for dissecting complex regulatory networks. Thus, potent computational tools detecting statistically significant and biologically relevant TF-motif co-occurrences in genome-wide datasets are essential for analyzing context-dependent transcriptional regulation. We have developed COPS (Co-Occurrence Pattern Search), a new bioinformatics tool based on a combination of association rules and Markov chain models, which detects co-occurring TF binding sites (BSs) on genomic regions of interest. COPS scans DNA sequences for frequent motif patterns using a Frequent-Pattern tree based data mining approach, which allows efficient performance of the software with respect to both data structure and implementation speed, in particular when mining large datasets. Since transcriptional gene regulation very often relies on the formation of regulatory protein complexes mediated by closely adjoining TF binding sites on CRMs, COPS additionally detects preferred short distance between co-occurring TF motifs. The performance of our software with respect to biological significance was evaluated using three published datasets containing genomic regions that are independently bound by several TFs involved in a defined biological process. In sum, COPS is a fast, efficient and user-friendly tool mining statistically and biologically significant TFBS co-occurrences and therefore allows the identification of TFs that combinatorially regulate gene expression. PMID:23272209
miRCat2: accurate prediction of plant and animal microRNAs from next-generation sequencing datasets
Paicu, Claudia; Mohorianu, Irina; Stocks, Matthew; Xu, Ping; Coince, Aurore; Billmeier, Martina; Dalmay, Tamas; Moulton, Vincent; Moxon, Simon
2017-01-01
Abstract Motivation MicroRNAs are a class of ∼21–22 nt small RNAs which are excised from a stable hairpin-like secondary structure. They have important gene regulatory functions and are involved in many pathways including developmental timing, organogenesis and development in eukaryotes. There are several computational tools for miRNA detection from next-generation sequencing datasets. However, many of these tools suffer from high false positive and false negative rates. Here we present a novel miRNA prediction algorithm, miRCat2. miRCat2 incorporates a new entropy-based approach to detect miRNA loci, which is designed to cope with the high sequencing depth of current next-generation sequencing datasets. It has a user-friendly interface and produces graphical representations of the hairpin structure and plots depicting the alignment of sequences on the secondary structure. Results We test miRCat2 on a number of animal and plant datasets and present a comparative analysis with miRCat, miRDeep2, miRPlant and miReap. We also use mutants in the miRNA biogenesis pathway to evaluate the predictions of these tools. Results indicate that miRCat2 has an improved accuracy compared with other methods tested. Moreover, miRCat2 predicts several new miRNAs that are differentially expressed in wild-type versus mutants in the miRNA biogenesis pathway. Availability and Implementation miRCat2 is part of the UEA small RNA Workbench and is freely available from http://srna-workbench.cmp.uea.ac.uk/. Contact v.moulton@uea.ac.uk or s.moxon@uea.ac.uk Supplementary information Supplementary data are available at Bioinformatics online. PMID:28407097
Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.
Ritz, Cecilia; Edén, Patrik
2008-01-19
For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.
Determining Physical Mechanisms of Gene Expression Regulation from Single Cell Gene Expression Data.
Ezer, Daphne; Moignard, Victoria; Göttgens, Berthold; Adryan, Boris
2016-08-01
Many genes are expressed in bursts, which can contribute to cell-to-cell heterogeneity. It is now possible to measure this heterogeneity with high throughput single cell gene expression assays (single cell qPCR and RNA-seq). These experimental approaches generate gene expression distributions which can be used to estimate the kinetic parameters of gene expression bursting, namely the rate that genes turn on, the rate that genes turn off, and the rate of transcription. We construct a complete pipeline for the analysis of single cell qPCR data that uses the mathematics behind bursty expression to develop more accurate and robust algorithms for analyzing the origin of heterogeneity in experimental samples, specifically an algorithm for clustering cells by their bursting behavior (Simulated Annealing for Bursty Expression Clustering, SABEC) and a statistical tool for comparing the kinetic parameters of bursty expression across populations of cells (Estimation of Parameter changes in Kinetics, EPiK). We applied these methods to hematopoiesis, including a new single cell dataset in which transcription factors (TFs) involved in the earliest branchpoint of blood differentiation were individually up- and down-regulated. We could identify two unique sub-populations within a seemingly homogenous group of hematopoietic stem cells. In addition, we could predict regulatory mechanisms controlling the expression levels of eighteen key hematopoietic transcription factors throughout differentiation. Detailed information about gene regulatory mechanisms can therefore be obtained simply from high throughput single cell gene expression data, which should be widely applicable given the rapid expansion of single cell genomics.
Muldoon, P P; Jackson, K J; Perez, E; Harenza, J L; Molas, S; Rais, B; Anwar, H; Zaveri, N T; Maldonado, R; Maskos, U; McIntosh, J M; Dierssen, M; Miles, M F; Chen, X; De Biasi, M; Damaj, M I
2014-01-01
BACKGROUND AND PURPOSE Recent data have indicated that α3β4* neuronal nicotinic (n) ACh receptors may play a role in morphine dependence. Here we investigated if nACh receptors modulate morphine physical withdrawal. EXPERIMENTAL APPROACHES To assess the role of α3β4* nACh receptors in morphine withdrawal, we used a genetic correlation approach using publically available datasets within the GeneNetwork web resource, genetic knockout and pharmacological tools. Male and female European-American (n = 2772) and African-American (n = 1309) subjects from the Study of Addiction: Genetics and Environment dataset were assessed for possible associations of polymorphisms in the 15q25 gene cluster and opioid dependence. KEY RESULTS BXD recombinant mouse lines demonstrated an increased expression of α3, β4 and α5 nACh receptor mRNA in the forebrain and midbrain, which significantly correlated with increased defecation in mice undergoing morphine withdrawal. Mice overexpressing the gene cluster CHRNA5/A3/B4 exhibited increased somatic signs of withdrawal. Furthermore, α5 and β4 nACh receptor knockout mice expressed decreased somatic withdrawal signs compared with their wild-type counterparts. Moreover, selective α3β4* nACh receptor antagonists, α-conotoxin AuIB and AT-1001, attenuated somatic signs of morphine withdrawal in a dose-related manner. In addition, two human datasets revealed a protective role for variants in the CHRNA3 gene, which codes for the α3 nACh receptor subunit, in opioid dependence and withdrawal. In contrast, we found that the α4β2* nACh receptor subtype is not involved in morphine somatic withdrawal signs. CONCLUSION AND IMPLICATIONS Overall, our findings suggest an important role for the α3β4* nACh receptor subtype in morphine physical dependence. PMID:24750073
ATP binding cassette (ABC) transporters: expression and clinical value in glioblastoma.
Dréan, Antonin; Rosenberg, Shai; Lejeune, François-Xavier; Goli, Larissa; Nadaradjane, Aravindan Arun; Guehennec, Jérémy; Schmitt, Charlotte; Verreault, Maïté; Bielle, Franck; Mokhtari, Karima; Sanson, Marc; Carpentier, Alexandre; Delattre, Jean-Yves; Idbaih, Ahmed
2018-03-08
ATP-binding cassette transporters (ABC transporters) regulate traffic of multiple compounds, including chemotherapeutic agents, through biological membranes. They are expressed by multiple cell types and have been implicated in the drug resistance of some cancer cells. Despite significant research in ABC transporters in the context of many diseases, little is known about their expression and clinical value in glioblastoma (GBM). We analyzed expression of 49 ABC transporters in both commercial and patient-derived GBM cell lines as well as from 51 human GBM tumor biopsies. Using The Cancer Genome Atlas (TCGA) cohort as a training dataset and our cohort as a validation dataset, we also investigated the prognostic value of these ABC transporters in newly diagnosed GBM patients, treated with the standard of care. In contrast to commercial GBM cell lines, GBM-patient derived cell lines (PDCL), grown as neurospheres in a serum-free medium, express ABC transporters similarly to parental tumors. Serum appeared to slightly increase resistance to temozolomide correlating with a tendency for an increased expression of ABCB1. Some differences were observed mainly due to expression of ABC transporters by microenvironmental cells. Together, our data suggest that the efficacy of chemotherapeutic agents may be misestimated in vitro if they are the targets of efflux pumps whose expression can be modulated by serum. Interestingly, several ABC transporters have prognostic value in the TCGA dataset. In our cohort of 51 GBM patients treated with radiation therapy with concurrent and adjuvant temozolomide, ABCA13 overexpression is associated with a decreased progression free survival in univariate (p < 0.01) and multivariate analyses including MGMT promoter methylation (p = 0.05) suggesting reduced sensitivity to temozolomide in ABCA13 overexpressing GBM. Expression of ABC transporters is: (i) detected in GBM and microenvironmental cells and (ii) better reproduced in GBM-PDCL. ABCA13 expression is an independent prognostic factor in newly diagnosed GBM patients. Further prospective studies are warranted to investigate whether ABCA13 expression can be used to further personalize treatments for GBM.
Meng, Xian-liang; Liu, Ping; Jia, Fu-long; Li, Jian; Gao, Bao-Quan
2015-01-01
The swimming crab Portunus trituberculatus is a commercially important crab species in East Asia countries. Gonadal development is a physiological process of great significance to the reproduction as well as commercial seed production for P. trituberculatus. However, little is currently known about the molecular mechanisms governing the developmental processes of gonads in this species. To open avenues of molecular research on P. trituberculatus gonadal development, Illumina paired-end sequencing technology was employed to develop deep-coverage transcriptome sequencing data for its gonads. Illumina sequencing generated 58,429,148 and 70,474,978 high-quality reads from the ovary and testis cDNA library, respectively. All these reads were assembled into 54,960 unigenes with an average sequence length of 879 bp, of which 12,340 unigenes (22.45% of the total) matched sequences in GenBank non-redundant database. Based on our transcriptome analysis as well as published literature, a number of candidate genes potentially involved in the regulation of gonadal development of P. trituberculatus were identified, such as FAOMeT, mPRγ, PGMRC1, PGDS, PGER4, 3β-HSD and 17β-HSDs. Differential expression analysis generated 5,919 differentially expressed genes between ovary and testis, among which many genes related to gametogenesis and several genes previously reported to be critical in differentiation and development of gonads were found, including Foxl2, Wnt4, Fst, Fem-1 and Sox9. Furthermore, 28,534 SSRs and 111,646 high-quality SNPs were identified in this transcriptome dataset. This work represents the first transcriptome analysis of P. trituberculatus gonads using the next generation sequencing technology and provides a valuable dataset for understanding molecular mechanisms controlling development of gonads and facilitating future investigation of reproductive biology in this species. The molecular markers obtained in this study will provide a fundamental basis for population genetics and functional genomics in P. trituberculatus and other closely related species. PMID:26042806
Martini, Paolo; Risso, Davide; Sales, Gabriele; Romualdi, Chiara; Lanfranchi, Gerolamo; Cagnin, Stefano
2011-04-11
In the last decades, microarray technology has spread, leading to a dramatic increase of publicly available datasets. The first statistical tools developed were focused on the identification of significant differentially expressed genes. Later, researchers moved toward the systematic integration of gene expression profiles with additional biological information, such as chromosomal location, ontological annotations or sequence features. The analysis of gene expression linked to physical location of genes on chromosomes allows the identification of transcriptionally imbalanced regions, while, Gene Set Analysis focuses on the detection of coordinated changes in transcriptional levels among sets of biologically related genes. In this field, meta-analysis offers the possibility to compare different studies, addressing the same biological question to fully exploit public gene expression datasets. We describe STEPath, a method that starts from gene expression profiles and integrates the analysis of imbalanced region as an a priori step before performing gene set analysis. The application of STEPath in individual studies produced gene set scores weighted by chromosomal activation. As a final step, we propose a way to compare these scores across different studies (meta-analysis) on related biological issues. One complication with meta-analysis is batch effects, which occur because molecular measurements are affected by laboratory conditions, reagent lots and personnel differences. Major problems occur when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. We evaluated the power of combining chromosome mapping and gene set enrichment analysis, performing the analysis on a dataset of leukaemia (example of individual study) and on a dataset of skeletal muscle diseases (meta-analysis approach). In leukaemia, we identified the Hox gene set, a gene set closely related to the pathology that other algorithms of gene set analysis do not identify, while the meta-analysis approach on muscular disease discriminates between related pathologies and correlates similar ones from different studies. STEPath is a new method that integrates gene expression profiles, genomic co-expressed regions and the information about the biological function of genes. The usage of the STEPath-computed gene set scores overcomes batch effects in the meta-analysis approaches allowing the direct comparison of different pathologies and different studies on a gene set activation level.
Laituri, Tony R; Henry, Scott; El-Jawahri, Raed; Muralidharan, Nirmal; Li, Guosong; Nutt, Marvin
2015-11-01
A provisional, age-dependent thoracic risk equation (or, "risk curve") was derived to estimate moderate-to-fatal injury potential (AIS2+), pertaining to men with responses gaged by the advanced mid-sized male test dummy (THOR50). The derivation involved two distinct data sources: cases from real-world crashes (e.g., the National Automotive Sampling System, NASS) and cases involving post-mortem human subjects (PMHS). The derivation was therefore more comprehensive, as NASS datasets generally skew towards younger occupants, and PMHS datasets generally skew towards older occupants. However, known deficiencies had to be addressed (e.g., the NASS cases had unknown stimuli, and the PMHS tests required transformation of known stimuli into THOR50 stimuli). For the NASS portion of the analysis, chest-injury outcomes for adult male drivers about the size of the THOR50 were collected from real-world, 11-1 o'clock, full-engagement frontal crashes (NASS, 1995-2012 calendar years, 1985-2012 model-year light passenger vehicles). The screening for THOR50-sized men involved application of a set of newly-derived "correction" equations for self-reported height and weight data in NASS. Finally, THOR50 stimuli were estimated via field simulations involving attendant representative restraint systems, and those stimuli were then assigned to corresponding NASS cases (n=508). For the PMHS portion of the analysis, simulation-based closure equations were developed to convert PMHS stimuli into THOR50 stimuli. Specifically, closure equations were derived for the four measurement locations on the THOR50 chest by cross-correlating the results of matched-loading simulations between the test dummy and the age-dependent, Ford Human Body Model. The resulting closure equations demonstrated acceptable fidelity (n=75 matched simulations, R2≥0.99). These equations were applied to the THOR50-sized men in the PMHS dataset (n=20). The NASS and PMHS datasets were combined and subjected to survival analysis with event-frequency weighting and arbitrary censoring. The resulting risk curve--a function of peak THOR50 chest compression and age--demonstrated acceptable fidelity for recovering the AIS2+ chest injury rate of the combined dataset (i.e., IR_dataset=1.97% vs. curve-based IR_dataset=1.98%). Additional sensitivity analyses showed that (a) binary logistic regression yielded a risk curve with nearly-identical fidelity, (b) there was only a slight advantage of combining the small-sample PMHS dataset with the large-sample NASS dataset, (c) use of the PMHS-based risk curve for risk estimation of the combined dataset yielded relatively poor performance (194% difference), and (d) when controlling for the type of contact (lab-consistent or not), the resulting risk curves were similar.
Variations in study design are typical for toxicogenomic studies, but their impact on gene expression in control animals has not been well characterized. A dataset of control animal microarray expression data was assembled by a working group of the Health and Environmental Scienc...
Establishing a process for conducting cross-jurisdictional record linkage in Australia.
Moore, Hannah C; Guiver, Tenniel; Woollacott, Anthony; de Klerk, Nicholas; Gidding, Heather F
2016-04-01
To describe the realities of conducting a cross-jurisdictional data linkage project involving state and Australian Government-based data collections to inform future national data linkage programs of work. We outline the processes involved in conducting a Proof of Concept data linkage project including the implementation of national data integration principles, data custodian and ethical approval requirements, and establishment of data flows. The approval process involved nine approval and regulatory bodies and took more than two years. Data will be linked across 12 datasets involving three data linkage centres. A framework was established to allow data to flow between these centres while maintaining the separation principle that serves to protect the privacy of the individual. This will be the first project to link child immunisation records from an Australian Government dataset to other administrative health datasets for a population cohort covering 2 million births in two Australian states. Although the project experienced some delays, positive outcomes were realised, primarily the development of strong collaborations across key stakeholder groups including community engagement. We have identified several recommendations and enhancements to this now established framework to further streamline the process for data linkage studies involving Australian Government data. © 2015 Public Health Association of Australia.
VideoWeb Dataset for Multi-camera Activities and Non-verbal Communication
NASA Astrophysics Data System (ADS)
Denina, Giovanni; Bhanu, Bir; Nguyen, Hoang Thanh; Ding, Chong; Kamal, Ahmed; Ravishankar, Chinya; Roy-Chowdhury, Amit; Ivers, Allen; Varda, Brenda
Human-activity recognition is one of the most challenging problems in computer vision. Researchers from around the world have tried to solve this problem and have come a long way in recognizing simple motions and atomic activities. As the computer vision community heads toward fully recognizing human activities, a challenging and labeled dataset is needed. To respond to that need, we collected a dataset of realistic scenarios in a multi-camera network environment (VideoWeb) involving multiple persons performing dozens of different repetitive and non-repetitive activities. This chapter describes the details of the dataset. We believe that this VideoWeb Activities dataset is unique and it is one of the most challenging datasets available today. The dataset is publicly available online at http://vwdata.ee.ucr.edu/ along with the data annotation.
Farnell, D J J; Popat, H; Richmond, S
2016-06-01
Methods used in image processing should reflect any multilevel structures inherent in the image dataset or they run the risk of functioning inadequately. We wish to test the feasibility of multilevel principal components analysis (PCA) to build active shape models (ASMs) for cases relevant to medical and dental imaging. Multilevel PCA was used to carry out model fitting to sets of landmark points and it was compared to the results of "standard" (single-level) PCA. Proof of principle was tested by applying mPCA to model basic peri-oral expressions (happy, neutral, sad) approximated to the junction between the mouth/lips. Monte Carlo simulations were used to create this data which allowed exploration of practical implementation issues such as the number of landmark points, number of images, and number of groups (i.e., "expressions" for this example). To further test the robustness of the method, mPCA was subsequently applied to a dental imaging dataset utilising landmark points (placed by different clinicians) along the boundary of mandibular cortical bone in panoramic radiographs of the face. Changes of expression that varied between groups were modelled correctly at one level of the model and changes in lip width that varied within groups at another for the Monte Carlo dataset. Extreme cases in the test dataset were modelled adequately by mPCA but not by standard PCA. Similarly, variations in the shape of the cortical bone were modelled by one level of mPCA and variations between the experts at another for the panoramic radiographs dataset. Results for mPCA were found to be comparable to those of standard PCA for point-to-point errors via miss-one-out testing for this dataset. These errors reduce with increasing number of eigenvectors/values retained, as expected. We have shown that mPCA can be used in shape models for dental and medical image processing. mPCA was found to provide more control and flexibility when compared to standard "single-level" PCA. Specifically, mPCA is preferable to "standard" PCA when multiple levels occur naturally in the dataset. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Wang, Yi Kan; Hurley, Daniel G.; Schnell, Santiago; Print, Cristin G.; Crampin, Edmund J.
2013-01-01
We develop a new regression algorithm, cMIKANA, for inference of gene regulatory networks from combinations of steady-state and time-series gene expression data. Using simulated gene expression datasets to assess the accuracy of reconstructing gene regulatory networks, we show that steady-state and time-series data sets can successfully be combined to identify gene regulatory interactions using the new algorithm. Inferring gene networks from combined data sets was found to be advantageous when using noisy measurements collected with either lower sampling rates or a limited number of experimental replicates. We illustrate our method by applying it to a microarray gene expression dataset from human umbilical vein endothelial cells (HUVECs) which combines time series data from treatment with growth factor TNF and steady state data from siRNA knockdown treatments. Our results suggest that the combination of steady-state and time-series datasets may provide better prediction of RNA-to-RNA interactions, and may also reveal biological features that cannot be identified from dynamic or steady state information alone. Finally, we consider the experimental design of genomics experiments for gene regulatory network inference and show that network inference can be improved by incorporating steady-state measurements with time-series data. PMID:23967277
Impact of sequencing depth and read length on single cell RNA sequencing data of T cells.
Rizzetto, Simone; Eltahla, Auda A; Lin, Peijie; Bull, Rowena; Lloyd, Andrew R; Ho, Joshua W K; Venturi, Vanessa; Luciani, Fabio
2017-10-06
Single cell RNA sequencing (scRNA-seq) provides great potential in measuring the gene expression profiles of heterogeneous cell populations. In immunology, scRNA-seq allowed the characterisation of transcript sequence diversity of functionally relevant T cell subsets, and the identification of the full length T cell receptor (TCRαβ), which defines the specificity against cognate antigens. Several factors, e.g. RNA library capture, cell quality, and sequencing output affect the quality of scRNA-seq data. We studied the effects of read length and sequencing depth on the quality of gene expression profiles, cell type identification, and TCRαβ reconstruction, utilising 1,305 single cells from 8 publically available scRNA-seq datasets, and simulation-based analyses. Gene expression was characterised by an increased number of unique genes identified with short read lengths (<50 bp), but these featured higher technical variability compared to profiles from longer reads. Successful TCRαβ reconstruction was achieved for 6 datasets (81% - 100%) with at least 0.25 millions (PE) reads of length >50 bp, while it failed for datasets with <30 bp reads. Sufficient read length and sequencing depth can control technical noise to enable accurate identification of TCRαβ and gene expression profiles from scRNA-seq data of T cells.
Macromolecular Expression and Function: A New Paradigm for NASA Risk Assessment
NASA Technical Reports Server (NTRS)
Richmond, Robert
2003-01-01
Predicting risks in humans of either acute effects such as bone loss or muscle wasting, or late effects such as cancer, is challenging. To an approximation, this is because uncertainties of exposure to stress factors or toxic agents and the uniformity of processing subsequent damage at the cellular level within a complex set of biological variables degrade the confidence of predicting pathologic outcome. A cellular biodosimeter that simultaneously reports 1) the type of damage due to that exposure, 2) the quantity of damage incurred by that exposure, and 3) the dataset used to assess risk of developing pathologic outcome caused by that exposure would therefore be useful for predicting ultimate risks faced by an individual, such as an astronaut. It is suggested that such a biodosimeter can be based upon analyses of gene-expression and protein expression whereby large datasets of cellular response to damage are obtained and analyzed for expression-profiles correlated with established end points and molecular markers predictive for risks being assessed. The usefulness of multiparametric cellular biodosimeters could be realized by quantitatively profiling these datasets using techniques of bioinformatics. Such an approach contributes to the foundation of molecular epidemiology as a new scientific discipline, and represents a new paradigm of risk assessment.
Chowdhury, Nilotpal; Sapru, Shantanu
2015-01-01
Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate - adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.
Chowdhury, Nilotpal; Sapru, Shantanu
2015-01-01
Introduction Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. Aim The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Methods Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Results Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. Conclusion To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research. PMID:26080057
Identifying pathogenic processes by integrating microarray data with prior knowledge
2014-01-01
Background It is of great importance to identify molecular processes and pathways that are involved in disease etiology. Although there has been an extensive use of various high-throughput methods for this task, pathogenic pathways are still not completely understood. Often the set of genes or proteins identified as altered in genome-wide screens show a poor overlap with canonical disease pathways. These findings are difficult to interpret, yet crucial in order to improve the understanding of the molecular processes underlying the disease progression. We present a novel method for identifying groups of connected molecules from a set of differentially expressed genes. These groups represent functional modules sharing common cellular function and involve signaling and regulatory events. Specifically, our method makes use of Bayesian statistics to identify groups of co-regulated genes based on the microarray data, where external information about molecular interactions and connections are used as priors in the group assignments. Markov chain Monte Carlo sampling is used to search for the most reliable grouping. Results Simulation results showed that the method improved the ability of identifying correct groups compared to traditional clustering, especially for small sample sizes. Applied to a microarray heart failure dataset the method found one large cluster with several genes important for the structure of the extracellular matrix and a smaller group with many genes involved in carbohydrate metabolism. The method was also applied to a microarray dataset on melanoma cancer patients with or without metastasis, where the main cluster was dominated by genes related to keratinocyte differentiation. Conclusion Our method found clusters overlapping with known pathogenic processes, but also pointed to new connections extending beyond the classical pathways. PMID:24758699
Secure Multiparty Computation for Cooperative Cyber Risk Assessment
2016-11-01
the scope of data available; the more attacks that are represented in the dataset the easier it will be to determine which vulnerabilities are most...assessments by pooling their data, as a dataset that covers the infrastructure of multiple institutions would allow each of them to account for...attacks that others had experienced [4]. Sharing information to produce a broad dataset would greatly improve the ability of each organization involved to
de Santiago-Martín, Ana; van Oort, Folkert; González, Concepción; Quintana, José R; Lafuente, Antonio L; Lamy, Isabelle
2015-01-01
The contribution of the nature instead of the total content of soil parameters relevant to metal bioavailability in lettuce was tested using a series of low-polluted Mediterranean agricultural calcareous soils offering natural gradients in the content and composition of carbonate, organic, and oxide fractions. Two datasets were compared by canonical ordination based on redundancy analysis: total concentrations (TC dataset) of main soil parameters (constituents, phases, or elements) involved in metal retention and bioavailability; and chemically defined reactive fractions of these parameters (RF dataset). The metal bioavailability patterns were satisfactorily explained only when the RF dataset was used, and the results showed that the proportion of crystalline Fe oxides, dissolved organic C, diethylene-triamine-pentaacetic acid (DTPA)-extractable Cu and Zn, and a labile organic pool accounted for 76% of the variance. In addition, 2 multipollution scenarios by metal spiking were tested that showed better relationships with the RF dataset than with the TC dataset (up to 17% more) and new reactive fractions involved. For Mediterranean calcareous soils, the use of reactive pools of soil parameters rather than their total contents improved the relationships between soil constituents and metal bioavailability. Such pool determinations should be systematically included in studies dealing with bioavailability or risk assessment. © 2014 SETAC.
ISRUC-Sleep: A comprehensive public dataset for sleep researchers.
Khalighi, Sirvan; Sousa, Teresa; Santos, José Moutinho; Nunes, Urbano
2016-02-01
To facilitate the performance comparison of new methods for sleep patterns analysis, datasets with quality content, publicly-available, are very important and useful. We introduce an open-access comprehensive sleep dataset, called ISRUC-Sleep. The data were obtained from human adults, including healthy subjects, subjects with sleep disorders, and subjects under the effect of sleep medication. Each recording was randomly selected between PSG recordings that were acquired by the Sleep Medicine Centre of the Hospital of Coimbra University (CHUC). The dataset comprises three groups of data: (1) data concerning 100 subjects, with one recording session per subject; (2) data gathered from 8 subjects; two recording sessions were performed per subject, and (3) data collected from one recording session related to 10 healthy subjects. The polysomnography (PSG) recordings, associated with each subject, were visually scored by two human experts. Comparing the existing sleep-related public datasets, ISRUC-Sleep provides data of a reasonable number of subjects with different characteristics such as: data useful for studies involving changes in the PSG signals over time; and data of healthy subjects useful for studies involving comparison of healthy subjects with the patients, suffering from sleep disorders. This dataset was created aiming to complement existing datasets by providing easy-to-apply data collection with some characteristics not covered yet. ISRUC-Sleep can be useful for analysis of new contributions: (i) in biomedical signal processing; (ii) in development of ASSC methods; and (iii) on sleep physiology studies. To evaluate and compare new contributions, which use this dataset as a benchmark, results of applying a subject-independent automatic sleep stage classification (ASSC) method on ISRUC-Sleep dataset are presented. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
R tool for analysis of DNA methylation and expression datasets. Integrative analysis allows reconstruction of in vivo transcription factor networks altered in cancer along with identification of the underlying gene regulatory sequences.
iCOSSY: An Online Tool for Context-Specific Subnetwork Discovery from Gene Expression Data
Saha, Ashis; Jeon, Minji; Tan, Aik Choon; Kang, Jaewoo
2015-01-01
Pathway analyses help reveal underlying molecular mechanisms of complex biological phenotypes. Biologists tend to perform multiple pathway analyses on the same dataset, as there is no single answer. It is often inefficient for them to implement and/or install all the algorithms by themselves. Online tools can help the community in this regard. Here we present an online gene expression analytical tool called iCOSSY which implements a novel pathway-based COntext-specific Subnetwork discoverY (COSSY) algorithm. iCOSSY also includes a few modifications of COSSY to increase its reliability and interpretability. Users can upload their gene expression datasets, and discover important subnetworks of closely interacting molecules to differentiate between two phenotypes (context). They can also interactively visualize the resulting subnetworks. iCOSSY is a web server that finds subnetworks that are differentially expressed in two phenotypes. Users can visualize the subnetworks to understand the biology of the difference. PMID:26147457
Bayesian median regression for temporal gene expression data
NASA Astrophysics Data System (ADS)
Yu, Keming; Vinciotti, Veronica; Liu, Xiaohui; 't Hoen, Peter A. C.
2007-09-01
Most of the existing methods for the identification of biologically interesting genes in a temporal expression profiling dataset do not fully exploit the temporal ordering in the dataset and are based on normality assumptions for the gene expression. In this paper, we introduce a Bayesian median regression model to detect genes whose temporal profile is significantly different across a number of biological conditions. The regression model is defined by a polynomial function where both time and condition effects as well as interactions between the two are included. MCMC-based inference returns the posterior distribution of the polynomial coefficients. From this a simple Bayes factor test is proposed to test for significance. The estimation of the median rather than the mean, and within a Bayesian framework, increases the robustness of the method compared to a Hotelling T2-test previously suggested. This is shown on simulated data and on muscular dystrophy gene expression data.
Inference of the oxidative stress network in Anopheles stephensi upon Plasmodium infection.
Shrinet, Jatin; Nandal, Umesh Kumar; Adak, Tridibes; Bhatnagar, Raj K; Sunil, Sujatha
2014-01-01
Ookinete invasion of Anopheles midgut is a critical step for malaria transmission; the parasite numbers drop drastically and practically reach a minimum during the parasite's whole life cycle. At this stage, the parasite as well as the vector undergoes immense oxidative stress. Thereafter, the vector undergoes oxidative stress at different time points as the parasite invades its tissues during the parasite development. The present study was undertaken to reconstruct the network of differentially expressed genes involved in oxidative stress in Anopheles stephensi during Plasmodium development and maturation in the midgut. Using high throughput next generation sequencing methods, we generated the transcriptome of the An. stephensi midgut during Plasmodium vinckei petteri oocyst invasion of the midgut epithelium. Further, we utilized large datasets available on public domain on Anopheles during Plasmodium ookinete invasion and Drosophila datasets and arrived upon clusters of genes that may play a role in oxidative stress. Finally, we used support vector machines for the functional prediction of the un-annotated genes of An. stephensi. Integrating the results from all the different data analyses, we identified a total of 516 genes that were involved in oxidative stress in An. stephensi during Plasmodium development. The significantly regulated genes were further extracted from this gene cluster and used to infer an oxidative stress network of An. stephensi. Using system biology approaches, we have been able to ascertain the role of several putative genes in An. stephensi with respect to oxidative stress. Further experimental validations of these genes are underway.
Chondrocyte channel transcriptomics
Lewis, Rebecca; May, Hannah; Mobasheri, Ali; Barrett-Jolley, Richard
2013-01-01
To date, a range of ion channels have been identified in chondrocytes using a number of different techniques, predominantly electrophysiological and/or biomolecular; each of these has its advantages and disadvantages. Here we aim to compare and contrast the data available from biophysical and microarray experiments. This letter analyses recent transcriptomics datasets from chondrocytes, accessible from the European Bioinformatics Institute (EBI). We discuss whether such bioinformatic analysis of microarray datasets can potentially accelerate identification and discovery of ion channels in chondrocytes. The ion channels which appear most frequently across these microarray datasets are discussed, along with their possible functions. We discuss whether functional or protein data exist which support the microarray data. A microarray experiment comparing gene expression in osteoarthritis and healthy cartilage is also discussed and we verify the differential expression of 2 of these genes, namely the genes encoding large calcium-activated potassium (BK) and aquaporin channels. PMID:23995703
A-MADMAN: Annotation-based microarray data meta-analysis tool
Bisognin, Andrea; Coppe, Alessandro; Ferrari, Francesco; Risso, Davide; Romualdi, Chiara; Bicciato, Silvio; Bortoluzzi, Stefania
2009-01-01
Background Publicly available datasets of microarray gene expression signals represent an unprecedented opportunity for extracting genomic relevant information and validating biological hypotheses. However, the exploitation of this exceptionally rich mine of information is still hampered by the lack of appropriate computational tools, able to overcome the critical issues raised by meta-analysis. Results This work presents A-MADMAN, an open source web application which allows the retrieval, annotation, organization and meta-analysis of gene expression datasets obtained from Gene Expression Omnibus. A-MADMAN addresses and resolves several open issues in the meta-analysis of gene expression data. Conclusion A-MADMAN allows i) the batch retrieval from Gene Expression Omnibus and the local organization of raw data files and of any related meta-information, ii) the re-annotation of samples to fix incomplete, or otherwise inadequate, metadata and to create user-defined batches of data, iii) the integrative analysis of data obtained from different Affymetrix platforms through custom chip definition files and meta-normalization. Software and documentation are available on-line at . PMID:19563634
LEAP: biomarker inference through learning and evaluating association patterns.
Jiang, Xia; Neapolitan, Richard E
2015-03-01
Single nucleotide polymorphism (SNP) high-dimensional datasets are available from Genome Wide Association Studies (GWAS). Such data provide researchers opportunities to investigate the complex genetic basis of diseases. Much of genetic risk might be due to undiscovered epistatic interactions, which are interactions in which combination of several genes affect disease. Research aimed at discovering interacting SNPs from GWAS datasets proceeded in two directions. First, tools were developed to evaluate candidate interactions. Second, algorithms were developed to search over the space of candidate interactions. Another problem when learning interacting SNPs, which has not received much attention, is evaluating how likely it is that the learned SNPs are associated with the disease. A complete system should provide this information as well. We develop such a system. Our system, called LEAP, includes a new heuristic search algorithm for learning interacting SNPs, and a Bayesian network based algorithm for computing the probability of their association. We evaluated the performance of LEAP using 100 1,000-SNP simulated datasets, each of which contains 15 SNPs involved in interactions. When learning interacting SNPs from these datasets, LEAP outperformed seven others methods. Furthermore, only SNPs involved in interactions were found to be probable. We also used LEAP to analyze real Alzheimer's disease and breast cancer GWAS datasets. We obtained interesting and new results from the Alzheimer's dataset, but limited results from the breast cancer dataset. We conclude that our results support that LEAP is a useful tool for extracting candidate interacting SNPs from high-dimensional datasets and determining their probability. © 2015 The Authors. *Genetic Epidemiology published by Wiley Periodicals, Inc.
da Rocha, Ricardo Fagundes; De Bastiani, Marco Antônio; Klamt, Fábio
2014-11-01
Atherosclerosis is a pro-inflammatory process intrinsically related to systemic redox impairments. Macrophages play a major role on disease development. The specific involvement of classically activated, M1 (pro-inflammatory), or the alternatively activated, M2 (anti-inflammatory), on plaque formation and disease progression are still not established. Thus, based on meta-data analysis of public micro-array datasets, we compared differential gene expression levels of the human antioxidant genes (HAG) and M1/M2 genes between early and advanced human atherosclerotic plaques, and among peripheric macrophages (with or without foam cells induction by oxidized low density lipoprotein, oxLDL) from healthy and atherosclerotic subjects. Two independent datasets, GSE28829 and GSE9874, were selected from gene expression omnibus (http://www.ncbi.nlm.nih.gov/geo/) repository. Functional interactions were obtained with STRING (http://string-db.org/) and Medusa (http://coot.embl.de/medusa/). Statistical analysis was performed with ViaComplex(®) (http://lief.if.ufrgs.br/pub/biosoftwares/viacomplex/) and gene score enrichment analysis (http://www.broadinstitute.org/gsea/index.jsp). Bootstrap analysis demonstrated that the activity (expression) of HAG and M1 gene sets were significantly increased in advance compared to early atherosclerotic plaque. Increased expressions of HAG, M1, and M2 gene sets were found in peripheric macrophages from atherosclerotic subjects compared to peripheric macrophages from healthy subjects, while only M1 gene set was increased in foam cells from atherosclerotic subjects compared to foam cells from healthy subjects. However, M1 gene set was decreased in foam cells from healthy subjects compared to peripheric macrophages from healthy subjects, while no differences were found in foam cells from atherosclerotic subjects compared to peripheric macrophages from atherosclerotic subjects. Our data suggest that, different to cancer, in atherosclerosis there is no M1 or M2 polarization of macrophages. Actually, M1 and M2 phenotype are equally induced, what is an important aspect to better understand the disease progression, and can help to develop new therapeutic approaches.
Leung, Ada W. Y.; Hung, Stacy S.; Backstrom, Ian; Ricaurte, Daniel; Kwok, Brian; Poon, Steven; McKinney, Steven; Segovia, Romulo; Rawji, Jenna; Qadir, Mohammed A.; Aparicio, Samuel; Stirling, Peter C.; Steidl, Christian; Bally, Marcel B.
2016-01-01
Platinum-based combination chemotherapy is the standard treatment for advanced non-small cell lung cancer (NSCLC). While cisplatin is effective, its use is not curative and resistance often emerges. As a consequence of microenvironmental heterogeneity, many tumour cells are exposed to sub-lethal doses of cisplatin. Further, genomic heterogeneity and unique tumor cell sub-populations with reduced sensitivities to cisplatin play a role in its effectiveness within a site of tumor growth. Being exposed to sub-lethal doses will induce changes in gene expression that contribute to the tumour cell’s ability to survive and eventually contribute to the selective pressures leading to cisplatin resistance. Such changes in gene expression, therefore, may contribute to cytoprotective mechanisms. Here, we report on studies designed to uncover how tumour cells respond to sub-lethal doses of cisplatin. A microarray study revealed changes in gene expressions that occurred when A549 cells were exposed to a no-observed-effect level (NOEL) of cisplatin (e.g. the IC10). These data were integrated with results from a genome-wide siRNA screen looking for novel therapeutic targets that when inhibited transformed a NOEL of cisplatin into one that induced significant increases in lethality. Pathway analyses were performed to identify pathways that could be targeted to enhance cisplatin activity. We found that over 100 genes were differentially expressed when A549 cells were exposed to a NOEL of cisplatin. Pathways associated with apoptosis and DNA repair were activated. The siRNA screen revealed the importance of the hedgehog, cell cycle regulation, and insulin action pathways in A549 cell survival and response to cisplatin treatment. Results from both datasets suggest that RRM2B, CABYR, ALDH3A1, and FHL2 could be further explored as cisplatin-enhancing gene targets. Finally, pathways involved in repairing double-strand DNA breaks and INO80 chromatin remodeling were enriched in both datasets, warranting further research into combinations of cisplatin and therapeutics targeting these pathways. PMID:26938915
Leung, Ada W Y; Hung, Stacy S; Backstrom, Ian; Ricaurte, Daniel; Kwok, Brian; Poon, Steven; McKinney, Steven; Segovia, Romulo; Rawji, Jenna; Qadir, Mohammed A; Aparicio, Samuel; Stirling, Peter C; Steidl, Christian; Bally, Marcel B
2016-01-01
Platinum-based combination chemotherapy is the standard treatment for advanced non-small cell lung cancer (NSCLC). While cisplatin is effective, its use is not curative and resistance often emerges. As a consequence of microenvironmental heterogeneity, many tumour cells are exposed to sub-lethal doses of cisplatin. Further, genomic heterogeneity and unique tumor cell sub-populations with reduced sensitivities to cisplatin play a role in its effectiveness within a site of tumor growth. Being exposed to sub-lethal doses will induce changes in gene expression that contribute to the tumour cell's ability to survive and eventually contribute to the selective pressures leading to cisplatin resistance. Such changes in gene expression, therefore, may contribute to cytoprotective mechanisms. Here, we report on studies designed to uncover how tumour cells respond to sub-lethal doses of cisplatin. A microarray study revealed changes in gene expressions that occurred when A549 cells were exposed to a no-observed-effect level (NOEL) of cisplatin (e.g. the IC10). These data were integrated with results from a genome-wide siRNA screen looking for novel therapeutic targets that when inhibited transformed a NOEL of cisplatin into one that induced significant increases in lethality. Pathway analyses were performed to identify pathways that could be targeted to enhance cisplatin activity. We found that over 100 genes were differentially expressed when A549 cells were exposed to a NOEL of cisplatin. Pathways associated with apoptosis and DNA repair were activated. The siRNA screen revealed the importance of the hedgehog, cell cycle regulation, and insulin action pathways in A549 cell survival and response to cisplatin treatment. Results from both datasets suggest that RRM2B, CABYR, ALDH3A1, and FHL2 could be further explored as cisplatin-enhancing gene targets. Finally, pathways involved in repairing double-strand DNA breaks and INO80 chromatin remodeling were enriched in both datasets, warranting further research into combinations of cisplatin and therapeutics targeting these pathways.
Genome-Wide RNAi Ionomics Screen Reveals New Genes and Regulation of Human Trace Element Metabolism
Malinouski, Mikalai; Hasan, Nesrin M.; Zhang, Yan; Seravalli, Javier; Lin, Jie; Avanesov, Andrei; Lutsenko, Svetlana; Gladyshev, Vadim N.
2017-01-01
Trace elements are essential for human metabolism and dysregulation of their homeostasis is associated with numerous disorders. Here we characterize mechanisms that regulate trace elements in human cells by designing and performing a genome-wide high-throughput siRNA/ionomics screen, and examining top hits in cellular and biochemical assays. The screen reveals high stability of the ionomes, especially the zinc ionome, and yields known regulators and novel candidates. We further uncover fundamental differences in the regulation of different trace elements. Specifically, selenium levels are controlled through the selenocysteine machinery and expression of abundant selenoproteins; copper balance is affected by lipid metabolism and requires machinery involved in protein trafficking and posttranslational modifications; and the iron levels are influenced by iron import and expression of the iron/heme-containing enzymes. Our approach can be applied to a variety of disease models and/or nutritional conditions, and the generated dataset opens new directions for studies of human trace element metabolism. PMID:24522796
2014-01-01
Background Triple negative breast cancer (TNBC) and often basal-like cancers are defined as negative for estrogen receptor, progesterone receptor and Her2 gene expression. Over the past few years an incredible amount of data has been generated defining the molecular characteristics of both cancers. The aim of these studies is to better understand the cancers and identify genes and molecular pathways that might be useful as targeted therapies. In an attempt to contribute to the understanding of basal-like/TNBC, we examined the Gene Expression Omnibus (GEO) public datasets in search of genes that might define basal-like/TNBC. The Il32 gene was identified as a candidate. Findings Analysis of several GEO datasets showed differential expression of IL32 in patient samples previously designated as basal and/or TNBC compared to normal and luminal breast samples. As validation of the GEO results, RNA and protein expression levels were examined using MCF7 and MDA MB231 cell lines and tissue microarrays (TMAs). IL32 gene expression levels were higher in MDA MB231 compared to MCF7. Analysis of TMAs showed 42% of TNBC tissues and 25% of the non-TNBC were positive for IL32, while non-malignant patient samples and all but one hyperplastic tissue sample demonstrated lower levels of IL32 protein expression. Conclusion Data obtained from several publically available GEO datasets showed overexpression of IL32 gene in basal-like/TNBC samples compared to normal and luminal samples. In support of these data, analysis of TMA clinical samples demonstrated a particular pattern of IL32 differential expression. Considered together, these data suggest IL32 is a candidate suitable for further study. PMID:25100201
Yuan, Yuan; Wang, Zhouyong; Jiang, Chao; Wang, Xumin; Huang, Luqi
2014-01-25
Chlorogenic acids (CGAs) and luteolin are active compounds in Lonicera japonica, a plant of high medicinal value in traditional Chinese medicine. This study provides a comprehensive overview of gene families involved in chlorogenic acid and luteolin biosynthesis in L. japonica, as well as its substitutes Lonicera hypoglauca and Lonicera macranthoides. The gene sequence feature and gene expression patterns in various tissues and buds of the species were characterized. Bioinformatics analysis revealed that 14 chlorogenic acid and luteolin biosynthesis-related genes were identified from the L. japonica transcriptome assembly. Phylogenetic analyses suggested that the function of individual gene could be differentiation and induce active compound diversity. Their orthologous genes were also recognized in L. hypoglauca and L. macranthoides genomic datasets, except for LHCHS1 and LMC4H2. The expression patterns of these genes are different in the tissues of L. japonica, L. hypoglauca and L. macranthoides. Results also showed that CGAs were controlled in the first step of biosynthesis, whereas both steps controlled luteolin in the bud of L. japonica. The expression of LJFNS2 exhibited positive correlation with luteolin levels in L. japonica. This study provides significant information for understanding the functional diversity of gene families involved in chlorogenic acid and the luteolin biosynthesis, active compound diversity of L. japonica and its substitutes, and the different usages of the three species. Copyright © 2012. Published by Elsevier B.V.
Pichler, Martin; Stiegelbauer, Verena; Vychytilova-Faltejskova, Petra; Ivan, Cristina; Ling, Hui; Winter, Elke; Zhang, Xinna; Goblirsch, Matthew; Wulf-Goldenberg, Annika; Ohtsuka, Masahisa; Haybaeck, Johannes; Svoboda, Marek; Okugawa, Yoshinaga; Gerger, Armin; Hoefler, Gerald; Goel, Ajay; Slaby, Ondrej; Calin, George Adrian
2017-01-01
Purpose Characterization of colorectal cancer transcriptome by high-throughput techniques has enabled the discovery of several differentially expressed genes involving previously unreported miRNA abnormalities. Here, we followed a systematic approach on a global scale to identify miRNAs as clinical outcome predictors and further validated them in the clinical and experimental setting. Experimental Design Genome-wide miRNA sequencing data of 228 colorectal cancer patients from The Cancer Genome Atlas dataset were analyzed as a screening cohort to identify miRNAs significantly associated with survival according to stringent prespecified criteria. A panel of six miRNAs was further validated for their prognostic utility in a large independent validation cohort (n = 332). In situ hybridization and functional experiments in a panel of colorectal cancer cell lines and xenografts further clarified the role of clinical relevant miRNAs. Results Six miRNAs (miR-92b-3p, miR-188-3p, miR-221-5p, miR-331-3p, miR-425-3p, and miR-497-5p) were identified as strong predictors of survival in the screening cohort. High miR-188-3p expression proves to be an independent prognostic factor [screening cohort: HR = 4.137; 95% confidence interval (CI), 1.568–10.917; P = 0.004; validation cohort: HR = 1.538; 95% CI, 1.107–2.137; P = 0.010, respectively]. Forced miR-188-3p expression increased migratory behavior of colorectal cancer cells in vitro and metastases formation in vivo (P < 0.05). The promigratory role of miR-188-3p is mediated by direct interaction with MLLT4, a novel identified player involved in colorectal cancer cell migration. Conclusions miR-188-3p is a novel independent prognostic factor in colorectal cancer patients, which can be partly explained by its effect on MLLT4 expression and migration of cancer cells. PMID:27601590
Prediction of novel target genes and pathways involved in bevacizumab-resistant colorectal cancer
Makondi, Precious Takondwa; Lee, Chia-Hwa; Huang, Chien-Yu; Chu, Chi-Ming; Chang, Yu-Jia
2018-01-01
Bevacizumab combined with cytotoxic chemotherapy is the backbone of metastatic colorectal cancer (mCRC) therapy; however, its treatment efficacy is hampered by therapeutic resistance. Therefore, understanding the mechanisms underlying bevacizumab resistance is crucial to increasing the therapeutic efficacy of bevacizumab. The Gene Expression Omnibus (GEO) database (dataset, GSE86525) was used to identify the key genes and pathways involved in bevacizumab-resistant mCRC. The GEO2R web tool was used to identify differentially expressed genes (DEGs). Functional and pathway enrichment analyses of the DEGs were performed using the Database for Annotation, Visualization, and Integrated Discovery(DAVID). Protein–protein interaction (PPI) networks were established using the Search Tool for the Retrieval of Interacting Genes/Proteins database(STRING) and visualized using Cytoscape software. A total of 124 DEGs were obtained, 57 of which upregulated and 67 were downregulated. PPI network analysis showed that seven upregulated genes and nine downregulated genes exhibited high PPI degrees. In the functional enrichment, the DEGs were mainly enriched in negative regulation of phosphate metabolic process and positive regulation of cell cycle process gene ontologies (GOs); the enriched pathways were the phosphoinositide 3-kinase-serine/threonine kinase signaling pathway, bladder cancer, and microRNAs in cancer. Cyclin-dependent kinase inhibitor 1A(CDKN1A), toll-like receptor 4 (TLR4), CD19 molecule (CD19), breast cancer 1, early onset (BRCA1), platelet-derived growth factor subunit A (PDGFA), and matrix metallopeptidase 1 (MMP1) were the DEGs involved in the pathways and the PPIs. The clinical validation of the DEGs in mCRC (TNM clinical stages 3 and 4) revealed that high PDGFA expression levels were associated with poor overall survival, whereas high BRCA1 and MMP1 expression levels were associated with favorable progress free survival(PFS). The identified genes and pathways can be potential targets and predictors of therapeutic resistance and prognosis in bevacizumab-treated patients with mCRC. PMID:29342159
Reverse phase protein arrays in signaling pathways: a data integration perspective
Creighton, Chad J; Huang, Shixia
2015-01-01
The reverse phase protein array (RPPA) data platform provides expression data for a prespecified set of proteins, across a set of tissue or cell line samples. Being able to measure either total proteins or posttranslationally modified proteins, even ones present at lower abundances, RPPA represents an excellent way to capture the state of key signaling transduction pathways in normal or diseased cells. RPPA data can be combined with those of other molecular profiling platforms, in order to obtain a more complete molecular picture of the cell. This review offers perspective on the use of RPPA as a component of integrative molecular analysis, using recent case examples from The Cancer Genome Altas consortium, showing how RPPA may provide additional insight into cancer besides what other data platforms may provide. There also exists a clear need for effective visualization approaches to RPPA-based proteomic results; this was highlighted by the recent challenge, put forth by the HPN-DREAM consortium, to develop visualization methods for a highly complex RPPA dataset involving many cancer cell lines, stimuli, and inhibitors applied over time course. In this review, we put forth a number of general guidelines for effective visualization of complex molecular datasets, namely, showing the data, ordering data elements deliberately, enabling generalization, focusing on relevant specifics, and putting things into context. We give examples of how these principles can be utilized in visualizing the intrinsic subtypes of breast cancer and in meaningfully displaying the entire HPN-DREAM RPPA dataset within a single page. PMID:26185419
Gene expression changes governing extreme dehydration tolerance in an Antarctic insect
Teets, Nicholas M.; Peyton, Justin T.; Colinet, Herve; Renault, David; Kelley, Joanna L.; Kawarasaki, Yuta; Lee, Richard E.; Denlinger, David L.
2012-01-01
Among terrestrial organisms, arthropods are especially susceptible to dehydration, given their small body size and high surface area to volume ratio. This challenge is particularly acute for polar arthropods that face near-constant desiccating conditions, as water is frozen and thus unavailable for much of the year. The molecular mechanisms that govern extreme dehydration tolerance in insects remain largely undefined. In this study, we used RNA sequencing to quantify transcriptional mechanisms of extreme dehydration tolerance in the Antarctic midge, Belgica antarctica, the world’s southernmost insect and only insect endemic to Antarctica. Larvae of B. antarctica are remarkably tolerant of dehydration, surviving losses up to 70% of their body water. Gene expression changes in response to dehydration indicated up-regulation of cellular recycling pathways including the ubiquitin-mediated proteasome and autophagy, with concurrent down-regulation of genes involved in general metabolism and ATP production. Metabolomics results revealed shifts in metabolite pools that correlated closely with changes in gene expression, indicating that coordinated changes in gene expression and metabolism are a critical component of the dehydration response. Finally, using comparative genomics, we compared our gene expression results with a transcriptomic dataset for the Arctic collembolan, Megaphorura arctica. Although B. antarctica and M. arctica are adapted to similar environments, our analysis indicated very little overlap in expression profiles between these two arthropods. Whereas several orthologous genes showed similar expression patterns, transcriptional changes were largely species specific, indicating these polar arthropods have developed distinct transcriptional mechanisms to cope with similar desiccating conditions. PMID:23197828
Gene expression changes governing extreme dehydration tolerance in an Antarctic insect.
Teets, Nicholas M; Peyton, Justin T; Colinet, Herve; Renault, David; Kelley, Joanna L; Kawarasaki, Yuta; Lee, Richard E; Denlinger, David L
2012-12-11
Among terrestrial organisms, arthropods are especially susceptible to dehydration, given their small body size and high surface area to volume ratio. This challenge is particularly acute for polar arthropods that face near-constant desiccating conditions, as water is frozen and thus unavailable for much of the year. The molecular mechanisms that govern extreme dehydration tolerance in insects remain largely undefined. In this study, we used RNA sequencing to quantify transcriptional mechanisms of extreme dehydration tolerance in the Antarctic midge, Belgica antarctica, the world's southernmost insect and only insect endemic to Antarctica. Larvae of B. antarctica are remarkably tolerant of dehydration, surviving losses up to 70% of their body water. Gene expression changes in response to dehydration indicated up-regulation of cellular recycling pathways including the ubiquitin-mediated proteasome and autophagy, with concurrent down-regulation of genes involved in general metabolism and ATP production. Metabolomics results revealed shifts in metabolite pools that correlated closely with changes in gene expression, indicating that coordinated changes in gene expression and metabolism are a critical component of the dehydration response. Finally, using comparative genomics, we compared our gene expression results with a transcriptomic dataset for the Arctic collembolan, Megaphorura arctica. Although B. antarctica and M. arctica are adapted to similar environments, our analysis indicated very little overlap in expression profiles between these two arthropods. Whereas several orthologous genes showed similar expression patterns, transcriptional changes were largely species specific, indicating these polar arthropods have developed distinct transcriptional mechanisms to cope with similar desiccating conditions.
Gene expression profiling in multipotent DFAT cells derived from mature adipocytes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ono, Hiromasa; Database Center for Life Science; Oki, Yoshinao
2011-04-15
Highlights: {yields} Adipocyte dedifferentiation is evident in a significant decrease in typical genes. {yields} Cell proliferation is strongly related to adipocyte dedifferentiation. {yields} Dedifferentiated adipocytes express several lineage-specific genes. {yields} Comparative analyses using publicly available datasets boost the interpretation. -- Abstract: Cellular dedifferentiation signifies the withdrawal of cells from a specific differentiated state to a stem cell-like undifferentiated state. However, the mechanism of dedifferentiation remains obscure. Here we performed comparative transcriptome analyses during dedifferentiation in mature adipocytes (MAs) to identify the transcriptional signatures of multipotent dedifferentiated fat (DFAT) cells derived from MAs. Using microarray systems, we explored similarly expressed asmore » well as significantly differentially expressed genes in MAs during dedifferentiation. This analysis revealed significant changes in gene expression during this process, including a significant reduction in expression of genes for lipid metabolism concomitantly with a significant increase in expression of genes for cell movement, cell migration, tissue developmental processes, cell growth, cell proliferation, cell morphogenesis, altered cell shape, and cell differentiation. Our observations indicate that the transcriptional signatures of DFAT cells derived from MAs are summarized in terms of a significant decrease in functional phenotype-related genes and a parallel increase in cell proliferation, altered cell morphology, and regulation of the differentiation of related genes. A better understanding of the mechanisms involved in dedifferentiation may enable scientists to control and possibly alter the plasticity of the differentiated state, which may lead to benefits not only in stem cell research but also in regenerative medicine.« less
Transcriptome analysis of the sulfate deficiency response in the marine microalga Emiliania huxleyi.
Bochenek, Michal; Etherington, Graham J; Koprivova, Anna; Mugford, Sam T; Bell, Thomas G; Malin, Gill; Kopriva, Stanislav
2013-08-01
The response to sulfate deficiency of plants and freshwater green algae has been extensively analysed by system biology approaches. By contrast, seawater sulfate concentration is high and very little is known about the sulfur metabolism of marine organisms. Here, we used a combination of metabolite analysis and transcriptomics to analyse the response of the marine microalga Emiliania huxleyi as it acclimated to sulfate limitation. Lowering sulfate availability in artificial seawater from 25 to 5 mM resulted in significant reduction in growth and intracellular concentrations of dimethylsulfoniopropionate and glutathione. Sulfate-limited E. huxleyi cells showed increased sulfate uptake but sulfate reduction to sulfite did not seem to be regulated. Sulfate limitation in E. huxleyi affected expression of 1718 genes. The vast majority of these genes were upregulated, including genes involved in carbohydrate and lipid metabolism, and genes involved in the general stress response. The acclimation response of E. huxleyi to sulfate deficiency shows several similarities to the well-described responses of Arabidopsis and Chlamydomonas, but also has many unique features. This dataset shows that even though E. huxleyi is adapted to constitutively high sulfate concentration, it retains the ability to re-program its gene expression in response to reduced sulfate availability. © 2013 The Authors. New Phytologist © 2013 New Phytologist Trust.
Yu, Yao; Tu, Kang; Zheng, Siyuan; Li, Yun; Ding, Guohui; Ping, Jie; Hao, Pei; Li, Yixue
2009-08-25
In the post-genomic era, the development of high-throughput gene expression detection technology provides huge amounts of experimental data, which challenges the traditional pipelines for data processing and analyzing in scientific researches. In our work, we integrated gene expression information from Gene Expression Omnibus (GEO), biomedical ontology from Medical Subject Headings (MeSH) and signaling pathway knowledge from sigPathway entries to develop a context mining tool for gene expression analysis - GEOGLE. GEOGLE offers a rapid and convenient way for searching relevant experimental datasets, pathways and biological terms according to multiple types of queries: including biomedical vocabularies, GDS IDs, gene IDs, pathway names and signature list. Moreover, GEOGLE summarizes the signature genes from a subset of GDSes and estimates the correlation between gene expression and the phenotypic distinction with an integrated p value. This approach performing global searching of expression data may expand the traditional way of collecting heterogeneous gene expression experiment data. GEOGLE is a novel tool that provides researchers a quantitative way to understand the correlation between gene expression and phenotypic distinction through meta-analysis of gene expression datasets from different experiments, as well as the biological meaning behind. The web site and user guide of GEOGLE are available at: http://omics.biosino.org:14000/kweb/workflow.jsp?id=00020.
Large-Scale Pattern Discovery in Music
NASA Astrophysics Data System (ADS)
Bertin-Mahieux, Thierry
This work focuses on extracting patterns in musical data from very large collections. The problem is split in two parts. First, we build such a large collection, the Million Song Dataset, to provide researchers access to commercial-size datasets. Second, we use this collection to study cover song recognition which involves finding harmonic patterns from audio features. Regarding the Million Song Dataset, we detail how we built the original collection from an online API, and how we encouraged other organizations to participate in the project. The result is the largest research dataset with heterogeneous sources of data available to music technology researchers. We demonstrate some of its potential and discuss the impact it already has on the field. On cover song recognition, we must revisit the existing literature since there are no publicly available results on a dataset of more than a few thousand entries. We present two solutions to tackle the problem, one using a hashing method, and one using a higher-level feature computed from the chromagram (dubbed the 2DFTM). We further investigate the 2DFTM since it has potential to be a relevant representation for any task involving audio harmonic content. Finally, we discuss the future of the dataset and the hope of seeing more work making use of the different sources of data that are linked in the Million Song Dataset. Regarding cover songs, we explain how this might be a first step towards defining a harmonic manifold of music, a space where harmonic similarities between songs would be more apparent.
The pineal gland: A model for adrenergic modulation of ubiquitin ligases.
Vriend, Jerry; Liu, Wenjun; Reiter, Russel J
2017-01-01
A recent study of the pineal gland of the rat found that the expression of more than 3000 genes showed significant day/night variations (The Hartley dataset). The investigators of this report made available a supplemental table in which they tabulated the expression of many genes that they did not discuss, including those coding for components of the ubiquitin proteasome system. Herein we identify the genes of the ubiquitin proteasome system whose expression were significantly influenced by environmental lighting in the Hartley dataset, those that were stimulated by DBcAMP in pineal glands in culture, and those that were stimulated by norepinephrine. Using the Ubiquitin and Ubiquitin-like Conjugation Database (UUCA) we identified ubiquitin ligases and conjugases, and deubiquitinases in the Hartley dataset for the purpose of determining whether expression of genes of the ubiquitin proteasome pathway were significantly influenced by day/night variations and if these variations were regulated by autonomic innervation of the pineal gland from the superior cervical ganglia. In the Hartley experiments pineal glands groups of rats sacrificed during the day and groups sacrificed during the night were examined for gene expression. Additional groups of rats had their superior cervical ganglia removed surgically or surgically decentralized and the pineal glands likewise examined for gene expression. The genes with at least a 2-fold day/night significant difference in expression included genes for 5 ubiquitin conjugating enzymes, genes for 58 ubiquitin E3 ligases and genes for 6 deubiquitinases. A 35-fold day/night difference was noted in the expression of the gene Sik1, which codes for a protein containing both an ubiquitin binding domain (UBD) and an ubiquitin-associated (UBA) domain. Most of the significant differences in these genes were prevented by surgical removal, or disconnection, of the superior cervical ganglia, and most were responsive, in vitro, to treatment with a cyclic AMP analog, and norepinephrine. All previously described 24-hour rhythms in the pineal require an intact sympathetic input from the superior cervical ganglia. The Hartley dataset thus provides evidence that the pineal gland is a highly useful model for studying adrenergically dependent mechanisms regulating variations in ubiquitin ligases, ubiquitin conjugases, and deubiquitinases, mechanisms that may be physiologically relevant not only in the pineal gland, but in all adrenergically innervated tissue.
The pineal gland: A model for adrenergic modulation of ubiquitin ligases
Liu, Wenjun; Reiter, Russel J.
2017-01-01
Introduction A recent study of the pineal gland of the rat found that the expression of more than 3000 genes showed significant day/night variations (The Hartley dataset). The investigators of this report made available a supplemental table in which they tabulated the expression of many genes that they did not discuss, including those coding for components of the ubiquitin proteasome system. Herein we identify the genes of the ubiquitin proteasome system whose expression were significantly influenced by environmental lighting in the Hartley dataset, those that were stimulated by DBcAMP in pineal glands in culture, and those that were stimulated by norepinephrine. Purpose Using the Ubiquitin and Ubiquitin-like Conjugation Database (UUCA) we identified ubiquitin ligases and conjugases, and deubiquitinases in the Hartley dataset for the purpose of determining whether expression of genes of the ubiquitin proteasome pathway were significantly influenced by day/night variations and if these variations were regulated by autonomic innervation of the pineal gland from the superior cervical ganglia. Methods In the Hartley experiments pineal glands groups of rats sacrificed during the day and groups sacrificed during the night were examined for gene expression. Additional groups of rats had their superior cervical ganglia removed surgically or surgically decentralized and the pineal glands likewise examined for gene expression. Results The genes with at least a 2-fold day/night significant difference in expression included genes for 5 ubiquitin conjugating enzymes, genes for 58 ubiquitin E3 ligases and genes for 6 deubiquitinases. A 35-fold day/night difference was noted in the expression of the gene Sik1, which codes for a protein containing both an ubiquitin binding domain (UBD) and an ubiquitin-associated (UBA) domain. Most of the significant differences in these genes were prevented by surgical removal, or disconnection, of the superior cervical ganglia, and most were responsive, in vitro, to treatment with a cyclic AMP analog, and norepinephrine. All previously described 24-hour rhythms in the pineal require an intact sympathetic input from the superior cervical ganglia. Conclusions The Hartley dataset thus provides evidence that the pineal gland is a highly useful model for studying adrenergically dependent mechanisms regulating variations in ubiquitin ligases, ubiquitin conjugases, and deubiquitinases, mechanisms that may be physiologically relevant not only in the pineal gland, but in all adrenergically innervated tissue. PMID:28212404
dynGENIE3: dynamical GENIE3 for the inference of gene networks from time series expression data.
Huynh-Thu, Vân Anh; Geurts, Pierre
2018-02-21
The elucidation of gene regulatory networks is one of the major challenges of systems biology. Measurements about genes that are exploited by network inference methods are typically available either in the form of steady-state expression vectors or time series expression data. In our previous work, we proposed the GENIE3 method that exploits variable importance scores derived from Random forests to identify the regulators of each target gene. This method provided state-of-the-art performance on several benchmark datasets, but it could however not specifically be applied to time series expression data. We propose here an adaptation of the GENIE3 method, called dynamical GENIE3 (dynGENIE3), for handling both time series and steady-state expression data. The proposed method is evaluated extensively on the artificial DREAM4 benchmarks and on three real time series expression datasets. Although dynGENIE3 does not systematically yield the best performance on each and every network, it is competitive with diverse methods from the literature, while preserving the main advantages of GENIE3 in terms of scalability.
Zhao, Zheng; Bai, Jing; Wu, Aiwei; Wang, Yuan; Zhang, Jinwen; Wang, Zishan; Li, Yongsheng; Xu, Juan; Li, Xia
2015-01-01
Long non-coding RNAs (lncRNAs) are emerging as key regulators of diverse biological processes and diseases. However, the combinatorial effects of these molecules in a specific biological function are poorly understood. Identifying co-expressed protein-coding genes of lncRNAs would provide ample insight into lncRNA functions. To facilitate such an effort, we have developed Co-LncRNA, which is a web-based computational tool that allows users to identify GO annotations and KEGG pathways that may be affected by co-expressed protein-coding genes of a single or multiple lncRNAs. LncRNA co-expressed protein-coding genes were first identified in publicly available human RNA-Seq datasets, including 241 datasets across 6560 total individuals representing 28 tissue types/cell lines. Then, the lncRNA combinatorial effects in a given GO annotations or KEGG pathways are taken into account by the simultaneous analysis of multiple lncRNAs in user-selected individual or multiple datasets, which is realized by enrichment analysis. In addition, this software provides a graphical overview of pathways that are modulated by lncRNAs, as well as a specific tool to display the relevant networks between lncRNAs and their co-expressed protein-coding genes. Co-LncRNA also supports users in uploading their own lncRNA and protein-coding gene expression profiles to investigate the lncRNA combinatorial effects. It will be continuously updated with more human RNA-Seq datasets on an annual basis. Taken together, Co-LncRNA provides a web-based application for investigating lncRNA combinatorial effects, which could shed light on their biological roles and could be a valuable resource for this community. Database URL: http://www.bio-bigdata.com/Co-LncRNA/. © The Author(s) 2015. Published by Oxford University Press.
Comparison of alternative approaches for analysing multi-level RNA-seq data
Mohorianu, Irina; Bretman, Amanda; Smith, Damian T.; Fowler, Emily K.; Dalmay, Tamas
2017-01-01
RNA sequencing (RNA-seq) is widely used for RNA quantification in the environmental, biological and medical sciences. It enables the description of genome-wide patterns of expression and the identification of regulatory interactions and networks. The aim of RNA-seq data analyses is to achieve rigorous quantification of genes/transcripts to allow a reliable prediction of differential expression (DE), despite variation in levels of noise and inherent biases in sequencing data. This can be especially challenging for datasets in which gene expression differences are subtle, as in the behavioural transcriptomics test dataset from D. melanogaster that we used here. We investigated the power of existing approaches for quality checking mRNA-seq data and explored additional, quantitative quality checks. To accommodate nested, multi-level experimental designs, we incorporated sample layout into our analyses. We employed a subsampling without replacement-based normalization and an identification of DE that accounted for the hierarchy and amplitude of effect sizes within samples, then evaluated the resulting differential expression call in comparison to existing approaches. In a final step to test for broader applicability, we applied our approaches to a published set of H. sapiens mRNA-seq samples, The dataset-tailored methods improved sample comparability and delivered a robust prediction of subtle gene expression changes. The proposed approaches have the potential to improve key steps in the analysis of RNA-seq data by incorporating the structure and characteristics of biological experiments. PMID:28792517
Peng, Lu; Wang, Lei; Yang, Yi-Fan; Zou, Ming-Min; He, Wei-Yi; Wang, Yue; Wang, Qing; Vasseur, Liette; You, Min-Sheng
2017-12-30
As a specialized organ, the insect ovary performs valuable functions by ensuring fecundity and population survival. Oogenesis is the complex physiological process resulting in the production of mature eggs, which are involved in epigenetic programming, germ cell behavior, cell cycle regulation, etc. Identification of the genes involved in ovary development and oogenesis is critical to better understand the reproductive biology and screening for the potential molecular targets in Plutella xylostella, a worldwide destructive pest of economically major crops. Based on transcriptome sequencing, a total of 7.88Gb clean nucleotides was obtained, with 19,934 genes and 1861 new transcripts being identified. Expression profiling indicated that 61.7% of the genes were expressed (FPKM≥1) in the P. xylostella ovary. GO annotation showed that the pathways of multicellular organism reproduction and multicellular organism reproduction process, as well as gamete generation and chorion were significantly enriched. Processes that were most likely relevant to reproduction included the spliceosome, ubiquitin mediated proteolysis, endocytosis, PI3K-Akt signaling pathway, insulin signaling pathway, cAMP signaling pathway, and focal adhesion were identified in the top 20 'highly represented' KEGG pathways. Functional genes involved in oogenesis were further analyzed and validated by qRT-PCR to show their potential predominant roles in P. xylostella reproduction. Our newly developed P. xylostella ovary transcriptome provides an overview of the gene expression profiling in this specialized tissue and the functional gene network closely related to the ovary development and oogenesis. This is the first genome-wide transcriptome dataset of P. xylostella ovary that includes a subset of functionally activated genes. This global approach will be the basis for further studies on molecular mechanisms of P. xylostella reproduction aimed at screening potential molecular targets for integrated pest management. Copyright © 2017 Elsevier B.V. All rights reserved.
2012-01-01
Background Chinese fir (Cunninghamia lanceolata) is an important timber species that accounts for 20–30% of the total commercial timber production in China. However, the available genomic information of Chinese fir is limited, and this severely encumbers functional genomic analysis and molecular breeding in Chinese fir. Recently, major advances in transcriptome sequencing have provided fast and cost-effective approaches to generate large expression datasets that have proven to be powerful tools to profile the transcriptomes of non-model organisms with undetermined genomes. Results In this study, the transcriptomes of nine tissues from Chinese fir were analyzed using the Illumina HiSeq™ 2000 sequencing platform. Approximately 40 million paired-end reads were obtained, generating 3.62 gigabase pairs of sequencing data. These reads were assembled into 83,248 unique sequences (i.e. Unigenes) with an average length of 449 bp, amounting to 37.40 Mb. A total of 73,779 Unigenes were supported by more than 5 reads, 42,663 (57.83%) had homologs in the NCBI non-redundant and Swiss-Prot protein databases, corresponding to 27,224 unique protein entries. Of these Unigenes, 16,750 were assigned to Gene Ontology classes, and 14,877 were clustered into orthologous groups. A total of 21,689 (29.40%) were mapped to 119 pathways by BLAST comparison against the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. The majority of the genes encoding the enzymes in the biosynthetic pathways of cellulose and lignin were identified in the Unigene dataset by targeted searches of their annotations. And a number of candidate Chinese fir genes in the two metabolic pathways were discovered firstly. Eighteen genes related to cellulose and lignin biosynthesis were cloned for experimental validating of transcriptome data. Overall 49 Unigenes, covering different regions of these selected genes, were found by alignment. Their expression patterns in different tissues were analyzed by qRT-PCR to explore their putative functions. Conclusions A substantial fraction of transcript sequences was obtained from the deep sequencing of Chinese fir. The assembled Unigene dataset was used to discover candidate genes of cellulose and lignin biosynthesis. This transcriptome dataset will provide a comprehensive sequence resource for molecular genetics research of C. lanceolata. PMID:23171398
From DNA Copy Number to Gene Expression: Local aberrations, Trisomies and Monosomies
NASA Astrophysics Data System (ADS)
Shay, Tal
The goal of my PhD research was to study the effect of DNA copy number changes on gene expression. DNA copy number aberrations may be local, encompassing several genes, or on the level of an entire chromosome, such as trisomy and monosomy. The main dataset I studied was of Glioblastoma, obtained in the framework of a collaboration, but I worked also with public datasets of cancer and Down's Syndrome. The molecular basis of expression changes in Glioblastoma. Glioblastoma is the most common and aggressive type of primary brain tumors in adults. In collaboration with Prof. Hegi (CHUV, Switzerland), we analyzed a rich Glioblastoma dataset including clinical information, DNA copy number (array CGH) and expression profiles. We explored the correlation between DNA copy number and gene expression at the level of chromosomal arms and local genomic aberrations. We detected known amplification and over expression of oncogenes, as well as deletion and down-regulation of tumor suppressor genes. We exploited that information to map alterations of pathways that are known to be disrupted in Glioblastoma, and tried to characterize samples that have no known alteration in any of the studied pathways. Identifying local DNA aberrations of biological significance. Many types of tumors exhibit chromosomal losses or gains and local amplifications and deletions. A region that is aberrant in many tumors, or whose copy number change is stronger, is more likely to be clinically relevant, and not just a by-product of genetic instability. We developed a novel method that defines and prioritizes aberrations by formalizing these intuitions. The method scores each aberration by the fraction of patients harboring it, its length and its amplitude, and assesses the significance of the score by comparing it to a null distribution obtained by permutations. This approach detects genetic locations that are significantly aberrant, generating a 'genomic aberration profile' for each sample. The 'genomic aberration profile' is then combined with chromosomal arm status (gain/loss) to define a succinct genomic signature for each tumor. Unsupervised clustering of the samples based on these genomic signatures can reveal novel tumor subtypes. This approach was applied to datasets from three types of brain tumors: Glioblastoma, Medulloblastoma and Neuroblastoma, and identified a new subtype in Medulloblastoma, characterized by many chromosomal aberrations. Elucidating the transcriptional effect of monosomy and trisomy. Trisomy and monosomy are expected to impact the expression of genes that are located on the affected chromosome. Analysis of several cancer datasets revealed that not all the genes on the aberrant chromosome are affected by the change of copy number. Affected genes exhibit a wide range of expression changes with varying penetrance. Specifically, (1) The effect of trisomy is much more conserved among individuals than the effect of monosomy and (2) the expression level of a gene in the diploid is significantly correlated with the level of change between the diploid and the trisomy or monosomy.
Enabling Open Research Data Discovery through a Recommender System
NASA Astrophysics Data System (ADS)
Devaraju, Anusuriya; Jayasinghe, Gaya; Klump, Jens; Hogan, Dominic
2017-04-01
Government agencies, universities, research and nonprofit organizations are increasingly publishing their datasets to promote transparency, induce new research and generate economic value through the development of new products or services. The datasets may be downloaded from various data portals (data repositories) which are general or domain-specific. The Registry of Research Data Repository (re3data.org) lists more than 2500 such data repositories from around the globe. Data portals allow keyword search and faceted navigation to facilitate discovery of research datasets. However, the volume and variety of datasets have made finding relevant datasets more difficult. Common dataset search mechanisms may be time consuming, may produce irrelevant results and are primarily suitable for users who are familiar with the general structure and contents of the respective database. Therefore, we need new approaches to support research data discovery. Recommender systems offer new possibilities for users to find datasets that are relevant to their research interests. This study presents a recommender system developed for the CSIRO Data Access Portal (DAP, http://data.csiro.au). The datasets hosted on the portal are diverse, published by researchers from 13 business units in the organisation. The goal of the study is not to replace the current search mechanisms on the data portal, but rather to extend the data discovery through an exploratory search, in this case by building a recommender system. We adopted a hybrid recommendation approach, comprising content-based filtering and item-item collaborative filtering. The content-based filtering computes similarities between datasets based on metadata such as title, keywords, descriptions, fields of research, location, contributors, etc. The collaborative filtering utilizes user search behaviour and download patterns derived from the server logs to determine similar datasets. Similarities above are then combined with different degrees of importance (weights) to determine the overall data similarity. We determined the similarity weights based on a survey involving 150 users of the portal. The recommender results for a given dataset are accessible programmatically via a RESTful web service. An offline evaluation involving data users demonstrates the ability of the recommender system to discover relevant and 'novel' datasets.
Data reuse and the open data citation advantage
Vision, Todd J.
2013-01-01
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003. PMID:24109559
Standards-based curation of a decade-old digital repository dataset of molecular information.
Harvey, Matthew J; Mason, Nicholas J; McLean, Andrew; Murray-Rust, Peter; Rzepa, Henry S; Stewart, James J P
2015-01-01
The desirable curation of 158,122 molecular geometries derived from the NCI set of reference molecules together with associated properties computed using the MOPAC semi-empirical quantum mechanical method and originally deposited in 2005 into the Cambridge DSpace repository as a data collection is reported. The procedures involved in the curation included annotation of the original data using new MOPAC methods, updating the syntax of the CML documents used to express the data to ensure schema conformance and adding new metadata describing the entries together with a XML schema transformation to map the metadata schema to that used by the DataCite organisation. We have adopted a granularity model in which a DataCite persistent identifier (DOI) is created for each individual molecule to enable data discovery and data metrics at this level using DataCite tools. We recommend that the future research data management (RDM) of the scientific and chemical data components associated with journal articles (the "supporting information") should be conducted in a manner that facilitates automatic periodic curation. Graphical abstractStandards and metadata-based curation of a decade-old digital repository dataset of molecular information.
Telkom UData sentiment analysis using crowdsourcing and trust
NASA Astrophysics Data System (ADS)
Noer, Edvya; Sulistyo Kusumo, Dana; Rusmawati, Yanti
2018-03-01
Microblogging sites have millions of people sharing their thoughts daily because of its characteristic short and simple manner of expression. Sentiments analysis are often being used to analyse the user customer opinions regarding brand images or products. For some reasons, not all sentiment generated using this existing machine-based algorithms yields satisfying results. This is mostly due to the uniformity of the informal language used in the social media sentences. This condition also occurs in Telkom UData on our preliminary study, where the machine-based provided less then optimal results in analysing the sentiment. This research offers concepts with human interaction using crowdsourcing where people are involved to analyse sentiments, while forming the new training dataset at the same time. From the research results found that sarcastic and contradictory sentences can be recognized by humans, to be utilized as new training datasets for further machine learning. From this experiments, that approach are likely increase the accuracy of the sentiments in UData from neutral to become positive or negative polarized up to 39%. We do as well simulated trust concept through sociometric to ensure the crowdsource workers are trusted and capable enough in analysing the sentiments on social media.
Exploring the key genes and pathways in enchondromas using a gene expression microarray.
Shi, Zhongju; Zhou, Hengxing; Pan, Bin; Lu, Lu; Kang, Yi; Liu, Lu; Wei, Zhijian; Feng, Shiqing
2017-07-04
Enchondromas are the most common primary benign osseous neoplasms that occur in the medullary bone; they can undergo malignant transformation into chondrosarcoma. However, enchondromas are always undetected in patients, and the molecular mechanism is unclear. To identify key genes and pathways associated with the occurrence and development of enchondromas, we downloaded the gene expression dataset GSE22855 and obtained the differentially expressed genes (DEGs) by analyzing high-throughput gene expression in enchondromas. In total, 635 genes were identified as DEGs. Of these, 225 genes (35.43%) were up-regulated, and the remaining 410 genes (64.57%) were down-regulated. We identified the predominant gene ontology (GO) categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways that were significantly over-represented in the enchondromas samples compared with the control samples. Subsequently the top 10 core genes were identified from the protein-protein interaction (PPI) network. The enrichment analyses of the genes mainly involved in two significant modules showed that the DEGs were principally related to ribosomes, protein digestion and absorption, ECM-receptor interaction, focal adhesion, amoebiasis and the PI3K-Akt signaling pathway.Together, these data elucidate the molecular mechanisms underlying the occurrence and development of enchondromas and provide promising candidates for therapeutic intervention and prognostic evaluation. However, further experimental studies are needed to confirm these results.
Xu, Dongkui; Liu, Shikai; Zhang, Liang; Song, Lili
2017-04-01
The dysregulated molecules and their involvement in lymph node metastases of cervical cancer are far from been fully revealed. In this study, by reviewing MUC4 expression in The Human Protein Atlas and retrieving gene microarray data in GEO dataset (No. GDS4664), we found that MUC4 upregulation is associated with lymph node metastasis in cervical cancer. Knockdown of MUC4 in Hela and SiHa cells significantly reduced their invasion and also reduced the mesenchymal properties. By performing bioinformatics analysis, we observed that miR-211 is a potential suppressor of MUC4, which has a predicted highly conserved binding site in the 3'UTR of MUC among mammals. The following assays confirmed that miR-211 can directly target the 3'UTR of MUC4 and inhibit its expression at both mRNA and protein levels. In addition, enforced miR-211 expression phenocopies the effects of MUC4 siRNA in inhibiting cervical cancer cell invasion and reversing EMT properties. Therefore, we infer that miR-211 is a novel miRNA with suppressive effect on MUC4 expression and can inhibit cervical cancer cell invasion and EMT. Copyright © 2016. Published by Elsevier Inc.
Identification of the Key Genes and Pathways in Esophageal Carcinoma.
Su, Peng; Wen, Shiwang; Zhang, Yuefeng; Li, Yong; Xu, Yanzhao; Zhu, Yonggang; Lv, Huilai; Zhang, Fan; Wang, Mingbo; Tian, Ziqiang
2016-01-01
Objective . Esophageal carcinoma (EC) is a frequently common malignancy of gastrointestinal cancer in the world. This study aims to screen key genes and pathways in EC and elucidate the mechanism of it. Methods . 5 microarray datasets of EC were downloaded from Gene Expression Omnibus. Differentially expressed genes (DEGs) were screened by bioinformatics analysis. Gene Ontology (GO) enrichment, Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment, and protein-protein interaction (PPI) network construction were performed to obtain the biological roles of DEGs in EC. Quantitative real-time polymerase chain reaction (qRT-PCR) was used to verify the expression level of DEGs in EC. Results . A total of 1955 genes were filtered as DEGs in EC. The upregulated genes were significantly enriched in cell cycle and the downregulated genes significantly enriched in Endocytosis. PPI network displayed CDK4 and CCT3 were hub proteins in the network. The expression level of 8 dysregulated DEGs including CDK4, CCT3, THSD4, SIM2, MYBL2, CENPF, CDCA3, and CDKN3 was validated in EC compared to adjacent nontumor tissues and the results were matched with the microarray analysis. Conclusion . The significantly DEGs including CDK4, CCT3, THSD4, and SIM2 may play key roles in tumorigenesis and development of EC involved in cell cycle and Endocytosis.
An, Ning; Yang, Xue; Cheng, Shujun; Wang, Guiqi; Zhang, Kaitai
2015-01-01
Carcinogenesis is an exceedingly complicated process, which involves multi-level dysregulations, including genomics (majorly caused by somatic mutation and copy number variation), DNA methylomics, and transcriptomics. Therefore, only looking into one molecular level of cancer is not sufficient to uncover the intricate underlying mechanisms. With the abundant resources of public available data in the Cancer Genome Atlas (TCGA) database, an integrative strategy was conducted to systematically analyze the aberrant patterns of colorectal cancer on the basis of DNA copy number, promoter methylation, somatic mutation and gene expression. In this study, paired samples in each genomic level were retrieved to identify differentially expressed genes with corresponding genetic or epigenetic dysregulations. Notably, the result of gene ontology enrichment analysis indicated that the differentially expressed genes with corresponding aberrant promoter methylation or somatic mutation were both functionally concentrated upon developmental process, suggesting the intimate association between development and carcinogenesis. Thus, by means of random walk with restart, 37 significant development-related genes were retrieved from a priori-knowledge based biological network. In five independent microarray datasets, Kaplan–Meier survival and Cox regression analyses both confirmed that the expression of these genes was significantly associated with overall survival of Stage III/IV colorectal cancer patients. PMID:26691761
An, Ning; Yang, Xue; Cheng, Shujun; Wang, Guiqi; Zhang, Kaitai
2015-12-22
Carcinogenesis is an exceedingly complicated process, which involves multi-level dysregulations, including genomics (majorly caused by somatic mutation and copy number variation), DNA methylomics, and transcriptomics. Therefore, only looking into one molecular level of cancer is not sufficient to uncover the intricate underlying mechanisms. With the abundant resources of public available data in the Cancer Genome Atlas (TCGA) database, an integrative strategy was conducted to systematically analyze the aberrant patterns of colorectal cancer on the basis of DNA copy number, promoter methylation, somatic mutation and gene expression. In this study, paired samples in each genomic level were retrieved to identify differentially expressed genes with corresponding genetic or epigenetic dysregulations. Notably, the result of gene ontology enrichment analysis indicated that the differentially expressed genes with corresponding aberrant promoter methylation or somatic mutation were both functionally concentrated upon developmental process, suggesting the intimate association between development and carcinogenesis. Thus, by means of random walk with restart, 37 significant development-related genes were retrieved from a priori-knowledge based biological network. In five independent microarray datasets, Kaplan-Meier survival and Cox regression analyses both confirmed that the expression of these genes was significantly associated with overall survival of Stage III/IV colorectal cancer patients.
The Physcomitrella patens gene atlas project: large-scale RNA-seq based expression data.
Perroud, Pierre-François; Haas, Fabian B; Hiss, Manuel; Ullrich, Kristian K; Alboresi, Alessandro; Amirebrahimi, Mojgan; Barry, Kerrie; Bassi, Roberto; Bonhomme, Sandrine; Chen, Haodong; Coates, Juliet C; Fujita, Tomomichi; Guyon-Debast, Anouchka; Lang, Daniel; Lin, Junyan; Lipzen, Anna; Nogué, Fabien; Oliver, Melvin J; Ponce de León, Inés; Quatrano, Ralph S; Rameau, Catherine; Reiss, Bernd; Reski, Ralf; Ricca, Mariana; Saidi, Younousse; Sun, Ning; Szövényi, Péter; Sreedasyam, Avinash; Grimwood, Jane; Stacey, Gary; Schmutz, Jeremy; Rensing, Stefan A
2018-07-01
High-throughput RNA sequencing (RNA-seq) has recently become the method of choice to define and analyze transcriptomes. For the model moss Physcomitrella patens, although this method has been used to help analyze specific perturbations, no overall reference dataset has yet been established. In the framework of the Gene Atlas project, the Joint Genome Institute selected P. patens as a flagship genome, opening the way to generate the first comprehensive transcriptome dataset for this moss. The first round of sequencing described here is composed of 99 independent libraries spanning 34 different developmental stages and conditions. Upon dataset quality control and processing through read mapping, 28 509 of the 34 361 v3.3 gene models (83%) were detected to be expressed across the samples. Differentially expressed genes (DEGs) were calculated across the dataset to permit perturbation comparisons between conditions. The analysis of the three most distinct and abundant P. patens growth stages - protonema, gametophore and sporophyte - allowed us to define both general transcriptional patterns and stage-specific transcripts. As an example of variation of physico-chemical growth conditions, we detail here the impact of ammonium supplementation under standard growth conditions on the protonemal transcriptome. Finally, the cooperative nature of this project allowed us to analyze inter-laboratory variation, as 13 different laboratories around the world provided samples. We compare differences in the replication of experiments in a single laboratory and between different laboratories. © 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.
Causes and Consequences of Genetic Background Effects Illuminated by Integrative Genomic Analysis
Chandler, Christopher H.; Chari, Sudarshan; Dworkin, Ian
2014-01-01
The phenotypic consequences of individual mutations are modulated by the wild-type genetic background in which they occur. Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist or about the mechanisms underlying it. We also lack knowledge on how mutations interact with genetic background to influence gene expression and how this in turn mediates mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome-level datasets from a mapping-by-introgression experiment and a tagged RNA gene expression dataset. In addition we used whole genome resequencing of the parental lines—two commonly used laboratory strains—to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutation screen. By searching for genes showing a congruent signal across multiple datasets, we were able to identify a robust set of candidate loci contributing to the background-dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers previously reported are caused by higher-order epistasis, not quantitative noncomplementation. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well. PMID:24504186
Accurate and fast multiple-testing correction in eQTL studies.
Sul, Jae Hoon; Raj, Towfique; de Jong, Simone; de Bakker, Paul I W; Raychaudhuri, Soumya; Ophoff, Roel A; Stranger, Barbara E; Eskin, Eleazar; Han, Buhm
2015-06-04
In studies of expression quantitative trait loci (eQTLs), it is of increasing interest to identify eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Detecting eGenes is important for follow-up analyses and prioritization because genes are the main entities in biological processes. To detect eGenes, one typically focuses on the genetic variant with the minimum p value among all variants in cis with a gene and corrects for multiple testing to obtain a gene-level p value. For performing multiple-testing correction, a permutation test is widely used. Because of growing sample sizes of eQTL studies, however, the permutation test has become a computational bottleneck in eQTL studies. In this paper, we propose an efficient approach for correcting for multiple testing and assess eGene p values by utilizing a multivariate normal distribution. Our approach properly takes into account the linkage-disequilibrium structure among variants, and its time complexity is independent of sample size. By applying our small-sample correction techniques, our method achieves high accuracy in both small and large studies. We have shown that our method consistently produces extremely accurate p values (accuracy > 98%) for three human eQTL datasets with different sample sizes and SNP densities: the Genotype-Tissue Expression pilot dataset, the multi-region brain dataset, and the HapMap 3 dataset. Copyright © 2015 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Silvestro, Francesco; Parodi, Antonio; Campo, Lorenzo
2017-04-01
The characterization of the hydrometeorological extremes, both in terms of rainfall and streamflow, in a given region plays a key role in the environmental monitoring provided by the flood alert services. In last years meteorological simulations (both near real-time and historical reanalysis) were available at increasing spatial and temporal resolutions, making possible long-period hydrological reanalysis in which the meteo dataset is used as input in distributed hydrological models. In this work, a very high resolution meteorological reanalysis dataset, namely Express-Hydro (CIMA, ISAC-CNR, GAUSS Special Project PR45DE), was employed as input in the hydrological model Continuum in order to produce long time series of streamflows in the Liguria territory, located in the Northern part of Italy. The original dataset covers the whole Europe territory in the 1979-2008 period, at 4 km of spatial resolution and 3 hours of time resolution. Analyses in terms of comparison between the rainfall estimated by the dataset and the observations (available from the local raingauges network) were carried out, and a bias correction was also performed in order to better match the observed climatology. An extreme analysis was eventually carried on the streamflows time series obtained by the simulations, by comparing them with the results of the same hydrological model fed with the observed time series of rainfall. The results of the analysis are shown and discussed.
An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets
2010-01-01
Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141
Schwarz, Jodi A; Brokstein, Peter B; Voolstra, Christian; Terry, Astrid Y; Miller, David J; Szmant, Alina M; Coffroth, Mary Alice; Medina, Mónica
2008-01-01
Background Scleractinian corals are the foundation of reef ecosystems in tropical marine environments. Their great success is due to interactions with endosymbiotic dinoflagellates (Symbiodinium spp.), with which they are obligately symbiotic. To develop a foundation for studying coral biology and coral symbiosis, we have constructed a set of cDNA libraries and generated and annotated ESTs from two species of corals, Acropora palmata and Montastraea faveolata. Results We generated 14,588 (Ap) and 3,854 (Mf) high quality ESTs from five life history/symbiosis stages (spawned eggs, early-stage planula larvae, late-stage planula larvae either infected with symbionts or uninfected, and adult coral). The ESTs assembled into a set of primarily stage-specific clusters, producing 4,980 (Ap), and 1,732 (Mf) unigenes. The egg stage library, relative to the other developmental stages, was enriched in genes functioning in cell division and proliferation, transcription, signal transduction, and regulation of protein function. Fifteen unigenes were identified as candidate symbiosis-related genes as they were expressed in all libraries constructed from the symbiotic stages and were absent from all of the non symbiotic stages. These include several DNA interacting proteins, and one highly expressed unigene (containing 17 cDNAs) with no significant protein-coding region. A significant number of unigenes (25) encode potential pattern recognition receptors (lectins, scavenger receptors, and others), as well as genes that may function in signaling pathways involved in innate immune responses (toll-like signaling, NFkB p105, and MAP kinases). Comparison between the A. palmata and an A. millepora EST dataset identified ferritin as a highly expressed gene in both datasets that appears to be undergoing adaptive evolution. Five unigenes appear to be restricted to the Scleractinia, as they had no homology to any sequences in the nr databases nor to the non-scleractinian cnidarians Nematostella vectensis and Hydra magnipapillata. Conclusion Partial sequencing of 5 cDNA libraries each for A. palmata and M. faveolata has produced a rich set of candidate genes (4,980 genes from A. palmata, and 1,732 genes from M. faveolata) that we can use as a starting point for examining the life history and symbiosis of these two species, as well as to further expand the dataset of cnidarian genes for comparative genomics and evolutionary studies. PMID:18298846
Schwarz, Jodi A.; Brokstein, Peter B.; Voolstra, Christian R.; ...
2008-02-25
Scleractinian corals are the foundation of reef ecosystems in tropical marine environments. Their great success is due to interactions with endosymbiotic dinoflagellates (Symbiodinium spp.), with which they are obligately symbiotic. To develop a foundation for studying coral biology and coral symbiosis, we have constructed a set of cDNA libraries and generated and annotated ESTs from two species of corals, Acropora palmata and Montastraea faveolata. Here we generated 14,588 (Ap) and 3,854 (Mf) high quality ESTs from five life history/symbiosis stages (spawned eggs, early-stage planula larvae, late-stage planula larvae either infected with symbionts or uninfected, and adult coral). The ESTs assembledmore » into a set of primarily stage-specific clusters, producing 4,980 (Ap), and 1,732 (Mf) unigenes. The egg stage library, relative to the other developmental stages, was enriched in genes functioning in cell division and proliferation, transcription, signal transduction, and regulation of protein function. Fifteen unigenes were identified as candidate symbiosis-related genes as they were expressed in all libraries constructed from the symbiotic stages and were absent from all of the non symbiotic stages. These include several DNA interacting proteins, and one highly expressed unigene (containing 17 cDNAs) with no significant protein-coding region. A significant number of unigenes (25) encode potential pattern recognition receptors (lectins, scavenger receptors, and others), as well as genes that may function in signaling pathways involved in innate immune responses (toll-like signaling, NFkB p105, and MAP kinases). Comparison between the A. palmata and an A. millepora EST dataset identified ferritin as a highly expressed gene in both datasets that appears to be undergoing adaptive evolution. Five unigenes appear to be restricted to the Scleractinia, as they had no homology to any sequences in the nr databases nor to the non-scleractinian cnidarians Nematostella vectensis and Hydra magnipapillata. In conclusion, partial sequencing of 5 cDNA libraries each for A. palmata and M. faveolata has produced a rich set of candidate genes (4,980 genes from A. palmata, and 1,732 genes from M. faveolata) that we can use as a starting point for examining the life history and symbiosis of these two species, as well as to further expand the dataset of cnidarian genes for comparative genomics and evolutionary studies.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schwarz, Jodi A.; Brokstein, Peter B.; Voolstra, Christian R.
Scleractinian corals are the foundation of reef ecosystems in tropical marine environments. Their great success is due to interactions with endosymbiotic dinoflagellates (Symbiodinium spp.), with which they are obligately symbiotic. To develop a foundation for studying coral biology and coral symbiosis, we have constructed a set of cDNA libraries and generated and annotated ESTs from two species of corals, Acropora palmata and Montastraea faveolata. Here we generated 14,588 (Ap) and 3,854 (Mf) high quality ESTs from five life history/symbiosis stages (spawned eggs, early-stage planula larvae, late-stage planula larvae either infected with symbionts or uninfected, and adult coral). The ESTs assembledmore » into a set of primarily stage-specific clusters, producing 4,980 (Ap), and 1,732 (Mf) unigenes. The egg stage library, relative to the other developmental stages, was enriched in genes functioning in cell division and proliferation, transcription, signal transduction, and regulation of protein function. Fifteen unigenes were identified as candidate symbiosis-related genes as they were expressed in all libraries constructed from the symbiotic stages and were absent from all of the non symbiotic stages. These include several DNA interacting proteins, and one highly expressed unigene (containing 17 cDNAs) with no significant protein-coding region. A significant number of unigenes (25) encode potential pattern recognition receptors (lectins, scavenger receptors, and others), as well as genes that may function in signaling pathways involved in innate immune responses (toll-like signaling, NFkB p105, and MAP kinases). Comparison between the A. palmata and an A. millepora EST dataset identified ferritin as a highly expressed gene in both datasets that appears to be undergoing adaptive evolution. Five unigenes appear to be restricted to the Scleractinia, as they had no homology to any sequences in the nr databases nor to the non-scleractinian cnidarians Nematostella vectensis and Hydra magnipapillata. In conclusion, partial sequencing of 5 cDNA libraries each for A. palmata and M. faveolata has produced a rich set of candidate genes (4,980 genes from A. palmata, and 1,732 genes from M. faveolata) that we can use as a starting point for examining the life history and symbiosis of these two species, as well as to further expand the dataset of cnidarian genes for comparative genomics and evolutionary studies.« less
Cell cycle gene expression networks discovered using systems biology: Significance in carcinogenesis
Scott, RE; Ghule, PN; Stein, JL; Stein, GS
2015-01-01
The early stages of carcinogenesis are linked to defects in the cell cycle. A series of cell cycle checkpoints are involved in this process. The G1/S checkpoint that serves to integrate the control of cell proliferation and differentiation is linked to carcinogenesis and the mitotic spindle checkpoint with the development of chromosomal instability. This paper presents the outcome of systems biology studies designed to evaluate if networks of covariate cell cycle gene transcripts exist in proliferative mammalian tissues including mice, rats and humans. The GeneNetwork website that contains numerous gene expression datasets from different species, sexes and tissues represents the foundational resource for these studies (www.genenetwork.org). In addition, WebGestalt, a gene ontology tool, facilitated the identification of expression networks of genes that co-vary with key cell cycle targets, especially Cdc20 and Plk1 (www.bioinfo.vanderbilt.edu/webgestalt). Cell cycle expression networks of such covariate mRNAs exist in multiple proliferative tissues including liver, lung, pituitary, adipose and lymphoid tissues among others but not in brain or retina that have low proliferative potential. Sixty-three covariate cell cycle gene transcripts (mRNAs) compose the average cell cycle network with p = e−13 to e−36. Cell cycle expression networks show species, sex and tissue variability and they are enriched in mRNA transcripts associated with mitosis many of which are associated with chromosomal instability. PMID:25808367
Efficient genotype compression and analysis of large genetic variation datasets
Layer, Ryan M.; Kindlon, Neil; Karczewski, Konrad J.; Quinlan, Aaron R.
2015-01-01
Genotype Query Tools (GQT) is a new indexing strategy that expedites analyses of genome variation datasets in VCF format based on sample genotypes, phenotypes and relationships. GQT’s compressed genotype index minimizes decompression for analysis, and performance relative to existing methods improves with cohort size. We show substantial (up to 443 fold) performance gains over existing methods and demonstrate GQT’s utility for exploring massive datasets involving thousands to millions of genomes. PMID:26550772
A multi-strategy approach to informative gene identification from gene expression data.
Liu, Ziying; Phan, Sieu; Famili, Fazel; Pan, Youlian; Lenferink, Anne E G; Cantin, Christiane; Collins, Catherine; O'Connor-McCourt, Maureen D
2010-02-01
An unsupervised multi-strategy approach has been developed to identify informative genes from high throughput genomic data. Several statistical methods have been used in the field to identify differentially expressed genes. Since different methods generate different lists of genes, it is very challenging to determine the most reliable gene list and the appropriate method. This paper presents a multi-strategy method, in which a combination of several data analysis techniques are applied to a given dataset and a confidence measure is established to select genes from the gene lists generated by these techniques to form the core of our final selection. The remainder of the genes that form the peripheral region are subject to exclusion or inclusion into the final selection. This paper demonstrates this methodology through its application to an in-house cancer genomics dataset and a public dataset. The results indicate that our method provides more reliable list of genes, which are validated using biological knowledge, biological experiments, and literature search. We further evaluated our multi-strategy method by consolidating two pairs of independent datasets, each pair is for the same disease, but generated by different labs using different platforms. The results showed that our method has produced far better results.
Damiani, Isabelle; Drain, Alice; Guichard, Marjorie; Balzergue, Sandrine; Boscari, Alexandre; Boyer, Jean-Christophe; Brunaud, Véronique; Cottaz, Sylvain; Rancurel, Corinne; Da Rocha, Martine; Fizames, Cécile; Fort, Sébastien; Gaillard, Isabelle; Maillol, Vincent; Danchin, Etienne G. J.; Rouached, Hatem; Samain, Eric; Su, Yan-Hua; Thouin, Julien; Touraine, Bruno; Puppo, Alain; Frachisse, Jean-Marie; Pauly, Nicolas; Sentenac, Hervé
2016-01-01
Root hairs are involved in water and nutrient uptake, and thereby in plant autotrophy. In legumes, they also play a crucial role in establishment of rhizobial symbiosis. To obtain a holistic view of Medicago truncatula genes expressed in root hairs and of their regulation during the first hours of the engagement in rhizobial symbiotic interaction, a high throughput RNA sequencing on isolated root hairs from roots challenged or not with lipochitooligosaccharides Nod factors (NF) for 4 or 20 h was carried out. This provided a repertoire of genes displaying expression in root hairs, responding or not to NF, and specific or not to legumes. In analyzing the transcriptome dataset, special attention was paid to pumps, transporters, or channels active at the plasma membrane, to other proteins likely to play a role in nutrient ion uptake, NF electrical and calcium signaling, control of the redox status or the dynamic reprogramming of root hair transcriptome induced by NF treatment, and to the identification of papilionoid legume-specific genes expressed in root hairs. About 10% of the root hair expressed genes were significantly up- or down-regulated by NF treatment, suggesting their involvement in remodeling plant functions to allow establishment of the symbiotic relationship. For instance, NF-induced changes in expression of genes encoding plasma membrane transport systems or disease response proteins indicate that root hairs reduce their involvement in nutrient ion absorption and adapt their immune system in order to engage in the symbiotic interaction. It also appears that the redox status of root hair cells is tuned in response to NF perception. In addition, 1176 genes that could be considered as “papilionoid legume-specific” were identified in the M. truncatula root hair transcriptome, from which 141 were found to possess an ortholog in every of the six legume genomes that we considered, suggesting their involvement in essential functions specific to legumes. This transcriptome provides a valuable resource to investigate root hair biology in legumes and the roles that these cells play in rhizobial symbiosis establishment. These results could also contribute to the long-term objective of transferring this symbiotic capacity to non-legume plants. PMID:27375649
Geoseq: a tool for dissecting deep-sequencing datasets.
Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi
2010-10-12
Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.
Daub, Carsten O; Steuer, Ralf; Selbig, Joachim; Kloska, Sebastian
2004-01-01
Background The information theoretic concept of mutual information provides a general framework to evaluate dependencies between variables. In the context of the clustering of genes with similar patterns of expression it has been suggested as a general quantity of similarity to extend commonly used linear measures. Since mutual information is defined in terms of discrete variables, its application to continuous data requires the use of binning procedures, which can lead to significant numerical errors for datasets of small or moderate size. Results In this work, we propose a method for the numerical estimation of mutual information from continuous data. We investigate the characteristic properties arising from the application of our algorithm and show that our approach outperforms commonly used algorithms: The significance, as a measure of the power of distinction from random correlation, is significantly increased. This concept is subsequently illustrated on two large-scale gene expression datasets and the results are compared to those obtained using other similarity measures. A C++ source code of our algorithm is available for non-commercial use from kloska@scienion.de upon request. Conclusion The utilisation of mutual information as similarity measure enables the detection of non-linear correlations in gene expression datasets. Frequently applied linear correlation measures, which are often used on an ad-hoc basis without further justification, are thereby extended. PMID:15339346
paraGSEA: a scalable approach for large-scale gene expression profiling
Peng, Shaoliang; Yang, Shunyun
2017-01-01
Abstract More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA. PMID:28973463
Impact of missing data imputation methods on gene expression clustering and classification.
de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G
2015-02-26
Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .
2013-01-01
Background The brown planthopper (Nilaparvata lugens) is one of the most serious rice plant pests in Asia. N. lugens causes extensive rice damage by sucking rice phloem sap, which results in stunted plant growth and the transmission of plant viruses. Despite the importance of this insect pest, little is known about the immunological mechanisms occurring in this hemimetabolous insect species. Results In this study, we performed a genome- and transcriptome-wide analysis aiming at the immune-related genes. The transcriptome datasets include the N. lugens intestine, the developmental stage, wing formation, and sex-specific expression information that provided useful gene expression sequence data for the genome-wide analysis. As a result, we identified a large number of genes encoding N. lugens pattern recognition proteins, modulation proteins in the prophenoloxidase (proPO) activating cascade, immune effectors, and the signal transduction molecules involved in the immune pathways, including the Toll, Immune deficiency (Imd) and Janus kinase signal transducers and activators of transcription (JAK-STAT) pathways. The genome scale analysis revealed detailed information of the gene structure, distribution and transcription orientations in scaffolds. A comparison of the genome-available hemimetabolous and metabolous insect species indicate the differences in the immune-related gene constitution. We investigated the gene expression profiles with regards to how they responded to bacterial infections and tissue, as well as development and sex expression specificity. Conclusions The genome- and transcriptome-wide analysis of immune-related genes including pattern recognition and modulation molecules, immune effectors, and the signal transduction molecules involved in the immune pathways is an important step in determining the overall architecture and functional network of the immune components in N. lugens. Our findings provide the comprehensive gene sequence resource and expression profiles of the immune-related genes of N. lugens, which could facilitate the understanding of the innate immune mechanisms in the hemimetabolous insect species. These data give insight into clarifying the potential functional roles of the immune-related genes involved in the biological processes of development, reproduction, and virus transmission in N. lugens. PMID:23497397
Prasopdee, Sattrachai; Sotillo, Javier; Tesana, Smarn; Laha, Thewarach; Kulsantiwong, Jutharat; Nolan, Matthew J.
2014-01-01
Background Bithynia siamensis goniomphalos is the snail intermediate host of the liver fluke, Opisthorchis viverrini, the leading cause of cholangiocarcinoma (CCA) in the Greater Mekong sub-region of Thailand. Despite the severe public health impact of Opisthorchis-induced CCA, knowledge of the molecular interactions occurring between the parasite and its snail intermediate host is scant. The examination of differences in gene expression profiling between uninfected and O. viverrini-infected B. siamensis goniomphalos could provide clues on fundamental pathways involved in the regulation of snail-parasite interplay. Methodology/Principal Findings Using high-throughput (Illumina) sequencing and extensive bioinformatic analyses, we characterized the transcriptomes of uninfected and O. viverrini-infected B. siamensis goniomphalos. Comparative analyses of gene expression profiling allowed the identification of 7,655 differentially expressed genes (DEGs), associated to 43 distinct biological pathways, including pathways associated with immune defense mechanisms against parasites. Amongst the DEGs with immune functions, transcripts encoding distinct proteases displayed the highest down-regulation in Bithynia specimens infected by O. viverrini; conversely, transcription of genes encoding heat-shock proteins and actins was significantly up-regulated in parasite-infected snails when compared to the uninfected counterparts. Conclusions/Significance The present study lays the foundation for functional studies of genes and gene products potentially involved in immune-molecular mechanisms implicated in the ability of the parasite to successfully colonize its snail intermediate host. The annotated dataset provided herein represents a ready-to-use molecular resource for the discovery of molecular pathways underlying susceptibility and resistance mechanisms of B. siamensis goniomphalos to O. viverrini and for comparative analyses with pulmonate snail intermediate hosts of other platyhelminths including schistosomes. PMID:24676090
A signature inferred from Drosophila mitotic genes predicts survival of breast cancer patients.
Damasco, Christian; Lembo, Antonio; Somma, Maria Patrizia; Gatti, Maurizio; Di Cunto, Ferdinando; Provero, Paolo
2011-02-28
The classification of breast cancer patients into risk groups provides a powerful tool for the identification of patients who will benefit from aggressive systemic therapy. The analysis of microarray data has generated several gene expression signatures that improve diagnosis and allow risk assessment. There is also evidence that cell proliferation-related genes have a high predictive power within these signatures. We thus constructed a gene expression signature (the DM signature) using the human orthologues of 108 Drosophila melanogaster genes required for either the maintenance of chromosome integrity (36 genes) or mitotic division (72 genes). The DM signature has minimal overlap with the extant signatures and is highly predictive of survival in 5 large breast cancer datasets. In addition, we show that the DM signature outperforms many widely used breast cancer signatures in predictive power, and performs comparably to other proliferation-based signatures. For most genes of the DM signature, an increased expression is negatively correlated with patient survival. The genes that provide the highest contribution to the predictive power of the DM signature are those involved in cytokinesis. This finding highlights cytokinesis as an important marker in breast cancer prognosis and as a possible target for antimitotic therapies.
ZNF281 inhibits neuronal differentiation and is a prognostic marker for neuroblastoma.
Pieraccioli, Marco; Nicolai, Sara; Pitolli, Consuelo; Agostini, Massimiliano; Antonov, Alexey; Malewicz, Michal; Knight, Richard A; Raschellà, Giuseppe; Melino, Gerry
2018-06-25
Derangement of cellular differentiation because of mutation or inappropriate expression of specific genes is a common feature in tumors. Here, we show that the expression of ZNF281, a zinc finger factor involved in several cellular processes, decreases during terminal differentiation of murine cortical neurons and in retinoic acid-induced differentiation of neuroblastoma (NB) cells. The ectopic expression of ZNF281 inhibits the neuronal differentiation of murine cortical neurons and NB cells, whereas its silencing causes the opposite effect. Furthermore, TAp73 inhibits the expression of ZNF281 through miR34a. Conversely, MYCN promotes the expression of ZNF281 at least in part by inhibiting miR34a. These findings imply a functional network that includes p73, MYCN, and ZNF281 in NB cells, where ZNF281 acts by negatively affecting neuronal differentiation. Array analysis of NB cells silenced for ZNF281 expression identified GDNF and NRP2 as two transcriptional targets inhibited by ZNF281. Binding of ZNF281 to the promoters of these genes suggests a direct mechanism of repression. Bioinformatic analysis of NB datasets indicates that ZNF281 expression is higher in aggressive, undifferentiated stage 4 than in localized stage 1 tumors supporting a central role of ZNF281 in affecting the differentiation of NB. Furthermore, patients with NB with high expression of ZNF281 have a poor clinical outcome compared with low-expressors. These observations suggest that ZNF281 is a controller of neuronal differentiation that should be evaluated as a prognostic marker in NB. Copyright © 2018 the Author(s). Published by PNAS.
Chumnanpuen, Pramote; Nookaew, Intawat; Nielsen, Jens
2013-10-16
In the yeast Saccharomyces cerevisiae, genes containing UASINO sequences are regulated by the Ino2/Ino4 and Opi1 transcription factors, and this regulation controls lipid biosynthesis. The expression level of INO2 and INO4 genes (INO-level) at different nutrient limited conditions might lead to various responses in yeast lipid metabolism. In this study, we undertook a global study on how INO-levels (transcription level of INO2 and INO4) affect lipid metabolism in yeast and we also studied the effects of single and double deletions of the two INO-genes (deficient effect). Using 2 types of nutrient limitations (carbon and nitrogen) in chemostat cultures operated at a fixed specific growth rate of 0.1 h-1 and strains having different INO-level, we were able to see the effect on expression level of the genes involved in lipid biosynthesis and the fluxes towards the different lipid components. Through combined measurements of the transcriptome, metabolome, and lipidome it was possible to obtain a large dataset that could be used to identify how the INO-level controls lipid metabolism and also establish correlations between the different components. In this study, we undertook a global study on how INO-levels (transcription level of INO2 and INO4) affect lipid metabolism in yeast and we also studied the effects of single and double deletions of the two INO-genes (deficient effect). Using 2 types of nutrient limitations (carbon and nitrogen) in chemostat cultures operated at a fixed specific growth rate of 0.1 h-1 and strains having different INO-level, we were able to see the effect on expression level of the genes involved in lipid biosynthesis and the fluxes towards the different lipid components. Through combined measurements of the transcriptome, metabolome, and lipidome it was possible to obtain a large dataset that could be used to identify how the INO-level controls lipid metabolism and also establish correlations between the different components. Our analysis showed the strength of using a combination of transcriptome and lipidome analysis to illustrate the effect of INO-levels on phospholipid metabolism and based on our analysis we established a global regulatory map.
Liu, Bowen; Wang, Tianjiao; Wang, Huawei; Zhang, Lu; Xu, Feifei; Fang, Runping; Li, Leilei; Cai, Xiaoli; Wu, Yue; Zhang, Weiying; Ye, Lihong
2018-02-23
Resistance to tamoxifen (TAM) frequently occurs in the treatment of estrogen receptor positive (ER+) breast cancer. Accumulating evidences indicate that transcription factor HOXB13 is of great significance in TAM resistance. However, the regulation of HOXB13 in TAM-resistant breast cancer remains largely unexplored. Here, we were interested in the potential effect of HBXIP, an oncoprotein involved in the acceleration of cancer progression, on the modulation of HOXB13 in TAM resistance of breast cancer. The Kaplan-Meier plotter cancer database and GEO dataset were used to analyze the association between HBXIP expression and relapse-free survival. The correlation of HBXIP and HOXB13 in ER+ breast cancer was assessed by human tissue microarray. Immunoblotting analysis, qRT-PCR assay, immunofluorescence staining, Co-IP assay, ChIP assay, luciferase reporter gene assay, cell viability assay, and colony formation assay were performed to explore the possible molecular mechanism by which HBXIP modulates HOXB13. Cell viability assay, xenograft assay, and immunohistochemistry staining analysis were utilized to evaluate the effect of the HBXIP/HOXB13 axis on the facilitation of TAM resistance in vitro and in vivo. The analysis of the Kaplan-Meier plotter and the GEO dataset showed that mono-TAM-treated breast cancer patients with higher HBXIP expression levels had shorter relapse-free survivals than patients with lower HBXIP expression levels. Overexpression of HBXIP induced TAM resistance in ER+ breast cancer cells. The tissue microarray analysis revealed a positive association between the expression levels of HBXIP and HOXB13 in ER+ breast cancer patients. HBXIP elevated HOXB13 protein level in breast cancer cells. Mechanistically, HBXIP prevented chaperone-mediated autophagy (CMA)-dependent degradation of HOXB13 via enhancement of HOXB13 acetylation at the lysine 277 residue, causing the accumulation of HOXB13. Moreover, HBXIP was able to act as a co-activator of HOXB13 to stimulate interleukin (IL)-6 transcription in the promotion of TAM resistance. Interestingly, aspirin (ASA) suppressed the HBXIP/HOXB13 axis by decreasing HBXIP expression, overcoming TAM resistance in vitro and in vivo. Our study highlights that HBXIP enhances HOXB13 acetylation to prevent HOXB13 degradation and co-activates HOXB13 in the promotion of TAM resistance of breast cancer. Therapeutically, ASA can serve as a potential candidate for reversing TAM resistance by inhibiting HBXIP expression.
Differential privacy based on importance weighting
Ji, Zhanglong
2014-01-01
This paper analyzes a novel method for publishing data while still protecting privacy. The method is based on computing weights that make an existing dataset, for which there are no confidentiality issues, analogous to the dataset that must be kept private. The existing dataset may be genuine but public already, or it may be synthetic. The weights are importance sampling weights, but to protect privacy, they are regularized and have noise added. The weights allow statistical queries to be answered approximately while provably guaranteeing differential privacy. We derive an expression for the asymptotic variance of the approximate answers. Experiments show that the new mechanism performs well even when the privacy budget is small, and when the public and private datasets are drawn from different populations. PMID:24482559
Time Series Expression Analyses Using RNA-seq: A Statistical Approach
Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P.
2013-01-01
RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis. PMID:23586021
Cross-platform method for identifying candidate network biomarkers for prostate cancer.
Jin, G; Zhou, X; Cui, K; Zhang, X-S; Chen, L; Wong, S T C
2009-11-01
Discovering biomarkers using mass spectrometry (MS) and microarray expression profiles is a promising strategy in molecular diagnosis. Here, the authors proposed a new pipeline for biomarker discovery that integrates disease information for proteins and genes, expression profiles in both genomic and proteomic levels, and protein-protein interactions (PPIs) to discover high confidence network biomarkers. Using this pipeline, a total of 474 molecules (genes and proteins) related to prostate cancer were identified and a prostate-cancer-related network (PCRN) was derived from the integrative information. Thus, a set of candidate network biomarkers were identified from multiple expression profiles composed by eight microarray datasets and one proteomics dataset. The network biomarkers with PPIs can accurately distinguish the prostate patients from the normal ones, which potentially provide more reliable hits of biomarker candidates than conventional biomarker discovery methods.
Time series expression analyses using RNA-seq: a statistical approach.
Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P
2013-01-01
RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis.
Integrated Quantitative Transcriptome Maps of Human Trisomy 21 Tissues and Cells
Pelleri, Maria Chiara; Cattani, Chiara; Vitale, Lorenza; Antonaros, Francesca; Strippoli, Pierluigi; Locatelli, Chiara; Cocchi, Guido; Piovesan, Allison; Caracausi, Maria
2018-01-01
Down syndrome (DS) is due to the presence of an extra full or partial chromosome 21 (Hsa21). The identification of genes contributing to DS pathogenesis could be the key to any rational therapy of the associated intellectual disability. We aim at generating quantitative transcriptome maps in DS integrating all gene expression profile datasets available for any cell type or tissue, to obtain a complete model of the transcriptome in terms of both expression values for each gene and segmental trend of gene expression along each chromosome. We used the TRAM (Transcriptome Mapper) software for this meta-analysis, comparing transcript expression levels and profiles between DS and normal brain, lymphoblastoid cell lines, blood cells, fibroblasts, thymus and induced pluripotent stem cells, respectively. TRAM combined, normalized, and integrated datasets from different sources and across diverse experimental platforms. The main output was a linear expression value that may be used as a reference for each of up to 37,181 mapped transcripts analyzed, related to both known genes and expression sequence tag (EST) clusters. An independent example in vitro validation of fibroblast transcriptome map data was performed through “Real-Time” reverse transcription polymerase chain reaction showing an excellent correlation coefficient (r = 0.93, p < 0.0001) with data obtained in silico. The availability of linear expression values for each gene allowed the testing of the gene dosage hypothesis of the expected 3:2 DS/normal ratio for Hsa21 as well as other human genes in DS, in addition to listing genes differentially expressed with statistical significance. Although a fraction of Hsa21 genes escapes dosage effects, Hsa21 genes are selectively over-expressed in DS samples compared to genes from other chromosomes, reflecting a decisive role in the pathogenesis of the syndrome. Finally, the analysis of chromosomal segments reveals a high prevalence of Hsa21 over-expressed segments over the other genomic regions, suggesting, in particular, a specific region on Hsa21 that appears to be frequently over-expressed (21q22). Our complete datasets are released as a new framework to investigate transcription in DS for individual genes as well as chromosomal segments in different cell types and tissues. PMID:29740474
Clinical Value of Prognosis Gene Expression Signatures in Colorectal Cancer: A Systematic Review
Cordero, David; Riccadonna, Samantha; Solé, Xavier; Crous-Bou, Marta; Guinó, Elisabet; Sanjuan, Xavier; Biondo, Sebastiano; Soriano, Antonio; Jurman, Giuseppe; Capella, Gabriel; Furlanello, Cesare; Moreno, Victor
2012-01-01
Introduction The traditional staging system is inadequate to identify those patients with stage II colorectal cancer (CRC) at high risk of recurrence or with stage III CRC at low risk. A number of gene expression signatures to predict CRC prognosis have been proposed, but none is routinely used in the clinic. The aim of this work was to assess the prediction ability and potential clinical usefulness of these signatures in a series of independent datasets. Methods A literature review identified 31 gene expression signatures that used gene expression data to predict prognosis in CRC tissue. The search was based on the PubMed database and was restricted to papers published from January 2004 to December 2011. Eleven CRC gene expression datasets with outcome information were identified and downloaded from public repositories. Random Forest classifier was used to build predictors from the gene lists. Matthews correlation coefficient was chosen as a measure of classification accuracy and its associated p-value was used to assess association with prognosis. For clinical usefulness evaluation, positive and negative post-tests probabilities were computed in stage II and III samples. Results Five gene signatures showed significant association with prognosis and provided reasonable prediction accuracy in their own training datasets. Nevertheless, all signatures showed low reproducibility in independent data. Stratified analyses by stage or microsatellite instability status showed significant association but limited discrimination ability, especially in stage II tumors. From a clinical perspective, the most predictive signatures showed a minor but significant improvement over the classical staging system. Conclusions The published signatures show low prediction accuracy but moderate clinical usefulness. Although gene expression data may inform prognosis, better strategies for signature validation are needed to encourage their widespread use in the clinic. PMID:23145004
Wei, Peng; Shi, Li; Shen, Guangmao; Xu, Zhifeng; Liu, Jialu; Pan, Yu; He, Lin
2016-07-01
Carboxylesterases (CarEs) play important roles in metabolism and detoxification of dietary and environmental xenobiotics in insects and mites. On the basis of the Tetranychuscinnabarinus transcriptome dataset, 23 CarE genes (6 genes are full sequence and 17 genes are partial sequence) were identified. Synergist bioassay showed that CarEs were involved in acaricide detoxification and resistance in fenpropathrin- (FeR) and cyflumetofen-resistant (CyR) strains. In order to further reveal the relationship between CarE gene's expression and acaricide-resistance in T. cinnabarinus, we profiled their expression in susceptible (SS) and resistant strains (FeR, and CyR). There were 8 and 4 over-expressed carboxylesterase genes in FeR and CyR, respectively, from which the over-expressions were detected at mRNA level, but not DNA level. Pesticide induction experiment elucidated that 4 of 8 and 2 of 4 up-regulated genes were inducible with significance in FeR and CyR strains, respectively, but they could not be induced in SS strain, which indicated that these genes became more enhanced and effective to withstand the pesticides' stress in resistant T. cinnabarinus. Most expression-changed and all inducible genes possess the Abhydrolase_3 motif, which is a catalytic domain for hydrolyzing. As a whole, these findings in current study provide clues for further elucidating the function and regulation mechanism of these carboxylesterase genes in T. cinnabarinus' resistance formation. Copyright © 2015 Elsevier B.V. All rights reserved.
Li, Weicong; Zheng, Zaosong; Chen, Haicheng; Cai, Yuhong; Xie, Wenlian
2018-05-01
Previous years have witnessed the importance of long non-coding RNAs (lncRNAs) in cancer research. The lncRNA Pvt1 oncogene (non-protein coding) (PVT1) was revealed to be upregulated in various cancer types. The aim of the present study was to investigate the function of PVT1 in clear cell renal cell carcinoma (ccRCC). The expression of PVT1 in ccRCC was analyzed using reverse transcription-quantitative polymerase chain reaction, and it was revealed that PVT1 expression was upregulated in ccRCC tissues compared with that in normal adjacent tissues. Next, PVT1 expression from The Cancer Genome Atlas datasets was validated, and it was also revealed that the high expression of PVT1 was associated with advanced disease stage and a poor prognosis. Furthermore, the knockdown of PVT1 induced apoptosis by increasing the expression of poly ADP ribose polymerase and Bcl-2-associated X protein, and promoted cell cycle arrest at the G1 phase by decreasing the expression of cyclin D1. Study of the mechanism involved indicated that PVT1 promoted the progression of ccRCC partly through activation of the epidermal growth factor receptor pathway. Altogether, the results of the present study suggested that PVT1 serves oncogenic functions and may be a biomarker and therapeutic target in ccRCC.
Vukmirovic, Milica; Herazo-Maya, Jose D; Blackmon, John; Skodric-Trifunovic, Vesna; Jovanovic, Dragana; Pavlovic, Sonja; Stojsic, Jelena; Zeljkovic, Vesna; Yan, Xiting; Homer, Robert; Stefanovic, Branko; Kaminski, Naftali
2017-01-12
Idiopathic Pulmonary Fibrosis (IPF) is a lethal lung disease of unknown etiology. A major limitation in transcriptomic profiling of lung tissue in IPF has been a dependence on snap-frozen fresh tissues (FF). In this project we sought to determine whether genome scale transcript profiling using RNA Sequencing (RNA-Seq) could be applied to archived Formalin-Fixed Paraffin-Embedded (FFPE) IPF tissues. We isolated total RNA from 7 IPF and 5 control FFPE lung tissues and performed 50 base pair paired-end sequencing on Illumina 2000 HiSeq. TopHat2 was used to map sequencing reads to the human genome. On average ~62 million reads (53.4% of ~116 million reads) were mapped per sample. 4,131 genes were differentially expressed between IPF and controls (1,920 increased and 2,211 decreased (FDR < 0.05). We compared our results to differentially expressed genes calculated from a previously published dataset generated from FF tissues analyzed on Agilent microarrays (GSE47460). The overlap of differentially expressed genes was very high (760 increased and 1,413 decreased, FDR < 0.05). Only 92 differentially expressed genes changed in opposite directions. Pathway enrichment analysis performed using MetaCore confirmed numerous IPF relevant genes and pathways including extracellular remodeling, TGF-beta, and WNT. Gene network analysis of MMP7, a highly differentially expressed gene in both datasets, revealed the same canonical pathways and gene network candidates in RNA-Seq and microarray data. For validation by NanoString nCounter® we selected 35 genes that had a fold change of 2 in at least one dataset (10 discordant, 10 significantly differentially expressed in one dataset only and 15 concordant genes). High concordance of fold change and FDR was observed for each type of the samples (FF vs FFPE) with both microarrays (r = 0.92) and RNA-Seq (r = 0.90) and the number of discordant genes was reduced to four. Our results demonstrate that RNA sequencing of RNA obtained from archived FFPE lung tissues is feasible. The results obtained from FFPE tissue are highly comparable to FF tissues. The ability to perform RNA-Seq on archived FFPE IPF tissues should greatly enhance the availability of tissue biopsies for research in IPF.
Inferring Boolean network states from partial information
2013-01-01
Networks of molecular interactions regulate key processes in living cells. Therefore, understanding their functionality is a high priority in advancing biological knowledge. Boolean networks are often used to describe cellular networks mathematically and are fitted to experimental datasets. The fitting often results in ambiguities since the interpretation of the measurements is not straightforward and since the data contain noise. In order to facilitate a more reliable mapping between datasets and Boolean networks, we develop an algorithm that infers network trajectories from a dataset distorted by noise. We analyze our algorithm theoretically and demonstrate its accuracy using simulation and microarray expression data. PMID:24006954
PmiRExAt: plant miRNA expression atlas database and web applications
Gurjar, Anoop Kishor Singh; Panwar, Abhijeet Singh; Gupta, Rajinder; Mantri, Shrikant S.
2016-01-01
High-throughput small RNA (sRNA) sequencing technology enables an entirely new perspective for plant microRNA (miRNA) research and has immense potential to unravel regulatory networks. Novel insights gained through data mining in publically available rich resource of sRNA data will help in designing biotechnology-based approaches for crop improvement to enhance plant yield and nutritional value. Bioinformatics resources enabling meta-analysis of miRNA expression across multiple plant species are still evolving. Here, we report PmiRExAt, a new online database resource that caters plant miRNA expression atlas. The web-based repository comprises of miRNA expression profile and query tool for 1859 wheat, 2330 rice and 283 maize miRNA. The database interface offers open and easy access to miRNA expression profile and helps in identifying tissue preferential, differential and constitutively expressing miRNAs. A feature enabling expression study of conserved miRNA across multiple species is also implemented. Custom expression analysis feature enables expression analysis of novel miRNA in total 117 datasets. New sRNA dataset can also be uploaded for analysing miRNA expression profiles for 73 plant species. PmiRExAt application program interface, a simple object access protocol web service allows other programmers to remotely invoke the methods written for doing programmatic search operations on PmiRExAt database. Database URL: http://pmirexat.nabi.res.in. PMID:27081157
An efficient method to identify differentially expressed genes in microarray experiments
Qin, Huaizhen; Feng, Tao; Harding, Scott A.; Tsai, Chung-Jui; Zhang, Shuanglin
2013-01-01
Motivation Microarray experiments typically analyze thousands to tens of thousands of genes from small numbers of biological replicates. The fact that genes are normally expressed in functionally relevant patterns suggests that gene-expression data can be stratified and clustered into relatively homogenous groups. Cluster-wise dimensionality reduction should make it feasible to improve screening power while minimizing information loss. Results We propose a powerful and computationally simple method for finding differentially expressed genes in small microarray experiments. The method incorporates a novel stratification-based tight clustering algorithm, principal component analysis and information pooling. Comprehensive simulations show that our method is substantially more powerful than the popular SAM and eBayes approaches. We applied the method to three real microarray datasets: one from a Populus nitrogen stress experiment with 3 biological replicates; and two from public microarray datasets of human cancers with 10 to 40 biological replicates. In all three analyses, our method proved more robust than the popular alternatives for identification of differentially expressed genes. Availability The C++ code to implement the proposed method is available upon request for academic use. PMID:18453554
Array data extractor (ADE): a LabVIEW program to extract and merge gene array data.
Kurtenbach, Stefan; Kurtenbach, Sarah; Zoidl, Georg
2013-12-01
Large data sets from gene expression array studies are publicly available offering information highly valuable for research across many disciplines ranging from fundamental to clinical research. Highly advanced bioinformatics tools have been made available to researchers, but a demand for user-friendly software allowing researchers to quickly extract expression information for multiple genes from multiple studies persists. Here, we present a user-friendly LabVIEW program to automatically extract gene expression data for a list of genes from multiple normalized microarray datasets. Functionality was tested for 288 class A G protein-coupled receptors (GPCRs) and expression data from 12 studies comparing normal and diseased human hearts. Results confirmed known regulation of a beta 1 adrenergic receptor and further indicate novel research targets. Although existing software allows for complex data analyses, the LabVIEW based program presented here, "Array Data Extractor (ADE)", provides users with a tool to retrieve meaningful information from multiple normalized gene expression datasets in a fast and easy way. Further, the graphical programming language used in LabVIEW allows applying changes to the program without the need of advanced programming knowledge.
Decibel: The Relational Dataset Branching System
Maddox, Michael; Goehring, David; Elmore, Aaron J.; Madden, Samuel; Parameswaran, Aditya; Deshpande, Amol
2017-01-01
As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs. PMID:28149668
Biclustering sparse binary genomic data.
van Uitert, Miranda; Meuleman, Wouter; Wessels, Lodewyk
2008-12-01
Genomic datasets often consist of large, binary, sparse data matrices. In such a dataset, one is often interested in finding contiguous blocks that (mostly) contain ones. This is a biclustering problem, and while many algorithms have been proposed to deal with gene expression data, only two algorithms have been proposed that specifically deal with binary matrices. None of the gene expression biclustering algorithms can handle the large number of zeros in sparse binary matrices. The two proposed binary algorithms failed to produce meaningful results. In this article, we present a new algorithm that is able to extract biclusters from sparse, binary datasets. A powerful feature is that biclusters with different numbers of rows and columns can be detected, varying from many rows to few columns and few rows to many columns. It allows the user to guide the search towards biclusters of specific dimensions. When applying our algorithm to an input matrix derived from TRANSFAC, we find transcription factors with distinctly dissimilar binding motifs, but a clear set of common targets that are significantly enriched for GO categories.
The Graduate Outcome Project: Using Data from the Integrated Data Infrastructure Project
ERIC Educational Resources Information Center
Rees, Malcolm
2014-01-01
This paper reports on progress to date with a project underway in New Zealand involving the extraction of data from multiple government agencies that is then combined into one comprehensive longitudinal integrated dataset and made available to trial participants in a way never previously thought possible. The dataset includes school leaver…
Trayhurn, Paul; Denyer, Gareth
2012-01-01
Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity.
Trayhurn, Paul; Denyer, Gareth
2012-01-01
Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity. PMID:25191551
Low-rank regularization for learning gene expression programs.
Ye, Guibo; Tang, Mengfan; Cai, Jian-Feng; Nie, Qing; Xie, Xiaohui
2013-01-01
Learning gene expression programs directly from a set of observations is challenging due to the complexity of gene regulation, high noise of experimental measurements, and insufficient number of experimental measurements. Imposing additional constraints with strong and biologically motivated regularizations is critical in developing reliable and effective algorithms for inferring gene expression programs. Here we propose a new form of regulation that constrains the number of independent connectivity patterns between regulators and targets, motivated by the modular design of gene regulatory programs and the belief that the total number of independent regulatory modules should be small. We formulate a multi-target linear regression framework to incorporate this type of regulation, in which the number of independent connectivity patterns is expressed as the rank of the connectivity matrix between regulators and targets. We then generalize the linear framework to nonlinear cases, and prove that the generalized low-rank regularization model is still convex. Efficient algorithms are derived to solve both the linear and nonlinear low-rank regularized problems. Finally, we test the algorithms on three gene expression datasets, and show that the low-rank regularization improves the accuracy of gene expression prediction in these three datasets.
2011-01-01
Background Panax notoginseng (Burk) F.H. Chen is important medicinal plant of the Araliacease family. Triterpene saponins are the bioactive constituents in P. notoginseng. However, available genomic information regarding this plant is limited. Moreover, details of triterpene saponin biosynthesis in the Panax species are largely unknown. Results Using the 454 pyrosequencing technology, a one-quarter GS FLX titanium run resulted in 188,185 reads with an average length of 410 bases for P. notoginseng root. These reads were processed and assembled by 454 GS De Novo Assembler software into 30,852 unique sequences. A total of 70.2% of unique sequences were annotated by Basic Local Alignment Search Tool (BLAST) similarity searches against public sequence databases. The Kyoto Encyclopedia of Genes and Genomes (KEGG) assignment discovered 41 unique sequences representing 11 genes involved in triterpene saponin backbone biosynthesis in the 454-EST dataset. In particular, the transcript encoding dammarenediol synthase (DS), which is the first committed enzyme in the biosynthetic pathway of major triterpene saponins, is highly expressed in the root of four-year-old P. notoginseng. It is worth emphasizing that the candidate cytochrome P450 (Pn02132 and Pn00158) and UDP-glycosyltransferase (Pn00082) gene most likely to be involved in hydroxylation or glycosylation of aglycones for triterpene saponin biosynthesis were discovered from 174 cytochrome P450s and 242 glycosyltransferases by phylogenetic analysis, respectively. Putative transcription factors were detected in 906 unique sequences, including Myb, homeobox, WRKY, basic helix-loop-helix (bHLH), and other family proteins. Additionally, a total of 2,772 simple sequence repeat (SSR) were identified from 2,361 unique sequences, of which, di-nucleotide motifs were the most abundant motif. Conclusion This study is the first to present a large-scale EST dataset for P. notoginseng root acquired by next-generation sequencing (NGS) technology. The candidate genes involved in triterpene saponin biosynthesis, including the putative CYP450s and UGTs, were obtained in this study. Additionally, the identification of SSRs provided plenty of genetic makers for molecular breeding and genetics applications in this species. These data will provide information on gene discovery, transcriptional regulation and marker-assisted selection for P. notoginseng. The dataset establishes an important foundation for the study with the purpose of ensuring adequate drug resources for this species. PMID:22369100
An Ensemble Framework Coping with Instability in the Gene Selection Process.
Castellanos-Garzón, José A; Ramos, Juan; López-Sánchez, Daniel; de Paz, Juan F; Corchado, Juan M
2018-03-01
This paper proposes an ensemble framework for gene selection, which is aimed at addressing instability problems presented in the gene filtering task. The complex process of gene selection from gene expression data faces different instability problems from the informative gene subsets found by different filter methods. This makes the identification of significant genes by the experts difficult. The instability of results can come from filter methods, gene classifier methods, different datasets of the same disease and multiple valid groups of biomarkers. Even though there is a wide number of proposals, the complexity imposed by this problem remains a challenge today. This work proposes a framework involving five stages of gene filtering to discover biomarkers for diagnosis and classification tasks. This framework performs a process of stable feature selection, facing the problems above and, thus, providing a more suitable and reliable solution for clinical and research purposes. Our proposal involves a process of multistage gene filtering, in which several ensemble strategies for gene selection were added in such a way that different classifiers simultaneously assess gene subsets to face instability. Firstly, we apply an ensemble of recent gene selection methods to obtain diversity in the genes found (stability according to filter methods). Next, we apply an ensemble of known classifiers to filter genes relevant to all classifiers at a time (stability according to classification methods). The achieved results were evaluated in two different datasets of the same disease (pancreatic ductal adenocarcinoma), in search of stability according to the disease, for which promising results were achieved.
Feature weight estimation for gene selection: a local hyperlinear learning approach
2014-01-01
Background Modeling high-dimensional data involving thousands of variables is particularly important for gene expression profiling experiments, nevertheless,it remains a challenging task. One of the challenges is to implement an effective method for selecting a small set of relevant genes, buried in high-dimensional irrelevant noises. RELIEF is a popular and widely used approach for feature selection owing to its low computational cost and high accuracy. However, RELIEF based methods suffer from instability, especially in the presence of noisy and/or high-dimensional outliers. Results We propose an innovative feature weighting algorithm, called LHR, to select informative genes from highly noisy data. LHR is based on RELIEF for feature weighting using classical margin maximization. The key idea of LHR is to estimate the feature weights through local approximation rather than global measurement, which is typically used in existing methods. The weights obtained by our method are very robust in terms of degradation of noisy features, even those with vast dimensions. To demonstrate the performance of our method, extensive experiments involving classification tests have been carried out on both synthetic and real microarray benchmark datasets by combining the proposed technique with standard classifiers, including the support vector machine (SVM), k-nearest neighbor (KNN), hyperplane k-nearest neighbor (HKNN), linear discriminant analysis (LDA) and naive Bayes (NB). Conclusion Experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed feature selection method combined with supervised learning in three aspects: 1) high classification accuracy, 2) excellent robustness to noise and 3) good stability using to various classification algorithms. PMID:24625071
Involving youth in program decision-making: how common and what might it do for youth?
Akiva, Thomas; Cortina, Kai S; Smith, Charles
2014-11-01
The strategy of sharing program decision-making with youth in youth programs, a specific form of youth-adult partnership, is widely recommended in practitioner literature; however, empirical study is relatively limited. We investigated the prevalence and correlates of youth program decision-making practices (e.g., asking youth to help decide what activities are offered), using single-level and multilevel methods with a cross-sectional dataset of 979 youth attending 63 multipurpose after-school programs (average age of youth = 11.4, 53 % female). The prevalence of such practices was relatively high, particularly for forms that involved low power sharing such as involving youth in selecting the activities a program offers. Hierarchical linear modeling revealed positive associations between youth program decision-making practices and youth motivation to attend programs. We also found positive correlations between decision-making practices and youth problem-solving efficacy, expression efficacy, and empathy. Significant interactions with age suggest that correlations with problem solving and empathy are more pronounced for older youth. Overall, the findings suggest that involving youth in program decision-making is a promising strategy for promoting youth motivation and skill building, and in some cases this is particularly the case for older (high school-age) youth.
Won, Kyeong-Hye; Song, Ki-Duk; Park, Jong-Eun; Kim, Duk-Kyung; Na, Chong-Sam
2016-01-01
Anethole and garlic have an immune modulatory effects on avian coccidiosis, and these effects are correlated with gene expression changes in intestinal epithelial lymphocytes (IELs). In this study, we integrated gene expression datasets from two independent experiments and investigated gene expression profile changes by anethole and garlic respectively, and identified gene expression signatures, which are common targets of these herbs as they might be used for the evaluation of the effect of plant herbs on immunity toward avian coccidiosis. We identified 4,382 and 371 genes, which were differentially expressed in IELs of chickens supplemented with garlic and anethole respectively. The gene ontology (GO) term of differentially expressed genes (DEGs) from garlic treatment resulted in the biological processes (BPs) related to proteolysis, e.g., “modification-dependent protein catabolic process”, “proteolysis involved in cellular protein catabolic process”, “cellular protein catabolic process”, “protein catabolic process”, and “ubiquitin-dependent protein catabolic process”. In GO analysis, one BP term, “Proteolysis”, was obtained. Among DEGs, 300 genes were differentially regulated in response to both garlic and anethole, and 234 and 59 genes were either up- or down-regulated in supplementation with both herbs. Pathway analysis resulted in enrichment of the pathways related to digestion such as “Starch and sucrose metabolism” and “Insulin signaling pathway”. Taken together, the results obtained in the present study could contribute to the effective development of evaluation system of plant herbs based on molecular signatures related with their immunological functions in chicken IELs. PMID:26954117
The multiple imputation method: a case study involving secondary data analysis.
Walani, Salimah R; Cleland, Charles M
2015-05-01
To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.
Identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network.
Chen, Xi; Wang, Qiao-Ling; Zhang, Meng-Hui
2017-10-01
The current study aimed to identify key genes in glaucoma based on a benchmarked dataset and gene regulatory network (GRN). Local and global noise was added to the gene expression dataset to produce a benchmarked dataset. Differentially-expressed genes (DEGs) between patients with glaucoma and normal controls were identified utilizing the Linear Models for Microarray Data (Limma) package based on benchmarked dataset. A total of 5 GRN inference methods, including Zscore, GeneNet, context likelihood of relatedness (CLR) algorithm, Partial Correlation coefficient with Information Theory (PCIT) and GEne Network Inference with Ensemble of Trees (Genie3) were evaluated using receiver operating characteristic (ROC) and precision and recall (PR) curves. The interference method with the best performance was selected to construct the GRN. Subsequently, topological centrality (degree, closeness and betweenness) was conducted to identify key genes in the GRN of glaucoma. Finally, the key genes were validated by performing reverse transcription-quantitative polymerase chain reaction (RT-qPCR). A total of 176 DEGs were detected from the benchmarked dataset. The ROC and PR curves of the 5 methods were analyzed and it was determined that Genie3 had a clear advantage over the other methods; thus, Genie3 was used to construct the GRN. Following topological centrality analysis, 14 key genes for glaucoma were identified, including IL6 , EPHA2 and GSTT1 and 5 of these 14 key genes were validated by RT-qPCR. Therefore, the current study identified 14 key genes in glaucoma, which may be potential biomarkers to use in the diagnosis of glaucoma and aid in identifying the molecular mechanism of this disease.
Integrative Exploratory Analysis of Two or More Genomic Datasets.
Meng, Chen; Culhane, Aedin
2016-01-01
Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.
Pairwise gene GO-based measures for biclustering of high-dimensional expression data.
Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S
2018-01-01
Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.
Varet, Hugo; Brillet-Guéguen, Loraine; Coppée, Jean-Yves; Dillies, Marie-Agnès
2016-01-01
Several R packages exist for the detection of differentially expressed genes from RNA-Seq data. The analysis process includes three main steps, namely normalization, dispersion estimation and test for differential expression. Quality control steps along this process are recommended but not mandatory, and failing to check the characteristics of the dataset may lead to spurious results. In addition, normalization methods and statistical models are not exchangeable across the packages without adequate transformations the users are often not aware of. Thus, dedicated analysis pipelines are needed to include systematic quality control steps and prevent errors from misusing the proposed methods. SARTools is an R pipeline for differential analysis of RNA-Seq count data. It can handle designs involving two or more conditions of a single biological factor with or without a blocking factor (such as a batch effect or a sample pairing). It is based on DESeq2 and edgeR and is composed of an R package and two R script templates (for DESeq2 and edgeR respectively). Tuning a small number of parameters and executing one of the R scripts, users have access to the full results of the analysis, including lists of differentially expressed genes and a HTML report that (i) displays diagnostic plots for quality control and model hypotheses checking and (ii) keeps track of the whole analysis process, parameter values and versions of the R packages used. SARTools provides systematic quality controls of the dataset as well as diagnostic plots that help to tune the model parameters. It gives access to the main parameters of DESeq2 and edgeR and prevents untrained users from misusing some functionalities of both packages. By keeping track of all the parameters of the analysis process it fits the requirements of reproducible research.
Far infrared promotes wound healing through activation of Notch1 signaling.
Hsu, Yung-Ho; Lin, Yuan-Feng; Chen, Cheng-Hsien; Chiu, Yu-Jhe; Chiu, Hui-Wen
2017-11-01
The Notch signaling pathway is critically involved in cell proliferation, differentiation, development, and homeostasis. Far infrared (FIR) has an effect that promotes wound healing. However, the underlying molecular mechanisms are unclear. In the present study, we employed in vivo and HaCaT (a human skin keratinocyte cell line) models to elucidate the role of Notch1 signaling in FIR-promoted wound healing. We found that FIR enhanced keratinocyte migration and proliferation. FIR induced the Notch1 signaling pathway in HaCaT cells and in a microarray dataset from the Gene Expression Omnibus database. We next determined the mRNA levels of NOTCH1 in paired normal and wound skin tissues derived from clinical patients using the microarray dataset and Ingenuity Pathway Analysis software. The result indicated that the Notch1/Twist1 axis plays important roles in wound healing and tissue repair. In addition, inhibiting Notch1 signaling decreased the FIR-enhanced proliferation and migration. In a full-thickness wound model in rats, the wounds healed more rapidly and the scar size was smaller in the FIR group than in the light group. Moreover, FIR could increase Notch1 and Delta1 in skin tissues. The activation of Notch1 signaling may be considered as a possible mechanism for the promoting effect of FIR on wound healing. FIR stimulates keratinocyte migration and proliferation. Notch1 in keratinocytes has an essential role in FIR-induced migration and proliferation. NOTCH1 promotes TWIST1-mediated gene expression to assist wound healing. FIR might promote skin wound healing in a rat model. FIR stimulates keratinocyte migration and proliferation. Notch1 in keratinocytes has an essential role in FIR-induced migration and proliferation. NOTCH1 promotes TWIST1-mediated gene expression to assist wound healing. FIR might promote skin wound healing in a rat model.
Wu, Zhaohua; Feng, Jiaxin; Qiao, Fangli; Tan, Zhe-Min
2016-04-13
In this big data era, it is more urgent than ever to solve two major issues: (i) fast data transmission methods that can facilitate access to data from non-local sources and (ii) fast and efficient data analysis methods that can reveal the key information from the available data for particular purposes. Although approaches in different fields to address these two questions may differ significantly, the common part must involve data compression techniques and a fast algorithm. This paper introduces the recently developed adaptive and spatio-temporally local analysis method, namely the fast multidimensional ensemble empirical mode decomposition (MEEMD), for the analysis of a large spatio-temporal dataset. The original MEEMD uses ensemble empirical mode decomposition to decompose time series at each spatial grid and then pieces together the temporal-spatial evolution of climate variability and change on naturally separated timescales, which is computationally expensive. By taking advantage of the high efficiency of the expression using principal component analysis/empirical orthogonal function analysis for spatio-temporally coherent data, we design a lossy compression method for climate data to facilitate its non-local transmission. We also explain the basic principles behind the fast MEEMD through decomposing principal components instead of original grid-wise time series to speed up computation of MEEMD. Using a typical climate dataset as an example, we demonstrate that our newly designed methods can (i) compress data with a compression rate of one to two orders; and (ii) speed-up the MEEMD algorithm by one to two orders. © 2016 The Authors.
Zhang, Shuwei; Ding, Feng; He, Xinhua; Luo, Cong; Huang, Guixiang; Hu, Ying
2015-02-01
Seedlessness is a desirable character in lemons and other citrus species. Seedless fruit can be induced in many ways, including through self-incompatibility (SI). SI is widely used as an intraspecific reproductive barrier that prevents self-fertilization in flowering plants. Although there have been many studies on SI, its mechanism remains unclear. The 'Xiangshui' lemon is an important seedless cultivar whose seedlessness has been caused by SI. It is essential to identify genes involved in SI in 'Xiangshui' lemon to clarify its molecular mechanism. In this study, candidate genes associated with SI were identified using high-throughput Illumina RNA sequencing (RNA-seq). A total of 61,224 unigenes were obtained (average, 948 bp; N50 of 1,457 bp), among which 47,260 unigenes were annotated by comparison to six public databases (Nr, Nt, Swiss-Prot, KEGG, COG, and GO). Differentially expressed genes were identified by comparing the transcriptomes of no-, self-, and cross-pollinated stigmas with styles of the 'Xiangshui' lemon. Several differentially expressed genes that might be associated with SI were identified, such as those involved in pollen tube growth, programmed cell death, signal transduction, and transcription. NADPH oxidase genes associated with apoptosis were highly upregulated in the self-pollinated transcriptome. The expression pattern of 12 genes was analyzed by quantitative real-time polymerase chain reaction. A putative S-RNase gene was identified that had not been previously associated with self-pollen rejection in lemon or citrus. This study provided a transcriptome dataset for further studies of SI and seedless lemon breeding.
Skeie, Jessica M; Aldrich, Benjamin T; Goldstein, Andrew S; Schmidt, Gregory A; Reed, Cynthia R; Greiner, Mark A
2018-01-01
The objective of this study was to characterize the proteome of the corneal endothelial cell layer and its basement membrane (Descemet membrane) in humans with various severities of type II diabetes mellitus compared to controls, and identify differentially expressed proteins across a range of diabetic disease severities that may influence corneal endothelial cell health. Endothelium-Descemet membrane complex tissues were peeled from transplant suitable donor corneas. Protein fractions were isolated from each sample and subjected to multidimensional liquid chromatography and tandem mass spectrometry. Peptide spectra were matched to the human proteome, assigned gene ontology, and grouped into protein signaling pathways unique to each of the disease states. We identified an average of 12,472 unique proteins in each of the endothelium-Descemet membrane complex tissue samples. There were 2,409 differentially expressed protein isoforms that included previously known risk factors for type II diabetes mellitus related to metabolic processes, oxidative stress, and inflammation. Gene ontology analysis demonstrated that diabetes progression has many protein footprints related to metabolic processes, binding, and catalysis. The most represented pathways involved in diabetes progression included mitochondrial dysfunction, cell-cell junction structure, and protein synthesis regulation. This proteomic dataset identifies novel corneal endothelial cell and Descemet membrane protein expression in various stages of diabetic disease. These findings give insight into the mechanisms involved in diabetes progression relevant to the corneal endothelium and its basement membrane, prioritize new pathways for therapeutic targeting, and provide insight into potential biomarkers for determining the health of this tissue.
Giambartolomei, Claudia; Vukcevic, Damjan; Schadt, Eric E; Franke, Lude; Hingorani, Aroon D; Wallace, Chris; Plagnol, Vincent
2014-05-01
Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.
Hill, Katherine E; Kelly, Andrew D; Kuijjer, Marieke L; Barry, William; Rattani, Ahmed; Garbutt, Cassandra C; Kissick, Haydn; Janeway, Katherine; Perez-Atayde, Antonio; Goldsmith, Jeffrey; Gebhardt, Mark C; Arredouani, Mohamed S; Cote, Greg; Hornicek, Francis; Choy, Edwin; Duan, Zhenfeng; Quackenbush, John; Haibe-Kains, Benjamin; Spentzos, Dimitrios
2017-05-15
A microRNA (miRNA) collection on the imprinted 14q32 MEG3 region has been associated with outcome in osteosarcoma. We assessed the clinical utility of this miRNA set and their association with methylation status. We integrated coding and non-coding RNA data from three independent annotated clinical osteosarcoma cohorts (n = 65, n = 27, and n = 25) and miRNA and methylation data from one in vitro (19 cell lines) and one clinical (NCI Therapeutically Applicable Research to Generate Effective Treatments (TARGET) osteosarcoma dataset, n = 80) dataset. We used time-dependent receiver operating characteristic (tdROC) analysis to evaluate the clinical value of candidate miRNA profiles and machine learning approaches to compare the coding and non-coding transcriptional programs of high- and low-risk osteosarcoma tumors and high- versus low-aggressiveness cell lines. In the cell line and TARGET datasets, we also studied the methylation patterns of the MEG3 imprinting control region on 14q32 and their association with miRNA expression and tumor aggressiveness. In the tdROC analysis, miRNA sets on 14q32 showed strong discriminatory power for recurrence and survival in the three clinical datasets. High- or low-risk tumor classification was robust to using different microRNA sets or classification methods. Machine learning approaches showed that genome-wide miRNA profiles and miRNA regulatory networks were quite different between the two outcome groups and mRNA profiles categorized the samples in a manner concordant with the miRNAs, suggesting potential molecular subtypes. Further, miRNA expression patterns were reproducible in comparing high-aggressiveness versus low-aggressiveness cell lines. Methylation patterns in the MEG3 differentially methylated region (DMR) also distinguished high-aggressiveness from low-aggressiveness cell lines and were associated with expression of several 14q32 miRNAs in both the cell lines and the large TARGET clinical dataset. Within the limits of available CpG array coverage, we observed a potential methylation-sensitive regulation of the non-coding RNA cluster by CTCF, a known enhancer-blocking factor. Loss of imprinting/methylation changes in the 14q32 non-coding region defines reproducible previously unrecognized osteosarcoma subtypes with distinct transcriptional programs and biologic and clinical behavior. Future studies will define the precise relationship between 14q32 imprinting, non-coding RNA expression, genomic enhancer binding, and tumor aggressiveness, with possible therapeutic implications for both early- and advanced-stage patients.
Zhou, Qingyuan; Jia, Junting; Huang, Xing; Yan, Xueqing; Cheng, Liqin; Chen, Shuangyan; Li, Xiaoxia; Peng, Xianjun; Liu, Gongshe
2014-05-26
Many Poaceae species show a gametophytic self-incompatibility (GSI) system, which is controlled by at least two independent and multiallelic loci, S and Z. Until currently, the gene products for S and Z were unknown. Grass SI plant stigmas discriminate between pollen grains that land on its surface and support compatible pollen tube growth and penetration into the stigma, whereas recognizing incompatible pollen and thus inhibiting pollination behaviors. Leymus chinensis (Trin.) Tzvel. (sheepgrass) is a Poaceae SI species. A comprehensive analysis of sheepgrass stigma transcriptome may provide valuable information for understanding the mechanism of pollen-stigma interactions and grass SI. The transcript abundance profiles of mature stigmas, mature ovaries and leaves were examined using high-throughput next generation sequencing technology. A comparative transcriptomic analysis of these tissues identified 1,025 specifically or preferentially expressed genes in sheepgrass stigmas. These genes contained a significant proportion of genes predicted to function in cell-cell communication and signal transduction. We identified 111 putative transcription factors (TFs) genes and the most abundant groups were MYB, C2H2, C3H, FAR1, MADS. Comparative analysis of the sheepgrass, rice and Arabidopsis stigma-specific or preferential datasets showed broad similarities and some differences in the proportion of genes in the Gene Ontology (GO) functional categories. Potential SI candidate genes identified in other grasses were also detected in the sheepgrass stigma-specific or preferential dataset. Quantitative real-time PCR experiments validated the expression pattern of stigma preferential genes including homologous grass SI candidate genes. This study represents the first large-scale investigation of gene expression in the stigmas of an SI grass species. We uncovered many notable genes that are potentially involved in pollen-stigma interactions and SI mechanisms, including genes encoding receptor-like protein kinases (RLK), CBL (calcineurin B-like proteins) interacting protein kinases, calcium-dependent protein kinase, expansins, pectinesterase, peroxidases and various transcription factors. The availability of a pool of stigma-specific or preferential genes for L. chinensis offers an opportunity to elucidate the mechanisms of SI in Poaceae.
Zaas, Aimee K.; Chen, Minhua; Varkey, Jay; Veldman, Timothy; Hero, Alfred O.; Lucas, Joseph; Huang, Yongsheng; Turner, Ronald; Gilbert, Anthony; Lambkin-Williams, Robert; Øien, N. Christine; Nicholson, Bradly; Kingsmore, Stephen; Carin, Lawrence; Woods, Christopher W.; Ginsburg, Geoffrey S.
2010-01-01
Summary Acute respiratory infections (ARI) are a common reason for seeking medical attention and the threat of pandemic influenza will likely add to these numbers. Using human viral challenge studies with live rhinovirus, respiratory syncytial virus, and influenza A, we developed peripheral blood gene expression signatures that distinguish individuals with symptomatic ARI from uninfected individuals with > 95% accuracy. We validated this “acute respiratory viral” signature - encompassing genes with a known role in host defense against viral infections - across each viral challenge. We also validated the signature in an independently acquired dataset for influenza A and classified infected individuals from healthy controls with 100% accuracy. In the same dataset, we could also distinguish viral from bacterial ARIs (93% accuracy). These results demonstrate that ARIs induce changes in human peripheral blood gene expression that can be used to diagnose a viral etiology of respiratory infection and triage symptomatic individuals. PMID:19664979
A prior-based integrative framework for functional transcriptional regulatory network inference
Siahpirani, Alireza F.
2017-01-01
Abstract Transcriptional regulatory networks specify regulatory proteins controlling the context-specific expression levels of genes. Inference of genome-wide regulatory networks is central to understanding gene regulation, but remains an open challenge. Expression-based network inference is among the most popular methods to infer regulatory networks, however, networks inferred from such methods have low overlap with experimentally derived (e.g. ChIP-chip and transcription factor (TF) knockouts) networks. Currently we have a limited understanding of this discrepancy. To address this gap, we first develop a regulatory network inference algorithm, based on probabilistic graphical models, to integrate expression with auxiliary datasets supporting a regulatory edge. Second, we comprehensively analyze our and other state-of-the-art methods on different expression perturbation datasets. Networks inferred by integrating sequence-specific motifs with expression have substantially greater agreement with experimentally derived networks, while remaining more predictive of expression than motif-based networks. Our analysis suggests natural genetic variation as the most informative perturbation for network inference, and, identifies core TFs whose targets are predictable from expression. Multiple reasons make the identification of targets of other TFs difficult, including network architecture and insufficient variation of TF mRNA level. Finally, we demonstrate the utility of our inference algorithm to infer stress-specific regulatory networks and for regulator prioritization. PMID:27794550
Comparative transcriptome analysis of microsclerotia development in Nomuraea rileyi
2013-01-01
Background Nomuraea rileyi is used as an environmental-friendly biopesticide. However, mass production and commercialization of this organism are limited due to its fastidious growth and sporulation requirements. When cultured in amended medium, we found that N. rileyi could produce microsclerotia bodies, replacing conidiophores as the infectious agent. However, little is known about the genes involved in microsclerotia development. In the present study, the transcriptomes were analyzed using next-generation sequencing technology to find the genes involved in microsclerotia development. Results A total of 4.69 Gb of clean nucleotides comprising 32,061 sequences was obtained, and 20,919 sequences were annotated (about 65%). Among the annotated sequences, only 5928 were annotated with 34 gene ontology (GO) functional categories, and 12,778 sequences were mapped to 165 pathways by searching against the Kyoto Encyclopedia of Genes and Genomes pathway (KEGG) database. Furthermore, we assessed the transcriptomic differences between cultures grown in minimal and amended medium. In total, 4808 sequences were found to be differentially expressed; 719 differentially expressed unigenes were assigned to 25 GO classes and 1888 differentially expressed unigenes were assigned to 161 KEGG pathways, including 25 enrichment pathways. Subsequently, we examined the up-regulation or uniquely expressed genes following amended medium treatment, which were also expressed on the enrichment pathway, and found that most of them participated in mediating oxidative stress homeostasis. To elucidate the role of oxidative stress in microsclerotia development, we analyzed the diversification of unigenes using quantitative reverse transcription-PCR (RT-qPCR). Conclusion Our findings suggest that oxidative stress occurs during microsclerotia development, along with a broad metabolic activity change. Our data provide the most comprehensive sequence resource available for the study of N. rileyi. We believe that the transcriptome datasets will serve as an important public information platform to accelerate studies on N. rileyi microsclerotia. PMID:23777366
To discover novel PPI signaling hubs for lung cancer, CTD2 Center at Emory utilized large-scale genomics datasets and literature to compile a set of lung cancer-associated genes. A library of expression vectors were generated for these genes and utilized for detecting pairwise PPIs with cell lysate-based TR-FRET assays in high-throughput screening format. Read the abstract.
Yang, Jun; Hou, Ziming; Wang, Changjiang; Wang, Hao; Zhang, Hongbing
2018-04-23
Adamantinomatous craniopharyngioma (ACP) is an aggressive brain tumor that occurs predominantly in the pediatric population. Conventional diagnosis method and standard therapy cannot treat ACPs effectively. In this paper, we aimed to identify key genes for ACP early diagnosis and treatment. Datasets GSE94349 and GSE68015 were obtained from Gene Expression Omnibus database. Consensus clustering was applied to discover the gene clusters in the expression data of GSE94349 and functional enrichment analysis was performed on gene set in each cluster. The protein-protein interaction (PPI) network was built by the Search Tool for the Retrieval of Interacting Genes, and hubs were selected. Support vector machine (SVM) model was built based on the signature genes identified from enrichment analysis and PPI network. Dataset GSE94349 was used for training and testing, and GSE68015 was used for validation. Besides, RT-qPCR analysis was performed to analyze the expression of signature genes in ACP samples compared with normal controls. Seven gene clusters were discovered in the differentially expressed genes identified from GSE94349 dataset. Enrichment analysis of each cluster identified 25 pathways that highly associated with ACP. PPI network was built and 46 hubs were determined. Twenty-five pathway-related genes that overlapped with the hubs in PPI network were used as signatures to establish the SVM diagnosis model for ACP. The prediction accuracy of SVM model for training, testing, and validation data were 94, 85, and 74%, respectively. The expression of CDH1, CCL2, ITGA2, COL8A1, COL6A2, and COL6A3 were significantly upregulated in ACP tumor samples, while CAMK2A, RIMS1, NEFL, SYT1, and STX1A were significantly downregulated, which were consistent with the differentially expressed gene analysis. SVM model is a promising classification tool for screening and early diagnosis of ACP. The ACP-related pathways and signature genes will advance our knowledge of ACP pathogenesis and benefit the therapy improvement.
Chen, Fasheng; Chen, Chen; Qu, Yangang; Xiang, Hua; Ai, Qingxiu; Yang, Fei; Tan, Xueping; Zhou, Yi; Jiang, Guang; Zhang, Zixiong
2016-01-01
Abstract Background: Selenium-binding protein 1 (SELENBP1) expression is reduced markedly in many types of cancers and low SELENBP1 expression levels are associated with poor patient prognosis. Methods: SELENBP1 gene expression in head and neck squamous cell carcinoma (HNSCC) was analyzed with GEO dataset and characteristics of SELENBP1 expression in paraffin embedded tissue were summarized. Expression of SELENBP1 in nasopharyngeal carcinoma (NPC), laryngeal cancer, oral cancer, tonsil cancer, hypopharyngeal cancer and normal tissues were detected using immunohistochemistry, at last, 99 NPC patients were followed up more than 5 years and were analyzed the prognostic significance of SELENBP1. Results: Analysis of GEO dataset concluded that SELENBP1 gene expression in HNSCC was lower than that in normal tissue (P < 0.01), but there was no significant difference of SELENBP1 gene expression in different T-stage and N-stage (P > 0.05). Analysis of pathological section concluded that SELENBP1 in the majority of HNSCC is low expression and in cancer nests is lower expression than surrounding normal tissue, even associated with the malignant degree of tumor. Further study indicated the low SELENBP1 expression group of patients with NPC accompanied by poor overall survival and has significantly different comparing with the high expression group. Conclusion: SELENBP1 expression was down-regulated in HNSCC, but has no associated with T-stage and N-stage of tumor. Low expression of SELENBP1 in patients with NPC has poor over survival, so SELENBP1 could be a novel biomarker for predicting prognosis. PMID:27583873
Orodu, Oyinkepreye D; Orodu, Kale B; Afolabi, Richard O; Dafe, Eboh A
2018-08-01
The dataset in this article are related to an experimental Enhanced Oil Recovery (EOR) scheme involving the use of dispersions containing Gum Arabic coated Alumina Nanoparticles (GCNPs) for Nigerian medium crude oil. The result contained in the dataset showed a 7.18% (5 wt% GCNPs), 7.81% (5 wt% GCNPs), and 5.61% (3 wt% GCNPs) improvement in the recovery oil beyond the water flooding stage for core samples A, B, and C respectively. Also, the improvement in recovery of the medium crude oil by the GCNPs dispersions when compared to Gum Arabic polymer flooding was evident in the dataset.
Krüger, Angela V; Jelier, Rob; Dzyubachyk, Oleh; Zimmerman, Timo; Meijering, Erik; Lehner, Ben
2015-02-15
Chromatin regulators are widely expressed proteins with diverse roles in gene expression, nuclear organization, cell cycle regulation, pluripotency, physiology and development, and are frequently mutated in human diseases such as cancer. Their inhibition often results in pleiotropic effects that are difficult to study using conventional approaches. We have developed a semi-automated nuclear tracking algorithm to quantify the divisions, movements and positions of all nuclei during the early development of Caenorhabditis elegans and have used it to systematically study the effects of inhibiting chromatin regulators. The resulting high dimensional datasets revealed that inhibition of multiple regulators, including F55A3.3 (encoding FACT subunit SUPT16H), lin-53 (RBBP4/7), rba-1 (RBBP4/7), set-16 (MLL2/3), hda-1 (HDAC1/2), swsn-7 (ARID2), and let-526 (ARID1A/1B) affected cell cycle progression and caused chromosome segregation defects. In contrast, inhibition of cir-1 (CIR1) accelerated cell division timing in specific cells of the AB lineage. The inhibition of RNA polymerase II also accelerated these division timings, suggesting that normal gene expression is required to delay cell cycle progression in multiple lineages in the early embryo. Quantitative analyses of the dataset suggested the existence of at least two functionally distinct SWI/SNF chromatin remodeling complex activities in the early embryo, and identified a redundant requirement for the egl-27 and lin-40 MTA orthologs in the development of endoderm and mesoderm lineages. Moreover, our dataset also revealed a characteristic rearrangement of chromatin to the nuclear periphery upon the inhibition of multiple general regulators of gene expression. Our systematic, comprehensive and quantitative datasets illustrate the power of single cell-resolution quantitative tracking and high dimensional phenotyping to investigate gene function. Furthermore, the results provide an overview of the functions of essential chromatin regulators during the early development of an animal. Copyright © 2014 Elsevier Inc. All rights reserved.
McArt, Darragh G.; Dunne, Philip D.; Blayney, Jaine K.; Salto-Tellez, Manuel; Van Schaeybroeck, Sandra; Hamilton, Peter W.; Zhang, Shu-Dong
2013-01-01
The advent of next generation sequencing technologies (NGS) has expanded the area of genomic research, offering high coverage and increased sensitivity over older microarray platforms. Although the current cost of next generation sequencing is still exceeding that of microarray approaches, the rapid advances in NGS will likely make it the platform of choice for future research in differential gene expression. Connectivity mapping is a procedure for examining the connections among diseases, genes and drugs by differential gene expression initially based on microarray technology, with which a large collection of compound-induced reference gene expression profiles have been accumulated. In this work, we aim to test the feasibility of incorporating NGS RNA-Seq data into the current connectivity mapping framework by utilizing the microarray based reference profiles and the construction of a differentially expressed gene signature from a NGS dataset. This would allow for the establishment of connections between the NGS gene signature and those microarray reference profiles, alleviating the associated incurring cost of re-creating drug profiles with NGS technology. We examined the connectivity mapping approach on a publicly available NGS dataset with androgen stimulation of LNCaP cells in order to extract candidate compounds that could inhibit the proliferative phenotype of LNCaP cells and to elucidate their potential in a laboratory setting. In addition, we also analyzed an independent microarray dataset of similar experimental settings. We found a high level of concordance between the top compounds identified using the gene signatures from the two datasets. The nicotine derivative cotinine was returned as the top candidate among the overlapping compounds with potential to suppress this proliferative phenotype. Subsequent lab experiments validated this connectivity mapping hit, showing that cotinine inhibits cell proliferation in an androgen dependent manner. Thus the results in this study suggest a promising prospect of integrating NGS data with connectivity mapping. PMID:23840550
Wang, Shiqiang; Wang, Bin; Hua, Wenping; Niu, Junfeng; Dang, Kaikai; Qiang, Yi; Wang, Zhezhi
2017-09-12
Polygonatum sibiricum polysaccharides (PSPs) are used to improve immunity, alleviate dryness, promote the secretion of fluids, and quench thirst. However, the PSP biosynthetic pathway is largely unknown. Understanding the genetic background will help delineate that pathway at the molecular level so that researchers can develop better conservation strategies. After comparing the PSP contents among several different P. sibiricum germplasms, we selected two groups with the largest contrasts in contents and subjected them to HiSeq2500 transcriptome sequencing to identify the candidate genes involved in PSP biosynthesis. In all, 20 kinds of enzyme-encoding genes were related to PSP biosynthesis. The polysaccharide content was positively correlated with the expression patterns of β-fructofuranosidase ( sacA ), fructokinase ( scrK ), UDP-glucose 4-epimerase ( GALE ), Mannose-1-phosphate guanylyltransferase ( GMPP ), and UDP-glucose 6-dehydrogenase ( UGDH ), but negatively correlated with the expression of Hexokinase ( HK ). Through qRT-PCR validation and comprehensive analysis, we determined that sacA , HK , and GMPP are key genes for enzymes within the PSP metabolic pathway in P. sibiricum. Our results provide a public transcriptome dataset for this species and an outline of pathways for the production of polysaccharides in medicinal plants. They also present more information about the PSP biosynthesis pathway at the molecular level in P. sibiricum and lay the foundation for subsequent research of gene functions.
Identification of BAG3 target proteins in anaplastic thyroid cancer cells by proteomic analysis.
Galdiero, Francesca; Bello, Anna Maria; Spina, Anna; Capiluongo, Anna; Liuu, Sophie; De Marco, Margot; Rosati, Alessandra; Capunzo, Mario; Napolitano, Maria; Vuttariello, Emilia; Monaco, Mario; Califano, Daniela; Turco, Maria Caterina; Chiappetta, Gennaro; Vinh, Joëlle; Chiappetta, Giovanni
2018-01-30
BAG3 protein is an apoptosis inhibitor and is highly expressed in Anaplastic Thyroid Cancer. We investigated the entire set of proteins modulated by BAG3 silencing in the human anaplastic thyroid 8505C cancer cells by using the Stable-Isotope Labeling by Amino acids in Cell culture strategy combined with mass spectrometry analysis. By this approach we identified 37 up-regulated and 54 down-regulated proteins in BAG3-silenced cells. Many of these proteins are reportedly involved in tumor progression, invasiveness and resistance to therapies. We focused our attention on an oncogenic protein, CAV1, and a tumor suppressor protein, SERPINB2, that had not previously been reported to be modulated by BAG3. Their expression levels in BAG3-silenced cells were confirmed by qRT-PCR and western blot analyses, disclosing two novel targets of BAG3 pro-tumor activity. We also examined the dataset of proteins obtained by the quantitative proteomics analysis using two tools, Downstream Effect Analysis and Upstream Regulator Analysis of the Ingenuity Pathways Analysis software. Our analyses confirm the association of the proteome profile observed in BAG3-silenced cells with an increase in cell survival and a decrease in cell proliferation and invasion, and highlight the possible involvement of four tumor suppressor miRNAs and TP53/63 proteins in BAG3 activity.
Severe hypertriglyceridemia in Norway: prevalence, clinical and genetic characteristics.
Retterstøl, Kjetil; Narverud, Ingunn; Selmer, Randi; Berge, Knut E; Osnes, Ingvild V; Ulven, Stine M; Halvorsen, Bente; Aukrust, Pål; Holven, Kirsten B; Iversen, Per O
2017-06-12
There is a lack of comprehensive patient-datasets regarding prevalence of severe hypertriglyceridemia (sHTG; triglycerides ≥10 mmol/L), frequency of co-morbidities, gene mutations, and gene characterization in sHTG. Using large surveys combined with detailed analysis of sub-cohorts of sHTG patients, we here sought to address these issues. We used data from several large Norwegian surveys that included 681,990 subjects, to estimate the prevalence. Sixty-five sHTG patients were investigated to obtain clinical profiles and candidate disease genes. We obtained peripheral blood mononuclear cells (PBMC) from six male patients and nine healthy controls and examined expression of mRNAs involved in lipid metabolism. The prevalence of sHTG was 0.13 (95% CI 0.12-0.14)%, and highest in men aged 40-49 years and in women 60-69 years. Among the 65 sHTG patients, a possible genetic cause was found in four and 11 had experienced acute pancreatitis. The mRNA expression levels of carnitine palmitoyltransferase (CPT)-1A, CPT2, and hormone-sensitive lipase, were significantly higher in patients compared to controls, whereas those of ATP-binding cassette, sub-family G, member 1 were significantly lower. In Norway, sHTG is present in 0.1%, carries considerable co-morbidity and is associated with an imbalance of genes involved in lipid metabolism, all potentially contributing to increased cardiovascular morbidity in sHTG.
Wang, Shiqiang; Wang, Bin; Hua, Wenping; Niu, Junfeng; Dang, Kaikai; Qiang, Yi; Wang, Zhezhi
2017-01-01
Polygonatum sibiricum polysaccharides (PSPs) are used to improve immunity, alleviate dryness, promote the secretion of fluids, and quench thirst. However, the PSP biosynthetic pathway is largely unknown. Understanding the genetic background will help delineate that pathway at the molecular level so that researchers can develop better conservation strategies. After comparing the PSP contents among several different P. sibiricum germplasms, we selected two groups with the largest contrasts in contents and subjected them to HiSeq2500 transcriptome sequencing to identify the candidate genes involved in PSP biosynthesis. In all, 20 kinds of enzyme-encoding genes were related to PSP biosynthesis. The polysaccharide content was positively correlated with the expression patterns of β-fructofuranosidase (sacA), fructokinase (scrK), UDP-glucose 4-epimerase (GALE), Mannose-1-phosphate guanylyltransferase (GMPP), and UDP-glucose 6-dehydrogenase (UGDH), but negatively correlated with the expression of Hexokinase (HK). Through qRT-PCR validation and comprehensive analysis, we determined that sacA, HK, and GMPP are key genes for enzymes within the PSP metabolic pathway in P. sibiricum. Our results provide a public transcriptome dataset for this species and an outline of pathways for the production of polysaccharides in medicinal plants. They also present more information about the PSP biosynthesis pathway at the molecular level in P. sibiricum and lay the foundation for subsequent research of gene functions. PMID:28895881
Ensemble Methods for MiRNA Target Prediction from Expression Data.
Le, Thuc Duy; Zhang, Junpeng; Liu, Lin; Li, Jiuyong
2015-01-01
microRNAs (miRNAs) are short regulatory RNAs that are involved in several diseases, including cancers. Identifying miRNA functions is very important in understanding disease mechanisms and determining the efficacy of drugs. An increasing number of computational methods have been developed to explore miRNA functions by inferring the miRNA-mRNA regulatory relationships from data. Each of the methods is developed based on some assumptions and constraints, for instance, assuming linear relationships between variables. For such reasons, computational methods are often subject to the problem of inconsistent performance across different datasets. On the other hand, ensemble methods integrate the results from individual methods and have been proved to outperform each of their individual component methods in theory. In this paper, we investigate the performance of some ensemble methods over the commonly used miRNA target prediction methods. We apply eight different popular miRNA target prediction methods to three cancer datasets, and compare their performance with the ensemble methods which integrate the results from each combination of the individual methods. The validation results using experimentally confirmed databases show that the results of the ensemble methods complement those obtained by the individual methods and the ensemble methods perform better than the individual methods across different datasets. The ensemble method, Pearson+IDA+Lasso, which combines methods in different approaches, including a correlation method, a causal inference method, and a regression method, is the best performed ensemble method in this study. Further analysis of the results of this ensemble method shows that the ensemble method can obtain more targets which could not be found by any of the single methods, and the discovered targets are more statistically significant and functionally enriched. The source codes, datasets, miRNA target predictions by all methods, and the ground truth for validation are available in the Supplementary materials.
Ensemble Methods for MiRNA Target Prediction from Expression Data
Le, Thuc Duy; Zhang, Junpeng; Liu, Lin; Li, Jiuyong
2015-01-01
Background microRNAs (miRNAs) are short regulatory RNAs that are involved in several diseases, including cancers. Identifying miRNA functions is very important in understanding disease mechanisms and determining the efficacy of drugs. An increasing number of computational methods have been developed to explore miRNA functions by inferring the miRNA-mRNA regulatory relationships from data. Each of the methods is developed based on some assumptions and constraints, for instance, assuming linear relationships between variables. For such reasons, computational methods are often subject to the problem of inconsistent performance across different datasets. On the other hand, ensemble methods integrate the results from individual methods and have been proved to outperform each of their individual component methods in theory. Results In this paper, we investigate the performance of some ensemble methods over the commonly used miRNA target prediction methods. We apply eight different popular miRNA target prediction methods to three cancer datasets, and compare their performance with the ensemble methods which integrate the results from each combination of the individual methods. The validation results using experimentally confirmed databases show that the results of the ensemble methods complement those obtained by the individual methods and the ensemble methods perform better than the individual methods across different datasets. The ensemble method, Pearson+IDA+Lasso, which combines methods in different approaches, including a correlation method, a causal inference method, and a regression method, is the best performed ensemble method in this study. Further analysis of the results of this ensemble method shows that the ensemble method can obtain more targets which could not be found by any of the single methods, and the discovered targets are more statistically significant and functionally enriched. The source codes, datasets, miRNA target predictions by all methods, and the ground truth for validation are available in the Supplementary materials. PMID:26114448
Student Activity and Profile Datasets from an Online Video-Based Collaborative Learning Experience
ERIC Educational Resources Information Center
Martín, Estefanía; Gértrudix, Manuel; Urquiza-Fuentes, Jaime; Haya, Pablo A.
2015-01-01
This paper describes two datasets extracted from a video-based educational experience using a social and collaborative platform. The length of the trial was 3 months. It involved 111 students from two different courses. Twenty-nine came from Computer Engineering (CE) course and 82 from Media and Communication (M&C) course. They were organised…
A new dataset validation system for the Planetary Science Archive
NASA Astrophysics Data System (ADS)
Manaud, N.; Zender, J.; Heather, D.; Martinez, S.
2007-08-01
The Planetary Science Archive is the official archive for the Mars Express mission. It has received its first data by the end of 2004. These data are delivered by the PI teams to the PSA team as datasets, which are formatted conform to the Planetary Data System (PDS). The PI teams are responsible for analyzing and calibrating the instrument data as well as the production of reduced and calibrated data. They are also responsible of the scientific validation of these data. ESA is responsible of the long-term data archiving and distribution to the scientific community and must ensure, in this regard, that all archived products meet quality. To do so, an archive peer-review is used to control the quality of the Mars Express science data archiving process. However a full validation of its content is missing. An independent review board recently recommended that the completeness of the archive as well as the consistency of the delivered data should be validated following well-defined procedures. A new validation software tool is being developed to complete the overall data quality control system functionality. This new tool aims to improve the quality of data and services provided to the scientific community through the PSA, and shall allow to track anomalies in and to control the completeness of datasets. It shall ensure that the PSA end-users: (1) can rely on the result of their queries, (2) will get data products that are suitable for scientific analysis, (3) can find all science data acquired during a mission. We defined dataset validation as the verification and assessment process to check the dataset content against pre-defined top-level criteria, which represent the general characteristics of good quality datasets. The dataset content that is checked includes the data and all types of information that are essential in the process of deriving scientific results and those interfacing with the PSA database. The validation software tool is a multi-mission tool that has been designed to provide the user with the flexibility of defining and implementing various types of validation criteria, to iteratively and incrementally validate datasets, and to generate validation reports.
Harries, Lorna W; Fellows, Alexander D; Pilling, Luke C; Hernandez, Dena; Singleton, Andrew; Bandinelli, Stefania; Guralnik, Jack; Powell, Jonathan; Ferrucci, Luigi; Melzer, David
2012-08-01
Interventions which inhibit TOR activity (including rapamycin and caloric restriction) lead to downstream gene expression changes and increased lifespan in laboratory models. However, the role of mTOR signaling in human aging is unclear. We tested the expression of mTOR-related transcripts in two independent study cohorts; the InCHIANTI population study of aging and the San Antonio Family Heart Study (SAFHS). Expression of 27/56 (InCHIANTI) and 19/44 (SAFHS) genes were associated with age after correction for multiple testing. 8 genes were robustly associated with age in both cohorts. Genes involved in insulin signaling (PTEN, PI3K, PDK1), ribosomal biogenesis (S6K), lipid metabolism (SREBF1), cellular apoptosis (SGK1), angiogenesis (VEGFB), insulin production and sensitivity (FOXO), cellular stress response (HIF1A) and cytoskeletal remodeling (PKC) were inversely correlated with age, whereas genes relating to inhibition of ribosomal components (4EBP1) and inflammatory mediators (STAT3) were positively associated with age in one or both datasets. We conclude that the expression of mTOR-related transcripts is associated with advancing age in humans. Changes seen are broadly similar to mTOR inhibition interventions associated with increased lifespan in animals. Work is needed to establish whether these changes are predictive of human longevity and whether further mTOR inhibition would be beneficial in older people. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Ning, Jinling; Shen, Ying; Wang, Ting; Wang, Mengru; Liu, Wei; Sun, Yonghu; Zhang, Furen; Chen, Lingling; Wang, Yiqiang
2018-05-21
Preliminary datamining performed with Gene Expression Omnibus datasets implied that psoriasis may involve the matrix remodeling associated 7 (MXRA7), a gene with little function information yet. To test that hypothesis, studies were performed in human samples and murine models. Immunohistochemistry in normal human skin showed that MXRA7 proteins were present across the full epidermal layer, with highest expression level detected in the basal layer. In psoriatic samples, MXRA7 proteins were absent in the basal stem cells layer while suprabasal keratinocytes stained at a higher level than in normal tissues. In an imiquimod-induced psoriasis-like disease model in mice, diseased skins manifested similar MXRA7 expression pattern and change as in human samples, and MXRA7-deficient mice developed severer psoriasis-like diseases than wild-type mice did. While levels of pro-psoriatic genes (e.g. IL17, IL22, IL23, etc) in imiquimod-stimulated MXRA7-deficient mice were higher than in wild-type mice, keratinocytes isolated from MXRA7-deficient mice showed increased proliferation upon differentiation induction in culture. These data demonstrated that MXRA7 gene might function as a negative modulator in psoriasis development when pro-psoriatic factors attack, presumably via expression alteration or redistribution of MXRA7 proteins in keratinocytes. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Zhou, Xiaohong; Wang, Ke; Lv, Dongwen; Wu, Chengjun; Li, Jiarui; Zhao, Pei; Lin, Zhishan; Du, Lipu; Yan, Yueming; Ye, Xingguo
2013-01-01
Agrobacterium-mediated plant transformation is an extremely complex and evolved process involving genetic determinants of both the bacteria and the host plant cells. However, the mechanism of the determinants remains obscure, especially in some cereal crops such as wheat, which is recalcitrant for Agrobacterium-mediated transformation. In this study, differentially expressed genes (DEGs) and differentially expressed proteins (DEPs) were analyzed in wheat callus cells co-cultured with Agrobacterium by using RNA sequencing (RNA-seq) and two-dimensional electrophoresis (2-DE) in conjunction with mass spectrometry (MS). A set of 4,889 DEGs and 90 DEPs were identified, respectively. Most of them are related to metabolism, chromatin assembly or disassembly and immune defense. After comparative analysis, 24 of the 90 DEPs were detected in RNA-seq and proteomics datasets simultaneously. In addition, real-time RT-PCR experiments were performed to check the differential expression of the 24 genes, and the results were consistent with the RNA-seq data. According to gene ontology (GO) analysis, we found that a big part of these differentially expressed genes were related to the process of stress or immunity response. Several putative determinants and candidate effectors responsive to Agrobacterium mediated transformation of wheat cells were discussed. We speculate that some of these genes are possibly related to Agrobacterium infection. Our results will help to understand the interaction between Agrobacterium and host cells, and may facilitate developing efficient transformation strategies in cereal crops. PMID:24278131
Thomassen, Mads; Tan, Qihua; Kruse, Torben A
2009-01-01
Breast cancer cells exhibit complex karyotypic alterations causing deregulation of numerous genes. Some of these genes are probably causal for cancer formation and local growth whereas others are causal for the various steps of metastasis. In a fraction of tumors deregulation of the same genes might be caused by epigenetic modulations, point mutations or the influence of other genes. We have investigated the relation of gene expression and chromosomal position, using eight datasets including more than 1200 breast tumors, to identify chromosomal regions and candidate genes possibly causal for breast cancer metastasis. By use of "Gene Set Enrichment Analysis" we have ranked chromosomal regions according to their relation to metastasis. Overrepresentation analysis identified regions with increased expression for chromosome 1q41-42, 8q24, 12q14, 16q22, 16q24, 17q12-21.2, 17q21-23, 17q25, 20q11, and 20q13 among metastasizing tumors and reduced gene expression at 1p31-21, 8p22-21, and 14q24. By analysis of genes with extremely imbalanced expression in these regions we identified DIRAS3 at 1p31, PSD3, LPL, EPHX2 at 8p21-22, and FOS at 14q24 as candidate metastasis suppressor genes. Potential metastasis promoting genes includes RECQL4 at 8q24, PRMT7 at 16q22, GINS2 at 16q24, and AURKA at 20q13.
HOXB7 overexpression in lung cancer is a hallmark of acquired stem-like phenotype.
Monterisi, Simona; Lo Riso, Pietro; Russo, Karin; Bertalot, Giovanni; Vecchi, Manuela; Testa, Giuseppe; Di Fiore, Pier Paolo; Bianchi, Fabrizio
2018-03-26
HOXB7 is a homeodomain (HOX) transcription factor involved in regional body patterning of invertebrates and vertebrates. We previously identified HOXB7 within a ten-gene prognostic signature for lung adenocarcinoma, where increased expression of HOXB7 was associated with poor prognosis. This raises the question of how HOXB7 overexpression can influence the metastatic behavior of lung adenocarcinoma. Here, we analyzed publicly available microarray and RNA-seq lung cancer expression datasets and found that HOXB7-overexpressing tumors are enriched in gene signatures characterizing adult and embryonic stem cells (SC), and induced pluripotent stem cells (iPSC). Experimentally, we found that HOXB7 upregulates several canonical SC/iPSC markers and sustains the expansion of a subpopulation of cells with SC characteristics, through modulation of LIN28B, an emerging cancer gene and pluripotency factor, which we discovered to be a direct target of HOXB7. We validated this new circuit by showing that HOXB7 enhances reprogramming to iPSC with comparable efficiency to LIN28B or its target c-MYC, which is a canonical reprogramming factor.
Canovas, Sebastian; Ivanova, Elena; Romar, Raquel; García-Martínez, Soledad; Soriano-Úbeda, Cristina; García-Vázquez, Francisco A; Saadeh, Heba; Andrews, Simon; Kelsey, Gavin; Coy, Pilar
2017-01-01
The number of children born since the origin of Assisted Reproductive Technologies (ART) exceeds 5 million. The majority seem healthy, but a higher frequency of defects has been reported among ART-conceived infants, suggesting an epigenetic cost. We report the first whole-genome DNA methylation datasets from single pig blastocysts showing differences between in vivo and in vitro produced embryos. Blastocysts were produced in vitro either without (C-IVF) or in the presence of natural reproductive fluids (Natur-IVF). Natur-IVF embryos were of higher quality than C-IVF in terms of cell number and hatching ability. RNA-Seq and DNA methylation analyses showed that Natur-IVF embryos have expression and methylation patterns closer to in vivo blastocysts. Genes involved in reprogramming, imprinting and development were affected by culture, with fewer aberrations in Natur-IVF embryos. Methylation analysis detected methylated changes in C-IVF, but not in Natur-IVF, at genes whose methylation could be critical, such as IGF2R and NNAT. DOI: http://dx.doi.org/10.7554/eLife.23670.001 PMID:28134613
Zhang, Zhijun; Zhang, Pengjun; Li, Weidi; Zhang, Jinming; Huang, Fang; Yang, Jian; Bei, Yawei; Lu, Yaobin
2013-05-01
The western flower thrips (WFT), Frankliniella occidentalis, a world-wide invasive insect, causes agricultural damage by directly feeding and by indirectly vectoring Tospoviruses, such as Tomato spotted wilt virus (TSWV). We characterized the transcriptome of WFT and analyzed global gene expression of WFT response to TSWV infection using Illumina sequencing platform. We compiled 59,932 unigenes, and identified 36,339 unigenes by similarity analysis against public databases, most of which were annotated using gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. Within these annotated transcripts, we collected 278 sequences related to insecticide resistance. GO and KEGG analysis of different expression genes between TSWV-infected and non-infected WFT population revealed that TSWV can regulate cellular process and immune response, which might lead to low virus titers in thrips cells and no detrimental effects on F. occidentalis. This data-set not only enriches genomic resource for WFT, but also benefits research into its molecular genetics and functional genomics. Copyright © 2013 Elsevier Inc. All rights reserved.
Strader, Marie E; Aglyamova, Galina V; Matz, Mikhail V
2018-01-04
Molecular mechanisms underlying coral larval competence, the ability of larvae to respond to settlement cues, determine their dispersal potential and are potential targets of natural selection. Here, we profiled competence, fluorescence and genome-wide gene expression in embryos and larvae of the reef-building coral Acropora millepora daily throughout 12 days post-fertilization. Gene expression associated with competence was positively correlated with transcriptomic response to the natural settlement cue, confirming that mature coral larvae are "primed" for settlement. Rise of competence through development was accompanied by up-regulation of sensory and signal transduction genes such as ion channels, genes involved in neuropeptide signaling, and G-protein coupled receptor (GPCRs). A drug screen targeting components of GPCR signaling pathways confirmed a role in larval settlement behavior and metamorphosis. These results gives insight into the molecular complexity underlying these transitions and reveals receptors and pathways that, if altered by changing environments, could affect dispersal capabilities of reef-building corals. In addition, this dataset provides a toolkit for asking broad questions about sensory capacity in multicellular animals and the evolution of development.
Moulos, Panagiotis; Samiotaki, Martina; Panayotou, George; Dedos, Skarlatos G.
2016-01-01
The cells of prothoracic glands (PG) are the main site of synthesis and secretion of ecdysteroids, the biochemical products of cholesterol conversion to steroids that shape the morphogenic development of insects. Despite the availability of genome sequences from several insect species and the extensive knowledge of certain signalling pathways that underpin ecdysteroidogenesis, the spectrum of signalling molecules and ecdysteroidogenic cascades is still not fully comprehensive. To fill this gap and obtain the complete list of cell membrane receptors expressed in PG cells, we used combinatory bioinformatic, proteomic and transcriptomic analysis and quantitative PCR to annotate and determine the expression profiles of genes identified as putative cell membrane receptors of the model insect species, Bombyx mori, and subsequently enrich the repertoire of signalling pathways that are present in its PG cells. The genome annotation dataset we report here highlights modules and pathways that may be directly involved in ecdysteroidogenesis and aims to disseminate data and assist other researchers in the discovery of the role of such receptors and their ligands. PMID:27576083
TNF-alpha inhibits insulin action in liver and adipose tissue: A model of metabolic syndrome.
Solomon, S S; Odunusi, O; Carrigan, D; Majumdar, G; Kakoola, D; Lenchik, N I; Gerling, I C
2010-02-01
Several studies suggest that TNF-alpha contributes to the development of insulin resistance (IR). We compared transcriptional profiles of rat H-411E liver cells exposed to insulin in the absence or presence of TNF-alpha. We identified 33 genes whose expression was altered by insulin, and then reversed by TNF-alpha. Twenty-six of these 33 genes created a single network centered around: insulin, TNF-alpha, p38-MAPK, TGFb1; SMAD and STAT1; and enzymes and cytokines involved in apoptosis (CASP3, GADD45B, IL2, TNF-alpha, etc.). We analyzed our data together with other data of gene expression in adipocytes and found a number of processes common to both, for example, cell death and inflammation; intercellular signaling and metabolism; G-Protein, IL-10 and PTEN signaling. Moreover, the two datasets combined generated a single molecular network that further identified PTEN (a phosphatase) as a unique new link between insulin signaling, IR, and apoptosis reflecting the pathophysiology of "metabolic syndrome". Georg Thieme Verlag KG Stuttgart * New York.
Kerkentzes, Konstantinos; Lagani, Vincenzo; Tsamardinos, Ioannis; Vyberg, Mogens; Røe, Oluf Dimitri
2014-01-01
Novel statistical methods and increasingly more accurate gene annotations can transform "old" biological data into a renewed source of knowledge with potential clinical relevance. Here, we provide an in silico proof-of-concept by extracting novel information from a high-quality mRNA expression dataset, originally published in 2001, using state-of-the-art bioinformatics approaches. The dataset consists of histologically defined cases of lung adenocarcinoma (AD), squamous (SQ) cell carcinoma, small-cell lung cancer, carcinoid, metastasis (breast and colon AD), and normal lung specimens (203 samples in total). A battery of statistical tests was used for identifying differential gene expressions, diagnostic and prognostic genes, enriched gene ontologies, and signaling pathways. Our results showed that gene expressions faithfully recapitulate immunohistochemical subtype markers, as chromogranin A in carcinoids, cytokeratin 5, p63 in SQ, and TTF1 in non-squamous types. Moreover, biological information with putative clinical relevance was revealed as potentially novel diagnostic genes for each subtype with specificity 93-100% (AUC = 0.93-1.00). Cancer subtypes were characterized by (a) differential expression of treatment target genes as TYMS, HER2, and HER3 and (b) overrepresentation of treatment-related pathways like cell cycle, DNA repair, and ERBB pathways. The vascular smooth muscle contraction, leukocyte trans-endothelial migration, and actin cytoskeleton pathways were overexpressed in normal tissue. Reanalysis of this public dataset displayed the known biological features of lung cancer subtypes and revealed novel pathways of potentially clinical importance. The findings also support our hypothesis that even old omics data of high quality can be a source of significant biological information when appropriate bioinformatics methods are used.
Chi, Baofang; Tao, Shiheng; Liu, Yanlin
2015-01-01
Sampling the solution space of genome-scale models is generally conducted to determine the feasible region for metabolic flux distribution. Because the region for actual metabolic states resides only in a small fraction of the entire space, it is necessary to shrink the solution space to improve the predictive power of a model. A common strategy is to constrain models by integrating extra datasets such as high-throughput datasets and C13-labeled flux datasets. However, studies refining these approaches by performing a meta-analysis of massive experimental metabolic flux measurements, which are closely linked to cellular phenotypes, are limited. In the present study, experimentally identified metabolic flux data from 96 published reports were systematically reviewed. Several strong associations among metabolic flux phenotypes were observed. These phenotype-phenotype associations at the flux level were quantified and integrated into a Saccharomyces cerevisiae genome-scale model as extra physiological constraints. By sampling the shrunken solution space of the model, the metabolic flux fluctuation level, which is an intrinsic trait of metabolic reactions determined by the network, was estimated and utilized to explore its relationship to gene expression noise. Although no correlation was observed in all enzyme-coding genes, a relationship between metabolic flux fluctuation and expression noise of genes associated with enzyme-dosage sensitive reactions was detected, suggesting that the metabolic network plays a role in shaping gene expression noise. Such correlation was mainly attributed to the genes corresponding to non-essential reactions, rather than essential ones. This was at least partially, due to regulations underlying the flux phenotype-phenotype associations. Altogether, this study proposes a new approach in shrinking the solution space of a genome-scale model, of which sampling provides new insights into gene expression noise.
2017-01-01
The changes of protein expression that are monitored in proteomic experiments are a type of biological transformation that also involves changes in chemical composition. Accompanying the myriad molecular-level interactions that underlie any proteomic transformation, there is an overall thermodynamic potential that is sensitive to microenvironmental conditions, including local oxidation and hydration potential. Here, up- and down-expressed proteins identified in 71 comparative proteomics studies were analyzed using the average oxidation state of carbon (ZC) and water demand per residue (\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}${\\overline{n}}_{{\\mathrm{H}}_{2}\\mathrm{O}}$\\end{document}n¯H2O), calculated using elemental abundances and stoichiometric reactions to form proteins from basis species. Experimental lowering of oxygen availability (hypoxia) or water activity (hyperosmotic stress) generally results in decreased ZC or \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}${\\overline{n}}_{{\\mathrm{H}}_{2}\\mathrm{O}}$\\end{document}n¯H2O of up-expressed compared to down-expressed proteins. This correspondence of chemical composition with experimental conditions provides evidence for attraction of the proteomes to a low-energy state. An opposite compositional change, toward higher average oxidation or hydration state, is found for proteomic transformations in colorectal and pancreatic cancer, and in two experiments for adipose-derived stem cells. Calculations of chemical affinity were used to estimate the thermodynamic potentials for proteomic transformations as a function of fugacity of O2 and activity of H2O, which serve as scales of oxidation and hydration potential. Diagrams summarizing the relative potential for formation of up- and down-expressed proteins have predicted equipotential lines that cluster around particular values of oxygen fugacity and water activity for similar datasets. The changes in chemical composition of proteomes are likely linked with reactions among other cellular molecules. A redox balance calculation indicates that an increase in the lipid to protein ratio in cancer cells by 20% over hypoxic cells would generate a large enough electron sink for oxidation of the cancer proteomes. The datasets and computer code used here are made available in a new R package, canprot. PMID:28603672
BCCTBbp: the Breast Cancer Campaign Tissue Bank bioinformatics portal.
Cutts, Rosalind J; Guerra-Assunção, José Afonso; Gadaleta, Emanuela; Dayem Ullah, Abu Z; Chelala, Claude
2015-01-01
BCCTBbp (http://bioinformatics.breastcancertissue bank.org) was initially developed as the data-mining portal of the Breast Cancer Campaign Tissue Bank (BCCTB), a vital resource of breast cancer tissue for researchers to support and promote cutting-edge research. BCCTBbp is dedicated to maximising research on patient tissues by initially storing genomics, methylomics, transcriptomics, proteomics and microRNA data that has been mined from the literature and linking to pathways and mechanisms involved in breast cancer. Currently, the portal holds 146 datasets comprising over 227,795 expression/genomic measurements from various breast tissues (e.g. normal, malignant or benign lesions), cell lines and body fluids. BCCTBbp can be used to build on breast cancer knowledge and maximise the value of existing research. By recording a large number of annotations on samples and studies, and linking to other databases, such as NCBI, Ensembl and Reactome, a wide variety of different investigations can be carried out. Additionally, BCCTBbp has a dedicated analytical layer allowing researchers to further analyse stored datasets. A future important role for BCCTBbp is to make available all data generated on BCCTB tissues thus building a valuable resource of information on the tissues in BCCTB that will save repetition of experiments and expand scientific knowledge. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Can specific transcriptional regulators assemble a universal cancer signature?
NASA Astrophysics Data System (ADS)
Roy, Janine; Isik, Zerrin; Pilarsky, Christian; Schroeder, Michael
2013-10-01
Recently, there is a lot of interest in using biomarker signatures derived from gene expression data to predict cancer progression. We assembled signatures of 25 published datasets covering 13 types of cancers. How do these signatures compare with each other? On one hand signatures answering the same biological question should overlap, whereas signatures predicting different cancer types should differ. On the other hand, there could also be a Universal Cancer Signature that is predictive independently of the cancer type. Initially, we generate signatures for all datasets using classical approaches such as t-test and fold change and then, we explore signatures resulting from a network-based method, that applies the random surfer model of Google's PageRank algorithm. We show that the signatures as published by the authors and the signatures generated with classical methods do not overlap - not even for the same cancer type - whereas the network-based signatures strongly overlap. Selecting 10 out of 37 universal cancer genes gives the optimal prediction for all cancers thus taking a first step towards a Universal Cancer Signature. We furthermore analyze and discuss the involved genes in terms of the Hallmarks of cancer and in particular single out SP1, JUN/FOS and NFKB1 and examine their specific role in cancer progression.
Rapposelli, Simona; Coi, Alessio; Imbriani, Marcello; Bianucci, Anna Maria
2012-01-01
P-glycoprotein (P-gp) is an efflux pump involved in the protection of tissues of several organs by influencing xenobiotic disposition. P-gp plays a key role in multidrug resistance and in the progression of many neurodegenerative diseases. The development of new and more effective therapeutics targeting P-gp thus represents an intriguing challenge in drug discovery. P-gp inhibition may be considered as a valid approach to improve drug bioavailability as well as to overcome drug resistance to many kinds of tumours characterized by the over-expression of this protein. This study aims to develop classification models from a unique dataset of 59 compounds for which there were homogeneous experimental data on P-gp inhibition, ATPase activation and monolayer efflux. For each experiment, the dataset was split into a training and a test set comprising 39 and 20 molecules, respectively. Rational splitting was accomplished using a sphere-exclusion type algorithm. After a two-step (internal/external) validation, the best-performing classification models were used in a consensus predicting task for the identification of compounds named as "true" P-gp inhibitors, i.e., molecules able to inhibit P-gp without being effluxed by P-gp itself and simultaneously unable to activate the ATPase function.
Perualila-Tan, Nolen Joy; Shkedy, Ziv; Talloen, Willem; Göhlmann, Hinrich W H; Moerbeke, Marijke Van; Kasim, Adetayo
2016-08-01
The modern process of discovering candidate molecules in early drug discovery phase includes a wide range of approaches to extract vital information from the intersection of biology and chemistry. A typical strategy in compound selection involves compound clustering based on chemical similarity to obtain representative chemically diverse compounds (not incorporating potency information). In this paper, we propose an integrative clustering approach that makes use of both biological (compound efficacy) and chemical (structural features) data sources for the purpose of discovering a subset of compounds with aligned structural and biological properties. The datasets are integrated at the similarity level by assigning complementary weights to produce a weighted similarity matrix, serving as a generic input in any clustering algorithm. This new analysis work flow is semi-supervised method since, after the determination of clusters, a secondary analysis is performed wherein it finds differentially expressed genes associated to the derived integrated cluster(s) to further explain the compound-induced biological effects inside the cell. In this paper, datasets from two drug development oncology projects are used to illustrate the usefulness of the weighted similarity-based clustering approach to integrate multi-source high-dimensional information to aid drug discovery. Compounds that are structurally and biologically similar to the reference compounds are discovered using this proposed integrative approach.
Zhao, Peng; Yang, Liping; Li, Jiansheng; Li, Ya; Tian, Yange; Li, Suyun
2016-01-01
Bufei Jianpi formula (BJF) has long been used as a therapeutic agent in the treatment of COPD. Systems pharmacology identified 145 active compounds and 175 potential targets of BJF in a previous study. Additionally, BJF was previously shown to effectively prevent COPD and its comorbidities, such as ventricular hypertrophy, by inhibition of inflammatory cytokine production, matrix metalloproteinases expression, and other cytokine production, in vivo. However, the system-level mechanism of BJF for the treatment of COPD is still unclear. The aim of this study was to gain insight into its system-level mechanisms by integrating transcriptomics, proteomics, and metabolomics together with systems pharmacology datasets. Using molecular function, pathway, and network analyses, the genes and proteins regulated in COPD rats and BJF-treated rats could be mainly attributed to oxidoreductase activity, antioxidant activity, focal adhesion, tight junction, or adherens junction. Furthermore, a comprehensive analysis of systems pharmacology, transcript, protein, and metabolite datasets is performed. The results showed that a number of genes, proteins, metabolites regulated in BJF-treated rats and potential target proteins of BJF were involved in lipid metabolism, cell junction, oxidative stress, and inflammatory response, which might be the system-level therapeutic mechanism of BJF treatment. PMID:27042044
Cell Specific eQTL Analysis without Sorting Cells
Esko, Tõnu; Peters, Marjolein J.; Schurmann, Claudia; Schramm, Katharina; Kettunen, Johannes; Yaghootkar, Hanieh; Fairfax, Benjamin P.; Andiappan, Anand Kumar; Li, Yang; Fu, Jingyuan; Karjalainen, Juha; Platteel, Mathieu; Visschedijk, Marijn; Weersma, Rinse K.; Kasela, Silva; Milani, Lili; Tserel, Liina; Peterson, Pärt; Reinmaa, Eva; Hofman, Albert; Uitterlinden, André G.; Rivadeneira, Fernando; Homuth, Georg; Petersmann, Astrid; Lorbeer, Roberto; Prokisch, Holger; Meitinger, Thomas; Herder, Christian; Roden, Michael; Grallert, Harald; Ripatti, Samuli; Perola, Markus; Wood, Andrew R.; Melzer, David; Ferrucci, Luigi; Singleton, Andrew B.; Hernandez, Dena G.; Knight, Julian C.; Melchiotti, Rossella; Lee, Bernett; Poidinger, Michael; Zolezzi, Francesca; Larbi, Anis; Wang, De Yun; van den Berg, Leonard H.; Veldink, Jan H.; Rotzschke, Olaf; Makino, Seiko; Salomaa, Veikko; Strauch, Konstantin; Völker, Uwe; van Meurs, Joyce B. J.; Metspalu, Andres; Wijmenga, Cisca; Jansen, Ritsert C.; Franke, Lude
2015-01-01
The functional consequences of trait associated SNPs are often investigated using expression quantitative trait locus (eQTL) mapping. While trait-associated variants may operate in a cell-type specific manner, eQTL datasets for such cell-types may not always be available. We performed a genome-environment interaction (GxE) meta-analysis on data from 5,683 samples to infer the cell type specificity of whole blood cis-eQTLs. We demonstrate that this method is able to predict neutrophil and lymphocyte specific cis-eQTLs and replicate these predictions in independent cell-type specific datasets. Finally, we show that SNPs associated with Crohn’s disease preferentially affect gene expression within neutrophils, including the archetypal NOD2 locus. PMID:25955312
NASA Astrophysics Data System (ADS)
Akhir, Nor Azurah Mat; Nadzirin, Nurul; Mohamed, Rahmah; Firdaus-Raih, Mohd
2015-09-01
Hypothetical proteins of bacterial pathogens represent a large numbers of novel biological mechanisms which could belong to essential pathways in the bacteria. They lack functional characterizations mainly due to the inability of sequence homology based methods to detect functional relationships in the absence of detectable sequence similarity. The dataset derived from this study showed 550 candidates conserved in genomes that has pathogenicity information and only present in the Burkholderiales order. The dataset has been narrowed down to taxonomic clusters. Ten proteins were selected for ORF amplification, seven of them were successfully amplified, and only four proteins were successfully expressed. These proteins will be great candidates in determining the true function via structural biology.
2013-01-01
Background Qualitative alterations or abnormal expression of microRNAs (miRNAs) in colon cancer have mainly been demonstrated in primary tumors. Poorly overlapping sets of oncomiRs, tumor suppressor miRNAs and metastamiRs have been linked with distinct stages in the progression of colorectal cancer. To identify changes in both miRNA and gene expression levels among normal colon mucosa, primary tumor and liver metastasis samples, and to classify miRNAs into functional networks, in this work miRNA and gene expression profiles in 158 samples from 46 patients were analysed. Results Most changes in miRNA and gene expression levels had already manifested in the primary tumors while these levels were almost stably maintained in the subsequent primary tumor-to-metastasis transition. In addition, comparing normal tissue, tumor and metastasis, we did not observe general impairment or any rise in miRNA biogenesis. While only few mRNAs were found to be differentially expressed between primary colorectal carcinoma and liver metastases, miRNA expression profiles can classify primary tumors and metastases well, including differential expression of miR-10b, miR-210 and miR-708. Of 82 miRNAs that were modulated during tumor progression, 22 were involved in EMT. qRT-PCR confirmed the down-regulation of miR-150 and miR-10b in both primary tumor and metastasis compared to normal mucosa and of miR-146a in metastases compared to primary tumor. The upregulation of miR-201 in metastasis compared both with normal and primary tumour was also confirmed. A preliminary survival analysis considering differentially expressed miRNAs suggested a possible link between miR-10b expression in metastasis and patient survival. By integrating miRNA and target gene expression data, we identified a combination of interconnected miRNAs, which are organized into sub-networks, including several regulatory relationships with differentially expressed genes. Key regulatory interactions were validated experimentally. Specific mixed circuits involving miRNAs and transcription factors were identified and deserve further investigation. The suppressor activity of miR-182 on ENTPD5 gene was identified for the first time and confirmed in an independent set of samples. Conclusions Using a large dataset of CRC miRNA and gene expression profiles, we describe the interplay of miRNA groups in regulating gene expression, which in turn affects modulated pathways that are important for tumor development. PMID:23987127
Bayesian correlated clustering to integrate multiple datasets
Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.
2012-01-01
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558
Grassi, Angela; Di Camillo, Barbara; Ciccarese, Francesco; Agnusdei, Valentina; Zanovello, Paola; Amadori, Alberto; Finesso, Lorenzo; Indraccolo, Stefano; Toffolo, Gianna Maria
2016-03-12
Inference of gene regulation from expression data may help to unravel regulatory mechanisms involved in complex diseases or in the action of specific drugs. A challenging task for many researchers working in the field of systems biology is to build up an experiment with a limited budget and produce a dataset suitable to reconstruct putative regulatory modules worth of biological validation. Here, we focus on small-scale gene expression screens and we introduce a novel experimental set-up and a customized method of analysis to make inference on regulatory modules starting from genetic perturbation data, e.g. knockdown and overexpression data. To illustrate the utility of our strategy, it was applied to produce and analyze a dataset of quantitative real-time RT-PCR data, in which interferon-α (IFN-α) transcriptional response in endothelial cells is investigated by RNA silencing of two candidate IFN-α modulators, STAT1 and IFIH1. A putative regulatory module was reconstructed by our method, revealing an intriguing feed-forward loop, in which STAT1 regulates IFIH1 and they both negatively regulate IFNAR1. STAT1 regulation on IFNAR1 was object of experimental validation at the protein level. Detailed description of the experimental set-up and of the analysis procedure is reported, with the intent to be of inspiration for other scientists who want to realize similar experiments to reconstruct gene regulatory modules starting from perturbations of possible regulators. Application of our approach to the study of IFN-α transcriptional response modulators in endothelial cells has led to many interesting novel findings and new biological hypotheses worth of validation.
Wheat EST resources for functional genomics of abiotic stress
Houde, Mario; Belcaid, Mahdi; Ouellet, François; Danyluk, Jean; Monroy, Antonio F; Dryanova, Ani; Gulick, Patrick; Bergeron, Anne; Laroche, André; Links, Matthew G; MacCarthy, Luke; Crosby, William L; Sarhan, Fathey
2006-01-01
Background Wheat is an excellent species to study freezing tolerance and other abiotic stresses. However, the sequence of the wheat genome has not been completely characterized due to its complexity and large size. To circumvent this obstacle and identify genes involved in cold acclimation and associated stresses, a large scale EST sequencing approach was undertaken by the Functional Genomics of Abiotic Stress (FGAS) project. Results We generated 73,521 quality-filtered ESTs from eleven cDNA libraries constructed from wheat plants exposed to various abiotic stresses and at different developmental stages. In addition, 196,041 ESTs for which tracefiles were available from the National Science Foundation wheat EST sequencing program and DuPont were also quality-filtered and used in the analysis. Clustering of the combined ESTs with d2_cluster and TGICL yielded a few large clusters containing several thousand ESTs that were refractory to routine clustering techniques. To resolve this problem, the sequence proximity and "bridges" were identified by an e-value distance graph to manually break clusters into smaller groups. Assembly of the resolved ESTs generated a 75,488 unique sequence set (31,580 contigs and 43,908 singletons/singlets). Digital expression analyses indicated that the FGAS dataset is enriched in stress-regulated genes compared to the other public datasets. Over 43% of the unique sequence set was annotated and classified into functional categories according to Gene Ontology. Conclusion We have annotated 29,556 different sequences, an almost 5-fold increase in annotated sequences compared to the available wheat public databases. Digital expression analysis combined with gene annotation helped in the identification of several pathways associated with abiotic stress. The genomic resources and knowledge developed by this project will contribute to a better understanding of the different mechanisms that govern stress tolerance in wheat and other cereals. PMID:16772040
Khan, Arshad M.
2013-01-01
Intracranial chemical injection (ICI) methods have been used to identify the locations in the brain where feeding behavior can be controlled acutely. Scientists conducting ICI studies often document their injection site locations, thereby leaving kernels of valuable location data for others to use to further characterize feeding control circuits. Unfortunately, this rich dataset has not yet been formally contextualized with other published neuroanatomical data. In particular, axonal tracing studies have delineated several neural circuits originating in the same areas where ICI injection feeding-control sites have been documented, but it remains unclear whether these circuits participate in feeding control. Comparing injection sites with other types of location data would require careful anatomical registration between the datasets. Here, a conceptual framework is presented for how such anatomical registration efforts can be performed. For example, by using a simple atlas alignment tool, a hypothalamic locus sensitive to the orexigenic effects of neuropeptide Y (NPY) can be aligned accurately with the locations of neurons labeled by anterograde tracers or those known to express NPY receptors or feeding-related peptides. This approach can also be applied to those intracranial “gene-directed” injection (IGI) methods (e.g., site-specific recombinase methods, RNA expression or interference, optogenetics, and pharmacosynthetics) that involve viral injections to targeted neuronal populations. Spatial alignment efforts can be accelerated if location data from ICI/IGI methods are mapped to stereotaxic brain atlases to allow powerful neuroinformatics tools to overlay different types of data in the same reference space. Atlas-based mapping will be critical for community-based sharing of location data for feeding control circuits, and will accelerate our understanding of structure-function relationships in the brain for mammalian models of obesity and metabolic disorders. PMID:24385950
Array data extractor (ADE): a LabVIEW program to extract and merge gene array data
2013-01-01
Background Large data sets from gene expression array studies are publicly available offering information highly valuable for research across many disciplines ranging from fundamental to clinical research. Highly advanced bioinformatics tools have been made available to researchers, but a demand for user-friendly software allowing researchers to quickly extract expression information for multiple genes from multiple studies persists. Findings Here, we present a user-friendly LabVIEW program to automatically extract gene expression data for a list of genes from multiple normalized microarray datasets. Functionality was tested for 288 class A G protein-coupled receptors (GPCRs) and expression data from 12 studies comparing normal and diseased human hearts. Results confirmed known regulation of a beta 1 adrenergic receptor and further indicate novel research targets. Conclusions Although existing software allows for complex data analyses, the LabVIEW based program presented here, “Array Data Extractor (ADE)”, provides users with a tool to retrieve meaningful information from multiple normalized gene expression datasets in a fast and easy way. Further, the graphical programming language used in LabVIEW allows applying changes to the program without the need of advanced programming knowledge. PMID:24289243
Kadri, Sabah; Hinman, Veronica F.; Benos, Panayiotis V.
2011-01-01
microRNAs (miRNAs) are small (20–23 nt), non-coding single stranded RNA molecules that act as post-transcriptional regulators of mRNA gene expression. They have been implicated in regulation of developmental processes in diverse organisms. The echinoderms, Strongylocentrotus purpuratus (sea urchin) and Patiria miniata (sea star) are excellent model organisms for studying development with well-characterized transcriptional networks. However, to date, nothing is known about the role of miRNAs during development in these organisms, except that the genes that are involved in the miRNA biogenesis pathway are expressed during their developmental stages. In this paper, we used Illumina Genome Analyzer (Illumina, Inc.) to sequence small RNA libraries in mixed stage population of embryos from one to three days after fertilization of sea urchin and sea star (total of 22,670,000 reads). Analysis of these data revealed the miRNA populations in these two species. We found that 47 and 38 known miRNAs are expressed in sea urchin and sea star, respectively, during early development (32 in common). We also found 13 potentially novel miRNAs in the sea urchin embryonic library. miRNA expression is generally conserved between the two species during development, but 7 miRNAs are highly expressed in only one species. We expect that our two datasets will be a valuable resource for everyone working in the field of developmental biology and the regulatory networks that affect it. The computational pipeline to analyze Illumina reads is available at http://www.benoslab.pitt.edu/services.html. PMID:22216218
Inferring causal genomic alterations in breast cancer using gene expression data
2011-01-01
Background One of the primary objectives in cancer research is to identify causal genomic alterations, such as somatic copy number variation (CNV) and somatic mutations, during tumor development. Many valuable studies lack genomic data to detect CNV; therefore, methods that are able to infer CNVs from gene expression data would help maximize the value of these studies. Results We developed a framework for identifying recurrent regions of CNV and distinguishing the cancer driver genes from the passenger genes in the regions. By inferring CNV regions across many datasets we were able to identify 109 recurrent amplified/deleted CNV regions. Many of these regions are enriched for genes involved in many important processes associated with tumorigenesis and cancer progression. Genes in these recurrent CNV regions were then examined in the context of gene regulatory networks to prioritize putative cancer driver genes. The cancer driver genes uncovered by the framework include not only well-known oncogenes but also a number of novel cancer susceptibility genes validated via siRNA experiments. Conclusions To our knowledge, this is the first effort to systematically identify and validate drivers for expression based CNV regions in breast cancer. The framework where the wavelet analysis of copy number alteration based on expression coupled with the gene regulatory network analysis, provides a blueprint for leveraging genomic data to identify key regulatory components and gene targets. This integrative approach can be applied to many other large-scale gene expression studies and other novel types of cancer data such as next-generation sequencing based expression (RNA-Seq) as well as CNV data. PMID:21806811
Klein, Hans-Ulrich; Ruckert, Christian; Kohlmann, Alexander; Bullinger, Lars; Thiede, Christian; Haferlach, Torsten; Dugas, Martin
2009-12-15
Multiple gene expression signatures derived from microarray experiments have been published in the field of leukemia research. A comparison of these signatures with results from new experiments is useful for verification as well as for interpretation of the results obtained. Currently, the percentage of overlapping genes is frequently used to compare published gene signatures against a signature derived from a new experiment. However, it has been shown that the percentage of overlapping genes is of limited use for comparing two experiments due to the variability of gene signatures caused by different array platforms or assay-specific influencing parameters. Here, we present a robust approach for a systematic and quantitative comparison of published gene expression signatures with an exemplary query dataset. A database storing 138 leukemia-related published gene signatures was designed. Each gene signature was manually annotated with terms according to a leukemia-specific taxonomy. Two analysis steps are implemented to compare a new microarray dataset with the results from previous experiments stored and curated in the database. First, the global test method is applied to assess gene signatures and to constitute a ranking among them. In a subsequent analysis step, the focus is shifted from single gene signatures to chromosomal aberrations or molecular mutations as modeled in the taxonomy. Potentially interesting disease characteristics are detected based on the ranking of gene signatures associated with these aberrations stored in the database. Two example analyses are presented. An implementation of the approach is freely available as web-based application. The presented approach helps researchers to systematically integrate the knowledge derived from numerous microarray experiments into the analysis of a new dataset. By means of example leukemia datasets we demonstrate that this approach detects related experiments as well as related molecular mutations and may help to interpret new microarray data.
Who shares? Who doesn't? Factors associated with openly archiving raw research data.
Piwowar, Heather A
2011-01-01
Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%-35% in 2007-2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.
DESNT: A Poor Prognosis Category of Human Prostate Cancer.
Luca, Bogdan-Alexandru; Brewer, Daniel S; Edwards, Dylan R; Edwards, Sandra; Whitaker, Hayley C; Merson, Sue; Dennis, Nening; Cooper, Rosalin A; Hazell, Steven; Warren, Anne Y; Eeles, Rosalind; Lynch, Andy G; Ross-Adams, Helen; Lamb, Alastair D; Neal, David E; Sethia, Krishna; Mills, Robert D; Ball, Richard Y; Curley, Helen; Clark, Jeremy; Moulton, Vincent; Cooper, Colin S
2017-03-06
A critical problem in the clinical management of prostate cancer is that it is highly heterogeneous. Accurate prediction of individual cancer behaviour is therefore not achievable at the time of diagnosis leading to substantial overtreatment. It remains an enigma that, in contrast to breast cancer, unsupervised analyses of global expression profiles have not currently defined robust categories of prostate cancer with distinct clinical outcomes. To devise a novel classification framework for human prostate cancer based on unsupervised mathematical approaches. Our analyses are based on the hypothesis that previous attempts to classify prostate cancer have been unsuccessful because individual samples of prostate cancer frequently have heterogeneous compositions. To address this issue, we applied an unsupervised Bayesian procedure called Latent Process Decomposition to four independent prostate cancer transcriptome datasets obtained using samples from prostatectomy patients and containing between 78 and 182 participants. Biochemical failure was assessed using log-rank analysis and Cox regression analysis. Application of Latent Process Decomposition identified a common process in all four independent datasets examined. Cancers assigned to this process (designated DESNT cancers) are characterized by low expression of a core set of 45 genes, many encoding proteins involved in the cytoskeleton machinery, ion transport, and cell adhesion. For the three datasets with linked prostate-specific antigen failure data following prostatectomy, patients with DESNT cancer exhibited poor outcome relative to other patients (p=2.65×10 -5 , p=4.28×10 -5 , and p=2.98×10 -8 ). When these three datasets were combined the independent predictive value of DESNT membership was p=1.61×10 -7 compared with p=1.00×10 -5 for Gleason sum. A limitation of the study is that only prediction of prostate-specific antigen failure was examined. Our results demonstrate the existence of a novel poor prognosis category of human prostate cancer and will assist in the targeting of therapy, helping avoid treatment-associated morbidity in men with indolent disease. Prostate cancer, unlike breast cancer, does not have a robust classification framework. We propose that this failure has occurred because prostate cancer samples selected for analysis frequently have heterozygous compositions (individual samples are made up of many different parts that each have different characteristics). Applying a mathematical approach that can overcome this problem we identify a novel poor prognosis category of human prostate cancer called DESNT. Copyright © 2017 European Association of Urology. Published by Elsevier B.V. All rights reserved.
Pichler, Martin; Stiegelbauer, Verena; Vychytilova-Faltejskova, Petra; Ivan, Cristina; Ling, Hui; Winter, Elke; Zhang, Xinna; Goblirsch, Matthew; Wulf-Goldenberg, Annika; Ohtsuka, Masahisa; Haybaeck, Johannes; Svoboda, Marek; Okugawa, Yoshinaga; Gerger, Armin; Hoefler, Gerald; Goel, Ajay; Slaby, Ondrej; Calin, George Adrian
2017-03-01
Purpose: Characterization of colorectal cancer transcriptome by high-throughput techniques has enabled the discovery of several differentially expressed genes involving previously unreported miRNA abnormalities. Here, we followed a systematic approach on a global scale to identify miRNAs as clinical outcome predictors and further validated them in the clinical and experimental setting. Experimental Design: Genome-wide miRNA sequencing data of 228 colorectal cancer patients from The Cancer Genome Atlas dataset were analyzed as a screening cohort to identify miRNAs significantly associated with survival according to stringent prespecified criteria. A panel of six miRNAs was further validated for their prognostic utility in a large independent validation cohort ( n = 332). In situ hybridization and functional experiments in a panel of colorectal cancer cell lines and xenografts further clarified the role of clinical relevant miRNAs. Results: Six miRNAs (miR-92b-3p, miR-188-3p, miR-221-5p, miR-331-3p, miR-425-3p, and miR-497-5p) were identified as strong predictors of survival in the screening cohort. High miR-188-3p expression proves to be an independent prognostic factor [screening cohort: HR = 4.137; 95% confidence interval (CI), 1.568-10.917; P = 0.004; validation cohort: HR = 1.538; 95% CI, 1.107-2.137; P = 0.010, respectively]. Forced miR-188-3p expression increased migratory behavior of colorectal cancer cells in vitro and metastases formation in vivo ( P < 0.05). The promigratory role of miR-188-3p is mediated by direct interaction with MLLT4, a novel identified player involved in colorectal cancer cell migration. Conclusions: miR-188-3p is a novel independent prognostic factor in colorectal cancer patients, which can be partly explained by its effect on MLLT4 expression and migration of cancer cells. Clin Cancer Res; 23(5); 1323-33. ©2016 AACR . ©2016 American Association for Cancer Research.
CrossLink: a novel method for cross-condition classification of cancer subtypes.
Ma, Chifeng; Sastry, Konduru S; Flore, Mario; Gehani, Salah; Al-Bozom, Issam; Feng, Yusheng; Serpedin, Erchin; Chouchane, Lotfi; Chen, Yidong; Huang, Yufei
2016-08-22
We considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Therefore, the trained classifier would work well under one condition but not under another. To address the problem of current normalization approaches, we propose a novel algorithm called CrossLink (CL). CL recognizes that there is no universal, condition-independent normalization mapping of signatures. In contrast, it exploits the fact that the signature is unique to its associated class under any condition and thus employs an unsupervised clustering algorithm to discover this unique signature. We assessed the performance of CL for cross-condition predictions of PAM50 subtypes of breast cancer by using a simulated dataset modeled after TCGA BRCA tumor samples with a cross-validation scheme, and datasets with known and unknown PAM50 classification. CL achieved prediction accuracy >73 %, highest among other methods we evaluated. We also applied the algorithm to a set of breast cancer tumors derived from Arabic population to assign a PAM50 classification to each tumor based on their gene expression profiles. A novel algorithm CrossLink for cross-condition prediction of cancer classes was proposed. In all test datasets, CL showed robust and consistent improvement in prediction performance over other state-of-the-art normalization and classification algorithms.
EBprot: Statistical analysis of labeling-based quantitative proteomics data.
Koh, Hiromi W L; Swa, Hannah L F; Fermin, Damian; Ler, Siok Ghee; Gunaratne, Jayantha; Choi, Hyungwon
2015-08-01
Labeling-based proteomics is a powerful method for detection of differentially expressed proteins (DEPs). The current data analysis platform typically relies on protein-level ratios, which is obtained by summarizing peptide-level ratios for each protein. In shotgun proteomics, however, some proteins are quantified with more peptides than others, and this reproducibility information is not incorporated into the differential expression (DE) analysis. Here, we propose a novel probabilistic framework EBprot that directly models the peptide-protein hierarchy and rewards the proteins with reproducible evidence of DE over multiple peptides. To evaluate its performance with known DE states, we conducted a simulation study to show that the peptide-level analysis of EBprot provides better receiver-operating characteristic and more accurate estimation of the false discovery rates than the methods based on protein-level ratios. We also demonstrate superior classification performance of peptide-level EBprot analysis in a spike-in dataset. To illustrate the wide applicability of EBprot in different experimental designs, we applied EBprot to a dataset for lung cancer subtype analysis with biological replicates and another dataset for time course phosphoproteome analysis of EGF-stimulated HeLa cells with multiplexed labeling. Through these examples, we show that the peptide-level analysis of EBprot is a robust alternative to the existing statistical methods for the DE analysis of labeling-based quantitative datasets. The software suite is freely available on the Sourceforge website http://ebprot.sourceforge.net/. All MS data have been deposited in the ProteomeXchange with identifier PXD001426 (http://proteomecentral.proteomexchange.org/dataset/PXD001426/). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Paraskevas, K I; Kalmykov, E L; Naylor, A R
2016-01-01
Randomised trials have reported higher stroke/death rates after carotid artery stenting (CAS) versus carotid endarterectomy (CEA). Despite this, the 2011 American Heart Association (AHA) guidelines expanded CAS indications, partly because of the Carotid Revascularization Endarterectomy versus Stenting Trial, but also because of improving outcomes in industry sponsored CAS Registries. The aim of this systematic review was: (i) to compare stroke/death rates after CAS/CEA in contemporary dataset registries, (ii) to examine whether published stroke/death rates after CAS fall within AHA thresholds, and, (iii) to see if there had been a decline (over time) in procedural risk after CAS/CEA. PubMed/Medline, Embase, and Cochrane databases were systematically searched according to the recommendations of the PRISMA statement from January 1, 2008 until February 23, 2015 for administrative dataset registries reporting outcomes after both CEA and CAS. Twenty-one registries reported outcomes involving more than 1,500,000 procedures. Stroke/death after CAS was significantly higher than after CEA in 11/21 registries (52%) involving "average risk for CEA" asymptomatic patients and in 11/18 registries (61%) involving "average risk for CEA" symptomatic patients. In another five registries, CAS was associated with higher stroke/death rates than CEA for both symptomatic and asymptomatic patients, but formal statistical comparison was not reported. CAS was associated with stroke/death rates that exceeded risk thresholds recommended by the AHA in 9/21 registries (43%) involving "average risk for CEA" asymptomatic patients and in 13/18 registries (72%) involving "average risk for CEA" symptomatic patients. In 5/18 registries (28%), the procedural risk after CAS in "average risk" symptomatic patients exceeded 10%. Data from contemporary administrative dataset registries suggest that stroke/death rates following CAS remain significantly higher than after CEA and often exceed accepted AHA thresholds. There was no evidence of a sustained decline in procedural risk after CAS. Copyright © 2015 European Society for Vascular Surgery. Published by Elsevier Ltd. All rights reserved.
Welham, Nathan V.; Ling, Changying; Dawson, John A.; Kendziorski, Christina; Thibeault, Susan L.; Yamashita, Masaru
2015-01-01
The vocal fold (VF) mucosa confers elegant biomechanical function for voice production but is susceptible to scar formation following injury. Current understanding of VF wound healing is hindered by a paucity of data and is therefore often generalized from research conducted in skin and other mucosal systems. Here, using a previously validated rat injury model, expression microarray technology and an empirical Bayes analysis approach, we generated a VF-specific transcriptome dataset to better capture the system-level complexity of wound healing in this specialized tissue. We measured differential gene expression at 3, 14 and 60 days post-injury compared to experimentally naïve controls, pursued functional enrichment analyses to refine and add greater biological definition to the previously proposed temporal phases of VF wound healing, and validated the expression and localization of a subset of previously unidentified repair- and regeneration-related genes at the protein level. Our microarray dataset is a resource for the wider research community and has the potential to stimulate new hypotheses and avenues of investigation, improve biological and mechanistic insight, and accelerate the identification of novel therapeutic targets. PMID:25592437
Endale, Mehari; Ahlfeld, Shawn; Bao, Erik; Chen, Xiaoting; Green, Jenna; Bess, Zach; Weirauch, Matthew; Xu, Yan; Perl, Anne Karina
2017-08-01
The following data are derived from key stages of acinar lung development and define the developmental role of lung interstitial fibroblasts expressing platelet-derived growth factor alpha (PDGFRα). This dataset is related to the research article entitled "Temporal, spatial, and phenotypical changes of PDGFRα expressing fibroblasts during late lung development" (Endale et al., 2017) [1]. At E16.5 (canalicular), E18.5 (saccular), P7 (early alveolar) and P28 (late alveolar), PDGFRα GFP mice, in conjunction with immunohistochemical markers, were utilized to define the spatiotemporal relationship of PDGFRα + fibroblasts to endothelial, stromal and epithelial cells in both the proximal and distal acinar lung. Complimentary analysis with flow cytometry was employed to determine changes in cellular proliferation, define lipofibroblast and myofibroblast populations via the presence of intracellular lipid or alpha smooth muscle actin (αSMA), and evaluate the expression of CD34, CD29, and Sca-1. Finally, PDGFRα + cells isolated at each stage of acinar lung development were subjected to RNA-Seq analysis, data was subjected to Bayesian timeline analysis and transcriptional factor promoter enrichment analysis.
Efficient Spatio-Temporal Local Binary Patterns for Spontaneous Facial Micro-Expression Recognition
Wang, Yandan; See, John; Phan, Raphael C.-W.; Oh, Yee-Hui
2015-01-01
Micro-expression recognition is still in the preliminary stage, owing much to the numerous difficulties faced in the development of datasets. Since micro-expression is an important affective clue for clinical diagnosis and deceit analysis, much effort has gone into the creation of these datasets for research purposes. There are currently two publicly available spontaneous micro-expression datasets—SMIC and CASME II, both with baseline results released using the widely used dynamic texture descriptor LBP-TOP for feature extraction. Although LBP-TOP is popular and widely used, it is still not compact enough. In this paper, we draw further inspiration from the concept of LBP-TOP that considers three orthogonal planes by proposing two efficient approaches for feature extraction. The compact robust form described by the proposed LBP-Six Intersection Points (SIP) and a super-compact LBP-Three Mean Orthogonal Planes (MOP) not only preserves the essential patterns, but also reduces the redundancy that affects the discriminality of the encoded features. Through a comprehensive set of experiments, we demonstrate the strengths of our approaches in terms of recognition accuracy and efficiency. PMID:25993498
Chromatin regulators as a guide for cancer treatment choice
Gurard-Levin, Zachary A.; Wilson, Laurence O.W.; Pancaldi, Vera; Postel-Vinay, Sophie; Sousa, Fabricio G.; Reyes, Cecile; Marangoni, Elisabetta; Gentien, David; Valencia, Alfonso; Pommier, Yves; Cottu, Paul; Almouzni, Geneviève
2016-01-01
The limited capacity to predict a patient’s response to distinct chemotherapeutic agents is a major hurdle in cancer management. The efficiency of a large fraction of current cancer therapeutics (radio- and chemotherapies) is influenced by chromatin structure. Reciprocally, alterations in chromatin organization may impact resistance mechanisms. Here, we explore how the mis-expression of chromatin regulators—factors involved in the establishment and maintenance of functional chromatin domains—can inform about the extent of docetaxel response. We exploit gene Affymetrix and NanoString gene expression data for a set of chromatin regulators generated from breast cancer patient-derived xenograft (PDX) models and patient samples treated with docetaxel. Random Forest classification reveals specific panels of chromatin regulators, including key components of the SWI/SNF chromatin remodeler, which readily distinguish docetaxel high-responders and poor-responders. Further exploration of SWI/SNF components in the comprehensive NCI-60 dataset reveals that the expression inversely correlates with docetaxel sensitivity. Finally, we show that loss of the SWI/SNF subunit BRG1 (SMARCA4) in a model cell line leads to enhanced docetaxel sensitivity. Altogether, our findings point towards chromatin regulators as biomarkers for drug response as well as therapeutic targets to sensitize patients towards docetaxel and combat drug resistance. PMID:27196757
Gene Expression Analysis to Assess the Relevance of Rodent Models to Human Lung Injury.
Sweeney, Timothy E; Lofgren, Shane; Khatri, Purvesh; Rogers, Angela J
2017-08-01
The relevance of animal models to human diseases is an area of intense scientific debate. The degree to which mouse models of lung injury recapitulate human lung injury has never been assessed. Integrating data from both human and animal expression studies allows for increased statistical power and identification of conserved differential gene expression across organisms and conditions. We sought comprehensive integration of gene expression data in experimental acute lung injury (ALI) in rodents compared with humans. We performed two separate gene expression multicohort analyses to determine differential gene expression in experimental animal and human lung injury. We used correlational and pathway analyses combined with external in vitro gene expression data to identify both potential drivers of underlying inflammation and therapeutic drug candidates. We identified 21 animal lung tissue datasets and three human lung injury bronchoalveolar lavage datasets. We show that the metasignatures of animal and human experimental ALI are significantly correlated despite these widely varying experimental conditions. The gene expression changes among mice and rats across diverse injury models (ozone, ventilator-induced lung injury, LPS) are significantly correlated with human models of lung injury (Pearson r = 0.33-0.45, P < 1E -16 ). Neutrophil signatures are enriched in both animal and human lung injury. Predicted therapeutic targets, peptide ligand signatures, and pathway analyses are also all highly overlapping. Gene expression changes are similar in animal and human experimental ALI, and provide several physiologic and therapeutic insights to the disease.
Predictive Models of Cognitive Outcomes of Developmental Insults
NASA Astrophysics Data System (ADS)
Chan, Yupo; Bouaynaya, Nidhal; Chowdhury, Parimal; Leszczynska, Danuta; Patterson, Tucker A.; Tarasenko, Olga
2010-04-01
Representatives of Arkansas medical, research and educational institutions have gathered over the past four years to discuss the relationship between functional developmental perturbations and their neurological consequences. We wish to track the effect on the nervous system by developmental perturbations over time and across species. Except for perturbations, the sequence of events that occur during neural development was found to be remarkably conserved across mammalian species. The tracking includes consequences on anatomical regions and behavioral changes. The ultimate goal is to develop a predictive model of long-term genotypic and phenotypic outcomes that includes developmental insults. Such a model can subsequently be fostered into an educated intervention for therapeutic purposes. Several datasets were identified to test plausible hypotheses, ranging from evoked potential datasets to sleep-disorder datasets. An initial model may be mathematical and conceptual. However, we expect to see rapid progress as large-scale gene expression studies in the mammalian brain permit genome-wide searches to discover genes that are uniquely expressed in brain circuits and regions. These genes ultimately control behavior. By using a validated model we endeavor to make useful predictions.
Sewer, Alain; Gubian, Sylvain; Kogel, Ulrike; Veljkovic, Emilija; Han, Wanjiang; Hengstermann, Arnd; Peitsch, Manuel C; Hoeng, Julia
2014-05-17
High-quality expression data are required to investigate the biological effects of microRNAs (miRNAs). The goal of this study was, first, to assess the quality of miRNA expression data based on microarray technologies and, second, to consolidate it by applying a novel normalization method. Indeed, because of significant differences in platform designs, miRNA raw data cannot be normalized blindly with standard methods developed for gene expression. This fundamental observation motivated the development of a novel multi-array normalization method based on controllable assumptions, which uses the spike-in control probes to adjust the measured intensities across arrays. Raw expression data were obtained with the Exiqon dual-channel miRCURY LNA™ platform in the "common reference design" and processed as "pseudo-single-channel". They were used to apply several quality metrics based on the coefficient of variation and to test the novel spike-in controls based normalization method. Most of the considerations presented here could be applied to raw data obtained with other platforms. To assess the normalization method, it was compared with 13 other available approaches from both data quality and biological outcome perspectives. The results showed that the novel multi-array normalization method reduced the data variability in the most consistent way. Further, the reliability of the obtained differential expression values was confirmed based on a quantitative reverse transcription-polymerase chain reaction experiment performed for a subset of miRNAs. The results reported here support the applicability of the novel normalization method, in particular to datasets that display global decreases in miRNA expression similarly to the cigarette smoke-exposed mouse lung dataset considered in this study. Quality metrics to assess between-array variability were used to confirm that the novel spike-in controls based normalization method provided high-quality miRNA expression data suitable for reliable downstream analysis. The multi-array miRNA raw data normalization method was implemented in an R software package called ExiMiR and deposited in the Bioconductor repository.
2014-01-01
Background High-quality expression data are required to investigate the biological effects of microRNAs (miRNAs). The goal of this study was, first, to assess the quality of miRNA expression data based on microarray technologies and, second, to consolidate it by applying a novel normalization method. Indeed, because of significant differences in platform designs, miRNA raw data cannot be normalized blindly with standard methods developed for gene expression. This fundamental observation motivated the development of a novel multi-array normalization method based on controllable assumptions, which uses the spike-in control probes to adjust the measured intensities across arrays. Results Raw expression data were obtained with the Exiqon dual-channel miRCURY LNA™ platform in the “common reference design” and processed as “pseudo-single-channel”. They were used to apply several quality metrics based on the coefficient of variation and to test the novel spike-in controls based normalization method. Most of the considerations presented here could be applied to raw data obtained with other platforms. To assess the normalization method, it was compared with 13 other available approaches from both data quality and biological outcome perspectives. The results showed that the novel multi-array normalization method reduced the data variability in the most consistent way. Further, the reliability of the obtained differential expression values was confirmed based on a quantitative reverse transcription–polymerase chain reaction experiment performed for a subset of miRNAs. The results reported here support the applicability of the novel normalization method, in particular to datasets that display global decreases in miRNA expression similarly to the cigarette smoke-exposed mouse lung dataset considered in this study. Conclusions Quality metrics to assess between-array variability were used to confirm that the novel spike-in controls based normalization method provided high-quality miRNA expression data suitable for reliable downstream analysis. The multi-array miRNA raw data normalization method was implemented in an R software package called ExiMiR and deposited in the Bioconductor repository. PMID:24886675
NASA Astrophysics Data System (ADS)
Nawir, Mukrimah; Amir, Amiza; Lynn, Ong Bi; Yaakob, Naimah; Badlishah Ahmad, R.
2018-05-01
The rapid growth of technologies might endanger them to various network attacks due to the nature of data which are frequently exchange their data through Internet and large-scale data that need to be handle. Moreover, network anomaly detection using machine learning faced difficulty when dealing the involvement of dataset where the number of labelled network dataset is very few in public and this caused many researchers keep used the most commonly network dataset (KDDCup99) which is not relevant to employ the machine learning (ML) algorithms for a classification. Several issues regarding these available labelled network datasets are discussed in this paper. The aim of this paper to build a network anomaly detection system using machine learning algorithms that are efficient, effective and fast processing. The finding showed that AODE algorithm is performed well in term of accuracy and processing time for binary classification towards UNSW-NB15 dataset.
Exploring Relationships in Big Data
NASA Astrophysics Data System (ADS)
Mahabal, A.; Djorgovski, S. G.; Crichton, D. J.; Cinquini, L.; Kelly, S.; Colbert, M. A.; Kincaid, H.
2015-12-01
Big Data are characterized by several different 'V's. Volume, Veracity, Volatility, Value and so on. For many datasets inflated Volumes through redundant features often make the data more noisy and difficult to extract Value out of. This is especially true if one is comparing/combining different datasets, and the metadata are diverse. We have been exploring ways to exploit such datasets through a variety of statistical machinery, and visualization. We show how we have applied it to time-series from large astronomical sky-surveys. This was done in the Virtual Observatory framework. More recently we have been doing similar work for a completely different domain viz. biology/cancer. The methodology reuse involves application to diverse datasets gathered through the various centers associated with the Early Detection Research Network (EDRN) for cancer, an initiative of the National Cancer Institute (NCI). Application to Geo datasets is a natural extension.
Kavuluru, Ramakanth; Rios, Anthony; Lu, Yuan
2015-01-01
Background Diagnosis codes are assigned to medical records in healthcare facilities by trained coders by reviewing all physician authored documents associated with a patient's visit. This is a necessary and complex task involving coders adhering to coding guidelines and coding all assignable codes. With the popularity of electronic medical records (EMRs), computational approaches to code assignment have been proposed in the recent years. However, most efforts have focused on single and often short clinical narratives, while realistic scenarios warrant full EMR level analysis for code assignment. Objective We evaluate supervised learning approaches to automatically assign international classification of diseases (ninth revision) - clinical modification (ICD-9-CM) codes to EMRs by experimenting with a large realistic EMR dataset. The overall goal is to identify methods that offer superior performance in this task when considering such datasets. Methods We use a dataset of 71,463 EMRs corresponding to in-patient visits with discharge date falling in a two year period (2011–2012) from the University of Kentucky (UKY) Medical Center. We curate a smaller subset of this dataset and also use a third gold standard dataset of radiology reports. We conduct experiments using different problem transformation approaches with feature and data selection components and employing suitable label calibration and ranking methods with novel features involving code co-occurrence frequencies and latent code associations. Results Over all codes with at least 50 training examples we obtain a micro F-score of 0.48. On the set of codes that occur at least in 1% of the two year dataset, we achieve a micro F-score of 0.54. For the smaller radiology report dataset, the classifier chaining approach yields best results. For the smaller subset of the UKY dataset, feature selection, data selection, and label calibration offer best performance. Conclusions We show that datasets at different scale (size of the EMRs, number of distinct codes) and with different characteristics warrant different learning approaches. For shorter narratives pertaining to a particular medical subdomain (e.g., radiology, pathology), classifier chaining is ideal given the codes are highly related with each other. For realistic in-patient full EMRs, feature and data selection methods offer high performance for smaller datasets. However, for large EMR datasets, we observe that the binary relevance approach with learning-to-rank based code reranking offers the best performance. Regardless of the training dataset size, for general EMRs, label calibration to select the optimal number of labels is an indispensable final step. PMID:26054428
Kavuluru, Ramakanth; Rios, Anthony; Lu, Yuan
2015-10-01
Diagnosis codes are assigned to medical records in healthcare facilities by trained coders by reviewing all physician authored documents associated with a patient's visit. This is a necessary and complex task involving coders adhering to coding guidelines and coding all assignable codes. With the popularity of electronic medical records (EMRs), computational approaches to code assignment have been proposed in the recent years. However, most efforts have focused on single and often short clinical narratives, while realistic scenarios warrant full EMR level analysis for code assignment. We evaluate supervised learning approaches to automatically assign international classification of diseases (ninth revision) - clinical modification (ICD-9-CM) codes to EMRs by experimenting with a large realistic EMR dataset. The overall goal is to identify methods that offer superior performance in this task when considering such datasets. We use a dataset of 71,463 EMRs corresponding to in-patient visits with discharge date falling in a two year period (2011-2012) from the University of Kentucky (UKY) Medical Center. We curate a smaller subset of this dataset and also use a third gold standard dataset of radiology reports. We conduct experiments using different problem transformation approaches with feature and data selection components and employing suitable label calibration and ranking methods with novel features involving code co-occurrence frequencies and latent code associations. Over all codes with at least 50 training examples we obtain a micro F-score of 0.48. On the set of codes that occur at least in 1% of the two year dataset, we achieve a micro F-score of 0.54. For the smaller radiology report dataset, the classifier chaining approach yields best results. For the smaller subset of the UKY dataset, feature selection, data selection, and label calibration offer best performance. We show that datasets at different scale (size of the EMRs, number of distinct codes) and with different characteristics warrant different learning approaches. For shorter narratives pertaining to a particular medical subdomain (e.g., radiology, pathology), classifier chaining is ideal given the codes are highly related with each other. For realistic in-patient full EMRs, feature and data selection methods offer high performance for smaller datasets. However, for large EMR datasets, we observe that the binary relevance approach with learning-to-rank based code reranking offers the best performance. Regardless of the training dataset size, for general EMRs, label calibration to select the optimal number of labels is an indispensable final step. Copyright © 2015 Elsevier B.V. All rights reserved.
Capurro, Alberto; Bodea, Liviu-Gabriel; Schaefer, Patrick; Luthi-Carter, Ruth; Perreau, Victoria M.
2015-01-01
The characterization of molecular changes in diseased tissues gives insight into pathophysiological mechanisms and is important for therapeutic development. Genome-wide gene expression analysis has proven valuable for identifying biological processes in neurodegenerative diseases using post mortem human brain tissue and numerous datasets are publically available. However, many studies utilize heterogeneous tissue samples consisting of multiple cell types, all of which contribute to global gene expression values, confounding biological interpretation of the data. In particular, changes in numbers of neuronal and glial cells occurring in neurodegeneration confound transcriptomic analyses, particularly in human brain tissues where sample availability and controls are limited. To identify cell specific gene expression changes in neurodegenerative disease, we have applied our recently published computational deconvolution method, population specific expression analysis (PSEA). PSEA estimates cell-type-specific expression values using reference expression measures, which in the case of brain tissue comprises mRNAs with cell-type-specific expression in neurons, astrocytes, oligodendrocytes and microglia. As an exercise in PSEA implementation and hypothesis development regarding neurodegenerative diseases, we applied PSEA to Parkinson's and Huntington's disease (PD, HD) datasets. Genes identified as differentially expressed in substantia nigra pars compacta neurons by PSEA were validated using external laser capture microdissection data. Network analysis and Annotation Clustering (DAVID) identified molecular processes implicated by differential gene expression in specific cell types. The results of these analyses provided new insights into the implementation of PSEA in brain tissues and additional refinement of molecular signatures in human HD and PD. PMID:25620908
Lin, Lin; Wang, Guangzhi; Ming, Jianguang; Meng, Xiangqi; Han, Bo; Sun, Bo; Cai, Jinquan; Jiang, Chuanlu
2016-11-01
Gliomas are the most common primary intracranial malignant tumors in adults. Surgical resection followed by optional radiotherapy and chemotherapy is the current standard therapy for glioma patients. Vimentin, a protein of intermediate filament family, could maintain the cellular integrity and participate in several cell signal pathways to modulate the motility and invasion of cancer cells. The purpose of the present research was to identify the relationship between vimentin expression and clinical characteristics and detect the prognostic and predictive ability of vimentin in patients with glioma. To determine the expression of vimentin in glioma tissues, paraffin-embedded blocks from glioma patients by surgical resection were obtained and evaluated by immunohistochemistry. To further investigate the association of vimentin expression with survival, we employed mRNA expression of vimentin genes from the Chinese Glioma Genome Atlas (CGGA) and the GSE 16011 dataset. Kaplan-Meier analysis and Cox regression model were used to statistical analysis. We detected positive vimentin straining in 84 % of high-grade compared to 47 % in low-grade glioma patients. Additionally, vimentin mRNA expression was correlated with glioma grade in both CGGA and GSE16011 dataset. Patients with low vimentin expression have longer survival than high expression. In multivariate analysis, vimentin was an independent significant prognostic factor for high-grade glioma patients. We also identified that glioblastoma patients with low vimentin expression had a better response to temozolomide therapy. Vimentin expression has a significant association with tumor grade and overall survival of high-grade glioma patients. Low vimentin expression may benefit from temozolomide therapy.
Loh, Swee Cheng; Thottathil, Gincy P; Othman, Ahmad Sofiman
2016-10-01
The natural rubber of Para rubber tree, Hevea brasiliensis, is the main crop involved in industrial rubber production due to its superior quality. The Hevea bark is commercially exploited to obtain latex, which is produced from the articulated secondary laticifer. The laticifer is well defined in the aspect of morphology; however, only some genes associated with its development have been reported. We successfully induced secondary laticifer in the jasmonic acid (JA)-treated and linolenic acid (LA)-treated Hevea bark but secondary laticifer is not observed in the ethephon (ET)-treated and untreated Hevea bark. In this study, we analysed 27,195 gene models using NimbleGen microarrays based on the Hevea draft genome. 491 filtered differentially expressed (FDE) transcripts that are common to both JA- and LA-treated bark samples but not ET-treated bark samples were identified. In the Eukaryotic Orthologous Group (KOG) analysis, 491 FDE transcripts belong to different functional categories that reflect the diverse processes and pathways involved in laticifer differentiation. In the Kyoto Encyclopedia of Genes and Genomes (KEGG) and KOG analysis, the profile of the FDE transcripts suggest that JA- and LA-treated bark samples have a sufficient molecular basis for secondary laticifer differentiation, especially regarding secondary metabolites metabolism. FDE genes in this category are from the cytochrome (CYP) P450 family, ATP-binding cassette (ABC) transporter family, short-chain dehydrogenase/reductase (SDR) family, or cinnamyl alcohol dehydrogenase (CAD) family. The data includes many genes involved in cell division, cell wall synthesis, and cell differentiation. The most abundant transcript in FDE list was SDR65C, reflecting its importance in laticifer differentiation. Using the Basic Local Alignment Search Tool (BLAST) as part of annotation and functional prediction, several characterised as well as uncharacterized transcription factors and genes were found in the dataset. Hence, the further characterization of these genes is necessary to unveil their role in laticifer differentiation. This study provides a platform for the further characterization and identification of the key genes involved in secondary laticifer differentiation. Copyright © 2016 Elsevier Masson SAS. All rights reserved.
Vavougios, George D; Zarogiannis, Sotirios G; Krogfelt, Karen Angeliki; Gourgoulianis, Konstantinos; Mitsikostas, Dimos Dimitrios; Hadjigeorgiou, Georgios
2018-01-01
currently only 4 studies have explored the potential role of PARK7's dysregulation in MS pathophysiology Currently, no study has evaluated the potential role of the PARK7 interactome in MS. The aim of our study was to assess the differential expression of PARK7 mRNA in peripheral blood mononuclears (PBMCs) donated from MS versus healthy patients using data mining techniques. The PARK7 interactome data from the GDS3920 profile were scrutinized for differentially expressed genes (DEGs); Gene Enrichment Analysis (GEA) was used to detect significantly enriched biological functions. 27 differentially expressed genes in the MS dataset were detected; 12 of these (NDUFA4, UBA2, TDP2, NPM1, NDUFS3, SUMO1, PIAS2, KIAA0101, RBBP4, NONO, RBBP7 AND HSPA4) are reported for the first time in MS. Stepwise Linear Discriminant Function Analysis constructed a predictive model (Wilk's λ = 0.176, χ 2 = 45.204, p = 1.5275e -10 ) with 2 variables (TIDP2, RBBP4) that achieved 96.6% accuracy when discriminating between patients and controls. Gene Enrichment Analysis revealed that induction and regulation of programmed / intrinsic cell death represented the most salient Gene Ontology annotations. Cross-validation on systemic lupus erythematosus and ischemic stroke datasets revealed that these functions are unique to the MS dataset. Based on our results, novel potential target genes are revealed; these differentially expressed genes regulate epigenetic and apoptotic pathways that may further elucidate underlying mechanisms of autorreactivity in MS. Copyright © 2017 Elsevier B.V. All rights reserved.
Li, Hong-Mei; Yang, Hong; Wen, Dong-Yue; Luo, Yi-Huan; Liang, Chun-Yan; Pan, Deng-Hua; Ma, Wei; Chen, Gang; He, Yun; Chen, Jun-Qiang
2017-05-01
The role of long non-coding RNA (lncRNA) HOX transcript antisense RNA (HOTAIR) in thyroid carcinoma (TC) remains unclear. The current study was aimed to assess the clinical value of HOTAIR expression levels in TC based on publically available data and to evaluate its potential signaling pathways. The expression data of HOTAIR and clinical information concerning TC were downloaded from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), respectively. Furthermore, 3 online biological databases, Starbase, Cbioportal, and Multi Experiment Matrix, were used to identify HOTAIR-related genes in TC. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Panther pathway analyses were then undertaken to study the most enriched signaling pathways in TC (EASE score<0.1, Bonferroni<0.05). The TCGA results demonstrated that the expression level of HOTAIR in TC tissues was significantly increased compared with non-cancerous tissues (p<0.001). HOTAIR over-expression was significantly associated with poor survival in TC patients (p=0.03). Meta-analyses of GEO datasets revealed a trend consistent with the above results on HOTAIR expression levels in TC (SMD=0.23; 95%CI, 0.00-0.45; p=0.047). Finally, the results of functional analysis for HOTAIR-related genes indicated that HOTAIR might participate in tumorigenesis via the Wnt signaling pathway. In conclusion, our study demonstrates that HOTAIR may be involved in thyroid carcinogenesis, and the over-expression of HOTAIR could act as a biomarker associated with a poor outcome in TC patients. Moreover, the Wnt signaling pathway may be the key pathway regulated by HOTAIR in TC. © Georg Thieme Verlag KG Stuttgart · New York.
Hu, Ting; Sun, Qian; Wu, Jianli; Lin, Xingguang; Luo, Danfeng; Sun, Chaoyang; Wang, Changyu; Zhou, Bo; Li, Na; Xia, Meng; Lu, Hao; Meng, Li; Xu, Xiaoyan; Hu, Junbo; Ma, Ding; Chen, Gang; Zhu, Tao
2016-01-01
Approximately 50-75% of patients with serous ovarian carcinoma (SOC) experience recurrence within 18 months after first-line treatment. Current clinical indicators are inadequate for predicting the risk of recurrence. In this study, we used 7 publicly available microarray datasets to identify gene signatures related to recurrence in optimally debulked SOC patients, and validated their expressions in an independent clinic cohort of 127 patients using immunohistochemistry (IHC). We identified a two-gene signature including KCNN4 and S100A14 which was related to recurrence in optimally debulked SOC patients. Their mRNA expression levels were positively correlated and regulated by DNA copy number alterations (CNA) (KCNN4: p=1.918e-05) and DNA promotermethylation (KCNN4: p=0.0179; S100A14: p=2.787e-13). Recurrence prediction models built in the TCGA dataset based on KCNN4 and S100A14 individually and in combination showed good prediction performance in the other 6 datasets (AUC:0.5442-0.9524). The independent cohort supported the expression difference between SOC recurrences. Also, a KCNN4 and S100A14-centered protein interaction subnetwork was built from the STRING database, and the shortest regulation path between them, called the KCNN4-UBA52-KLF4-S100A14 axis, was identified. This discovery might facilitate individualized treatment of SOC. PMID:27270322
Sun, Jie; Chen, Xihai; Wang, Zhenzhen; Guo, Maoni; Shi, Hongbo; Wang, Xiaojun; Cheng, Liang; Zhou, Meng
2015-11-09
Long non-coding RNAs (lncRNAs) have been implicated in a variety of biological processes, and dysregulated lncRNAs have demonstrated potential roles as biomarkers and therapeutic targets for cancer prognosis and treatment. In this study, by repurposing microarray probes, we analyzed lncRNA expression profiles of 916 breast cancer patients from the Gene Expression Omnibus (GEO). Nine lncRNAs were identified to be significantly associated with metastasis-free survival (MFS) in the training dataset of 254 patients using the Cox proportional hazards regression model. These nine lncRNAs were then combined to form a single prognostic signature for predicting metastatic risk in breast cancer patients that was able to classify patients in the training dataset into high- and low-risk subgroups with significantly different MFSs (median 2.4 years versus 3.0 years, log-rank test p < 0.001). This nine-lncRNA signature was similarly effective for prognosis in a testing dataset and two independent datasets. Further analysis showed that the predictive ability of the signature was independent of clinical variables, including age, ER status, ESR1 status and ERBB2 status. Our results indicated that lncRNA signature could be a useful prognostic marker to predict metastatic risk in breast cancer patients and may improve upon our understanding of the molecular mechanisms underlying breast cancer metastasis.
Wen, Qing; Kim, Chang-Sik; Hamilton, Peter W; Zhang, Shu-Dong
2016-05-11
Gene expression connectivity mapping has gained much popularity recently with a number of successful applications in biomedical research testifying its utility and promise. Previously methodological research in connectivity mapping mainly focused on two of the key components in the framework, namely, the reference gene expression profiles and the connectivity mapping algorithms. The other key component in this framework, the query gene signature, has been left to users to construct without much consensus on how this should be done, albeit it has been an issue most relevant to end users. As a key input to the connectivity mapping process, gene signature is crucially important in returning biologically meaningful and relevant results. This paper intends to formulate a standardized procedure for constructing high quality gene signatures from a user's perspective. We describe a two-stage process for making quality gene signatures using gene expression data as initial inputs. First, a differential gene expression analysis comparing two distinct biological states; only the genes that have passed stringent statistical criteria are considered in the second stage of the process, which involves ranking genes based on statistical as well as biological significance. We introduce a "gene signature progression" method as a standard procedure in connectivity mapping. Starting from the highest ranked gene, we progressively determine the minimum length of the gene signature that allows connections to the reference profiles (drugs) being established with a preset target false discovery rate. We use a lung cancer dataset and a breast cancer dataset as two case studies to demonstrate how this standardized procedure works, and we show that highly relevant and interesting biological connections are returned. Of particular note is gefitinib, identified as among the candidate therapeutics in our lung cancer case study. Our gene signature was based on gene expression data from Taiwan female non-smoker lung cancer patients, while there is evidence from independent studies that gefitinib is highly effective in treating women, non-smoker or former light smoker, advanced non-small cell lung cancer patients of Asian origin. In summary, we introduced a gene signature progression method into connectivity mapping, which enables a standardized procedure for constructing high quality gene signatures. This progression method is particularly useful when the number of differentially expressed genes identified is large, and when there is a need to prioritize them to be included in the query signature. The results from two case studies demonstrate that the approach we have developed is capable of obtaining pertinent candidate drugs with high precision.
Malouf, Gabriel G; Su, Xiaoping; Yao, Hui; Gao, Jianjun; Xiong, Liangwen; He, Qiuming; Compérat, Eva; Couturier, Jérôme; Molinié, Vincent; Escudier, Bernard; Camparo, Philippe; Doss, Denaha J; Thompson, Erika J; Khayat, David; Wood, Christopher G; Yu, Willie; Teh, Bin T; Weinstein, John; Tannir, Nizar M
2014-08-01
MITF/TFE translocation renal cell carcinoma (TRCC) is a rare subtype of kidney cancer. Its incidence and the genome-wide characterization of its genetic origin have not been fully elucidated. We performed RNA and exome sequencing on an exploratory set of TRCC (n = 7), and validated our findings using The Cancer Genome Atlas (TCGA) clear-cell RCC (ccRCC) dataset (n = 460). Using the TCGA dataset, we identified seven TRCC (1.5%) cases and determined their genomic profile. We discovered three novel partners of MITF/TFE (LUC7L3, KHSRP, and KHDRBS2) that are involved in RNA splicing. TRCC displayed a unique gene expression signature as compared with other RCC types, and showed activation of MITF, the transforming growth factor β1 and the PI3K complex targets. Genes differentially spliced between TRCC and other RCC types were enriched for MITF and ID2 targets. Exome sequencing of TRCC revealed a distinct mutational spectrum as compared with ccRCC, with frequent mutations in chromatin-remodeling genes (six of eight cases, three of which were from the TCGA). In two cases, we identified mutations in INO80D, an ATP-dependent chromatin-remodeling gene, previously shown to control the amplitude of the S phase. Knockdown of INO80D decreased cell proliferation in a novel cell line bearing LUC7L3-TFE3 translocation. This genome-wide study defines the incidence of TRCC within a ccRCC-directed project and expands the genomic spectrum of TRCC by identifying novel MITF/TFE partners involved in RNA splicing and frequent mutations in chromatin-remodeling genes. ©2014 American Association for Cancer Research.
Exploring the Transcriptome of Ciliated Cells Using In Silico Dissection of Human Tissues
Ivliev, Alexander E.; 't Hoen, Peter A. C.; van Roon-Mom, Willeke M. C.; Peters, Dorien J. M.; Sergeeva, Marina G.
2012-01-01
Cilia are cell organelles that play important roles in cell motility, sensory and developmental functions and are involved in a range of human diseases, known as ciliopathies. Here, we search for novel human genes related to cilia using a strategy that exploits the previously reported tendency of cell type-specific genes to be coexpressed in the transcriptome of complex tissues. Gene coexpression networks were constructed using the noise-resistant WGCNA algorithm in 12 publicly available microarray datasets from human tissues rich in motile cilia: airways, fallopian tubes and brain. A cilia-related coexpression module was detected in 10 out of the 12 datasets. A consensus analysis of this module's gene composition recapitulated 297 known and predicted 74 novel cilia-related genes. 82% of the novel candidates were supported by tissue-specificity expression data from GEO and/or proteomic data from the Human Protein Atlas. The novel findings included a set of genes (DCDC2, DYX1C1, KIAA0319) related to a neurological disease dyslexia suggesting their potential involvement in ciliary functions. Furthermore, we searched for differences in gene composition of the ciliary module between the tissues. A multidrug-and-toxin extrusion transporter MATE2 (SLC47A2) was found as a brain-specific central gene in the ciliary module. We confirm the localization of MATE2 in cilia by immunofluorescence staining using MDCK cells as a model. While MATE2 has previously gained attention as a pharmacologically relevant transporter, its potential relation to cilia is suggested for the first time. Taken together, our large-scale analysis of gene coexpression networks identifies novel genes related to human cell cilia. PMID:22558177
Integrative Analysis Reveals Relationships of Genetic and Epigenetic Alterations in Osteosarcoma
Skårn, Magne; Namløs, Heidi M.; Barragan-Polania, Ana H.; Cleton-Jansen, Anne-Marie; Serra, Massimo; Liestøl, Knut; Hogendoorn, Pancras C. W.; Hovig, Eivind; Myklebost, Ola; Meza-Zepeda, Leonardo A.
2012-01-01
Background Osteosarcomas are the most common non-haematological primary malignant tumours of bone, and all conventional osteosarcomas are high-grade tumours showing complex genomic aberrations. We have integrated genome-wide genetic and epigenetic profiles from the EuroBoNeT panel of 19 human osteosarcoma cell lines based on microarray technologies. Principal Findings The cell lines showed complex patterns of DNA copy number changes, where genomic copy number gains were significantly associated with gene-rich regions and losses with gene-poor regions. By integrating the datasets, 350 genes were identified as having two types of aberrations (gain/over-expression, hypo-methylation/over-expression, loss/under-expression or hyper-methylation/under-expression) using a recurrence threshold of 6/19 (>30%) cell lines. The genes showed in general alterations in either DNA copy number or DNA methylation, both within individual samples and across the sample panel. These 350 genes are involved in embryonic skeletal system development and morphogenesis, as well as remodelling of extracellular matrix. The aberrations of three selected genes, CXCL5, DLX5 and RUNX2, were validated in five cell lines and five tumour samples using PCR techniques. Several genes were hyper-methylated and under-expressed compared to normal osteoblasts, and expression could be reactivated by demethylation using 5-Aza-2′-deoxycytidine treatment for four genes tested; AKAP12, CXCL5, EFEMP1 and IL11RA. Globally, there was as expected a significant positive association between gain and over-expression, loss and under-expression as well as hyper-methylation and under-expression, but gain was also associated with hyper-methylation and under-expression, suggesting that hyper-methylation may oppose the effects of increased copy number for detrimental genes. Conclusions Integrative analysis of genome-wide genetic and epigenetic alterations identified dependencies and relationships between DNA copy number, DNA methylation and mRNA expression in osteosarcomas, contributing to better understanding of osteosarcoma biology. PMID:23144859
Genetic variants in the PIWI-piRNA pathway gene DCP1A predict melanoma disease-specific survival.
Zhang, Weikang; Liu, Hongliang; Yin, Jieyun; Wu, Wenting; Zhu, Dakai; Amos, Christopher I; Fang, Shenying; Lee, Jeffrey E; Li, Yi; Han, Jiali; Wei, Qingyi
2016-12-15
The Piwi-piRNA pathway is important for germ cell maintenance, genome integrity, DNA methylation and retrotransposon control and thus may be involved in cancer development. In this study, we comprehensively analyzed prognostic roles of 3,116 common SNPs in PIWI-piRNA pathway genes in melanoma disease-specific survival. A published genome-wide association study (GWAS) by The University of Texas M.D. Anderson Cancer Center was used to identify associated SNPs, which were later validated by another GWAS from the Harvard Nurses' Health Study and Health Professionals Follow-up Study. After multiple testing correction, we found that there were 27 common SNPs in two genes (PIWIL4 and DCP1A) with false discovery rate < 0.2 in the discovery dataset. Three tagSNPs (i.e., rs7933369 and rs508485 in PIWIL4; rs11551405 in DCP1A) were replicated. The rs11551405 A allele, located at the 3' UTR microRNA binding site of DCP1A, was associated with an increased risk of melanoma disease-specific death in both discovery dataset [adjusted Hazards ratio (HR) = 1.66, 95% confidence interval (CI) = 1.21-2.27, p =1.50 × 10 -3 ] and validation dataset (HR = 1.55, 95% CI = 1.03-2.34, p = 0.038), compared with the C allele, and their meta-analysis showed an HR of 1.62 (95% CI, 1.26-2.08, p =1.55 × 10 -4 ). Using RNA-seq data from the 1000 Genomes Project, we found that DCP1A mRNA expression levels increased significantly with the A allele number of rs11551405. Additional large, prospective studies are needed to validate these findings. © 2016 UICC.
NASA Astrophysics Data System (ADS)
Christensen, C.; Summa, B.; Scorzelli, G.; Lee, J. W.; Venkat, A.; Bremer, P. T.; Pascucci, V.
2017-12-01
Massive datasets are becoming more common due to increasingly detailed simulations and higher resolution acquisition devices. Yet accessing and processing these huge data collections for scientific analysis is still a significant challenge. Solutions that rely on extensive data transfers are increasingly untenable and often impossible due to lack of sufficient storage at the client side as well as insufficient bandwidth to conduct such large transfers, that in some cases could entail petabytes of data. Large-scale remote computing resources can be useful, but utilizing such systems typically entails some form of offline batch processing with long delays, data replications, and substantial cost for any mistakes. Both types of workflows can severely limit the flexible exploration and rapid evaluation of new hypotheses that are crucial to the scientific process and thereby impede scientific discovery. In order to facilitate interactivity in both analysis and visualization of these massive data ensembles, we introduce a dynamic runtime system suitable for progressive computation and interactive visualization of arbitrarily large, disparately located spatiotemporal datasets. Our system includes an embedded domain-specific language (EDSL) that allows users to express a wide range of data analysis operations in a simple and abstract manner. The underlying runtime system transparently resolves issues such as remote data access and resampling while at the same time maintaining interactivity through progressive and interruptible processing. Computations involving large amounts of data can be performed remotely in an incremental fashion that dramatically reduces data movement, while the client receives updates progressively thereby remaining robust to fluctuating network latency or limited bandwidth. This system facilitates interactive, incremental analysis and visualization of massive remote datasets up to petabytes in size. Our system is now available for general use in the community through both docker and anaconda.
Autophagy-related prognostic signature for breast cancer.
Gu, Yunyan; Li, Pengfei; Peng, Fuduan; Zhang, Mengmeng; Zhang, Yuanyuan; Liang, Haihai; Zhao, Wenyuan; Qi, Lishuang; Wang, Hongwei; Wang, Chenguang; Guo, Zheng
2016-03-01
Autophagy is a process that degrades intracellular constituents, such as long-lived or damaged proteins and organelles, to buffer metabolic stress under starvation conditions. Deregulation of autophagy is involved in the progression of cancer. However, the predictive value of autophagy for breast cancer prognosis remains unclear. First, based on gene expression profiling, we found that autophagy genes were implicated in breast cancer. Then, using the Cox proportional hazard regression model, we detected autophagy prognostic signature for breast cancer in a training dataset. We identified a set of eight autophagy genes (BCL2, BIRC5, EIF4EBP1, ERO1L, FOS, GAPDH, ITPR1 and VEGFA) that were significantly associated with overall survival in breast cancer. The eight autophagy genes were assigned as a autophagy-related prognostic signature for breast cancer. Based on the autophagy-related signature, the training dataset GSE21653 could be classified into high-risk and low-risk subgroups with significantly different survival times (HR = 2.72, 95% CI = (1.91, 3.87); P = 1.37 × 10(-5)). Inactivation of autophagy was associated with shortened survival of breast cancer patients. The prognostic value of the autophagy-related signature was confirmed in the testing dataset GSE3494 (HR = 2.12, 95% CI = (1.48, 3.03); P = 1.65 × 10(-3)) and GSE7390 (HR = 1.76, 95% CI = (1.22, 2.54); P = 9.95 × 10(-4)). Further analysis revealed that the prognostic value of the autophagy signature was independent of known clinical prognostic factors, including age, tumor size, grade, estrogen receptor status, progesterone receptor status, ERBB2 status, lymph node status and TP53 mutation status. Finally, we demonstrated that the autophagy signature could also predict distant metastasis-free survival for breast cancer. © 2015 Wiley Periodicals, Inc.
A Fast SVD-Hidden-nodes based Extreme Learning Machine for Large-Scale Data Analytics.
Deng, Wan-Yu; Bai, Zuo; Huang, Guang-Bin; Zheng, Qing-Hua
2016-05-01
Big dimensional data is a growing trend that is emerging in many real world contexts, extending from web mining, gene expression analysis, protein-protein interaction to high-frequency financial data. Nowadays, there is a growing consensus that the increasing dimensionality poses impeding effects on the performances of classifiers, which is termed as the "peaking phenomenon" in the field of machine intelligence. To address the issue, dimensionality reduction is commonly employed as a preprocessing step on the Big dimensional data before building the classifiers. In this paper, we propose an Extreme Learning Machine (ELM) approach for large-scale data analytic. In contrast to existing approaches, we embed hidden nodes that are designed using singular value decomposition (SVD) into the classical ELM. These SVD nodes in the hidden layer are shown to capture the underlying characteristics of the Big dimensional data well, exhibiting excellent generalization performances. The drawback of using SVD on the entire dataset, however, is the high computational complexity involved. To address this, a fast divide and conquer approximation scheme is introduced to maintain computational tractability on high volume data. The resultant algorithm proposed is labeled here as Fast Singular Value Decomposition-Hidden-nodes based Extreme Learning Machine or FSVD-H-ELM in short. In FSVD-H-ELM, instead of identifying the SVD hidden nodes directly from the entire dataset, SVD hidden nodes are derived from multiple random subsets of data sampled from the original dataset. Comprehensive experiments and comparisons are conducted to assess the FSVD-H-ELM against other state-of-the-art algorithms. The results obtained demonstrated the superior generalization performance and efficiency of the FSVD-H-ELM. Copyright © 2016 Elsevier Ltd. All rights reserved.
Rabiee, Atefeh; Schwämmle, Veit; Sidoli, Simone; Dai, Jie; Rogowska-Wrzesinska, Adelina; Mandrup, Susanne; Jensen, Ole N
2017-03-01
Adipocytes (fat cells) are important endocrine and metabolic cells critical for systemic insulin sensitivity. Both adipose excess and insufficiency are associated with adverse metabolic function. Adipogenesis is the process whereby preadipocyte precursor cells differentiate into lipid-laden mature adipocytes. This process is driven by a network of transcriptional regulators (TRs). We hypothesized that protein PTMs, in particular phosphorylation, play a major role in activating and propagating signals within TR networks upon induction of adipogenesis by extracellular stimulus. We applied MS-based quantitative proteomics and phosphoproteomics to monitor the alteration of nuclear proteins during the early stages (4 h) of preadipocyte differentiation. We identified a total of 4072 proteins including 2434 phosphorylated proteins, a majority of which were assigned as regulators of gene expression. Our results demonstrate that adipogenic stimuli increase the nuclear abundance and/or the phosphorylation levels of proteins involved in gene expression, cell organization, and oxidation-reduction pathways. Furthermore, proteins acting as negative modulators involved in negative regulation of gene expression, insulin stimulated glucose uptake, and cytoskeletal organization showed a decrease in their nuclear abundance and/or phosphorylation levels during the first 4 h of adipogenesis. Among 288 identified TRs, 49 were regulated within 4 h of adipogenic stimulation including several known and many novel potential adipogenic regulators. We created a kinase-substrate database for 3T3-L1 preadipocytes by investigating the relationship between protein kinases and protein phosphorylation sites identified in our dataset. A majority of the putative protein kinases belong to the cyclin-dependent kinase family and the mitogen-activated protein kinase family including P38 and c-Jun N-terminal kinases, suggesting that these kinases act as orchestrators of early adipogenesis. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Gomulski, Ludvik M; Dimopoulos, George; Xi, Zhiyong; Soares, Marcelo B; Bonaldo, Maria F; Malacrida, Anna R; Gasperi, Giuliano
2008-01-01
Background The medfly, Ceratitis capitata, is a highly invasive agricultural pest that has become a model insect for the development of biological control programs. Despite research into the behavior and classical and population genetics of this organism, the quantity of sequence data available is limited. We have utilized an expressed sequence tag (EST) approach to obtain detailed information on transcriptome signatures that relate to a variety of physiological systems in the medfly; this information emphasizes on reproduction, sex determination, and chemosensory perception, since the study was based on normalized cDNA libraries from embryos and adult heads. Results A total of 21,253 high-quality ESTs were obtained from the embryo and head libraries. Clustering analyses performed separately for each library resulted in 5201 embryo and 6684 head transcripts. Considering an estimated 19% overlap in the transcriptomes of the two libraries, they represent about 9614 unique transcripts involved in a wide range of biological processes and molecular functions. Of particular interest are the sequences that share homology with Drosophila genes involved in sex determination, olfaction, and reproductive behavior. The medfly transformer2 (tra2) homolog was identified among the embryonic sequences, and its genomic organization and expression were characterized. Conclusion The sequences obtained in this study represent the first major dataset of expressed genes in a tephritid species of agricultural importance. This resource provides essential information to support the investigation of numerous questions regarding the biology of the medfly and other related species and also constitutes an invaluable tool for the annotation of complete genome sequences. Our study has revealed intriguing findings regarding the transcript regulation of tra2 and other sex determination genes, as well as insights into the comparative genomics of genes implicated in chemosensory reception and reproduction. PMID:18500975
A cross-country Exchange Market Pressure (EMP) dataset.
Desai, Mohit; Patnaik, Ila; Felman, Joshua; Shah, Ajay
2017-06-01
The data presented in this article are related to the research article titled - "An exchange market pressure measure for cross country analysis" (Patnaik et al. [1]). In this article, we present the dataset for Exchange Market Pressure values (EMP) for 139 countries along with their conversion factors, ρ (rho). Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values) for the point estimates of ρ 's. Using the standard errors of estimates of ρ 's, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.
A Study of the Effects of School Size and Single-Sex Education in English Schools
ERIC Educational Resources Information Center
Spielhofer, Thomas; Benton, Tom; Schagen, Sandie
2004-01-01
National value-added datasets have recently become available that record a pupil's progress from Key Stage 2 right through to GCSE. Such a dataset is clearly a useful tool for assessing the impact various characteristics of secondary schools have on pupil performance. This paper reports on a research project which involved the use of a variety of…
Chumnanpuen, Pramote; Zhang, Jie; Nookaew, Intawat; Nielsen, Jens
2012-07-01
In the yeast Saccharomyces cerevisiae many genes involved in lipid biosynthesis are transcriptionally controlled by inositol-choline and the protein kinase Snf1. Here we undertook a global study on how inositol-choline and Snf1 interact in controlling lipid metabolism in yeast. Using both a reference strain (CEN.PK113-7D) and a snf1Δ strain cultured at different nutrient limitations (carbon and nitrogen), at a fixed specific growth rate of 0.1 h(-1), and at different inositol choline concentrations, we quantified the expression of genes involved in lipid biosynthesis and the fluxes towards the different lipid components. Through integrated analysis of the transcriptome, the lipid profiling and the fluxome, it was possible to obtain a high quality, large-scale dataset that could be used to identify correlations and associations between the different components. At the transcription level, Snf1 and inositol-choline interact either directly through the main phospholipid-involving transcription factors (i.e. Ino2, Ino4, and Opi1) or through other transcription factors e.g. Gis1, Mga2, and Hac1. However, there seems to be flux regulation at the enzyme levels of several lipid involving enzymes. The analysis showed the strength of using both transcriptome and lipid profiling analysis for mapping the co-influence of inositol-choline and Snf1 on phospholipid metabolism.
MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.
Gonzalez-Dominguez, Jorge; Martin, Maria J
2017-10-10
In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.
Viscoplastic properties of laponite-CMC mixes.
Tarhini, Z; Jarny, S; Texier, A
2017-04-01
In this dataset, 15 samples of laponite-CMC mixes were realized and their viscoplastic properties are determined. Rheological parameters are then expressed as a function of age and components concentrations.
Gene expression inference with deep learning.
Chen, Yifei; Li, Yi; Narayan, Rajiv; Subramanian, Aravind; Xie, Xiaohui
2016-06-15
Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes. We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. D-GEX is available at https://github.com/uci-cbcl/D-GEX CONTACT: xhx@ics.uci.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Gene expression inference with deep learning
Chen, Yifei; Li, Yi; Narayan, Rajiv; Subramanian, Aravind; Xie, Xiaohui
2016-01-01
Motivation: Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes. Results: We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. Availability and implementation: D-GEX is available at https://github.com/uci-cbcl/D-GEX. Contact: xhx@ics.uci.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26873929
2011-01-01
Purpose To theoretically develop and experimentally validate a formulism based on a fractional order calculus (FC) diffusion model to characterize anomalous diffusion in brain tissues measured with a twice-refocused spin-echo (TRSE) pulse sequence. Materials and Methods The FC diffusion model is the fractional order generalization of the Bloch-Torrey equation. Using this model, an analytical expression was derived to describe the diffusion-induced signal attenuation in a TRSE pulse sequence. To experimentally validate this expression, a set of diffusion-weighted (DW) images was acquired at 3 Tesla from healthy human brains using a TRSE sequence with twelve b-values ranging from 0 to 2,600 s/mm2. For comparison, DW images were also acquired using a Stejskal-Tanner diffusion gradient in a single-shot spin-echo echo planar sequence. For both datasets, a Levenberg-Marquardt fitting algorithm was used to extract three parameters: diffusion coefficient D, fractional order derivative in space β, and a spatial parameter μ (in units of μm). Using adjusted R-squared values and standard deviations, D, β and μ values and the goodness-of-fit in three specific regions of interest (ROI) in white matter, gray matter, and cerebrospinal fluid were evaluated for each of the two datasets. In addition, spatially resolved parametric maps were assessed qualitatively. Results The analytical expression for the TRSE sequence, derived from the FC diffusion model, accurately characterized the diffusion-induced signal loss in brain tissues at high b-values. In the selected ROIs, the goodness-of-fit and standard deviations for the TRSE dataset were comparable with the results obtained from the Stejskal-Tanner dataset, demonstrating the robustness of the FC model across multiple data acquisition strategies. Qualitatively, the D, β, and μ maps from the TRSE dataset exhibited fewer artifacts, reflecting the improved immunity to eddy currents. Conclusion The diffusion-induced signal attenuation in a TRSE pulse sequence can be described by an FC diffusion model at high b-values. This model performs equally well for data acquired from the human brain tissues with a TRSE pulse sequence or a conventional Stejskal-Tanner sequence. PMID:21509877
Gao, Qing; Srinivasan, Girish; Magin, Richard L; Zhou, Xiaohong Joe
2011-05-01
To theoretically develop and experimentally validate a formulism based on a fractional order calculus (FC) diffusion model to characterize anomalous diffusion in brain tissues measured with a twice-refocused spin-echo (TRSE) pulse sequence. The FC diffusion model is the fractional order generalization of the Bloch-Torrey equation. Using this model, an analytical expression was derived to describe the diffusion-induced signal attenuation in a TRSE pulse sequence. To experimentally validate this expression, a set of diffusion-weighted (DW) images was acquired at 3 Tesla from healthy human brains using a TRSE sequence with twelve b-values ranging from 0 to 2600 s/mm(2). For comparison, DW images were also acquired using a Stejskal-Tanner diffusion gradient in a single-shot spin-echo echo planar sequence. For both datasets, a Levenberg-Marquardt fitting algorithm was used to extract three parameters: diffusion coefficient D, fractional order derivative in space β, and a spatial parameter μ (in units of μm). Using adjusted R-squared values and standard deviations, D, β, and μ values and the goodness-of-fit in three specific regions of interest (ROIs) in white matter, gray matter, and cerebrospinal fluid, respectively, were evaluated for each of the two datasets. In addition, spatially resolved parametric maps were assessed qualitatively. The analytical expression for the TRSE sequence, derived from the FC diffusion model, accurately characterized the diffusion-induced signal loss in brain tissues at high b-values. In the selected ROIs, the goodness-of-fit and standard deviations for the TRSE dataset were comparable with the results obtained from the Stejskal-Tanner dataset, demonstrating the robustness of the FC model across multiple data acquisition strategies. Qualitatively, the D, β, and μ maps from the TRSE dataset exhibited fewer artifacts, reflecting the improved immunity to eddy currents. The diffusion-induced signal attenuation in a TRSE pulse sequence can be described by an FC diffusion model at high b-values. This model performs equally well for data acquired from the human brain tissues with a TRSE pulse sequence or a conventional Stejskal-Tanner sequence. Copyright © 2011 Wiley-Liss, Inc.
Ma, Junshui; Bayram, Sevinç; Tao, Peining; Svetnik, Vladimir
2011-03-15
After a review of the ocular artifact reduction literature, a high-throughput method designed to reduce the ocular artifacts in multichannel continuous EEG recordings acquired at clinical EEG laboratories worldwide is proposed. The proposed method belongs to the category of component-based methods, and does not rely on any electrooculography (EOG) signals. Based on a concept that all ocular artifact components exist in a signal component subspace, the method can uniformly handle all types of ocular artifacts, including eye-blinks, saccades, and other eye movements, by automatically identifying ocular components from decomposed signal components. This study also proposes an improved strategy to objectively and quantitatively evaluate artifact reduction methods. The evaluation strategy uses real EEG signals to synthesize realistic simulated datasets with different amounts of ocular artifacts. The simulated datasets enable us to objectively demonstrate that the proposed method outperforms some existing methods when no high-quality EOG signals are available. Moreover, the results of the simulated datasets improve our understanding of the involved signal decomposition algorithms, and provide us with insights into the inconsistency regarding the performance of different methods in the literature. The proposed method was also applied to two independent clinical EEG datasets involving 28 volunteers and over 1000 EEG recordings. This effort further confirms that the proposed method can effectively reduce ocular artifacts in large clinical EEG datasets in a high-throughput fashion. Copyright © 2011 Elsevier B.V. All rights reserved.
Leveling data in geochemical mapping: scope of application, pros and cons of existing methods
NASA Astrophysics Data System (ADS)
Pereira, Benoît; Vandeuren, Aubry; Sonnet, Philippe
2017-04-01
Geochemical mapping successfully met a range of needs from mineral exploration to environmental management. In Europe and around the world numerous geochemical datasets already exist. These datasets may originate from geochemical mapping projects or from the collection of sample analyses requested by environmental protection regulatory bodies. Combining datasets can be highly beneficial for establishing geochemical maps with increased resolution and/or coverage area. However this practice requires assessing the equivalence between datasets and, if needed, applying data leveling to remove possible biases between datasets. In the literature, several procedures for assessing dataset equivalence and leveling data are proposed. Daneshfar & Cameron (1998) proposed a method for the leveling of two adjacent datasets while Pereira et al. (2016) proposed two methods for the leveling of datasets that contain records located within the same geographical area. Each discussed method requires its own set of assumptions (underlying populations of data, spatial distribution of data, etc.). Here we propose to discuss the scope of application, pros, cons and practical recommendations for each method. This work is illustrated with several case studies in Wallonia (Southern Belgium) and in Europe involving trace element geochemical datasets. References: Daneshfar, B. & Cameron, E. (1998), Leveling geochemical data between map sheets, Journal of Geochemical Exploration 63(3), 189-201. Pereira, B.; Vandeuren, A.; Govaerts, B. B. & Sonnet, P. (2016), Assessing dataset equivalence and leveling data in geochemical mapping, Journal of Geochemical Exploration 168, 36-48.
Micro RNA as a potential blood-based epigenetic biomarker for Alzheimer's disease.
Fransquet, Peter D; Ryan, Joanne
2018-06-06
As the prevalence of Alzheimer's disease (AD) increases, the search for a definitive, easy to access diagnostic biomarker has become increasingly important. Micro RNA (miRNA), involved in the epigenetic regulation of protein synthesis, is a biological mark which varies in association with a number of disease states, possibly including AD. Here we comprehensively review methods and findings from 26 studies comparing the measurement of miRNA in blood between AD cases and controls. Thirteen of these studies used receiver operator characteristic (ROC) analysis to determine the diagnostic accuracy of identified miRNA to predict AD, and three studies did this with a machine learning approach. Of 8098 individually measured miRNAs, 23 that were differentially expressed between AD cases and controls were found to be significant in two or more studies. Only six of these were consistent in their direction of expression between studies (miR-107, miR-125b, miR-146a, miR-181c, miR-29b, and miR-342), and they were all shown to be down regulated in individuals with AD compared to controls. Of these directionally concordant miRNAs, the strongest evidence was for miR-107 which has also been shown in previous studies to be involved in the dysregulation of proteins involved in aspects of AD pathology, as well as being consistently downregulated in studies of AD brains. We conclude that imperative to the discovery of reliable and replicable miRNA biomarkers of AD, standardised methods of measurements, appropriate statistical analysis, utilization of large datasets with machine learning approaches, and comprehensive reporting of findings is urgently needed. Copyright © 2017. Published by Elsevier Inc.
Zhu, Hong; Xia, Wei; Mo, Xing-Bo; Lin, Xiang; Qiu, Ying-Hua; Yi, Neng-Jun; Zhang, Yong-Hong; Deng, Fei-Yan; Lei, Shu-Feng
2016-01-01
Rheumatoid arthritis (RA) is a complex autoimmune disease. Using a gene-based association research strategy, the present study aims to detect unknown susceptibility to RA and to address the ethnic differences in genetic susceptibility to RA between European and Asian populations. Gene-based association analyses were performed with KGG 2.5 by using publicly available large RA datasets (14,361 RA cases and 43,923 controls of European subjects, 4,873 RA cases and 17,642 controls of Asian Subjects). For the newly identified RA-associated genes, gene set enrichment analyses and protein-protein interactions analyses were carried out with DAVID and STRING version 10.0, respectively. Differential expression verification was conducted using 4 GEO datasets. The expression levels of three selected 'highly verified' genes were measured by ELISA among our in-house RA cases and controls. A total of 221 RA-associated genes were newly identified by gene-based association study, including 71'overlapped', 76 'European-specific' and 74 'Asian-specific' genes. Among them, 105 genes had significant differential expressions between RA patients and health controls at least in one dataset, especially for 20 genes including 11 'overlapped' (ABCF1, FLOT1, HLA-F, IER3, TUBB, ZKSCAN4, BTN3A3, HSP90AB1, CUTA, BRD2, HLA-DMA), 5 'European-specific' (PHTF1, RPS18, BAK1, TNFRSF14, SUOX) and 4 'Asian-specific' (RNASET2, HFE, BTN2A2, MAPK13) genes whose differential expressions were significant at least in three datasets. The protein expressions of two selected genes FLOT1 (P value = 1.70E-02) and HLA-DMA (P value = 4.70E-02) in plasma were significantly different in our in-house samples. Our study identified 221 novel RA-associated genes and especially highlighted the importance of 20 candidate genes on RA. The results addressed ethnic genetic background differences for RA susceptibility between European and Asian populations and detected a long list of overlapped or ethnic specific RA genes. The study not only greatly increases our understanding of genetic susceptibility to RA, but also provides important insights into the ethno-genetic homogeneity and heterogeneity of RA in both ethnicities.
Vasileva, Hristina; Butcher, Robert; Pickering, Harry; Sokana, Oliver; Jack, Kelvin; Solomon, Anthony W; Holland, Martin J; Roberts, Chrissy H
2018-02-21
Clinical signs of active (inflammatory) trachoma are found in many children in the Solomon Islands, but the majority of these individuals have no serological evidence of previous infection with Chlamydia trachomatis. In Temotu and Rennell and Bellona provinces, ocular infections with C. trachomatis were seldom detected among children with active trachoma; a similar lack of association was seen between active trachoma and other common bacterial and viral causes of follicular conjunctivitis. Here, we set out to characterise patterns of gene expression at the conjunctivae of children in these provinces with and without clinical signs of trachomatous inflammation-follicular (TF) and C. trachomatis infection. Purified RNA from children with and without active trachoma was run on Affymetrix GeneChip Human Transcriptome Array 2.0 microarrays. Profiles were compared between individuals with ocular C. trachomatis infection and TF (group DI; n = 6), individuals with TF but no C. trachomatis infection (group D; n = 7), and individuals without TF or C. trachomatis infection (group N; n = 7). Differential gene expression and gene set enrichment for pathway membership were assessed. Conjunctival gene expression profiles were more similar within-group than between-group. Principal components analysis indicated that the first and second principal components combined explained almost 50% of the variance in the dataset. When comparing the DI group to the N group, genes involved in T-cell proliferation, B-cell signalling and CD8+ T cell signalling pathways were differentially regulated. When comparing the DI group to the D group, CD8+ T-cell regulation, interferon-gamma and IL17 production pathways were enriched. Genes involved in RNA transcription and translation pathways were upregulated when comparing the D group to the N group. Gene expression profiles in children in the Solomon Islands indicate immune responses consistent with bacterial infection when TF and C. trachomatis infection are concurrent. The transcriptomes of children with TF but without identified infection were not consistent with allergic or viral conjunctivitis.
Wada, Masayoshi; Takahashi, Hiroki; Altaf-Ul-Amin, Md; Nakamura, Kensuke; Hirai, Masami Y; Ohta, Daisaku; Kanaya, Shigehiko
2012-07-15
Operon-like arrangements of genes occur in eukaryotes ranging from yeasts and filamentous fungi to nematodes, plants, and mammals. In plants, several examples of operon-like gene clusters involved in metabolic pathways have recently been characterized, e.g. the cyclic hydroxamic acid pathways in maize, the avenacin biosynthesis gene clusters in oat, the thalianol pathway in Arabidopsis thaliana, and the diterpenoid momilactone cluster in rice. Such operon-like gene clusters are defined by their co-regulation or neighboring positions within immediate vicinity of chromosomal regions. A comprehensive analysis of the expression of neighboring genes therefore accounts a crucial step to reveal the complete set of operon-like gene clusters within a genome. Genome-wide prediction of operon-like gene clusters should contribute to functional annotation efforts and provide novel insight into evolutionary aspects acquiring certain biological functions as well. We predicted co-expressed gene clusters by comparing the Pearson correlation coefficient of neighboring genes and randomly selected gene pairs, based on a statistical method that takes false discovery rate (FDR) into consideration for 1469 microarray gene expression datasets of A. thaliana. We estimated that A. thaliana contains 100 operon-like gene clusters in total. We predicted 34 statistically significant gene clusters consisting of 3 to 22 genes each, based on a stringent FDR threshold of 0.1. Functional relationships among genes in individual clusters were estimated by sequence similarity and functional annotation of genes. Duplicated gene pairs (determined based on BLAST with a cutoff of E<10(-5)) are included in 27 clusters. Five clusters are associated with metabolism, containing P450 genes restricted to the Brassica family and predicted to be involved in secondary metabolism. Operon-like clusters tend to include genes encoding bio-machinery associated with ribosomes, the ubiquitin/proteasome system, secondary metabolic pathways, lipid and fatty-acid metabolism, and the lipid transfer system. Copyright © 2012 Elsevier B.V. All rights reserved.
Dupl'áková, Nikoleta; Renák, David; Hovanec, Patrik; Honysová, Barbora; Twell, David; Honys, David
2007-07-23
Microarray technologies now belong to the standard functional genomics toolbox and have undergone massive development leading to increased genome coverage, accuracy and reliability. The number of experiments exploiting microarray technology has markedly increased in recent years. In parallel with the rapid accumulation of transcriptomic data, on-line analysis tools are being introduced to simplify their use. Global statistical data analysis methods contribute to the development of overall concepts about gene expression patterns and to query and compose working hypotheses. More recently, these applications are being supplemented with more specialized products offering visualization and specific data mining tools. We present a curated gene family-oriented gene expression database, Arabidopsis Gene Family Profiler (aGFP; http://agfp.ueb.cas.cz), which gives the user access to a large collection of normalised Affymetrix ATH1 microarray datasets. The database currently contains NASC Array and AtGenExpress transcriptomic datasets for various tissues at different developmental stages of wild type plants gathered from nearly 350 gene chips. The Arabidopsis GFP database has been designed as an easy-to-use tool for users needing an easily accessible resource for expression data of single genes, pre-defined gene families or custom gene sets, with the further possibility of keyword search. Arabidopsis Gene Family Profiler presents a user-friendly web interface using both graphic and text output. Data are stored at the MySQL server and individual queries are created in PHP script. The most distinguishable features of Arabidopsis Gene Family Profiler database are: 1) the presentation of normalized datasets (Affymetrix MAS algorithm and calculation of model-based gene-expression values based on the Perfect Match-only model); 2) the choice between two different normalization algorithms (Affymetrix MAS4 or MAS5 algorithms); 3) an intuitive interface; 4) an interactive "virtual plant" visualizing the spatial and developmental expression profiles of both gene families and individual genes. Arabidopsis GFP gives users the possibility to analyze current Arabidopsis developmental transcriptomic data starting with simple global queries that can be expanded and further refined to visualize comparative and highly selective gene expression profiles.
Xue, Linlin; Xie, Li; Song, Xingguo; Song, Xianrang
2018-04-17
Platelets have emerged as key players in tumorigenesis and tumor progression. Tumor-educated platelet (TEP) RNA profile has the potential to diagnose non-small-cell lung cancer (NSCLC). The objective of this study was to identify potential TEP RNA biomarkers for the diagnosis of NSCLC and to explore the mechanisms in alternations of TEP RNA profile. The RNA-seq datasets GSE68086 and GSE89843 were downloaded from Gene Expression Omnibus DataSets (GEO DataSets). Then, the functional enrichment of the differentially expressed mRNAs was analyzed by the Database for Annotation Visualization and Integrated Discovery (DAVID). The miRNAs which regulated the differential mRNAs and the target mRNAs of miRNAs were identified by miRanda and miRDB. Then, the miRNA-mRNA regulatory network was visualized via Cytoscape software. Twenty consistently altered mRNAs (2 up-regulated and 18 down-regulated) were identified from the two GSE datasets, and they were significantly enriched in several biological processes, including transport and establishment of localization. Twenty identical miRNAs were found between exosomal miRNA-seq dataset and 229 miRNAs that regulated 20 consistently differential mRNAs in platelets. We also analyzed 13 spliceosomal mRNAs and their miRNA predictions; there were 27 common miRNAs between 206 differential exosomal miRNAs and 338 miRNAs that regulated 13 distinct spliceosomal mRNAs. This study identified 20 potential TEP RNA biomarkers in NSCLC for diagnosis by integrated bioinformatical analysis, and alternations in TEP RNA profile may be related to the post-transcriptional regulation and the splicing metabolisms of spliceosome. © 2018 Wiley Periodicals, Inc.
2014-01-01
Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods. PMID:24708878
Peng, Li; Liu, Zhao-Yang; Li, Wen-Ling; Zhang, Chao-Yang; Zhang, Ya-Qin; Pan, Xi; Chen, Jun; Li, Yue-Hui
2017-01-01
Upregulation of lncRNA H19 expression is associated with an unfavorable prognosis in some cancers. However, the prognostic value of H19 in female-specific cancers has remained uncharacterized. In this study, the prognostic power of high H19 expression in female cancer patients from the TCGA datasets was analyzed using Kaplan-Meier survival curves and Cox's proportional hazard modeling. In addition, in a meta-analysis of non-female cancer patients from TCGA datasets and 12 independent studies, hazard ratios (HRs) with 95% confidence interval (CI) for overall survival (OS) and disease-free survival (DFS)/relapse-free survival (RFS)/metastasis-free survival (MFS)/progression-free survival (PFS) were pooled to assess the prognostic value of high H19 expression. Kaplan-Meier analysis revealed that patients with uterine corpus cancer and higher H19 expression had a shorter OS (HR=2.710, p<0.05), while females with cervical cancer and increased H19 expression had a shorter RFS (HR=2.261, p<0.05). Multivariate Cox regression analysis showed that high H19 expression could independently predict a poorer prognosis in cervical cancer patients (HR=4.099, p<0.05). In the meta-analysis, patients with high H19 expression showed a poorer outcome in non-female cancer (p<0.05). These results suggest that high lncRNA H19 expression is predictive of an unfavorable prognosis in two female cancers (uterine corpus endometrioid cancer and cervical cancer) as well as in non-female cancer patients. PMID:27926484
Peng, Li; Yuan, Xiao-Qing; Liu, Zhao-Yang; Li, Wen-Ling; Zhang, Chao-Yang; Zhang, Ya-Qin; Pan, Xi; Chen, Jun; Li, Yue-Hui; Li, Guan-Cheng
2017-01-03
Upregulation of lncRNA H19 expression is associated with an unfavorable prognosis in some cancers. However, the prognostic value of H19 in female-specific cancers has remained uncharacterized. In this study, the prognostic power of high H19 expression in female cancer patients from the TCGA datasets was analyzed using Kaplan-Meier survival curves and Cox's proportional hazard modeling. In addition, in a meta-analysis of non-female cancer patients from TCGA datasets and 12 independent studies, hazard ratios (HRs) with 95% confidence interval (CI) for overall survival (OS) and disease-free survival (DFS)/relapse-free survival (RFS)/metastasis-free survival (MFS)/progression-free survival (PFS) were pooled to assess the prognostic value of high H19 expression. Kaplan-Meier analysis revealed that patients with uterine corpus cancer and higher H19 expression had a shorter OS (HR=2.710, p<0.05), while females with cervical cancer and increased H19 expression had a shorter RFS (HR=2.261, p<0.05). Multivariate Cox regression analysis showed that high H19 expression could independently predict a poorer prognosis in cervical cancer patients (HR=4.099, p<0.05). In the meta-analysis, patients with high H19 expression showed a poorer outcome in non-female cancer (p<0.05). These results suggest that high lncRNA H19 expression is predictive of an unfavorable prognosis in two female cancers (uterine corpus endometrioid cancer and cervical cancer) as well as in non-female cancer patients.
Mittal, Anuradha; Pachter, Lior; Nelson, J. Lee; Smed, Mette Kiel; Gildengorin, Virginia L.; Zoffmann, Vibeke; Hetland, Merete Lund; Jewell, Nicholas P.; Olsen, Jørn; Jawaheer, Damini
2015-01-01
Background Pregnancy induces drastic biological changes systemically, and has a beneficial effect on some autoimmune conditions such as rheumatoid arthritis (RA). However, specific systemic changes that occur as a result of pregnancy have not been thoroughly examined in healthy women or women with RA. The goal of this study was to identify genes with expression patterns associated with pregnancy, compared to pre-pregnancy as baseline and determine whether those associations are modified by presence of RA. Results In our RNA sequencing (RNA-seq) dataset from 5 healthy women and 20 women with RA, normalized expression levels of 4,710 genes were significantly associated with pregnancy status (pre-pregnancy, first, second and third trimesters) over time, irrespective of presence of RA (False Discovery Rate (FDR)-adjusted p value<0.05). These genes were enriched in pathways spanning multiple systems, as would be expected during pregnancy. A subset of these genes (n = 256) showed greater than two-fold change in expression during pregnancy compared to baseline levels, with distinct temporal trends through pregnancy. Another 98 genes involved in various biological processes including immune regulation exhibited expression patterns that were differentially associated with pregnancy in the presence or absence of RA. Conclusions Our findings support the hypothesis that the maternal immune system plays an active role during pregnancy, and also provide insight into other systemic changes that occur in the maternal transcriptome during pregnancy compared to the pre-pregnancy state. Only a small proportion of genes modulated by pregnancy were influenced by presence of RA in our data. PMID:26683605
Mittal, Anuradha; Pachter, Lior; Nelson, J Lee; Kjærgaard, Hanne; Smed, Mette Kiel; Gildengorin, Virginia L; Zoffmann, Vibeke; Hetland, Merete Lund; Jewell, Nicholas P; Olsen, Jørn; Jawaheer, Damini
2015-01-01
Pregnancy induces drastic biological changes systemically, and has a beneficial effect on some autoimmune conditions such as rheumatoid arthritis (RA). However, specific systemic changes that occur as a result of pregnancy have not been thoroughly examined in healthy women or women with RA. The goal of this study was to identify genes with expression patterns associated with pregnancy, compared to pre-pregnancy as baseline and determine whether those associations are modified by presence of RA. In our RNA sequencing (RNA-seq) dataset from 5 healthy women and 20 women with RA, normalized expression levels of 4,710 genes were significantly associated with pregnancy status (pre-pregnancy, first, second and third trimesters) over time, irrespective of presence of RA (False Discovery Rate (FDR)-adjusted p value<0.05). These genes were enriched in pathways spanning multiple systems, as would be expected during pregnancy. A subset of these genes (n = 256) showed greater than two-fold change in expression during pregnancy compared to baseline levels, with distinct temporal trends through pregnancy. Another 98 genes involved in various biological processes including immune regulation exhibited expression patterns that were differentially associated with pregnancy in the presence or absence of RA. Our findings support the hypothesis that the maternal immune system plays an active role during pregnancy, and also provide insight into other systemic changes that occur in the maternal transcriptome during pregnancy compared to the pre-pregnancy state. Only a small proportion of genes modulated by pregnancy were influenced by presence of RA in our data.
Rinchai, Darawan; Boughorbel, Sabri; Presnell, Scott; Quinn, Charlie; Chaussabel, Damien
2016-01-01
Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp. PMID:27158452
2013-01-01
Background The regenerative response of Schwann cells after peripheral nerve injury is a critical process directly related to the pathophysiology of a number of neurodegenerative diseases. This SC injury response is dependent on an intricate gene regulatory program coordinated by a number of transcription factors and microRNAs, but the interactions among them remain largely unknown. Uncovering the transcriptional and post-transcriptional regulatory networks governing the Schwann cell injury response is a key step towards a better understanding of Schwann cell biology and may help develop novel therapies for related diseases. Performing such comprehensive network analysis requires systematic bioinformatics methods to integrate multiple genomic datasets. Results In this study we present a computational pipeline to infer transcription factor and microRNA regulatory networks. Our approach combined mRNA and microRNA expression profiling data, ChIP-Seq data of transcription factors, and computational transcription factor and microRNA target prediction. Using mRNA and microRNA expression data collected in a Schwann cell injury model, we constructed a regulatory network and studied regulatory pathways involved in Schwann cell response to injury. Furthermore, we analyzed network motifs and obtained insights on cooperative regulation of transcription factors and microRNAs in Schwann cell injury recovery. Conclusions This work demonstrates a systematic method for gene regulatory network inference that may be used to gain new information on gene regulation by transcription factors and microRNAs. PMID:23387820
Li, Lingli; Zhang, Hehua; Liu, Zhongshuai; Cui, Xiaoyue; Zhang, Tong; Li, Yanfang; Zhang, Lingyun
2016-10-12
Blueberry is an economically important fruit crop in Ericaceae family. The substantial quantities of flavonoids in blueberry have been implicated in a broad range of health benefits. However, the information regarding fruit development and flavonoid metabolites based on the transcriptome level is still limited. In the present study, the transcriptome and gene expression profiling over berry development, especially during color development were initiated. A total of approximately 13.67 Gbp of data were obtained and assembled into 186,962 transcripts and 80,836 unigenes from three stages of blueberry fruit and color development. A large number of simple sequence repeats (SSRs) and candidate genes, which are potentially involved in plant development, metabolic and hormone pathways, were identified. A total of 6429 sequences containing 8796 SSRs were characterized from 15,457 unigenes and 1763 unigenes contained more than one SSR. The expression profiles of key genes involved in anthocyanin biosynthesis were also studied. In addition, a comparison between our dataset and other published results was carried out. Our high quality reads produced in this study are an important advancement and provide a new resource for the interpretation of high-throughput data for blueberry species whether regarding sequencing data depth or species extension. The use of this transcriptome data will serve as a valuable public information database for the studies of blueberry genome and would greatly boost the research of fruit and color development, flavonoid metabolisms and regulation and breeding of more healthful blueberries.
Kakati, Tulika; Kashyap, Hirak; Bhattacharyya, Dhruba K
2016-11-30
There exist many tools and methods for construction of co-expression network from gene expression data and for extraction of densely connected gene modules. In this paper, a method is introduced to construct co-expression network and to extract co-expressed modules having high biological significance. The proposed method has been validated on several well known microarray datasets extracted from a diverse set of species, using statistical measures, such as p and q values. The modules obtained in these studies are found to be biologically significant based on Gene Ontology enrichment analysis, pathway analysis, and KEGG enrichment analysis. Further, the method was applied on an Alzheimer's disease dataset and some interesting genes are found, which have high semantic similarity among them, but are not significantly correlated in terms of expression similarity. Some of these interesting genes, such as MAPT, CASP2, and PSEN2, are linked with important aspects of Alzheimer's disease, such as dementia, increase cell death, and deposition of amyloid-beta proteins in Alzheimer's disease brains. The biological pathways associated with Alzheimer's disease, such as, Wnt signaling, Apoptosis, p53 signaling, and Notch signaling, incorporate these interesting genes. The proposed method is evaluated in regard to existing literature.