Sample records for expression datasets implications

  1. Microarray Data Mining for Potential Selenium Targets in Chemoprevention of Prostate Cancer

    PubMed Central

    ZHANG, HAITAO; DONG, YAN; ZHAO, HONGJUAN; BROOKS, JAMES D.; HAWTHORN, LESLEYANN; NOWAK, NORMA; MARSHALL, JAMES R.; GAO, ALLEN C.; IP, CLEMENT

    2008-01-01

    Background A previous clinical trial showed that selenium supplementation significantly reduced the incidence of prostate cancer. We report here a bioinformatics approach to gain new insights into selenium molecular targets that might be relevant to prostate cancer chemoprevention. Materials and Methods We first performed data mining analysis to identify genes which are consistently dysregulated in prostate cancer using published datasets from gene expression profiling of clinical prostate specimens. We then devised a method to systematically analyze three selenium microarray datasets from the LNCaP human prostate cancer cells, and to match the analysis to the cohort of genes implicated in prostate carcinogenesis. Moreover, we compared the selenium datasets with two datasets obtained from expression profiling of androgen-stimulated LNCaP cells. Results We found that selenium reverses the expression of genes implicated in prostate carcinogenesis. In addition, we found that selenium could counteract the effect of androgen on the expression of a subset obtained from androgen-regulated genes. Conclusions The above information provides us with a treasure of new clues to investigate the mechanism of selenium chemoprevention of prostate cancer. Furthermore, these selenium target genes could also serve as biomarkers in future clinical trials to gauge the efficacy of selenium intervention. PMID:18548127

  2. A microarray whole-genome gene expression dataset in a rat model of inflammatory corneal angiogenesis.

    PubMed

    Mukwaya, Anthony; Lindvall, Jessica M; Xeroudaki, Maria; Peebo, Beatrice; Ali, Zaheer; Lennikov, Anton; Jensen, Lasse Dahl Ejby; Lagali, Neil

    2016-11-22

    In angiogenesis with concurrent inflammation, many pathways are activated, some linked to VEGF and others largely VEGF-independent. Pathways involving inflammatory mediators, chemokines, and micro-RNAs may play important roles in maintaining a pro-angiogenic environment or mediating angiogenic regression. Here, we describe a gene expression dataset to facilitate exploration of pro-angiogenic, pro-inflammatory, and remodelling/normalization-associated genes during both an active capillary sprouting phase, and in the restoration of an avascular phenotype. The dataset was generated by microarray analysis of the whole transcriptome in a rat model of suture-induced inflammatory corneal neovascularisation. Regions of active capillary sprout growth or regression in the cornea were harvested and total RNA extracted from four biological replicates per group. High quality RNA was obtained for gene expression analysis using microarrays. Fold change of selected genes was validated by qPCR, and protein expression was evaluated by immunohistochemistry. We provide a gene expression dataset that may be re-used to investigate corneal neovascularisation, and may also have implications in other contexts of inflammation-mediated angiogenesis.

  3. Cancer Detection in Microarray Data Using a Modified Cat Swarm Optimization Clustering Approach

    PubMed

    M, Pandi; R, Balamurugan; N, Sadhasivam

    2017-12-29

    Objective: A better understanding of functional genomics can be obtained by extracting patterns hidden in gene expression data. This could have paramount implications for cancer diagnosis, gene treatments and other domains. Clustering may reveal natural structures and identify interesting patterns in underlying data. The main objective of this research was to derive a heuristic approach to detection of highly co-expressed genes related to cancer from gene expression data with minimum Mean Squared Error (MSE). Methods: A modified CSO algorithm using Harmony Search (MCSO-HS) for clustering cancer gene expression data was applied. Experiment results are analyzed using two cancer gene expression benchmark datasets, namely for leukaemia and for breast cancer. Result: The results indicated MCSO-HS to be better than HS and CSO, 13% and 9% with the leukaemia dataset. For breast cancer dataset improvement was by 22% and 17%, respectively, in terms of MSE. Conclusion: The results showed MCSO-HS to outperform HS and CSO with both benchmark datasets. To validate the clustering results, this work was tested with internal and external cluster validation indices. Also this work points to biological validation of clusters with gene ontology in terms of function, process and component. Creative Commons Attribution License

  4. 8D.07: GENE EXPRESSION ANALYSIS AND BIOINFORMATICS REVEALED POTENTIAL TRANSCRIPTION FACTORS ASSOCIATED WITH RENIN-ANGIOTENSIN-ALDOSTERONE SYSTEM IN ATHEROMA.

    PubMed

    Nehme, A; Zibara, K; Cerutti, C; Bricca, G

    2015-06-01

    The implication of the renin-angiotensin-aldosterone system (RAAS) in atheroma development is well described. However, a complete view of the local RAAS in atheroma is still missing. In this study we aimed to reveal the organization of RAAS in atheroma at the transcriptomic level and identify the transcriptional regulators behind it. Extended RAAS (extRAAS) was defined as the set of 37 genes coding for classical and novel RAAS participants (Figure 1). Five microarray datasets containing overall 590 samples representing carotid and peripheral atheroma were downloaded from the GEO database. Correlation-based hierarchical clustering (R software) of extRAAS genes within each dataset allowed the identification of modules of co-expressed genes. Reproducible co-expression modules across datasets were then extracted. Transcription factors (TFs) having common binding sites (TFBSs) in the promoters of coordinated genes were identified using the Genomatix database tools and analyzed for their correlation with extRAAS genes in the microarray datasets. Expression data revealed the expressed extRAAS components and their relative abundance displaying the favored pathways in atheroma. Three co-expression modules with more than 80% reproducibility across datasets were extracted. Two of them (M1 and M2) contained genes coding for angiotensin metabolizing enzymes involved in different pathways: M1 included ACE, MME, RNPEP, and DPP3, in addition to 7 other genes; and M2 included CMA1, CTSG, and CPA3. The third module (M3) contained genes coding for receptors known to be implicated in atheroma (AGTR1, MR, GR, LNPEP, EGFR and GPER). M1 and M3 were negatively correlated in 3 of 5 datasets. We identified 19 TFs that have enriched TFBSs in the promoters of genes of M1, and two for M3, but none was found for M2. Among the extracted TFs, ELF1, MAX, and IRF5 showed significant positive correlations with peptidase-coding genes from M1 and negative correlations with receptors-coding genes from M3 (p < 0.05). The identified co-expression modules display the transcriptional organization of local extRAAS in human carotid atheroma. The identification of several TFs potentially associated to extRAAS genes may provide a frame for the discovery of atheroma-specific modulators of extRAAS activity.(Figure is included in full-text article.).

  5. LINC00472 expression is regulated by promoter methylation and associated with disease-free survival in patients with grade 2 breast cancer

    PubMed Central

    Shen, Yi; Wang, Zhanwei; Loo, Lenora WM; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A.; Katsaros, Dionyssios; Yu, Herbert

    2015-01-01

    Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management. PMID:26564482

  6. LINC00472 expression is regulated by promoter methylation and associated with disease-free survival in patients with grade 2 breast cancer.

    PubMed

    Shen, Yi; Wang, Zhanwei; Loo, Lenora W M; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A; Katsaros, Dionyssios; Yu, Herbert

    2015-12-01

    Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management.

  7. Gene-Expression Signature Predicts Postoperative Recurrence in Stage I Non-Small Cell Lung Cancer Patients

    PubMed Central

    Lu, Yan; Wang, Liang; Liu, Pengyuan; Yang, Ping; You, Ming

    2012-01-01

    About 30% stage I non-small cell lung cancer (NSCLC) patients undergoing resection will recur. Robust prognostic markers are required to better manage therapy options. The purpose of this study is to develop and validate a novel gene-expression signature that can predict tumor recurrence of stage I NSCLC patients. Cox proportional hazards regression analysis was performed to identify recurrence-related genes and a partial Cox regression model was used to generate a gene signature of recurrence in the training dataset −142 stage I lung adenocarcinomas without adjunctive therapy from the Director's Challenge Consortium. Four independent validation datasets, including GSE5843, GSE8894, and two other datasets provided by Mayo Clinic and Washington University, were used to assess the prediction accuracy by calculating the correlation between risk score estimated from gene expression and real recurrence-free survival time and AUC of time-dependent ROC analysis. Pathway-based survival analyses were also performed. 104 probesets correlated with recurrence in the training dataset. They are enriched in cell adhesion, apoptosis and regulation of cell proliferation. A 51-gene expression signature was identified to distinguish patients likely to develop tumor recurrence (Dxy = −0.83, P<1e-16) and this signature was validated in four independent datasets with AUC >85%. Multiple pathways including leukocyte transendothelial migration and cell adhesion were highly correlated with recurrence-free survival. The gene signature is highly predictive of recurrence in stage I NSCLC patients, which has important prognostic and therapeutic implications for the future management of these patients. PMID:22292069

  8. Neuroligin 4X overexpression in human breast cancer is associated with poor relapse-free survival.

    PubMed

    Henderson, Henry J; Karanam, Balasubramanyam; Samant, Rajeev; Vig, Komal; Singh, Shree R; Yates, Clayton; Bedi, Deepa

    2017-01-01

    The molecular mechanisms involved in breast cancer progression and metastasis still remain unclear to date. It is a heterogeneous disease featuring several different phenotypes with consistently different biological characteristics. Neuroligins are neural cell adhesion molecules that have been implicated in heterotopic cell adhesion. In humans, alterations in neuroligin genes are implicated in autism and other cognitive diseases. Until recently, neuroligins have been shown to be abundantly expressed in blood vessels and also play a role implicated in the growth of glioma cells. Here we report increased expression of neuroligin 4X (NLGN4X) in breast cancer. We found NLGN4X was abundantly expressed in breast cancer tissues. NLGN4X expression data for all breast cancer cell lines in the Cancer Cell Line Encyclopedia (CCLE) was analyzed. Correlation between NLGN4X levels and clinicopathologic parameters were analyzed within Oncomine datasets. Evaluation of these bioinfomatic datasets results revealed that NLGN4X expression was higher in triple negative breast cancer cells, particularly the basal subtype and tissues versus non-triple-negative sets. Its level was also observed to be higher in metastatic tissues. RT-PCR, flow cytometry and immunofluorescence study of MDA-MB-231 and MCF-7 breast cancer cells validated that NLGN4X was increased in MDA-MB-231. Knockdown of NLGN4X expression by siRNA decreased cell proliferation and migration significantly in MDA-MB-231 breast cancer cells. NLGN4X knockdown in MDA-MB-231 cells resulted in induction of apoptosis as determined by annexin staining, elevated caspase 3/7 and cleaved PARP by flow cytometry. High NLGN4X expression highly correlated with decrease in relapse free-survival in TNBC. NLGN4X might represent novel biomarkers and therapeutic targets for breast cancer. Inhibition of NLGN4X may be a new target for the prevention and treatment of breast cancer.

  9. Characteristics of allelic gene expression in human brain cells from single-cell RNA-seq data analysis.

    PubMed

    Zhao, Dejian; Lin, Mingyan; Pedrosa, Erika; Lachman, Herbert M; Zheng, Deyou

    2017-11-10

    Monoallelic expression of autosomal genes has been implicated in human psychiatric disorders. However, there is a paucity of allelic expression studies in human brain cells at the single cell and genome wide levels. In this report, we reanalyzed a previously published single-cell RNA-seq dataset from several postmortem human brains and observed pervasive monoallelic expression in individual cells, largely in a random manner. Examining single nucleotide variants with a predicted functional disruption, we found that the "damaged" alleles were overall expressed in fewer brain cells than their counterparts, and at a lower level in cells where their expression was detected. We also identified many brain cell type-specific monoallelically expressed genes. Interestingly, many of these cell type-specific monoallelically expressed genes were enriched for functions important for those brain cell types. In addition, function analysis showed that genes displaying monoallelic expression and correlated expression across neuronal cells from different individual brains were implicated in the regulation of synaptic function. Our findings suggest that monoallelic gene expression is prevalent in human brain cells, which may play a role in generating cellular identity and neuronal diversity and thus increasing the complexity and diversity of brain cell functions.

  10. A methodology for translating positional error into measures of attribute error, and combining the two error sources

    Treesearch

    Yohay Carmel; Curtis Flather; Denis Dean

    2006-01-01

    This paper summarizes our efforts to investigate the nature, behavior, and implications of positional error and attribute error in spatiotemporal datasets. Estimating the combined influence of these errors on map analysis has been hindered by the fact that these two error types are traditionally expressed in different units (distance units, and categorical units,...

  11. ATP binding cassette (ABC) transporters: expression and clinical value in glioblastoma.

    PubMed

    Dréan, Antonin; Rosenberg, Shai; Lejeune, François-Xavier; Goli, Larissa; Nadaradjane, Aravindan Arun; Guehennec, Jérémy; Schmitt, Charlotte; Verreault, Maïté; Bielle, Franck; Mokhtari, Karima; Sanson, Marc; Carpentier, Alexandre; Delattre, Jean-Yves; Idbaih, Ahmed

    2018-03-08

    ATP-binding cassette transporters (ABC transporters) regulate traffic of multiple compounds, including chemotherapeutic agents, through biological membranes. They are expressed by multiple cell types and have been implicated in the drug resistance of some cancer cells. Despite significant research in ABC transporters in the context of many diseases, little is known about their expression and clinical value in glioblastoma (GBM). We analyzed expression of 49 ABC transporters in both commercial and patient-derived GBM cell lines as well as from 51 human GBM tumor biopsies. Using The Cancer Genome Atlas (TCGA) cohort as a training dataset and our cohort as a validation dataset, we also investigated the prognostic value of these ABC transporters in newly diagnosed GBM patients, treated with the standard of care. In contrast to commercial GBM cell lines, GBM-patient derived cell lines (PDCL), grown as neurospheres in a serum-free medium, express ABC transporters similarly to parental tumors. Serum appeared to slightly increase resistance to temozolomide correlating with a tendency for an increased expression of ABCB1. Some differences were observed mainly due to expression of ABC transporters by microenvironmental cells. Together, our data suggest that the efficacy of chemotherapeutic agents may be misestimated in vitro if they are the targets of efflux pumps whose expression can be modulated by serum. Interestingly, several ABC transporters have prognostic value in the TCGA dataset. In our cohort of 51 GBM patients treated with radiation therapy with concurrent and adjuvant temozolomide, ABCA13 overexpression is associated with a decreased progression free survival in univariate (p < 0.01) and multivariate analyses including MGMT promoter methylation (p = 0.05) suggesting reduced sensitivity to temozolomide in ABCA13 overexpressing GBM. Expression of ABC transporters is: (i) detected in GBM and microenvironmental cells and (ii) better reproduced in GBM-PDCL. ABCA13 expression is an independent prognostic factor in newly diagnosed GBM patients. Further prospective studies are warranted to investigate whether ABCA13 expression can be used to further personalize treatments for GBM.

  12. Overcoming the matched-sample bottleneck: an orthogonal approach to integrate omic data.

    PubMed

    Nguyen, Tin; Diaz, Diana; Tagett, Rebecca; Draghici, Sorin

    2016-07-12

    MicroRNAs (miRNAs) are small non-coding RNA molecules whose primary function is to regulate the expression of gene products via hybridization to mRNA transcripts, resulting in suppression of translation or mRNA degradation. Although miRNAs have been implicated in complex diseases, including cancer, their impact on distinct biological pathways and phenotypes is largely unknown. Current integration approaches require sample-matched miRNA/mRNA datasets, resulting in limited applicability in practice. Since these approaches cannot integrate heterogeneous information available across independent experiments, they neither account for bias inherent in individual studies, nor do they benefit from increased sample size. Here we present a novel framework able to integrate miRNA and mRNA data (vertical data integration) available in independent studies (horizontal meta-analysis) allowing for a comprehensive analysis of the given phenotypes. To demonstrate the utility of our method, we conducted a meta-analysis of pancreatic and colorectal cancer, using 1,471 samples from 15 mRNA and 14 miRNA expression datasets. Our two-dimensional data integration approach greatly increases the power of statistical analysis and correctly identifies pathways known to be implicated in the phenotypes. The proposed framework is sufficiently general to integrate other types of data obtained from high-throughput assays.

  13. Revealing complex function, process and pathway interactions with high-throughput expression and biological annotation data.

    PubMed

    Singh, Nitesh Kumar; Ernst, Mathias; Liebscher, Volkmar; Fuellen, Georg; Taher, Leila

    2016-10-20

    The biological relationships both between and within the functions, processes and pathways that operate within complex biological systems are only poorly characterized, making the interpretation of large scale gene expression datasets extremely challenging. Here, we present an approach that integrates gene expression and biological annotation data to identify and describe the interactions between biological functions, processes and pathways that govern a phenotype of interest. The product is a global, interconnected network, not of genes but of functions, processes and pathways, that represents the biological relationships within the system. We validated our approach on two high-throughput expression datasets describing organismal and organ development. Our findings are well supported by the available literature, confirming that developmental processes and apoptosis play key roles in cell differentiation. Furthermore, our results suggest that processes related to pluripotency and lineage commitment, which are known to be critical for development, interact mainly indirectly, through genes implicated in more general biological processes. Moreover, we provide evidence that supports the relevance of cell spatial organization in the developing liver for proper liver function. Our strategy can be viewed as an abstraction that is useful to interpret high-throughput data and devise further experiments.

  14. A compendium of multi-omic sequence information from the Saanich Inlet water column

    DOE PAGES

    Hawley, Alyse K.; Torres-Beltran, Monica; Zaikova, Elena; ...

    2017-10-31

    Microbial communities play vital roles in earth’s geochemical cycles. Within marine oxygen minimum zones (OMZs) gradients of oxygen, nitrate and sulfide create redox gradients that drive biogeochemical cycling of carbon, nitrogen and sulphur. Climate-change induced expansion and intensification of OMZs and associated biogeochemical activities has significant implications for green house gas production i.e. nitrous oxide and methane. Next generation sequencing technologies have enabled observations of changes in microbial community structure and expression of RNA and protein along these redox gradients within OMZs. Here, we present a multi-omic time series dataset from Saanich Inlet spanning six years, including high spatial resolutionmore » small subunit ribosomal RNA tags, metagenomes, metatranscriptomes, and metaproteomes. As a result, this compendium provides paired multi-omic datasets over multiple time points providing a basis for exploring shifts in microbial community interactions and regulation of metabolic activities both along redox gradients and over time with implications for global climate models.« less

  15. A compendium of multi-omic sequence information from the Saanich Inlet water column

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hawley, Alyse K.; Torres-Beltran, Monica; Zaikova, Elena

    Microbial communities play vital roles in earth’s geochemical cycles. Within marine oxygen minimum zones (OMZs) gradients of oxygen, nitrate and sulfide create redox gradients that drive biogeochemical cycling of carbon, nitrogen and sulphur. Climate-change induced expansion and intensification of OMZs and associated biogeochemical activities has significant implications for green house gas production i.e. nitrous oxide and methane. Next generation sequencing technologies have enabled observations of changes in microbial community structure and expression of RNA and protein along these redox gradients within OMZs. Here, we present a multi-omic time series dataset from Saanich Inlet spanning six years, including high spatial resolutionmore » small subunit ribosomal RNA tags, metagenomes, metatranscriptomes, and metaproteomes. As a result, this compendium provides paired multi-omic datasets over multiple time points providing a basis for exploring shifts in microbial community interactions and regulation of metabolic activities both along redox gradients and over time with implications for global climate models.« less

  16. A potential prognostic long non-coding RNA signature to predict metastasis-free survival of breast cancer patients.

    PubMed

    Sun, Jie; Chen, Xihai; Wang, Zhenzhen; Guo, Maoni; Shi, Hongbo; Wang, Xiaojun; Cheng, Liang; Zhou, Meng

    2015-11-09

    Long non-coding RNAs (lncRNAs) have been implicated in a variety of biological processes, and dysregulated lncRNAs have demonstrated potential roles as biomarkers and therapeutic targets for cancer prognosis and treatment. In this study, by repurposing microarray probes, we analyzed lncRNA expression profiles of 916 breast cancer patients from the Gene Expression Omnibus (GEO). Nine lncRNAs were identified to be significantly associated with metastasis-free survival (MFS) in the training dataset of 254 patients using the Cox proportional hazards regression model. These nine lncRNAs were then combined to form a single prognostic signature for predicting metastatic risk in breast cancer patients that was able to classify patients in the training dataset into high- and low-risk subgroups with significantly different MFSs (median 2.4 years versus 3.0 years, log-rank test p < 0.001). This nine-lncRNA signature was similarly effective for prognosis in a testing dataset and two independent datasets. Further analysis showed that the predictive ability of the signature was independent of clinical variables, including age, ER status, ESR1 status and ERBB2 status. Our results indicated that lncRNA signature could be a useful prognostic marker to predict metastatic risk in breast cancer patients and may improve upon our understanding of the molecular mechanisms underlying breast cancer metastasis.

  17. UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets.

    PubMed

    Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K

    2015-06-04

    Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.

  18. Integrative analysis of gene expression and DNA methylation using unsupervised feature extraction for detecting candidate cancer biomarkers.

    PubMed

    Moon, Myungjin; Nakai, Kenta

    2018-04-01

    Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.

  19. Computational deconvolution of genome wide expression data from Parkinson's and Huntington's disease brain tissues using population-specific expression analysis

    PubMed Central

    Capurro, Alberto; Bodea, Liviu-Gabriel; Schaefer, Patrick; Luthi-Carter, Ruth; Perreau, Victoria M.

    2015-01-01

    The characterization of molecular changes in diseased tissues gives insight into pathophysiological mechanisms and is important for therapeutic development. Genome-wide gene expression analysis has proven valuable for identifying biological processes in neurodegenerative diseases using post mortem human brain tissue and numerous datasets are publically available. However, many studies utilize heterogeneous tissue samples consisting of multiple cell types, all of which contribute to global gene expression values, confounding biological interpretation of the data. In particular, changes in numbers of neuronal and glial cells occurring in neurodegeneration confound transcriptomic analyses, particularly in human brain tissues where sample availability and controls are limited. To identify cell specific gene expression changes in neurodegenerative disease, we have applied our recently published computational deconvolution method, population specific expression analysis (PSEA). PSEA estimates cell-type-specific expression values using reference expression measures, which in the case of brain tissue comprises mRNAs with cell-type-specific expression in neurons, astrocytes, oligodendrocytes and microglia. As an exercise in PSEA implementation and hypothesis development regarding neurodegenerative diseases, we applied PSEA to Parkinson's and Huntington's disease (PD, HD) datasets. Genes identified as differentially expressed in substantia nigra pars compacta neurons by PSEA were validated using external laser capture microdissection data. Network analysis and Annotation Clustering (DAVID) identified molecular processes implicated by differential gene expression in specific cell types. The results of these analyses provided new insights into the implementation of PSEA in brain tissues and additional refinement of molecular signatures in human HD and PD. PMID:25620908

  20. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics.

    PubMed

    Giambartolomei, Claudia; Vukcevic, Damjan; Schadt, Eric E; Franke, Lude; Hingorani, Aroon D; Wallace, Chris; Plagnol, Vincent

    2014-05-01

    Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.

  1. Systematic analysis of microarray datasets to identify Parkinson's disease‑associated pathways and genes.

    PubMed

    Feng, Yinling; Wang, Xuefeng

    2017-03-01

    In order to investigate commonly disturbed genes and pathways in various brain regions of patients with Parkinson's disease (PD), microarray datasets from previous studies were collected and systematically analyzed. Different normalization methods were applied to microarray datasets from different platforms. A strategy combining gene co‑expression networks and clinical information was adopted, using weighted gene co‑expression network analysis (WGCNA) to screen for commonly disturbed genes in different brain regions of patients with PD. Functional enrichment analysis of commonly disturbed genes was performed using the Database for Annotation, Visualization, and Integrated Discovery (DAVID). Co‑pathway relationships were identified with Pearson's correlation coefficient tests and a hypergeometric distribution‑based test. Common genes in pathway pairs were selected out and regarded as risk genes. A total of 17 microarray datasets from 7 platforms were retained for further analysis. Five gene coexpression modules were identified, containing 9,745, 736, 233, 101 and 93 genes, respectively. One module was significantly correlated with PD samples and thus the 736 genes it contained were considered to be candidate PD‑associated genes. Functional enrichment analysis demonstrated that these genes were implicated in oxidative phosphorylation and PD. A total of 44 pathway pairs and 52 risk genes were revealed, and a risk gene pathway relationship network was constructed. Eight modules were identified and were revealed to be associated with PD, cancers and metabolism. A number of disturbed pathways and risk genes were unveiled in PD, and these findings may help advance understanding of PD pathogenesis.

  2. DigOut: viewing differential expression genes as outliers.

    PubMed

    Yu, Hui; Tu, Kang; Xie, Lu; Li, Yuan-Yuan

    2010-12-01

    With regards to well-replicated two-conditional microarray datasets, the selection of differentially expressed (DE) genes is a well-studied computational topic, but for multi-conditional microarray datasets with limited or no replication, the same task is not properly addressed by previous studies. This paper adopts multivariate outlier analysis to analyze replication-lacking multi-conditional microarray datasets, finding that it performs significantly better than the widely used limit fold change (LFC) model in a simulated comparative experiment. Compared with the LFC model, the multivariate outlier analysis also demonstrates improved stability against sample variations in a series of manipulated real expression datasets. The reanalysis of a real non-replicated multi-conditional expression dataset series leads to satisfactory results. In conclusion, a multivariate outlier analysis algorithm, like DigOut, is particularly useful for selecting DE genes from non-replicated multi-conditional gene expression dataset.

  3. Cross-species inference of long non-coding RNAs greatly expands the ruminant transcriptome.

    PubMed

    Bush, Stephen J; Muriuki, Charity; McCulloch, Mary E B; Farquhar, Iseabail L; Clark, Emily L; Hume, David A

    2018-04-24

    mRNA-like long non-coding RNAs (lncRNAs) are a significant component of mammalian transcriptomes, although most are expressed only at low levels, with high tissue-specificity and/or at specific developmental stages. Thus, in many cases lncRNA detection by RNA-sequencing (RNA-seq) is compromised by stochastic sampling. To account for this and create a catalogue of ruminant lncRNAs, we compared de novo assembled lncRNAs derived from large RNA-seq datasets in transcriptional atlas projects for sheep and goats with previous lncRNAs assembled in cattle and human. We then combined the novel lncRNAs with the sheep transcriptional atlas to identify co-regulated sets of protein-coding and non-coding loci. Few lncRNAs could be reproducibly assembled from a single dataset, even with deep sequencing of the same tissues from multiple animals. Furthermore, there was little sequence overlap between lncRNAs that were assembled from pooled RNA-seq data. We combined positional conservation (synteny) with cross-species mapping of candidate lncRNAs to identify a consensus set of ruminant lncRNAs and then used the RNA-seq data to demonstrate detectable and reproducible expression in each species. In sheep, 20 to 30% of lncRNAs were located close to protein-coding genes with which they are strongly co-expressed, which is consistent with the evolutionary origin of some ncRNAs in enhancer sequences. Nevertheless, most of the lncRNAs are not co-expressed with neighbouring protein-coding genes. Alongside substantially expanding the ruminant lncRNA repertoire, the outcomes of our analysis demonstrate that stochastic sampling can be partly overcome by combining RNA-seq datasets from related species. This has practical implications for the future discovery of lncRNAs in other species.

  4. Similarity of markers identified from cancer gene expression studies: observations from GEO.

    PubMed

    Shi, Xingjie; Shen, Shihao; Liu, Jin; Huang, Jian; Zhou, Yong; Ma, Shuangge

    2014-09-01

    Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  5. Coagulation factor VII is regulated by androgen receptor in breast cancer.

    PubMed

    Naderi, Ali

    2015-02-01

    Androgen receptor (AR) is widely expressed in breast cancer; however, there is limited information on the key molecular functions and gene targets of AR in this disease. In this study, gene expression data from a cohort of 52 breast cancer cell lines was analyzed to identify a network of AR co-expressed genes. A total of 300 genes, which were significantly enriched for cell cycle and metabolic functions, showed absolute correlation coefficients (|CC|) of more than 0.5 with AR expression across the dataset. In this network, a subset of 35 "AR-signature" genes were highly co-expressed with AR (|CC|>0.6) that included transcriptional regulators PATZ1, NFATC4, and SPDEF. Furthermore, gene encoding coagulation factor VII (F7) demonstrated the closest expression pattern with AR (CC=0.716) in the dataset and factor VII protein expression was significantly associated to that of AR in a cohort of 209 breast tumors. Moreover, functional studies demonstrated that AR activation results in the induction of factor VII expression at both transcript and protein levels and AR directly binds to a proximal region of F7 promoter in breast cancer cells. Importantly, AR activation in breast cancer cells induced endogenous factor VII activity to convert factor X to Xa in conjunction with tissue factor. In summary, F7 is a novel AR target gene and AR activation regulates the ectopic expression and activity of factor VII in breast cancer cells. These findings have functional implications in the pathobiology of thromboembolic events and regulation of factor VII/tissue factor signaling in breast cancer. Copyright © 2014 Elsevier Inc. All rights reserved.

  6. Statistical modeling of isoform splicing dynamics from RNA-seq time series data.

    PubMed

    Huang, Yuanhua; Sanguinetti, Guido

    2016-10-01

    Isoform quantification is an important goal of RNA-seq experiments, yet it remains problematic for genes with low expression or several isoforms. These difficulties may in principle be ameliorated by exploiting correlated experimental designs, such as time series or dosage response experiments. Time series RNA-seq experiments, in particular, are becoming increasingly popular, yet there are no methods that explicitly leverage the experimental design to improve isoform quantification. Here, we present DICEseq, the first isoform quantification method tailored to correlated RNA-seq experiments. DICEseq explicitly models the correlations between different RNA-seq experiments to aid the quantification of isoforms across experiments. Numerical experiments on simulated datasets show that DICEseq yields more accurate results than state-of-the-art methods, an advantage that can become considerable at low coverage levels. On real datasets, our results show that DICEseq provides substantially more reproducible and robust quantifications, increasing the correlation of estimates from replicate datasets by up to 10% on genes with low or moderate expression levels (bottom third of all genes). Furthermore, DICEseq permits to quantify the trade-off between temporal sampling of RNA and depth of sequencing, frequently an important choice when planning experiments. Our results have strong implications for the design of RNA-seq experiments, and offer a novel tool for improved analysis of such datasets. Python code is freely available at http://diceseq.sf.net G.Sanguinetti@ed.ac.uk Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  7. The long non-coding RNA HOTAIR is transcriptionally activated by HOXA9 and is an independent prognostic marker in patients with malignant glioma

    PubMed Central

    Xavier-Magalhães, Ana; Gonçalves, Céline S.; Fogli, Anne; Lourenço, Tatiana; Pojo, Marta; Pereira, Bruno; Rocha, Miguel; Lopes, Maria Celeste; Crespo, Inês; Rebelo, Olinda; Tão, Herminio; Lima, João; Moreira, Ricardo; Pinto, Afonso A.; Jones, Chris; Reis, Rui M.; Costello, Joseph F.; Arnaud, Philippe; Sousa, Nuno; Costa, Bruno M.

    2018-01-01

    The lncRNA HOTAIR has been implicated in several human cancers. Here, we evaluated the molecular alterations and upstream regulatory mechanisms of HOTAIR in glioma, the most common primary brain tumors, and its clinical relevance. HOTAIR gene expression, methylation, copy-number and prognostic value were investigated in human gliomas integrating data from online datasets and our cohorts. High levels of HOTAIR were associated with higher grades of glioma, particularly IDH wild-type cases. Mechanistically, HOTAIR was overexpressed in a gene dosage-independent manner, while DNA methylation levels of particular CpGs in HOTAIR locus were associated with HOTAIR expression levels in GBM clinical specimens and cell lines. Concordantly, the demethylating agent 5-Aza-2′-deoxycytidine affected HOTAIR transcriptional levels in a cell line-dependent manner. Importantly, HOTAIR was frequently co-expressed with HOXA9 in high-grade gliomas from TCGA, Oncomine, and our Portuguese and French datasets. Integrated in silico analyses, chromatin immunoprecipitation, and qPCR data showed that HOXA9 binds directly to the promoter of HOTAIR. Clinically, GBM patients with high HOTAIR expression had a significantly reduced overall survival, independently of other prognostic variables. In summary, this work reveals HOXA9 as a novel direct regulator of HOTAIR, and establishes HOTAIR as an independent prognostic marker, providing new therapeutic opportunities to treat this highly aggressive cancer. PMID:29644006

  8. Meta-Analysis of Multiple Sclerosis Microarray Data Reveals Dysregulation in RNA Splicing Regulatory Genes.

    PubMed

    Paraboschi, Elvezia Maria; Cardamone, Giulia; Rimoldi, Valeria; Gemmati, Donato; Spreafico, Marta; Duga, Stefano; Soldà, Giulia; Asselta, Rosanna

    2015-09-30

    Abnormalities in RNA metabolism and alternative splicing (AS) are emerging as important players in complex disease phenotypes. In particular, accumulating evidence suggests the existence of pathogenic links between multiple sclerosis (MS) and altered AS, including functional studies showing that an imbalance in alternatively-spliced isoforms may contribute to disease etiology. Here, we tested whether the altered expression of AS-related genes represents a MS-specific signature. A comprehensive comparative analysis of gene expression profiles of publicly-available microarray datasets (190 MS cases, 182 controls), followed by gene-ontology enrichment analysis, highlighted a significant enrichment for differentially-expressed genes involved in RNA metabolism/AS. In detail, a total of 17 genes were found to be differentially expressed in MS in multiple datasets, with CELF1 being dysregulated in five out of seven studies. We confirmed CELF1 downregulation in MS (p=0.0015) by real-time RT-PCRs on RNA extracted from blood cells of 30 cases and 30 controls. As a proof of concept, we experimentally verified the unbalance in alternatively-spliced isoforms in MS of the NFAT5 gene, a putative CELF1 target. In conclusion, for the first time we provide evidence of a consistent dysregulation of splicing-related genes in MS and we discuss its possible implications in modulating specific AS events in MS susceptibility genes.

  9. TRACING CO-REGULATORY NETWORK DYNAMICS IN NOISY, SINGLE-CELL TRANSCRIPTOME TRAJECTORIES.

    PubMed

    Cordero, Pablo; Stuart, Joshua M

    2017-01-01

    The availability of gene expression data at the single cell level makes it possible to probe the molecular underpinnings of complex biological processes such as differentiation and oncogenesis. Promising new methods have emerged for reconstructing a progression 'trajectory' from static single-cell transcriptome measurements. However, it remains unclear how to adequately model the appreciable level of noise in these data to elucidate gene regulatory network rewiring. Here, we present a framework called Single Cell Inference of MorphIng Trajectories and their Associated Regulation (SCIMITAR) that infers progressions from static single-cell transcriptomes by employing a continuous parametrization of Gaussian mixtures in high-dimensional curves. SCIMITAR yields rich models from the data that highlight genes with expression and co-expression patterns that are associated with the inferred progression. Further, SCIMITAR extracts regulatory states from the implicated trajectory-evolvingco-expression networks. We benchmark the method on simulated data to show that it yields accurate cell ordering and gene network inferences. Applied to the interpretation of a single-cell human fetal neuron dataset, SCIMITAR finds progression-associated genes in cornerstone neural differentiation pathways missed by standard differential expression tests. Finally, by leveraging the rewiring of gene-gene co-expression relations across the progression, the method reveals the rise and fall of co-regulatory states and trajectory-dependent gene modules. These analyses implicate new transcription factors in neural differentiation including putative co-factors for the multi-functional NFAT pathway.

  10. An imprinted non-coding genomic cluster at 14q32 defines clinically relevant molecular subtypes in osteosarcoma across multiple independent datasets.

    PubMed

    Hill, Katherine E; Kelly, Andrew D; Kuijjer, Marieke L; Barry, William; Rattani, Ahmed; Garbutt, Cassandra C; Kissick, Haydn; Janeway, Katherine; Perez-Atayde, Antonio; Goldsmith, Jeffrey; Gebhardt, Mark C; Arredouani, Mohamed S; Cote, Greg; Hornicek, Francis; Choy, Edwin; Duan, Zhenfeng; Quackenbush, John; Haibe-Kains, Benjamin; Spentzos, Dimitrios

    2017-05-15

    A microRNA (miRNA) collection on the imprinted 14q32 MEG3 region has been associated with outcome in osteosarcoma. We assessed the clinical utility of this miRNA set and their association with methylation status. We integrated coding and non-coding RNA data from three independent annotated clinical osteosarcoma cohorts (n = 65, n = 27, and n = 25) and miRNA and methylation data from one in vitro (19 cell lines) and one clinical (NCI Therapeutically Applicable Research to Generate Effective Treatments (TARGET) osteosarcoma dataset, n = 80) dataset. We used time-dependent receiver operating characteristic (tdROC) analysis to evaluate the clinical value of candidate miRNA profiles and machine learning approaches to compare the coding and non-coding transcriptional programs of high- and low-risk osteosarcoma tumors and high- versus low-aggressiveness cell lines. In the cell line and TARGET datasets, we also studied the methylation patterns of the MEG3 imprinting control region on 14q32 and their association with miRNA expression and tumor aggressiveness. In the tdROC analysis, miRNA sets on 14q32 showed strong discriminatory power for recurrence and survival in the three clinical datasets. High- or low-risk tumor classification was robust to using different microRNA sets or classification methods. Machine learning approaches showed that genome-wide miRNA profiles and miRNA regulatory networks were quite different between the two outcome groups and mRNA profiles categorized the samples in a manner concordant with the miRNAs, suggesting potential molecular subtypes. Further, miRNA expression patterns were reproducible in comparing high-aggressiveness versus low-aggressiveness cell lines. Methylation patterns in the MEG3 differentially methylated region (DMR) also distinguished high-aggressiveness from low-aggressiveness cell lines and were associated with expression of several 14q32 miRNAs in both the cell lines and the large TARGET clinical dataset. Within the limits of available CpG array coverage, we observed a potential methylation-sensitive regulation of the non-coding RNA cluster by CTCF, a known enhancer-blocking factor. Loss of imprinting/methylation changes in the 14q32 non-coding region defines reproducible previously unrecognized osteosarcoma subtypes with distinct transcriptional programs and biologic and clinical behavior. Future studies will define the precise relationship between 14q32 imprinting, non-coding RNA expression, genomic enhancer binding, and tumor aggressiveness, with possible therapeutic implications for both early- and advanced-stage patients.

  11. A large dataset of protein dynamics in the mammalian heart proteome.

    PubMed

    Lau, Edward; Cao, Quan; Ng, Dominic C M; Bleakley, Brian J; Dincer, T Umut; Bot, Brian M; Wang, Ding; Liem, David A; Lam, Maggie P Y; Ge, Junbo; Ping, Peipei

    2016-03-15

    Protein stability is a major regulatory principle of protein function and cellular homeostasis. Despite limited understanding on mechanisms, disruption of protein turnover is widely implicated in diverse pathologies from heart failure to neurodegenerations. Information on global protein dynamics therefore has the potential to expand the depth and scope of disease phenotyping and therapeutic strategies. Using an integrated platform of metabolic labeling, high-resolution mass spectrometry and computational analysis, we report here a comprehensive dataset of the in vivo half-life of 3,228 and the expression of 8,064 cardiac proteins, quantified under healthy and hypertrophic conditions across six mouse genetic strains commonly employed in biomedical research. We anticipate these data will aid in understanding key mitochondrial and metabolic pathways in heart diseases, and further serve as a reference for methodology development in dynamics studies in multiple organ systems.

  12. Multimedia Content Development as a Facial Expression Datasets for Recognition of Human Emotions

    NASA Astrophysics Data System (ADS)

    Mamonto, N. E.; Maulana, H.; Liliana, D. Y.; Basaruddin, T.

    2018-02-01

    Datasets that have been developed before contain facial expression from foreign people. The development of multimedia content aims to answer the problems experienced by the research team and other researchers who will conduct similar research. The method used in the development of multimedia content as facial expression datasets for human emotion recognition is the Villamil-Molina version of the multimedia development method. Multimedia content developed with 10 subjects or talents with each talent performing 3 shots with each capturing talent having to demonstrate 19 facial expressions. After the process of editing and rendering, tests are carried out with the conclusion that the multimedia content can be used as a facial expression dataset for recognition of human emotions.

  13. Utility and Limitations of Using Gene Expression Data to Identify Functional Associations

    PubMed Central

    Peng, Cheng; Shiu, Shin-Han

    2016-01-01

    Gene co-expression has been widely used to hypothesize gene function through guilt-by association. However, it is not clear to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the type of dataset impacts inferences about gene functions. Here our goal is to assess the utility and limitations of using co-expression as a criterion to recover functional associations between genes. By determining the percentage of gene pairs in a metabolic pathway with significant expression correlation, we found that many genes in the same pathway do not have similar transcript profiles and the choice of dataset, annotation quality, gene function, expression similarity measure, and clustering approach significantly impacts the ability to recover functional associations between genes using Arabidopsis thaliana as an example. Some datasets are more informative in capturing coordinated expression profiles and larger data sets are not always better. In addition, to recover the maximum number of known pathways and identify candidate genes with similar functions, it is important to explore rather exhaustively multiple dataset combinations, similarity measures, clustering algorithms and parameters. Finally, we validated the biological relevance of co-expression cluster memberships with an independent phenomics dataset and found that genes that consistently cluster with leucine degradation genes tend to have similar leucine levels in mutants. This study provides a framework for obtaining gene functional associations by maximizing the information that can be obtained from gene expression datasets. PMID:27935950

  14. A large dataset of protein dynamics in the mammalian heart proteome

    PubMed Central

    Lau, Edward; Cao, Quan; Ng, Dominic C.M.; Bleakley, Brian J.; Dincer, T. Umut; Bot, Brian M.; Wang, Ding; Liem, David A.; Lam, Maggie P.Y.; Ge, Junbo; Ping, Peipei

    2016-01-01

    Protein stability is a major regulatory principle of protein function and cellular homeostasis. Despite limited understanding on mechanisms, disruption of protein turnover is widely implicated in diverse pathologies from heart failure to neurodegenerations. Information on global protein dynamics therefore has the potential to expand the depth and scope of disease phenotyping and therapeutic strategies. Using an integrated platform of metabolic labeling, high-resolution mass spectrometry and computational analysis, we report here a comprehensive dataset of the in vivo half-life of 3,228 and the expression of 8,064 cardiac proteins, quantified under healthy and hypertrophic conditions across six mouse genetic strains commonly employed in biomedical research. We anticipate these data will aid in understanding key mitochondrial and metabolic pathways in heart diseases, and further serve as a reference for methodology development in dynamics studies in multiple organ systems. PMID:26977904

  15. Climatological Impact of Atmospheric River Based on NARCCAP and DRI-RCM Datasets

    NASA Astrophysics Data System (ADS)

    Mejia, J. F.; Perryman, N. M.

    2012-12-01

    This study evaluates spatial responses of extreme precipitation environments, typically associated with Atmospheric River events, using Regional Climate Model (RCM) output from NARCCAP dataset (50km grid size) and the Desert Research Institute-RCM simulations (36 and 12 km grid size). For this study, a pattern-detection algorithm was developed to characterize Atmospheric Rivers (ARs)-like features from climate models. Topological analysis of the enhanced elongated moisture flux (500-300hPa; daily means) cores is used to objectively characterize such AR features in two distinct groups: (i) zonal, north Pacific ARs, and (ii) subtropical ARs, also known as "Pineapple Express" events. We computed the climatological responses of the different RCMs upon these two AR groups, from which intricate differences among RCMs stand out. This study presents these climatological responses from historical and scenario driven simulations, as well as implications for precipitation extreme-value analyses.

  16. Identification of Genes Potentially Regulated by Human Polynucleotide Phosphorylase (hPNPaseold-35) Using Melanoma as a Model

    PubMed Central

    Sokhi, Upneet K.; Bacolod, Manny D.; Dasgupta, Santanu; Emdad, Luni; Das, Swadesh K.; Dumur, Catherine I.; Miles, Michael F.; Sarkar, Devanand; Fisher, Paul B.

    2013-01-01

    Human Polynucleotide Phosphorylase (hPNPaseold-35 or PNPT1) is an evolutionarily conserved 3′→5′ exoribonuclease implicated in the regulation of numerous physiological processes including maintenance of mitochondrial homeostasis, mtRNA import and aging-associated inflammation. From an RNase perspective, little is known about the RNA or miRNA species it targets for degradation or whose expression it regulates; except for c-myc and miR-221. To further elucidate the functional implications of hPNPaseold-35 in cellular physiology, we knocked-down and overexpressed hPNPaseold-35 in human melanoma cells and performed gene expression analyses to identify differentially expressed transcripts. Ingenuity Pathway Analysis indicated that knockdown of hPNPaseold-35 resulted in significant gene expression changes associated with mitochondrial dysfunction and cholesterol biosynthesis; whereas overexpression of hPNPaseold-35 caused global changes in cell-cycle related functions. Additionally, comparative gene expression analyses between our hPNPaseold-35 knockdown and overexpression datasets allowed us to identify 77 potential “direct” and 61 potential “indirect” targets of hPNPaseold-35 which formed correlated networks enriched for cell-cycle and wound healing functional association, respectively. These results provide a comprehensive database of genes responsive to hPNPaseold-35 expression levels; along with the identification new potential candidate genes offering fresh insight into cellular pathways regulated by PNPT1 and which may be used in the future for possible therapeutic intervention in mitochondrial- or inflammation-associated disease phenotypes. PMID:24143183

  17. Extraction of Molecular Features through Exome to Transcriptome Alignment

    PubMed Central

    Mudvari, Prakriti; Kowsari, Kamran; Cole, Charles; Mazumder, Raja; Horvath, Anelia

    2014-01-01

    Integrative Next Generation Sequencing (NGS) DNA and RNA analyses have very recently become feasible, and the published to date studies have discovered critical disease implicated pathways, and diagnostic and therapeutic targets. A growing number of exomes, genomes and transcriptomes from the same individual are quickly accumulating, providing unique venues for mechanistic and regulatory features analysis, and, at the same time, requiring new exploration strategies. In this study, we have integrated variation and expression information of four NGS datasets from the same individual: normal and tumor breast exomes and transcriptomes. Focusing on SNPcentered variant allelic prevalence, we illustrate analytical algorithms that can be applied to extract or validate potential regulatory elements, such as expression or growth advantage, imprinting, loss of heterozygosity (LOH), somatic changes, and RNA editing. In addition, we point to some critical elements that might bias the output and recommend alternative measures to maximize the confidence of findings. The need for such strategies is especially recognized within the growing appreciation of the concept of systems biology: integrative exploration of genome and transcriptome features reveal mechanistic and regulatory insights that reach far beyond linear addition of the individual datasets. PMID:24791251

  18. A comparative study of RNA-Seq and microarray data analysis on the two examples of rectal-cancer patients and Burkitt Lymphoma cells.

    PubMed

    Wolff, Alexander; Bayerlová, Michaela; Gaedcke, Jochen; Kube, Dieter; Beißbarth, Tim

    2018-01-01

    Pipeline comparisons for gene expression data are highly valuable for applied real data analyses, as they enable the selection of suitable analysis strategies for the dataset at hand. Such pipelines for RNA-Seq data should include mapping of reads, counting and differential gene expression analysis or preprocessing, normalization and differential gene expression in case of microarray analysis, in order to give a global insight into pipeline performances. Four commonly used RNA-Seq pipelines (STAR/HTSeq-Count/edgeR, STAR/RSEM/edgeR, Sailfish/edgeR, TopHat2/Cufflinks/CuffDiff)) were investigated on multiple levels (alignment and counting) and cross-compared with the microarray counterpart on the level of gene expression and gene ontology enrichment. For these comparisons we generated two matched microarray and RNA-Seq datasets: Burkitt Lymphoma cell line data and rectal cancer patient data. The overall mapping rate of STAR was 98.98% for the cell line dataset and 98.49% for the patient dataset. Tophat's overall mapping rate was 97.02% and 96.73%, respectively, while Sailfish had only an overall mapping rate of 84.81% and 54.44%. The correlation of gene expression in microarray and RNA-Seq data was moderately worse for the patient dataset (ρ = 0.67-0.69) than for the cell line dataset (ρ = 0.87-0.88). An exception were the correlation results of Cufflinks, which were substantially lower (ρ = 0.21-0.29 and 0.34-0.53). For both datasets we identified very low numbers of differentially expressed genes using the microarray platform. For RNA-Seq we checked the agreement of differentially expressed genes identified in the different pipelines and of GO-term enrichment results. In conclusion the combination of STAR aligner with HTSeq-Count followed by STAR aligner with RSEM and Sailfish generated differentially expressed genes best suited for the dataset at hand and in agreement with most of the other transcriptomics pipelines.

  19. Ion channel gene expression predicts survival in glioma patients

    PubMed Central

    Wang, Rong; Gurguis, Christopher I.; Gu, Wanjun; Ko, Eun A; Lim, Inja; Bang, Hyoweon; Zhou, Tong; Ko, Jae-Hong

    2015-01-01

    Ion channels are important regulators in cell proliferation, migration, and apoptosis. The malfunction and/or aberrant expression of ion channels may disrupt these important biological processes and influence cancer progression. In this study, we investigate the expression pattern of ion channel genes in glioma. We designate 18 ion channel genes that are differentially expressed in high-grade glioma as a prognostic molecular signature. This ion channel gene expression based signature predicts glioma outcome in three independent validation cohorts. Interestingly, 16 of these 18 genes were down-regulated in high-grade glioma. This signature is independent of traditional clinical, molecular, and histological factors. Resampling tests indicate that the prognostic power of the signature outperforms random gene sets selected from human genome in all the validation cohorts. More importantly, this signature performs better than the random gene signatures selected from glioma-associated genes in two out of three validation datasets. This study implicates ion channels in brain cancer, thus expanding on knowledge of their roles in other cancers. Individualized profiling of ion channel gene expression serves as a superior and independent prognostic tool for glioma patients. PMID:26235283

  20. CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets

    PubMed Central

    Li, Yang; Liu, Jun S.; Mootha, Vamsi K.

    2017-01-01

    In recent years, there has been a huge rise in the number of publicly available transcriptional profiling datasets. These massive compendia comprise billions of measurements and provide a special opportunity to predict the function of unstudied genes based on co-expression to well-studied pathways. Such analyses can be very challenging, however, since biological pathways are modular and may exhibit co-expression only in specific contexts. To overcome these challenges we introduce CLIC, CLustering by Inferred Co-expression. CLIC accepts as input a pathway consisting of two or more genes. It then uses a Bayesian partition model to simultaneously partition the input gene set into coherent co-expressed modules (CEMs), while assigning the posterior probability for each dataset in support of each CEM. CLIC then expands each CEM by scanning the transcriptome for additional co-expressed genes, quantified by an integrated log-likelihood ratio (LLR) score weighted for each dataset. As a byproduct, CLIC automatically learns the conditions (datasets) within which a CEM is operative. We implemented CLIC using a compendium of 1774 mouse microarray datasets (28628 microarrays) or 1887 human microarray datasets (45158 microarrays). CLIC analysis reveals that of 910 canonical biological pathways, 30% consist of strongly co-expressed gene modules for which new members are predicted. For example, CLIC predicts a functional connection between protein C7orf55 (FMC1) and the mitochondrial ATP synthase complex that we have experimentally validated. CLIC is freely available at www.gene-clic.org. We anticipate that CLIC will be valuable both for revealing new components of biological pathways as well as the conditions in which they are active. PMID:28719601

  1. cGRNB: a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets.

    PubMed

    Xu, Huayong; Yu, Hui; Tu, Kang; Shi, Qianqian; Wei, Chaochun; Li, Yuan-Yuan; Li, Yi-Xue

    2013-01-01

    We are witnessing rapid progress in the development of methodologies for building the combinatorial gene regulatory networks involving both TFs (Transcription Factors) and miRNAs (microRNAs). There are a few tools available to do these jobs but most of them are not easy to use and not accessible online. A web server is especially needed in order to allow users to upload experimental expression datasets and build combinatorial regulatory networks corresponding to their particular contexts. In this work, we compiled putative TF-gene, miRNA-gene and TF-miRNA regulatory relationships from forward-engineering pipelines and curated them as built-in data libraries. We streamlined the R codes of our two separate forward-and-reverse engineering algorithms for combinatorial gene regulatory network construction and formalized them as two major functional modules. As a result, we released the cGRNB (combinatorial Gene Regulatory Networks Builder): a web server for constructing combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. The cGRNB enables two major network-building modules, one for MPGE (miRNA-perturbed gene expression) datasets and the other for parallel miRNA/mRNA expression datasets. A miRNA-centered two-layer combinatorial regulatory cascade is the output of the first module and a comprehensive genome-wide network involving all three types of combinatorial regulations (TF-gene, TF-miRNA, and miRNA-gene) are the output of the second module. In this article we propose cGRNB, a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. Since parallel miRNA/mRNA expression datasets are rapidly accumulated by the advance of next-generation sequencing techniques, cGRNB will be very useful tool for researchers to build combinatorial gene regulatory networks based on expression datasets. The cGRNB web-server is free and available online at http://www.scbit.org/cgrnb.

  2. Meta-analysis of expression of l(3)mbt tumor-associated germline genes supports the model that a soma-to-germline transition is a hallmark of human cancers.

    PubMed

    Feichtinger, Julia; Larcombe, Lee; McFarlane, Ramsay J

    2014-05-15

    Evidence is starting to emerge indicating that tumorigenesis in metazoans involves a soma-to-germline transition, which may contribute to the acquisition of neoplastic characteristics. Here, we have meta-analyzed gene expression profiles of the human orthologs of Drosophila melanogaster germline genes that are ectopically expressed in l(3)mbt brain tumors using gene expression datasets derived from a large cohort of human tumors. We find these germline genes, some of which drive oncogenesis in D. melanogaster, are similarly ectopically activated in a wide range of human cancers. Some of these genes normally have expression restricted to the germline, making them of particular clinical interest. Importantly, these analyses provide additional support to the emerging model that proposes a soma-to-germline transition is a general hallmark of a wide range of human tumors. This has implications for our understanding of human oncogenesis and the development of new therapeutic and biomarker targets with clinical potential. © 2013 The Authors. Published by Wiley Periodicals, Inc. on behalf of UICC.

  3. Comparison of gene expression microarray data with count-based RNA measurements informs microarray interpretation.

    PubMed

    Richard, Arianne C; Lyons, Paul A; Peters, James E; Biasci, Daniele; Flint, Shaun M; Lee, James C; McKinney, Eoin F; Siegel, Richard M; Smith, Kenneth G C

    2014-08-04

    Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study. Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this "gold-standard" comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues. Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently.

  4. Data Mining of Gene Arrays for Biomarkers of Survival in Ovarian Cancer

    PubMed Central

    Coveney, Clare; Boocock, David J.; Rees, Robert C.; Deen, Suha; Ball, Graham R.

    2015-01-01

    The expected five-year survival rate from a stage III ovarian cancer diagnosis is a mere 22%; this applies to the 7000 new cases diagnosed yearly in the UK. Stratification of patients with this heterogeneous disease, based on active molecular pathways, would aid a targeted treatment improving the prognosis for many cases. While hundreds of genes have been associated with ovarian cancer, few have yet been verified by peer research for clinical significance. Here, a meta-analysis approach was applied to two carefully selected gene expression microarray datasets. Artificial neural networks, Cox univariate survival analyses and T-tests identified genes whose expression was consistently and significantly associated with patient survival. The rigor of this experimental design increases confidence in the genes found to be of interest. A list of 56 genes were distilled from a potential 37,000 to be significantly related to survival in both datasets with a FDR of 1.39859 × 10−11, the identities of which both verify genes already implicated with this disease and provide novel genes and pathways to pursue. Further investigation and validation of these may lead to clinical insights and have potential to predict a patient’s response to treatment or be used as a novel target for therapy. PMID:27600227

  5. Transcriptome-derived stromal and immune scores infer clinical outcomes of patients with cancer.

    PubMed

    Liu, Wei; Ye, Hua; Liu, Ying-Fu; Xu, Chao-Qun; Zhong, Yue-Xian; Tian, Tian; Ma, Shi-Wei; Tao, Huan; Li, Ling; Xue, Li-Chun; He, Hua-Qin

    2018-04-01

    The stromal and immune cells that form the tumor microenvironment serve a key role in the aggressiveness of tumors. Current tumor-centric interpretations of cancer transcriptome data ignore the roles of stromal and immune cells. The aim of the present study was to investigate the clinical utility of stromal and immune cells in tissue-based transcriptome data. The 'Estimation of STromal and Immune cells in MAlignant Tumor tissues using Expression data' (ESTIMATE) algorithm was used to probe diverse cancer datasets and the fraction of stromal and immune cells in tumor tissues was scored. The association between the ESTIMATE scores and patient survival data was asessed; it was indicated that the two scores have implications for patient survival, metastasis and recurrence. Analysis of a colorectal cancer progression dataset revealed that decreased levels immune cells could serve an important role in cancer progression. The results of the present study indicated that trasncriptome-derived stromal and immune scores may be a useful indicator of cancer prognosis.

  6. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets

    PubMed Central

    Wernisch, Lorenz

    2017-01-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190

  7. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

    PubMed

    Gabasova, Evelina; Reid, John; Wernisch, Lorenz

    2017-10-01

    Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.

  8. African Americans with pancreatic ductal adenocarcinoma exhibit gender differences in Kaiso expression

    PubMed Central

    Mukherjee, Angana; Jones, Jacqueline; Karanam, Balasubramanyam; Davis, Melissa; Jaynes, Jesse; Reams, R. Renee; Dean-Colomb, Windy; Yates, Clayton

    2016-01-01

    Kaiso, a bi-modal transcription factor, regulates gene expression, and is elevated in breast, prostate, and colon cancers. Depletion of Kaiso in other cancer types leads to a reduction in markers for the epithelial–mesenchymal transition (EMT) (Jones et al., 2014), however its clinical implications in pancreatic ductal adenocarcinoma (PDCA) have not been widely explored. PDCA is rarely detected at an early stage but is characterized by rapid progression and invasiveness. We now report the significance of the subcellular localization of Kaiso in PDCAs from African Americans. Kaiso expression is higher in the cytoplasm of invasive and metastatic pancreatic cancers. In males, cytoplasmic expression of Kaiso correlates with cancer grade and lymph node positivity. In male and female patients, cytoplasmic Kaiso expression correlates with invasiveness. Also, nuclear expression of Kaiso increases with increased invasiveness and lymph node positivity. Further, analysis of the largest PDCA dataset available on ONCOMINE shows that as Kaiso increases, there is an overall increase in Zeb1, which is the inverse for E-cadherin. Hence, these findings suggest a role for Kaiso in the progression of PDCAs, involving the EMT markers, E-cadherin and Zeb1. PMID:27424525

  9. -A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome.

    PubMed

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp.

  10. ­A curated transcriptomic dataset collection relevant to embryonic development associated with in vitro fertilization in healthy individuals and patients with polycystic ovary syndrome

    PubMed Central

    Mackeh, Rafah; Boughorbel, Sabri; Chaussabel, Damien; Kino, Tomoshige

    2017-01-01

    The collection of large-scale datasets available in public repositories is rapidly growing and providing opportunities to identify and fill gaps in different fields of biomedical research. However, users of these datasets should be able to selectively browse datasets related to their field of interest. Here we made available a collection of transcriptome datasets related to human follicular cells from normal individuals or patients with polycystic ovary syndrome, in the process of their development, during in vitro fertilization. After RNA-seq dataset exclusion and careful selection based on study description and sample information, 12 datasets, encompassing a total of 85 unique transcriptome profiles, were identified in NCBI Gene Expression Omnibus and uploaded to the Gene Expression Browser (GXB), a web application specifically designed for interactive query and visualization of integrated large-scale data. Once annotated in GXB, multiple sample grouping has been made in order to create rank lists to allow easy data interpretation and comparison. The GXB tool also allows the users to browse a single gene across multiple projects to evaluate its expression profiles in multiple biological systems/conditions in a web-based customized graphical views. The curated dataset is accessible at the following link: http://ivf.gxbsidra.org/dm3/landing.gsp. PMID:28413616

  11. Predicting ionizing radiation exposure using biochemically-inspired genomic machine learning.

    PubMed

    Zhao, Jonathan Z L; Mucaki, Eliseos J; Rogan, Peter K

    2018-01-01

    Background: Gene signatures derived from transcriptomic data using machine learning methods have shown promise for biodosimetry testing. These signatures may not be sufficiently robust for large scale testing, as their performance has not been adequately validated on external, independent datasets. The present study develops human and murine signatures with biochemically-inspired machine learning that are strictly validated using k-fold and traditional approaches. Methods: Gene Expression Omnibus (GEO) datasets of exposed human and murine lymphocytes were preprocessed via nearest neighbor imputation and expression of genes implicated in the literature to be responsive to radiation exposure (n=998) were then ranked by Minimum Redundancy Maximum Relevance (mRMR). Optimal signatures were derived by backward, complete, and forward sequential feature selection using Support Vector Machines (SVM), and validated using k-fold or traditional validation on independent datasets. Results: The best human signatures we derived exhibit k-fold validation accuracies of up to 98% ( DDB2 ,  PRKDC , TPP2 , PTPRE , and GADD45A ) when validated over 209 samples and traditional validation accuracies of up to 92% ( DDB2 ,  CD8A ,  TALDO1 ,  PCNA ,  EIF4G2 ,  LCN2 ,  CDKN1A ,  PRKCH ,  ENO1 ,  and PPM1D ) when validated over 85 samples. Some human signatures are specific enough to differentiate between chemotherapy and radiotherapy. Certain multi-class murine signatures have sufficient granularity in dose estimation to inform eligibility for cytokine therapy (assuming these signatures could be translated to humans). We compiled a list of the most frequently appearing genes in the top 20 human and mouse signatures. More frequently appearing genes among an ensemble of signatures may indicate greater impact of these genes on the performance of individual signatures. Several genes in the signatures we derived are present in previously proposed signatures. Conclusions: Gene signatures for ionizing radiation exposure derived by machine learning have low error rates in externally validated, independent datasets, and exhibit high specificity and granularity for dose estimation.

  12. SurvExpress: an online biomarker validation tool and database for cancer gene expression data using survival analysis.

    PubMed

    Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G; Treviño, Victor

    2013-01-01

    Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R.

  13. SurvExpress: An Online Biomarker Validation Tool and Database for Cancer Gene Expression Data Using Survival Analysis

    PubMed Central

    Aguirre-Gamboa, Raul; Gomez-Rueda, Hugo; Martínez-Ledesma, Emmanuel; Martínez-Torteya, Antonio; Chacolla-Huaringa, Rafael; Rodriguez-Barrientos, Alberto; Tamez-Peña, José G.; Treviño, Victor

    2013-01-01

    Validation of multi-gene biomarkers for clinical outcomes is one of the most important issues for cancer prognosis. An important source of information for virtual validation is the high number of available cancer datasets. Nevertheless, assessing the prognostic performance of a gene expression signature along datasets is a difficult task for Biologists and Physicians and also time-consuming for Statisticians and Bioinformaticians. Therefore, to facilitate performance comparisons and validations of survival biomarkers for cancer outcomes, we developed SurvExpress, a cancer-wide gene expression database with clinical outcomes and a web-based tool that provides survival analysis and risk assessment of cancer datasets. The main input of SurvExpress is only the biomarker gene list. We generated a cancer database collecting more than 20,000 samples and 130 datasets with censored clinical information covering tumors over 20 tissues. We implemented a web interface to perform biomarker validation and comparisons in this database, where a multivariate survival analysis can be accomplished in about one minute. We show the utility and simplicity of SurvExpress in two biomarker applications for breast and lung cancer. Compared to other tools, SurvExpress is the largest, most versatile, and quickest free tool available. SurvExpress web can be accessed in http://bioinformatica.mty.itesm.mx/SurvExpress (a tutorial is included). The website was implemented in JSP, JavaScript, MySQL, and R. PMID:24066126

  14. A Self-Directed Method for Cell-Type Identification and Separation of Gene Expression Microarrays

    PubMed Central

    Zuckerman, Neta S.; Noam, Yair; Goldsmith, Andrea J.; Lee, Peter P.

    2013-01-01

    Gene expression analysis is generally performed on heterogeneous tissue samples consisting of multiple cell types. Current methods developed to separate heterogeneous gene expression rely on prior knowledge of the cell-type composition and/or signatures - these are not available in most public datasets. We present a novel method to identify the cell-type composition, signatures and proportions per sample without need for a-priori information. The method was successfully tested on controlled and semi-controlled datasets and performed as accurately as current methods that do require additional information. As such, this method enables the analysis of cell-type specific gene expression using existing large pools of publically available microarray datasets. PMID:23990767

  15. Hierarchical Recognition Scheme for Human Facial Expression Recognition Systems

    PubMed Central

    Siddiqi, Muhammad Hameed; Lee, Sungyoung; Lee, Young-Koo; Khan, Adil Mehmood; Truc, Phan Tran Ho

    2013-01-01

    Over the last decade, human facial expressions recognition (FER) has emerged as an important research area. Several factors make FER a challenging research problem. These include varying light conditions in training and test images; need for automatic and accurate face detection before feature extraction; and high similarity among different expressions that makes it difficult to distinguish these expressions with a high accuracy. This work implements a hierarchical linear discriminant analysis-based facial expressions recognition (HL-FER) system to tackle these problems. Unlike the previous systems, the HL-FER uses a pre-processing step to eliminate light effects, incorporates a new automatic face detection scheme, employs methods to extract both global and local features, and utilizes a HL-FER to overcome the problem of high similarity among different expressions. Unlike most of the previous works that were evaluated using a single dataset, the performance of the HL-FER is assessed using three publicly available datasets under three different experimental settings: n-fold cross validation based on subjects for each dataset separately; n-fold cross validation rule based on datasets; and, finally, a last set of experiments to assess the effectiveness of each module of the HL-FER separately. Weighted average recognition accuracy of 98.7% across three different datasets, using three classifiers, indicates the success of employing the HL-FER for human FER. PMID:24316568

  16. Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories.

    PubMed

    Jong, Victor L; Novianti, Putri W; Roes, Kit C B; Eijkemans, Marinus J C

    2014-12-01

    The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.

  17. Unsupervised clustering of gene expression data points at hypoxia as possible trigger for metabolic syndrome.

    PubMed

    Ptitsyn, Andrey; Hulver, Matthew; Cefalu, William; York, David; Smith, Steven R

    2006-12-19

    Classification of large volumes of data produced in a microarray experiment allows for the extraction of important clues as to the nature of a disease. Using multi-dimensional unsupervised FOREL (FORmal ELement) algorithm we have re-analyzed three public datasets of skeletal muscle gene expression in connection with insulin resistance and type 2 diabetes (DM2). Our analysis revealed the major line of variation between expression profiles of normal, insulin resistant, and diabetic skeletal muscle. A cluster of most "metabolically sound" samples occupied one end of this line. The distance along this line coincided with the classic markers of diabetes risk, namely obesity and insulin resistance, but did not follow the accepted clinical diagnosis of DM2 as defined by the presence or absence of hyperglycemia. Genes implicated in this expression pattern are those controlling skeletal muscle fiber type and glycolytic metabolism. Additionally myoglobin and hemoglobin were upregulated and ribosomal genes deregulated in insulin resistant patients. Our findings are concordant with the changes seen in skeletal muscle with altitude hypoxia. This suggests that hypoxia and shift to glycolytic metabolism may also drive insulin resistance.

  18. A gene expression signature of retinoblastoma loss-of-function is a predictive biomarker of resistance to palbociclib in breast cancer cell lines and is prognostic in patients with ER positive early breast cancer.

    PubMed

    Malorni, Luca; Piazza, Silvano; Ciani, Yari; Guarducci, Cristina; Bonechi, Martina; Biagioni, Chiara; Hart, Christopher D; Verardo, Roberto; Di Leo, Angelo; Migliaccio, Ilenia

    2016-09-13

    Palbociclib is a CDK4/6 inhibitor that received FDA approval for treatment of hormone receptor positive (HR+) HER2 negative (HER2neg) advanced breast cancer. To better personalize patients treatment it is critical to identify subgroups that would mostly benefit from it. We hypothesize that complex alterations of the Retinoblastoma (Rb) pathway might be implicated in resistance to CDK4/6 inhibitors and aim to investigate whether signatures of Rb loss-of-function would identify breast cancer cell lines resistant to palbociclib. We established a gene expression signature of Rb loss-of-function (RBsig) by identifying genes correlated with E2F1 and E2F2 expression in breast cancers within The Cancer Genome Atlas. We assessed the RBsig prognostic role in the METABRIC and in a comprehensive breast cancer meta-dataset. Finally, we analyzed whether RBsig would discriminate palbociclib-sensitive and -resistant breast cancer cells in a large RNA sequencing-based dataset. The RBsig was associated with RB1 genetic status in all tumors (p <7e-32) and in luminal or basal subtypes (p < 7e-11 and p < 0.002, respectively). The RBsig was prognostic in the METABRIC dataset (discovery: HR = 1.93 [1.5-2.4] p = 1.4e-08; validation: HR = 2.01 [1.6-2.5] p = 1.3e-09). Untreated and endocrine treated patients with estrogen receptor positive breast cancer expressing high RBsig had significantly worse recurrence free survival compared to those with low RBsig (HR = 2.37 [1.8 - 3.2] p = 1.87e-08 and HR = 2.62 [1.9- 3.5] p = 8.6e-11, respectively). The RBsig was able to identify palbociclib resistant and sensitive breast cancer cells (ROC AUC = 0,7778). Signatures of RB loss might be helpful in personalizing treatment of patients with HR+/HER2neg breast cancer. Further validation in patients receiving palbociclib is warranted.

  19. Pathway-based outlier method reveals heterogeneous genomic structure of autism in blood transcriptome

    PubMed Central

    2013-01-01

    Background Decades of research strongly suggest that the genetic etiology of autism spectrum disorders (ASDs) is heterogeneous. However, most published studies focus on group differences between cases and controls. In contrast, we hypothesized that the heterogeneity of the disorder could be characterized by identifying pathways for which individuals are outliers rather than pathways representative of shared group differences of the ASD diagnosis. Methods Two previously published blood gene expression data sets – the Translational Genetics Research Institute (TGen) dataset (70 cases and 60 unrelated controls) and the Simons Simplex Consortium (Simons) dataset (221 probands and 191 unaffected family members) – were analyzed. All individuals of each dataset were projected to biological pathways, and each sample’s Mahalanobis distance from a pooled centroid was calculated to compare the number of case and control outliers for each pathway. Results Analysis of a set of blood gene expression profiles from 70 ASD and 60 unrelated controls revealed three pathways whose outliers were significantly overrepresented in the ASD cases: neuron development including axonogenesis and neurite development (29% of ASD, 3% of control), nitric oxide signaling (29%, 3%), and skeletal development (27%, 3%). Overall, 50% of cases and 8% of controls were outliers in one of these three pathways, which could not be identified using group comparison or gene-level outlier methods. In an independently collected data set consisting of 221 ASD and 191 unaffected family members, outliers in the neurogenesis pathway were heavily biased towards cases (20.8% of ASD, 12.0% of control). Interestingly, neurogenesis outliers were more common among unaffected family members (Simons) than unrelated controls (TGen), but the statistical significance of this effect was marginal (Chi squared P < 0.09). Conclusions Unlike group difference approaches, our analysis identified the samples within the case and control groups that manifested each expression signal, and showed that outlier groups were distinct for each implicated pathway. Moreover, our results suggest that by seeking heterogeneity, pathway-based outlier analysis can reveal expression signals that are not apparent when considering only shared group differences. PMID:24063311

  20. A group LASSO-based method for robustly inferring gene regulatory networks from multiple time-course datasets.

    PubMed

    Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun

    2014-01-01

    As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.

  1. GSNFS: Gene subnetwork biomarker identification of lung cancer expression data.

    PubMed

    Doungpan, Narumol; Engchuan, Worrawat; Chan, Jonathan H; Meechai, Asawin

    2016-12-05

    Gene expression has been used to identify disease gene biomarkers, but there are ongoing challenges. Single gene or gene-set biomarkers are inadequate to provide sufficient understanding of complex disease mechanisms and the relationship among those genes. Network-based methods have thus been considered for inferring the interaction within a group of genes to further study the disease mechanism. Recently, the Gene-Network-based Feature Set (GNFS), which is capable of handling case-control and multiclass expression for gene biomarker identification, has been proposed, partly taking into account of network topology. However, its performance relies on a greedy search for building subnetworks and thus requires further improvement. In this work, we establish a new approach named Gene Sub-Network-based Feature Selection (GSNFS) by implementing the GNFS framework with two proposed searching and scoring algorithms, namely gene-set-based (GS) search and parent-node-based (PN) search, to identify subnetworks. An additional dataset is used to validate the results. The two proposed searching algorithms of the GSNFS method for subnetwork expansion are concerned with the degree of connectivity and the scoring scheme for building subnetworks and their topology. For each iteration of expansion, the neighbour genes of a current subnetwork, whose expression data improved the overall subnetwork score, is recruited. While the GS search calculated the subnetwork score using an activity score of a current subnetwork and the gene expression values of its neighbours, the PN search uses the expression value of the corresponding parent of each neighbour gene. Four lung cancer expression datasets were used for subnetwork identification. In addition, using pathway data and protein-protein interaction as network data in order to consider the interaction among significant genes were discussed. Classification was performed to compare the performance of the identified gene subnetworks with three subnetwork identification algorithms. The two searching algorithms resulted in better classification and gene/gene-set agreement compared to the original greedy search of the GNFS method. The identified lung cancer subnetwork using the proposed searching algorithm resulted in an improvement of the cross-dataset validation and an increase in the consistency of findings between two independent datasets. The homogeneity measurement of the datasets was conducted to assess dataset compatibility in cross-dataset validation. The lung cancer dataset with higher homogeneity showed a better result when using the GS search while the dataset with low homogeneity showed a better result when using the PN search. The 10-fold cross-dataset validation on the independent lung cancer datasets showed higher classification performance of the proposed algorithms when compared with the greedy search in the original GNFS method. The proposed searching algorithms provide a higher number of genes in the subnetwork expansion step than the greedy algorithm. As a result, the performance of the subnetworks identified from the GSNFS method was improved in terms of classification performance and gene/gene-set level agreement depending on the homogeneity of the datasets used in the analysis. Some common genes obtained from the four datasets using different searching algorithms are genes known to play a role in lung cancer. The improvement of classification performance and the gene/gene-set level agreement, and the biological relevance indicated the effectiveness of the GSNFS method for gene subnetwork identification using expression data.

  2. Probe-level linear model fitting and mixture modeling results in high accuracy detection of differential gene expression.

    PubMed

    Lemieux, Sébastien

    2006-08-25

    The identification of differentially expressed genes (DEGs) from Affymetrix GeneChips arrays is currently done by first computing expression levels from the low-level probe intensities, then deriving significance by comparing these expression levels between conditions. The proposed PL-LM (Probe-Level Linear Model) method implements a linear model applied on the probe-level data to directly estimate the treatment effect. A finite mixture of Gaussian components is then used to identify DEGs using the coefficients estimated by the linear model. This approach can readily be applied to experimental design with or without replication. On a wholly defined dataset, the PL-LM method was able to identify 75% of the differentially expressed genes within 10% of false positives. This accuracy was achieved both using the three replicates per conditions available in the dataset and using only one replicate per condition. The method achieves, on this dataset, a higher accuracy than the best set of tools identified by the authors of the dataset, and does so using only one replicate per condition.

  3. Mining Gene Regulatory Networks by Neural Modeling of Expression Time-Series.

    PubMed

    Rubiolo, Mariano; Milone, Diego H; Stegmayer, Georgina

    2015-01-01

    Discovering gene regulatory networks from data is one of the most studied topics in recent years. Neural networks can be successfully used to infer an underlying gene network by modeling expression profiles as times series. This work proposes a novel method based on a pool of neural networks for obtaining a gene regulatory network from a gene expression dataset. They are used for modeling each possible interaction between pairs of genes in the dataset, and a set of mining rules is applied to accurately detect the subjacent relations among genes. The results obtained on artificial and real datasets confirm the method effectiveness for discovering regulatory networks from a proper modeling of the temporal dynamics of gene expression profiles.

  4. Integrating genome-wide association studies and gene expression data highlights dysregulated multiple sclerosis risk pathways.

    PubMed

    Liu, Guiyou; Zhang, Fang; Jiang, Yongshuai; Hu, Yang; Gong, Zhongying; Liu, Shoufeng; Chen, Xiuju; Jiang, Qinghua; Hao, Junwei

    2017-02-01

    Much effort has been expended on identifying the genetic determinants of multiple sclerosis (MS). Existing large-scale genome-wide association study (GWAS) datasets provide strong support for using pathway and network-based analysis methods to investigate the mechanisms underlying MS. However, no shared genetic pathways have been identified to date. We hypothesize that shared genetic pathways may indeed exist in different MS-GWAS datasets. Here, we report results from a three-stage analysis of GWAS and expression datasets. In stage 1, we conducted multiple pathway analyses of two MS-GWAS datasets. In stage 2, we performed a candidate pathway analysis of the large-scale MS-GWAS dataset. In stage 3, we performed a pathway analysis using the dysregulated MS gene list from seven human MS case-control expression datasets. In stage 1, we identified 15 shared pathways. In stage 2, we successfully replicated 14 of these 15 significant pathways. In stage 3, we found that dysregulated MS genes were significantly enriched in 10 of 15 MS risk pathways identified in stages 1 and 2. We report shared genetic pathways in different MS-GWAS datasets and highlight some new MS risk pathways. Our findings provide new insights on the genetic determinants of MS.

  5. Complex nature of SNP genotype effects on gene expression in primary human leucocytes.

    PubMed

    Heap, Graham A; Trynka, Gosia; Jansen, Ritsert C; Bruinenberg, Marcel; Swertz, Morris A; Dinesen, Lotte C; Hunt, Karen A; Wijmenga, Cisca; Vanheel, David A; Franke, Lude

    2009-01-07

    Genome wide association studies have been hugely successful in identifying disease risk variants, yet most variants do not lead to coding changes and how variants influence biological function is usually unknown. We correlated gene expression and genetic variation in untouched primary leucocytes (n = 110) from individuals with celiac disease - a common condition with multiple risk variants identified. We compared our observations with an EBV-transformed HapMap B cell line dataset (n = 90), and performed a meta-analysis to increase power to detect non-tissue specific effects. In celiac peripheral blood, 2,315 SNP variants influenced gene expression at 765 different transcripts (< 250 kb from SNP, at FDR = 0.05, cis expression quantitative trait loci, eQTLs). 135 of the detected SNP-probe effects (reflecting 51 unique probes) were also detected in a HapMap B cell line published dataset, all with effects in the same allelic direction. Overall gene expression differences within the two datasets predominantly explain the limited overlap in observed cis-eQTLs. Celiac associated risk variants from two regions, containing genes IL18RAP and CCR3, showed significant cis genotype-expression correlations in the peripheral blood but not in the B cell line datasets. We identified 14 genes where a SNP affected the expression of different probes within the same gene, but in opposite allelic directions. By incorporating genetic variation in co-expression analyses, functional relationships between genes can be more significantly detected. In conclusion, the complex nature of genotypic effects in human populations makes the use of a relevant tissue, large datasets, and analysis of different exons essential to enable the identification of the function for many genetic risk variants in common diseases.

  6. A Systems Biology Methodology Combining Transcriptome and Interactome Datasets to Assess the Implications of Cytokinin Signaling for Plant Immune Networks.

    PubMed

    Kunz, Meik; Dandekar, Thomas; Naseem, Muhammad

    2017-01-01

    Cytokinins (CKs) play an important role in plant growth and development. Also, several studies highlight the modulatory implications of CKs for plant-pathogen interaction. However, the underlying mechanisms of CK mediating immune networks in plants are still not fully understood. A detailed analysis of high-throughput transcriptome (RNA-Seq and microarrays) datasets under modulated conditions of plant CKs and its mergence with cellular interactome (large-scale protein-protein interaction data) has the potential to unlock the contribution of CKs to plant defense. Here, we specifically describe a detailed systems biology methodology pertinent to the acquisition and analysis of various omics datasets that delineate the role of plant CKs in impacting immune pathways in Arabidopsis.

  7. Genome-wide characterization of differential transcript usage in Arabidopsis thaliana.

    PubMed

    Vaneechoutte, Dries; Estrada, April R; Lin, Ying-Chen; Loraine, Ann E; Vandepoele, Klaas

    2017-12-01

    Alternative splicing and the usage of alternate transcription start- or stop sites allows a single gene to produce multiple transcript isoforms. Most plant genes express certain isoforms at a significantly higher level than others, but under specific conditions this expression dominance can change, resulting in a different set of dominant isoforms. These events of differential transcript usage (DTU) have been observed for thousands of Arabidopsis thaliana, Zea mays and Vitis vinifera genes, and have been linked to development and stress response. However, neither the characteristics of these genes, nor the implications of DTU on their protein coding sequences or functions, are currently well understood. Here we present a dataset of isoform dominance and DTU for all genes in the AtRTD2 reference transcriptome based on a protocol that was benchmarked on simulated data and validated through comparison with a published reverse transciptase-polymerase chain reaction panel. We report DTU events for 8148 genes across 206 public RNA-Seq samples, and find that protein sequences are affected in 22% of the cases. The observed DTU events show high consistency across replicates, and reveal reproducible patterns in response to treatment and development. We also demonstrate that genes with different evolutionary ages, expression breadths and functions show large differences in the frequency at which they undergo DTU, and in the effect that these events have on their protein sequences. Finally, we showcase how the generated dataset can be used to explore DTU events for genes of interest or to find genes with specific DTU in samples of interest. © 2017 The Authors The Plant Journal © 2017 John Wiley & Sons Ltd.

  8. Computational immune profiling in lung adenocarcinoma reveals reproducible prognostic associations with implications for immunotherapy

    PubMed Central

    Varn, Frederick S.; Tafe, Laura J.; Amos, Christopher I.; Cheng, Chao

    2018-01-01

    ABSTRACT Non-small cell lung cancer is one of the leading causes of cancer-related death in the world. Lung adenocarcinoma, the most common type of non-small cell lung cancer, has been well characterized as having a dense lymphocytic infiltrate, suggesting that the immune system plays an active role in shaping this cancer's growth and development. Despite these findings, our understanding of how this infiltrate affects patient prognosis and its association with lung adenocarcinoma-specific clinical factors remains limited. To address these questions, we inferred the infiltration level of six distinct immune cell types from a series of four lung adenocarcinoma gene expression datasets. We found that naive B cell, CD8+ T cell, and myeloid cell-derived expression signals of immune infiltration were significantly predictive of patient survival in multiple independent datasets, with B cell and CD8+ T cell infiltration associated with prolonged prognosis and myeloid cell infiltration associated with shorter survival. These associations remained significant even after accounting for additional clinical variables. Patients stratified by smoking status exhibited decreased CD8+ T cell infiltration and altered prognostic associations, suggesting potential immunosuppressive mechanisms in smokers. Survival analyses accounting for immune checkpoint gene expression and cellular immune infiltrate indicated checkpoint protein-specific modulatory effects on CD8+ T cell and B cell function that may be associated with patient sensitivity to immunotherapy. Together, these analyses identified reproducible associations that can be used to better characterize the role of immune infiltration in lung adenocarcinoma and demonstrate the utility in using computational approaches to systematically characterize tissue-specific tumor-immune interactions. PMID:29872556

  9. Hybrid coexpression link similarity graph clustering for mining biological modules from multiple gene expression datasets.

    PubMed

    Salem, Saeed; Ozcaglar, Cagri

    2014-01-01

    Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways.

  10. Integrated genomic approaches identify major pathways and upstream regulators in late onset Alzheimer’s disease

    PubMed Central

    Li, Xinzhong; Long, Jintao; He, Taigang; Belshaw, Robert; Scott, James

    2015-01-01

    Previous studies have evaluated gene expression in Alzheimer’s disease (AD) brains to identify mechanistic processes, but have been limited by the size of the datasets studied. Here we have implemented a novel meta-analysis approach to identify differentially expressed genes (DEGs) in published datasets comprising 450 late onset AD (LOAD) brains and 212 controls. We found 3124 DEGs, many of which were highly correlated with Braak stage and cerebral atrophy. Pathway Analysis revealed the most perturbed pathways to be (a) nitric oxide and reactive oxygen species in macrophages (NOROS), (b) NFkB and (c) mitochondrial dysfunction. NOROS was also up-regulated, and mitochondrial dysfunction down-regulated, in healthy ageing subjects. Upstream regulator analysis predicted the TLR4 ligands, STAT3 and NFKBIA, for activated pathways and RICTOR for mitochondrial genes. Protein-protein interaction network analysis emphasised the role of NFKB; identified a key interaction of CLU with complement; and linked TYROBP, TREM2 and DOK3 to modulation of LPS signalling through TLR4 and to phosphatidylinositol metabolism. We suggest that NEUROD6, ZCCHC17, PPEF1 and MANBAL are potentially implicated in LOAD, with predicted links to calcium signalling and protein mannosylation. Our study demonstrates a highly injurious combination of TLR4-mediated NFKB signalling, NOROS inflammatory pathway activation, and mitochondrial dysfunction in LOAD. PMID:26202100

  11. Evaluation of Different Normalization and Analysis Procedures for Illumina Gene Expression Microarray Data Involving Small Changes

    PubMed Central

    Johnstone, Daniel M.; Riveros, Carlos; Heidari, Moones; Graham, Ross M.; Trinder, Debbie; Berretta, Regina; Olynyk, John K.; Scott, Rodney J.; Moscato, Pablo; Milward, Elizabeth A.

    2013-01-01

    While Illumina microarrays can be used successfully for detecting small gene expression changes due to their high degree of technical replicability, there is little information on how different normalization and differential expression analysis strategies affect outcomes. To evaluate this, we assessed concordance across gene lists generated by applying different combinations of normalization strategy and analytical approach to two Illumina datasets with modest expression changes. In addition to using traditional statistical approaches, we also tested an approach based on combinatorial optimization. We found that the choice of both normalization strategy and analytical approach considerably affected outcomes, in some cases leading to substantial differences in gene lists and subsequent pathway analysis results. Our findings suggest that important biological phenomena may be overlooked when there is a routine practice of using only one approach to investigate all microarray datasets. Analytical artefacts of this kind are likely to be especially relevant for datasets involving small fold changes, where inherent technical variation—if not adequately minimized by effective normalization—may overshadow true biological variation. This report provides some basic guidelines for optimizing outcomes when working with Illumina datasets involving small expression changes. PMID:27605185

  12. Identification of druggable cancer driver genes amplified across TCGA datasets.

    PubMed

    Chen, Ying; McGee, Jeremy; Chen, Xianming; Doman, Thompson N; Gong, Xueqian; Zhang, Youyan; Hamm, Nicole; Ma, Xiwen; Higgs, Richard E; Bhagwat, Shripad V; Buchanan, Sean; Peng, Sheng-Bin; Staschke, Kirk A; Yadav, Vipin; Yue, Yong; Kouros-Mehr, Hosein

    2014-01-01

    The Cancer Genome Atlas (TCGA) projects have advanced our understanding of the driver mutations, genetic backgrounds, and key pathways activated across cancer types. Analysis of TCGA datasets have mostly focused on somatic mutations and translocations, with less emphasis placed on gene amplifications. Here we describe a bioinformatics screening strategy to identify putative cancer driver genes amplified across TCGA datasets. We carried out GISTIC2 analysis of TCGA datasets spanning 16 cancer subtypes and identified 486 genes that were amplified in two or more datasets. The list was narrowed to 75 cancer-associated genes with potential "druggable" properties. The majority of the genes were localized to 14 amplicons spread across the genome. To identify potential cancer driver genes, we analyzed gene copy number and mRNA expression data from individual patient samples and identified 42 putative cancer driver genes linked to diverse oncogenic processes. Oncogenic activity was further validated by siRNA/shRNA knockdown and by referencing the Project Achilles datasets. The amplified genes represented a number of gene families, including epigenetic regulators, cell cycle-associated genes, DNA damage response/repair genes, metabolic regulators, and genes linked to the Wnt, Notch, Hedgehog, JAK/STAT, NF-KB and MAPK signaling pathways. Among the 42 putative driver genes were known driver genes, such as EGFR, ERBB2 and PIK3CA. Wild-type KRAS was amplified in several cancer types, and KRAS-amplified cancer cell lines were most sensitive to KRAS shRNA, suggesting that KRAS amplification was an independent oncogenic event. A number of MAP kinase adapters were co-amplified with their receptor tyrosine kinases, such as the FGFR adapter FRS2 and the EGFR family adapters GRB2 and GRB7. The ubiquitin-like ligase DCUN1D1 and the histone methyltransferase NSD3 were also identified as novel putative cancer driver genes. We discuss the patient tailoring implications for existing cancer drug targets and we further discuss potential novel opportunities for drug discovery efforts.

  13. Identification of Druggable Cancer Driver Genes Amplified across TCGA Datasets

    PubMed Central

    Chen, Ying; McGee, Jeremy; Chen, Xianming; Doman, Thompson N.; Gong, Xueqian; Zhang, Youyan; Hamm, Nicole; Ma, Xiwen; Higgs, Richard E.; Bhagwat, Shripad V.; Buchanan, Sean; Peng, Sheng-Bin; Staschke, Kirk A.; Yadav, Vipin; Yue, Yong; Kouros-Mehr, Hosein

    2014-01-01

    The Cancer Genome Atlas (TCGA) projects have advanced our understanding of the driver mutations, genetic backgrounds, and key pathways activated across cancer types. Analysis of TCGA datasets have mostly focused on somatic mutations and translocations, with less emphasis placed on gene amplifications. Here we describe a bioinformatics screening strategy to identify putative cancer driver genes amplified across TCGA datasets. We carried out GISTIC2 analysis of TCGA datasets spanning 14 cancer subtypes and identified 461 genes that were amplified in two or more datasets. The list was narrowed to 73 cancer-associated genes with potential “druggable” properties. The majority of the genes were localized to 14 amplicons spread across the genome. To identify potential cancer driver genes, we analyzed gene copy number and mRNA expression data from individual patient samples and identified 40 putative cancer driver genes linked to diverse oncogenic processes. Oncogenic activity was further validated by siRNA/shRNA knockdown and by referencing the Project Achilles datasets. The amplified genes represented a number of gene families, including epigenetic regulators, cell cycle-associated genes, DNA damage response/repair genes, metabolic regulators, and genes linked to the Wnt, Notch, Hedgehog, JAK/STAT, NF-KB and MAPK signaling pathways. Among the 40 putative driver genes were known driver genes, such as EGFR, ERBB2 and PIK3CA. Wild-type KRAS was amplified in several cancer types, and KRAS-amplified cancer cell lines were most sensitive to KRAS shRNA, suggesting that KRAS amplification was an independent oncogenic event. A number of MAP kinase adapters were co-amplified with their receptor tyrosine kinases, such as the FGFR adapter FRS2 and the EGFR family adapter GRB7. The ubiquitin-like ligase DCUN1D1 and the histone methyltransferase NSD3 were also identified as novel putative cancer driver genes. We discuss the patient tailoring implications for existing cancer drug targets and we further discuss potential novel opportunities for drug discovery efforts. PMID:24874471

  14. Gene expression profile of mouse prostate tumors reveals dysregulations in major biological processes and identifies potential murine targets for preclinical development of human prostate cancer therapy.

    PubMed

    Haram, Kerstyn M; Peltier, Heidi J; Lu, Bin; Bhasin, Manoj; Otu, Hasan H; Choy, Bob; Regan, Meredith; Libermann, Towia A; Latham, Gary J; Sanda, Martin G; Arredouani, Mohamed S

    2008-10-01

    Translation of preclinical studies into effective human cancer therapy is hampered by the lack of defined molecular expression patterns in mouse models that correspond to the human counterpart. We sought to generate an open source TRAMP mouse microarray dataset and to use this array to identify differentially expressed genes from human prostate cancer (PCa) that have concordant expression in TRAMP tumors, and thereby represent lead targets for preclinical therapy development. We performed microarrays on total RNA extracted and amplified from eight TRAMP tumors and nine normal prostates. A subset of differentially expressed genes was validated by QRT-PCR. Differentially expressed TRAMP genes were analyzed for concordant expression in publicly available human prostate array datasets and a subset of resulting genes was analyzed by QRT-PCR. Cross-referencing differentially expressed TRAMP genes to public human prostate array datasets revealed 66 genes with concordant expression in mouse and human PCa; 56 between metastases and normal and 10 between primary tumor and normal tissues. Of these 10 genes, two, Sox4 and Tubb2a, were validated by QRT-PCR. Our analysis also revealed various dysregulations in major biologic pathways in the TRAMP prostates. We report a TRAMP microarray dataset of which a gene subset was validated by QRT-PCR with expression patterns consistent with previous gene-specific TRAMP studies. Concordance analysis between TRAMP and human PCa associated genes supports the utility of the model and suggests several novel molecular targets for preclinical therapy.

  15. Polyester: simulating RNA-seq datasets with differential transcript expression.

    PubMed

    Frazee, Alyssa C; Jaffe, Andrew E; Langmead, Ben; Leek, Jeffrey T

    2015-09-01

    Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data. Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user. Polyester is freely available from Bioconductor (http://bioconductor.org/). jtleek@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  16. SPP1 and AGER as potential prognostic biomarkers for lung adenocarcinoma.

    PubMed

    Zhang, Weiguo; Fan, Junli; Chen, Qiang; Lei, Caipeng; Qiao, Bin; Liu, Qin

    2018-05-01

    Overdue treatment and prognostic evaluation lead to low survival rates in patients with lung adenocarcinoma (LUAD). To date, effective biomarkers for prognosis are still required. The aim of the present study was to screen differentially expressed genes (DEGs) as biomarkers for prognostic evaluation of LUAD. DEGs in tumor and normal samples were identified and analyzed for Kyoto Encyclopedia of Genes and Genomes/Gene Ontology functional enrichments. The common genes that are up and downregulated were selected for prognostic analysis using RNAseq data in The Cancer Genome Atlas. Differential expression analysis was performed with 164 samples in GSE10072 and GSE7670 datasets. A total of 484 DEGs that were present in GSE10072 and GSE7670 datasets were screened, including secreted phosphoprotein 1 (SPP1) that was highly expressed and DEGs ficolin 3, advanced glycosylation end-product specific receptor (AGER), transmembrane protein 100 that were lowly expressed in tumor tissues. These four key genes were subsequently verified using an independent dataset, GSE19804. The gene expression model was consistent with GSE10072 and GSE7670 datasets. The dysregulation of highly expressed SPP1 and lowly expressed AGER significantly reduced the median survival time of patients with LUAD. These findings suggest that SPP1 and AGER are risk factors for LUAD, and these two genes may be utilized in the prognostic evaluation of patients with LUAD. Additionally, the key genes and functional enrichments may provide a reference for investigating the molecular expression mechanisms underlying LUAD.

  17. A method for generating new datasets based on copy number for cancer analysis.

    PubMed

    Kim, Shinuk; Kon, Mark; Kang, Hyunsik

    2015-01-01

    New data sources for the analysis of cancer data are rapidly supplementing the large number of gene-expression markers used for current methods of analysis. Significant among these new sources are copy number variation (CNV) datasets, which typically enumerate several hundred thousand CNVs distributed throughout the genome. Several useful algorithms allow systems-level analyses of such datasets. However, these rich data sources have not yet been analyzed as deeply as gene-expression data. To address this issue, the extensive toolsets used for analyzing expression data in cancerous and noncancerous tissue (e.g., gene set enrichment analysis and phenotype prediction) could be redirected to extract a great deal of predictive information from CNV data, in particular those derived from cancers. Here we present a software package capable of preprocessing standard Agilent copy number datasets into a form to which essentially all expression analysis tools can be applied. We illustrate the use of this toolset in predicting the survival time of patients with ovarian cancer or glioblastoma multiforme and also provide an analysis of gene- and pathway-level deletions in these two types of cancer.

  18. Dynamic association rules for gene expression data analysis.

    PubMed

    Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung

    2015-10-14

    The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.

  19. The MetabolomeExpress Project: enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets.

    PubMed

    Carroll, Adam J; Badger, Murray R; Harvey Millar, A

    2010-07-14

    Standardization of analytical approaches and reporting methods via community-wide collaboration can work synergistically with web-tool development to result in rapid community-driven expansion of online data repositories suitable for data mining and meta-analysis. In metabolomics, the inter-laboratory reproducibility of gas-chromatography/mass-spectrometry (GC/MS) makes it an obvious target for such development. While a number of web-tools offer access to datasets and/or tools for raw data processing and statistical analysis, none of these systems are currently set up to act as a public repository by easily accepting, processing and presenting publicly submitted GC/MS metabolomics datasets for public re-analysis. Here, we present MetabolomeExpress, a new File Transfer Protocol (FTP) server and web-tool for the online storage, processing, visualisation and statistical re-analysis of publicly submitted GC/MS metabolomics datasets. Users may search a quality-controlled database of metabolite response statistics from publicly submitted datasets by a number of parameters (eg. metabolite, species, organ/biofluid etc.). Users may also perform meta-analysis comparisons of multiple independent experiments or re-analyse public primary datasets via user-friendly tools for t-test, principal components analysis, hierarchical cluster analysis and correlation analysis. They may interact with chromatograms, mass spectra and peak detection results via an integrated raw data viewer. Researchers who register for a free account may upload (via FTP) their own data to the server for online processing via a novel raw data processing pipeline. MetabolomeExpress https://www.metabolome-express.org provides a new opportunity for the general metabolomics community to transparently present online the raw and processed GC/MS data underlying their metabolomics publications. Transparent sharing of these data will allow researchers to assess data quality and draw their own insights from published metabolomics datasets.

  20. A high resolution atlas of gene expression in the domestic sheep (Ovis aries)

    PubMed Central

    Farquhar, Iseabail L.; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G.; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C. Bruce; Freeman, Tom C.; Archibald, Alan L.; Hume, David A.

    2017-01-01

    Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of ‘guilt by association’ was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages. PMID:28915238

  1. A high resolution atlas of gene expression in the domestic sheep (Ovis aries).

    PubMed

    Clark, Emily L; Bush, Stephen J; McCulloch, Mary E B; Farquhar, Iseabail L; Young, Rachel; Lefevre, Lucas; Pridans, Clare; Tsang, Hiu G; Wu, Chunlei; Afrasiabi, Cyrus; Watson, Mick; Whitelaw, C Bruce; Freeman, Tom C; Summers, Kim M; Archibald, Alan L; Hume, David A

    2017-09-01

    Sheep are a key source of meat, milk and fibre for the global livestock sector, and an important biomedical model. Global analysis of gene expression across multiple tissues has aided genome annotation and supported functional annotation of mammalian genes. We present a large-scale RNA-Seq dataset representing all the major organ systems from adult sheep and from several juvenile, neonatal and prenatal developmental time points. The Ovis aries reference genome (Oar v3.1) includes 27,504 genes (20,921 protein coding), of which 25,350 (19,921 protein coding) had detectable expression in at least one tissue in the sheep gene expression atlas dataset. Network-based cluster analysis of this dataset grouped genes according to their expression pattern. The principle of 'guilt by association' was used to infer the function of uncharacterised genes from their co-expression with genes of known function. We describe the overall transcriptional signatures present in the sheep gene expression atlas and assign those signatures, where possible, to specific cell populations or pathways. The findings are related to innate immunity by focusing on clusters with an immune signature, and to the advantages of cross-breeding by examining the patterns of genes exhibiting the greatest expression differences between purebred and crossbred animals. This high-resolution gene expression atlas for sheep is, to our knowledge, the largest transcriptomic dataset from any livestock species to date. It provides a resource to improve the annotation of the current reference genome for sheep, presenting a model transcriptome for ruminants and insight into gene, cell and tissue function at multiple developmental stages.

  2. New Statistics for Testing Differential Expression of Pathways from Microarray Data

    NASA Astrophysics Data System (ADS)

    Siu, Hoicheong; Dong, Hua; Jin, Li; Xiong, Momiao

    Exploring biological meaning from microarray data is very important but remains a great challenge. Here, we developed three new statistics: linear combination test, quadratic test and de-correlation test to identify differentially expressed pathways from gene expression profile. We apply our statistics to two rheumatoid arthritis datasets. Notably, our results reveal three significant pathways and 275 genes in common in two datasets. The pathways we found are meaningful to uncover the disease mechanisms of rheumatoid arthritis, which implies that our statistics are a powerful tool in functional analysis of gene expression data.

  3. BABAR: an R package to simplify the normalisation of common reference design microarray-based transcriptomic datasets

    PubMed Central

    2010-01-01

    Background The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells. Results The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques. Conclusions BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show BABAR transforms real and simulated datasets to allow for the correct interpretation of these data, and is the ideal tool to facilitate the identification of differentially expressed genes or network inference analysis from transcriptomic datasets. PMID:20128918

  4. A Protocol for Epigenetic Imprinting Analysis with RNA-Seq Data.

    PubMed

    Zou, Jinfeng; Xiang, Daoquan; Datla, Raju; Wang, Edwin

    2018-01-01

    Genomic imprinting is an epigenetic regulatory mechanism that operates through expression of certain genes from maternal or paternal in a parent-of-origin-specific manner. Imprinted genes have been identified in diverse biological systems that are implicated in some human diseases and in embryonic and seed developmental programs in plants. The molecular underpinning programs and mechanisms involved in imprinting are yet to be explored in depth in plants. The recent advances in RNA-Seq-based methods and technologies offer an opportunity to systematically analyze epigenetic imprinting that operates at the whole genome level in the model and crop plants. We are interested using Arabidopsis model system, to investigate gene expression patterns associated with parent of origin and their implications to imprinting during embryo and seed development. Toward this, we have generated early embryo development RNA-Seq-based transcriptome datasets in F1s from a genetic cross between two diverse Arabidopsis thaliana ecotypes Col-0 and Tsu-1. With the data, we developed a protocol for evaluating the maternal and paternal contributions of genes during the early stages of embryo development after fertilization. This protocol is also designed to consider the contamination from other potential seed tissues, sequencing quality, proper processing of sequenced reads and variant calling, and appropriate inference of the parental contributions based on the parent-of-origin-specific single-nucleotide polymorphisms within the expressed genes. The approach, methods and the protocol developed in this study can be used for evaluating the effects of epigenetic imprinting in plants.

  5. Affective State Level Recognition in Naturalistic Facial and Vocal Expressions.

    PubMed

    Meng, Hongying; Bianchi-Berthouze, Nadia

    2014-03-01

    Naturalistic affective expressions change at a rate much slower than the typical rate at which video or audio is recorded. This increases the probability that consecutive recorded instants of expressions represent the same affective content. In this paper, we exploit such a relationship to improve the recognition performance of continuous naturalistic affective expressions. Using datasets of naturalistic affective expressions (AVEC 2011 audio and video dataset, PAINFUL video dataset) continuously labeled over time and over different dimensions, we analyze the transitions between levels of those dimensions (e.g., transitions in pain intensity level). We use an information theory approach to show that the transitions occur very slowly and hence suggest modeling them as first-order Markov models. The dimension levels are considered to be the hidden states in the Hidden Markov Model (HMM) framework. Their discrete transition and emission matrices are trained by using the labels provided with the training set. The recognition problem is converted into a best path-finding problem to obtain the best hidden states sequence in HMMs. This is a key difference from previous use of HMMs as classifiers. Modeling of the transitions between dimension levels is integrated in a multistage approach, where the first level performs a mapping between the affective expression features and a soft decision value (e.g., an affective dimension level), and further classification stages are modeled as HMMs that refine that mapping by taking into account the temporal relationships between the output decision labels. The experimental results for each of the unimodal datasets show overall performance to be significantly above that of a standard classification system that does not take into account temporal relationships. In particular, the results on the AVEC 2011 audio dataset outperform all other systems presented at the international competition.

  6. Validation of MIMGO: a method to identify differentially expressed GO terms in a microarray dataset

    PubMed Central

    2012-01-01

    Background We previously proposed an algorithm for the identification of GO terms that commonly annotate genes whose expression is upregulated or downregulated in some microarray data compared with in other microarray data. We call these “differentially expressed GO terms” and have named the algorithm “matrix-assisted identification method of differentially expressed GO terms” (MIMGO). MIMGO can also identify microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. However, MIMGO has not yet been validated on a real microarray dataset using all available GO terms. Findings We combined Gene Set Enrichment Analysis (GSEA) with MIMGO to identify differentially expressed GO terms in a yeast cell cycle microarray dataset. GSEA followed by MIMGO (GSEA + MIMGO) correctly identified (p < 0.05) microarray data in which genes annotated to differentially expressed GO terms are upregulated. We found that GSEA + MIMGO was slightly less effective than, or comparable to, GSEA (Pearson), a method that uses Pearson’s correlation as a metric, at detecting true differentially expressed GO terms. However, unlike other methods including GSEA (Pearson), GSEA + MIMGO can comprehensively identify the microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. Conclusions MIMGO is a reliable method to identify differentially expressed GO terms comprehensively. PMID:23232071

  7. Preventing Unintended Disclosure of Personally Identifiable Data Following Anonymisation.

    PubMed

    Smith, Chris

    2017-01-01

    Errors and anomalies during the capture and processing of health data have the potential to place personally identifiable values into attributes of a dataset that are expected to contain non-identifiable values. Anonymisation focuses on those attributes that have been judged to enable identification of individuals. Attributes that are judged to contain non-identifiable values are not considered, but may be included in datasets that are shared by organisations. Consequently, organisations are at risk of sharing datasets that unintendedly disclose personally identifiable values through these attributes. This would have ethical and legal implications for organisations and privacy implications for individuals whose personally identifiable values are disclosed. In this paper, we formulate the problem of unintended disclosure following anonymisation, describe the necessary steps to address this problem, and discuss some key challenges to applying these steps in practice.

  8. Hybrid coexpression link similarity graph clustering for mining biological modules from multiple gene expression datasets

    PubMed Central

    2014-01-01

    Background Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. Results We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways. PMID:25221624

  9. TESTING HIGH-DIMENSIONAL COVARIANCE MATRICES, WITH APPLICATION TO DETECTING SCHIZOPHRENIA RISK GENES

    PubMed Central

    Zhu, Lingxue; Lei, Jing; Devlin, Bernie; Roeder, Kathryn

    2017-01-01

    Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leading-Eigenvalue-Driven (sLED) test for comparing two high-dimensional covariance matrices. By focusing on the spectrum of the differential matrix, sLED provides a novel perspective that accommodates what we assume to be common, namely sparse and weak signals in gene expression data, and it is closely related with Sparse Principal Component Analysis. We prove that sLED achieves full power asymptotically under mild assumptions, and simulation studies verify that it outperforms other existing procedures under many biologically plausible scenarios. Applying sLED to the largest gene-expression dataset obtained from post-mortem brain tissue from Schizophrenia patients and controls, we provide a novel list of genes implicated in Schizophrenia and reveal intriguing patterns in gene co-expression change for Schizophrenia subjects. We also illustrate that sLED can be generalized to compare other gene-gene “relationship” matrices that are of practical interest, such as the weighted adjacency matrices. PMID:29081874

  10. TESTING HIGH-DIMENSIONAL COVARIANCE MATRICES, WITH APPLICATION TO DETECTING SCHIZOPHRENIA RISK GENES.

    PubMed

    Zhu, Lingxue; Lei, Jing; Devlin, Bernie; Roeder, Kathryn

    2017-09-01

    Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leading-Eigenvalue-Driven (sLED) test for comparing two high-dimensional covariance matrices. By focusing on the spectrum of the differential matrix, sLED provides a novel perspective that accommodates what we assume to be common, namely sparse and weak signals in gene expression data, and it is closely related with Sparse Principal Component Analysis. We prove that sLED achieves full power asymptotically under mild assumptions, and simulation studies verify that it outperforms other existing procedures under many biologically plausible scenarios. Applying sLED to the largest gene-expression dataset obtained from post-mortem brain tissue from Schizophrenia patients and controls, we provide a novel list of genes implicated in Schizophrenia and reveal intriguing patterns in gene co-expression change for Schizophrenia subjects. We also illustrate that sLED can be generalized to compare other gene-gene "relationship" matrices that are of practical interest, such as the weighted adjacency matrices.

  11. Determining Cutoff Point of Ensemble Trees Based on Sample Size in Predicting Clinical Dose with DNA Microarray Data.

    PubMed

    Yılmaz Isıkhan, Selen; Karabulut, Erdem; Alpar, Celal Reha

    2016-01-01

    Background/Aim . Evaluating the success of dose prediction based on genetic or clinical data has substantially advanced recently. The aim of this study is to predict various clinical dose values from DNA gene expression datasets using data mining techniques. Materials and Methods . Eleven real gene expression datasets containing dose values were included. First, important genes for dose prediction were selected using iterative sure independence screening. Then, the performances of regression trees (RTs), support vector regression (SVR), RT bagging, SVR bagging, and RT boosting were examined. Results . The results demonstrated that a regression-based feature selection method substantially reduced the number of irrelevant genes from raw datasets. Overall, the best prediction performance in nine of 11 datasets was achieved using SVR; the second most accurate performance was provided using a gradient-boosting machine (GBM). Conclusion . Analysis of various dose values based on microarray gene expression data identified common genes found in our study and the referenced studies. According to our findings, SVR and GBM can be good predictors of dose-gene datasets. Another result of the study was to identify the sample size of n = 25 as a cutoff point for RT bagging to outperform a single RT.

  12. Genomic Models of Short-Term Exposure Accurately Predict Long-Term Chemical Carcinogenicity and Identify Putative Mechanisms of Action

    PubMed Central

    Gusenleitner, Daniel; Auerbach, Scott S.; Melia, Tisha; Gómez, Harold F.; Sherr, David H.; Monti, Stefano

    2014-01-01

    Background Despite an overall decrease in incidence of and mortality from cancer, about 40% of Americans will be diagnosed with the disease in their lifetime, and around 20% will die of it. Current approaches to test carcinogenic chemicals adopt the 2-year rodent bioassay, which is costly and time-consuming. As a result, fewer than 2% of the chemicals on the market have actually been tested. However, evidence accumulated to date suggests that gene expression profiles from model organisms exposed to chemical compounds reflect underlying mechanisms of action, and that these toxicogenomic models could be used in the prediction of chemical carcinogenicity. Results In this study, we used a rat-based microarray dataset from the NTP DrugMatrix Database to test the ability of toxicogenomics to model carcinogenicity. We analyzed 1,221 gene-expression profiles obtained from rats treated with 127 well-characterized compounds, including genotoxic and non-genotoxic carcinogens. We built a classifier that predicts a chemical's carcinogenic potential with an AUC of 0.78, and validated it on an independent dataset from the Japanese Toxicogenomics Project consisting of 2,065 profiles from 72 compounds. Finally, we identified differentially expressed genes associated with chemical carcinogenesis, and developed novel data-driven approaches for the molecular characterization of the response to chemical stressors. Conclusion Here, we validate a toxicogenomic approach to predict carcinogenicity and provide strong evidence that, with a larger set of compounds, we should be able to improve the sensitivity and specificity of the predictions. We found that the prediction of carcinogenicity is tissue-dependent and that the results also confirm and expand upon previous studies implicating DNA damage, the peroxisome proliferator-activated receptor, the aryl hydrocarbon receptor, and regenerative pathology in the response to carcinogen exposure. PMID:25058030

  13. Dataset of proinflammatory cytokine and cytokine receptor gene expression in rainbow trout (Oncorhynchus mykiss) measured using a novel GeXP multiplex, RT-PCR assay

    USDA-ARS?s Scientific Manuscript database

    A GeXP multiplex, RT-PCR assay was developed and optimized that simultaneously measures expression of a suite of immune-relevant genes in rainbow trout (Oncorhynchus mykiss), concentrating on tumor necrosis factor and interleukin-1 ligand/receptor systems and acute phase response genes. The dataset ...

  14. SpeCond: a method to detect condition-specific gene expression

    PubMed Central

    2011-01-01

    Transcriptomic studies routinely measure expression levels across numerous conditions. These datasets allow identification of genes that are specifically expressed in a small number of conditions. However, there are currently no statistically robust methods for identifying such genes. Here we present SpeCond, a method to detect condition-specific genes that outperforms alternative approaches. We apply the method to a dataset of 32 human tissues to determine 2,673 specifically expressed genes. An implementation of SpeCond is freely available as a Bioconductor package at http://www.bioconductor.org/packages/release/bioc/html/SpeCond.html. PMID:22008066

  15. RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes.

    PubMed

    Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa

    2017-08-29

    Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/.

  16. RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes

    PubMed Central

    Ono, Hiromasa; Ogasawara, Osamu; Okubo, Kosaku; Bono, Hidemasa

    2017-01-01

    Gene expression data are exponentially accumulating; thus, the functional annotation of such sequence data from metadata is urgently required. However, life scientists have difficulty utilizing the available data due to its sheer magnitude and complicated access. We have developed a web tool for browsing reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which should facilitate the reuse of the precious data archived in several public databases. The web tool is called Reference Expression dataset (RefEx), and RefEx allows users to search by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. RefEx also provides information about genes with tissue-specific expression, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D. Combined with the newly incorporated Functional Annotation of Mammals (FANTOM) dataset, RefEx provides insight regarding the functional interpretation of unfamiliar genes. RefEx is publicly available at http://refex.dbcls.jp/. PMID:28850115

  17. Cis-eQTL-based trans-ethnic meta-analysis reveals novel genes associated with breast cancer risk

    PubMed Central

    Tai, Caroline G.; Passarelli, Michael N.; Hu, Donglei; Huntsman, Scott; Zaitlen, Noah; Ziv, Elad; Witte, John S.

    2017-01-01

    Breast cancer is the most common solid organ malignancy and the most frequent cause of cancer death among women worldwide. Previous research has yielded insights into its genetic etiology, but there remains a gap in the understanding of genetic factors that contribute to risk, and particularly in the biological mechanisms by which genetic variation modulates risk. The National Cancer Institute’s “Up for a Challenge” (U4C) competition provided an opportunity to further elucidate the genetic basis of the disease. Our group leveraged the seven datasets made available by the U4C organizers and data from the publicly available UK Biobank cohort to examine associations between imputed gene expression and breast cancer risk. In particular, we used reference datasets describing the breast tissue and whole blood transcriptomes to impute expression levels in breast cancer cases and controls. In trans-ethnic meta-analyses of U4C and UK Biobank data, we found significant associations between breast cancer risk and the expression of RCCD1 (joint p-value: 3.6x10-06) and DHODH (p-value: 7.1x10-06) in breast tissue, as well as a suggestive association for ANKLE1 (p-value: 9.3x10-05). Expression of RCCD1 in whole blood was also suggestively associated with disease risk (p-value: 1.2x10-05), as were expression of ACAP1 (p-value: 1.9x10-05) and LRRC25 (p-value: 5.2x10-05). While genome-wide association studies (GWAS) have implicated RCCD1 and ANKLE1 in breast cancer risk, they have not identified the remaining three genes. Among the genetic variants that contributed to the predicted expression of the five genes, we found 23 nominally (p-value < 0.05) associated with breast cancer risk, among which 15 are not in high linkage disequilibrium with risk variants previously identified by GWAS. In summary, we used a transcriptome-based approach to investigate the genetic underpinnings of breast carcinogenesis. This approach provided an avenue for deciphering the functional relevance of genes and genetic variants involved in breast cancer. PMID:28362817

  18. Uncovering Hidden Layers of Cell Cycle Regulation through Integrative Multi-omic Analysis

    PubMed Central

    Aviner, Ranen; Shenoy, Anjana; Elroy-Stein, Orna; Geiger, Tamar

    2015-01-01

    Studying the complex relationship between transcription, translation and protein degradation is essential to our understanding of biological processes in health and disease. The limited correlations observed between mRNA and protein abundance suggest pervasive regulation of post-transcriptional steps and support the importance of profiling mRNA levels in parallel to protein synthesis and degradation rates. In this work, we applied an integrative multi-omic approach to study gene expression along the mammalian cell cycle through side-by-side analysis of mRNA, translation and protein levels. Our analysis sheds new light on the significant contribution of both protein synthesis and degradation to the variance in protein expression. Furthermore, we find that translation regulation plays an important role at S-phase, while progression through mitosis is predominantly controlled by changes in either mRNA levels or protein stability. Specific molecular functions are found to be co-regulated and share similar patterns of mRNA, translation and protein expression along the cell cycle. Notably, these include genes and entire pathways not previously implicated in cell cycle progression, demonstrating the potential of this approach to identify novel regulatory mechanisms beyond those revealed by traditional expression profiling. Through this three-level analysis, we characterize different mechanisms of gene expression, discover new cycling gene products and highlight the importance and utility of combining datasets generated using different techniques that monitor distinct steps of gene expression. PMID:26439921

  19. Enhancing biological relevance of a weighted gene co-expression network for functional module identification.

    PubMed

    Prom-On, Santitham; Chanthaphan, Atthawut; Chan, Jonathan Hoyin; Meechai, Asawin

    2011-02-01

    Relationships among gene expression levels may be associated with the mechanisms of the disease. While identifying a direct association such as a difference in expression levels between case and control groups links genes to disease mechanisms, uncovering an indirect association in the form of a network structure may help reveal the underlying functional module associated with the disease under scrutiny. This paper presents a method to improve the biological relevance in functional module identification from the gene expression microarray data by enhancing the structure of a weighted gene co-expression network using minimum spanning tree. The enhanced network, which is called a backbone network, contains only the essential structural information to represent the gene co-expression network. The entire backbone network is decoupled into a number of coherent sub-networks, and then the functional modules are reconstructed from these sub-networks to ensure minimum redundancy. The method was tested with a simulated gene expression dataset and case-control expression datasets of autism spectrum disorder and colorectal cancer studies. The results indicate that the proposed method can accurately identify clusters in the simulated dataset, and the functional modules of the backbone network are more biologically relevant than those obtained from the original approach.

  20. Methods to increase reproducibility in differential gene expression via meta-analysis

    PubMed Central

    Sweeney, Timothy E.; Haynes, Winston A.; Vallania, Francesco; Ioannidis, John P.; Khatri, Purvesh

    2017-01-01

    Findings from clinical and biological studies are often not reproducible when tested in independent cohorts. Due to the testing of a large number of hypotheses and relatively small sample sizes, results from whole-genome expression studies in particular are often not reproducible. Compared to single-study analysis, gene expression meta-analysis can improve reproducibility by integrating data from multiple studies. However, there are multiple choices in designing and carrying out a meta-analysis. Yet, clear guidelines on best practices are scarce. Here, we hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility. We therefore constructed three very large gene expression meta-analyses from clinical samples, and then examined meta-analyses of subsets of the datasets (all combinations of datasets with up to N/2 samples and K/2 datasets) compared to a ‘silver standard’ of differentially expressed genes found in the entire cohort. We tested three random-effects meta-analysis models using this procedure. We showed relatively greater reproducibility with more-stringent effect size thresholds with relaxed significance thresholds; relatively lower reproducibility when imposing extraneous constraints on residual heterogeneity; and an underestimation of actual false positive rate by Benjamini–Hochberg correction. In addition, multivariate regression showed that the accuracy of a meta-analysis increased significantly with more included datasets even when controlling for sample size. PMID:27634930

  1. Molecular Subtypes of Glioblastoma Are Relevant to Lower Grade Glioma

    PubMed Central

    Sloan, Andrew E.; Chen, Yanwen; Brat, Daniel J.; O’Neill, Brian Patrick; de Groot, John; Yust-Katz, Shlomit; Yung, Wai-Kwan Alfred; Cohen, Mark L.; Aldape, Kenneth D.; Rosenfeld, Steven; Verhaak, Roeland G. W.; Barnholtz-Sloan, Jill S.

    2014-01-01

    Background Gliomas are the most common primary malignant brain tumors in adults with great heterogeneity in histopathology and clinical course. The intent was to evaluate the relevance of known glioblastoma (GBM) expression and methylation based subtypes to grade II and III gliomas (ie. lower grade gliomas). Methods Gene expression array, single nucleotide polymorphism (SNP) array and clinical data were obtained for 228 GBMs and 176 grade II/II gliomas (GII/III) from the publically available Rembrandt dataset. Two additional datasets with IDH1 mutation status were utilized as validation datasets (one publicly available dataset and one newly generated dataset from MD Anderson). Unsupervised clustering was performed and compared to gene expression subtypes assigned using the Verhaak et al 840-gene classifier. The glioma-CpG Island Methylator Phenotype (G-CIMP) was assigned using prediction models by Fine et al. Results Unsupervised clustering by gene expression aligned with the Verhaak 840-gene subtype group assignments. GII/IIIs were preferentially assigned to the proneural subtype with IDH1 mutation and G-CIMP. GBMs were evenly distributed among the four subtypes. Proneural, IDH1 mutant, G-CIMP GII/III s had significantly better survival than other molecular subtypes. Only 6% of GBMs were proneural and had either IDH1 mutation or G-CIMP but these tumors had significantly better survival than other GBMs. Copy number changes in chromosomes 1p and 19q were associated with GII/IIIs, while these changes in CDKN2A, PTEN and EGFR were more commonly associated with GBMs. Conclusions GBM gene-expression and methylation based subtypes are relevant for GII/III s and associate with overall survival differences. A better understanding of the association between these subtypes and GII/IIIs could further knowledge regarding prognosis and mechanisms of glioma progression. PMID:24614622

  2. The α3β4* nicotinic ACh receptor subtype mediates physical dependence to morphine: mouse and human studies

    PubMed Central

    Muldoon, P P; Jackson, K J; Perez, E; Harenza, J L; Molas, S; Rais, B; Anwar, H; Zaveri, N T; Maldonado, R; Maskos, U; McIntosh, J M; Dierssen, M; Miles, M F; Chen, X; De Biasi, M; Damaj, M I

    2014-01-01

    BACKGROUND AND PURPOSE Recent data have indicated that α3β4* neuronal nicotinic (n) ACh receptors may play a role in morphine dependence. Here we investigated if nACh receptors modulate morphine physical withdrawal. EXPERIMENTAL APPROACHES To assess the role of α3β4* nACh receptors in morphine withdrawal, we used a genetic correlation approach using publically available datasets within the GeneNetwork web resource, genetic knockout and pharmacological tools. Male and female European-American (n = 2772) and African-American (n = 1309) subjects from the Study of Addiction: Genetics and Environment dataset were assessed for possible associations of polymorphisms in the 15q25 gene cluster and opioid dependence. KEY RESULTS BXD recombinant mouse lines demonstrated an increased expression of α3, β4 and α5 nACh receptor mRNA in the forebrain and midbrain, which significantly correlated with increased defecation in mice undergoing morphine withdrawal. Mice overexpressing the gene cluster CHRNA5/A3/B4 exhibited increased somatic signs of withdrawal. Furthermore, α5 and β4 nACh receptor knockout mice expressed decreased somatic withdrawal signs compared with their wild-type counterparts. Moreover, selective α3β4* nACh receptor antagonists, α-conotoxin AuIB and AT-1001, attenuated somatic signs of morphine withdrawal in a dose-related manner. In addition, two human datasets revealed a protective role for variants in the CHRNA3 gene, which codes for the α3 nACh receptor subunit, in opioid dependence and withdrawal. In contrast, we found that the α4β2* nACh receptor subtype is not involved in morphine somatic withdrawal signs. CONCLUSION AND IMPLICATIONS Overall, our findings suggest an important role for the α3β4* nACh receptor subtype in morphine physical dependence. PMID:24750073

  3. Stereotypes Possess Heterogeneous Directionality: A Theoretical and Empirical Exploration of Stereotype Structure and Content

    PubMed Central

    Cox, William T. L.; Devine, Patricia G.

    2015-01-01

    We advance a theory-driven approach to stereotype structure, informed by connectionist theories of cognition. Whereas traditional models define or tacitly assume that stereotypes possess inherently Group → Attribute activation directionality (e.g., Black activates criminal), our model predicts heterogeneous stereotype directionality. Alongside the classically studied Group → Attribute stereotypes, some stereotypes should be bidirectional (i.e., Group ⇄ Attribute) and others should have Attribute → Group unidirectionality (e.g., fashionable activates gay). We tested this prediction in several large-scale studies with human participants (NCombined = 4,817), assessing stereotypic inferences among various groups and attributes. Supporting predictions, we found heterogeneous directionality both among the stereotype links related to a given social group and also between the links of different social groups. These efforts yield rich datasets that map the networks of stereotype links related to several social groups. We make these datasets publicly available, enabling other researchers to explore a number of questions related to stereotypes and stereotyping. Stereotype directionality is an understudied feature of stereotypes and stereotyping with widespread implications for the development, measurement, maintenance, expression, and change of stereotypes, stereotyping, prejudice, and discrimination. PMID:25811181

  4. Stereotypes possess heterogeneous directionality: a theoretical and empirical exploration of stereotype structure and content.

    PubMed

    Cox, William T L; Devine, Patricia G

    2015-01-01

    We advance a theory-driven approach to stereotype structure, informed by connectionist theories of cognition. Whereas traditional models define or tacitly assume that stereotypes possess inherently Group → Attribute activation directionality (e.g., Black activates criminal), our model predicts heterogeneous stereotype directionality. Alongside the classically studied Group → Attribute stereotypes, some stereotypes should be bidirectional (i.e., Group ⇄ Attribute) and others should have Attribute → Group unidirectionality (e.g., fashionable activates gay). We tested this prediction in several large-scale studies with human participants (NCombined = 4,817), assessing stereotypic inferences among various groups and attributes. Supporting predictions, we found heterogeneous directionality both among the stereotype links related to a given social group and also between the links of different social groups. These efforts yield rich datasets that map the networks of stereotype links related to several social groups. We make these datasets publicly available, enabling other researchers to explore a number of questions related to stereotypes and stereotyping. Stereotype directionality is an understudied feature of stereotypes and stereotyping with widespread implications for the development, measurement, maintenance, expression, and change of stereotypes, stereotyping, prejudice, and discrimination.

  5. MiSTIC, an integrated platform for the analysis of heterogeneity in large tumour transcriptome datasets

    PubMed Central

    Sargeant, Tobias; Laperrière, David; Ismail, Houssam; Boucher, Geneviève; Rozendaal, Marieke; Lavallée, Vincent-Philippe; Ashton-Beaucage, Dariel; Wilhelm, Brian; Hébert, Josée; Hilton, Douglas J.

    2017-01-01

    Abstract Genome-wide transcriptome profiling has enabled non-supervised classification of tumours, revealing different sub-groups characterized by specific gene expression features. However, the biological significance of these subtypes remains for the most part unclear. We describe herein an interactive platform, Minimum Spanning Trees Inferred Clustering (MiSTIC), that integrates the direct visualization and comparison of the gene correlation structure between datasets, the analysis of the molecular causes underlying co-variations in gene expression in cancer samples, and the clinical annotation of tumour sets defined by the combined expression of selected biomarkers. We have used MiSTIC to highlight the roles of specific transcription factors in breast cancer subtype specification, to compare the aspects of tumour heterogeneity targeted by different prognostic signatures, and to highlight biomarker interactions in AML. A version of MiSTIC preloaded with datasets described herein can be accessed through a public web server (http://mistic.iric.ca); in addition, the MiSTIC software package can be obtained (github.com/iric-soft/MiSTIC) for local use with personalized datasets. PMID:28472340

  6. PhenomeExpress: a refined network analysis of expression datasets by inclusion of known disease phenotypes.

    PubMed

    Soul, Jamie; Hardingham, Timothy E; Boot-Handford, Raymond P; Schwartz, Jean-Marc

    2015-01-29

    We describe a new method, PhenomeExpress, for the analysis of transcriptomic datasets to identify pathogenic disease mechanisms. Our analysis method includes input from both protein-protein interaction and phenotype similarity networks. This introduces valuable information from disease relevant phenotypes, which aids the identification of sub-networks that are significantly enriched in differentially expressed genes and are related to the disease relevant phenotypes. This contrasts with many active sub-network detection methods, which rely solely on protein-protein interaction networks derived from compounded data of many unrelated biological conditions and which are therefore not specific to the context of the experiment. PhenomeExpress thus exploits readily available animal model and human disease phenotype information. It combines this prior evidence of disease phenotypes with the experimentally derived disease data sets to provide a more targeted analysis. Two case studies, in subchondral bone in osteoarthritis and in Pax5 in acute lymphoblastic leukaemia, demonstrate that PhenomeExpress identifies core disease pathways in both mouse and human disease expression datasets derived from different technologies. We also validate the approach by comparison to state-of-the-art active sub-network detection methods, which reveals how it may enhance the detection of molecular phenotypes and provide a more detailed context to those previously identified as possible candidates.

  7. Trefoil factor 3 is required for differentiation of thyroid follicular cells and acts as a context-dependent tumor suppressor.

    PubMed

    Abols, A; Ducena, K; Andrejeva, D; Sadovska, L; Zandberga, E; Vilmanis, J; Narbuts, Z; Tars, J; Eglitis, J; Pirags, V; Line, A

    2015-01-01

    Trefoil factor 3 (TFF3) is overexpressed in a variety of solid epithelial cancers, where it has been shown to promote migration, invasion, proliferation, survival and angiogenesis. On the contrary, in the majority of thyroid tumors, it is downregulated, yet its role in the development of thyroid cancer remains unknown. Here we show that TFF3 exhibits strong cytoplasmic staining of normal thyroid follicular cells and colloid and the staining is increased in hyperfunctioning thyroid nodules, while it is decreased in all thyroid cancers of follicular cell origin. By meta-analysis of gene expression datasets, we found that in the thyroid cancer, conversely to the breast cancer, the expression of TFF3 mRNA was downregulated by estrogen signaling and confirmed this by treating thyroid cancer cells with estradiol. Forced expression of TFF3 in anaplastic thyroid cancer cells resulted in decreased cell proliferation, clonal spheroid formation and entry into the S phase. Furthermore, it induced acquisition of epithelial-like cell morphology and expression of the differentiation markers of thyroid follicular cells and transcription factors implicated in the thyroid morphogenesis and function. Taken together, this study provides the first evidence that TFF3 may act as a tumor suppressor or an oncogene depending on the cellular context.

  8. Heart morphogenesis gene regulatory networks revealed by temporal expression analysis.

    PubMed

    Hill, Jonathon T; Demarest, Bradley; Gorsi, Bushra; Smith, Megan; Yost, H Joseph

    2017-10-01

    During embryogenesis the heart forms as a linear tube that then undergoes multiple simultaneous morphogenetic events to obtain its mature shape. To understand the gene regulatory networks (GRNs) driving this phase of heart development, during which many congenital heart disease malformations likely arise, we conducted an RNA-seq timecourse in zebrafish from 30 hpf to 72 hpf and identified 5861 genes with altered expression. We clustered the genes by temporal expression pattern, identified transcription factor binding motifs enriched in each cluster, and generated a model GRN for the major gene batteries in heart morphogenesis. This approach predicted hundreds of regulatory interactions and found batteries enriched in specific cell and tissue types, indicating that the approach can be used to narrow the search for novel genetic markers and regulatory interactions. Subsequent analyses confirmed the GRN using two mutants, Tbx5 and nkx2-5 , and identified sets of duplicated zebrafish genes that do not show temporal subfunctionalization. This dataset provides an essential resource for future studies on the genetic/epigenetic pathways implicated in congenital heart defects and the mechanisms of cardiac transcriptional regulation. © 2017. Published by The Company of Biologists Ltd.

  9. Prediction of Human Disease Genes by Human-Mouse Conserved Coexpression Analysis

    PubMed Central

    Grassi, Elena; Damasco, Christian; Silengo, Lorenzo; Oti, Martin; Provero, Paolo; Di Cunto, Ferdinando

    2008-01-01

    Background Even in the post-genomic era, the identification of candidate genes within loci associated with human genetic diseases is a very demanding task, because the critical region may typically contain hundreds of positional candidates. Since genes implicated in similar phenotypes tend to share very similar expression profiles, high throughput gene expression data may represent a very important resource to identify the best candidates for sequencing. However, so far, gene coexpression has not been used very successfully to prioritize positional candidates. Methodology/Principal Findings We show that it is possible to reliably identify disease-relevant relationships among genes from massive microarray datasets by concentrating only on genes sharing similar expression profiles in both human and mouse. Moreover, we show systematically that the integration of human-mouse conserved coexpression with a phenotype similarity map allows the efficient identification of disease genes in large genomic regions. Finally, using this approach on 850 OMIM loci characterized by an unknown molecular basis, we propose high-probability candidates for 81 genetic diseases. Conclusion Our results demonstrate that conserved coexpression, even at the human-mouse phylogenetic distance, represents a very strong criterion to predict disease-relevant relationships among human genes. PMID:18369433

  10. Isotherm ranking and selection using thirteen literature datasets involving hydrophobic organic compounds.

    PubMed

    Matott, L Shawn; Jiang, Zhengzheng; Rabideau, Alan J; Allen-King, Richelle M

    2015-01-01

    Numerous isotherm expressions have been developed for describing sorption of hydrophobic organic compounds (HOCs), including "dual-mode" approaches that combine nonlinear behavior with a linear partitioning component. Choosing among these alternative expressions for describing a given dataset is an important task that can significantly influence subsequent transport modeling and/or mechanistic interpretation. In this study, a series of numerical experiments were undertaken to identify "best-in-class" isotherms by refitting 10 alternative models to a suite of 13 previously published literature datasets. The corrected Akaike Information Criterion (AICc) was used for ranking these alternative fits and distinguishing between plausible and implausible isotherms for each dataset. The occurrence of multiple plausible isotherms was inversely correlated with dataset "richness", such that datasets with fewer observations and/or a narrow range of aqueous concentrations resulted in a greater number of plausible isotherms. Overall, only the Polanyi-partition dual-mode isotherm was classified as "plausible" across all 13 of the considered datasets, indicating substantial statistical support consistent with current advances in sorption theory. However, these findings are predicated on the use of the AICc measure as an unbiased ranking metric and the adoption of a subjective, but defensible, threshold for separating plausible and implausible isotherms. Copyright © 2015 Elsevier B.V. All rights reserved.

  11. Characterization of distinct classes of differential gene expression in osteoblast cultures from non-syndromic craniosynostosis bone.

    PubMed

    Rojas-Peña, Monica L; Olivares-Navarrete, Rene; Hyzy, Sharon; Arafat, Dalia; Schwartz, Zvi; Boyan, Barbara D; Williams, Joseph; Gibson, Greg

    2014-01-01

    Craniosynostosis, the premature fusion of one or more skull sutures, occurs in approximately 1 in 2500 infants, with the majority of cases non-syndromic and of unknown etiology. Two common reasons proposed for premature suture fusion are abnormal compression forces on the skull and rare genetic abnormalities. Our goal was to evaluate whether different sub-classes of disease can be identified based on total gene expression profiles. RNA-Seq data were obtained from 31 human osteoblast cultures derived from bone biopsy samples collected between 2009 and 2011, representing 23 craniosynostosis fusions and 8 normal cranial bones or long bones. No differentiation between regions of the skull was detected, but variance component analysis of gene expression patterns nevertheless supports transcriptome-based classification of craniosynostosis. Cluster analysis showed 4 distinct groups of samples; 1 predominantly normal and 3 craniosynostosis subtypes. Similar constellations of sub-types were also observed upon re-analysis of a similar dataset of 199 calvarial osteoblast cultures. Annotation of gene function of differentially expressed transcripts strongly implicates physiological differences with respect to cell cycle and cell death, stromal cell differentiation, extracellular matrix (ECM) components, and ribosomal activity. Based on these results, we propose non-syndromic craniosynostosis cases can be classified by differences in their gene expression patterns and that these may provide targets for future clinical intervention.

  12. Characterization of Distinct Classes of Differential Gene Expression in Osteoblast Cultures from Non-Syndromic Craniosynostosis Bone

    PubMed Central

    Rojas-Peña, Monica L.; Olivares-Navarrete, Rene; Hyzy, Sharon; Arafat, Dalia; Schwartz, Zvi; Boyan, Barbara D.; Williams, Joseph; Gibson, Greg

    2014-01-01

    Craniosynostosis, the premature fusion of one or more skull sutures, occurs in approximately 1 in 2500 infants, with the majority of cases non-syndromic and of unknown etiology. Two common reasons proposed for premature suture fusion are abnormal compression forces on the skull and rare genetic abnormalities. Our goal was to evaluate whether different sub-classes of disease can be identified based on total gene expression profiles. RNA-Seq data were obtained from 31 human osteoblast cultures derived from bone biopsy samples collected between 2009 and 2011, representing 23 craniosynostosis fusions and 8 normal cranial bones or long bones. No differentiation between regions of the skull was detected, but variance component analysis of gene expression patterns nevertheless supports transcriptome-based classification of craniosynostosis. Cluster analysis showed 4 distinct groups of samples; 1 predominantly normal and 3 craniosynostosis subtypes. Similar constellations of sub-types were also observed upon re-analysis of a similar dataset of 199 calvarial osteoblast cultures. Annotation of gene function of differentially expressed transcripts strongly implicates physiological differences with respect to cell cycle and cell death, stromal cell differentiation, extracellular matrix (ECM) components, and ribosomal activity. Based on these results, we propose non-syndromic craniosynostosis cases can be classified by differences in their gene expression patterns and that these may provide targets for future clinical intervention. PMID:25184005

  13. Gasdermin-B promotes invasion and metastasis in breast cancer cells.

    PubMed

    Hergueta-Redondo, Marta; Sarrió, David; Molina-Crespo, Ángela; Megias, Diego; Mota, Alba; Rojo-Sebastian, Alejandro; García-Sanz, Pablo; Morales, Saleta; Abril, Sandra; Cano, Amparo; Peinado, Héctor; Moreno-Bueno, Gema

    2014-01-01

    Gasdermin B (GSDMB) belongs to the Gasdermin protein family that comprises four members (GSDMA-D). Gasdermin B expression has been detected in some tumor types such as hepatocarcinomas, gastric and cervix cancers; and its over-expression has been related to tumor progression. At least four splicing isoforms of GSDMB have been identified, which may play differential roles in cancer. However, the implication of GSDMB in carcinogenesis and tumor progression is not well understood. Here, we uncover for the first time the functional implication of GSDMB in breast cancer. Our data shows that high levels of GSDMB expression is correlated with reduced survival and increased metastasis in breast cancer patients included in an expression dataset (>1,000 cases). We demonstrate that GSDMB is upregulated in breast carcinomas compared to normal breast tissue, being the isoform 2 (GSDMB-2) the most differentially expressed. In order to evaluate the functional role of GSDMB in breast cancer two GSDMB isoforms were studied (GSDMB-1 and GSDMB-2). The overexpression of both isoforms in the MCF7 breast carcinoma cell line promotes cell motility and invasion, while its silencing in HCC1954 breast carcinoma cells decreases the migratory and invasive phenotype. Importantly, we demonstrate that both isoforms have a differential role on the activation of Rac-1 and Cdc-42 Rho-GTPases. Moreover, our data support that GSMDB-2 induces a pro-tumorigenic and pro-metastatic behavior in mouse xenograft models as compared to GSDMB-1. Finally, we observed that although both GSDMB isoforms interact in vitro with the chaperone Hsp90, only the GSDMB-2 isoform relies on this chaperone for its stability. Taken together, our results provide for the first time evidences that GSDMB-2 induces invasion, tumor progression and metastasis in MCF7 cells and that GSDMB can be considered as a new potential prognostic marker in breast cancer.

  14. Gasdermin-B Promotes Invasion and Metastasis in Breast Cancer Cells

    PubMed Central

    Hergueta-Redondo, Marta; Sarrió, David; Molina-Crespo, Ángela; Megias, Diego; Mota, Alba; Rojo-Sebastian, Alejandro; García-Sanz, Pablo; Morales, Saleta; Abril, Sandra; Cano, Amparo; Peinado, Héctor; Moreno-Bueno, Gema

    2014-01-01

    Gasdermin B (GSDMB) belongs to the Gasdermin protein family that comprises four members (GSDMA-D). Gasdermin B expression has been detected in some tumor types such as hepatocarcinomas, gastric and cervix cancers; and its over-expression has been related to tumor progression. At least four splicing isoforms of GSDMB have been identified, which may play differential roles in cancer. However, the implication of GSDMB in carcinogenesis and tumor progression is not well understood. Here, we uncover for the first time the functional implication of GSDMB in breast cancer. Our data shows that high levels of GSDMB expression is correlated with reduced survival and increased metastasis in breast cancer patients included in an expression dataset (>1,000 cases). We demonstrate that GSDMB is upregulated in breast carcinomas compared to normal breast tissue, being the isoform 2 (GSDMB-2) the most differentially expressed. In order to evaluate the functional role of GSDMB in breast cancer two GSDMB isoforms were studied (GSDMB-1 and GSDMB-2). The overexpression of both isoforms in the MCF7 breast carcinoma cell line promotes cell motility and invasion, while its silencing in HCC1954 breast carcinoma cells decreases the migratory and invasive phenotype. Importantly, we demonstrate that both isoforms have a differential role on the activation of Rac-1 and Cdc-42 Rho-GTPases. Moreover, our data support that GSMDB-2 induces a pro-tumorigenic and pro-metastatic behavior in mouse xenograft models as compared to GSDMB-1. Finally, we observed that although both GSDMB isoforms interact in vitro with the chaperone Hsp90, only the GSDMB-2 isoform relies on this chaperone for its stability. Taken together, our results provide for the first time evidences that GSDMB-2 induces invasion, tumor progression and metastasis in MCF7 cells and that GSDMB can be considered as a new potential prognostic marker in breast cancer. PMID:24675552

  15. Integrative Analysis of GWASs, Human Protein Interaction, and Gene Expression Identified Gene Modules Associated With BMDs

    PubMed Central

    He, Hao; Zhang, Lei; Li, Jian; Wang, Yu-Ping; Zhang, Ji-Gang; Shen, Jie; Guo, Yan-Fang

    2014-01-01

    Context: To date, few systems genetics studies in the bone field have been performed. We designed our study from a systems-level perspective by integrating genome-wide association studies (GWASs), human protein-protein interaction (PPI) network, and gene expression to identify gene modules contributing to osteoporosis risk. Methods: First we searched for modules significantly enriched with bone mineral density (BMD)-associated genes in human PPI network by using 2 large meta-analysis GWAS datasets through a dense module search algorithm. One included 7 individual GWAS samples (Meta7). The other was from the Genetic Factors for Osteoporosis Consortium (GEFOS2). One was assigned as a discovery dataset and the other as an evaluation dataset, and vice versa. Results: In total, 42 modules and 129 modules were identified significantly in both Meta7 and GEFOS2 datasets for femoral neck and spine BMD, respectively. There were 3340 modules identified for hip BMD only in Meta7. As candidate modules, they were assessed for the biological relevance to BMD by gene set enrichment analysis in 2 expression profiles generated from circulating monocytes in subjects with low versus high BMD values. Interestingly, there were 2 modules significantly enriched in monocytes from the low BMD group in both gene expression datasets (nominal P value <.05). Two modules had 16 nonredundant genes. Functional enrichment analysis revealed that both modules were enriched for genes involved in Wnt receptor signaling and osteoblast differentiation. Conclusion: We highlighted 2 modules and novel genes playing important roles in the regulation of bone mass, providing important clues for therapeutic approaches for osteoporosis. PMID:25119315

  16. Harnessing Diversity towards the Reconstructing of Large Scale Gene Regulatory Networks

    PubMed Central

    Yamanaka, Ryota; Kitano, Hiroaki

    2013-01-01

    Elucidating gene regulatory network (GRN) from large scale experimental data remains a central challenge in systems biology. Recently, numerous techniques, particularly consensus driven approaches combining different algorithms, have become a potentially promising strategy to infer accurate GRNs. Here, we develop a novel consensus inference algorithm, TopkNet that can integrate multiple algorithms to infer GRNs. Comprehensive performance benchmarking on a cloud computing framework demonstrated that (i) a simple strategy to combine many algorithms does not always lead to performance improvement compared to the cost of consensus and (ii) TopkNet integrating only high-performance algorithms provide significant performance improvement compared to the best individual algorithms and community prediction. These results suggest that a priori determination of high-performance algorithms is a key to reconstruct an unknown regulatory network. Similarity among gene-expression datasets can be useful to determine potential optimal algorithms for reconstruction of unknown regulatory networks, i.e., if expression-data associated with known regulatory network is similar to that with unknown regulatory network, optimal algorithms determined for the known regulatory network can be repurposed to infer the unknown regulatory network. Based on this observation, we developed a quantitative measure of similarity among gene-expression datasets and demonstrated that, if similarity between the two expression datasets is high, TopkNet integrating algorithms that are optimal for known dataset perform well on the unknown dataset. The consensus framework, TopkNet, together with the similarity measure proposed in this study provides a powerful strategy towards harnessing the wisdom of the crowds in reconstruction of unknown regulatory networks. PMID:24278007

  17. MicroRNA array normalization: an evaluation using a randomized dataset as the benchmark.

    PubMed

    Qin, Li-Xuan; Zhou, Qin

    2014-01-01

    MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays.

  18. MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark

    PubMed Central

    Qin, Li-Xuan; Zhou, Qin

    2014-01-01

    MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays. PMID:24905456

  19. EPConDB: a web resource for gene expression related to pancreatic development, beta-cell function and diabetes.

    PubMed

    Mazzarelli, Joan M; Brestelli, John; Gorski, Regina K; Liu, Junmin; Manduchi, Elisabetta; Pinney, Deborah F; Schug, Jonathan; White, Peter; Kaestner, Klaus H; Stoeckert, Christian J

    2007-01-01

    EPConDB (http://www.cbil.upenn.edu/EPConDB) is a public web site that supports research in diabetes, pancreatic development and beta-cell function by providing information about genes expressed in cells of the pancreas. EPConDB displays expression profiles for individual genes and information about transcripts, promoter elements and transcription factor binding sites. Gene expression results are obtained from studies examining tissue expression, pancreatic development and growth, differentiation of insulin-producing cells, islet or beta-cell injury, and genetic models of impaired beta-cell function. The expression datasets are derived using different microarray platforms, including the BCBC PancChips and Affymetrix gene expression arrays. Other datasets include semi-quantitative RT-PCR and MPSS expression studies. For selected microarray studies, lists of differentially expressed genes, derived from PaGE analysis, are displayed on the site. EPConDB provides database queries and tools to examine the relationship between a gene, its transcriptional regulation, protein function and expression in pancreatic tissues.

  20. Quantifying circular RNA expression from RNA-seq data using model-based framework.

    PubMed

    Li, Musheng; Xie, Xueying; Zhou, Jing; Sheng, Mengying; Yin, Xiaofeng; Ko, Eun-A; Zhou, Tong; Gu, Wanjun

    2017-07-15

    Circular RNAs (circRNAs) are a class of non-coding RNAs that are widely expressed in various cell lines and tissues of many organisms. Although the exact function of many circRNAs is largely unknown, the cell type-and tissue-specific circRNA expression has implicated their crucial functions in many biological processes. Hence, the quantification of circRNA expression from high-throughput RNA-seq data is becoming important to ascertain. Although many model-based methods have been developed to quantify linear RNA expression from RNA-seq data, these methods are not applicable to circRNA quantification. Here, we proposed a novel strategy that transforms circular transcripts to pseudo-linear transcripts and estimates the expression values of both circular and linear transcripts using an existing model-based algorithm, Sailfish. The new strategy can accurately estimate transcript expression of both linear and circular transcripts from RNA-seq data. Several factors, such as gene length, amount of expression and the ratio of circular to linear transcripts, had impacts on quantification performance of circular transcripts. In comparison to count-based tools, the new computational framework had superior performance in estimating the amount of circRNA expression from both simulated and real ribosomal RNA-depleted (rRNA-depleted) RNA-seq datasets. On the other hand, the consideration of circular transcripts in expression quantification from rRNA-depleted RNA-seq data showed substantial increased accuracy of linear transcript expression. Our proposed strategy was implemented in a program named Sailfish-cir. Sailfish-cir is freely available at https://github.com/zerodel/Sailfish-cir . tongz@medicine.nevada.edu or wanjun.gu@gmail.com. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  1. Altered Pathway Analyzer: A gene expression dataset analysis tool for identification and prioritization of differentially regulated and network rewired pathways

    PubMed Central

    Kaushik, Abhinav; Ali, Shakir; Gupta, Dinesh

    2017-01-01

    Gene connection rewiring is an essential feature of gene network dynamics. Apart from its normal functional role, it may also lead to dysregulated functional states by disturbing pathway homeostasis. Very few computational tools measure rewiring within gene co-expression and its corresponding regulatory networks in order to identify and prioritize altered pathways which may or may not be differentially regulated. We have developed Altered Pathway Analyzer (APA), a microarray dataset analysis tool for identification and prioritization of altered pathways, including those which are differentially regulated by TFs, by quantifying rewired sub-network topology. Moreover, APA also helps in re-prioritization of APA shortlisted altered pathways enriched with context-specific genes. We performed APA analysis of simulated datasets and p53 status NCI-60 cell line microarray data to demonstrate potential of APA for identification of several case-specific altered pathways. APA analysis reveals several altered pathways not detected by other tools evaluated by us. APA analysis of unrelated prostate cancer datasets identifies sample-specific as well as conserved altered biological processes, mainly associated with lipid metabolism, cellular differentiation and proliferation. APA is designed as a cross platform tool which may be transparently customized to perform pathway analysis in different gene expression datasets. APA is freely available at http://bioinfo.icgeb.res.in/APA. PMID:28084397

  2. CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.

    PubMed

    Planey, Catherine R; Gevaert, Olivier

    2016-03-09

    Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.

  3. A biclustering algorithm for extracting bit-patterns from binary datasets.

    PubMed

    Rodriguez-Baena, Domingo S; Perez-Pulido, Antonio J; Aguilar-Ruiz, Jesus S

    2011-10-01

    Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html dsrodbae@upo.es Supplementary data are available at Bioinformatics online.

  4. Serum-based six-miRNA signature as a potential marker for EC diagnosis: Comparison with TCGA miRNAseq dataset and identification of miRNA-mRNA target pairs by integrated analysis of TCGA miRNAseq and RNAseq datasets.

    PubMed

    Sharma, Priyanka; Saraya, Anoop; Sharma, Rinu

    2018-01-30

    To evaluate the diagnostic potential of a six microRNAs (miRNAs) panel consisting of miR-21, miR-144, miR-107, miR-342, miR-93 and miR-152 for esophageal cancer (EC) detection. The expression of miRNAs was analyzed in EC sera samples using quantitative real-time PCR. Risk score analysis was performed and linear regression models were then fitted to generate the six-miRNA panel. In addition, we made an effort to identify significantly dysregulated miRNAs and mRNAs in EC using the Cancer Genome Atlas (TCGA) miRNAseq and RNAseq datasets, respectively. Further, we identified significantly correlated miRNA-mRNA target pairs by integrating TCGA EC miRNAseq dataset with RNAseq dataset. The panel of circulating miRNAs showed enhanced sensitivity (87.5%) and specificity (90.48%) in terms of discriminating EC patients from normal subjects (area under the curve [AUC] = 0.968). Pathway enrichment analysis for potential targets of six miRNAs revealed 48 significant (P < 0.05) pathways, viz. pathways in cancer, mRNA surveillance, MAPK, Wnt, mTOR signaling, and so on. The expression data for mRNAs and miRNAs, downloaded from TCGA database, lead to identification of 2309 differentially expressed genes and 189 miRNAs. Gene ontology and pathway enrichment analysis showed that cell-cycle processes were most significantly enriched for differentially expressed mRNA. Integrated analysis of TCGA miRNAseq and RNAseq datasets resulted in identification of 53 063 significantly and negatively correlated miRNA-mRNA pairs. In summary, a novel and highly sensitive signature of serum miRNAs was identified for EC detection. Moreover, this is the first report identifying miRNA-mRNA target pairs from EC TCGA dataset, thus providing a comprehensive resource for understanding the interactions existing between miRNA and their target mRNAs in EC. © 2018 John Wiley & Sons Australia, Ltd.

  5. SPICE for ESA Planetary Missions

    NASA Astrophysics Data System (ADS)

    Costa, M.

    2017-09-01

    SPICE is an information system that provides the geometry needed to plan scientific observations and to analyze the obtained. The ESA SPICE Service generates the SPICE Kernel datasets for missions in all the active ESA Missions. This contribution describes the current status of the datasets, the extended services and the SPICE support provided to the ESA Planetary Missions (Mars-Express, ExoMars2016, BepiColombo, JUICE, Rosetta, Venus-Express and SMART-1) for the benefit of the science community.

  6. Phosphoproteome and transcriptome analyses of ErbB ligand-stimulated MCF-7 cells.

    PubMed

    Nagashima, Takeshi; Oyama, Masaaki; Kozuka-Hata, Hiroko; Yumoto, Noriko; Sakaki, Yoshiyuki; Hatakeyama, Mariko

    2008-01-01

    Cellular signal transduction pathways and gene expression are tightly regulated to accommodate changes in response to physiological environments. In the current study, molecules were identified that are activated as a result of intracellular signaling and immediately expressed as mRNA in MCF-7 breast cancer cells shortly after stimulation of ErbB receptor ligands, epidermal growth factor (EGF) or heregulin (HRG). For the identification of tyrosine-phosphorylated proteins and expressed genes, a SILAC (stable isotopic labeling using amino acids in cell culture) method and Affymetrix gene expression array system, respectively, were used. Unexpectedly, the overlapping of genes appeared in two experimental datasets was very low for HRG (43 hits in the proteome data, 1,655 in the transcriptome data, and 5 hits common to both datasets), while no overlapping gene was detected for EGF (15 hits in the proteome data, 211 hits in the transcriptome data, and no hits common to both datasets). The HRG overlapping genes included ERBB2, NEDD9, MAPK3, JUP and EPHA2. Biological pathway analysis indicated that HRG-stimulated molecular activation is significantly related to cancer pathways including bladder cancer, chronic myeloid leukemia and pancreatic cancer (p < 0.05). The proteome datasets of EGF and HRG contain molecules that are related to Axon guidance, ErbB signaling and VEGF signaling at a high rate.

  7. Investigating the Effects of Imputation Methods for Modelling Gene Networks Using a Dynamic Bayesian Network from Gene Expression Data

    PubMed Central

    CHAI, Lian En; LAW, Chow Kuan; MOHAMAD, Mohd Saberi; CHONG, Chuii Khim; CHOON, Yee Wen; DERIS, Safaai; ILLIAS, Rosli Md

    2014-01-01

    Background: Gene expression data often contain missing expression values. Therefore, several imputation methods have been applied to solve the missing values, which include k-nearest neighbour (kNN), local least squares (LLS), and Bayesian principal component analysis (BPCA). However, the effects of these imputation methods on the modelling of gene regulatory networks from gene expression data have rarely been investigated and analysed using a dynamic Bayesian network (DBN). Methods: In the present study, we separately imputed datasets of the Escherichia coli S.O.S. DNA repair pathway and the Saccharomyces cerevisiae cell cycle pathway with kNN, LLS, and BPCA, and subsequently used these to generate gene regulatory networks (GRNs) using a discrete DBN. We made comparisons on the basis of previous studies in order to select the gene network with the least error. Results: We found that BPCA and LLS performed better on larger networks (based on the S. cerevisiae dataset), whereas kNN performed better on smaller networks (based on the E. coli dataset). Conclusion: The results suggest that the performance of each imputation method is dependent on the size of the dataset, and this subsequently affects the modelling of the resultant GRNs using a DBN. In addition, on the basis of these results, a DBN has the capacity to discover potential edges, as well as display interactions, between genes. PMID:24876803

  8. MiSTIC, an integrated platform for the analysis of heterogeneity in large tumour transcriptome datasets.

    PubMed

    Lemieux, Sebastien; Sargeant, Tobias; Laperrière, David; Ismail, Houssam; Boucher, Geneviève; Rozendaal, Marieke; Lavallée, Vincent-Philippe; Ashton-Beaucage, Dariel; Wilhelm, Brian; Hébert, Josée; Hilton, Douglas J; Mader, Sylvie; Sauvageau, Guy

    2017-07-27

    Genome-wide transcriptome profiling has enabled non-supervised classification of tumours, revealing different sub-groups characterized by specific gene expression features. However, the biological significance of these subtypes remains for the most part unclear. We describe herein an interactive platform, Minimum Spanning Trees Inferred Clustering (MiSTIC), that integrates the direct visualization and comparison of the gene correlation structure between datasets, the analysis of the molecular causes underlying co-variations in gene expression in cancer samples, and the clinical annotation of tumour sets defined by the combined expression of selected biomarkers. We have used MiSTIC to highlight the roles of specific transcription factors in breast cancer subtype specification, to compare the aspects of tumour heterogeneity targeted by different prognostic signatures, and to highlight biomarker interactions in AML. A version of MiSTIC preloaded with datasets described herein can be accessed through a public web server (http://mistic.iric.ca); in addition, the MiSTIC software package can be obtained (github.com/iric-soft/MiSTIC) for local use with personalized datasets. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  9. ExpTreeDB: web-based query and visualization of manually annotated gene expression profiling experiments of human and mouse from GEO.

    PubMed

    Ni, Ming; Ye, Fuqiang; Zhu, Juanjuan; Li, Zongwei; Yang, Shuai; Yang, Bite; Han, Lu; Wu, Yongge; Chen, Ying; Li, Fei; Wang, Shengqi; Bo, Xiaochen

    2014-12-01

    Numerous public microarray datasets are valuable resources for the scientific communities. Several online tools have made great steps to use these data by querying related datasets with users' own gene signatures or expression profiles. However, dataset annotation and result exhibition still need to be improved. ExpTreeDB is a database that allows for queries on human and mouse microarray experiments from Gene Expression Omnibus with gene signatures or profiles. Compared with similar applications, ExpTreeDB pays more attention to dataset annotations and result visualization. We introduced a multiple-level annotation system to depict and organize original experiments. For example, a tamoxifen-treated cell line experiment is hierarchically annotated as 'agent→drug→estrogen receptor antagonist→tamoxifen'. Consequently, retrieved results are exhibited by an interactive tree-structured graphics, which provide an overview for related experiments and might enlighten users on key items of interest. The database is freely available at http://biotech.bmi.ac.cn/ExpTreeDB. Web site is implemented in Perl, PHP, R, MySQL and Apache. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  10. Integrative sparse principal component analysis of gene expression data.

    PubMed

    Liu, Mengque; Fan, Xinyan; Fang, Kuangnan; Zhang, Qingzhao; Ma, Shuangge

    2017-12-01

    In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance. © 2017 WILEY PERIODICALS, INC.

  11. Reproducibility-optimized test statistic for ranking genes in microarray studies.

    PubMed

    Elo, Laura L; Filén, Sanna; Lahesmaa, Riitta; Aittokallio, Tero

    2008-01-01

    A principal goal of microarray studies is to identify the genes showing differential expression under distinct conditions. In such studies, the selection of an optimal test statistic is a crucial challenge, which depends on the type and amount of data under analysis. While previous studies on simulated or spike-in datasets do not provide practical guidance on how to choose the best method for a given real dataset, we introduce an enhanced reproducibility-optimization procedure, which enables the selection of a suitable gene- anking statistic directly from the data. In comparison with existing ranking methods, the reproducibilityoptimized statistic shows good performance consistently under various simulated conditions and on Affymetrix spike-in dataset. Further, the feasibility of the novel statistic is confirmed in a practical research setting using data from an in-house cDNA microarray study of asthma-related gene expression changes. These results suggest that the procedure facilitates the selection of an appropriate test statistic for a given dataset without relying on a priori assumptions, which may bias the findings and their interpretation. Moreover, the general reproducibilityoptimization procedure is not limited to detecting differential expression only but could be extended to a wide range of other applications as well.

  12. Unmasking Upstream Gene Expression Regulators with miRNA-corrected mRNA Data

    PubMed Central

    Bollmann, Stephanie; Bu, Dengpan; Wang, Jiaqi; Bionaz, Massimo

    2015-01-01

    Expressed micro-RNA (miRNA) affects messenger RNA (mRNA) abundance, hindering the accuracy of upstream regulator analysis. Our objective was to provide an algorithm to correct such bias. Large mRNA and miRNA analyses were performed on RNA extracted from bovine liver and mammary tissue. Using four levels of target scores from TargetScan (all miRNA:mRNA target gene pairs or only the top 25%, 50%, or 75%). Using four levels of target scores from TargetScan (all miRNA:mRNA target gene pairs or only the top 25%, 50%, or 75%) and four levels of the magnitude of miRNA effect (ME) on mRNA expression (30%, 50%, 75%, and 83% mRNA reduction), we generated 17 different datasets (including the original dataset). For each dataset, we performed upstream regulator analysis using two bioinformatics tools. We detected an increased effect on the upstream regulator analysis with larger miRNA:mRNA pair bins and higher ME. The miRNA correction allowed identification of several upstream regulators not present in the analysis of the original dataset. Thus, the proposed algorithm improved the prediction of upstream regulators. PMID:27279737

  13. Integrative multi-platform meta-analysis of gene expression profiles in pancreatic ductal adenocarcinoma patients for identifying novel diagnostic biomarkers.

    PubMed

    Irigoyen, Antonio; Jimenez-Luna, Cristina; Benavides, Manuel; Caba, Octavio; Gallego, Javier; Ortuño, Francisco Manuel; Guillen-Ponce, Carmen; Rojas, Ignacio; Aranda, Enrique; Torres, Carolina; Prados, Jose

    2018-01-01

    Applying differentially expressed genes (DEGs) to identify feasible biomarkers in diseases can be a hard task when working with heterogeneous datasets. Expression data are strongly influenced by technology, sample preparation processes, and/or labeling methods. The proliferation of different microarray platforms for measuring gene expression increases the need to develop models able to compare their results, especially when different technologies can lead to signal values that vary greatly. Integrative meta-analysis can significantly improve the reliability and robustness of DEG detection. The objective of this work was to develop an integrative approach for identifying potential cancer biomarkers by integrating gene expression data from two different platforms. Pancreatic ductal adenocarcinoma (PDAC), where there is an urgent need to find new biomarkers due its late diagnosis, is an ideal candidate for testing this technology. Expression data from two different datasets, namely Affymetrix and Illumina (18 and 36 PDAC patients, respectively), as well as from 18 healthy controls, was used for this study. A meta-analysis based on an empirical Bayesian methodology (ComBat) was then proposed to integrate these datasets. DEGs were finally identified from the integrated data by using the statistical programming language R. After our integrative meta-analysis, 5 genes were commonly identified within the individual analyses of the independent datasets. Also, 28 novel genes that were not reported by the individual analyses ('gained' genes) were also discovered. Several of these gained genes have been already related to other gastroenterological tumors. The proposed integrative meta-analysis has revealed novel DEGs that may play an important role in PDAC and could be potential biomarkers for diagnosing the disease.

  14. Integrated analyses for genetic markers of polycystic ovary syndrome with 9 case-control studies of gene expression profiles.

    PubMed

    Lu, Chenqi; Liu, Xiaoqin; Wang, Lin; Jiang, Ning; Yu, Jun; Zhao, Xiaobo; Hu, Hairong; Zheng, Saihua; Li, Xuelian; Wang, Guiying

    2017-01-10

    Due to genetic heterogeneity and variable diagnostic criteria, genetic studies of polycystic ovary syndrome are particularly challenging. Furthermore, lack of sufficiently large cohorts limits the identification of susceptibility genes contributing to polycystic ovary syndrome. Here, we carried out a systematic search of studies deposited in the Gene Expression Omnibus database through August 31, 2016. The present analyses included studies with: 1) patients with polycystic ovary syndrome and normal controls, 2) gene expression profiling of messenger RNA, and 3) sufficient data for our analysis. Ultimately, a total of 9 studies with 13 datasets met the inclusion criteria and were performed for the subsequent integrated analyses. Through comprehensive analyses, there were 13 genetic factors overlapped in all datasets and identified as significant specific genes for polycystic ovary syndrome. After quality control assessment, there were six datasets remained. Further gene ontology enrichment and pathway analyses suggested that differentially expressed genes mainly enriched in oocyte pathways. These findings provide potential molecular markers for diagnosis and prognosis of polycystic ovary syndrome, and need in-depth studies on the exact function and mechanism in polycystic ovary syndrome.

  15. Classification of Time Series Gene Expression in Clinical Studies via Integration of Biological Network

    PubMed Central

    Qian, Liwei; Zheng, Haoran; Zhou, Hong; Qin, Ruibin; Li, Jinlong

    2013-01-01

    The increasing availability of time series expression datasets, although promising, raises a number of new computational challenges. Accordingly, the development of suitable classification methods to make reliable and sound predictions is becoming a pressing issue. We propose, here, a new method to classify time series gene expression via integration of biological networks. We evaluated our approach on 2 different datasets and showed that the use of a hidden Markov model/Gaussian mixture models hybrid explores the time-dependence of the expression data, thereby leading to better prediction results. We demonstrated that the biclustering procedure identifies function-related genes as a whole, giving rise to high accordance in prognosis prediction across independent time series datasets. In addition, we showed that integration of biological networks into our method significantly improves prediction performance. Moreover, we compared our approach with several state-of–the-art algorithms and found that our method outperformed previous approaches with regard to various criteria. Finally, our approach achieved better prediction results on early-stage data, implying the potential of our method for practical prediction. PMID:23516469

  16. Integrated Analysis of Alzheimer's Disease and Schizophrenia Dataset Revealed Different Expression Pattern in Learning and Memory.

    PubMed

    Li, Wen-Xing; Dai, Shao-Xing; Liu, Jia-Qian; Wang, Qian; Li, Gong-Hua; Huang, Jing-Fei

    2016-01-01

    Alzheimer's disease (AD) and schizophrenia (SZ) are both accompanied by impaired learning and memory functions. This study aims to explore the expression profiles of learning or memory genes between AD and SZ. We downloaded 10 AD and 10 SZ datasets from GEO-NCBI for integrated analysis. These datasets were processed using RMA algorithm and a global renormalization for all studies. Then Empirical Bayes algorithm was used to find the differentially expressed genes between patients and controls. The results showed that most of the differentially expressed genes were related to AD whereas the gene expression profile was little affected in the SZ. Furthermore, in the aspects of the number of differentially expressed genes, the fold change and the brain region, there was a great difference in the expression of learning or memory related genes between AD and SZ. In AD, the CALB1, GABRA5, and TAC1 were significantly downregulated in whole brain, frontal lobe, temporal lobe, and hippocampus. However, in SZ, only two genes CRHBP and CX3CR1 were downregulated in hippocampus, and other brain regions were not affected. The effect of these genes on learning or memory impairment has been widely studied. It was suggested that these genes may play a crucial role in AD or SZ pathogenesis. The different gene expression patterns between AD and SZ on learning and memory functions in different brain regions revealed in our study may help to understand the different mechanism between two diseases.

  17. A Gene Expression Profile of BRCAness That Predicts for Responsiveness to Platinum and PARP Inhibitors

    DTIC Science & Technology

    2017-02-01

    To) 15 July 2010 – 2 Nov.2016 4 . TITLE AND SUBTITLE A Gene Expression Profile of BRCAness That Predicts for Responsiveness to Platinum and PARP...resistance in vitro, and to investigate the mechanism for this effect. The major goal for Aim 4 was to determine the reproducibility of the BRCAness...we used the epithelial ovarian cancer (EOC) dataset from The Cancer Genome Atlas (TCGA) ( 4 ). The TCGA dataset is a unique tool for these studies as

  18. Microarray Analysis Dataset

    EPA Pesticide Factsheets

    This file contains a link for Gene Expression Omnibus and the GSE designations for the publicly available gene expression data used in the study and reflected in Figures 6 and 7 for the Das et al., 2016 paper.This dataset is associated with the following publication:Das, K., C. Wood, M. Lin, A.A. Starkov, C. Lau, K.B. Wallace, C. Corton, and B. Abbott. Perfluoroalky acids-induced liver steatosis: Effects on genes controlling lipid homeostasis. TOXICOLOGY. Elsevier Science Ltd, New York, NY, USA, 378: 32-52, (2017).

  19. Identifying spatially similar gene expression patterns in early stage fruit fly embryo images: binary feature versus invariant moment digital representations

    PubMed Central

    Gurunathan, Rajalakshmi; Van Emden, Bernard; Panchanathan, Sethuraman; Kumar, Sudhir

    2004-01-01

    Background Modern developmental biology relies heavily on the analysis of embryonic gene expression patterns. Investigators manually inspect hundreds or thousands of expression patterns to identify those that are spatially similar and to ultimately infer potential gene interactions. However, the rapid accumulation of gene expression pattern data over the last two decades, facilitated by high-throughput techniques, has produced a need for the development of efficient approaches for direct comparison of images, rather than their textual descriptions, to identify spatially similar expression patterns. Results The effectiveness of the Binary Feature Vector (BFV) and Invariant Moment Vector (IMV) based digital representations of the gene expression patterns in finding biologically meaningful patterns was compared for a small (226 images) and a large (1819 images) dataset. For each dataset, an ordered list of images, with respect to a query image, was generated to identify overlapping and similar gene expression patterns, in a manner comparable to what a developmental biologist might do. The results showed that the BFV representation consistently outperforms the IMV representation in finding biologically meaningful matches when spatial overlap of the gene expression pattern and the genes involved are considered. Furthermore, we explored the value of conducting image-content based searches in a dataset where individual expression components (or domains) of multi-domain expression patterns were also included separately. We found that this technique improves performance of both IMV and BFV based searches. Conclusions We conclude that the BFV representation consistently produces a more extensive and better list of biologically useful patterns than the IMV representation. The high quality of results obtained scales well as the search database becomes larger, which encourages efforts to build automated image query and retrieval systems for spatial gene expression patterns. PMID:15603586

  20. Therapeutic Activity of Anti-AXL Antibody against Triple-Negative Breast Cancer Patient-Derived Xenografts and Metastasis.

    PubMed

    Leconet, Wilhem; Chentouf, Myriam; du Manoir, Stanislas; Chevalier, Clément; Sirvent, Audrey; Aït-Arsa, Imade; Busson, Muriel; Jarlier, Marta; Radosevic-Robin, Nina; Theillet, Charles; Chalbos, Dany; Pasquet, Jean-Max; Pèlegrin, André; Larbouret, Christel; Robert, Bruno

    2017-06-01

    Purpose: AXL receptor tyrosine kinase has been described as a relevant molecular marker and a key player in invasiveness, especially in triple-negative breast cancer (TNBC). Experimental Design: We evaluate the antitumor efficacy of the anti-AXL monoclonal antibody 20G7-D9 in several TNBC cell xenografts or patient-derived xenograft (PDX) models and decipher the underlying mechanisms. In a dataset of 254 basal-like breast cancer samples, genes correlated with AXL expression are enriched in EMT, migration, and invasion signaling pathways. Results: Treatment with 20G7-D9 inhibited tumor growth and bone metastasis formation in AXL-positive TNBC cell xenografts or PDX, but not in AXL-negative PDX, highlighting AXL role in cancer growth and invasion. In vitro stimulation of AXL-positive cancer cells by its ligand GAS6 induced the expression of several EMT-associated genes ( SNAIL, SLUG , and VIM ) through an intracellular signaling implicating the transcription factor FRA-1, important in cell invasion and plasticity, and increased their migration/invasion capacity. 20G7-D9 induced AXL degradation and inhibited all AXL/GAS6-dependent cell signaling implicated in EMT and in cell migration/invasion. Conclusions: The anti-AXL antibody 20G7-D9 represents a promising therapeutic strategy in TNBC with mesenchymal features by inhibiting AXL-dependent EMT, tumor growth, and metastasis formation. Clin Cancer Res; 23(11); 2806-16. ©2016 AACR . ©2016 American Association for Cancer Research.

  1. Cross-platform normalization of microarray and RNA-seq data for machine learning applications

    PubMed Central

    Thompson, Jeffrey A.; Tan, Jie

    2016-01-01

    Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language. PMID:26844019

  2. Enhanced expression of SRPK2 contributes to aggressive progression and metastasis in prostate cancer.

    PubMed

    Zhuo, Yang Jia; Liu, Ze Zhen; Wan, Song; Cai, Zhi Duan; Xie, Jian Jiang; Cai, Zhou da; Song, Sheng da; Wan, Yue Ping; Hua, Wei; Zhong, Wei de; Wu, Chin Lee

    2018-06-01

    Serine/Arginine-Rich Protein-Specific Kinase-2 (SRSF protein kinase-2, SRPK2) is up-regulated in multiple human tumors. However, the expression, function and clinical significance of SRPK2 in prostate cancer (PCa) has not yet been understood. We therefore aimed to determine the association of SRPK2 with tumor progression and metastasis in PCa patients in our present study. The expression of SRPK2 was detected by some public datasets and validated using a clinical tissue microarray (TMA) by immunohistochemistry. The association of SRPK2 expression with various clinicopathological characteristics of PCa patients was subsequently statistically analyzed based on the The Cancer Genome Atlas (TCGA) dataset and clinical TMA. The effects of SRPK2 on cancer cell proliferation, migration, invasion, cell cycle progression, apoptosis and tumor growth were then respectively investigated using in vitro and in vivo experiments. First, public datasets showed that SRPK2 expression was greater in PCa tissues when compared with non-cancerous tissues. Statistical analysis demonstrated that high expression of SRPK2 was significantly correlated with a higher Gleason Score, advanced pathological stage and the presence of tumor metastasis in the TCGA Dataset (all P < 0.01). Similar correlations between SRPK2 and a higher Gleason Score or advanced pathological stage were also identified in the TMA (P < 0.05). Kaplan-Meier curve analyses showed that the biochemical recurrence (BCR)-free time of PCa patients with SRPK2 high expression was shorter than for those with SRPK2 low expression (P < 0.05). Second, cell function experiments in PCa cell lines revealed that enhanced SRPK2 expression could promote cell proliferation, migration, invasion and cell cycle progression but suppress tumor cell apoptosis in vitro. Xenograft experiments showed that SRPK2 promoted tumor growth in vivo. In conclusion, our data demonstrated that SRPK2 may play an important role in the progression and metastasis of PCa, which suggests that it might be a potential therapeutic target for PCa clinical therapy. Copyright © 2018 Elsevier Masson SAS. All rights reserved.

  3. Efficiently Identifying Significant Associations in Genome-wide Association Studies

    PubMed Central

    Eskin, Eleazar

    2013-01-01

    Abstract Over the past several years, genome-wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome that harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits, where only a handful of phenotypes are analyzed per study, in eQTL studies, tens of thousands of gene expression levels are measured, and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the single nucleotide polymorphisms (SNPs). In the first stage, a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions that may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to the state-of-the-art testing approaches by a factor of 75. PMID:24033261

  4. Dissecting the transcriptional phenotype of ribosomal protein deficiency: implications for Diamond-Blackfan Anemia

    PubMed Central

    Aspesi, Anna; Pavesi, Elisa; Robotti, Elisa; Crescitelli, Rossella; Boria, Ilenia; Avondo, Federica; Moniz, Hélène; Da Costa, Lydie; Mohandas, Narla; Roncaglia, Paola; Ramenghi, Ugo; Ronchi, Antonella; Gustincich, Stefano; Merlin, Simone; Marengo, Emilio; Ellis, Steven R.; Follenzi, Antonia; Santoro, Claudio; Dianzani, Irma

    2014-01-01

    Defects in genes encoding ribosomal proteins cause Diamond Blackfan Anemia (DBA), a red cell aplasia often associated with physical abnormalities. Other bone marrow failure syndromes have been attributed to defects in ribosomal components but the link between erythropoiesis and the ribosome remains to be fully defined. Several lines of evidence suggest that defects in ribosome synthesis lead to “ribosomal stress” with p53 activation and either cell cycle arrest or induction of apoptosis. Pathways independent of p53 have also been proposed to play a role in DBA pathogenesis. We took an unbiased approach to identify p53-independent pathways activated by defects in ribosome synthesis by analyzing global gene expression in various cellular models of DBA. Ranking-Principal Component Analysis (Ranking-PCA) was applied to the identified datasets to determine whether there are common sets of genes whose expression is altered in these different cellular models. We observed consistent changes in the expression of genes involved in cellular amino acid metabolic process, negative regulation of cell proliferation and cell redox homeostasis. These data indicate that cells respond to defects in ribosome synthesis by changing the level of expression of a limited subset of genes involved in critical cellular processes. Moreover, our data support a role for p53-independent pathways in the pathophysiology of DBA. PMID:24835311

  5. Molecular profiling of ALDH1+ colorectal cancer stem cells reveals preferential activation of MAPK, FAK, and oxidative stress pro-survival signalling pathways.

    PubMed

    Vishnubalaji, Radhakrishnan; Manikandan, Muthurangan; Fahad, Mohamed; Hamam, Rimi; Alfayez, Musaad; Kassem, Moustapha; Aldahmash, Abdullah; Alajez, Nehad M

    2018-03-02

    Tumour heterogeneity leads to variable clinical response and inaccurate diagnostic and prognostic assessment. Cancer stem cells (CSCs) represent a subpopulation responsible for invasion, metastasis, therapeutic resistance, and recurrence in many human cancer types. However, the true identity of colorectal cancer (CRC) SCs remains elusive. Here, we aimed to characterize and define the gene expression portrait of CSCs in CRC-model SW403 cells. We found that ALDH + positive cells are clonogenic and highly proliferative; their global gene expression profiling-based molecular signature revealed gene enrichment related to DNA damage, MAPK, FAK, oxidative stress response, and Wnt signalling. ALDH + cells showed enhanced ROS stress resistance, whereas MAPK/FAK pathway pharmacologic inhibition limited their survival. Conversely, 5-fluorouracil increased the ALDH + cell fraction among the SW403, HCT116 and SW620 CRC models. Notably, analysis of ALDH1A1 and POU5F1 expression levels in cohorts of 462 or 420 patients for overall (OS) or disease-free (DFS) survival, respectively, obtained from the Cancer Genome Atlas CRC dataset, revealed strong association between elevated expression and poor OS ( p = 0.006) and poor DFS ( p = 0.05), thus implicating ALDH1A1 and POU5F1 in CRC prognosis. Our data reveal distinct molecular signature of ALDH + CSCs in CRC and suggest pathways relevant for successful targeted therapies and management of CRC.

  6. Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses

    PubMed Central

    Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M.; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V.; Ma’ayan, Avi

    2018-01-01

    Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated ‘canned’ analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools. PMID:29485625

  7. Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses.

    PubMed

    Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V; Ma'ayan, Avi

    2018-02-27

    Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated 'canned' analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools.

  8. Genome-Level Longitudinal Expression of Signaling Pathways and Gene Networks in Pediatric Septic Shock

    PubMed Central

    Shanley, Thomas P; Cvijanovich, Natalie; Lin, Richard; Allen, Geoffrey L; Thomas, Neal J; Doctor, Allan; Kalyanaraman, Meena; Tofil, Nancy M; Penfil, Scott; Monaco, Marie; Odoms, Kelli; Barnes, Michael; Sakthivel, Bhuvaneswari; Aronow, Bruce J; Wong, Hector R

    2007-01-01

    We have conducted longitudinal studies focused on the expression profiles of signaling pathways and gene networks in children with septic shock. Genome-level expression profiles were generated from whole blood-derived RNA of children with septic shock (n = 30) corresponding to day one and day three of septic shock, respectively. Based on sequential statistical and expression filters, day one and day three of septic shock were characterized by differential regulation of 2,142 and 2,504 gene probes, respectively, relative to controls (n = 15). Venn analysis demonstrated 239 unique genes in the day one dataset, 598 unique genes in the day three dataset, and 1,906 genes common to both datasets. Functional analyses demonstrated time-dependent, differential regulation of genes involved in multiple signaling pathways and gene networks primarily related to immunity and inflammation. Notably, multiple and distinct gene networks involving T cell- and MHC antigen-related biology were persistently downregulated on both day one and day three. Further analyses demonstrated large scale, persistent downregulation of genes corresponding to functional annotations related to zinc homeostasis. These data represent the largest reported cohort of patients with septic shock subjected to longitudinal genome-level expression profiling. The data further advance our genome-level understanding of pediatric septic shock and support novel hypotheses. PMID:17932561

  9. An improved Pearson's correlation proximity-based hierarchical clustering for mining biological association between genes.

    PubMed

    Booma, P M; Prabhakaran, S; Dhanalakshmi, R

    2014-01-01

    Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality.

  10. An Improved Pearson's Correlation Proximity-Based Hierarchical Clustering for Mining Biological Association between Genes

    PubMed Central

    Booma, P. M.; Prabhakaran, S.; Dhanalakshmi, R.

    2014-01-01

    Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality. PMID:25136661

  11. Functional Analyses of NSF1 in Wine Yeast Using Interconnected Correlation Clustering and Molecular Analyses

    PubMed Central

    Bessonov, Kyrylo; Walkey, Christopher J.; Shelp, Barry J.; van Vuuren, Hennie J. J.; Chiu, David; van der Merwe, George

    2013-01-01

    Analyzing time-course expression data captured in microarray datasets is a complex undertaking as the vast and complex data space is represented by a relatively low number of samples as compared to thousands of available genes. Here, we developed the Interdependent Correlation Clustering (ICC) method to analyze relationships that exist among genes conditioned on the expression of a specific target gene in microarray data. Based on Correlation Clustering, the ICC method analyzes a large set of correlation values related to gene expression profiles extracted from given microarray datasets. ICC can be applied to any microarray dataset and any target gene. We applied this method to microarray data generated from wine fermentations and selected NSF1, which encodes a C2H2 zinc finger-type transcription factor, as the target gene. The validity of the method was verified by accurate identifications of the previously known functional roles of NSF1. In addition, we identified and verified potential new functions for this gene; specifically, NSF1 is a negative regulator for the expression of sulfur metabolism genes, the nuclear localization of Nsf1 protein (Nsf1p) is controlled in a sulfur-dependent manner, and the transcription of NSF1 is regulated by Met4p, an important transcriptional activator of sulfur metabolism genes. The inter-disciplinary approach adopted here highlighted the accuracy and relevancy of the ICC method in mining for novel gene functions using complex microarray datasets with a limited number of samples. PMID:24130853

  12. A Pilot Proteogenomic Study with Data Integration Identifies MCT1 and GLUT1 as Prognostic Markers in Lung Adenocarcinoma.

    PubMed

    Stewart, Paul A; Parapatics, Katja; Welsh, Eric A; Müller, André C; Cao, Haoyun; Fang, Bin; Koomen, John M; Eschrich, Steven A; Bennett, Keiryn L; Haura, Eric B

    2015-01-01

    We performed a pilot proteogenomic study to compare lung adenocarcinoma to lung squamous cell carcinoma using quantitative proteomics (6-plex TMT) combined with a customized Affymetrix GeneChip. Using MaxQuant software, we identified 51,001 unique peptides that mapped to 7,241 unique proteins and from these identified 6,373 genes with matching protein expression for further analysis. We found a minor correlation between gene expression and protein expression; both datasets were able to independently recapitulate known differences between the adenocarcinoma and squamous cell carcinoma subtypes. We found 565 proteins and 629 genes to be differentially expressed between adenocarcinoma and squamous cell carcinoma, with 113 of these consistently differentially expressed at both the gene and protein levels. We then compared our results to published adenocarcinoma versus squamous cell carcinoma proteomic data that we also processed with MaxQuant. We selected two proteins consistently overexpressed in squamous cell carcinoma in all studies, MCT1 (SLC16A1) and GLUT1 (SLC2A1), for further investigation. We found differential expression of these same proteins at the gene level in our study as well as in other public gene expression datasets. These findings combined with survival analysis of public datasets suggest that MCT1 and GLUT1 may be potential prognostic markers in adenocarcinoma and druggable targets in squamous cell carcinoma. Data are available via ProteomeXchange with identifier PXD002622.

  13. Seq-ing answers: uncovering the unexpected in global gene regulation.

    PubMed

    Otto, George Maxwell; Brar, Gloria Ann

    2018-04-19

    The development of techniques for measuring gene expression globally has greatly expanded our understanding of gene regulatory mechanisms in depth and scale. We can now quantify every intermediate and transition in the canonical pathway of gene expression-from DNA to mRNA to protein-genome-wide. Employing such measurements in parallel can produce rich datasets, but extracting the most information requires careful experimental design and analysis. Here, we argue for the value of genome-wide studies that measure multiple outputs of gene expression over many timepoints during the course of a natural developmental process. We discuss our findings from a highly parallel gene expression dataset of meiotic differentiation, and those of others, to illustrate how leveraging these features can provide new and surprising insight into fundamental mechanisms of gene regulation.

  14. DMirNet: Inferring direct microRNA-mRNA association networks.

    PubMed

    Lee, Minsu; Lee, HyungJune

    2016-12-05

    MicroRNAs (miRNAs) play important regulatory roles in the wide range of biological processes by inducing target mRNA degradation or translational repression. Based on the correlation between expression profiles of a miRNA and its target mRNA, various computational methods have previously been proposed to identify miRNA-mRNA association networks by incorporating the matched miRNA and mRNA expression profiles. However, there remain three major issues to be resolved in the conventional computation approaches for inferring miRNA-mRNA association networks from expression profiles. 1) Inferred correlations from the observed expression profiles using conventional correlation-based methods include numerous erroneous links or over-estimated edge weight due to the transitive information flow among direct associations. 2) Due to the high-dimension-low-sample-size problem on the microarray dataset, it is difficult to obtain an accurate and reliable estimate of the empirical correlations between all pairs of expression profiles. 3) Because the previously proposed computational methods usually suffer from varying performance across different datasets, a more reliable model that guarantees optimal or suboptimal performance across different datasets is highly needed. In this paper, we present DMirNet, a new framework for identifying direct miRNA-mRNA association networks. To tackle the aforementioned issues, DMirNet incorporates 1) three direct correlation estimation methods (namely Corpcor, SPACE, Network deconvolution) to infer direct miRNA-mRNA association networks, 2) the bootstrapping method to fully utilize insufficient training expression profiles, and 3) a rank-based Ensemble aggregation to build a reliable and robust model across different datasets. Our empirical experiments on three datasets demonstrate the combinatorial effects of necessary components in DMirNet. Additional performance comparison experiments show that DMirNet outperforms the state-of-the-art Ensemble-based model [1] which has shown the best performance across the same three datasets, with a factor of up to 1.29. Further, we identify 43 putative novel multi-cancer-related miRNA-mRNA association relationships from an inferred Top 1000 direct miRNA-mRNA association network. We believe that DMirNet is a promising method to identify novel direct miRNA-mRNA relations and to elucidate the direct miRNA-mRNA association networks. Since DMirNet infers direct relationships from the observed data, DMirNet can contribute to reconstructing various direct regulatory pathways, including, but not limited to, the direct miRNA-mRNA association networks.

  15. Effects of Larval Density on Gene Regulation in Caenorhabditis elegans During Routine L1 Synchronization.

    PubMed

    Chan, Io Long; Rando, Oliver J; Conine, Colin C

    2018-05-04

    Bleaching gravid C. elegans followed by a short period of starvation of the L1 larvae is a routine method performed by worm researchers for generating synchronous populations for experiments. During the process of investigating dietary effects on gene regulation in L1 stage worms by single-worm RNA-Seq, we found that the density of resuspended L1 larvae affects expression of many mRNAs. Specifically, a number of genes related to metabolism and signaling are highly expressed in worms arrested at low density, but are repressed at higher arrest densities. We generated a GFP reporter strain based on one of the most density-dependent genes in our dataset - lips-15 - and confirmed that this reporter was expressed specifically in worms arrested at relatively low density. Finally, we show that conditioned media from high density L1 cultures was able to downregulate lips-15 even in L1 animals arrested at low density, and experiments using daf-22 mutant animals demonstrated that this effect is not mediated by the ascaroside family of signaling pheromones. Together, our data implicate a soluble signaling molecule in density sensing by L1 stage C. elegans , and provide guidance for design of experiments focused on early developmental gene regulation. Copyright © 2018 Chan et al.

  16. Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space

    PubMed Central

    Karnik, Rahul; Beer, Michael A.

    2015-01-01

    The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs. PMID:26465884

  17. Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space.

    PubMed

    Karnik, Rahul; Beer, Michael A

    2015-01-01

    The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.

  18. Co-LncRNA: investigating the lncRNA combinatorial effects in GO annotations and KEGG pathways based on human RNA-Seq data

    PubMed Central

    Zhao, Zheng; Bai, Jing; Wu, Aiwei; Wang, Yuan; Zhang, Jinwen; Wang, Zishan; Li, Yongsheng; Xu, Juan; Li, Xia

    2015-01-01

    Long non-coding RNAs (lncRNAs) are emerging as key regulators of diverse biological processes and diseases. However, the combinatorial effects of these molecules in a specific biological function are poorly understood. Identifying co-expressed protein-coding genes of lncRNAs would provide ample insight into lncRNA functions. To facilitate such an effort, we have developed Co-LncRNA, which is a web-based computational tool that allows users to identify GO annotations and KEGG pathways that may be affected by co-expressed protein-coding genes of a single or multiple lncRNAs. LncRNA co-expressed protein-coding genes were first identified in publicly available human RNA-Seq datasets, including 241 datasets across 6560 total individuals representing 28 tissue types/cell lines. Then, the lncRNA combinatorial effects in a given GO annotations or KEGG pathways are taken into account by the simultaneous analysis of multiple lncRNAs in user-selected individual or multiple datasets, which is realized by enrichment analysis. In addition, this software provides a graphical overview of pathways that are modulated by lncRNAs, as well as a specific tool to display the relevant networks between lncRNAs and their co-expressed protein-coding genes. Co-LncRNA also supports users in uploading their own lncRNA and protein-coding gene expression profiles to investigate the lncRNA combinatorial effects. It will be continuously updated with more human RNA-Seq datasets on an annual basis. Taken together, Co-LncRNA provides a web-based application for investigating lncRNA combinatorial effects, which could shed light on their biological roles and could be a valuable resource for this community. Database URL: http://www.bio-bigdata.com/Co-LncRNA/ PMID:26363020

  19. Dynamic Modularity of Host Protein Interaction Networks in Salmonella Typhi Infection

    PubMed Central

    Dhal, Paltu Kumar; Barman, Ranjan Kumar; Saha, Sudipto; Das, Santasabuj

    2014-01-01

    Background Salmonella Typhi is a human-restricted pathogen, which causes typhoid fever and remains a global health problem in the developing countries. Although previously reported host expression datasets had identified putative biomarkers and therapeutic targets of typhoid fever, the underlying molecular mechanism of pathogenesis remains incompletely understood. Methods We used five gene expression datasets of human peripheral blood from patients suffering from S. Typhi or other bacteremic infections or non-infectious disease like leukemia. The expression datasets were merged into human protein interaction network (PIN) and the expression correlation between the hubs and their interacting proteins was measured by calculating Pearson Correlation Coefficient (PCC) values. The differences in the average PCC for each hub between the disease states and their respective controls were calculated for studied datasets. The individual hubs and their interactors with expression, PCC and average PCC values were treated as dynamic subnetworks. The hubs that showed unique trends of alterations specific to S. Typhi infection were identified. Results We identified S. Typhi infection-specific dynamic subnetworks of the host, which involve 81 hubs and 1343 interactions. The major enriched GO biological process terms in the identified subnetworks were regulation of apoptosis and biological adhesions, while the enriched pathways include cytokine signalling in the immune system and downstream TCR signalling. The dynamic nature of the hubs CCR1, IRS2 and PRKCA with their interactors was studied in detail. The difference in the dynamics of the subnetworks specific to S. Typhi infection suggests a potential molecular model of typhoid fever. Conclusions Hubs and their interactors of the S. Typhi infection-specific dynamic subnetworks carrying distinct PCC values compared with the non-typhoid and other disease conditions reveal new insight into the pathogenesis of S. Typhi. PMID:25144185

  20. Exploring Transcription Factors-microRNAs Co-regulation Networks in Schizophrenia.

    PubMed

    Xu, Yong; Yue, Weihua; Yao Shugart, Yin; Li, Sheng; Cai, Lei; Li, Qiang; Cheng, Zaohuo; Wang, Guoqiang; Zhou, Zhenhe; Jin, Chunhui; Yuan, Jianmin; Tian, Lin; Wang, Jun; Zhang, Kai; Zhang, Kerang; Liu, Sha; Song, Yuqing; Zhang, Fuquan

    2016-07-01

    Transcriptional factors (TFs) and microRNAs (miRNAs) have been recognized as 2 classes of principal gene regulators that may be responsible for genome coexpression changes observed in schizophrenia (SZ). This study aims to (1) identify differentially coexpressed genes (DCGs) in 3 mRNA expression microarray datasets; (2) explore potential interactions among the DCGs, and differentially expressed miRNAs identified in our dataset composed of early-onset SZ patients and healthy controls; (3) validate expression levels of some key transcripts; and (4) explore the druggability of DCGs using the curated database. We detected a differential coexpression network associated with SZ and found that 9 out of the 12 regulators were replicated in either of the 2 other datasets. Leveraging the differentially expressed miRNAs identified in our previous dataset, we constructed a miRNA-TF-gene network relevant to SZ, including an EGR1-miR-124-3p-SKIL feed-forward loop. Our real-time quantitative PCR analysis indicated the overexpression of miR-124-3p, the under expression of SKIL and EGR1 in the blood of SZ patients compared with controls, and the direction of change of miR-124-3p and SKIL mRNA levels in SZ cases were reversed after a 12-week treatment cycle. Our druggability analysis revealed that many of these genes have the potential to be drug targets. Together, our results suggest that coexpression network abnormalities driven by combinatorial and interactive action from TFs and miRNAs may contribute to the development of SZ and be relevant to the clinical treatment of the disease. © The Author 2015. Published by Oxford University Press on behalf of the Maryland Psychiatric Research Center. All rights reserved. For permissions, please email: journals.permissions@oup.com.

  1. Exploring Transcription Factors-microRNAs Co-regulation Networks in Schizophrenia

    PubMed Central

    Xu, Yong; Yue, Weihua; Yao Shugart, Yin; Li, Sheng; Cai, Lei; Li, Qiang; Cheng, Zaohuo; Wang, Guoqiang; Zhou, Zhenhe; Jin, Chunhui; Yuan, Jianmin; Tian, Lin; Wang, Jun; Zhang, Kai; Zhang, Kerang; Liu, Sha; Song, Yuqing; Zhang, Fuquan

    2016-01-01

    Background: Transcriptional factors (TFs) and microRNAs (miRNAs) have been recognized as 2 classes of principal gene regulators that may be responsible for genome coexpression changes observed in schizophrenia (SZ). Methods: This study aims to (1) identify differentially coexpressed genes (DCGs) in 3 mRNA expression microarray datasets; (2) explore potential interactions among the DCGs, and differentially expressed miRNAs identified in our dataset composed of early-onset SZ patients and healthy controls; (3) validate expression levels of some key transcripts; and (4) explore the druggability of DCGs using the curated database. Results: We detected a differential coexpression network associated with SZ and found that 9 out of the 12 regulators were replicated in either of the 2 other datasets. Leveraging the differentially expressed miRNAs identified in our previous dataset, we constructed a miRNA–TF–gene network relevant to SZ, including an EGR1–miR-124-3p–SKIL feed-forward loop. Our real-time quantitative PCR analysis indicated the overexpression of miR-124-3p, the under expression of SKIL and EGR1 in the blood of SZ patients compared with controls, and the direction of change of miR-124-3p and SKIL mRNA levels in SZ cases were reversed after a 12-week treatment cycle. Our druggability analysis revealed that many of these genes have the potential to be drug targets. Conclusions: Together, our results suggest that coexpression network abnormalities driven by combinatorial and interactive action from TFs and miRNAs may contribute to the development of SZ and be relevant to the clinical treatment of the disease. PMID:26609121

  2. A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue.

    PubMed

    Chen, Zhenyu; Li, Jianping; Wei, Liwei

    2007-10-01

    Recently, gene expression profiling using microarray techniques has been shown as a promising tool to improve the diagnosis and treatment of cancer. Gene expression data contain high level of noise and the overwhelming number of genes relative to the number of available samples. It brings out a great challenge for machine learning and statistic techniques. Support vector machine (SVM) has been successfully used to classify gene expression data of cancer tissue. In the medical field, it is crucial to deliver the user a transparent decision process. How to explain the computed solutions and present the extracted knowledge becomes a main obstacle for SVM. A multiple kernel support vector machine (MK-SVM) scheme, consisting of feature selection, rule extraction and prediction modeling is proposed to improve the explanation capacity of SVM. In this scheme, we show that the feature selection problem can be translated into an ordinary multiple parameters learning problem. And a shrinkage approach: 1-norm based linear programming is proposed to obtain the sparse parameters and the corresponding selected features. We propose a novel rule extraction approach using the information provided by the separating hyperplane and support vectors to improve the generalization capacity and comprehensibility of rules and reduce the computational complexity. Two public gene expression datasets: leukemia dataset and colon tumor dataset are used to demonstrate the performance of this approach. Using the small number of selected genes, MK-SVM achieves encouraging classification accuracy: more than 90% for both two datasets. Moreover, very simple rules with linguist labels are extracted. The rule sets have high diagnostic power because of their good classification performance.

  3. GiniClust: detecting rare cell types from single-cell gene expression data with Gini index.

    PubMed

    Jiang, Lan; Chen, Huidong; Pinello, Luca; Yuan, Guo-Cheng

    2016-07-01

    High-throughput single-cell technologies have great potential to discover new cell types; however, it remains challenging to detect rare cell types that are distinct from a large population. We present a novel computational method, called GiniClust, to overcome this challenge. Validation against a benchmark dataset indicates that GiniClust achieves high sensitivity and specificity. Application of GiniClust to public single-cell RNA-seq datasets uncovers previously unrecognized rare cell types, including Zscan4-expressing cells within mouse embryonic stem cells and hemoglobin-expressing cells in the mouse cortex and hippocampus. GiniClust also correctly detects a small number of normal cells that are mixed in a cancer cell population.

  4. RNA Deep Sequencing Reveals Differential MicroRNA Expression during Development of Sea Urchin and Sea Star

    PubMed Central

    Kadri, Sabah; Hinman, Veronica F.; Benos, Panayiotis V.

    2011-01-01

    microRNAs (miRNAs) are small (20–23 nt), non-coding single stranded RNA molecules that act as post-transcriptional regulators of mRNA gene expression. They have been implicated in regulation of developmental processes in diverse organisms. The echinoderms, Strongylocentrotus purpuratus (sea urchin) and Patiria miniata (sea star) are excellent model organisms for studying development with well-characterized transcriptional networks. However, to date, nothing is known about the role of miRNAs during development in these organisms, except that the genes that are involved in the miRNA biogenesis pathway are expressed during their developmental stages. In this paper, we used Illumina Genome Analyzer (Illumina, Inc.) to sequence small RNA libraries in mixed stage population of embryos from one to three days after fertilization of sea urchin and sea star (total of 22,670,000 reads). Analysis of these data revealed the miRNA populations in these two species. We found that 47 and 38 known miRNAs are expressed in sea urchin and sea star, respectively, during early development (32 in common). We also found 13 potentially novel miRNAs in the sea urchin embryonic library. miRNA expression is generally conserved between the two species during development, but 7 miRNAs are highly expressed in only one species. We expect that our two datasets will be a valuable resource for everyone working in the field of developmental biology and the regulatory networks that affect it. The computational pipeline to analyze Illumina reads is available at http://www.benoslab.pitt.edu/services.html. PMID:22216218

  5. Properties of Protein Drug Target Classes

    PubMed Central

    Bull, Simon C.; Doig, Andrew J.

    2015-01-01

    Accurate identification of drug targets is a crucial part of any drug development program. We mined the human proteome to discover properties of proteins that may be important in determining their suitability for pharmaceutical modulation. Data was gathered concerning each protein’s sequence, post-translational modifications, secondary structure, germline variants, expression profile and drug target status. The data was then analysed to determine features for which the target and non-target proteins had significantly different values. This analysis was repeated for subsets of the proteome consisting of all G-protein coupled receptors, ion channels, kinases and proteases, as well as proteins that are implicated in cancer. Machine learning was used to quantify the proteins in each dataset in terms of their potential to serve as a drug target. This was accomplished by first inducing a random forest that could distinguish between its targets and non-targets, and then using the random forest to quantify the drug target likeness of the non-targets. The properties that can best differentiate targets from non-targets were primarily those that are directly related to a protein’s sequence (e.g. secondary structure). Germline variants, expression levels and interactions between proteins had minimal discriminative power. Overall, the best indicators of drug target likeness were found to be the proteins’ hydrophobicities, in vivo half-lives, propensity for being membrane bound and the fraction of non-polar amino acids in their sequences. In terms of predicting potential targets, datasets of proteases, ion channels and cancer proteins were able to induce random forests that were highly capable of distinguishing between targets and non-targets. The non-target proteins predicted to be targets by these random forests comprise the set of the most suitable potential future drug targets, and should therefore be prioritised when building a drug development programme. PMID:25822509

  6. Insights on the Martian water cycle through the SPICAM/MEx retrievals of the H _{2}O vertical distribution

    NASA Astrophysics Data System (ADS)

    Maltagliati, Luca; Montmessin, Franck; Fedorova, Anna; Bertaux, Jean-Loup; Korablev, Oleg

    In pre-Mars Express era only very sparse measurements of the vertical profile of water vapor existed, with limited temporal and spatial coverage. Thus, knowledge of the H2 O distribution along the atmosphere relied almost exclusively on General Circulation Models. The vertical distribution of water vapor nonetheless allows to get otherwise unobtainable information on important characteristics of the Martian water cycle, such as the role of sources and sinks, phase changes, and the influence of clouds. Several other potentially significant phenomena, as the presence of supersaturation, the deposition of water vapor in the layer just below the saturation height, the formation of ice particles and water ice clouds, can be observed and studied in detail for the first time. The infrared channel of the SPICAM spectrometer onboard Mars Express, used in solar oc-cultation mode, allows to retrieve simultaneously the vertical profile of H2 O, CO2 , and aerosol properties. This dataset is thus perfectly suited to enhance our vertical knowledge of the at-mosphere of Mars, covering more than three full Martian years with good temporal and spatial distribution. We present the main results from the analysis of water vapor profiles, and their implication for the behavior of the water cycle on Mars. A comparison with the output from the state-of-the-art General Circulation Model developed at the Laboratoire de Météorologie Dynamique ee in Paris (LMD-GCM), is performed, in order to understand the consequences of this dataset on the current knowledge of physics and microphysics of water on Martian atmosphere. In particular, the currently accepted assumption that the distribution of water in the atmosphere is controlled by saturation physics is tested, and the consequences of the departure from this assumption are analysed in detail.

  7. CD147 expression predicts biochemical recurrence after prostatectomy independent of histologic and pathologic features.

    PubMed

    Bauman, Tyler M; Ewald, Jonathan A; Huang, Wei; Ricke, William A

    2015-07-25

    CD147 is an MMP-inducing protein often implicated in cancer progression. The purpose of this study was to investigate the expression of CD147 in prostate cancer (PCa) progression and the prognostic ability of CD147 in predicting biochemical recurrence after prostatectomy. Plasma membrane-localized CD147 protein expression was quantified in patient samples using immunohistochemistry and multispectral imaging, and expression was compared to clinico-pathological features (pathologic stage, Gleason score, tumor volume, preoperative PSA, lymph node status, surgical margins, biochemical recurrence status). CD147 specificity and expression were confirmed with immunoblotting of prostate cell lines, and CD147 mRNA expression was evaluated in public expression microarray datasets of patient prostate tumors. Expression of CD147 protein was significantly decreased in localized tumors (pT2; p = 0.02) and aggressive PCa (≥pT3; p = 0.004), and metastases (p = 0.001) compared to benign prostatic tissue. Decreased CD147 was associated with advanced pathologic stage (p = 0.009) and high Gleason score (p = 0.02), and low CD147 expression predicted biochemical recurrence (HR 0.55; 95 % CI 0.31-0.97; p = 0.04) independent of clinico-pathologic features. Immunoblot bands were detected at 44 kDa and 66 kDa, representing non-glycosylated and glycosylated forms of CD147 protein, and CD147 expression was lower in tumorigenic T10 cells than non-tumorigenic BPH-1 cells (p = 0.02). Decreased CD147 mRNA expression was associated with increased Gleason score and pathologic stage in patient tumors but is not associated with recurrence status. Membrane-associated CD147 expression is significantly decreased in PCa compared to non-malignant prostate tissue and is associated with tumor progression, and low CD147 expression predicts biochemical recurrence after prostatectomy independent of pathologic stage, Gleason score, lymph node status, surgical margins, and tumor volume in multivariable analysis.

  8. An interactive web application for the dissemination of human systems immunology data.

    PubMed

    Speake, Cate; Presnell, Scott; Domico, Kelly; Zeitner, Brad; Bjork, Anna; Anderson, David; Mason, Michael J; Whalen, Elizabeth; Vargas, Olivia; Popov, Dimitry; Rinchai, Darawan; Jourde-Chiche, Noemie; Chiche, Laurent; Quinn, Charlie; Chaussabel, Damien

    2015-06-19

    Systems immunology approaches have proven invaluable in translational research settings. The current rate at which large-scale datasets are generated presents unique challenges and opportunities. Mining aggregates of these datasets could accelerate the pace of discovery, but new solutions are needed to integrate the heterogeneous data types with the contextual information that is necessary for interpretation. In addition, enabling tools and technologies facilitating investigators' interaction with large-scale datasets must be developed in order to promote insight and foster knowledge discovery. State of the art application programming was employed to develop an interactive web application for browsing and visualizing large and complex datasets. A collection of human immune transcriptome datasets were loaded alongside contextual information about the samples. We provide a resource enabling interactive query and navigation of transcriptome datasets relevant to human immunology research. Detailed information about studies and samples are displayed dynamically; if desired the associated data can be downloaded. Custom interactive visualizations of the data can be shared via email or social media. This application can be used to browse context-rich systems-scale data within and across systems immunology studies. This resource is publicly available online at [Gene Expression Browser Landing Page ( https://gxb.benaroyaresearch.org/dm3/landing.gsp )]. The source code is also available openly [Gene Expression Browser Source Code ( https://github.com/BenaroyaResearch/gxbrowser )]. We have developed a data browsing and visualization application capable of navigating increasingly large and complex datasets generated in the context of immunological studies. This intuitive tool ensures that, whether taken individually or as a whole, such datasets generated at great effort and expense remain interpretable and a ready source of insight for years to come.

  9. Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework.

    PubMed

    Yang, Lingjian; Ainali, Chrysanthi; Tsoka, Sophia; Papageorgiou, Lazaros G

    2014-12-05

    Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies. A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile. The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

  10. Wide-Open: Accelerating public data release by automating detection of overdue datasets

    PubMed Central

    Poon, Hoifung; Howe, Bill

    2017-01-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819

  11. Wide-Open: Accelerating public data release by automating detection of overdue datasets.

    PubMed

    Grechkin, Maxim; Poon, Hoifung; Howe, Bill

    2017-06-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

  12. A formal concept analysis approach to consensus clustering of multi-experiment expression data

    PubMed Central

    2014-01-01

    Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. PMID:24885407

  13. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

    PubMed

    Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W

    2010-07-02

    The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.

  14. A Filter Feature Selection Method Based on MFA Score and Redundancy Excluding and It's Application to Tumor Gene Expression Data Analysis.

    PubMed

    Li, Jiangeng; Su, Lei; Pang, Zenan

    2015-12-01

    Feature selection techniques have been widely applied to tumor gene expression data analysis in recent years. A filter feature selection method named marginal Fisher analysis score (MFA score) which is based on graph embedding has been proposed, and it has been widely used mainly because it is superior to Fisher score. Considering the heavy redundancy in gene expression data, we proposed a new filter feature selection technique in this paper. It is named MFA score+ and is based on MFA score and redundancy excluding. We applied it to an artificial dataset and eight tumor gene expression datasets to select important features and then used support vector machine as the classifier to classify the samples. Compared with MFA score, t test and Fisher score, it achieved higher classification accuracy.

  15. Automated Discovery of Functional Generality of Human Gene Expression Programs

    PubMed Central

    Gerber, Georg K; Dowell, Robin D; Jaakkola, Tommi S; Gifford, David K

    2007-01-01

    An important research problem in computational biology is the identification of expression programs, sets of co-expressed genes orchestrating normal or pathological processes, and the characterization of the functional breadth of these programs. The use of human expression data compendia for discovery of such programs presents several challenges including cellular inhomogeneity within samples, genetic and environmental variation across samples, uncertainty in the numbers of programs and sample populations, and temporal behavior. We developed GeneProgram, a new unsupervised computational framework based on Hierarchical Dirichlet Processes that addresses each of the above challenges. GeneProgram uses expression data to simultaneously organize tissues into groups and genes into overlapping programs with consistent temporal behavior, to produce maps of expression programs, which are sorted by generality scores that exploit the automatically learned groupings. Using synthetic and real gene expression data, we showed that GeneProgram outperformed several popular expression analysis methods. We applied GeneProgram to a compendium of 62 short time-series gene expression datasets exploring the responses of human cells to infectious agents and immune-modulating molecules. GeneProgram produced a map of 104 expression programs, a substantial number of which were significantly enriched for genes involved in key signaling pathways and/or bound by NF-κB transcription factors in genome-wide experiments. Further, GeneProgram discovered expression programs that appear to implicate surprising signaling pathways or receptor types in the response to infection, including Wnt signaling and neurotransmitter receptors. We believe the discovered map of expression programs involved in the response to infection will be useful for guiding future biological experiments; genes from programs with low generality scores might serve as new drug targets that exhibit minimal “cross-talk,” and genes from high generality programs may maintain common physiological responses that go awry in disease states. Further, our method is multipurpose, and can be applied readily to novel compendia of biological data. PMID:17696603

  16. Novel harmonic regularization approach for variable selection in Cox's proportional hazards model.

    PubMed

    Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan

    2014-01-01

    Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq  (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods.

  17. dbMDEGA: a database for meta-analysis of differentially expressed genes in autism spectrum disorder.

    PubMed

    Zhang, Shuyun; Deng, Libin; Jia, Qiyue; Huang, Shaoting; Gu, Junwang; Zhou, Fankun; Gao, Meng; Sun, Xinyi; Feng, Chang; Fan, Guangqin

    2017-11-16

    Autism spectrum disorders (ASD) are hereditary, heterogeneous and biologically complex neurodevelopmental disorders. Individual studies on gene expression in ASD cannot provide clear consensus conclusions. Therefore, a systematic review to synthesize the current findings from brain tissues and a search tool to share the meta-analysis results are urgently needed. Here, we conducted a meta-analysis of brain gene expression profiles in the current reported human ASD expression datasets (with 84 frozen male cortex samples, 17 female cortex samples, 32 cerebellum samples and 4 formalin fixed samples) and knock-out mouse ASD model expression datasets (with 80 collective brain samples). Then, we applied R language software and developed an interactive shared and updated database (dbMDEGA) displaying the results of meta-analysis of data from ASD studies regarding differentially expressed genes (DEGs) in the brain. This database, dbMDEGA ( https://dbmdega.shinyapps.io/dbMDEGA/ ), is a publicly available web-portal for manual annotation and visualization of DEGs in the brain from data from ASD studies. This database uniquely presents meta-analysis values and homologous forest plots of DEGs in brain tissues. Gene entries are annotated with meta-values, statistical values and forest plots of DEGs in brain samples. This database aims to provide searchable meta-analysis results based on the current reported brain gene expression datasets of ASD to help detect candidate genes underlying this disorder. This new analytical tool may provide valuable assistance in the discovery of DEGs and the elucidation of the molecular pathogenicity of ASD. This database model may be replicated to study other disorders.

  18. Profiling RNA editing in human tissues: towards the inosinome Atlas

    PubMed Central

    Picardi, Ernesto; Manzari, Caterina; Mastropasqua, Francesca; Aiello, Italia; D’Erchia, Anna Maria; Pesole, Graziano

    2015-01-01

    Adenine to Inosine RNA editing is a widespread co- and post-transcriptional mechanism mediated by ADAR enzymes acting on double stranded RNA. It has a plethora of biological effects, appears to be particularly pervasive in humans with respect to other mammals, and is implicated in a number of diverse human pathologies. Here we present the first human inosinome atlas comprising 3,041,422 A-to-I events identified in six tissues from three healthy individuals. Matched directional total-RNA-Seq and whole genome sequence datasets were generated and analysed within a dedicated computational framework, also capable of detecting hyper-edited reads. Inosinome profiles are tissue specific and edited gene sets consistently show enrichment of genes involved in neurological disorders and cancer. Overall frequency of editing also varies, but is strongly correlated with ADAR expression levels. The inosinome database is available at: http://srv00.ibbe.cnr.it/editing/. PMID:26449202

  19. Nursery, gutter, or anatomy class? Obscene expression in consumer health

    PubMed Central

    Smith, Catherine Arnott

    2007-01-01

    This paper presents results of a consumer health vocabulary study of text appearing on Web-based bulletin boards. Consumers used obscenities and euphemisms to refer to certain body parts, functions, and behaviors. The female genitalia are the body region most often described with an obscenity (29% of all instances); male genitalia, in contrast, were rendered as obscene only 3% of the time. Consumers responding on the bulletin boards appear genuinely to prefer euphemistic slang and baby talk (62%) over obscenities (24%) when referring to the buttocks. From an anatomical perspective, this large dataset reveals a consumer health vocabulary of euphemisms and outright obscenities coexisting with professional medical terminology. The evident preference for euphemisms and slang for some anatomical parts has important implications for the design of health information controlled vocabularies and translation systems, faced with a lay language more informal than expected. PMID:18693922

  20. Association of tRNA methyltransferase NSUN2/IGF-II molecular signature with ovarian cancer survival.

    PubMed

    Yang, Jia-Cheng; Risch, Eric; Zhang, Meiqin; Huang, Chan; Huang, Huatian; Lu, Lingeng

    2017-09-01

    To investigate the association between NSUN2/IGF-II signature and ovarian cancer survival. Using a publicly accessible dataset of RNA sequencing and clinical follow-up data, we performed Classification and Regression Tree and survival analyses. Patients with NSUN2 high IGF-II low had significantly superior overall and disease progression-free survival, followed by NSUN2 low IGF-II low , NSUN2 high IGF-II high and NSUN2 low IGF-II high (p < 0.0001 for overall, p = 0.0024 for progression-free survival, respectively). The associations of NSUN2/IGF-II signature with the risks of death and relapse remained significant in multivariate Cox regression models. Random-effects meta-analyses show the upregulated NSUN2 and IGF-II expression in ovarian cancer versus normal tissues. The NSUN2/IGF-II signature associates with heterogeneous outcome and may have clinical implications in managing ovarian cancer.

  1. Genes@Work: an efficient algorithm for pattern discovery and multivariate feature selection in gene expression data.

    PubMed

    Lepre, Jorge; Rice, J Jeremy; Tu, Yuhai; Stolovitzky, Gustavo

    2004-05-01

    Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).

  2. A Stromal Immune Module Correlated with the Response to Neoadjuvant Chemotherapy, Prognosis and Lymphocyte Infiltration in HER2-Positive Breast Carcinoma Is Inversely Correlated with Hormonal Pathways

    PubMed Central

    Lae, Marick; Moarii, Matahi; Sadacca, Benjamin; Pinheiro, Alice; Galliot, Marion; Abecassis, Judith; Laurent, Cecile; Reyal, Fabien

    2016-01-01

    Introduction HER2-positive breast cancer (BC) is a heterogeneous group of aggressive breast cancers, the prognosis of which has greatly improved since the introduction of treatments targeting HER2. However, these tumors may display intrinsic or acquired resistance to treatment, and classifiers of HER2-positive tumors are required to improve the prediction of prognosis and to develop novel therapeutic interventions. Methods We analyzed 2893 primary human breast cancer samples from 21 publicly available datasets and developed a six-metagene signature on a training set of 448 HER2-positive BC. We then used external public datasets to assess the ability of these metagenes to predict the response to chemotherapy (Ignatiadis dataset), and prognosis (METABRIC dataset). Results We identified a six-metagene signature (138 genes) containing metagenes enriched in different gene ontologies. The gene clusters were named as follows: Immunity, Tumor suppressors/proliferation, Interferon, Signal transduction, Hormone/survival and Matrix clusters. In all datasets, the Immunity metagene was less strongly expressed in ER-positive than in ER-negative tumors, and was inversely correlated with the Hormonal/survival metagene. Within the signature, multivariate analyses showed that strong expression of the “Immunity” metagene was associated with higher pCR rates after NAC (OR = 3.71[1.28–11.91], p = 0.019) than weak expression, and with a better prognosis in HER2-positive/ER-negative breast cancers (HR = 0.58 [0.36–0.94], p = 0.026). Immunity metagene expression was associated with the presence of tumor-infiltrating lymphocytes (TILs). Conclusion The identification of a predictive and prognostic immune module in HER2-positive BC confirms the need for clinical testing for immune checkpoint modulators and vaccines for this specific subtype. The inverse correlation between Immunity and hormone pathways opens research perspectives and deserves further investigation. PMID:28005906

  3. A ground truth based comparative study on clustering of gene expression data.

    PubMed

    Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue

    2008-05-01

    Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.

  4. Two-pass imputation algorithm for missing value estimation in gene expression time series.

    PubMed

    Tsiporkova, Elena; Boeva, Veselka

    2007-10-01

    Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.

  5. An Integrated Bioinformatics Approach Identifies Elevated Cyclin E2 Expression and E2F Activity as Distinct Features of Tamoxifen Resistant Breast Tumors

    PubMed Central

    Huang, Lei; Zhao, Shuangping; Frasor, Jonna M.; Dai, Yang

    2011-01-01

    Approximately half of estrogen receptor (ER) positive breast tumors will fail to respond to endocrine therapy. Here we used an integrative bioinformatics approach to analyze three gene expression profiling data sets from breast tumors in an attempt to uncover underlying mechanisms contributing to the development of resistance and potential therapeutic strategies to counteract these mechanisms. Genes that are differentially expressed in tamoxifen resistant vs. sensitive breast tumors were identified from three different publically available microarray datasets. These differentially expressed (DE) genes were analyzed using gene function and gene set enrichment and examined in intrinsic subtypes of breast tumors. The Connectivity Map analysis was utilized to link gene expression profiles of tamoxifen resistant tumors to small molecules and validation studies were carried out in a tamoxifen resistant cell line. Despite little overlap in genes that are differentially expressed in tamoxifen resistant vs. sensitive tumors, a high degree of functional similarity was observed among the three datasets. Tamoxifen resistant tumors displayed enriched expression of genes related to cell cycle and proliferation, as well as elevated activity of E2F transcription factors, and were highly correlated with a Luminal intrinsic subtype. A number of small molecules, including phenothiazines, were found that induced a gene signature in breast cancer cell lines opposite to that found in tamoxifen resistant vs. sensitive tumors and the ability of phenothiazines to down-regulate cyclin E2 and inhibit proliferation of tamoxifen resistant breast cancer cells was validated. Our findings demonstrate that an integrated bioinformatics approach to analyze gene expression profiles from multiple breast tumor datasets can identify important biological pathways and potentially novel therapeutic options for tamoxifen-resistant breast cancers. PMID:21789246

  6. Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.

    PubMed

    Ritz, Cecilia; Edén, Patrik

    2008-01-19

    For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.

  7. Rapid ocean-atmosphere response to Southern Ocean freshening during the last glacial period

    NASA Astrophysics Data System (ADS)

    Turney, Christian; Jones, Richard; Phipps, Steven; Thomas, Zoë; Hogg, Alan; Kershaw, Peter; Fogwill, Christopher; Palmer, Jonathan; Bronk Ramsey, Christopher; Adolphi, Florian; Muscheler, Raimund; Hughen, Konrad; Staff, Richard; Grosvenor, Mark; Golledge, Nicholas; Rasmussen, Sune; Hutchinson, David; Haberle, Simon; Lorrey, Andrew; Boswijk, Gretel

    2017-04-01

    Contrasting Greenland and Antarctic temperature trends during the late last glacial period (60,000 to 11,703 years ago) are thought to be driven by imbalances in the rate of formation of North Atlantic and Antarctic Deep Water (the 'bipolar seesaw'), with cooling in the north leading the onset of warming in the south. Some events, however, appear to have occurred independently of changes in deep water formation but still have a southern expression, implying that an alternative mechanism may have driven some global climatic changes during the glacial. Testing these competing hypotheses is challenging given the relatively large uncertainties associated with correlating terrestrial, marine and ice core records of abrupt change. Here we exploit a bidecadally-resolved 14C calibration dataset obtained from New Zealand kauri (Agathis australis) to undertake high-precision alignment of key climate datasets spanning 28,400 to 30,400 years ago. We observe no divergence between terrestrial and marine 14C datasets implying limited impact of freshwater hosing on the Atlantic Meridional Overturning Circulation (AMOC). However, an ice-rafted debris event (SA2) in Southern Ocean waters appears to be associated with dramatic synchronous warming over the North Atlantic and contrasting precipitation patterns across the low latitudes. Using a fully coupled climate system model we undertook an ensemble of transient meltwater simulations and find that a southern salinity anomaly can trigger low-latitude temperature changes through barotropic and baroclinic oceanic waves that are atmospherically propagated globally via a Rossby wave train, consistent with contemporary modelling studies. Our results suggest the Antarctic ice sheets and Southern Ocean dynamics may have contributed to some global climatic changes through rapid ocean-atmospheric teleconnections, with implications for past (and future) change.

  8. Autophagy-related prognostic signature for breast cancer.

    PubMed

    Gu, Yunyan; Li, Pengfei; Peng, Fuduan; Zhang, Mengmeng; Zhang, Yuanyuan; Liang, Haihai; Zhao, Wenyuan; Qi, Lishuang; Wang, Hongwei; Wang, Chenguang; Guo, Zheng

    2016-03-01

    Autophagy is a process that degrades intracellular constituents, such as long-lived or damaged proteins and organelles, to buffer metabolic stress under starvation conditions. Deregulation of autophagy is involved in the progression of cancer. However, the predictive value of autophagy for breast cancer prognosis remains unclear. First, based on gene expression profiling, we found that autophagy genes were implicated in breast cancer. Then, using the Cox proportional hazard regression model, we detected autophagy prognostic signature for breast cancer in a training dataset. We identified a set of eight autophagy genes (BCL2, BIRC5, EIF4EBP1, ERO1L, FOS, GAPDH, ITPR1 and VEGFA) that were significantly associated with overall survival in breast cancer. The eight autophagy genes were assigned as a autophagy-related prognostic signature for breast cancer. Based on the autophagy-related signature, the training dataset GSE21653 could be classified into high-risk and low-risk subgroups with significantly different survival times (HR = 2.72, 95% CI = (1.91, 3.87); P = 1.37 × 10(-5)). Inactivation of autophagy was associated with shortened survival of breast cancer patients. The prognostic value of the autophagy-related signature was confirmed in the testing dataset GSE3494 (HR = 2.12, 95% CI = (1.48, 3.03); P = 1.65 × 10(-3)) and GSE7390 (HR = 1.76, 95% CI = (1.22, 2.54); P = 9.95 × 10(-4)). Further analysis revealed that the prognostic value of the autophagy signature was independent of known clinical prognostic factors, including age, tumor size, grade, estrogen receptor status, progesterone receptor status, ERBB2 status, lymph node status and TP53 mutation status. Finally, we demonstrated that the autophagy signature could also predict distant metastasis-free survival for breast cancer. © 2015 Wiley Periodicals, Inc.

  9. Statistical procedures for analyzing mental health services data.

    PubMed

    Elhai, Jon D; Calhoun, Patrick S; Ford, Julian D

    2008-08-15

    In mental health services research, analyzing service utilization data often poses serious problems, given the presence of substantially skewed data distributions. This article presents a non-technical introduction to statistical methods specifically designed to handle the complexly distributed datasets that represent mental health service use, including Poisson, negative binomial, zero-inflated, and zero-truncated regression models. A flowchart is provided to assist the investigator in selecting the most appropriate method. Finally, a dataset of mental health service use reported by medical patients is described, and a comparison of results across several different statistical methods is presented. Implications of matching data analytic techniques appropriately with the often complexly distributed datasets of mental health services utilization variables are discussed.

  10. Statistical Test of Expression Pattern (STEPath): a new strategy to integrate gene expression data with genomic information in individual and meta-analysis studies.

    PubMed

    Martini, Paolo; Risso, Davide; Sales, Gabriele; Romualdi, Chiara; Lanfranchi, Gerolamo; Cagnin, Stefano

    2011-04-11

    In the last decades, microarray technology has spread, leading to a dramatic increase of publicly available datasets. The first statistical tools developed were focused on the identification of significant differentially expressed genes. Later, researchers moved toward the systematic integration of gene expression profiles with additional biological information, such as chromosomal location, ontological annotations or sequence features. The analysis of gene expression linked to physical location of genes on chromosomes allows the identification of transcriptionally imbalanced regions, while, Gene Set Analysis focuses on the detection of coordinated changes in transcriptional levels among sets of biologically related genes. In this field, meta-analysis offers the possibility to compare different studies, addressing the same biological question to fully exploit public gene expression datasets. We describe STEPath, a method that starts from gene expression profiles and integrates the analysis of imbalanced region as an a priori step before performing gene set analysis. The application of STEPath in individual studies produced gene set scores weighted by chromosomal activation. As a final step, we propose a way to compare these scores across different studies (meta-analysis) on related biological issues. One complication with meta-analysis is batch effects, which occur because molecular measurements are affected by laboratory conditions, reagent lots and personnel differences. Major problems occur when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. We evaluated the power of combining chromosome mapping and gene set enrichment analysis, performing the analysis on a dataset of leukaemia (example of individual study) and on a dataset of skeletal muscle diseases (meta-analysis approach). In leukaemia, we identified the Hox gene set, a gene set closely related to the pathology that other algorithms of gene set analysis do not identify, while the meta-analysis approach on muscular disease discriminates between related pathologies and correlates similar ones from different studies. STEPath is a new method that integrates gene expression profiles, genomic co-expressed regions and the information about the biological function of genes. The usage of the STEPath-computed gene set scores overcomes batch effects in the meta-analysis approaches allowing the direct comparison of different pathologies and different studies on a gene set activation level.

  11. SOURCES OF VARIATION IN BASELINE GENE EXPRESSION LEVELS FROM TOXICOGENOMIC STUDY CONTROL ANIMALS ACROSS MULTIPLE LABORATORIES

    EPA Science Inventory

    Variations in study design are typical for toxicogenomic studies, but their impact on gene expression in control animals has not been well characterized. A dataset of control animal microarray expression data was assembled by a working group of the Health and Environmental Scienc...

  12. Multilevel principal component analysis (mPCA) in shape analysis: A feasibility study in medical and dental imaging.

    PubMed

    Farnell, D J J; Popat, H; Richmond, S

    2016-06-01

    Methods used in image processing should reflect any multilevel structures inherent in the image dataset or they run the risk of functioning inadequately. We wish to test the feasibility of multilevel principal components analysis (PCA) to build active shape models (ASMs) for cases relevant to medical and dental imaging. Multilevel PCA was used to carry out model fitting to sets of landmark points and it was compared to the results of "standard" (single-level) PCA. Proof of principle was tested by applying mPCA to model basic peri-oral expressions (happy, neutral, sad) approximated to the junction between the mouth/lips. Monte Carlo simulations were used to create this data which allowed exploration of practical implementation issues such as the number of landmark points, number of images, and number of groups (i.e., "expressions" for this example). To further test the robustness of the method, mPCA was subsequently applied to a dental imaging dataset utilising landmark points (placed by different clinicians) along the boundary of mandibular cortical bone in panoramic radiographs of the face. Changes of expression that varied between groups were modelled correctly at one level of the model and changes in lip width that varied within groups at another for the Monte Carlo dataset. Extreme cases in the test dataset were modelled adequately by mPCA but not by standard PCA. Similarly, variations in the shape of the cortical bone were modelled by one level of mPCA and variations between the experts at another for the panoramic radiographs dataset. Results for mPCA were found to be comparable to those of standard PCA for point-to-point errors via miss-one-out testing for this dataset. These errors reduce with increasing number of eigenvectors/values retained, as expected. We have shown that mPCA can be used in shape models for dental and medical image processing. mPCA was found to provide more control and flexibility when compared to standard "single-level" PCA. Specifically, mPCA is preferable to "standard" PCA when multiple levels occur naturally in the dataset. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  13. Integration of Steady-State and Temporal Gene Expression Data for the Inference of Gene Regulatory Networks

    PubMed Central

    Wang, Yi Kan; Hurley, Daniel G.; Schnell, Santiago; Print, Cristin G.; Crampin, Edmund J.

    2013-01-01

    We develop a new regression algorithm, cMIKANA, for inference of gene regulatory networks from combinations of steady-state and time-series gene expression data. Using simulated gene expression datasets to assess the accuracy of reconstructing gene regulatory networks, we show that steady-state and time-series data sets can successfully be combined to identify gene regulatory interactions using the new algorithm. Inferring gene networks from combined data sets was found to be advantageous when using noisy measurements collected with either lower sampling rates or a limited number of experimental replicates. We illustrate our method by applying it to a microarray gene expression dataset from human umbilical vein endothelial cells (HUVECs) which combines time series data from treatment with growth factor TNF and steady state data from siRNA knockdown treatments. Our results suggest that the combination of steady-state and time-series datasets may provide better prediction of RNA-to-RNA interactions, and may also reveal biological features that cannot be identified from dynamic or steady state information alone. Finally, we consider the experimental design of genomics experiments for gene regulatory network inference and show that network inference can be improved by incorporating steady-state measurements with time-series data. PMID:23967277

  14. Impact of sequencing depth and read length on single cell RNA sequencing data of T cells.

    PubMed

    Rizzetto, Simone; Eltahla, Auda A; Lin, Peijie; Bull, Rowena; Lloyd, Andrew R; Ho, Joshua W K; Venturi, Vanessa; Luciani, Fabio

    2017-10-06

    Single cell RNA sequencing (scRNA-seq) provides great potential in measuring the gene expression profiles of heterogeneous cell populations. In immunology, scRNA-seq allowed the characterisation of transcript sequence diversity of functionally relevant T cell subsets, and the identification of the full length T cell receptor (TCRαβ), which defines the specificity against cognate antigens. Several factors, e.g. RNA library capture, cell quality, and sequencing output affect the quality of scRNA-seq data. We studied the effects of read length and sequencing depth on the quality of gene expression profiles, cell type identification, and TCRαβ reconstruction, utilising 1,305 single cells from 8 publically available scRNA-seq datasets, and simulation-based analyses. Gene expression was characterised by an increased number of unique genes identified with short read lengths (<50 bp), but these featured higher technical variability compared to profiles from longer reads. Successful TCRαβ reconstruction was achieved for 6 datasets (81% - 100%) with at least 0.25 millions (PE) reads of length >50 bp, while it failed for datasets with <30 bp reads. Sufficient read length and sequencing depth can control technical noise to enable accurate identification of TCRαβ and gene expression profiles from scRNA-seq data of T cells.

  15. Macromolecular Expression and Function: A New Paradigm for NASA Risk Assessment

    NASA Technical Reports Server (NTRS)

    Richmond, Robert

    2003-01-01

    Predicting risks in humans of either acute effects such as bone loss or muscle wasting, or late effects such as cancer, is challenging. To an approximation, this is because uncertainties of exposure to stress factors or toxic agents and the uniformity of processing subsequent damage at the cellular level within a complex set of biological variables degrade the confidence of predicting pathologic outcome. A cellular biodosimeter that simultaneously reports 1) the type of damage due to that exposure, 2) the quantity of damage incurred by that exposure, and 3) the dataset used to assess risk of developing pathologic outcome caused by that exposure would therefore be useful for predicting ultimate risks faced by an individual, such as an astronaut. It is suggested that such a biodosimeter can be based upon analyses of gene-expression and protein expression whereby large datasets of cellular response to damage are obtained and analyzed for expression-profiles correlated with established end points and molecular markers predictive for risks being assessed. The usefulness of multiparametric cellular biodosimeters could be realized by quantitatively profiling these datasets using techniques of bioinformatics. Such an approach contributes to the foundation of molecular epidemiology as a new scientific discipline, and represents a new paradigm of risk assessment.

  16. Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach.

    PubMed

    Chowdhury, Nilotpal; Sapru, Shantanu

    2015-01-01

    Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate - adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.

  17. Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach

    PubMed Central

    Chowdhury, Nilotpal; Sapru, Shantanu

    2015-01-01

    Introduction Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. Aim The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Methods Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Results Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. Conclusion To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research. PMID:26080057

  18. Low-carbon, low-water scenarios with life cycle water factors for ES&T paper

    EPA Pesticide Factsheets

    The dataset includes all data used in the creation of figures and graphs in the paper: Scenarios for low carbon and low water electric power plant operations: implications for upstream water use. Data includes regional electricity mixes, full life cycle water use, and water use for each life cycle stage. These encompass a range of scenarios out to 2050, and should not be used as predictions, forecasts or official baselines. The scenarios and results are for research purposes only, and do not represent current or future U.S. EPA policies or regulations.This dataset is associated with the following publication:Dodder , R., J. Barnwell , and W. Yelverton. Scenarios for low carbon and low water electric power plant operations: implications for upstream water use. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 50(21): 11460-11470, (2016).

  19. Trade Study: Storing NASA HDF5/netCDF-4 Data in the Amazon Cloud and Retrieving Data via Hyrax Server / THREDDS Data Server

    NASA Technical Reports Server (NTRS)

    Habermann, Ted; Jelenak, Aleksander; Lee, Joe; Yang, Kent; Gallagher, James; Potter, Nathan

    2017-01-01

    As part of the overall effort to understand implications of migrating ESDIS data and services to the cloud we are testing several common OPeNDAP and HDF use cases against three architectures for general performance and cost characteristics. The architectures include retrieving entire files, retrieving datasets using HTTP range gets, and retrieving elements of datasets (chunks) with HTTP range gets. We will describe these architectures and discuss our approach to estimating cost.

  20. Systems Biomedicine of Rabies Delineates the Affected Signaling Pathways.

    PubMed

    Azimzadeh Jamalkandi, Sadegh; Mozhgani, Sayed-Hamidreza; Gholami Pourbadie, Hamid; Mirzaie, Mehdi; Noorbakhsh, Farshid; Vaziri, Behrouz; Gholami, Alireza; Ansari-Pour, Naser; Jafari, Mohieddin

    2016-01-01

    The prototypical neurotropic virus, rabies, is a member of the Rhabdoviridae family that causes lethal encephalomyelitis. Although there have been a plethora of studies investigating the etiological mechanism of the rabies virus and many precautionary methods have been implemented to avert the disease outbreak over the last century, the disease has surprisingly no definite remedy at its late stages. The psychological symptoms and the underlying etiology, as well as the rare survival rate from rabies encephalitis, has still remained a mystery. We, therefore, undertook a systems biomedicine approach to identify the network of gene products implicated in rabies. This was done by meta-analyzing whole-transcriptome microarray datasets of the CNS infected by strain CVS-11, and integrating them with interactome data using computational and statistical methods. We first determined the differentially expressed genes (DEGs) in each study and horizontally integrated the results at the mRNA and microRNA levels separately. A total of 61 seed genes involved in signal propagation system were obtained by means of unifying mRNA and microRNA detected integrated DEGs. We then reconstructed a refined protein-protein interaction network (PPIN) of infected cells to elucidate the rabies-implicated signal transduction network (RISN). To validate our findings, we confirmed differential expression of randomly selected genes in the network using Real-time PCR. In conclusion, the identification of seed genes and their network neighborhood within the refined PPIN can be useful for demonstrating signaling pathways including interferon circumvent, toward proliferation and survival, and neuropathological clue, explaining the intricate underlying molecular neuropathology of rabies infection and thus rendered a molecular framework for predicting potential drug targets.

  1. Identification of miRNA-Mediated Core Gene Module for Glioma Patient Prediction by Integrating High-Throughput miRNA, mRNA Expression and Pathway Structure

    PubMed Central

    Han, Junwei; Shang, Desi; Zhang, Yunpeng; Zhang, Wei; Yao, Qianlan; Han, Lei; Xu, Yanjun; Yan, Wei; Bao, Zhaoshi; You, Gan; Jiang, Tao; Kang, Chunsheng; Li, Xia

    2014-01-01

    The prognosis of glioma patients is usually poor, especially in patients with glioblastoma (World Health Organization (WHO) grade IV). The regulatory functions of microRNA (miRNA) on genes have important implications in glioma cell survival. However, there are not many studies that have investigated glioma survival by integrating miRNAs and genes while also considering pathway structure. In this study, we performed sample-matched miRNA and mRNA expression profilings to systematically analyze glioma patient survival. During this analytical process, we developed pathway-based random walk to identify a glioma core miRNA-gene module, simultaneously considering pathway structure information and multi-level involvement of miRNAs and genes. The core miRNA-gene module we identified was comprised of four apparent sub-modules; all four sub-modules displayed a significant correlation with patient survival in the testing set (P-values≤0.001). Notably, one sub-module that consisted of 6 miRNAs and 26 genes also correlated with survival time in the high-grade subgroup (WHO grade III and IV), P-value = 0.0062. Furthermore, the 26-gene expression signature from this sub-module had robust predictive power in four independent, publicly available glioma datasets. Our findings suggested that the expression signatures, which were identified by integration of miRNA and gene level, were closely associated with overall survival among the glioma patients with various grades. PMID:24809850

  2. Molecular profiling of ALDH1+ colorectal cancer stem cells reveals preferential activation of MAPK, FAK, and oxidative stress pro-survival signalling pathways

    PubMed Central

    Vishnubalaji, Radhakrishnan; Manikandan, Muthurangan; Fahad, Mohamed; Hamam, Rimi; Alfayez, Musaad; Kassem, Moustapha; Aldahmash, Abdullah; Alajez, Nehad M.

    2018-01-01

    Tumour heterogeneity leads to variable clinical response and inaccurate diagnostic and prognostic assessment. Cancer stem cells (CSCs) represent a subpopulation responsible for invasion, metastasis, therapeutic resistance, and recurrence in many human cancer types. However, the true identity of colorectal cancer (CRC) SCs remains elusive. Here, we aimed to characterize and define the gene expression portrait of CSCs in CRC-model SW403 cells. We found that ALDH+ positive cells are clonogenic and highly proliferative; their global gene expression profiling-based molecular signature revealed gene enrichment related to DNA damage, MAPK, FAK, oxidative stress response, and Wnt signalling. ALDH+ cells showed enhanced ROS stress resistance, whereas MAPK/FAK pathway pharmacologic inhibition limited their survival. Conversely, 5-fluorouracil increased the ALDH+ cell fraction among the SW403, HCT116 and SW620 CRC models. Notably, analysis of ALDH1A1 and POU5F1 expression levels in cohorts of 462 or 420 patients for overall (OS) or disease-free (DFS) survival, respectively, obtained from the Cancer Genome Atlas CRC dataset, revealed strong association between elevated expression and poor OS (p = 0.006) and poor DFS (p = 0.05), thus implicating ALDH1A1 and POU5F1 in CRC prognosis. Our data reveal distinct molecular signature of ALDH+ CSCs in CRC and suggest pathways relevant for successful targeted therapies and management of CRC. PMID:29568377

  3. Localization of migraine susceptibility genes in human brain by single-cell RNA sequencing.

    PubMed

    Renthal, William

    2018-01-01

    Background Migraine is a debilitating disorder characterized by severe headaches and associated neurological symptoms. A key challenge to understanding migraine has been the cellular complexity of the human brain and the multiple cell types implicated in its pathophysiology. The present study leverages recent advances in single-cell transcriptomics to localize the specific human brain cell types in which putative migraine susceptibility genes are expressed. Methods The cell-type specific expression of both familial and common migraine-associated genes was determined bioinformatically using data from 2,039 individual human brain cells across two published single-cell RNA sequencing datasets. Enrichment of migraine-associated genes was determined for each brain cell type. Results Analysis of single-brain cell RNA sequencing data from five major subtypes of cells in the human cortex (neurons, oligodendrocytes, astrocytes, microglia, and endothelial cells) indicates that over 40% of known migraine-associated genes are enriched in the expression profiles of a specific brain cell type. Further analysis of neuronal migraine-associated genes demonstrated that approximately 70% were significantly enriched in inhibitory neurons and 30% in excitatory neurons. Conclusions This study takes the next step in understanding the human brain cell types in which putative migraine susceptibility genes are expressed. Both familial and common migraine may arise from dysfunction of discrete cell types within the neurovascular unit, and localization of the affected cell type(s) in an individual patient may provide insight into to their susceptibility to migraine.

  4. Microarray Analysis of Iris Gene Expression in Mice with Mutations Influencing Pigmentation

    PubMed Central

    Trantow, Colleen M.; Cuffy, Tryphena L.; Fingert, John H.; Kuehn, Markus H.

    2011-01-01

    Purpose. Several ocular diseases involve the iris, notably including oculocutaneous albinism, pigment dispersion syndrome, and exfoliation syndrome. To screen for candidate genes that may contribute to the pathogenesis of these diseases, genome-wide iris gene expression patterns were comparatively analyzed from mouse models of these conditions. Methods. Iris samples from albino mice with a Tyr mutation, pigment dispersion–prone mice with Tyrp1 and Gpnmb mutations, and mice resembling exfoliation syndrome with a Lyst mutation were compared with samples from wild-type mice. All mice were strain (C57BL/6J), age (60 days old), and sex (female) matched. Microarrays were used to compare transcriptional profiles, and differentially expressed transcripts were described by functional annotation clustering using DAVID Bioinformatics Resources. Quantitative real-time PCR was performed to validate a subset of identified changes. Results. Compared with wild-type C57BL/6J mice, each disease context exhibited a large number of statistically significant changes in gene expression, including 685 transcripts differentially expressed in albino irides, 403 in pigment dispersion–prone irides, and 460 in exfoliative-like irides. Conclusions. Functional annotation clusterings were particularly striking among the overrepresented genes, with albino and pigment dispersion–prone irides both exhibiting overall evidence of crystallin-mediated stress responses. Exfoliative-like irides from mice with a Lyst mutation showed overall evidence of involvement of genes that influence immune system processes, lytic vacuoles, and lysosomes. These findings have several biologically relevant implications, particularly with respect to secondary forms of glaucoma, and represent a useful resource as a hypothesis-generating dataset. PMID:20739468

  5. Enhancer Linking by Methylation/Expression Relationships (ELMER) | Informatics Technology for Cancer Research (ITCR)

    Cancer.gov

    R tool for analysis of DNA methylation and expression datasets. Integrative analysis allows reconstruction of in vivo transcription factor networks altered in cancer along with identification of the underlying gene regulatory sequences.

  6. iCOSSY: An Online Tool for Context-Specific Subnetwork Discovery from Gene Expression Data

    PubMed Central

    Saha, Ashis; Jeon, Minji; Tan, Aik Choon; Kang, Jaewoo

    2015-01-01

    Pathway analyses help reveal underlying molecular mechanisms of complex biological phenotypes. Biologists tend to perform multiple pathway analyses on the same dataset, as there is no single answer. It is often inefficient for them to implement and/or install all the algorithms by themselves. Online tools can help the community in this regard. Here we present an online gene expression analytical tool called iCOSSY which implements a novel pathway-based COntext-specific Subnetwork discoverY (COSSY) algorithm. iCOSSY also includes a few modifications of COSSY to increase its reliability and interpretability. Users can upload their gene expression datasets, and discover important subnetworks of closely interacting molecules to differentiate between two phenotypes (context). They can also interactively visualize the resulting subnetworks. iCOSSY is a web server that finds subnetworks that are differentially expressed in two phenotypes. Users can visualize the subnetworks to understand the biology of the difference. PMID:26147457

  7. Bayesian median regression for temporal gene expression data

    NASA Astrophysics Data System (ADS)

    Yu, Keming; Vinciotti, Veronica; Liu, Xiaohui; 't Hoen, Peter A. C.

    2007-09-01

    Most of the existing methods for the identification of biologically interesting genes in a temporal expression profiling dataset do not fully exploit the temporal ordering in the dataset and are based on normality assumptions for the gene expression. In this paper, we introduce a Bayesian median regression model to detect genes whose temporal profile is significantly different across a number of biological conditions. The regression model is defined by a polynomial function where both time and condition effects as well as interactions between the two are included. MCMC-based inference returns the posterior distribution of the polynomial coefficients. From this a simple Bayes factor test is proposed to test for significance. The estimation of the median rather than the mean, and within a Bayesian framework, increases the robustness of the method compared to a Hotelling T2-test previously suggested. This is shown on simulated data and on muscular dystrophy gene expression data.

  8. Novel Harmonic Regularization Approach for Variable Selection in Cox's Proportional Hazards Model

    PubMed Central

    Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan

    2014-01-01

    Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq  (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods. PMID:25506389

  9. Chondrocyte channel transcriptomics

    PubMed Central

    Lewis, Rebecca; May, Hannah; Mobasheri, Ali; Barrett-Jolley, Richard

    2013-01-01

    To date, a range of ion channels have been identified in chondrocytes using a number of different techniques, predominantly electrophysiological and/or biomolecular; each of these has its advantages and disadvantages. Here we aim to compare and contrast the data available from biophysical and microarray experiments. This letter analyses recent transcriptomics datasets from chondrocytes, accessible from the European Bioinformatics Institute (EBI). We discuss whether such bioinformatic analysis of microarray datasets can potentially accelerate identification and discovery of ion channels in chondrocytes. The ion channels which appear most frequently across these microarray datasets are discussed, along with their possible functions. We discuss whether functional or protein data exist which support the microarray data. A microarray experiment comparing gene expression in osteoarthritis and healthy cartilage is also discussed and we verify the differential expression of 2 of these genes, namely the genes encoding large calcium-activated potassium (BK) and aquaporin channels. PMID:23995703

  10. A-MADMAN: Annotation-based microarray data meta-analysis tool

    PubMed Central

    Bisognin, Andrea; Coppe, Alessandro; Ferrari, Francesco; Risso, Davide; Romualdi, Chiara; Bicciato, Silvio; Bortoluzzi, Stefania

    2009-01-01

    Background Publicly available datasets of microarray gene expression signals represent an unprecedented opportunity for extracting genomic relevant information and validating biological hypotheses. However, the exploitation of this exceptionally rich mine of information is still hampered by the lack of appropriate computational tools, able to overcome the critical issues raised by meta-analysis. Results This work presents A-MADMAN, an open source web application which allows the retrieval, annotation, organization and meta-analysis of gene expression datasets obtained from Gene Expression Omnibus. A-MADMAN addresses and resolves several open issues in the meta-analysis of gene expression data. Conclusion A-MADMAN allows i) the batch retrieval from Gene Expression Omnibus and the local organization of raw data files and of any related meta-information, ii) the re-annotation of samples to fix incomplete, or otherwise inadequate, metadata and to create user-defined batches of data, iii) the integrative analysis of data obtained from different Affymetrix platforms through custom chip definition files and meta-normalization. Software and documentation are available on-line at . PMID:19563634

  11. R-Spondins Are Expressed by the Intestinal Stroma and are Differentially Regulated during Citrobacter rodentium- and DSS-Induced Colitis in Mice.

    PubMed

    Kang, Eugene; Yousefi, Mitra; Gruenheid, Samantha

    2016-01-01

    The R-spondin family of proteins has recently been described as secreted enhancers of β-catenin activation through the canonical Wnt signaling pathway. We previously reported that Rspo2 is a major determinant of susceptibility to Citrobacter rodentium-mediated colitis in mice and recent genome-wide association studies have revealed RSPO3 as a candidate Crohn's disease-specific inflammatory bowel disease susceptibility gene in humans. However, there is little information on the endogenous expression and cellular source of R-spondins in the colon at steady state and during intestinal inflammation. RNA sequencing and qRT-PCR were used to assess the expression of R-spondins at steady state and in two mouse models of colonic inflammation. The cellular source of R-spondins was assessed in specific colonic cell populations isolated by cell sorting. Data mining from publicly available datasets was used to assess the expression of R-spondins in the human colon. At steady state, colonic expression of R-spondins was found to be exclusive to non-epithelial CD45- lamina propria cells, and Rspo3/RSPO3 was the most highly expressed R-spondin in both mouse and human colon. R-spondin expression was found to be highly dynamic and differentially regulated during C. rodentium infection and dextran sodium sulfate (DSS) colitis, with notably high levels of Rspo3 expression during DSS colitis, and high levels of Rspo2 expression during C. rodentium infection, specifically in susceptible mice. Our data are consistent with the hypothesis that in the colon, R-spondins are expressed by subepithelial stromal cells, and that Rspo3/RSPO3 is the family member most implicated in colonic homeostasis. The differential regulation of the R-spondins in different models of intestinal inflammation indicate they respond to specific pathogenic and inflammatory signals that differ in the two models and provides further evidence that this family of proteins plays a key role in linking intestinal inflammation and homeostasis.

  12. pySAPC, a python package for sparse affinity propagation clustering: Application to odontogenesis whole genome time series gene-expression data.

    PubMed

    Cao, Huojun; Amendt, Brad A

    2016-11-01

    Developmental dental anomalies are common forms of congenital defects. The molecular mechanisms of dental anomalies are poorly understood. Systematic approaches such as clustering genes based on similar expression patterns could identify novel genes involved in dental anomalies and provide a framework for understanding molecular regulatory mechanisms of these genes during tooth development (odontogenesis). A python package (pySAPC) of sparse affinity propagation clustering algorithm for large datasets was developed. Whole genome pair-wise similarity was calculated based on expression pattern similarity based on 45 microarrays of several stages during odontogenesis. pySAPC identified 743 gene clusters based on expression pattern similarity during mouse tooth development. Three clusters are significantly enriched for genes associated with dental anomalies (with FDR <0.1). The three clusters of genes have distinct expression patterns during odontogenesis. Clustering genes based on similar expression profiles recovered several known regulatory relationships for genes involved in odontogenesis, as well as many novel genes that may be involved with the same genetic pathways as genes that have already been shown to contribute to dental defects. By using sparse similarity matrix, pySAPC use much less memory and CPU time compared with the original affinity propagation program that uses a full similarity matrix. This python package will be useful for many applications where dataset(s) are too large to use full similarity matrix. This article is part of a Special Issue entitled "System Genetics" Guest Editor: Dr. Yudong Cai and Dr. Tao Huang. Copyright © 2016. Published by Elsevier B.V.

  13. Preliminary characterization of IL32 in basal-like/triple negative compared to other types of breast cell lines and tissues

    PubMed Central

    2014-01-01

    Background Triple negative breast cancer (TNBC) and often basal-like cancers are defined as negative for estrogen receptor, progesterone receptor and Her2 gene expression. Over the past few years an incredible amount of data has been generated defining the molecular characteristics of both cancers. The aim of these studies is to better understand the cancers and identify genes and molecular pathways that might be useful as targeted therapies. In an attempt to contribute to the understanding of basal-like/TNBC, we examined the Gene Expression Omnibus (GEO) public datasets in search of genes that might define basal-like/TNBC. The Il32 gene was identified as a candidate. Findings Analysis of several GEO datasets showed differential expression of IL32 in patient samples previously designated as basal and/or TNBC compared to normal and luminal breast samples. As validation of the GEO results, RNA and protein expression levels were examined using MCF7 and MDA MB231 cell lines and tissue microarrays (TMAs). IL32 gene expression levels were higher in MDA MB231 compared to MCF7. Analysis of TMAs showed 42% of TNBC tissues and 25% of the non-TNBC were positive for IL32, while non-malignant patient samples and all but one hyperplastic tissue sample demonstrated lower levels of IL32 protein expression. Conclusion Data obtained from several publically available GEO datasets showed overexpression of IL32 gene in basal-like/TNBC samples compared to normal and luminal samples. In support of these data, analysis of TMA clinical samples demonstrated a particular pattern of IL32 differential expression. Considered together, these data suggest IL32 is a candidate suitable for further study. PMID:25100201

  14. Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics.

    PubMed

    Mahmood, Khalid; Jung, Chol-Hee; Philip, Gayle; Georgeson, Peter; Chung, Jessica; Pope, Bernard J; Park, Daniel J

    2017-05-16

    Genetic variant effect prediction algorithms are used extensively in clinical genomics and research to determine the likely consequences of amino acid substitutions on protein function. It is vital that we better understand their accuracies and limitations because published performance metrics are confounded by serious problems of circularity and error propagation. Here, we derive three independent, functionally determined human mutation datasets, UniFun, BRCA1-DMS and TP53-TA, and employ them, alongside previously described datasets, to assess the pre-eminent variant effect prediction tools. Apparent accuracies of variant effect prediction tools were influenced significantly by the benchmarking dataset. Benchmarking with the assay-determined datasets UniFun and BRCA1-DMS yielded areas under the receiver operating characteristic curves in the modest ranges of 0.52 to 0.63 and 0.54 to 0.75, respectively, considerably lower than observed for other, potentially more conflicted datasets. These results raise concerns about how such algorithms should be employed, particularly in a clinical setting. Contemporary variant effect prediction tools are unlikely to be as accurate at the general prediction of functional impacts on proteins as reported prior. Use of functional assay-based datasets that avoid prior dependencies promises to be valuable for the ongoing development and accurate benchmarking of such tools.

  15. Immune-Modulation by Epidermal Growth Factor Receptor Inhibitors: Implication on Anti-Tumor Immunity in Lung Cancer

    PubMed Central

    Herrmann, Amanda C.; Bernatchez, Chantale; Haymaker, Cara; Molldrem, Jeffrey J.; Hong, Waun Ki; Perez-Soler, Roman

    2016-01-01

    Skin toxicity is the most common toxicity caused by Epidermal Growth Factor Receptor (EGFR) inhibitors, and has been associated with clinical efficacy. As EGFR inhibitors enhance the expression of antigen presenting molecules in affected skin keratinocytes, they may concurrently facilitate neo-antigen presentation in lung cancer tumor cells contributing to anti-tumor immunity. Here, we investigated the modulatory effect of the EGFR inhibitor, erlotinib on antigen presenting molecules and PD-L1, prominent immune checkpoint protein, of skin keratinocytes and lung cancer cell lines to delineate the link between EGFR signaling pathway inhibition and potential anti-tumor immunity. Erlotinib up-regulated MHC-I and MHC-II proteins on IFNγ treated keratinocytes but abrogated IFNγ-induced expression of PD-L1, suggesting the potential role of infiltrating autoreactive T cells in the damage of keratinocytes in affected skin. Interestingly, the surface expression of MHC-I, MHC-II, and PD-L1 was up-regulated in response to IFNγ more often in lung cancer cell lines sensitive to erlotinib, but only expression of PD-L1 was inhibited by erlotinib. Further, erlotinib significantly increased T cell mediated cytotoxicity on lung cancer cells. Lastly, the analysis of gene expression dataset of 186 lung cancer cell lines from Cancer Cell Line Encyclopedia demonstrated that overexpression of PD-L1 was associated with sensitivity to erlotinib and higher expression of genes related to antigen presenting pathways and IFNγ signaling pathway. Our findings suggest that the EGFR inhibitors can facilitate anti-tumor adaptive immune responses by breaking tolerance especially in EGFR driven lung cancer that are associated with overexpression of PD-L1 and genes related to antigen presentation and inflammation. PMID:27467256

  16. A Meta-Analysis: Identification of Common Mir-145 Target Genes that have Similar Behavior in Different GEO Datasets.

    PubMed

    Pashaei, Elnaz; Guzel, Esra; Ozgurses, Mete Emir; Demirel, Goksun; Aydin, Nizamettin; Ozen, Mustafa

    MicroRNAs, which are small regulatory RNAs, post-transcriptionally regulate gene expression by binding 3'-UTR of their mRNA targets. Their deregulation has been shown to cause increased proliferation, migration, invasion, and apoptosis. miR-145, an important tumor supressor microRNA, has shown to be downregulated in many cancer types and has crucial roles in tumor initiation, progression, metastasis, invasion, recurrence, and chemo-radioresistance. Our aim is to investigate potential common target genes of miR-145, and to help understanding the underlying molecular pathways of tumor pathogenesis in association with those common target genes. Eight published microarray datasets, where targets of mir-145 were investigated in cell lines upon mir-145 over expression, were included into this study for meta-analysis. Inter group variabilities were assessed by box-plot analysis. Microarray datasets were analyzed using GEOquery package in Bioconducter 3.2 with R version 3.2.2 and two-way Hierarchical Clustering was used for gene expression data analysis. Meta-analysis of different GEO datasets showed that UNG, FUCA2, DERA, GMFB, TF, and SNX2 were commonly downregulated genes, whereas MYL9 and TAGLN were found to be commonly upregulated upon mir-145 over expression in prostate, breast, esophageal, bladder cancer, and head and neck squamous cell carcinoma. Biological process, molecular function, and pathway analysis of these potential targets of mir-145 through functional enrichments in PPI network demonstrated that those genes are significantly involved in telomere maintenance, DNA binding and repair mechanisms. As a conclusion, our results indicated that mir-145, through targeting its common potential targets, may significantly contribute to tumor pathogenesis in distinct cancer types and might serve as an important target for cancer therapy.

  17. GEOGLE: context mining tool for the correlation between gene expression and the phenotypic distinction.

    PubMed

    Yu, Yao; Tu, Kang; Zheng, Siyuan; Li, Yun; Ding, Guohui; Ping, Jie; Hao, Pei; Li, Yixue

    2009-08-25

    In the post-genomic era, the development of high-throughput gene expression detection technology provides huge amounts of experimental data, which challenges the traditional pipelines for data processing and analyzing in scientific researches. In our work, we integrated gene expression information from Gene Expression Omnibus (GEO), biomedical ontology from Medical Subject Headings (MeSH) and signaling pathway knowledge from sigPathway entries to develop a context mining tool for gene expression analysis - GEOGLE. GEOGLE offers a rapid and convenient way for searching relevant experimental datasets, pathways and biological terms according to multiple types of queries: including biomedical vocabularies, GDS IDs, gene IDs, pathway names and signature list. Moreover, GEOGLE summarizes the signature genes from a subset of GDSes and estimates the correlation between gene expression and the phenotypic distinction with an integrated p value. This approach performing global searching of expression data may expand the traditional way of collecting heterogeneous gene expression experiment data. GEOGLE is a novel tool that provides researchers a quantitative way to understand the correlation between gene expression and phenotypic distinction through meta-analysis of gene expression datasets from different experiments, as well as the biological meaning behind. The web site and user guide of GEOGLE are available at: http://omics.biosino.org:14000/kweb/workflow.jsp?id=00020.

  18. The pineal gland: A model for adrenergic modulation of ubiquitin ligases.

    PubMed

    Vriend, Jerry; Liu, Wenjun; Reiter, Russel J

    2017-01-01

    A recent study of the pineal gland of the rat found that the expression of more than 3000 genes showed significant day/night variations (The Hartley dataset). The investigators of this report made available a supplemental table in which they tabulated the expression of many genes that they did not discuss, including those coding for components of the ubiquitin proteasome system. Herein we identify the genes of the ubiquitin proteasome system whose expression were significantly influenced by environmental lighting in the Hartley dataset, those that were stimulated by DBcAMP in pineal glands in culture, and those that were stimulated by norepinephrine. Using the Ubiquitin and Ubiquitin-like Conjugation Database (UUCA) we identified ubiquitin ligases and conjugases, and deubiquitinases in the Hartley dataset for the purpose of determining whether expression of genes of the ubiquitin proteasome pathway were significantly influenced by day/night variations and if these variations were regulated by autonomic innervation of the pineal gland from the superior cervical ganglia. In the Hartley experiments pineal glands groups of rats sacrificed during the day and groups sacrificed during the night were examined for gene expression. Additional groups of rats had their superior cervical ganglia removed surgically or surgically decentralized and the pineal glands likewise examined for gene expression. The genes with at least a 2-fold day/night significant difference in expression included genes for 5 ubiquitin conjugating enzymes, genes for 58 ubiquitin E3 ligases and genes for 6 deubiquitinases. A 35-fold day/night difference was noted in the expression of the gene Sik1, which codes for a protein containing both an ubiquitin binding domain (UBD) and an ubiquitin-associated (UBA) domain. Most of the significant differences in these genes were prevented by surgical removal, or disconnection, of the superior cervical ganglia, and most were responsive, in vitro, to treatment with a cyclic AMP analog, and norepinephrine. All previously described 24-hour rhythms in the pineal require an intact sympathetic input from the superior cervical ganglia. The Hartley dataset thus provides evidence that the pineal gland is a highly useful model for studying adrenergically dependent mechanisms regulating variations in ubiquitin ligases, ubiquitin conjugases, and deubiquitinases, mechanisms that may be physiologically relevant not only in the pineal gland, but in all adrenergically innervated tissue.

  19. The pineal gland: A model for adrenergic modulation of ubiquitin ligases

    PubMed Central

    Liu, Wenjun; Reiter, Russel J.

    2017-01-01

    Introduction A recent study of the pineal gland of the rat found that the expression of more than 3000 genes showed significant day/night variations (The Hartley dataset). The investigators of this report made available a supplemental table in which they tabulated the expression of many genes that they did not discuss, including those coding for components of the ubiquitin proteasome system. Herein we identify the genes of the ubiquitin proteasome system whose expression were significantly influenced by environmental lighting in the Hartley dataset, those that were stimulated by DBcAMP in pineal glands in culture, and those that were stimulated by norepinephrine. Purpose Using the Ubiquitin and Ubiquitin-like Conjugation Database (UUCA) we identified ubiquitin ligases and conjugases, and deubiquitinases in the Hartley dataset for the purpose of determining whether expression of genes of the ubiquitin proteasome pathway were significantly influenced by day/night variations and if these variations were regulated by autonomic innervation of the pineal gland from the superior cervical ganglia. Methods In the Hartley experiments pineal glands groups of rats sacrificed during the day and groups sacrificed during the night were examined for gene expression. Additional groups of rats had their superior cervical ganglia removed surgically or surgically decentralized and the pineal glands likewise examined for gene expression. Results The genes with at least a 2-fold day/night significant difference in expression included genes for 5 ubiquitin conjugating enzymes, genes for 58 ubiquitin E3 ligases and genes for 6 deubiquitinases. A 35-fold day/night difference was noted in the expression of the gene Sik1, which codes for a protein containing both an ubiquitin binding domain (UBD) and an ubiquitin-associated (UBA) domain. Most of the significant differences in these genes were prevented by surgical removal, or disconnection, of the superior cervical ganglia, and most were responsive, in vitro, to treatment with a cyclic AMP analog, and norepinephrine. All previously described 24-hour rhythms in the pineal require an intact sympathetic input from the superior cervical ganglia. Conclusions The Hartley dataset thus provides evidence that the pineal gland is a highly useful model for studying adrenergically dependent mechanisms regulating variations in ubiquitin ligases, ubiquitin conjugases, and deubiquitinases, mechanisms that may be physiologically relevant not only in the pineal gland, but in all adrenergically innervated tissue. PMID:28212404

  20. dynGENIE3: dynamical GENIE3 for the inference of gene networks from time series expression data.

    PubMed

    Huynh-Thu, Vân Anh; Geurts, Pierre

    2018-02-21

    The elucidation of gene regulatory networks is one of the major challenges of systems biology. Measurements about genes that are exploited by network inference methods are typically available either in the form of steady-state expression vectors or time series expression data. In our previous work, we proposed the GENIE3 method that exploits variable importance scores derived from Random forests to identify the regulators of each target gene. This method provided state-of-the-art performance on several benchmark datasets, but it could however not specifically be applied to time series expression data. We propose here an adaptation of the GENIE3 method, called dynamical GENIE3 (dynGENIE3), for handling both time series and steady-state expression data. The proposed method is evaluated extensively on the artificial DREAM4 benchmarks and on three real time series expression datasets. Although dynGENIE3 does not systematically yield the best performance on each and every network, it is competitive with diverse methods from the literature, while preserving the main advantages of GENIE3 in terms of scalability.

  1. Co-LncRNA: investigating the lncRNA combinatorial effects in GO annotations and KEGG pathways based on human RNA-Seq data.

    PubMed

    Zhao, Zheng; Bai, Jing; Wu, Aiwei; Wang, Yuan; Zhang, Jinwen; Wang, Zishan; Li, Yongsheng; Xu, Juan; Li, Xia

    2015-01-01

    Long non-coding RNAs (lncRNAs) are emerging as key regulators of diverse biological processes and diseases. However, the combinatorial effects of these molecules in a specific biological function are poorly understood. Identifying co-expressed protein-coding genes of lncRNAs would provide ample insight into lncRNA functions. To facilitate such an effort, we have developed Co-LncRNA, which is a web-based computational tool that allows users to identify GO annotations and KEGG pathways that may be affected by co-expressed protein-coding genes of a single or multiple lncRNAs. LncRNA co-expressed protein-coding genes were first identified in publicly available human RNA-Seq datasets, including 241 datasets across 6560 total individuals representing 28 tissue types/cell lines. Then, the lncRNA combinatorial effects in a given GO annotations or KEGG pathways are taken into account by the simultaneous analysis of multiple lncRNAs in user-selected individual or multiple datasets, which is realized by enrichment analysis. In addition, this software provides a graphical overview of pathways that are modulated by lncRNAs, as well as a specific tool to display the relevant networks between lncRNAs and their co-expressed protein-coding genes. Co-LncRNA also supports users in uploading their own lncRNA and protein-coding gene expression profiles to investigate the lncRNA combinatorial effects. It will be continuously updated with more human RNA-Seq datasets on an annual basis. Taken together, Co-LncRNA provides a web-based application for investigating lncRNA combinatorial effects, which could shed light on their biological roles and could be a valuable resource for this community. Database URL: http://www.bio-bigdata.com/Co-LncRNA/. © The Author(s) 2015. Published by Oxford University Press.

  2. Comparison of alternative approaches for analysing multi-level RNA-seq data

    PubMed Central

    Mohorianu, Irina; Bretman, Amanda; Smith, Damian T.; Fowler, Emily K.; Dalmay, Tamas

    2017-01-01

    RNA sequencing (RNA-seq) is widely used for RNA quantification in the environmental, biological and medical sciences. It enables the description of genome-wide patterns of expression and the identification of regulatory interactions and networks. The aim of RNA-seq data analyses is to achieve rigorous quantification of genes/transcripts to allow a reliable prediction of differential expression (DE), despite variation in levels of noise and inherent biases in sequencing data. This can be especially challenging for datasets in which gene expression differences are subtle, as in the behavioural transcriptomics test dataset from D. melanogaster that we used here. We investigated the power of existing approaches for quality checking mRNA-seq data and explored additional, quantitative quality checks. To accommodate nested, multi-level experimental designs, we incorporated sample layout into our analyses. We employed a subsampling without replacement-based normalization and an identification of DE that accounted for the hierarchy and amplitude of effect sizes within samples, then evaluated the resulting differential expression call in comparison to existing approaches. In a final step to test for broader applicability, we applied our approaches to a published set of H. sapiens mRNA-seq samples, The dataset-tailored methods improved sample comparability and delivered a robust prediction of subtle gene expression changes. The proposed approaches have the potential to improve key steps in the analysis of RNA-seq data by incorporating the structure and characteristics of biological experiments. PMID:28792517

  3. Sjögren's syndrome X-chromosome dose effect: An epigenetic perspective.

    PubMed

    Mougeot, J-Lc; Noll, B D; Bahrani Mougeot, F K

    2018-01-09

    Sjögren's syndrome (SS) is a chronic autoimmune disease affecting exocrine glands leading to mouth and eyes dryness. The extent to which epigenetic DNA methylation changes are responsible for an X-chromosome dose effect has yet to be determined. Our objectives were to (i) describe how epigenetic DNA methylation changes could explain an X-chromosome dose effect in SS for women with normal 46,XX genotype and (ii) determine the relevant relationships to this dose effect, between X-linked genes, genes controlling X-chromosome inactivation (XCI) and genes encoding associated transcription factors, all of which are differentially expressed and/or differentially methylated in the salivary glands of patients with SS. We identified 58 upregulated X-chromosome genes, including 22 genes previously shown to escape XCI, based on the analysis of SS patient salivary gland GEO2R gene expression datasets. Moreover, we found XIST and its cis regulators RLIM, FTX, and CHIC1, and polycomb repressor genes of the PRC1/2 complexes to be upregulated. Many of the X-chromosome genes implicated in SS pathogenesis can be regulated by transcription factors which we found to be overexpressed and/or differentially methylated in patients with SS. Determination of the mechanisms underlying methylation-dependent gene expression and impaired XCI is needed to further elucidate the etiopathogenesis of SS. © 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd. All rights reserved.

  4. Functional relevance for type 1 diabetes mellitus-associated genetic variants by using integrative analyses.

    PubMed

    Qiu, Ying-Hua; Deng, Fei-Yan; Tang, Zai-Xiang; Jiang, Zhen-Huan; Lei, Shu-Feng

    2015-10-01

    Type 1 diabetes mellitus (type 1 DM) is an autoimmune disease. Although genome-wide association studies (GWAS) and meta-analyses have successfully identified numerous type 1 DM-associated susceptibility loci, the underlying mechanisms for these susceptibility loci are currently largely unclear. Based on publicly available datasets, we performed integrative analyses (i.e., integrated gene relationships among implicated loci, differential gene expression analysis, functional prediction and functional annotation clustering analysis) and combined with expression quantitative trait loci (eQTL) results to further explore function mechanisms underlying the associations between genetic variants and type 1 DM. Among a total of 183 type 1 DM-associated SNPs, eQTL analysis showed that 17 SNPs with cis-regulated eQTL effects on 9 genes. All the 9 eQTL genes enrich in immune-related pathways or Gene Ontology (GO) terms. Functional prediction analysis identified 5 SNPs located in transcription factor (TF) binding sites. Of the 9 eQTL genes, 6 (TAP2, HLA-DOB, HLA-DQB1, HLA-DQA1, HLA-DRB5 and CTSH) were differentially expressed in type 1 DM-associated related cells. Especially, rs3825932 in CTSH has integrative functional evidence supporting the association with type 1 DM. These findings indicated that integrative analyses can yield important functional information to link genetic variants and type 1 DM. Copyright © 2015 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.

  5. Extensive shift in placental transcriptome profile in preeclampsia and placental origin of adverse pregnancy outcomes

    PubMed Central

    Sõber, Siim; Reiman, Mario; Kikas, Triin; Rull, Kristiina; Inno, Rain; Vaas, Pille; Teesalu, Pille; Marti, Jesus M. Lopez; Mattila, Pirkko; Laan, Maris

    2015-01-01

    One in five pregnant women suffer from gestational complications, prevalently driven by placental malfunction. Using RNASeq, we analyzed differential placental gene expression in cases of normal gestation, late-onset preeclampsia (LO-PE), gestational diabetes (GD) and pregnancies ending with the birth of small-for-gestational-age (SGA) or large-for-gestational-age (LGA) newborns (n = 8/group). In all groups, the highest expression was detected for small noncoding RNAs and genes specifically implicated in placental function and hormonal regulation. The transcriptome of LO-PE placentas was clearly distinct, showing statistically significant (after FDR) expressional disturbances for hundreds of genes. Taqman RT-qPCR validation of 45 genes in an extended sample (n = 24/group) provided concordant results. A limited number of transcription factors including LRF, SP1 and AP2 were identified as possible drivers of these changes. Notable differences were detected in differential expression signatures of LO-PE subtypes defined by the presence or absence of intrauterine growth restriction (IUGR). LO-PE with IUGR showed higher correlation with SGA and LO-PE without IUGR with LGA placentas. Whereas changes in placental transcriptome in SGA, LGA and GD cases were less prominent, the overall profiles of expressional disturbances overlapped among pregnancy complications providing support to shared placental responses. The dataset represent a rich catalogue for potential biomarkers and therapeutic targets. PMID:26268791

  6. From DNA Copy Number to Gene Expression: Local aberrations, Trisomies and Monosomies

    NASA Astrophysics Data System (ADS)

    Shay, Tal

    The goal of my PhD research was to study the effect of DNA copy number changes on gene expression. DNA copy number aberrations may be local, encompassing several genes, or on the level of an entire chromosome, such as trisomy and monosomy. The main dataset I studied was of Glioblastoma, obtained in the framework of a collaboration, but I worked also with public datasets of cancer and Down's Syndrome. The molecular basis of expression changes in Glioblastoma. Glioblastoma is the most common and aggressive type of primary brain tumors in adults. In collaboration with Prof. Hegi (CHUV, Switzerland), we analyzed a rich Glioblastoma dataset including clinical information, DNA copy number (array CGH) and expression profiles. We explored the correlation between DNA copy number and gene expression at the level of chromosomal arms and local genomic aberrations. We detected known amplification and over expression of oncogenes, as well as deletion and down-regulation of tumor suppressor genes. We exploited that information to map alterations of pathways that are known to be disrupted in Glioblastoma, and tried to characterize samples that have no known alteration in any of the studied pathways. Identifying local DNA aberrations of biological significance. Many types of tumors exhibit chromosomal losses or gains and local amplifications and deletions. A region that is aberrant in many tumors, or whose copy number change is stronger, is more likely to be clinically relevant, and not just a by-product of genetic instability. We developed a novel method that defines and prioritizes aberrations by formalizing these intuitions. The method scores each aberration by the fraction of patients harboring it, its length and its amplitude, and assesses the significance of the score by comparing it to a null distribution obtained by permutations. This approach detects genetic locations that are significantly aberrant, generating a 'genomic aberration profile' for each sample. The 'genomic aberration profile' is then combined with chromosomal arm status (gain/loss) to define a succinct genomic signature for each tumor. Unsupervised clustering of the samples based on these genomic signatures can reveal novel tumor subtypes. This approach was applied to datasets from three types of brain tumors: Glioblastoma, Medulloblastoma and Neuroblastoma, and identified a new subtype in Medulloblastoma, characterized by many chromosomal aberrations. Elucidating the transcriptional effect of monosomy and trisomy. Trisomy and monosomy are expected to impact the expression of genes that are located on the affected chromosome. Analysis of several cancer datasets revealed that not all the genes on the aberrant chromosome are affected by the change of copy number. Affected genes exhibit a wide range of expression changes with varying penetrance. Specifically, (1) The effect of trisomy is much more conserved among individuals than the effect of monosomy and (2) the expression level of a gene in the diploid is significantly correlated with the level of change between the diploid and the trisomy or monosomy.

  7. Data reuse and the open data citation advantage

    PubMed Central

    Vision, Todd J.

    2013-01-01

    Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003. PMID:24109559

  8. Low HIP1R mRNA and protein expression are associated with worse survival in diffuse large B-cell lymphoma patients treated with R-CHOP.

    PubMed

    Wong, Kah Keng; Ch'ng, Ewe Seng; Loo, Suet Kee; Husin, Azlan; Muruzabal, María Arestin; Møller, Michael B; Pedersen, Lars M; Pomposo, María Puente; Gaafar, Ayman; Banham, Alison H; Green, Tina M; Lawrie, Charles H

    2015-12-01

    Huntingtin-interacting protein 1-related (HIP1R) is an endocytic protein involved in receptor trafficking, including regulating cell surface expression of receptor tyrosine kinases. We have previously shown that low HIP1R protein expression was associated with poorer survival in diffuse large B-cell lymphoma (DLBCL) patients from Denmark treated with R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, prednisone). In this multicenter study, we extend these findings and validate the prognostic and subtyping utility of HIP1R expression at both transcript and protein level. Using data mining on three independent transcriptomic datasets of DLBCL, HIP1R transcript was preferentially expressed in germinal center B-cell (GCB)-like DLBCL subtype (P<0.01 in all three datasets), and lower expression was correlated with worse overall survival (OS; P<0.01) and progression-free survival (PFS; P<0.05) in a microarray-profiled DLBCL dataset. At the protein level examined by immunohistochemistry, HIP1R expression at 30% cut-off was associated with GCB-DLBCL molecular subtype (P=0.0004; n=42), and predictive of OS (P=0.0006) and PFS (P=0.0230) in de novo DLBCL patients treated with R-CHOP (n=73). Cases with high FOXP1 and low HIP1R expression frequency (FOXP1(hi)/HIP1R(lo) phenotype) exhibited poorer OS (P=0.0038) and PFS (P=0.0134). Multivariate analysis showed that HIP1R<30% or FOXP1(hi)/HIP1R(lo) subgroup of patients exhibited inferior OS and PFS (P<0.05) independently of the International Prognostic Index. We conclude that HIP1R expression is strongly indicative of survival when utilized on its own or in combination with FOXP1, and the molecule is potentially applicable for subtyping of DLBCL cases. Copyright © 2015 Elsevier Inc. All rights reserved.

  9. VTCdb: a gene co-expression database for the crop species Vitis vinifera (grapevine).

    PubMed

    Wong, Darren C J; Sweetman, Crystal; Drew, Damian P; Ford, Christopher M

    2013-12-16

    Gene expression datasets in model plants such as Arabidopsis have contributed to our understanding of gene function and how a single underlying biological process can be governed by a diverse network of genes. The accumulation of publicly available microarray data encompassing a wide range of biological and environmental conditions has enabled the development of additional capabilities including gene co-expression analysis (GCA). GCA is based on the understanding that genes encoding proteins involved in similar and/or related biological processes may exhibit comparable expression patterns over a range of experimental conditions, developmental stages and tissues. We present an open access database for the investigation of gene co-expression networks within the cultivated grapevine, Vitis vinifera. The new gene co-expression database, VTCdb (http://vtcdb.adelaide.edu.au/Home.aspx), offers an online platform for transcriptional regulatory inference in the cultivated grapevine. Using condition-independent and condition-dependent approaches, grapevine co-expression networks were constructed using the latest publicly available microarray datasets from diverse experimental series, utilising the Affymetrix Vitis vinifera GeneChip (16 K) and the NimbleGen Grape Whole-genome microarray chip (29 K), thus making it possible to profile approximately 29,000 genes (95% of the predicted grapevine transcriptome). Applications available with the online platform include the use of gene names, probesets, modules or biological processes to query the co-expression networks, with the option to choose between Affymetrix or Nimblegen datasets and between multiple co-expression measures. Alternatively, the user can browse existing network modules using interactive network visualisation and analysis via CytoscapeWeb. To demonstrate the utility of the database, we present examples from three fundamental biological processes (berry development, photosynthesis and flavonoid biosynthesis) whereby the recovered sub-networks reconfirm established plant gene functions and also identify novel associations. Together, we present valuable insights into grapevine transcriptional regulation by developing network models applicable to researchers in their prioritisation of gene candidates, for on-going study of biological processes related to grapevine development, metabolism and stress responses.

  10. The Physcomitrella patens gene atlas project: large-scale RNA-seq based expression data.

    PubMed

    Perroud, Pierre-François; Haas, Fabian B; Hiss, Manuel; Ullrich, Kristian K; Alboresi, Alessandro; Amirebrahimi, Mojgan; Barry, Kerrie; Bassi, Roberto; Bonhomme, Sandrine; Chen, Haodong; Coates, Juliet C; Fujita, Tomomichi; Guyon-Debast, Anouchka; Lang, Daniel; Lin, Junyan; Lipzen, Anna; Nogué, Fabien; Oliver, Melvin J; Ponce de León, Inés; Quatrano, Ralph S; Rameau, Catherine; Reiss, Bernd; Reski, Ralf; Ricca, Mariana; Saidi, Younousse; Sun, Ning; Szövényi, Péter; Sreedasyam, Avinash; Grimwood, Jane; Stacey, Gary; Schmutz, Jeremy; Rensing, Stefan A

    2018-07-01

    High-throughput RNA sequencing (RNA-seq) has recently become the method of choice to define and analyze transcriptomes. For the model moss Physcomitrella patens, although this method has been used to help analyze specific perturbations, no overall reference dataset has yet been established. In the framework of the Gene Atlas project, the Joint Genome Institute selected P. patens as a flagship genome, opening the way to generate the first comprehensive transcriptome dataset for this moss. The first round of sequencing described here is composed of 99 independent libraries spanning 34 different developmental stages and conditions. Upon dataset quality control and processing through read mapping, 28 509 of the 34 361 v3.3 gene models (83%) were detected to be expressed across the samples. Differentially expressed genes (DEGs) were calculated across the dataset to permit perturbation comparisons between conditions. The analysis of the three most distinct and abundant P. patens growth stages - protonema, gametophore and sporophyte - allowed us to define both general transcriptional patterns and stage-specific transcripts. As an example of variation of physico-chemical growth conditions, we detail here the impact of ammonium supplementation under standard growth conditions on the protonemal transcriptome. Finally, the cooperative nature of this project allowed us to analyze inter-laboratory variation, as 13 different laboratories around the world provided samples. We compare differences in the replication of experiments in a single laboratory and between different laboratories. © 2018 The Authors The Plant Journal © 2018 John Wiley & Sons Ltd.

  11. Causes and Consequences of Genetic Background Effects Illuminated by Integrative Genomic Analysis

    PubMed Central

    Chandler, Christopher H.; Chari, Sudarshan; Dworkin, Ian

    2014-01-01

    The phenotypic consequences of individual mutations are modulated by the wild-type genetic background in which they occur. Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist or about the mechanisms underlying it. We also lack knowledge on how mutations interact with genetic background to influence gene expression and how this in turn mediates mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome-level datasets from a mapping-by-introgression experiment and a tagged RNA gene expression dataset. In addition we used whole genome resequencing of the parental lines—two commonly used laboratory strains—to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutation screen. By searching for genes showing a congruent signal across multiple datasets, we were able to identify a robust set of candidate loci contributing to the background-dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers previously reported are caused by higher-order epistasis, not quantitative noncomplementation. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well. PMID:24504186

  12. Accurate and fast multiple-testing correction in eQTL studies.

    PubMed

    Sul, Jae Hoon; Raj, Towfique; de Jong, Simone; de Bakker, Paul I W; Raychaudhuri, Soumya; Ophoff, Roel A; Stranger, Barbara E; Eskin, Eleazar; Han, Buhm

    2015-06-04

    In studies of expression quantitative trait loci (eQTLs), it is of increasing interest to identify eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Detecting eGenes is important for follow-up analyses and prioritization because genes are the main entities in biological processes. To detect eGenes, one typically focuses on the genetic variant with the minimum p value among all variants in cis with a gene and corrects for multiple testing to obtain a gene-level p value. For performing multiple-testing correction, a permutation test is widely used. Because of growing sample sizes of eQTL studies, however, the permutation test has become a computational bottleneck in eQTL studies. In this paper, we propose an efficient approach for correcting for multiple testing and assess eGene p values by utilizing a multivariate normal distribution. Our approach properly takes into account the linkage-disequilibrium structure among variants, and its time complexity is independent of sample size. By applying our small-sample correction techniques, our method achieves high accuracy in both small and large studies. We have shown that our method consistently produces extremely accurate p values (accuracy > 98%) for three human eQTL datasets with different sample sizes and SNP densities: the Genotype-Tissue Expression pilot dataset, the multi-region brain dataset, and the HapMap 3 dataset. Copyright © 2015 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  13. Analysis of the precipitation and streamflow extremes in Northern Italy using high resolution reanalysis dataset Express-Hydro

    NASA Astrophysics Data System (ADS)

    Silvestro, Francesco; Parodi, Antonio; Campo, Lorenzo

    2017-04-01

    The characterization of the hydrometeorological extremes, both in terms of rainfall and streamflow, in a given region plays a key role in the environmental monitoring provided by the flood alert services. In last years meteorological simulations (both near real-time and historical reanalysis) were available at increasing spatial and temporal resolutions, making possible long-period hydrological reanalysis in which the meteo dataset is used as input in distributed hydrological models. In this work, a very high resolution meteorological reanalysis dataset, namely Express-Hydro (CIMA, ISAC-CNR, GAUSS Special Project PR45DE), was employed as input in the hydrological model Continuum in order to produce long time series of streamflows in the Liguria territory, located in the Northern part of Italy. The original dataset covers the whole Europe territory in the 1979-2008 period, at 4 km of spatial resolution and 3 hours of time resolution. Analyses in terms of comparison between the rainfall estimated by the dataset and the observations (available from the local raingauges network) were carried out, and a bias correction was also performed in order to better match the observed climatology. An extreme analysis was eventually carried on the streamflows time series obtained by the simulations, by comparing them with the results of the same hydrological model fed with the observed time series of rainfall. The results of the analysis are shown and discussed.

  14. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets

    PubMed Central

    2010-01-01

    Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141

  15. Demonstrating the robustness of population surveillance data: implications of error rates on demographic and mortality estimates.

    PubMed

    Fottrell, Edward; Byass, Peter; Berhane, Yemane

    2008-03-25

    As in any measurement process, a certain amount of error may be expected in routine population surveillance operations such as those in demographic surveillance sites (DSSs). Vital events are likely to be missed and errors made no matter what method of data capture is used or what quality control procedures are in place. The extent to which random errors in large, longitudinal datasets affect overall health and demographic profiles has important implications for the role of DSSs as platforms for public health research and clinical trials. Such knowledge is also of particular importance if the outputs of DSSs are to be extrapolated and aggregated with realistic margins of error and validity. This study uses the first 10-year dataset from the Butajira Rural Health Project (BRHP) DSS, Ethiopia, covering approximately 336,000 person-years of data. Simple programmes were written to introduce random errors and omissions into new versions of the definitive 10-year Butajira dataset. Key parameters of sex, age, death, literacy and roof material (an indicator of poverty) were selected for the introduction of errors based on their obvious importance in demographic and health surveillance and their established significant associations with mortality. Defining the original 10-year dataset as the 'gold standard' for the purposes of this investigation, population, age and sex compositions and Poisson regression models of mortality rate ratios were compared between each of the intentionally erroneous datasets and the original 'gold standard' 10-year data. The composition of the Butajira population was well represented despite introducing random errors, and differences between population pyramids based on the derived datasets were subtle. Regression analyses of well-established mortality risk factors were largely unaffected even by relatively high levels of random errors in the data. The low sensitivity of parameter estimates and regression analyses to significant amounts of randomly introduced errors indicates a high level of robustness of the dataset. This apparent inertia of population parameter estimates to simulated errors is largely due to the size of the dataset. Tolerable margins of random error in DSS data may exceed 20%. While this is not an argument in favour of poor quality data, reducing the time and valuable resources spent on detecting and correcting random errors in routine DSS operations may be justifiable as the returns from such procedures diminish with increasing overall accuracy. The money and effort currently spent on endlessly correcting DSS datasets would perhaps be better spent on increasing the surveillance population size and geographic spread of DSSs and analysing and disseminating research findings.

  16. On the mutual relationship between conceptual models and datasets in geophysical monitoring of volcanic systems

    NASA Astrophysics Data System (ADS)

    Neuberg, J. W.; Thomas, M.; Pascal, K.; Karl, S.

    2012-04-01

    Geophysical datasets are essential to guide particularly short-term forecasting of volcanic activity. Key parameters are derived from these datasets and interpreted in different ways, however, the biggest impact on the interpretation is not determined by the range of parameters but controlled through the parameterisation and the underlying conceptual model of the volcanic process. On the other hand, the increasing number of sophisticated geophysical models need to be constrained by monitoring data, to transform a merely numerical exercise into a useful forecasting tool. We utilise datasets from the "big three", seismology, deformation and gas emissions, to gain insight in the mutual relationship between conceptual models and constraining data. We show that, e.g. the same seismic dataset can be interpreted with respect to a wide variety of different models with very different implications to forecasting. In turn, different data processing procedures lead to different outcomes even though they are based on the same conceptual model. Unsurprisingly, the most reliable interpretation will be achieved by employing multi-disciplinary models with overlapping constraints.

  17. A multi-strategy approach to informative gene identification from gene expression data.

    PubMed

    Liu, Ziying; Phan, Sieu; Famili, Fazel; Pan, Youlian; Lenferink, Anne E G; Cantin, Christiane; Collins, Catherine; O'Connor-McCourt, Maureen D

    2010-02-01

    An unsupervised multi-strategy approach has been developed to identify informative genes from high throughput genomic data. Several statistical methods have been used in the field to identify differentially expressed genes. Since different methods generate different lists of genes, it is very challenging to determine the most reliable gene list and the appropriate method. This paper presents a multi-strategy method, in which a combination of several data analysis techniques are applied to a given dataset and a confidence measure is established to select genes from the gene lists generated by these techniques to form the core of our final selection. The remainder of the genes that form the peripheral region are subject to exclusion or inclusion into the final selection. This paper demonstrates this methodology through its application to an in-house cancer genomics dataset and a public dataset. The results indicate that our method provides more reliable list of genes, which are validated using biological knowledge, biological experiments, and literature search. We further evaluated our multi-strategy method by consolidating two pairs of independent datasets, each pair is for the same disease, but generated by different labs using different platforms. The results showed that our method has produced far better results.

  18. Geoseq: a tool for dissecting deep-sequencing datasets.

    PubMed

    Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi

    2010-10-12

    Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.

  19. Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data.

    PubMed

    Ooi, Chia Huey; Chetty, Madhu; Teng, Shyh Wei

    2006-06-23

    Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.

  20. Aberrant Expression of COT Is Related to Recurrence of Papillary Thyroid Cancer

    PubMed Central

    Lee, Jandee; Jeong, Seonhyang; Park, Jae Hyun; Lee, Cho Rok; Ku, Cheol Ryong; Kang, Sang-Wook; Jeong, Jong Ju; Nam, Kee-Hyun; Shin, Dong Yeob; Lee, Eun Jig; Chung, Woong Youn; Jo, Young Suk

    2015-01-01

    Abstract Aberrant expression of Cancer Osaka Thyroid Oncogene mitogen-activated protein kinase kinase kinase 8 (COT) (MAP3K8) is a driver of resistance to B-RAF inhibition. However, the de novo expression and clinical implications of COT in papillary thyroid cancer (PTC) have not been investigated. The aim of this study is to investigate the expression of A-, B-, C-RAF, and COT in PTC (n = 167) and analyze the clinical implications of aberrant expression of these genes. Quantitative polymerase chain reaction (qPCR) and immunohistochemical staining (IHC) were performed on primary thyroid cancers. Expression of COT was compared with clinicopathological characteristics including recurrence-free survival. Datasets from public repository (NCBI) were subjected to Gene Set Enrichment Analysis (GSEA). qPCR data showed that the relative mRNA expression of A-, B-, C-RAF and COT of PTC were higher than normal tissues (all P < 0.01). In addition, the expression of COT mRNA in PTC showed positive correlation with A- (r = 0.4083, P < 0.001), B- (r = 0.2773, P = 0.0003), and C-RAF (r = 0.5954, P < 0.001). The mRNA expressions of A-, B,- and C-RAF were also correlated with each other (all P < 0.001). In IHC, the staining intensities of B-RAF and COT were higher in PTC than in normal tissue (P < 0.001). Interestingly, moderate-to-strong staining intensities of B-RAF and COT were more frequent in B-RAFV600E-positive PTC (P < 0.001, P = 0.013, respectively). In addition, aberrant expression of COT was related to old age at initial diagnosis (P = 0.045) and higher recurrence rate (P = 0.025). In multivariate analysis, tumor recurrence was persistently associated with moderate-to-strong staining of COT after adjusting for age, sex, extrathyroidal extension, multifocality, T-stage, N-stage, TNM stage, and B-RAFV600E mutation (odds ratio, 4.662; 95% confidence interval 1.066 − 21.609; P = 0.045). Moreover, moderate-to-strong COT expression in PTC was associated with shorter recurrence-free survival (mean follow-up duration, 14.2 ± 4.1 years; P = 0.0403). GSEA indicated that gene sets related to B-RAF-RAS (P < 0.0001, false discovery rate [FDR] q-value = 0.000) and thyroid differentiation (P = 0.048, FDR q-value = 0.05) scores were enriched in lower COT expression group and gene sets such as T-cell receptor signaling pathway and Toll-like receptor signaling pathway are coordinately upregulated in higher COT expression group (both, P < 0.0001, FDR q-value = 0.000). Aberrant expression of A-, B-, and C-RAF, and COT is frequent in PTC; increased expression of COT is correlated with recurrence of PTC. PMID:25674762

  1. Aberrant expression of COT is related to recurrence of papillary thyroid cancer.

    PubMed

    Lee, Jandee; Jeong, Seonhyang; Park, Jae Hyun; Lee, Cho Rok; Ku, Cheol Ryong; Kang, Sang-Wook; Jeong, Jong Ju; Nam, Kee-Hyun; Shin, Dong Yeob; Lee, Eun Jig; Chung, Woong Youn; Jo, Young Suk

    2015-02-01

    Aberrant expression of Cancer Osaka Thyroid Oncogene mitogen-activated protein kinase kinase kinase 8 (COT) (MAP3K8) is a driver of resistance to B-RAF inhibition. However, the de novo expression and clinical implications of COT in papillary thyroid cancer (PTC) have not been investigated.The aim of this study is to investigate the expression of A-, B-, C-RAF, and COT in PTC (n = 167) and analyze the clinical implications of aberrant expression of these genes.Quantitative polymerase chain reaction (qPCR) and immunohistochemical staining (IHC) were performed on primary thyroid cancers. Expression of COT was compared with clinicopathological characteristics including recurrence-free survival. Datasets from public repository (NCBI) were subjected to Gene Set Enrichment Analysis (GSEA).qPCR data showed that the relative mRNA expression of A-, B-, C-RAF and COT of PTC were higher than normal tissues (all P < 0.01). In addition, the expression of COT mRNA in PTC showed positive correlation with A- (r = 0.4083, P < 0.001), B- (r = 0.2773, P = 0.0003), and C-RAF (r = 0.5954, P < 0.001). The mRNA expressions of A-, B,- and C-RAF were also correlated with each other (all P < 0.001). In IHC, the staining intensities of B-RAF and COT were higher in PTC than in normal tissue (P < 0.001). Interestingly, moderate-to-strong staining intensities of B-RAF and COT were more frequent in B-RAF-positive PTC (P < 0.001, P = 0.013, respectively). In addition, aberrant expression of COT was related to old age at initial diagnosis (P = 0.045) and higher recurrence rate (P = 0.025). In multivariate analysis, tumor recurrence was persistently associated with moderate-to-strong staining of COT after adjusting for age, sex, extrathyroidal extension, multifocality, T-stage, N-stage, TNM stage, and B-RAF mutation (odds ratio, 4.662; 95% confidence interval 1.066 - 21.609; P = 0.045). Moreover, moderate-to-strong COT expression in PTC was associated with shorter recurrence-free survival (mean follow-up duration, 14.2 ± 4.1 years; P = 0.0403). GSEA indicated that gene sets related to B-RAF-RAS (P < 0.0001, false discovery rate [FDR] q-value = 0.000) and thyroid differentiation (P = 0.048, FDR q-value = 0.05) scores were enriched in lower COT expression group and gene sets such as T-cell receptor signaling pathway and Toll-like receptor signaling pathway are coordinately upregulated in higher COT expression group (both, P < 0.0001, FDR q-value = 0.000).Aberrant expression of A-, B-, and C-RAF, and COT is frequent in PTC; increased expression of COT is correlated with recurrence of PTC.

  2. Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data

    PubMed Central

    Daub, Carsten O; Steuer, Ralf; Selbig, Joachim; Kloska, Sebastian

    2004-01-01

    Background The information theoretic concept of mutual information provides a general framework to evaluate dependencies between variables. In the context of the clustering of genes with similar patterns of expression it has been suggested as a general quantity of similarity to extend commonly used linear measures. Since mutual information is defined in terms of discrete variables, its application to continuous data requires the use of binning procedures, which can lead to significant numerical errors for datasets of small or moderate size. Results In this work, we propose a method for the numerical estimation of mutual information from continuous data. We investigate the characteristic properties arising from the application of our algorithm and show that our approach outperforms commonly used algorithms: The significance, as a measure of the power of distinction from random correlation, is significantly increased. This concept is subsequently illustrated on two large-scale gene expression datasets and the results are compared to those obtained using other similarity measures. A C++ source code of our algorithm is available for non-commercial use from kloska@scienion.de upon request. Conclusion The utilisation of mutual information as similarity measure enables the detection of non-linear correlations in gene expression datasets. Frequently applied linear correlation measures, which are often used on an ad-hoc basis without further justification, are thereby extended. PMID:15339346

  3. paraGSEA: a scalable approach for large-scale gene expression profiling

    PubMed Central

    Peng, Shaoliang; Yang, Shunyun

    2017-01-01

    Abstract More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA. PMID:28973463

  4. Impact of missing data imputation methods on gene expression clustering and classification.

    PubMed

    de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G

    2015-02-26

    Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .

  5. Elucidation of the genetic and epigenetic landscape alterations in RNA binding proteins in glioblastoma

    PubMed Central

    Mahalingam, Kulandaivelu; Somasundaram, Kumaravel

    2017-01-01

    RNA binding proteins (RBPs) have been implicated in cancer development. An integrated bioinformatics analysis of RBPs (n = 1756) in various datasets (n = 11) revealed several genetic and epigenetically altered events among RBPs in glioblastoma (GBM). We identified 13 mutated and 472 differentially regulated RBPs in GBM samples. Mutations in AHNAK predicted poor prognosis. Copy number variation (CNV), DNA methylation and miRNA targeting contributed to RBP differential regulation. Two sets of differentially regulated RBPs that may be implicated in initial astrocytic transformation and glioma progression were identified. We have also identified a four RBP (NOL3, SUCLG1, HERC5 and AFF3) signature, having a unique expression pattern in glioma stem-like cells (GSCs), to be an independent poor prognostic indicator in GBM. RBP risk score derived from the signature also stratified GBM into low-risk and high-risk groups with significant survival difference. Silencing NOL3, SUCLG1 and HERC5 inhibited GSC maintenance. Gene set enrichment analysis of differentially regulated genes between high-risk and low-risk underscored the importance of inflammation, EMT and hypoxia in high-risk GBM. Thus, we provide a comprehensive overview of genetic and epigenetic regulation of RBPs in glioma development and progression. PMID:28035070

  6. Prognostic value of MACC1 and proficient mismatch repair status for recurrence risk prediction in stage II colon cancer patients: the BIOGRID studies.

    PubMed

    Rohr, U-P; Herrmann, P; Ilm, K; Zhang, H; Lohmann, S; Reiser, A; Muranyi, A; Smith, J; Burock, S; Osterland, M; Leith, K; Singh, S; Brunhoeber, P; Bowermaster, R; Tie, J; Christie, M; Wong, H-L; Waring, P; Shanmugam, K; Gibbs, P; Stein, U

    2017-08-01

    We assessed the novel MACC1 gene to further stratify stage II colon cancer patients with proficient mismatch repair (pMMR). Four cohorts with 596 patients were analyzed: Charité 1 discovery cohort was assayed for MACC1 mRNA expression and MMR in cryo-preserved tumors. Charité 2 comparison cohort was used to translate MACC1 qRT-PCR analyses to FFPE samples. In the BIOGRID 1 training cohort MACC1 mRNA levels were related to MACC1 protein levels from immunohistochemistry in FFPE sections; also analyzed for MMR. Chemotherapy-naïve pMMR patients were stratified by MACC1 mRNA and protein expression to establish risk groups based on recurrence-free survival (RFS). Risk stratification from BIOGRID 1 was confirmed in the BIOGRID 2 validation cohort. Pooled BIOGRID datasets produced a best effect-size estimate. In BIOGRID 1, using qRT-PCR and immunohistochemistry for MACC1 detection, pMMR/MACC1-low patients had a lower recurrence probability versus pMMR/MACC1-high patients (5-year RFS of 92% and 67% versus 100% and 68%, respectively). In BIOGRID 2, longer RFS was confirmed for pMMR/MACC1-low versus pMMR/MACC1-high patients (5-year RFS of 100% versus 90%, respectively). In the pooled dataset, 6.5% of patients were pMMR/MACC1-low with no disease recurrence, resulting in a 17% higher 5-year RFS [95% confidence interval (CI) (12.6%-21.3%)] versus pMMR/MACC1-high patients (P = 0.037). Outcomes were similar for pMMR/MACC1-low and deficient MMR (dMMR) patients (5-year RFS of 100% and 96%, respectively). MACC1 expression stratifies colon cancer patients with unfavorable pMMR status. Stage II colon cancer patients with pMMR/MACC1-low tumors have a similar favorable prognosis to those with dMMR with potential implications for the role of adjuvant therapy. © The Author 2017. Published by Oxford University Press on behalf of the European Society for Medical Oncology. All rights reserved. For permissions, please email: journals.permissions@oup.com.

  7. Evaluating the Consistency of the 1982–1999 NDVI Trends in the Iberian Peninsula across Four Time-series Derived from the AVHRR Sensor: LTDR, GIMMS, FASIR, and PAL-II

    PubMed Central

    Alcaraz-Segura, Domingo; Liras, Elisa; Tabik, Siham; Paruelo, José; Cabello, Javier

    2010-01-01

    Successive efforts have processed the Advanced Very High Resolution Radiometer (AVHRR) sensor archive to produce Normalized Difference Vegetation Index (NDVI) datasets (i.e., PAL, FASIR, GIMMS, and LTDR) under different corrections and processing schemes. Since NDVI datasets are used to evaluate carbon gains, differences among them may affect nations’ carbon budgets in meeting international targets (such as the Kyoto Protocol). This study addresses the consistency across AVHRR NDVI datasets in the Iberian Peninsula (Spain and Portugal) by evaluating whether their 1982–1999 NDVI trends show similar spatial patterns. Significant trends were calculated with the seasonal Mann-Kendall trend test and their spatial consistency with partial Mantel tests. Over 23% of the Peninsula (N, E, and central mountain ranges) showed positive and significant NDVI trends across the four datasets and an additional 18% across three datasets. In 20% of Iberia (SW quadrant), the four datasets exhibited an absence of significant trends and an additional 22% across three datasets. Significant NDVI decreases were scarce (croplands in the Guadalquivir and Segura basins, La Mancha plains, and Valencia). Spatial consistency of significant trends across at least three datasets was observed in 83% of the Peninsula, but it decreased to 47% when comparing across the four datasets. FASIR, PAL, and LTDR were the most spatially similar datasets, while GIMMS was the most different. The different performance of each AVHRR dataset to detect significant NDVI trends (e.g., LTDR detected greater significant trends (both positive and negative) and in 32% more pixels than GIMMS) has great implications to evaluate carbon budgets. The lack of spatial consistency across NDVI datasets derived from the same AVHRR sensor archive, makes it advisable to evaluate carbon gains trends using several satellite datasets and, whether possible, independent/additional data sources to contrast. PMID:22205868

  8. Evaluating the consistency of the 1982-1999 NDVI trends in the Iberian Peninsula across four time-series derived from the AVHRR sensor: LTDR, GIMMS, FASIR, and PAL-II.

    PubMed

    Alcaraz-Segura, Domingo; Liras, Elisa; Tabik, Siham; Paruelo, José; Cabello, Javier

    2010-01-01

    Successive efforts have processed the Advanced Very High Resolution Radiometer (AVHRR) sensor archive to produce Normalized Difference Vegetation Index (NDVI) datasets (i.e., PAL, FASIR, GIMMS, and LTDR) under different corrections and processing schemes. Since NDVI datasets are used to evaluate carbon gains, differences among them may affect nations' carbon budgets in meeting international targets (such as the Kyoto Protocol). This study addresses the consistency across AVHRR NDVI datasets in the Iberian Peninsula (Spain and Portugal) by evaluating whether their 1982-1999 NDVI trends show similar spatial patterns. Significant trends were calculated with the seasonal Mann-Kendall trend test and their spatial consistency with partial Mantel tests. Over 23% of the Peninsula (N, E, and central mountain ranges) showed positive and significant NDVI trends across the four datasets and an additional 18% across three datasets. In 20% of Iberia (SW quadrant), the four datasets exhibited an absence of significant trends and an additional 22% across three datasets. Significant NDVI decreases were scarce (croplands in the Guadalquivir and Segura basins, La Mancha plains, and Valencia). Spatial consistency of significant trends across at least three datasets was observed in 83% of the Peninsula, but it decreased to 47% when comparing across the four datasets. FASIR, PAL, and LTDR were the most spatially similar datasets, while GIMMS was the most different. The different performance of each AVHRR dataset to detect significant NDVI trends (e.g., LTDR detected greater significant trends (both positive and negative) and in 32% more pixels than GIMMS) has great implications to evaluate carbon budgets. The lack of spatial consistency across NDVI datasets derived from the same AVHRR sensor archive, makes it advisable to evaluate carbon gains trends using several satellite datasets and, whether possible, independent/additional data sources to contrast.

  9. Differential privacy based on importance weighting

    PubMed Central

    Ji, Zhanglong

    2014-01-01

    This paper analyzes a novel method for publishing data while still protecting privacy. The method is based on computing weights that make an existing dataset, for which there are no confidentiality issues, analogous to the dataset that must be kept private. The existing dataset may be genuine but public already, or it may be synthetic. The weights are importance sampling weights, but to protect privacy, they are regularized and have noise added. The weights allow statistical queries to be answered approximately while provably guaranteeing differential privacy. We derive an expression for the asymptotic variance of the approximate answers. Experiments show that the new mechanism performs well even when the privacy budget is small, and when the public and private datasets are drawn from different populations. PMID:24482559

  10. Time Series Expression Analyses Using RNA-seq: A Statistical Approach

    PubMed Central

    Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P.

    2013-01-01

    RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis. PMID:23586021

  11. Cross-platform method for identifying candidate network biomarkers for prostate cancer.

    PubMed

    Jin, G; Zhou, X; Cui, K; Zhang, X-S; Chen, L; Wong, S T C

    2009-11-01

    Discovering biomarkers using mass spectrometry (MS) and microarray expression profiles is a promising strategy in molecular diagnosis. Here, the authors proposed a new pipeline for biomarker discovery that integrates disease information for proteins and genes, expression profiles in both genomic and proteomic levels, and protein-protein interactions (PPIs) to discover high confidence network biomarkers. Using this pipeline, a total of 474 molecules (genes and proteins) related to prostate cancer were identified and a prostate-cancer-related network (PCRN) was derived from the integrative information. Thus, a set of candidate network biomarkers were identified from multiple expression profiles composed by eight microarray datasets and one proteomics dataset. The network biomarkers with PPIs can accurately distinguish the prostate patients from the normal ones, which potentially provide more reliable hits of biomarker candidates than conventional biomarker discovery methods.

  12. Time series expression analyses using RNA-seq: a statistical approach.

    PubMed

    Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P

    2013-01-01

    RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis.

  13. Integrated Quantitative Transcriptome Maps of Human Trisomy 21 Tissues and Cells

    PubMed Central

    Pelleri, Maria Chiara; Cattani, Chiara; Vitale, Lorenza; Antonaros, Francesca; Strippoli, Pierluigi; Locatelli, Chiara; Cocchi, Guido; Piovesan, Allison; Caracausi, Maria

    2018-01-01

    Down syndrome (DS) is due to the presence of an extra full or partial chromosome 21 (Hsa21). The identification of genes contributing to DS pathogenesis could be the key to any rational therapy of the associated intellectual disability. We aim at generating quantitative transcriptome maps in DS integrating all gene expression profile datasets available for any cell type or tissue, to obtain a complete model of the transcriptome in terms of both expression values for each gene and segmental trend of gene expression along each chromosome. We used the TRAM (Transcriptome Mapper) software for this meta-analysis, comparing transcript expression levels and profiles between DS and normal brain, lymphoblastoid cell lines, blood cells, fibroblasts, thymus and induced pluripotent stem cells, respectively. TRAM combined, normalized, and integrated datasets from different sources and across diverse experimental platforms. The main output was a linear expression value that may be used as a reference for each of up to 37,181 mapped transcripts analyzed, related to both known genes and expression sequence tag (EST) clusters. An independent example in vitro validation of fibroblast transcriptome map data was performed through “Real-Time” reverse transcription polymerase chain reaction showing an excellent correlation coefficient (r = 0.93, p < 0.0001) with data obtained in silico. The availability of linear expression values for each gene allowed the testing of the gene dosage hypothesis of the expected 3:2 DS/normal ratio for Hsa21 as well as other human genes in DS, in addition to listing genes differentially expressed with statistical significance. Although a fraction of Hsa21 genes escapes dosage effects, Hsa21 genes are selectively over-expressed in DS samples compared to genes from other chromosomes, reflecting a decisive role in the pathogenesis of the syndrome. Finally, the analysis of chromosomal segments reveals a high prevalence of Hsa21 over-expressed segments over the other genomic regions, suggesting, in particular, a specific region on Hsa21 that appears to be frequently over-expressed (21q22). Our complete datasets are released as a new framework to investigate transcription in DS for individual genes as well as chromosomal segments in different cell types and tissues. PMID:29740474

  14. Convergent Genetic and Expression Datasets Highlight TREM2 in Parkinson's Disease Susceptibility.

    PubMed

    Liu, Guiyou; Liu, Yongquan; Jiang, Qinghua; Jiang, Yongshuai; Feng, Rennan; Zhang, Liangcai; Chen, Zugen; Li, Keshen; Liu, Jiafeng

    2016-09-01

    A rare TREM2 missense mutation (rs75932628-T) was reported to confer a significant Alzheimer's disease (AD) risk. A recent study indicated no evidence of the involvement of this variant in Parkinson's disease (PD). Here, we used the genetic and expression data to reinvestigate the potential association between TREM2 and PD susceptibility. In stage 1, using 10 independent studies (N = 89,157; 8787 cases and 80,370 controls), we conducted a subgroup meta-analysis. We identified a significant association between rs75932628 and PD (P = 3.10E-03, odds ratio (OR) = 3.88, 95 % confidence interval (CI) 1.58-9.54) in No-Northern Europe subgroup, and significantly increased PD risks (P = 0.01 for Mann-Whitney test) in No-Northern Europe subgroup than in Northern Europe subgroup. In stage 2, we used the summary results from a large-scale PD genome-wide association study (GWAS; N = 108,990; 13,708 cases and 95,282 controls) to search for other TREM2 variants contributing to PD susceptibility. We identified 14 single-nucleotide polymorphisms (SNPs) associated with PD within 50-kb upstream and downstream range of TREM2. In stage 3, using two brain expression GWAS datasets (N = 773), we identified 6 of the 14 SNPs regulating increased expression of TREM2. In stage 4, using the whole human genome microarray data (N = 50), we further identified significantly increased expression of TREM2 in PD cases compared with controls in human prefrontal cortex. In summary, convergent genetic and expression datasets demonstrate that TREM2 is a potent risk factor for PD and may be a therapeutic target in PD and other neurodegenerative diseases.

  15. Clinical Value of Prognosis Gene Expression Signatures in Colorectal Cancer: A Systematic Review

    PubMed Central

    Cordero, David; Riccadonna, Samantha; Solé, Xavier; Crous-Bou, Marta; Guinó, Elisabet; Sanjuan, Xavier; Biondo, Sebastiano; Soriano, Antonio; Jurman, Giuseppe; Capella, Gabriel; Furlanello, Cesare; Moreno, Victor

    2012-01-01

    Introduction The traditional staging system is inadequate to identify those patients with stage II colorectal cancer (CRC) at high risk of recurrence or with stage III CRC at low risk. A number of gene expression signatures to predict CRC prognosis have been proposed, but none is routinely used in the clinic. The aim of this work was to assess the prediction ability and potential clinical usefulness of these signatures in a series of independent datasets. Methods A literature review identified 31 gene expression signatures that used gene expression data to predict prognosis in CRC tissue. The search was based on the PubMed database and was restricted to papers published from January 2004 to December 2011. Eleven CRC gene expression datasets with outcome information were identified and downloaded from public repositories. Random Forest classifier was used to build predictors from the gene lists. Matthews correlation coefficient was chosen as a measure of classification accuracy and its associated p-value was used to assess association with prognosis. For clinical usefulness evaluation, positive and negative post-tests probabilities were computed in stage II and III samples. Results Five gene signatures showed significant association with prognosis and provided reasonable prediction accuracy in their own training datasets. Nevertheless, all signatures showed low reproducibility in independent data. Stratified analyses by stage or microsatellite instability status showed significant association but limited discrimination ability, especially in stage II tumors. From a clinical perspective, the most predictive signatures showed a minor but significant improvement over the classical staging system. Conclusions The published signatures show low prediction accuracy but moderate clinical usefulness. Although gene expression data may inform prognosis, better strategies for signature validation are needed to encourage their widespread use in the clinic. PMID:23145004

  16. Identification and validation of differentially expressed transcripts by RNA-sequencing of formalin-fixed, paraffin-embedded (FFPE) lung tissue from patients with Idiopathic Pulmonary Fibrosis.

    PubMed

    Vukmirovic, Milica; Herazo-Maya, Jose D; Blackmon, John; Skodric-Trifunovic, Vesna; Jovanovic, Dragana; Pavlovic, Sonja; Stojsic, Jelena; Zeljkovic, Vesna; Yan, Xiting; Homer, Robert; Stefanovic, Branko; Kaminski, Naftali

    2017-01-12

    Idiopathic Pulmonary Fibrosis (IPF) is a lethal lung disease of unknown etiology. A major limitation in transcriptomic profiling of lung tissue in IPF has been a dependence on snap-frozen fresh tissues (FF). In this project we sought to determine whether genome scale transcript profiling using RNA Sequencing (RNA-Seq) could be applied to archived Formalin-Fixed Paraffin-Embedded (FFPE) IPF tissues. We isolated total RNA from 7 IPF and 5 control FFPE lung tissues and performed 50 base pair paired-end sequencing on Illumina 2000 HiSeq. TopHat2 was used to map sequencing reads to the human genome. On average ~62 million reads (53.4% of ~116 million reads) were mapped per sample. 4,131 genes were differentially expressed between IPF and controls (1,920 increased and 2,211 decreased (FDR < 0.05). We compared our results to differentially expressed genes calculated from a previously published dataset generated from FF tissues analyzed on Agilent microarrays (GSE47460). The overlap of differentially expressed genes was very high (760 increased and 1,413 decreased, FDR < 0.05). Only 92 differentially expressed genes changed in opposite directions. Pathway enrichment analysis performed using MetaCore confirmed numerous IPF relevant genes and pathways including extracellular remodeling, TGF-beta, and WNT. Gene network analysis of MMP7, a highly differentially expressed gene in both datasets, revealed the same canonical pathways and gene network candidates in RNA-Seq and microarray data. For validation by NanoString nCounter® we selected 35 genes that had a fold change of 2 in at least one dataset (10 discordant, 10 significantly differentially expressed in one dataset only and 15 concordant genes). High concordance of fold change and FDR was observed for each type of the samples (FF vs FFPE) with both microarrays (r = 0.92) and RNA-Seq (r = 0.90) and the number of discordant genes was reduced to four. Our results demonstrate that RNA sequencing of RNA obtained from archived FFPE lung tissues is feasible. The results obtained from FFPE tissue are highly comparable to FF tissues. The ability to perform RNA-Seq on archived FFPE IPF tissues should greatly enhance the availability of tissue biopsies for research in IPF.

  17. The androgen receptor controls expression of the cancer-associated sTn antigen and cell adhesion through induction of ST6GalNAc1 in prostate cancer

    PubMed Central

    Munkley, Jennifer; Oltean, Sebastian; Vodák, Daniel; Wilson, Brian T.; Livermore, Karen E.; Zhou, Yan; Star, Eleanor; Floros, Vasileios I.; Johannessen, Bjarne; Knight, Bridget; McCullagh, Paul; McGrath, John; Crundwell, Malcolm; Skotheim, Rolf I.; Robson, Craig N.; Leung, Hing Y.; Harries, Lorna W.; Rajan, Prabhakar; Mills, Ian G.; Elliott, David J.

    2015-01-01

    Patterns of glycosylation are important in cancer, but the molecular mechanisms that drive changes are often poorly understood. The androgen receptor drives prostate cancer (PCa) development and progression to lethal metastatic castration-resistant disease. Here we used RNA-Seq coupled with bioinformatic analyses of androgen-receptor (AR) binding sites and clinical PCa expression array data to identify ST6GalNAc1 as a direct and rapidly activated target gene of the AR in PCa cells. ST6GalNAc1 encodes a sialytransferase that catalyses formation of the cancer-associated sialyl-Tn antigen (sTn), which we find is also induced by androgen exposure. Androgens induce expression of a novel splice variant of the ST6GalNAc1 protein in PCa cells. This splice variant encodes a shorter protein isoform that is still fully functional as a sialyltransferase and able to induce expression of the sTn-antigen. Surprisingly, given its high expression in tumours, stable expression of ST6GalNAc1 in PCa cells reduced formation of stable tumours in mice, reduced cell adhesion and induced a switch towards a more mesenchymal-like cell phenotype in vitro. ST6GalNAc1 has a dynamic expression pattern in clinical datasets, being significantly up-regulated in primary prostate carcinoma but relatively down-regulated in established metastatic tissue. ST6GalNAc1 is frequently upregulated concurrently with another important glycosylation enzyme GCNT1 previously associated with prostate cancer progression and implicated in Sialyl Lewis X antigen synthesis. Together our data establishes an androgen-dependent mechanism for sTn antigen expression in PCa, and are consistent with a general role for the androgen receptor in driving important coordinate changes to the glycoproteome during PCa progression. PMID:26452038

  18. Inferring Boolean network states from partial information

    PubMed Central

    2013-01-01

    Networks of molecular interactions regulate key processes in living cells. Therefore, understanding their functionality is a high priority in advancing biological knowledge. Boolean networks are often used to describe cellular networks mathematically and are fitted to experimental datasets. The fitting often results in ambiguities since the interpretation of the measurements is not straightforward and since the data contain noise. In order to facilitate a more reliable mapping between datasets and Boolean networks, we develop an algorithm that infers network trajectories from a dataset distorted by noise. We analyze our algorithm theoretically and demonstrate its accuracy using simulation and microarray expression data. PMID:24006954

  19. PmiRExAt: plant miRNA expression atlas database and web applications

    PubMed Central

    Gurjar, Anoop Kishor Singh; Panwar, Abhijeet Singh; Gupta, Rajinder; Mantri, Shrikant S.

    2016-01-01

    High-throughput small RNA (sRNA) sequencing technology enables an entirely new perspective for plant microRNA (miRNA) research and has immense potential to unravel regulatory networks. Novel insights gained through data mining in publically available rich resource of sRNA data will help in designing biotechnology-based approaches for crop improvement to enhance plant yield and nutritional value. Bioinformatics resources enabling meta-analysis of miRNA expression across multiple plant species are still evolving. Here, we report PmiRExAt, a new online database resource that caters plant miRNA expression atlas. The web-based repository comprises of miRNA expression profile and query tool for 1859 wheat, 2330 rice and 283 maize miRNA. The database interface offers open and easy access to miRNA expression profile and helps in identifying tissue preferential, differential and constitutively expressing miRNAs. A feature enabling expression study of conserved miRNA across multiple species is also implemented. Custom expression analysis feature enables expression analysis of novel miRNA in total 117 datasets. New sRNA dataset can also be uploaded for analysing miRNA expression profiles for 73 plant species. PmiRExAt application program interface, a simple object access protocol web service allows other programmers to remotely invoke the methods written for doing programmatic search operations on PmiRExAt database. Database URL: http://pmirexat.nabi.res.in. PMID:27081157

  20. An efficient method to identify differentially expressed genes in microarray experiments

    PubMed Central

    Qin, Huaizhen; Feng, Tao; Harding, Scott A.; Tsai, Chung-Jui; Zhang, Shuanglin

    2013-01-01

    Motivation Microarray experiments typically analyze thousands to tens of thousands of genes from small numbers of biological replicates. The fact that genes are normally expressed in functionally relevant patterns suggests that gene-expression data can be stratified and clustered into relatively homogenous groups. Cluster-wise dimensionality reduction should make it feasible to improve screening power while minimizing information loss. Results We propose a powerful and computationally simple method for finding differentially expressed genes in small microarray experiments. The method incorporates a novel stratification-based tight clustering algorithm, principal component analysis and information pooling. Comprehensive simulations show that our method is substantially more powerful than the popular SAM and eBayes approaches. We applied the method to three real microarray datasets: one from a Populus nitrogen stress experiment with 3 biological replicates; and two from public microarray datasets of human cancers with 10 to 40 biological replicates. In all three analyses, our method proved more robust than the popular alternatives for identification of differentially expressed genes. Availability The C++ code to implement the proposed method is available upon request for academic use. PMID:18453554

  1. Array data extractor (ADE): a LabVIEW program to extract and merge gene array data.

    PubMed

    Kurtenbach, Stefan; Kurtenbach, Sarah; Zoidl, Georg

    2013-12-01

    Large data sets from gene expression array studies are publicly available offering information highly valuable for research across many disciplines ranging from fundamental to clinical research. Highly advanced bioinformatics tools have been made available to researchers, but a demand for user-friendly software allowing researchers to quickly extract expression information for multiple genes from multiple studies persists. Here, we present a user-friendly LabVIEW program to automatically extract gene expression data for a list of genes from multiple normalized microarray datasets. Functionality was tested for 288 class A G protein-coupled receptors (GPCRs) and expression data from 12 studies comparing normal and diseased human hearts. Results confirmed known regulation of a beta 1 adrenergic receptor and further indicate novel research targets. Although existing software allows for complex data analyses, the LabVIEW based program presented here, "Array Data Extractor (ADE)", provides users with a tool to retrieve meaningful information from multiple normalized gene expression datasets in a fast and easy way. Further, the graphical programming language used in LabVIEW allows applying changes to the program without the need of advanced programming knowledge.

  2. Biclustering sparse binary genomic data.

    PubMed

    van Uitert, Miranda; Meuleman, Wouter; Wessels, Lodewyk

    2008-12-01

    Genomic datasets often consist of large, binary, sparse data matrices. In such a dataset, one is often interested in finding contiguous blocks that (mostly) contain ones. This is a biclustering problem, and while many algorithms have been proposed to deal with gene expression data, only two algorithms have been proposed that specifically deal with binary matrices. None of the gene expression biclustering algorithms can handle the large number of zeros in sparse binary matrices. The two proposed binary algorithms failed to produce meaningful results. In this article, we present a new algorithm that is able to extract biclusters from sparse, binary datasets. A powerful feature is that biclusters with different numbers of rows and columns can be detected, varying from many rows to few columns and few rows to many columns. It allows the user to guide the search towards biclusters of specific dimensions. When applying our algorithm to an input matrix derived from TRANSFAC, we find transcription factors with distinctly dissimilar binding motifs, but a clear set of common targets that are significantly enriched for GO categories.

  3. Genome-scale analysis identifies GJB2 and ERO1LB as prognosis markers in patients with pancreatic cancer.

    PubMed

    Zhu, Tao; Gao, Yuan-Feng; Chen, Yi-Xin; Wang, Zhi-Bin; Yin, Ji-Ye; Mao, Xiao-Yuan; Li, Xi; Zhang, Wei; Zhou, Hong-Hao; Liu, Zhao-Qian

    2017-03-28

    Pancreatic cancer is a complex and heterogeneous disease with the etiology largely unknown. The deadly nature of pancreatic cancer, with an extremely low 5-year survival rate, renders urgent a better understanding of the molecular events underlying it. The aim of this study is to investigate the gene expression module of pancreatic adenocarcinoma and to identify differentially expressed genes (DEGs) with prognostic potentials. Transcriptome microarray data of five GEO datasets (GSE15471, GSE16515, GSE18670, GSE32676, GSE71989), including 117 primary tumor samples and 73 normal pancreatic tissue samples, were utilized to identify DEGs. The five sets of DEGs had an overlapping subset consisting of 98 genes (90 up-regulated and 8 down-regulated), which were probably common to pancreatic cancer. Gene ontology (GO) analysis of the 98 DEGs showed that cell cycle and cell adhesion were the major enriched processes, and extracellular matrix (ECM)-receptor interaction and p53 signaling pathway were the most enriched pathways according to Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Elevated expression of gap junction protein beta 2 (GJB2) and reduced endoplasmic reticulum oxidoreductase 1-like beta (ERO1LB) expression were validated in an independent cohort. Kaplan-Meier survival analysis revealed that GJB2 and ERO1LB levels were significantly associated with the overall survival of pancreatic cancer patients. GJB2 and ERO1LB are implicated in pancreatic cancer progression and can be used to predict patient survival. Therapeutic strategies targeting GJB2 and facilitating ERO1LB expression may deserve evaluation to improve prognosis of pancreatic cancer patients.

  4. Immunological network analysis in HPV associated head and neck squamous cancer and implications for disease prognosis.

    PubMed

    Chen, Xiaohang; Yan, Bingqing; Lou, Huihuang; Shen, Zhenji; Tong, Fangjia; Zhai, Aixia; Wei, Lanlan; Zhang, Fengmin

    2018-04-01

    Human papillomavirus-positive (HPV+) head and neck squamous cell cancer (HNSCC) exhibits a better prognosis than HPV-negative (HPV-) HNSCC. This difference may in part be due to enhanced immune activation in the HPV+ HNSCC tumor microenvironment. To characterize differences in immune activation between HPV+ and HPV- HNSCC tumors, we identified and annotated differentially expressed genes based upon mRNA expression data from The Cancer Genome Atlas (TCGA). Immune network between immune cells and cytokines was constructed by using single sample Gene Set Enrichment Analysis and conditional mutual information. Multivariate Cox regression analysis was used to determine the prognostic value of immune microenvironment characterization. A total of 1673 differentially expressed genes were functionally annotated. We found that genes upregulated in HPV+ HNSCC are enriched in immune-associated processes. And the up-regulated gene sets were validated by Gene Set Enrichment Analysis. The microenvironment of HPV+ HNSCC exhibited greater numbers of infiltrating B and T cells and fewer neutrophils than HPV- HNSCC. These findings were validated by two independent datasets in the Gene Expression Omnibus (GEO) database. Further analyses of T cell subtypes revealed that cytotoxic T cell subtypes predominated in HPV+ HNSCC. In addition, the ratio of M1/M2 macrophages was much higher in HPV+ HNSCC. The infiltration of these immune cells was correlated with differentially expressed cytokine-associated genes. Enhanced infiltration of B cells and CD8+ T cells were identified as independent protective factors, while high neutrophil infiltration was a risk enhancing factor for HPV+ HNSCC patients. A schematic model of immunological network was established for HPV+ HNSCC to summarize our findings. Copyright © 2018 Elsevier Ltd. All rights reserved.

  5. cMYC Expression in Infiltrating Gliomas: Associations with IDH1 Mutations, Clinicopathologic Features and Outcome

    PubMed Central

    Odia, Yazmin; Orr, Brent A.; Bell, W. Robert; Eberhart, Charles G.; Rodriguez, Fausto J.

    2013-01-01

    Gliomas are among the most frequent adult primary brain tumors. Mutations in IDH1, a metabolic enzyme, strongly correlate with secondary glioblastomas and increased survival. cMYC is an oncogene also implicated in aberrant metabolism, but its prognostic impact remains unclear. Recent genotyping studies also showed SNP variants near the cMYC gene locus, associate with an increased risk for development of IDH1/2 mutant gliomas suggesting a possible interaction between cMYC and IDH1. We evaluated nuclear cMYC protein levels and IDH1 (R132H) by immunohistochemistry in patients with oligodendroglioma/oligoastrocytomas (n=20), astrocytomas (grade II) (n=19), anaplastic astrocytomas (n=21) or glioblastomas (n=111). Of 158 tumors with sufficient tissue, 110 (70%) showed nuclear cMYC immunopositivity – most frequent (95%, χ2 p=0.0248) and intense (mean 1.33, ANOVA p=0.0179) in anaplastic astrocytomas versus glioblastomas (63%) or low grade gliomas (74%). cMYC expression associated with younger age as well as p53 immunopositivity (OR=3.6, p=0.0332) and mutant IDH1 (R132H) (OR=7.4, p=0.06) among malignant gliomas in our cohort. Independent analysis of the publically available TCGA glioblastoma dataset confirmed our strong association between cMYC and mutant IDH1 expression. Both IDH1 (R132H) and cMYC protein expression were associated with improved overall survival by univariate analysis. However, cMYC co-expression associated with shortened time to malignant transformation and overall survival among IDH1 (R132H) mutants in both univariate and multivariate analyses. In summary, our findings suggest that cMYC may be associated with a unique clinicopathologic and biologic group of infiltrating gliomas and help mediate the malignant transformation of IDH1 mutant gliomas. PMID:23934175

  6. SPICE: exploration and analysis of post-cytometric complex multivariate datasets.

    PubMed

    Roederer, Mario; Nozzi, Joshua L; Nason, Martha C

    2011-02-01

    Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.

  7. Shared regulatory sites are abundant in the human genome and shed light on genome evolution and disease pleiotropy.

    PubMed

    Tong, Pin; Monahan, Jack; Prendergast, James G D

    2017-03-01

    Large-scale gene expression datasets are providing an increasing understanding of the location of cis-eQTLs in the human genome and their role in disease. However, little is currently known regarding the extent of regulatory site-sharing between genes. This is despite it having potentially wide-ranging implications, from the determination of the way in which genetic variants may shape multiple phenotypes to the understanding of the evolution of human gene order. By first identifying the location of non-redundant cis-eQTLs, we show that regulatory site-sharing is a relatively common phenomenon in the human genome, with over 10% of non-redundant regulatory variants linked to the expression of multiple nearby genes. We show that these shared, local regulatory sites are linked to high levels of chromatin looping between the regulatory sites and their associated genes. In addition, these co-regulated gene modules are found to be strongly conserved across mammalian species, suggesting that shared regulatory sites have played an important role in shaping human gene order. The association of these shared cis-eQTLs with multiple genes means they also appear to be unusually important in understanding the genetics of human phenotypes and pleiotropy, with shared regulatory sites more often linked to multiple human phenotypes than other regulatory variants. This study shows that regulatory site-sharing is likely an underappreciated aspect of gene regulation and has important implications for the understanding of various biological phenomena, including how the two and three dimensional structures of the genome have been shaped and the potential causes of disease pleiotropy outside coding regions.

  8. Systems Biomedicine of Rabies Delineates the Affected Signaling Pathways

    PubMed Central

    Azimzadeh Jamalkandi, Sadegh; Mozhgani, Sayed-Hamidreza; Gholami Pourbadie, Hamid; Mirzaie, Mehdi; Noorbakhsh, Farshid; Vaziri, Behrouz; Gholami, Alireza; Ansari-Pour, Naser; Jafari, Mohieddin

    2016-01-01

    The prototypical neurotropic virus, rabies, is a member of the Rhabdoviridae family that causes lethal encephalomyelitis. Although there have been a plethora of studies investigating the etiological mechanism of the rabies virus and many precautionary methods have been implemented to avert the disease outbreak over the last century, the disease has surprisingly no definite remedy at its late stages. The psychological symptoms and the underlying etiology, as well as the rare survival rate from rabies encephalitis, has still remained a mystery. We, therefore, undertook a systems biomedicine approach to identify the network of gene products implicated in rabies. This was done by meta-analyzing whole-transcriptome microarray datasets of the CNS infected by strain CVS-11, and integrating them with interactome data using computational and statistical methods. We first determined the differentially expressed genes (DEGs) in each study and horizontally integrated the results at the mRNA and microRNA levels separately. A total of 61 seed genes involved in signal propagation system were obtained by means of unifying mRNA and microRNA detected integrated DEGs. We then reconstructed a refined protein–protein interaction network (PPIN) of infected cells to elucidate the rabies-implicated signal transduction network (RISN). To validate our findings, we confirmed differential expression of randomly selected genes in the network using Real-time PCR. In conclusion, the identification of seed genes and their network neighborhood within the refined PPIN can be useful for demonstrating signaling pathways including interferon circumvent, toward proliferation and survival, and neuropathological clue, explaining the intricate underlying molecular neuropathology of rabies infection and thus rendered a molecular framework for predicting potential drug targets. PMID:27872612

  9. An informatics approach to integrating genetic and neurological data in speech and language neuroscience.

    PubMed

    Bohland, Jason W; Myers, Emma M; Kim, Esther

    2014-01-01

    A number of heritable disorders impair the normal development of speech and language processes and occur in large numbers within the general population. While candidate genes and loci have been identified, the gap between genotype and phenotype is vast, limiting current understanding of the biology of normal and disordered processes. This gap exists not only in our scientific knowledge, but also in our research communities, where genetics researchers and speech, language, and cognitive scientists tend to operate independently. Here we describe a web-based, domain-specific, curated database that represents information about genotype-phenotype relations specific to speech and language disorders, as well as neuroimaging results demonstrating focal brain differences in relevant patients versus controls. Bringing these two distinct data types into a common database ( http://neurospeech.org/sldb ) is a first step toward bringing molecular level information into cognitive and computational theories of speech and language function. One bridge between these data types is provided by densely sampled profiles of gene expression in the brain, such as those provided by the Allen Brain Atlases. Here we present results from exploratory analyses of human brain gene expression profiles for genes implicated in speech and language disorders, which are annotated in our database. We then discuss how such datasets can be useful in the development of computational models that bridge levels of analysis, necessary to provide a mechanistic understanding of heritable language disorders. We further describe our general approach to information integration, discuss important caveats and considerations, and offer a specific but speculative example based on genes implicated in stuttering and basal ganglia function in speech motor control.

  10. Transcriptomics and proteomics show that selenium affects inflammation, cytoskeleton, and cancer pathways in human rectal biopsies.

    PubMed

    Méplan, Catherine; Johnson, Ian T; Polley, Abigael C J; Cockell, Simon; Bradburn, David M; Commane, Daniel M; Arasaradnam, Ramesh P; Mulholland, Francis; Zupanic, Anze; Mathers, John C; Hesketh, John

    2016-08-01

    Epidemiologic studies highlight the potential role of dietary selenium (Se) in colorectal cancer prevention. Our goal was to elucidate whether expression of factors crucial for colorectal homoeostasis is affected by physiologic differences in Se status. Using transcriptomics and proteomics followed by pathway analysis, we identified pathways affected by Se status in rectal biopsies from 22 healthy adults, including 11 controls with optimal status (mean plasma Se = 1.43 μM) and 11 subjects with suboptimal status (mean plasma Se = 0.86 μM). We observed that 254 genes and 26 proteins implicated in cancer (80%), immune function and inflammatory response (40%), cell growth and proliferation (70%), cellular movement, and cell death (50%) were differentially expressed between the 2 groups. Expression of 69 genes, including selenoproteins W1 and K, which are genes involved in cytoskeleton remodelling and transcription factor NFκB signaling, correlated significantly with Se status. Integrating proteomics and transcriptomics datasets revealed reduced inflammatory and immune responses and cytoskeleton remodelling in the suboptimal Se status group. This is the first study combining omics technologies to describe the impact of differences in Se status on colorectal expression patterns, revealing that suboptimal Se status could alter inflammatory signaling and cytoskeleton in human rectal mucosa and so influence cancer risk.-Méplan, C., Johnson, I. T., Polley, A. C. J., Cockell, S., Bradburn, D. M., Commane, D. M., Arasaradnam, R. P., Mulholland, F., Zupanic, A., Mathers, J. C., Hesketh, J. Transcriptomics and proteomics show that selenium affects inflammation, cytoskeleton, and cancer pathways in human rectal biopsies. © The Author(s).

  11. Meta-Analysis of Maternal and Fetal Transcriptomic Data Elucidates the Role of Adaptive and Innate Immunity in Preterm Birth

    PubMed Central

    Vora, Bianca; Wang, Aolin; Kosti, Idit; Huang, Hongtai; Paranjpe, Ishan; Woodruff, Tracey J.; MacKenzie, Tippi; Sirota, Marina

    2018-01-01

    Preterm birth (PTB) is the leading cause of newborn deaths around the world. Spontaneous preterm birth (sPTB) accounts for two-thirds of all PTBs; however, there remains an unmet need of detecting and preventing sPTB. Although the dysregulation of the immune system has been implicated in various studies, small sizes and irreproducibility of results have limited identification of its role. Here, we present a cross-study meta-analysis to evaluate genome-wide differential gene expression signals in sPTB. A comprehensive search of the NIH genomic database for studies related to sPTB with maternal whole blood samples resulted in data from three separate studies consisting of 339 samples. After aggregating and normalizing these transcriptomic datasets and performing a meta-analysis, we identified 210 genes that were differentially expressed in sPTB relative to term birth. These genes were enriched in immune-related pathways, showing upregulation of innate immunity and downregulation of adaptive immunity in women who delivered preterm. An additional analysis found several of these differentially expressed at mid-gestation, suggesting their potential to be clinically relevant biomarkers. Furthermore, a complementary analysis identified 473 genes differentially expressed in preterm cord blood samples. However, these genes demonstrated downregulation of the innate immune system, a stark contrast to findings using maternal blood samples. These immune-related findings were further confirmed by cell deconvolution as well as upstream transcription and cytokine regulation analyses. Overall, this study identified a strong immune signature related to sPTB as well as several potential biomarkers that could be translated to clinical use.

  12. Experimentally-Derived Fibroblast Gene Signatures Identify Molecular Pathways Associated with Distinct Subsets of Systemic Sclerosis Patients in Three Independent Cohorts

    PubMed Central

    Johnson, Michael E.; Mahoney, J. Matthew; Taroni, Jaclyn; Sargent, Jennifer L.; Marmarelis, Eleni; Wu, Ming-Ru; Varga, John; Hinchcliff, Monique E.; Whitfield, Michael L.

    2015-01-01

    Genome-wide expression profiling in systemic sclerosis (SSc) has identified four ‘intrinsic’ subsets of disease (fibroproliferative, inflammatory, limited, and normal-like), each of which shows deregulation of distinct signaling pathways; however, the full set of pathways contributing to this differential gene expression has not been fully elucidated. Here we examine experimentally derived gene expression signatures in dermal fibroblasts for thirteen different signaling pathways implicated in SSc pathogenesis. These data show distinct and overlapping sets of genes induced by each pathway, allowing for a better understanding of the molecular relationship between profibrotic and immune signaling networks. Pathway-specific gene signatures were analyzed across a compendium of microarray datasets consisting of skin biopsies from three independent cohorts representing 80 SSc patients, 4 morphea, and 26 controls. IFNα signaling showed a strong association with early disease, while TGFβ signaling spanned the fibroproliferative and inflammatory subsets, was associated with worse MRSS, and was higher in lesional than non-lesional skin. The fibroproliferative subset was most strongly associated with PDGF signaling, while the inflammatory subset demonstrated strong activation of innate immune pathways including TLR signaling upstream of NF-κB. The limited and normal-like subsets did not show associations with fibrotic and inflammatory mediators such as TGFβ and TNFα. The normal-like subset showed high expression of genes associated with lipid signaling, which was absent in the inflammatory and limited subsets. Together, these data suggest a model by which IFNα is involved in early disease pathology, and disease severity is associated with active TGFβ signaling. PMID:25607805

  13. Mining microarray datasets in nutrition: expression of the GPR120 (n-3 fatty acid receptor/sensor) gene is down-regulated in human adipocytes by macrophage secretions.

    PubMed

    Trayhurn, Paul; Denyer, Gareth

    2012-01-01

    Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity.

  14. Mining microarray datasets in nutrition: expression of the GPR120 (n-3 fatty acid receptor/sensor) gene is down-regulated in human adipocytes by macrophage secretions

    PubMed Central

    Trayhurn, Paul; Denyer, Gareth

    2012-01-01

    Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity. PMID:25191551

  15. Low-rank regularization for learning gene expression programs.

    PubMed

    Ye, Guibo; Tang, Mengfan; Cai, Jian-Feng; Nie, Qing; Xie, Xiaohui

    2013-01-01

    Learning gene expression programs directly from a set of observations is challenging due to the complexity of gene regulation, high noise of experimental measurements, and insufficient number of experimental measurements. Imposing additional constraints with strong and biologically motivated regularizations is critical in developing reliable and effective algorithms for inferring gene expression programs. Here we propose a new form of regulation that constrains the number of independent connectivity patterns between regulators and targets, motivated by the modular design of gene regulatory programs and the belief that the total number of independent regulatory modules should be small. We formulate a multi-target linear regression framework to incorporate this type of regulation, in which the number of independent connectivity patterns is expressed as the rank of the connectivity matrix between regulators and targets. We then generalize the linear framework to nonlinear cases, and prove that the generalized low-rank regularization model is still convex. Efficient algorithms are derived to solve both the linear and nonlinear low-rank regularized problems. Finally, we test the algorithms on three gene expression datasets, and show that the low-rank regularization improves the accuracy of gene expression prediction in these three datasets.

  16. Identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network.

    PubMed

    Chen, Xi; Wang, Qiao-Ling; Zhang, Meng-Hui

    2017-10-01

    The current study aimed to identify key genes in glaucoma based on a benchmarked dataset and gene regulatory network (GRN). Local and global noise was added to the gene expression dataset to produce a benchmarked dataset. Differentially-expressed genes (DEGs) between patients with glaucoma and normal controls were identified utilizing the Linear Models for Microarray Data (Limma) package based on benchmarked dataset. A total of 5 GRN inference methods, including Zscore, GeneNet, context likelihood of relatedness (CLR) algorithm, Partial Correlation coefficient with Information Theory (PCIT) and GEne Network Inference with Ensemble of Trees (Genie3) were evaluated using receiver operating characteristic (ROC) and precision and recall (PR) curves. The interference method with the best performance was selected to construct the GRN. Subsequently, topological centrality (degree, closeness and betweenness) was conducted to identify key genes in the GRN of glaucoma. Finally, the key genes were validated by performing reverse transcription-quantitative polymerase chain reaction (RT-qPCR). A total of 176 DEGs were detected from the benchmarked dataset. The ROC and PR curves of the 5 methods were analyzed and it was determined that Genie3 had a clear advantage over the other methods; thus, Genie3 was used to construct the GRN. Following topological centrality analysis, 14 key genes for glaucoma were identified, including IL6 , EPHA2 and GSTT1 and 5 of these 14 key genes were validated by RT-qPCR. Therefore, the current study identified 14 key genes in glaucoma, which may be potential biomarkers to use in the diagnosis of glaucoma and aid in identifying the molecular mechanism of this disease.

  17. Integrative Exploratory Analysis of Two or More Genomic Datasets.

    PubMed

    Meng, Chen; Culhane, Aedin

    2016-01-01

    Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.

  18. Pairwise gene GO-based measures for biclustering of high-dimensional expression data.

    PubMed

    Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S

    2018-01-01

    Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.

  19. Gene Expression Signatures Diagnose Influenza and Other Symptomatic Respiratory Viral Infection in Humans

    PubMed Central

    Zaas, Aimee K.; Chen, Minhua; Varkey, Jay; Veldman, Timothy; Hero, Alfred O.; Lucas, Joseph; Huang, Yongsheng; Turner, Ronald; Gilbert, Anthony; Lambkin-Williams, Robert; Øien, N. Christine; Nicholson, Bradly; Kingsmore, Stephen; Carin, Lawrence; Woods, Christopher W.; Ginsburg, Geoffrey S.

    2010-01-01

    Summary Acute respiratory infections (ARI) are a common reason for seeking medical attention and the threat of pandemic influenza will likely add to these numbers. Using human viral challenge studies with live rhinovirus, respiratory syncytial virus, and influenza A, we developed peripheral blood gene expression signatures that distinguish individuals with symptomatic ARI from uninfected individuals with > 95% accuracy. We validated this “acute respiratory viral” signature - encompassing genes with a known role in host defense against viral infections - across each viral challenge. We also validated the signature in an independently acquired dataset for influenza A and classified infected individuals from healthy controls with 100% accuracy. In the same dataset, we could also distinguish viral from bacterial ARIs (93% accuracy). These results demonstrate that ARIs induce changes in human peripheral blood gene expression that can be used to diagnose a viral etiology of respiratory infection and triage symptomatic individuals. PMID:19664979

  20. A prior-based integrative framework for functional transcriptional regulatory network inference

    PubMed Central

    Siahpirani, Alireza F.

    2017-01-01

    Abstract Transcriptional regulatory networks specify regulatory proteins controlling the context-specific expression levels of genes. Inference of genome-wide regulatory networks is central to understanding gene regulation, but remains an open challenge. Expression-based network inference is among the most popular methods to infer regulatory networks, however, networks inferred from such methods have low overlap with experimentally derived (e.g. ChIP-chip and transcription factor (TF) knockouts) networks. Currently we have a limited understanding of this discrepancy. To address this gap, we first develop a regulatory network inference algorithm, based on probabilistic graphical models, to integrate expression with auxiliary datasets supporting a regulatory edge. Second, we comprehensively analyze our and other state-of-the-art methods on different expression perturbation datasets. Networks inferred by integrating sequence-specific motifs with expression have substantially greater agreement with experimentally derived networks, while remaining more predictive of expression than motif-based networks. Our analysis suggests natural genetic variation as the most informative perturbation for network inference, and, identifies core TFs whose targets are predictable from expression. Multiple reasons make the identification of targets of other TFs difficult, including network architecture and insufficient variation of TF mRNA level. Finally, we demonstrate the utility of our inference algorithm to infer stress-specific regulatory networks and for regulator prioritization. PMID:27794550

  1. Cancer biomarker discovery: the entropic hallmark.

    PubMed

    Berretta, Regina; Moscato, Pablo

    2010-08-18

    It is a commonly accepted belief that cancer cells modify their transcriptional state during the progression of the disease. We propose that the progression of cancer cells towards malignant phenotypes can be efficiently tracked using high-throughput technologies that follow the gradual changes observed in the gene expression profiles by employing Shannon's mathematical theory of communication. Methods based on Information Theory can then quantify the divergence of cancer cells' transcriptional profiles from those of normally appearing cells of the originating tissues. The relevance of the proposed methods can be evaluated using microarray datasets available in the public domain but the method is in principle applicable to other high-throughput methods. Using melanoma and prostate cancer datasets we illustrate how it is possible to employ Shannon Entropy and the Jensen-Shannon divergence to trace the transcriptional changes progression of the disease. We establish how the variations of these two measures correlate with established biomarkers of cancer progression. The Information Theory measures allow us to identify novel biomarkers for both progressive and relatively more sudden transcriptional changes leading to malignant phenotypes. At the same time, the methodology was able to validate a large number of genes and processes that seem to be implicated in the progression of melanoma and prostate cancer. We thus present a quantitative guiding rule, a new unifying hallmark of cancer: the cancer cell's transcriptome changes lead to measurable observed transitions of Normalized Shannon Entropy values (as measured by high-throughput technologies). At the same time, tumor cells increment their divergence from the normal tissue profile increasing their disorder via creation of states that we might not directly measure. This unifying hallmark allows, via the the Jensen-Shannon divergence, to identify the arrow of time of the processes from the gene expression profiles, and helps to map the phenotypical and molecular hallmarks of specific cancer subtypes. The deep mathematical basis of the approach allows us to suggest that this principle is, hopefully, of general applicability for other diseases.

  2. Dysregulation of the mitogen granulin in human cancer through the miR-15/107 microRNA gene group

    PubMed Central

    Wang, Wang-Xia; Kyprianou, Natasha; Wang, Xiaowei; Nelson, Peter T.

    2010-01-01

    Granulin (GRN) is a potent mitogen and growth factor implicated in many human cancers, but its regulation is poorly understood. Recent findings indicate that GRN is regulated strongly by the microRNA miR-107, which functionally overlap with miR-15, miR-16, and miR-195 due to a common 5' sequence critical for target specificity. In this study, we queried whether miR-107 and paralogs regulated GRN in human cancers. In cultured cells, anti-Argonaute RIP-ChIP experiments indicate that GRN mRNA is directly targeted by numerous miR-15/107 miRNAs. Further tests of this association in human tumors. MiR-15 and miR-16 are known to be downregulated in chronic lymphocytic leukemia (CLL). Using pre-existing microarray datasets, we found that GRN expression is higher in CLL relative to non-neoplastic lymphocytes (P>0.00001). By contrast, other prospective miR-15/miR-16 targets in the dataset (BCL-2 and cyclin D1) were not up-regulated in CLL. Unlike in CLL, GRN was not up-regulated in chronic myelogenous leukemia (CML) where miR-107 paralogs are not known to be dysregulated. Prior studies have shown that GRN is also up-regulated, and miR-107 down-regulated, in prostate carcinoma. Our results indicate that multiple members of the miR-107 gene group indeed repress GRN protein levels when transfected into prostate cancer cells. At least a dozen distinct types of cancer have the pattern of increased GRN and decreased miR-107 expression. These findings indicate for the first time that the mitogen and growth factor GRN is dysregulated via the miR-15/107 gene group in multiple human cancers, which may provide a potential common therapeutic target. PMID:20884628

  3. An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data.

    PubMed

    Nidheesh, N; Abdul Nazeer, K A; Ameer, P M

    2017-12-01

    Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data. Copyright © 2017 Elsevier Ltd. All rights reserved.

  4. Emory University: High-Throughput Protein-Protein Interaction Dataset for Lung Cancer-Associated Genes | Office of Cancer Genomics

    Cancer.gov

    To discover novel PPI signaling hubs for lung cancer, CTD2 Center at Emory utilized large-scale genomics datasets and literature to compile a set of lung cancer-associated genes. A library of expression vectors were generated for these genes and utilized for detecting pairwise PPIs with cell lysate-based TR-FRET assays in high-throughput screening format. Read the abstract.

  5. metaseq: a Python package for integrative genome-wide analysis reveals relationships between chromatin insulators and associated nuclear mRNA.

    PubMed

    Dale, Ryan K; Matzat, Leah H; Lei, Elissa P

    2014-08-01

    Here we introduce metaseq, a software library written in Python, which enables loading multiple genomic data formats into standard Python data structures and allows flexible, customized manipulation and visualization of data from high-throughput sequencing studies. We demonstrate its practical use by analyzing multiple datasets related to chromatin insulators, which are DNA-protein complexes proposed to organize the genome into distinct transcriptional domains. Recent studies in Drosophila and mammals have implicated RNA in the regulation of chromatin insulator activities. Moreover, the Drosophila RNA-binding protein Shep has been shown to antagonize gypsy insulator activity in a tissue-specific manner, but the precise role of RNA in this process remains unclear. Better understanding of chromatin insulator regulation requires integration of multiple datasets, including those from chromatin-binding, RNA-binding, and gene expression experiments. We use metaseq to integrate RIP- and ChIP-seq data for Shep and the core gypsy insulator protein Su(Hw) in two different cell types, along with publicly available ChIP-chip and RNA-seq data. Based on the metaseq-enabled analysis presented here, we propose a model where Shep associates with chromatin cotranscriptionally, then is recruited to insulator complexes in trans where it plays a negative role in insulator activity. Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  6. Murine colon proteome and characterization of the protein pathways

    PubMed Central

    2012-01-01

    Background Most of the current proteomic researches focus on proteome alteration due to pathological disorders (i.e.: colorectal cancer) rather than normal healthy state when mentioning colon. As a result, there are lacks of information regarding normal whole tissue- colon proteome. Results We report here a detailed murine (mouse) whole tissue- colon protein reference dataset composed of 1237 confident protein (FDR < 2) with comprehensive insight on its peptide properties, cellular and subcellular localization, functional network GO annotation analysis, and its relative abundances. The presented dataset includes wide spectra of pI and Mw ranged from 3–12 and 4–600 KDa, respectively. Gravy index scoring predicted 19.5% membranous and 80.5% globularly located proteins. GO hierarchies and functional network analysis illustrated proteins function together with their relevance and implication of several candidates in malignancy such as Mitogen- activated protein kinase (Mapk8, 9) in colorectal cancer, Fibroblast growth factor receptor (Fgfr 2), Glutathione S-transferase (Gstp1) in prostate cancer, and Cell division control protein (Cdc42), Ras-related protein (Rac1,2) in pancreatic cancer. Protein abundances calculated with 3 different algorithms (NSAF, PAF and emPAI) provide a relative quantification under normal condition as guidance. Conclusions This highly confidence colon proteome catalogue will not only serve as a useful reference for further experiments characterizing differentially expressed proteins induced from diseased conditions, but also will aid in better understanding the ontology and functional absorptive mechanism of the colon as well. PMID:22929016

  7. Gene expression profiles reveal key genes for early diagnosis and treatment of adamantinomatous craniopharyngioma.

    PubMed

    Yang, Jun; Hou, Ziming; Wang, Changjiang; Wang, Hao; Zhang, Hongbing

    2018-04-23

    Adamantinomatous craniopharyngioma (ACP) is an aggressive brain tumor that occurs predominantly in the pediatric population. Conventional diagnosis method and standard therapy cannot treat ACPs effectively. In this paper, we aimed to identify key genes for ACP early diagnosis and treatment. Datasets GSE94349 and GSE68015 were obtained from Gene Expression Omnibus database. Consensus clustering was applied to discover the gene clusters in the expression data of GSE94349 and functional enrichment analysis was performed on gene set in each cluster. The protein-protein interaction (PPI) network was built by the Search Tool for the Retrieval of Interacting Genes, and hubs were selected. Support vector machine (SVM) model was built based on the signature genes identified from enrichment analysis and PPI network. Dataset GSE94349 was used for training and testing, and GSE68015 was used for validation. Besides, RT-qPCR analysis was performed to analyze the expression of signature genes in ACP samples compared with normal controls. Seven gene clusters were discovered in the differentially expressed genes identified from GSE94349 dataset. Enrichment analysis of each cluster identified 25 pathways that highly associated with ACP. PPI network was built and 46 hubs were determined. Twenty-five pathway-related genes that overlapped with the hubs in PPI network were used as signatures to establish the SVM diagnosis model for ACP. The prediction accuracy of SVM model for training, testing, and validation data were 94, 85, and 74%, respectively. The expression of CDH1, CCL2, ITGA2, COL8A1, COL6A2, and COL6A3 were significantly upregulated in ACP tumor samples, while CAMK2A, RIMS1, NEFL, SYT1, and STX1A were significantly downregulated, which were consistent with the differentially expressed gene analysis. SVM model is a promising classification tool for screening and early diagnosis of ACP. The ACP-related pathways and signature genes will advance our knowledge of ACP pathogenesis and benefit the therapy improvement.

  8. Selenium-binding protein 1 in head and neck cancer is low-expression and associates with the prognosis of nasopharyngeal carcinoma

    PubMed Central

    Chen, Fasheng; Chen, Chen; Qu, Yangang; Xiang, Hua; Ai, Qingxiu; Yang, Fei; Tan, Xueping; Zhou, Yi; Jiang, Guang; Zhang, Zixiong

    2016-01-01

    Abstract Background: Selenium-binding protein 1 (SELENBP1) expression is reduced markedly in many types of cancers and low SELENBP1 expression levels are associated with poor patient prognosis. Methods: SELENBP1 gene expression in head and neck squamous cell carcinoma (HNSCC) was analyzed with GEO dataset and characteristics of SELENBP1 expression in paraffin embedded tissue were summarized. Expression of SELENBP1 in nasopharyngeal carcinoma (NPC), laryngeal cancer, oral cancer, tonsil cancer, hypopharyngeal cancer and normal tissues were detected using immunohistochemistry, at last, 99 NPC patients were followed up more than 5 years and were analyzed the prognostic significance of SELENBP1. Results: Analysis of GEO dataset concluded that SELENBP1 gene expression in HNSCC was lower than that in normal tissue (P < 0.01), but there was no significant difference of SELENBP1 gene expression in different T-stage and N-stage (P > 0.05). Analysis of pathological section concluded that SELENBP1 in the majority of HNSCC is low expression and in cancer nests is lower expression than surrounding normal tissue, even associated with the malignant degree of tumor. Further study indicated the low SELENBP1 expression group of patients with NPC accompanied by poor overall survival and has significantly different comparing with the high expression group. Conclusion: SELENBP1 expression was down-regulated in HNSCC, but has no associated with T-stage and N-stage of tumor. Low expression of SELENBP1 in patients with NPC has poor over survival, so SELENBP1 could be a novel biomarker for predicting prognosis. PMID:27583873

  9. Fyn-Dependent Gene Networks in Acute Ethanol Sensitivity

    PubMed Central

    Farris, Sean P.; Miles, Michael F.

    2013-01-01

    Studies in humans and animal models document that acute behavioral responses to ethanol are predisposing factor for the risk of long-term drinking behavior. Prior microarray data from our laboratory document strain- and brain region-specific variation in gene expression profile responses to acute ethanol that may be underlying regulators of ethanol behavioral phenotypes. The non-receptor tyrosine kinase Fyn has previously been mechanistically implicated in the sedative-hypnotic response to acute ethanol. To further understand how Fyn may modulate ethanol behaviors, we used whole-genome expression profiling. We characterized basal and acute ethanol-evoked (3 g/kg) gene expression patterns in nucleus accumbens (NAC), prefrontal cortex (PFC), and ventral midbrain (VMB) of control and Fyn knockout mice. Bioinformatics analysis identified a set of Fyn-related gene networks differently regulated by acute ethanol across the three brain regions. In particular, our analysis suggested a coordinate basal decrease in myelin-associated gene expression within NAC and PFC as an underlying factor in sensitivity of Fyn null animals to ethanol sedation. An in silico analysis across the BXD recombinant inbred (RI) strains of mice identified a significant correlation between Fyn expression and a previously published ethanol loss-of-righting-reflex (LORR) phenotype. By combining PFC gene expression correlates to Fyn and LORR across multiple genomic datasets, we identified robust Fyn-centric gene networks related to LORR. Our results thus suggest that multiple system-wide changes exist within specific brain regions of Fyn knockout mice, and that distinct Fyn-dependent expression networks within PFC may be important determinates of the LORR due to acute ethanol. These results add to the interpretation of acute ethanol behavioral sensitivity in Fyn kinase null animals, and identify Fyn-centric gene networks influencing variance in ethanol LORR. Such networks may also inform future design of pharmacotherapies for the treatment and prevention of alcohol use disorders. PMID:24312422

  10. Systems-based biological concordance and predictive reproducibility of gene set discovery methods in cardiovascular disease.

    PubMed

    Azuaje, Francisco; Zheng, Huiru; Camargo, Anyela; Wang, Haiying

    2011-08-01

    The discovery of novel disease biomarkers is a crucial challenge for translational bioinformatics. Demonstration of both their classification power and reproducibility across independent datasets are essential requirements to assess their potential clinical relevance. Small datasets and multiplicity of putative biomarker sets may explain lack of predictive reproducibility. Studies based on pathway-driven discovery approaches have suggested that, despite such discrepancies, the resulting putative biomarkers tend to be implicated in common biological processes. Investigations of this problem have been mainly focused on datasets derived from cancer research. We investigated the predictive and functional concordance of five methods for discovering putative biomarkers in four independently-generated datasets from the cardiovascular disease domain. A diversity of biosignatures was identified by the different methods. However, we found strong biological process concordance between them, especially in the case of methods based on gene set analysis. With a few exceptions, we observed lack of classification reproducibility using independent datasets. Partial overlaps between our putative sets of biomarkers and the primary studies exist. Despite the observed limitations, pathway-driven or gene set analysis can predict potentially novel biomarkers and can jointly point to biomedically-relevant underlying molecular mechanisms. Copyright © 2011 Elsevier Inc. All rights reserved.

  11. A four-gene signature predicts survival in clear-cell renal-cell carcinoma.

    PubMed

    Dai, Jun; Lu, Yuchao; Wang, Jinyu; Yang, Lili; Han, Yingyan; Wang, Ying; Yan, Dan; Ruan, Qiurong; Wang, Shaogang

    2016-12-13

    Clear-cell renal-cell carcinoma (ccRCC) is the most common pathological subtype of renal cell carcinoma (RCC), accounting for about 80% of RCC. In order to find potential prognostic biomarkers in ccRCC, we presented a four-gene signature to evaluate the prognosis of ccRCC. SurvExpress and immunohistochemical (IHC) staining of tissue microarrays were used to analyze the association between the four genes and the prognosis of ccRCC. Data from TCGA dataset revealed a prognostic prompt function of the four genes (PTEN, PIK3C2A, ITPA and BCL3). Further discovery suggested that the four-gene signature predicted survival better than any of the four genes alone. Moreover, IHC staining demonstrated a consistent result with TCGA, indicating that the signature was an independent prognostic factor of survival in ccRCC. Univariate and multivariate Cox proportional hazard regression analysis were conducted to verify the association of clinicopathological variables and the four genes' expression levels with survival. The results further testified that the risk (four-gene signature) was an independent prognostic factors of both Overall Survival (OS) and Disease-free Survival (DFS) (P<0.05). In conclusion, the four-gene signature was correlated with the survival of ccRCC, and therefore, may help to provide significant clinical implications for predicting the prognosis of patients.

  12. LncSubpathway: a novel approach for identifying dysfunctional subpathways associated with risk lncRNAs by integrating lncRNA and mRNA expression profiles and pathway topologies.

    PubMed

    Xu, Yanjun; Li, Feng; Wu, Tan; Xu, Yingqi; Yang, Haixiu; Dong, Qun; Zheng, Meiyu; Shang, Desi; Zhang, Chunlong; Zhang, Yunpeng; Li, Xia

    2017-02-28

    Long non-coding RNAs (lncRNAs) play important roles in various biological processes, including the development of many diseases. Pathway analysis is a valuable aid for understanding the cellular functions of these transcripts. We have developed and characterized LncSubpathway, a novel method that integrates lncRNA and protein coding gene (PCG) expression with interactome data to identify disease risk subpathways that functionally associated with risk lncRNAs. LncSubpathway identifies the most relevance regions which are related with risk lncRNA set and implicated with study conditions through simultaneously considering the dysregulation extent of lncRNAs, PCGs and their correlations. Simulation studies demonstrated that the sensitivity and false positive rates of LncSubpathway were within acceptable ranges, and that LncSubpathway could accurately identify dysregulated regions that related with disease risk lncRNAs within pathways. When LncSubpathway was applied to colorectal carcinoma and breast cancer subtype datasets, it identified cancer type- and breast cancer subtype-related meaningful subpathways. Further, analysis of its robustness and reproducibility indicated that LncSubpathway was a reliable means of identifying subpathways that functionally associated with lncRNAs. LncSubpathway is freely available at http://www.bio-bigdata.com/lncSubpathway/.

  13. LncSubpathway: a novel approach for identifying dysfunctional subpathways associated with risk lncRNAs by integrating lncRNA and mRNA expression profiles and pathway topologies

    PubMed Central

    Wu, Tan; Xu, Yingqi; Yang, Haixiu; Dong, Qun; Zheng, Meiyu; Shang, Desi; Zhang, Chunlong; Zhang, Yunpeng; Li, Xia

    2017-01-01

    Long non-coding RNAs (lncRNAs) play important roles in various biological processes, including the development of many diseases. Pathway analysis is a valuable aid for understanding the cellular functions of these transcripts. We have developed and characterized LncSubpathway, a novel method that integrates lncRNA and protein coding gene (PCG) expression with interactome data to identify disease risk subpathways that functionally associated with risk lncRNAs. LncSubpathway identifies the most relevance regions which are related with risk lncRNA set and implicated with study conditions through simultaneously considering the dysregulation extent of lncRNAs, PCGs and their correlations. Simulation studies demonstrated that the sensitivity and false positive rates of LncSubpathway were within acceptable ranges, and that LncSubpathway could accurately identify dysregulated regions that related with disease risk lncRNAs within pathways. When LncSubpathway was applied to colorectal carcinoma and breast cancer subtype datasets, it identified cancer type- and breast cancer subtype-related meaningful subpathways. Further, analysis of its robustness and reproducibility indicated that LncSubpathway was a reliable means of identifying subpathways that functionally associated with lncRNAs. LncSubpathway is freely available at http://www.bio-bigdata.com/lncSubpathway/. PMID:28152521

  14. Comprehensive single cell-resolution analysis of the role of chromatin regulators in early C. elegans embryogenesis.

    PubMed

    Krüger, Angela V; Jelier, Rob; Dzyubachyk, Oleh; Zimmerman, Timo; Meijering, Erik; Lehner, Ben

    2015-02-15

    Chromatin regulators are widely expressed proteins with diverse roles in gene expression, nuclear organization, cell cycle regulation, pluripotency, physiology and development, and are frequently mutated in human diseases such as cancer. Their inhibition often results in pleiotropic effects that are difficult to study using conventional approaches. We have developed a semi-automated nuclear tracking algorithm to quantify the divisions, movements and positions of all nuclei during the early development of Caenorhabditis elegans and have used it to systematically study the effects of inhibiting chromatin regulators. The resulting high dimensional datasets revealed that inhibition of multiple regulators, including F55A3.3 (encoding FACT subunit SUPT16H), lin-53 (RBBP4/7), rba-1 (RBBP4/7), set-16 (MLL2/3), hda-1 (HDAC1/2), swsn-7 (ARID2), and let-526 (ARID1A/1B) affected cell cycle progression and caused chromosome segregation defects. In contrast, inhibition of cir-1 (CIR1) accelerated cell division timing in specific cells of the AB lineage. The inhibition of RNA polymerase II also accelerated these division timings, suggesting that normal gene expression is required to delay cell cycle progression in multiple lineages in the early embryo. Quantitative analyses of the dataset suggested the existence of at least two functionally distinct SWI/SNF chromatin remodeling complex activities in the early embryo, and identified a redundant requirement for the egl-27 and lin-40 MTA orthologs in the development of endoderm and mesoderm lineages. Moreover, our dataset also revealed a characteristic rearrangement of chromatin to the nuclear periphery upon the inhibition of multiple general regulators of gene expression. Our systematic, comprehensive and quantitative datasets illustrate the power of single cell-resolution quantitative tracking and high dimensional phenotyping to investigate gene function. Furthermore, the results provide an overview of the functions of essential chromatin regulators during the early development of an animal. Copyright © 2014 Elsevier Inc. All rights reserved.

  15. Meta-Analysis of Tumor Stem-Like Breast Cancer Cells Using Gene Set and Network Analysis

    PubMed Central

    Lee, Won Jun; Kim, Sang Cheol; Yoon, Jung-Ho; Yoon, Sang Jun; Lim, Johan; Kim, You-Sun; Kwon, Sung Won; Park, Jeong Hill

    2016-01-01

    Generally, cancer stem cells have epithelial-to-mesenchymal-transition characteristics and other aggressive properties that cause metastasis. However, there have been no confident markers for the identification of cancer stem cells and comparative methods examining adherent and sphere cells are widely used to investigate mechanism underlying cancer stem cells, because sphere cells have been known to maintain cancer stem cell characteristics. In this study, we conducted a meta-analysis that combined gene expression profiles from several studies that utilized tumorsphere technology to investigate tumor stem-like breast cancer cells. We used our own gene expression profiles along with the three different gene expression profiles from the Gene Expression Omnibus, which we combined using the ComBat method, and obtained significant gene sets using the gene set analysis of our datasets and the combined dataset. This experiment focused on four gene sets such as cytokine-cytokine receptor interaction that demonstrated significance in both datasets. Our observations demonstrated that among the genes of four significant gene sets, six genes were consistently up-regulated and satisfied the p-value of < 0.05, and our network analysis showed high connectivity in five genes. From these results, we established CXCR4, CXCL1 and HMGCS1, the intersecting genes of the datasets with high connectivity and p-value of < 0.05, as significant genes in the identification of cancer stem cells. Additional experiment using quantitative reverse transcription-polymerase chain reaction showed significant up-regulation in MCF-7 derived sphere cells and confirmed the importance of these three genes. Taken together, using meta-analysis that combines gene set and network analysis, we suggested CXCR4, CXCL1 and HMGCS1 as candidates involved in tumor stem-like breast cancer cells. Distinct from other meta-analysis, by using gene set analysis, we selected possible markers which can explain the biological mechanisms and suggested network analysis as an additional criterion for selecting candidates. PMID:26870956

  16. Connectivity Mapping for Candidate Therapeutics Identification Using Next Generation Sequencing RNA-Seq Data

    PubMed Central

    McArt, Darragh G.; Dunne, Philip D.; Blayney, Jaine K.; Salto-Tellez, Manuel; Van Schaeybroeck, Sandra; Hamilton, Peter W.; Zhang, Shu-Dong

    2013-01-01

    The advent of next generation sequencing technologies (NGS) has expanded the area of genomic research, offering high coverage and increased sensitivity over older microarray platforms. Although the current cost of next generation sequencing is still exceeding that of microarray approaches, the rapid advances in NGS will likely make it the platform of choice for future research in differential gene expression. Connectivity mapping is a procedure for examining the connections among diseases, genes and drugs by differential gene expression initially based on microarray technology, with which a large collection of compound-induced reference gene expression profiles have been accumulated. In this work, we aim to test the feasibility of incorporating NGS RNA-Seq data into the current connectivity mapping framework by utilizing the microarray based reference profiles and the construction of a differentially expressed gene signature from a NGS dataset. This would allow for the establishment of connections between the NGS gene signature and those microarray reference profiles, alleviating the associated incurring cost of re-creating drug profiles with NGS technology. We examined the connectivity mapping approach on a publicly available NGS dataset with androgen stimulation of LNCaP cells in order to extract candidate compounds that could inhibit the proliferative phenotype of LNCaP cells and to elucidate their potential in a laboratory setting. In addition, we also analyzed an independent microarray dataset of similar experimental settings. We found a high level of concordance between the top compounds identified using the gene signatures from the two datasets. The nicotine derivative cotinine was returned as the top candidate among the overlapping compounds with potential to suppress this proliferative phenotype. Subsequent lab experiments validated this connectivity mapping hit, showing that cotinine inhibits cell proliferation in an androgen dependent manner. Thus the results in this study suggest a promising prospect of integrating NGS data with connectivity mapping. PMID:23840550

  17. Identification of mineral resources in Afghanistan-Detecting and mapping resource anomalies in prioritized areas using geophysical and remote sensing (ASTER and HyMap) data

    USGS Publications Warehouse

    King, Trude V.V.; Johnson, Michaela R.; Hubbard, Bernard E.; Drenth, Benjamin J.

    2011-01-01

    During the independent analysis of the geophysical, ASTER, and imaging spectrometer (HyMap) data by USGS scientists, previously unrecognized targets of potential mineralization were identified using evaluation criteria most suitable to the individual dataset. These anomalous zones offer targets of opportunity that warrant additional field verification. This report describes the standards used to define the anomalies, summarizes the results of the evaluations for each type of data, and discusses the importance and implications of regions of anomaly overlap between two or three of the datasets.

  18. Brain Growth Across the Life Span in Autism: Age-Specific Changes in Anatomical Pathology

    PubMed Central

    Courchesne, Eric; Campbell, Kathleen; Solso, Stephanie

    2014-01-01

    Autism is marked by overgrowth of the brain at the earliest ages but not at older ages when decreases in structural volumes and neuron numbers are observed instead. This has lead to the theory of age-specific anatomic abnormalities in autism. Here we report age-related changes in brain size in autistic and typical subjects from 12 months to 50 years of age based on analyses of 586 longitudinal and cross-sectional MRI scans. This dataset is several times larger than the largest autism study to date. Results demonstrate early brain overgrowth during infancy and the toddler years in autistic boys and girls, followed by an accelerated rate of decline in size and perhaps degeneration from adolescence to late middle age in this disorder. We theorize that underlying these age-specific changes in anatomic abnormalities in autism there may also be age-specific changes in gene expression, molecular, synaptic, cellular and circuit abnormalities. A peak age for detecting and studying the earliest fundamental biological underpinnings of autism is prenatal life and the first three postnatal years. Studies of the older autistic brain may not address original causes but are essential to discovering how best to help the older aging autistic person. Lastly, the theory of age-specific anatomic abnormalities in autism has broad implications for a wide range of work on the disorder including the design, validation and interpretation of animal model, lymphocyte gene expression, brain gene expression, and genotype/CNV-anatomic phenotype studies. PMID:20920490

  19. A new dataset validation system for the Planetary Science Archive

    NASA Astrophysics Data System (ADS)

    Manaud, N.; Zender, J.; Heather, D.; Martinez, S.

    2007-08-01

    The Planetary Science Archive is the official archive for the Mars Express mission. It has received its first data by the end of 2004. These data are delivered by the PI teams to the PSA team as datasets, which are formatted conform to the Planetary Data System (PDS). The PI teams are responsible for analyzing and calibrating the instrument data as well as the production of reduced and calibrated data. They are also responsible of the scientific validation of these data. ESA is responsible of the long-term data archiving and distribution to the scientific community and must ensure, in this regard, that all archived products meet quality. To do so, an archive peer-review is used to control the quality of the Mars Express science data archiving process. However a full validation of its content is missing. An independent review board recently recommended that the completeness of the archive as well as the consistency of the delivered data should be validated following well-defined procedures. A new validation software tool is being developed to complete the overall data quality control system functionality. This new tool aims to improve the quality of data and services provided to the scientific community through the PSA, and shall allow to track anomalies in and to control the completeness of datasets. It shall ensure that the PSA end-users: (1) can rely on the result of their queries, (2) will get data products that are suitable for scientific analysis, (3) can find all science data acquired during a mission. We defined dataset validation as the verification and assessment process to check the dataset content against pre-defined top-level criteria, which represent the general characteristics of good quality datasets. The dataset content that is checked includes the data and all types of information that are essential in the process of deriving scientific results and those interfacing with the PSA database. The validation software tool is a multi-mission tool that has been designed to provide the user with the flexibility of defining and implementing various types of validation criteria, to iteratively and incrementally validate datasets, and to generate validation reports.

  20. Phylogenetic conservatism in plant-soil feedback and its implications for plant abundance

    USDA-ARS?s Scientific Manuscript database

    Plant interactions with macro-mutualists (e.g., seed dispersers, pollinators) and antagonists (e.g., herbivores, pathogens) often exhibit phylogenetic conservatism, but conservatism of interactions with soil microorganisms is understudied. We assembled one of the best available datasets to examine c...

  1. Size-based trends and management implications of microhabitat utilization by Brown Treesnakes, with an emphasis on juvenile snakes

    USGS Publications Warehouse

    Rodda, Gordon H.; Reed, Robert N.

    2007-01-01

    The brown treesnake (Boiga irregularis, or BTS), a costly invasive species, has been the subject of intensive research on Guam over the past two decades. The behavior and habitat use of hatchling and juvenile snakes, however, remain largely unknown. We used a long-term dataset of BTS captures (N = 2,415) and a dataset resulting from intensive sampling within and immediately around a 5-ha fenced population (N = 2,541) to examine habitat use of BTS. Small snakes were almost exclusively arboreal and that they appeared to prefer tangantangan (Leucaena leucocephala) habitats. In contrast, large snakes used arboreal and terrestrial habitats in roughly equal proportion, and were less frequently found in tangantangan. Among snakes found in trees, there were no clear size-based preferences for certain heights above ground, nor for size-based choice of perch diameters. We discuss these results as they relate to management and interdiction implications for brown treesnakes on Guam and in potential incipient populations on other islands.

  2. Hidden treasures in "ancient" microarrays: gene-expression portrays biology and potential resistance pathways of major lung cancer subtypes and normal tissue.

    PubMed

    Kerkentzes, Konstantinos; Lagani, Vincenzo; Tsamardinos, Ioannis; Vyberg, Mogens; Røe, Oluf Dimitri

    2014-01-01

    Novel statistical methods and increasingly more accurate gene annotations can transform "old" biological data into a renewed source of knowledge with potential clinical relevance. Here, we provide an in silico proof-of-concept by extracting novel information from a high-quality mRNA expression dataset, originally published in 2001, using state-of-the-art bioinformatics approaches. The dataset consists of histologically defined cases of lung adenocarcinoma (AD), squamous (SQ) cell carcinoma, small-cell lung cancer, carcinoid, metastasis (breast and colon AD), and normal lung specimens (203 samples in total). A battery of statistical tests was used for identifying differential gene expressions, diagnostic and prognostic genes, enriched gene ontologies, and signaling pathways. Our results showed that gene expressions faithfully recapitulate immunohistochemical subtype markers, as chromogranin A in carcinoids, cytokeratin 5, p63 in SQ, and TTF1 in non-squamous types. Moreover, biological information with putative clinical relevance was revealed as potentially novel diagnostic genes for each subtype with specificity 93-100% (AUC = 0.93-1.00). Cancer subtypes were characterized by (a) differential expression of treatment target genes as TYMS, HER2, and HER3 and (b) overrepresentation of treatment-related pathways like cell cycle, DNA repair, and ERBB pathways. The vascular smooth muscle contraction, leukocyte trans-endothelial migration, and actin cytoskeleton pathways were overexpressed in normal tissue. Reanalysis of this public dataset displayed the known biological features of lung cancer subtypes and revealed novel pathways of potentially clinical importance. The findings also support our hypothesis that even old omics data of high quality can be a source of significant biological information when appropriate bioinformatics methods are used.

  3. Physiologically Shrinking the Solution Space of a Saccharomyces cerevisiae Genome-Scale Model Suggests the Role of the Metabolic Network in Shaping Gene Expression Noise.

    PubMed

    Chi, Baofang; Tao, Shiheng; Liu, Yanlin

    2015-01-01

    Sampling the solution space of genome-scale models is generally conducted to determine the feasible region for metabolic flux distribution. Because the region for actual metabolic states resides only in a small fraction of the entire space, it is necessary to shrink the solution space to improve the predictive power of a model. A common strategy is to constrain models by integrating extra datasets such as high-throughput datasets and C13-labeled flux datasets. However, studies refining these approaches by performing a meta-analysis of massive experimental metabolic flux measurements, which are closely linked to cellular phenotypes, are limited. In the present study, experimentally identified metabolic flux data from 96 published reports were systematically reviewed. Several strong associations among metabolic flux phenotypes were observed. These phenotype-phenotype associations at the flux level were quantified and integrated into a Saccharomyces cerevisiae genome-scale model as extra physiological constraints. By sampling the shrunken solution space of the model, the metabolic flux fluctuation level, which is an intrinsic trait of metabolic reactions determined by the network, was estimated and utilized to explore its relationship to gene expression noise. Although no correlation was observed in all enzyme-coding genes, a relationship between metabolic flux fluctuation and expression noise of genes associated with enzyme-dosage sensitive reactions was detected, suggesting that the metabolic network plays a role in shaping gene expression noise. Such correlation was mainly attributed to the genes corresponding to non-essential reactions, rather than essential ones. This was at least partially, due to regulations underlying the flux phenotype-phenotype associations. Altogether, this study proposes a new approach in shrinking the solution space of a genome-scale model, of which sampling provides new insights into gene expression noise.

  4. R-Spondins Are Expressed by the Intestinal Stroma and are Differentially Regulated during Citrobacter rodentium- and DSS-Induced Colitis in Mice

    PubMed Central

    Kang, Eugene; Yousefi, Mitra; Gruenheid, Samantha

    2016-01-01

    The R-spondin family of proteins has recently been described as secreted enhancers of β-catenin activation through the canonical Wnt signaling pathway. We previously reported that Rspo2 is a major determinant of susceptibility to Citrobacter rodentium-mediated colitis in mice and recent genome-wide association studies have revealed RSPO3 as a candidate Crohn’s disease-specific inflammatory bowel disease susceptibility gene in humans. However, there is little information on the endogenous expression and cellular source of R-spondins in the colon at steady state and during intestinal inflammation. RNA sequencing and qRT-PCR were used to assess the expression of R-spondins at steady state and in two mouse models of colonic inflammation. The cellular source of R-spondins was assessed in specific colonic cell populations isolated by cell sorting. Data mining from publicly available datasets was used to assess the expression of R-spondins in the human colon. At steady state, colonic expression of R-spondins was found to be exclusive to non-epithelial CD45- lamina propria cells, and Rspo3/RSPO3 was the most highly expressed R-spondin in both mouse and human colon. R-spondin expression was found to be highly dynamic and differentially regulated during C. rodentium infection and dextran sodium sulfate (DSS) colitis, with notably high levels of Rspo3 expression during DSS colitis, and high levels of Rspo2 expression during C. rodentium infection, specifically in susceptible mice. Our data are consistent with the hypothesis that in the colon, R-spondins are expressed by subepithelial stromal cells, and that Rspo3/RSPO3 is the family member most implicated in colonic homeostasis. The differential regulation of the R-spondins in different models of intestinal inflammation indicate they respond to specific pathogenic and inflammatory signals that differ in the two models and provides further evidence that this family of proteins plays a key role in linking intestinal inflammation and homeostasis. PMID:27046199

  5. Near-Surface and High Resolution Seismic Imaging of the Bennett Thrust Fault in the Indio Mountains of West Texas

    NASA Astrophysics Data System (ADS)

    Vennemann, Alan

    My research investigates the structure of the Indio Mountains in southwest Texas, 34 kilometers southwest of Van Horn, at the UTEP (University of Texas at El Paso) Field Station using newly acquired active-source seismic data. The area is underlain by deformed Cretaceous sedimentary rocks that represent a transgressive sequence nearly 2 km in total stratigraphic thickness. The rocks were deposited in mid Cretaceous extensional basins and later contracted into fold-thrust structures during Laramide orogenesis. The stratigraphic sequence is an analog for similar areas that are ideal for pre-salt petroleum reservoirs, such as reservoirs off the coasts of Brazil and Angola (Li, 2014; Fox, 2016; Kattah, 2017). The 1-km-long 2-D shallow seismic reflection survey that I planned and led during May 2016 was the first at the UTEP Field Station, providing critical subsurface information that was previously lacking. The data were processed with Landmark ProMAX seismic processing software to create a seismic reflection image of the Bennett Thrust Fault and additional imbricate faulting not expressed at the surface. Along the 1-km line, reflection data were recorded with 200 4.5 Hz geophones, using 100 150-gram explosive charges and 490 sledge-hammer blows for sources. A seismic reflection profile was produced using the lower frequency explosive dataset, which was used in the identification of the Bennett Thrust Fault and additional faulting and folding in the subsurface. This dataset provides three possible interpretations for the subsurface geometries of the faulting and folding present. However, producing a seismic reflection image with the higher frequency sledge-hammer sourced dataset for interpretation proved more challenging. While there are no petroleum plays in the Indio Mountains region, imaging and understanding subsurface structural and lithological geometries and how that geometry directs potential fluid flow has implications for other regions with petroleum plays.

  6. Precipitation variability of the Grand Canyon region, 1893 through 2009, and its implications for studying effects of gullying of Holocene terraces and associated archeological sites in Grand Canyon, Arizona

    USGS Publications Warehouse

    Hereford, Richard; Bennett, Glenn E.; Fairley, Helen C.

    2014-01-01

    A daily precipitation dataset covering a large part of the American Southwest was compiled for online electronic distribution (http://pubs.usgs.gov/of/2014/1006/). The dataset contains 10.8 million observations spanning January 1893 through January 2009 from 846 weather stations in six states and 13 climate divisions. In addition to processing the data for distribution, water-year totals and other statistical parameters were calculated for each station with more than 2 years of observations. Division-wide total precipitation, expressed as the average deviation from the individual station means of a climate division, shows that the region—including the Grand Canyon, Arizona, area—has been affected by alternating multidecadal episodes of drought and wet conditions. In addition to compiling and analyzing the long-term regional precipitation data, a second dataset consisting of high-temporal-resolution precipitation measurements collected between November 2003 and January 2009 from 10 localities along the Colorado River in Grand Canyon was compiled. An exploratory study of these high-temporal-resolution precipitation measurements suggests that on a daily basis precipitation patterns are generally similar to those at a long-term weather station in the canyon, which in turn resembles the patterns at other long-term stations on the canyon rims; however, precipitation amounts recorded by the individual inner canyon weather stations can vary substantially from station to station. Daily and seasonal rainfall patterns apparent in these data are not random. For example, the inner canyon record, although short and fragmented, reveals three episodes of widespread, heavy precipitation in late summer 2004, early winter 2005, and summer 2007. The 2004 event and several others had sufficient rainfall to initiate potentially pervasive erosion of the late Holocene terraces and related archeological features located along the Colorado River in Grand Canyon.

  7. Cell Specific eQTL Analysis without Sorting Cells

    PubMed Central

    Esko, Tõnu; Peters, Marjolein J.; Schurmann, Claudia; Schramm, Katharina; Kettunen, Johannes; Yaghootkar, Hanieh; Fairfax, Benjamin P.; Andiappan, Anand Kumar; Li, Yang; Fu, Jingyuan; Karjalainen, Juha; Platteel, Mathieu; Visschedijk, Marijn; Weersma, Rinse K.; Kasela, Silva; Milani, Lili; Tserel, Liina; Peterson, Pärt; Reinmaa, Eva; Hofman, Albert; Uitterlinden, André G.; Rivadeneira, Fernando; Homuth, Georg; Petersmann, Astrid; Lorbeer, Roberto; Prokisch, Holger; Meitinger, Thomas; Herder, Christian; Roden, Michael; Grallert, Harald; Ripatti, Samuli; Perola, Markus; Wood, Andrew R.; Melzer, David; Ferrucci, Luigi; Singleton, Andrew B.; Hernandez, Dena G.; Knight, Julian C.; Melchiotti, Rossella; Lee, Bernett; Poidinger, Michael; Zolezzi, Francesca; Larbi, Anis; Wang, De Yun; van den Berg, Leonard H.; Veldink, Jan H.; Rotzschke, Olaf; Makino, Seiko; Salomaa, Veikko; Strauch, Konstantin; Völker, Uwe; van Meurs, Joyce B. J.; Metspalu, Andres; Wijmenga, Cisca; Jansen, Ritsert C.; Franke, Lude

    2015-01-01

    The functional consequences of trait associated SNPs are often investigated using expression quantitative trait locus (eQTL) mapping. While trait-associated variants may operate in a cell-type specific manner, eQTL datasets for such cell-types may not always be available. We performed a genome-environment interaction (GxE) meta-analysis on data from 5,683 samples to infer the cell type specificity of whole blood cis-eQTLs. We demonstrate that this method is able to predict neutrophil and lymphocyte specific cis-eQTLs and replicate these predictions in independent cell-type specific datasets. Finally, we show that SNPs associated with Crohn’s disease preferentially affect gene expression within neutrophils, including the archetypal NOD2 locus. PMID:25955312

  8. Screening and expression of selected taxonomically conserved and unique hypothetical proteins in Burkholderia pseudomallei K96243

    NASA Astrophysics Data System (ADS)

    Akhir, Nor Azurah Mat; Nadzirin, Nurul; Mohamed, Rahmah; Firdaus-Raih, Mohd

    2015-09-01

    Hypothetical proteins of bacterial pathogens represent a large numbers of novel biological mechanisms which could belong to essential pathways in the bacteria. They lack functional characterizations mainly due to the inability of sequence homology based methods to detect functional relationships in the absence of detectable sequence similarity. The dataset derived from this study showed 550 candidates conserved in genomes that has pathogenicity information and only present in the Burkholderiales order. The dataset has been narrowed down to taxonomic clusters. Ten proteins were selected for ORF amplification, seven of them were successfully amplified, and only four proteins were successfully expressed. These proteins will be great candidates in determining the true function via structural biology.

  9. Bayesian correlated clustering to integrate multiple datasets

    PubMed Central

    Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.

    2012-01-01

    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558

  10. Array data extractor (ADE): a LabVIEW program to extract and merge gene array data

    PubMed Central

    2013-01-01

    Background Large data sets from gene expression array studies are publicly available offering information highly valuable for research across many disciplines ranging from fundamental to clinical research. Highly advanced bioinformatics tools have been made available to researchers, but a demand for user-friendly software allowing researchers to quickly extract expression information for multiple genes from multiple studies persists. Findings Here, we present a user-friendly LabVIEW program to automatically extract gene expression data for a list of genes from multiple normalized microarray datasets. Functionality was tested for 288 class A G protein-coupled receptors (GPCRs) and expression data from 12 studies comparing normal and diseased human hearts. Results confirmed known regulation of a beta 1 adrenergic receptor and further indicate novel research targets. Conclusions Although existing software allows for complex data analyses, the LabVIEW based program presented here, “Array Data Extractor (ADE)”, provides users with a tool to retrieve meaningful information from multiple normalized gene expression datasets in a fast and easy way. Further, the graphical programming language used in LabVIEW allows applying changes to the program without the need of advanced programming knowledge. PMID:24289243

  11. Quantitative comparison of microarray experiments with published leukemia related gene expression signatures.

    PubMed

    Klein, Hans-Ulrich; Ruckert, Christian; Kohlmann, Alexander; Bullinger, Lars; Thiede, Christian; Haferlach, Torsten; Dugas, Martin

    2009-12-15

    Multiple gene expression signatures derived from microarray experiments have been published in the field of leukemia research. A comparison of these signatures with results from new experiments is useful for verification as well as for interpretation of the results obtained. Currently, the percentage of overlapping genes is frequently used to compare published gene signatures against a signature derived from a new experiment. However, it has been shown that the percentage of overlapping genes is of limited use for comparing two experiments due to the variability of gene signatures caused by different array platforms or assay-specific influencing parameters. Here, we present a robust approach for a systematic and quantitative comparison of published gene expression signatures with an exemplary query dataset. A database storing 138 leukemia-related published gene signatures was designed. Each gene signature was manually annotated with terms according to a leukemia-specific taxonomy. Two analysis steps are implemented to compare a new microarray dataset with the results from previous experiments stored and curated in the database. First, the global test method is applied to assess gene signatures and to constitute a ranking among them. In a subsequent analysis step, the focus is shifted from single gene signatures to chromosomal aberrations or molecular mutations as modeled in the taxonomy. Potentially interesting disease characteristics are detected based on the ranking of gene signatures associated with these aberrations stored in the database. Two example analyses are presented. An implementation of the approach is freely available as web-based application. The presented approach helps researchers to systematically integrate the knowledge derived from numerous microarray experiments into the analysis of a new dataset. By means of example leukemia datasets we demonstrate that this approach detects related experiments as well as related molecular mutations and may help to interpret new microarray data.

  12. Who shares? Who doesn't? Factors associated with openly archiving raw research data.

    PubMed

    Piwowar, Heather A

    2011-01-01

    Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%-35% in 2007-2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.

  13. CrossLink: a novel method for cross-condition classification of cancer subtypes.

    PubMed

    Ma, Chifeng; Sastry, Konduru S; Flore, Mario; Gehani, Salah; Al-Bozom, Issam; Feng, Yusheng; Serpedin, Erchin; Chouchane, Lotfi; Chen, Yidong; Huang, Yufei

    2016-08-22

    We considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Therefore, the trained classifier would work well under one condition but not under another. To address the problem of current normalization approaches, we propose a novel algorithm called CrossLink (CL). CL recognizes that there is no universal, condition-independent normalization mapping of signatures. In contrast, it exploits the fact that the signature is unique to its associated class under any condition and thus employs an unsupervised clustering algorithm to discover this unique signature. We assessed the performance of CL for cross-condition predictions of PAM50 subtypes of breast cancer by using a simulated dataset modeled after TCGA BRCA tumor samples with a cross-validation scheme, and datasets with known and unknown PAM50 classification. CL achieved prediction accuracy >73 %, highest among other methods we evaluated. We also applied the algorithm to a set of breast cancer tumors derived from Arabic population to assign a PAM50 classification to each tumor based on their gene expression profiles. A novel algorithm CrossLink for cross-condition prediction of cancer classes was proposed. In all test datasets, CL showed robust and consistent improvement in prediction performance over other state-of-the-art normalization and classification algorithms.

  14. EBprot: Statistical analysis of labeling-based quantitative proteomics data.

    PubMed

    Koh, Hiromi W L; Swa, Hannah L F; Fermin, Damian; Ler, Siok Ghee; Gunaratne, Jayantha; Choi, Hyungwon

    2015-08-01

    Labeling-based proteomics is a powerful method for detection of differentially expressed proteins (DEPs). The current data analysis platform typically relies on protein-level ratios, which is obtained by summarizing peptide-level ratios for each protein. In shotgun proteomics, however, some proteins are quantified with more peptides than others, and this reproducibility information is not incorporated into the differential expression (DE) analysis. Here, we propose a novel probabilistic framework EBprot that directly models the peptide-protein hierarchy and rewards the proteins with reproducible evidence of DE over multiple peptides. To evaluate its performance with known DE states, we conducted a simulation study to show that the peptide-level analysis of EBprot provides better receiver-operating characteristic and more accurate estimation of the false discovery rates than the methods based on protein-level ratios. We also demonstrate superior classification performance of peptide-level EBprot analysis in a spike-in dataset. To illustrate the wide applicability of EBprot in different experimental designs, we applied EBprot to a dataset for lung cancer subtype analysis with biological replicates and another dataset for time course phosphoproteome analysis of EGF-stimulated HeLa cells with multiplexed labeling. Through these examples, we show that the peptide-level analysis of EBprot is a robust alternative to the existing statistical methods for the DE analysis of labeling-based quantitative datasets. The software suite is freely available on the Sourceforge website http://ebprot.sourceforge.net/. All MS data have been deposited in the ProteomeXchange with identifier PXD001426 (http://proteomecentral.proteomexchange.org/dataset/PXD001426/). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  15. Pan- and core- network analysis of co-expression genes in a model plant

    DOE PAGES

    He, Fei; Maslov, Sergei

    2016-12-16

    Genome-wide gene expression experiments have been performed using the model plant Arabidopsis during the last decade. Some studies involved construction of coexpression networks, a popular technique used to identify groups of co-regulated genes, to infer unknown gene functions. One approach is to construct a single coexpression network by combining multiple expression datasets generated in different labs. We advocate a complementary approach in which we construct a large collection of 134 coexpression networks based on expression datasets reported in individual publications. To this end we reanalyzed public expression data. To describe this collection of networks we introduced concepts of ‘pan-network’ andmore » ‘core-network’ representing union and intersection between a sizeable fractions of individual networks, respectively. Here, we showed that these two types of networks are different both in terms of their topology and biological function of interacting genes. For example, the modules of the pan-network are enriched in regulatory and signaling functions, while the modules of the core-network tend to include components of large macromolecular complexes such as ribosomes and photosynthetic machinery. Our analysis is aimed to help the plant research community to better explore the information contained within the existing vast collection of gene expression data in Arabidopsis.« less

  16. Pan- and core- network analysis of co-expression genes in a model plant

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    He, Fei; Maslov, Sergei

    Genome-wide gene expression experiments have been performed using the model plant Arabidopsis during the last decade. Some studies involved construction of coexpression networks, a popular technique used to identify groups of co-regulated genes, to infer unknown gene functions. One approach is to construct a single coexpression network by combining multiple expression datasets generated in different labs. We advocate a complementary approach in which we construct a large collection of 134 coexpression networks based on expression datasets reported in individual publications. To this end we reanalyzed public expression data. To describe this collection of networks we introduced concepts of ‘pan-network’ andmore » ‘core-network’ representing union and intersection between a sizeable fractions of individual networks, respectively. Here, we showed that these two types of networks are different both in terms of their topology and biological function of interacting genes. For example, the modules of the pan-network are enriched in regulatory and signaling functions, while the modules of the core-network tend to include components of large macromolecular complexes such as ribosomes and photosynthetic machinery. Our analysis is aimed to help the plant research community to better explore the information contained within the existing vast collection of gene expression data in Arabidopsis.« less

  17. Microarray-based characterization of differential gene expression during vocal fold wound healing in rats

    PubMed Central

    Welham, Nathan V.; Ling, Changying; Dawson, John A.; Kendziorski, Christina; Thibeault, Susan L.; Yamashita, Masaru

    2015-01-01

    The vocal fold (VF) mucosa confers elegant biomechanical function for voice production but is susceptible to scar formation following injury. Current understanding of VF wound healing is hindered by a paucity of data and is therefore often generalized from research conducted in skin and other mucosal systems. Here, using a previously validated rat injury model, expression microarray technology and an empirical Bayes analysis approach, we generated a VF-specific transcriptome dataset to better capture the system-level complexity of wound healing in this specialized tissue. We measured differential gene expression at 3, 14 and 60 days post-injury compared to experimentally naïve controls, pursued functional enrichment analyses to refine and add greater biological definition to the previously proposed temporal phases of VF wound healing, and validated the expression and localization of a subset of previously unidentified repair- and regeneration-related genes at the protein level. Our microarray dataset is a resource for the wider research community and has the potential to stimulate new hypotheses and avenues of investigation, improve biological and mechanistic insight, and accelerate the identification of novel therapeutic targets. PMID:25592437

  18. Dataset on transcriptional profiles and the developmental characteristics of PDGFRα expressing lung fibroblasts.

    PubMed

    Endale, Mehari; Ahlfeld, Shawn; Bao, Erik; Chen, Xiaoting; Green, Jenna; Bess, Zach; Weirauch, Matthew; Xu, Yan; Perl, Anne Karina

    2017-08-01

    The following data are derived from key stages of acinar lung development and define the developmental role of lung interstitial fibroblasts expressing platelet-derived growth factor alpha (PDGFRα). This dataset is related to the research article entitled "Temporal, spatial, and phenotypical changes of PDGFRα expressing fibroblasts during late lung development" (Endale et al., 2017) [1]. At E16.5 (canalicular), E18.5 (saccular), P7 (early alveolar) and P28 (late alveolar), PDGFRα GFP mice, in conjunction with immunohistochemical markers, were utilized to define the spatiotemporal relationship of PDGFRα + fibroblasts to endothelial, stromal and epithelial cells in both the proximal and distal acinar lung. Complimentary analysis with flow cytometry was employed to determine changes in cellular proliferation, define lipofibroblast and myofibroblast populations via the presence of intracellular lipid or alpha smooth muscle actin (αSMA), and evaluate the expression of CD34, CD29, and Sca-1. Finally, PDGFRα + cells isolated at each stage of acinar lung development were subjected to RNA-Seq analysis, data was subjected to Bayesian timeline analysis and transcriptional factor promoter enrichment analysis.

  19. Efficient Spatio-Temporal Local Binary Patterns for Spontaneous Facial Micro-Expression Recognition

    PubMed Central

    Wang, Yandan; See, John; Phan, Raphael C.-W.; Oh, Yee-Hui

    2015-01-01

    Micro-expression recognition is still in the preliminary stage, owing much to the numerous difficulties faced in the development of datasets. Since micro-expression is an important affective clue for clinical diagnosis and deceit analysis, much effort has gone into the creation of these datasets for research purposes. There are currently two publicly available spontaneous micro-expression datasets—SMIC and CASME II, both with baseline results released using the widely used dynamic texture descriptor LBP-TOP for feature extraction. Although LBP-TOP is popular and widely used, it is still not compact enough. In this paper, we draw further inspiration from the concept of LBP-TOP that considers three orthogonal planes by proposing two efficient approaches for feature extraction. The compact robust form described by the proposed LBP-Six Intersection Points (SIP) and a super-compact LBP-Three Mean Orthogonal Planes (MOP) not only preserves the essential patterns, but also reduces the redundancy that affects the discriminality of the encoded features. Through a comprehensive set of experiments, we demonstrate the strengths of our approaches in terms of recognition accuracy and efficiency. PMID:25993498

  20. In with the new, out with the old? Auto-extraction for remote sensing archaeology

    NASA Astrophysics Data System (ADS)

    Cowley, David C.

    2012-09-01

    This paper explores aspects of the inter-relationships between traditional archaeological interpretation of remote sensed data (principally visual examination of aerial photographs/satellite) and those drawing on automated feature extraction and processing. Established approaches to archaeological interpretation of aerial photographs are heavily reliant on individual observation (eye/brain) in an experience and knowledge-based process. Increasingly, however, much more complex and extensive datasets are becoming available to archaeology and these require critical reflection on analytical and interpretative processes. Archaeological applications of Airborne Laser Scanning (ALS) are becoming increasingly routine, and as the spatial resolution of hyper-spectral data improves, its potentially massive implications for archaeological site detection may prove to be a sea-change. These complex datasets demand new approaches, as traditional methods based on direct observation by an archaeological interpreter will never do more than scratch the surface, and will fail to fully extend the boundaries of knowledge. Inevitably, changing analytical and interpretative processes can create tensions, especially, as has been the case in archaeology, when the innovations in data and analysis come from outside the discipline. These tensions often centre on the character of the information produced, and a lack of clarity on the place of archaeological interpretation in the workflow. This is especially true for ALS data and autoextraction techniques, and carries implications for all forms of remote sensed archaeological datasets, including hyperspectral data and aerial photographs.

  1. Gene Expression Analysis to Assess the Relevance of Rodent Models to Human Lung Injury.

    PubMed

    Sweeney, Timothy E; Lofgren, Shane; Khatri, Purvesh; Rogers, Angela J

    2017-08-01

    The relevance of animal models to human diseases is an area of intense scientific debate. The degree to which mouse models of lung injury recapitulate human lung injury has never been assessed. Integrating data from both human and animal expression studies allows for increased statistical power and identification of conserved differential gene expression across organisms and conditions. We sought comprehensive integration of gene expression data in experimental acute lung injury (ALI) in rodents compared with humans. We performed two separate gene expression multicohort analyses to determine differential gene expression in experimental animal and human lung injury. We used correlational and pathway analyses combined with external in vitro gene expression data to identify both potential drivers of underlying inflammation and therapeutic drug candidates. We identified 21 animal lung tissue datasets and three human lung injury bronchoalveolar lavage datasets. We show that the metasignatures of animal and human experimental ALI are significantly correlated despite these widely varying experimental conditions. The gene expression changes among mice and rats across diverse injury models (ozone, ventilator-induced lung injury, LPS) are significantly correlated with human models of lung injury (Pearson r = 0.33-0.45, P < 1E -16 ). Neutrophil signatures are enriched in both animal and human lung injury. Predicted therapeutic targets, peptide ligand signatures, and pathway analyses are also all highly overlapping. Gene expression changes are similar in animal and human experimental ALI, and provide several physiologic and therapeutic insights to the disease.

  2. Predictive Models of Cognitive Outcomes of Developmental Insults

    NASA Astrophysics Data System (ADS)

    Chan, Yupo; Bouaynaya, Nidhal; Chowdhury, Parimal; Leszczynska, Danuta; Patterson, Tucker A.; Tarasenko, Olga

    2010-04-01

    Representatives of Arkansas medical, research and educational institutions have gathered over the past four years to discuss the relationship between functional developmental perturbations and their neurological consequences. We wish to track the effect on the nervous system by developmental perturbations over time and across species. Except for perturbations, the sequence of events that occur during neural development was found to be remarkably conserved across mammalian species. The tracking includes consequences on anatomical regions and behavioral changes. The ultimate goal is to develop a predictive model of long-term genotypic and phenotypic outcomes that includes developmental insults. Such a model can subsequently be fostered into an educated intervention for therapeutic purposes. Several datasets were identified to test plausible hypotheses, ranging from evoked potential datasets to sleep-disorder datasets. An initial model may be mathematical and conceptual. However, we expect to see rapid progress as large-scale gene expression studies in the mammalian brain permit genome-wide searches to discover genes that are uniquely expressed in brain circuits and regions. These genes ultimately control behavior. By using a validated model we endeavor to make useful predictions.

  3. Assessment of a novel multi-array normalization method based on spike-in control probes suitable for microRNA datasets with global decreases in expression.

    PubMed

    Sewer, Alain; Gubian, Sylvain; Kogel, Ulrike; Veljkovic, Emilija; Han, Wanjiang; Hengstermann, Arnd; Peitsch, Manuel C; Hoeng, Julia

    2014-05-17

    High-quality expression data are required to investigate the biological effects of microRNAs (miRNAs). The goal of this study was, first, to assess the quality of miRNA expression data based on microarray technologies and, second, to consolidate it by applying a novel normalization method. Indeed, because of significant differences in platform designs, miRNA raw data cannot be normalized blindly with standard methods developed for gene expression. This fundamental observation motivated the development of a novel multi-array normalization method based on controllable assumptions, which uses the spike-in control probes to adjust the measured intensities across arrays. Raw expression data were obtained with the Exiqon dual-channel miRCURY LNA™ platform in the "common reference design" and processed as "pseudo-single-channel". They were used to apply several quality metrics based on the coefficient of variation and to test the novel spike-in controls based normalization method. Most of the considerations presented here could be applied to raw data obtained with other platforms. To assess the normalization method, it was compared with 13 other available approaches from both data quality and biological outcome perspectives. The results showed that the novel multi-array normalization method reduced the data variability in the most consistent way. Further, the reliability of the obtained differential expression values was confirmed based on a quantitative reverse transcription-polymerase chain reaction experiment performed for a subset of miRNAs. The results reported here support the applicability of the novel normalization method, in particular to datasets that display global decreases in miRNA expression similarly to the cigarette smoke-exposed mouse lung dataset considered in this study. Quality metrics to assess between-array variability were used to confirm that the novel spike-in controls based normalization method provided high-quality miRNA expression data suitable for reliable downstream analysis. The multi-array miRNA raw data normalization method was implemented in an R software package called ExiMiR and deposited in the Bioconductor repository.

  4. Assessment of a novel multi-array normalization method based on spike-in control probes suitable for microRNA datasets with global decreases in expression

    PubMed Central

    2014-01-01

    Background High-quality expression data are required to investigate the biological effects of microRNAs (miRNAs). The goal of this study was, first, to assess the quality of miRNA expression data based on microarray technologies and, second, to consolidate it by applying a novel normalization method. Indeed, because of significant differences in platform designs, miRNA raw data cannot be normalized blindly with standard methods developed for gene expression. This fundamental observation motivated the development of a novel multi-array normalization method based on controllable assumptions, which uses the spike-in control probes to adjust the measured intensities across arrays. Results Raw expression data were obtained with the Exiqon dual-channel miRCURY LNA™ platform in the “common reference design” and processed as “pseudo-single-channel”. They were used to apply several quality metrics based on the coefficient of variation and to test the novel spike-in controls based normalization method. Most of the considerations presented here could be applied to raw data obtained with other platforms. To assess the normalization method, it was compared with 13 other available approaches from both data quality and biological outcome perspectives. The results showed that the novel multi-array normalization method reduced the data variability in the most consistent way. Further, the reliability of the obtained differential expression values was confirmed based on a quantitative reverse transcription–polymerase chain reaction experiment performed for a subset of miRNAs. The results reported here support the applicability of the novel normalization method, in particular to datasets that display global decreases in miRNA expression similarly to the cigarette smoke-exposed mouse lung dataset considered in this study. Conclusions Quality metrics to assess between-array variability were used to confirm that the novel spike-in controls based normalization method provided high-quality miRNA expression data suitable for reliable downstream analysis. The multi-array miRNA raw data normalization method was implemented in an R software package called ExiMiR and deposited in the Bioconductor repository. PMID:24886675

  5. Structure and transcriptional regulation of the major intrinsic protein gene family in grapevine.

    PubMed

    Wong, Darren Chern Jan; Zhang, Li; Merlin, Isabelle; Castellarin, Simone D; Gambetta, Gregory A

    2018-04-11

    The major intrinsic protein (MIP) family is a family of proteins, including aquaporins, which facilitate water and small molecule transport across plasma membranes. In plants, MIPs function in a huge variety of processes including water transport, growth, stress response, and fruit development. In this study, we characterize the structure and transcriptional regulation of the MIP family in grapevine, describing the putative genome duplication events leading to the family structure and characterizing the family's tissue and developmental specific expression patterns across numerous preexisting microarray and RNAseq datasets. Gene co-expression network (GCN) analyses were carried out across these datasets and the promoters of each family member were analyzed for cis-regulatory element structure in order to provide insight into their transcriptional regulation. A total of 29 Vitis vinifera MIP family members (excluding putative pseudogenes) were identified of which all but two were mapped onto Vitis vinifera chromosomes. In this study, segmental duplication events were identified for five plasma membrane intrinsic protein (PIP) and four tonoplast intrinsic protein (TIP) genes, contributing to the expansion of PIPs and TIPs in grapevine. Grapevine MIP family members have distinct tissue and developmental expression patterns and hierarchical clustering revealed two primary groups regardless of the datasets analyzed. Composite microarray and RNA-seq gene co-expression networks (GCNs) highlighted the relationships between MIP genes and functional categories involved in cell wall modification and transport, as well as with other MIPs revealing a strong co-regulation within the family itself. Some duplicated MIP family members have undergone sub-functionalization and exhibit distinct expression patterns and GCNs. Cis-regulatory element (CRE) analyses of the MIP promoters and their associated GCN members revealed enrichment for numerous CREs including AP2/ERFs and NACs. Combining phylogenetic analyses, gene expression profiling, gene co-expression network analyses, and cis-regulatory element enrichment, this study provides a comprehensive overview of the structure and transcriptional regulation of the grapevine MIP family. The study highlights the duplication and sub-functionalization of the family, its strong coordinated expression with genes involved in growth and transport, and the putative classes of TFs responsible for its regulation.

  6. Analysis of expression and prognostic significance of vimentin and the response to temozolomide in glioma patients.

    PubMed

    Lin, Lin; Wang, Guangzhi; Ming, Jianguang; Meng, Xiangqi; Han, Bo; Sun, Bo; Cai, Jinquan; Jiang, Chuanlu

    2016-11-01

    Gliomas are the most common primary intracranial malignant tumors in adults. Surgical resection followed by optional radiotherapy and chemotherapy is the current standard therapy for glioma patients. Vimentin, a protein of intermediate filament family, could maintain the cellular integrity and participate in several cell signal pathways to modulate the motility and invasion of cancer cells. The purpose of the present research was to identify the relationship between vimentin expression and clinical characteristics and detect the prognostic and predictive ability of vimentin in patients with glioma. To determine the expression of vimentin in glioma tissues, paraffin-embedded blocks from glioma patients by surgical resection were obtained and evaluated by immunohistochemistry. To further investigate the association of vimentin expression with survival, we employed mRNA expression of vimentin genes from the Chinese Glioma Genome Atlas (CGGA) and the GSE 16011 dataset. Kaplan-Meier analysis and Cox regression model were used to statistical analysis. We detected positive vimentin straining in 84 % of high-grade compared to 47 % in low-grade glioma patients. Additionally, vimentin mRNA expression was correlated with glioma grade in both CGGA and GSE16011 dataset. Patients with low vimentin expression have longer survival than high expression. In multivariate analysis, vimentin was an independent significant prognostic factor for high-grade glioma patients. We also identified that glioblastoma patients with low vimentin expression had a better response to temozolomide therapy. Vimentin expression has a significant association with tumor grade and overall survival of high-grade glioma patients. Low vimentin expression may benefit from temozolomide therapy.

  7. Systems toxicology identifies mechanistic impacts of 2-amino-4,6-dinitrotoluene (2A-DNT) exposure in Northern Bobwhite.

    PubMed

    Gust, Kurt A; Nanduri, Bindu; Rawat, Arun; Wilbanks, Mitchell S; Ang, Choo Yaw; Johnson, David R; Pendarvis, Ken; Chen, Xianfeng; Quinn, Michael J; Johnson, Mark S; Burgess, Shane C; Perkins, Edward J

    2015-08-07

    A systems toxicology investigation comparing and integrating transcriptomic and proteomic results was conducted to develop holistic effects characterizations for the wildlife bird model, Northern bobwhite (Colinus virginianus) dosed with the explosives degradation product 2-amino-4,6-dinitrotoluene (2A-DNT). A subchronic 60 d toxicology bioassay was leveraged where both sexes were dosed via daily gavage with 0, 3, 14, or 30 mg/kg-d 2A-DNT. Effects on global transcript expression were investigated in liver and kidney tissue using custom microarrays for C. virginianus in both sexes at all doses, while effects on proteome expression were investigated in liver for both sexes and kidney in males, at 30 mg/kg-d. As expected, transcript expression was not directly indicative of protein expression in response to 2A-DNT. However, a high degree of correspondence was observed among gene and protein expression when investigating higher-order functional responses including statistically enriched gene networks and canonical pathways, especially when connected to toxicological outcomes of 2A-DNT exposure. Analysis of networks statistically enriched for both transcripts and proteins demonstrated common responses including inhibition of programmed cell death and arrest of cell cycle in liver tissues at 2A-DNT doses that caused liver necrosis and death in females. Additionally, both transcript and protein expression in liver tissue was indicative of induced phase I and II xenobiotic metabolism potentially as a mechanism to detoxify and excrete 2A-DNT. Nuclear signaling assays, transcript expression and protein expression each implicated peroxisome proliferator-activated receptor (PPAR) nuclear signaling as a primary molecular target in the 2A-DNT exposure with significant downstream enrichment of PPAR-regulated pathways including lipid metabolic pathways and gluconeogenesis suggesting impaired bioenergetic potential. Although the differential expression of transcripts and proteins was largely unique, the consensus of functional pathways and gene networks enriched among transcriptomic and proteomic datasets provided the identification of many critical metabolic functions underlying 2A-DNT toxicity as well as impaired PPAR signaling, a key molecular initiating event known to be affected in di- and trinitrotoluene exposures.

  8. Novel candidate genes of the PARK7 interactome as mediators of apoptosis and acetylation in multiple sclerosis: An in silico analysis.

    PubMed

    Vavougios, George D; Zarogiannis, Sotirios G; Krogfelt, Karen Angeliki; Gourgoulianis, Konstantinos; Mitsikostas, Dimos Dimitrios; Hadjigeorgiou, Georgios

    2018-01-01

    currently only 4 studies have explored the potential role of PARK7's dysregulation in MS pathophysiology Currently, no study has evaluated the potential role of the PARK7 interactome in MS. The aim of our study was to assess the differential expression of PARK7 mRNA in peripheral blood mononuclears (PBMCs) donated from MS versus healthy patients using data mining techniques. The PARK7 interactome data from the GDS3920 profile were scrutinized for differentially expressed genes (DEGs); Gene Enrichment Analysis (GEA) was used to detect significantly enriched biological functions. 27 differentially expressed genes in the MS dataset were detected; 12 of these (NDUFA4, UBA2, TDP2, NPM1, NDUFS3, SUMO1, PIAS2, KIAA0101, RBBP4, NONO, RBBP7 AND HSPA4) are reported for the first time in MS. Stepwise Linear Discriminant Function Analysis constructed a predictive model (Wilk's λ = 0.176, χ 2 = 45.204, p = 1.5275e -10 ) with 2 variables (TIDP2, RBBP4) that achieved 96.6% accuracy when discriminating between patients and controls. Gene Enrichment Analysis revealed that induction and regulation of programmed / intrinsic cell death represented the most salient Gene Ontology annotations. Cross-validation on systemic lupus erythematosus and ischemic stroke datasets revealed that these functions are unique to the MS dataset. Based on our results, novel potential target genes are revealed; these differentially expressed genes regulate epigenetic and apoptotic pathways that may further elucidate underlying mechanisms of autorreactivity in MS. Copyright © 2017 Elsevier B.V. All rights reserved.

  9. Deregulation of polycomb repressor complex 1 modifier AUTS2 in T-cell leukemia.

    PubMed

    Nagel, Stefan; Pommerenke, Claudia; Meyer, Corinna; Kaufmann, Maren; Drexler, Hans G; MacLeod, Roderick A F

    2016-07-19

    Recently, we identified deregulated expression of the B-cell specific transcription factor MEF2C in T-cell acute lymphoid leukemia (T-ALL). Here, we performed sequence analysis of a regulatory upstream section of MEF2C in T-ALL cell lines which, however, proved devoid of mutations. Unexpectedly, we found strong conservation between the regulatory upstream region of MEF2C (located at chromosomal band 5q14) and an intergenic stretch at 7q11 located between STAG3L4 and AUTS2, covering nearly 20 kb. While the non-coding gene STAG3L4 was inconspicuously expressed, AUTS2 was aberrantly upregulated in 6% of T-ALL patients (public dataset GSE42038) and in 3/24 T-ALL cell lines, two of which represented very immature differentiation stages. AUTS2 expression was higher in normal B-cells than in T-cells, indicating lineage-specific activity in lymphopoiesis. While excluding chromosomal aberrations, examinations of AUTS2 transcriptional regulation in T-ALL cells revealed activation by IL7-IL7R-STAT5-signalling and MEF2C. AUTS2 protein has been shown to interact with polycomb repressor complex 1 subtype 5 (PRC1.5), transforming this particular complex into an activator. Accordingly, expression profiling and functional analyses demonstrated that AUTS2 activated while PCGF5 repressed transcription of NKL homeobox gene MSX1 in T-ALL cells. Forced expression and pharmacological inhibition of EZH2 in addition to H3K27me3 analysis indicated that PRC2 repressed MSX1 as well. Taken together, we found that AUTS2 and MEF2C, despite lying on different chromosomes, share strikingly similar regulatory upstream regions and aberrant expression in T-ALL subsets. Our data implicate chromatin complexes PRC1/AUTS2 and PRC2 in a gene network in T-ALL regulating early lymphoid differentiation.

  10. Integrative analysis of RUNX1 downstream pathways and target genes

    PubMed Central

    Michaud, Joëlle; Simpson, Ken M; Escher, Robert; Buchet-Poyau, Karine; Beissbarth, Tim; Carmichael, Catherine; Ritchie, Matthew E; Schütz, Frédéric; Cannon, Ping; Liu, Marjorie; Shen, Xiaofeng; Ito, Yoshiaki; Raskind, Wendy H; Horwitz, Marshall S; Osato, Motomi; Turner, David R; Speed, Terence P; Kavallaris, Maria; Smyth, Gordon K; Scott, Hamish S

    2008-01-01

    Background The RUNX1 transcription factor gene is frequently mutated in sporadic myeloid and lymphoid leukemia through translocation, point mutation or amplification. It is also responsible for a familial platelet disorder with predisposition to acute myeloid leukemia (FPD-AML). The disruption of the largely unknown biological pathways controlled by RUNX1 is likely to be responsible for the development of leukemia. We have used multiple microarray platforms and bioinformatic techniques to help identify these biological pathways to aid in the understanding of why RUNX1 mutations lead to leukemia. Results Here we report genes regulated either directly or indirectly by RUNX1 based on the study of gene expression profiles generated from 3 different human and mouse platforms. The platforms used were global gene expression profiling of: 1) cell lines with RUNX1 mutations from FPD-AML patients, 2) over-expression of RUNX1 and CBFβ, and 3) Runx1 knockout mouse embryos using either cDNA or Affymetrix microarrays. We observe that our datasets (lists of differentially expressed genes) significantly correlate with published microarray data from sporadic AML patients with mutations in either RUNX1 or its cofactor, CBFβ. A number of biological processes were identified among the differentially expressed genes and functional assays suggest that heterozygous RUNX1 point mutations in patients with FPD-AML impair cell proliferation, microtubule dynamics and possibly genetic stability. In addition, analysis of the regulatory regions of the differentially expressed genes has for the first time systematically identified numerous potential novel RUNX1 target genes. Conclusion This work is the first large-scale study attempting to identify the genetic networks regulated by RUNX1, a master regulator in the development of the hematopoietic system and leukemia. The biological pathways and target genes controlled by RUNX1 will have considerable importance in disease progression in both familial and sporadic leukemia as well as therapeutic implications. PMID:18671852

  11. A genome-wide analysis of the flax (Linum usitatissimum L.) dirigent protein family: from gene identification and evolution to differential regulation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Corbin, Cyrielle; Drouet, Samantha; Markulin, Lucija

    Identification of DIR encoding genes in flax genome. Analysis of phylogeny, gene/protein structures and evolution. Identification of new conserved motifs linked to biochemical functions. Investigation of spatio-temporal gene expression and response to stress. Dirigent proteins (DIRs) were discovered during 8-8' lignan biosynthesis studies, through identification of stereoselective coupling to afford either (+)- or (-)-pinoresinols from E-coniferyl alcohol. DIRs are also involved or potentially involved in terpenoid, allyl/propenyl phenol lignan, pterocarpan and lignin biosynthesis. DIRs have very large multigene families in different vascular plants including flax, with most still of unknown function. DIR studies typically focus on a small subset ofmore » genes and identification of biochemical/physiological functions. Herein, a genome-wide analysis and characterization of the predicted flax DIR 44-membered multigene family was performed, this species being a rich natural grain source of 8-8' linked secoisolariciresinol-derived lignan oligomers. All predicted DIR sequences, including their promoters, were analyzed together with their public gene expression datasets. Expression patterns of selected DIRs were examined using qPCR, as well as through clustering analysis of DIR gene expression. These analyses further implicated roles for specific DIRs in (-)-pinoresinol formation in seed-coats, as well as (+)-pinoresinol in vegetative organs and/or specific responses to stress. Phylogeny and gene expression analysis segregated flax DIRs into six distinct clusters with new cluster-specific motifs identified. We propose that these findings can serve as a foundation to further systematically determine functions of DIRs, i.e. other than those already known in lignan biosynthesis in flax and other species. Given the differential expression profiles and inducibility of the flax DIR family, we provisionally propose that some DIR genes of unknown function could be involved in different aspects of secondary cell wall biosynthesis and plant defense.« less

  12. A genome-wide analysis of the flax (Linum usitatissimum L.) dirigent protein family: from gene identification and evolution to differential regulation.

    PubMed

    Corbin, Cyrielle; Drouet, Samantha; Markulin, Lucija; Auguin, Daniel; Lainé, Éric; Davin, Laurence B; Cort, John R; Lewis, Norman G; Hano, Christophe

    2018-05-01

    Identification of DIR encoding genes in flax genome. Analysis of phylogeny, gene/protein structures and evolution. Identification of new conserved motifs linked to biochemical functions. Investigation of spatio-temporal gene expression and response to stress. Dirigent proteins (DIRs) were discovered during 8-8' lignan biosynthesis studies, through identification of stereoselective coupling to afford either (+)- or (-)-pinoresinols from E-coniferyl alcohol. DIRs are also involved or potentially involved in terpenoid, allyl/propenyl phenol lignan, pterocarpan and lignin biosynthesis. DIRs have very large multigene families in different vascular plants including flax, with most still of unknown function. DIR studies typically focus on a small subset of genes and identification of biochemical/physiological functions. Herein, a genome-wide analysis and characterization of the predicted flax DIR 44-membered multigene family was performed, this species being a rich natural grain source of 8-8' linked secoisolariciresinol-derived lignan oligomers. All predicted DIR sequences, including their promoters, were analyzed together with their public gene expression datasets. Expression patterns of selected DIRs were examined using qPCR, as well as through clustering analysis of DIR gene expression. These analyses further implicated roles for specific DIRs in (-)-pinoresinol formation in seed-coats, as well as (+)-pinoresinol in vegetative organs and/or specific responses to stress. Phylogeny and gene expression analysis segregated flax DIRs into six distinct clusters with new cluster-specific motifs identified. We propose that these findings can serve as a foundation to further systematically determine functions of DIRs, i.e. other than those already known in lignan biosynthesis in flax and other species. Given the differential expression profiles and inducibility of the flax DIR family, we provisionally propose that some DIR genes of unknown function could be involved in different aspects of secondary cell wall biosynthesis and plant defense.

  13. Dynamic regulation of genetic pathways and targets during aging in Caenorhabditis elegans.

    PubMed

    He, Kan; Zhou, Tao; Shao, Jiaofang; Ren, Xiaoliang; Zhao, Zhongying; Liu, Dahai

    2014-03-01

    Numerous genetic targets and some individual pathways associated with aging have been identified using the worm model. However, less is known about the genetic mechanisms of aging in genome wide, particularly at the level of multiple pathways as well as the regulatory networks during aging. Here, we employed the gene expression datasets of three time points during aging in Caenorhabditis elegans (C. elegans) and performed the approach of gene set enrichment analysis (GSEA) on each dataset between adjacent stages. As a result, multiple genetic pathways and targets were identified as significantly down- or up-regulated. Among them, 5 truly aging-dependent signaling pathways including MAPK signaling pathway, mTOR signaling pathway, Wnt signaling pathway, TGF-beta signaling pathway and ErbB signaling pathway as well as 12 significantly associated genes were identified with dynamic expression pattern during aging. On the other hand, the continued declines in the regulation of several metabolic pathways have been demonstrated to display age-related changes. Furthermore, the reconstructed regulatory networks based on three of aging related Chromatin immunoprecipitation experiments followed by sequencing (ChIP-seq) datasets and the expression matrices of 154 involved genes in above signaling pathways provide new insights into aging at the multiple pathways level. The combination of multiple genetic pathways and targets needs to be taken into consideration in future studies of aging, in which the dynamic regulation would be uncovered.

  14. KCNN4 and S100A14 act as predictors of recurrence in optimally debulked patients with serous ovarian cancer

    PubMed Central

    Hu, Ting; Sun, Qian; Wu, Jianli; Lin, Xingguang; Luo, Danfeng; Sun, Chaoyang; Wang, Changyu; Zhou, Bo; Li, Na; Xia, Meng; Lu, Hao; Meng, Li; Xu, Xiaoyan; Hu, Junbo; Ma, Ding; Chen, Gang; Zhu, Tao

    2016-01-01

    Approximately 50-75% of patients with serous ovarian carcinoma (SOC) experience recurrence within 18 months after first-line treatment. Current clinical indicators are inadequate for predicting the risk of recurrence. In this study, we used 7 publicly available microarray datasets to identify gene signatures related to recurrence in optimally debulked SOC patients, and validated their expressions in an independent clinic cohort of 127 patients using immunohistochemistry (IHC). We identified a two-gene signature including KCNN4 and S100A14 which was related to recurrence in optimally debulked SOC patients. Their mRNA expression levels were positively correlated and regulated by DNA copy number alterations (CNA) (KCNN4: p=1.918e-05) and DNA promotermethylation (KCNN4: p=0.0179; S100A14: p=2.787e-13). Recurrence prediction models built in the TCGA dataset based on KCNN4 and S100A14 individually and in combination showed good prediction performance in the other 6 datasets (AUC:0.5442-0.9524). The independent cohort supported the expression difference between SOC recurrences. Also, a KCNN4 and S100A14-centered protein interaction subnetwork was built from the STRING database, and the shortest regulation path between them, called the KCNN4-UBA52-KLF4-S100A14 axis, was identified. This discovery might facilitate individualized treatment of SOC. PMID:27270322

  15. Computational analysis of ribonomics datasets identifies long non-coding RNA targets of γ-herpesviral miRNAs.

    PubMed

    Sethuraman, Sunantha; Thomas, Merin; Gay, Lauren A; Renne, Rolf

    2018-05-29

    Ribonomics experiments involving crosslinking and immuno-precipitation (CLIP) of Ago proteins have expanded the understanding of the miRNA targetome of several organisms. These techniques, collectively referred to as CLIP-seq, have been applied to identifying the mRNA targets of miRNAs expressed by Kaposi's Sarcoma-associated herpes virus (KSHV) and Epstein-Barr virus (EBV). However, these studies focused on identifying only those RNA targets of KSHV and EBV miRNAs that are known to encode proteins. Recent studies have demonstrated that long non-coding RNAs (lncRNAs) are also targeted by miRNAs. In this study, we performed a systematic re-analysis of published datasets from KSHV- and EBV-driven cancers. We used CLIP-seq data from lymphoma cells or EBV-transformed B cells, and a crosslinking, ligation and sequencing of hybrids dataset from KSHV-infected endothelial cells, to identify novel lncRNA targets of viral miRNAs. Here, we catalog the lncRNA targetome of KSHV and EBV miRNAs, and provide a detailed in silico analysis of lncRNA-miRNA binding interactions. Viral miRNAs target several hundred lncRNAs, including a subset previously shown to be aberrantly expressed in human malignancies. In addition, we identified thousands of lncRNAs to be putative targets of human miRNAs, suggesting that miRNA-lncRNA interactions broadly contribute to the regulation of gene expression.

  16. Integrative functional analyses using rainbow trout selected for tolerance to plant diets reveal nutrigenomic signatures for soy utilization without the concurrence of enteritis.

    PubMed

    Abernathy, Jason; Brezas, Andreas; Snekvik, Kevin R; Hardy, Ronald W; Overturf, Ken

    2017-01-01

    Finding suitable alternative protein sources for diets of carnivorous fish species remains a major concern for sustainable aquaculture. Through genetic selection, we created a strain of rainbow trout that outperforms parental lines in utilizing an all-plant protein diet and does not develop enteritis in the distal intestine, as is typical with salmonids on long-term plant protein-based feeds. By incorporating this strain into functional analyses, we set out to determine which genes are critical to plant protein utilization in the absence of gut inflammation. After a 12-week feeding trial with our selected strain and a control trout strain fed either a fishmeal-based diet or an all-plant protein diet, high-throughput RNA sequencing was completed on both liver and muscle tissues. Differential gene expression analyses, weighted correlation network analyses and further functional characterization were performed. A strain-by-diet design revealed differential expression ranging from a few dozen to over one thousand genes among the various comparisons and tissues. Major gene ontology groups identified between comparisons included those encompassing central, intermediary and foreign molecule metabolism, associated biosynthetic pathways as well as immunity. A systems approach indicated that genes involved in purine metabolism were highly perturbed. Systems analysis among the tissues tested further suggests the interplay between selection for growth, dietary utilization and protein tolerance may also have implications for nonspecific immunity. By combining data from differential gene expression and co-expression networks using selected trout, along with ontology and pathway analyses, a set of 63 candidate genes for plant diet tolerance was found. Risk loci in human inflammatory bowel diseases were also found in our datasets, indicating rainbow trout selected for plant-diet tolerance may have added utility as a potential biomedical model.

  17. Using the Positive and Negative Syndrome Scale (PANSS) to Define Different Domains of Negative Symptoms

    PubMed Central

    Khan, Anzalee; Keefe, Richard S. E.

    2017-01-01

    Background: Reduced emotional experience and expression are two domains of negative symptoms. The authors assessed these two domains of negative symptoms using previously developed Positive and Negative Syndrome Scale (PANSS) factors. Using an existing dataset, the authors predicted three different elements of everyday functioning (social, vocational, and everyday activities) with these two factors, as well as with performance on measures of functional capacity. Methods: A large (n=630) sample of people with schizophrenia was used as the data source of this study. Using regression analyses, the authors predicted the three different aspects of everyday functioning, first with just the two Positive and Negative Syndrome Scale factors and then with a global negative symptom factor. Finally, we added neurocognitive performance and functional capacity as predictors. Results: The Positive and Negative Syndrome Scale reduced emotional experience factor accounted for 21 percent of the variance in everyday social functioning, while reduced emotional expression accounted for no variance. The total Positive and Negative Syndrome Scale negative symptom factor accounted for less variance (19%) than the reduced experience factor alone. The Positive and Negative Syndrome Scale expression factor accounted for, at most, one percent of the variance in any of the functional outcomes, with or without the addition of other predictors. Implications: Reduced emotional experience measured with the Positive and Negative Syndrome Scale, often referred to as “avolition and anhedonia,” specifically predicted impairments in social outcomes. Further, reduced experience predicted social impairments better than emotional expression or the total Positive and Negative Syndrome Scale negative symptom factor. In this cross-sectional study, reduced emotional experience was specifically related with social outcomes, accounting for essentially no variance in work or everyday activities, and being the sole meaningful predictor of impairment in social outcomes. PMID:29410933

  18. RNA-Seq Reveals Infection-Induced Gene Expression Changes in the Snail Intermediate Host of the Carcinogenic Liver Fluke, Opisthorchis viverrini

    PubMed Central

    Prasopdee, Sattrachai; Sotillo, Javier; Tesana, Smarn; Laha, Thewarach; Kulsantiwong, Jutharat; Nolan, Matthew J.

    2014-01-01

    Background Bithynia siamensis goniomphalos is the snail intermediate host of the liver fluke, Opisthorchis viverrini, the leading cause of cholangiocarcinoma (CCA) in the Greater Mekong sub-region of Thailand. Despite the severe public health impact of Opisthorchis-induced CCA, knowledge of the molecular interactions occurring between the parasite and its snail intermediate host is scant. The examination of differences in gene expression profiling between uninfected and O. viverrini-infected B. siamensis goniomphalos could provide clues on fundamental pathways involved in the regulation of snail-parasite interplay. Methodology/Principal Findings Using high-throughput (Illumina) sequencing and extensive bioinformatic analyses, we characterized the transcriptomes of uninfected and O. viverrini-infected B. siamensis goniomphalos. Comparative analyses of gene expression profiling allowed the identification of 7,655 differentially expressed genes (DEGs), associated to 43 distinct biological pathways, including pathways associated with immune defense mechanisms against parasites. Amongst the DEGs with immune functions, transcripts encoding distinct proteases displayed the highest down-regulation in Bithynia specimens infected by O. viverrini; conversely, transcription of genes encoding heat-shock proteins and actins was significantly up-regulated in parasite-infected snails when compared to the uninfected counterparts. Conclusions/Significance The present study lays the foundation for functional studies of genes and gene products potentially involved in immune-molecular mechanisms implicated in the ability of the parasite to successfully colonize its snail intermediate host. The annotated dataset provided herein represents a ready-to-use molecular resource for the discovery of molecular pathways underlying susceptibility and resistance mechanisms of B. siamensis goniomphalos to O. viverrini and for comparative analyses with pulmonate snail intermediate hosts of other platyhelminths including schistosomes. PMID:24676090

  19. The "Discouraged-Business-Major" Hypothesis: Policy Implications

    ERIC Educational Resources Information Center

    Marangos, John

    2012-01-01

    This paper uses a relatively large dataset of the stated academic major preferences of economics majors at a relatively large, not highly selective, public university in the USA to identify the "discouraged-business-majors" (DBMs). The DBM hypothesis addresses the phenomenon where students who are screened out of the business curriculum often…

  20. Towards an effective data peer review

    NASA Astrophysics Data System (ADS)

    Düsterhus, André; Hense, Andreas

    2014-05-01

    Peer review is an established procedure to ensure the quality of scientific publications and is currently used as a prerequisite for acceptance of papers in the scientific community. In the past years the publication of raw data and its metadata got increased attention, which led to the idea of bringing it to the same standards the journals for traditional publications have. One missing element to achieve this is a comparable peer review scheme. This contribution introduces the idea of a quality evaluation process, which is designed to analyse the technical quality as well as the content of a dataset. It bases on quality tests, which results are evaluated with the help of the knowledge of an expert. The results of the tests and the expert knowledge are evaluated probabilistically and are statistically combined. As a result the quality of a dataset is estimated with a single value only. This approach allows the reviewer to quickly identify the potential weaknesses of a dataset and generate a transparent and comprehensible report. To demonstrate the scheme, an application on a large meteorological dataset will be shown. Furthermore, potentials and risks of such a scheme will be introduced and practical implications for its possible introduction to data centres investigated. Especially, the effects of reducing the estimate of quality of a dataset to a single number will be critically discussed.

  1. A cross-country Exchange Market Pressure (EMP) dataset.

    PubMed

    Desai, Mohit; Patnaik, Ila; Felman, Joshua; Shah, Ajay

    2017-06-01

    The data presented in this article are related to the research article titled - "An exchange market pressure measure for cross country analysis" (Patnaik et al. [1]). In this article, we present the dataset for Exchange Market Pressure values (EMP) for 139 countries along with their conversion factors, ρ (rho). Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values) for the point estimates of ρ 's. Using the standard errors of estimates of ρ 's, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.

  2. MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.

    PubMed

    Gonzalez-Dominguez, Jorge; Martin, Maria J

    2017-10-10

    In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.

  3. Viscoplastic properties of laponite-CMC mixes.

    PubMed

    Tarhini, Z; Jarny, S; Texier, A

    2017-04-01

    In this dataset, 15 samples of laponite-CMC mixes were realized and their viscoplastic properties are determined. Rheological parameters are then expressed as a function of age and components concentrations.

  4. Gene expression inference with deep learning.

    PubMed

    Chen, Yifei; Li, Yi; Narayan, Rajiv; Subramanian, Aravind; Xie, Xiaohui

    2016-06-15

    Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes. We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. D-GEX is available at https://github.com/uci-cbcl/D-GEX CONTACT: xhx@ics.uci.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  5. Gene expression inference with deep learning

    PubMed Central

    Chen, Yifei; Li, Yi; Narayan, Rajiv; Subramanian, Aravind; Xie, Xiaohui

    2016-01-01

    Motivation: Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes. Results: We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. Availability and implementation: D-GEX is available at https://github.com/uci-cbcl/D-GEX. Contact: xhx@ics.uci.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26873929

  6. Anomalous Diffusion Measured by a Twice-Refocused Spin Echo Pulse Sequence: Analysis Using Fractional Order Calculus

    PubMed Central

    2011-01-01

    Purpose To theoretically develop and experimentally validate a formulism based on a fractional order calculus (FC) diffusion model to characterize anomalous diffusion in brain tissues measured with a twice-refocused spin-echo (TRSE) pulse sequence. Materials and Methods The FC diffusion model is the fractional order generalization of the Bloch-Torrey equation. Using this model, an analytical expression was derived to describe the diffusion-induced signal attenuation in a TRSE pulse sequence. To experimentally validate this expression, a set of diffusion-weighted (DW) images was acquired at 3 Tesla from healthy human brains using a TRSE sequence with twelve b-values ranging from 0 to 2,600 s/mm2. For comparison, DW images were also acquired using a Stejskal-Tanner diffusion gradient in a single-shot spin-echo echo planar sequence. For both datasets, a Levenberg-Marquardt fitting algorithm was used to extract three parameters: diffusion coefficient D, fractional order derivative in space β, and a spatial parameter μ (in units of μm). Using adjusted R-squared values and standard deviations, D, β and μ values and the goodness-of-fit in three specific regions of interest (ROI) in white matter, gray matter, and cerebrospinal fluid were evaluated for each of the two datasets. In addition, spatially resolved parametric maps were assessed qualitatively. Results The analytical expression for the TRSE sequence, derived from the FC diffusion model, accurately characterized the diffusion-induced signal loss in brain tissues at high b-values. In the selected ROIs, the goodness-of-fit and standard deviations for the TRSE dataset were comparable with the results obtained from the Stejskal-Tanner dataset, demonstrating the robustness of the FC model across multiple data acquisition strategies. Qualitatively, the D, β, and μ maps from the TRSE dataset exhibited fewer artifacts, reflecting the improved immunity to eddy currents. Conclusion The diffusion-induced signal attenuation in a TRSE pulse sequence can be described by an FC diffusion model at high b-values. This model performs equally well for data acquired from the human brain tissues with a TRSE pulse sequence or a conventional Stejskal-Tanner sequence. PMID:21509877

  7. Anomalous diffusion measured by a twice-refocused spin echo pulse sequence: analysis using fractional order calculus.

    PubMed

    Gao, Qing; Srinivasan, Girish; Magin, Richard L; Zhou, Xiaohong Joe

    2011-05-01

    To theoretically develop and experimentally validate a formulism based on a fractional order calculus (FC) diffusion model to characterize anomalous diffusion in brain tissues measured with a twice-refocused spin-echo (TRSE) pulse sequence. The FC diffusion model is the fractional order generalization of the Bloch-Torrey equation. Using this model, an analytical expression was derived to describe the diffusion-induced signal attenuation in a TRSE pulse sequence. To experimentally validate this expression, a set of diffusion-weighted (DW) images was acquired at 3 Tesla from healthy human brains using a TRSE sequence with twelve b-values ranging from 0 to 2600 s/mm(2). For comparison, DW images were also acquired using a Stejskal-Tanner diffusion gradient in a single-shot spin-echo echo planar sequence. For both datasets, a Levenberg-Marquardt fitting algorithm was used to extract three parameters: diffusion coefficient D, fractional order derivative in space β, and a spatial parameter μ (in units of μm). Using adjusted R-squared values and standard deviations, D, β, and μ values and the goodness-of-fit in three specific regions of interest (ROIs) in white matter, gray matter, and cerebrospinal fluid, respectively, were evaluated for each of the two datasets. In addition, spatially resolved parametric maps were assessed qualitatively. The analytical expression for the TRSE sequence, derived from the FC diffusion model, accurately characterized the diffusion-induced signal loss in brain tissues at high b-values. In the selected ROIs, the goodness-of-fit and standard deviations for the TRSE dataset were comparable with the results obtained from the Stejskal-Tanner dataset, demonstrating the robustness of the FC model across multiple data acquisition strategies. Qualitatively, the D, β, and μ maps from the TRSE dataset exhibited fewer artifacts, reflecting the improved immunity to eddy currents. The diffusion-induced signal attenuation in a TRSE pulse sequence can be described by an FC diffusion model at high b-values. This model performs equally well for data acquired from the human brain tissues with a TRSE pulse sequence or a conventional Stejskal-Tanner sequence. Copyright © 2011 Wiley-Liss, Inc.

  8. Gene-Based Genome-Wide Association Analysis in European and Asian Populations Identified Novel Genes for Rheumatoid Arthritis.

    PubMed

    Zhu, Hong; Xia, Wei; Mo, Xing-Bo; Lin, Xiang; Qiu, Ying-Hua; Yi, Neng-Jun; Zhang, Yong-Hong; Deng, Fei-Yan; Lei, Shu-Feng

    2016-01-01

    Rheumatoid arthritis (RA) is a complex autoimmune disease. Using a gene-based association research strategy, the present study aims to detect unknown susceptibility to RA and to address the ethnic differences in genetic susceptibility to RA between European and Asian populations. Gene-based association analyses were performed with KGG 2.5 by using publicly available large RA datasets (14,361 RA cases and 43,923 controls of European subjects, 4,873 RA cases and 17,642 controls of Asian Subjects). For the newly identified RA-associated genes, gene set enrichment analyses and protein-protein interactions analyses were carried out with DAVID and STRING version 10.0, respectively. Differential expression verification was conducted using 4 GEO datasets. The expression levels of three selected 'highly verified' genes were measured by ELISA among our in-house RA cases and controls. A total of 221 RA-associated genes were newly identified by gene-based association study, including 71'overlapped', 76 'European-specific' and 74 'Asian-specific' genes. Among them, 105 genes had significant differential expressions between RA patients and health controls at least in one dataset, especially for 20 genes including 11 'overlapped' (ABCF1, FLOT1, HLA-F, IER3, TUBB, ZKSCAN4, BTN3A3, HSP90AB1, CUTA, BRD2, HLA-DMA), 5 'European-specific' (PHTF1, RPS18, BAK1, TNFRSF14, SUOX) and 4 'Asian-specific' (RNASET2, HFE, BTN2A2, MAPK13) genes whose differential expressions were significant at least in three datasets. The protein expressions of two selected genes FLOT1 (P value = 1.70E-02) and HLA-DMA (P value = 4.70E-02) in plasma were significantly different in our in-house samples. Our study identified 221 novel RA-associated genes and especially highlighted the importance of 20 candidate genes on RA. The results addressed ethnic genetic background differences for RA susceptibility between European and Asian populations and detected a long list of overlapped or ethnic specific RA genes. The study not only greatly increases our understanding of genetic susceptibility to RA, but also provides important insights into the ethno-genetic homogeneity and heterogeneity of RA in both ethnicities.

  9. Arabidopsis Gene Family Profiler (aGFP)--user-oriented transcriptomic database with easy-to-use graphic interface.

    PubMed

    Dupl'áková, Nikoleta; Renák, David; Hovanec, Patrik; Honysová, Barbora; Twell, David; Honys, David

    2007-07-23

    Microarray technologies now belong to the standard functional genomics toolbox and have undergone massive development leading to increased genome coverage, accuracy and reliability. The number of experiments exploiting microarray technology has markedly increased in recent years. In parallel with the rapid accumulation of transcriptomic data, on-line analysis tools are being introduced to simplify their use. Global statistical data analysis methods contribute to the development of overall concepts about gene expression patterns and to query and compose working hypotheses. More recently, these applications are being supplemented with more specialized products offering visualization and specific data mining tools. We present a curated gene family-oriented gene expression database, Arabidopsis Gene Family Profiler (aGFP; http://agfp.ueb.cas.cz), which gives the user access to a large collection of normalised Affymetrix ATH1 microarray datasets. The database currently contains NASC Array and AtGenExpress transcriptomic datasets for various tissues at different developmental stages of wild type plants gathered from nearly 350 gene chips. The Arabidopsis GFP database has been designed as an easy-to-use tool for users needing an easily accessible resource for expression data of single genes, pre-defined gene families or custom gene sets, with the further possibility of keyword search. Arabidopsis Gene Family Profiler presents a user-friendly web interface using both graphic and text output. Data are stored at the MySQL server and individual queries are created in PHP script. The most distinguishable features of Arabidopsis Gene Family Profiler database are: 1) the presentation of normalized datasets (Affymetrix MAS algorithm and calculation of model-based gene-expression values based on the Perfect Match-only model); 2) the choice between two different normalization algorithms (Affymetrix MAS4 or MAS5 algorithms); 3) an intuitive interface; 4) an interactive "virtual plant" visualizing the spatial and developmental expression profiles of both gene families and individual genes. Arabidopsis GFP gives users the possibility to analyze current Arabidopsis developmental transcriptomic data starting with simple global queries that can be expanded and further refined to visualize comparative and highly selective gene expression profiles.

  10. A gene expression estimator of intramuscular fat percentage for use in both cattle and sheep

    PubMed Central

    2014-01-01

    Background The expression of genes encoding proteins involved in triacyglyceride and fatty acid synthesis and storage in cattle muscle are correlated with intramuscular fat (IMF)%. Are the same genes also correlated with IMF% in sheep muscle, and can the same set of genes be used to estimate IMF% in both species? Results The correlation between gene expression (microarray) and IMF% in the longissimus muscle (LM) of twenty sheep was calculated. An integrated analysis of this dataset with an equivalent cattle correlation dataset and a cattle differential expression dataset was undertaken. A total of 30 genes were identified to be strongly correlated with IMF% in both cattle and sheep. The overlap of genes was highly significant, 8 of the 13 genes in the TAG gene set and 8 of the 13 genes in the FA gene set were in the top 100 and 500 genes respectively most correlated with IMF% in sheep, P-value = 0. Of the 30 genes, CIDEA, THRSP, ACSM1, DGAT2 and FABP4 had the highest average rank in both species. Using the data from two small groups of Brahman cattle (control and Hormone growth promotant-treated [known to decrease IMF% in muscle]) and 22 animals in total, the utility of a direct measure and different estimators of IMF% (ultrasound and gene expression) to differentiate between the two groups were examined. Directly measured IMF% and IMF% estimated from ultrasound scanning could not discriminate between the two groups. However, using gene expression to estimate IMF% discriminated between the two groups. Increasing the number of genes used to estimate IMF% from one to five significantly increased the discrimination power; but increasing the number of genes to 15 resulted in little further improvement. Conclusion We have demonstrated the utility of a comparative approach to identify robust estimators of IMF% in the LM in cattle and sheep. We have also demonstrated a number of approaches (potentially applicable to much smaller groups of animals than conventional methods) to using gene expression to rank animals for IMF% within a single farm/treatment, or to estimate differences in IMF% between two farms/treatments. PMID:25028604

  11. The hydroclimatology of UK droughts: evidence from newly recovered and reconstructed datasets from the late 19th century to present

    NASA Astrophysics Data System (ADS)

    Smith, K. A.; Hannaford, J.; Bloomfield, J.; McCarthy, M.; Parry, S.; Barker, L. J.; Svensson, C.; Tanguy, M.; Marchant, B.; McKenzie, A.; Legg, T.; Prudhomme, C.

    2017-12-01

    While the UK is regarded as a wet country, it has periodically suffered from major droughts which have caused serious environmental and societal impacts. Parts of the UK are water stressed and, in a warming world, changes to supply/demand balances could have major implications. There is a pressing need for improved tools for drought risk assessment, which is contingent on a proper understanding of past occurrence of droughts. However, our understanding of hydrological drought occurrence is grounded in the post-1960 period when most UK river flow and groundwater records commenced. As such, water resources planners would benefit from a more thorough assessment of historical drought characteristics and their variability. The multi-disciplinary `Historic Droughts' project thus aims to rigorously characterise droughts in the UK back to the 1890s to inform improved drought management. The foundation of this is a comprehensive characterisation of the hydroclimatology of UK droughts. Here, we present the results of this initiative, based on a hydrological reconstruction campaign of unparalleled scope and detail. Driven by rainfall and potential evapotranspiration data, extended in time using newly recovered observational records, hydro(geo)logical models are used to reconstruct, back to 1890, river flows for >300 catchments across the UK, and groundwater levels from >50 boreholes. The reconstructions are derived within a state-of-the-art modelling framework which allows a comprehensive assessment of uncertainty. A suite of indicators are then applied to these datasets to identify and characterise drought events, integrating precipitation, evapotranspiration, streamflow and groundwater. The work provides new insights into the spatial and temporal dynamics of hitherto poorly quantified late 19th and early 20th century droughts. Similarly, the assessment of temporal variability of drought characteristics benefits from the long timescale of the reconstructions, in turn allowing improved assessment of the large-scale climate drivers of UK droughts. The propagation of UK drought is analysed comprehensively for the first time, highlighting the differential spatio-temporal expression of meteorological, streamflow and groundwater droughts, with important implications for water resources management.

  12. Identification of potential tumor-educated platelets RNA biomarkers in non-small-cell lung cancer by integrated bioinformatical analysis.

    PubMed

    Xue, Linlin; Xie, Li; Song, Xingguo; Song, Xianrang

    2018-04-17

    Platelets have emerged as key players in tumorigenesis and tumor progression. Tumor-educated platelet (TEP) RNA profile has the potential to diagnose non-small-cell lung cancer (NSCLC). The objective of this study was to identify potential TEP RNA biomarkers for the diagnosis of NSCLC and to explore the mechanisms in alternations of TEP RNA profile. The RNA-seq datasets GSE68086 and GSE89843 were downloaded from Gene Expression Omnibus DataSets (GEO DataSets). Then, the functional enrichment of the differentially expressed mRNAs was analyzed by the Database for Annotation Visualization and Integrated Discovery (DAVID). The miRNAs which regulated the differential mRNAs and the target mRNAs of miRNAs were identified by miRanda and miRDB. Then, the miRNA-mRNA regulatory network was visualized via Cytoscape software. Twenty consistently altered mRNAs (2 up-regulated and 18 down-regulated) were identified from the two GSE datasets, and they were significantly enriched in several biological processes, including transport and establishment of localization. Twenty identical miRNAs were found between exosomal miRNA-seq dataset and 229 miRNAs that regulated 20 consistently differential mRNAs in platelets. We also analyzed 13 spliceosomal mRNAs and their miRNA predictions; there were 27 common miRNAs between 206 differential exosomal miRNAs and 338 miRNAs that regulated 13 distinct spliceosomal mRNAs. This study identified 20 potential TEP RNA biomarkers in NSCLC for diagnosis by integrated bioinformatical analysis, and alternations in TEP RNA profile may be related to the post-transcriptional regulation and the splicing metabolisms of spliceosome. © 2018 Wiley Periodicals, Inc.

  13. Procollagen-lysine 2-oxoglutarate 5-dioxygenase 2 promotes hypoxia-induced glioma migration and invasion

    PubMed Central

    Xu, Yangyang; Zhang, Lin; Wei, Yuzhen; Zhang, Xin; Xu, Ran; Han, Mingzhi; Huang, Bing; Chen, Anjing; Li, Wenjie; Zhang, Qing; Li, Gang; Wang, Jian; Zhao, Peng; Li, Xingang

    2017-01-01

    Poor prognosis of glioblastoma multiforme is strongly associated with the ability of tumor cells to invade the brain parenchyma, which is believed to be the major factor responsible for glioblastoma recurrence. Therefore, identifying the molecular mechanisms driving invasion may lead to the development of improved therapies for glioblastoma patients. Here, we investigated the role of procollagen-lysine 2-oxoglutarate 5-dioxygenase 2 (PLOD2), an enzyme catalyzing collagen cross-linking, in the biology of glioblastoma invasion. PLOD2 mRNA was significantly overexpressed in glioblastoma compared to low-grade tumors based on the Oncomine datasets and REMBRANDT database for human gliomas. Kaplan-Meier estimates based on the TCGA dataset demonstrated that high PLOD2 expression was associated with poor prognosis. In vitro, hypoxia upregulated PLOD2 protein in U87 and U251 human glioma cell lines. siRNA knockdown of endogenous HIF-1α or treatment of cells with the HIF-1α inhibitor PX-478 largely abolished the hypoxia-mediated PLOD2 upregulation. Knockdown of PLOD2 in glioma cell lines led to decreases in migration and invasion under normoxia and hypoxia. In addition, levels of phosphorylated FAK (Tyr 397), an important kinase mediating cell adhesion, were reduced in U87-shPLOD2 and U251-shPLOD2 cells, particularly under hypoxic conditions. Finally, orthotopic U251-shPLOD2 xenografts were circumscribed rather than locally invasive. In conclusion, the results indicated that PLOD2 was a gene of clinical relevance with implications in glioblastoma invasion and treatment strategies. PMID:28423580

  14. Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies

    PubMed Central

    2014-01-01

    Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods. PMID:24708878

  15. High lncRNA H19 expression as prognostic indicator: data mining in female cancers and polling analysis in non-female cancers

    PubMed Central

    Peng, Li; Liu, Zhao-Yang; Li, Wen-Ling; Zhang, Chao-Yang; Zhang, Ya-Qin; Pan, Xi; Chen, Jun; Li, Yue-Hui

    2017-01-01

    Upregulation of lncRNA H19 expression is associated with an unfavorable prognosis in some cancers. However, the prognostic value of H19 in female-specific cancers has remained uncharacterized. In this study, the prognostic power of high H19 expression in female cancer patients from the TCGA datasets was analyzed using Kaplan-Meier survival curves and Cox's proportional hazard modeling. In addition, in a meta-analysis of non-female cancer patients from TCGA datasets and 12 independent studies, hazard ratios (HRs) with 95% confidence interval (CI) for overall survival (OS) and disease-free survival (DFS)/relapse-free survival (RFS)/metastasis-free survival (MFS)/progression-free survival (PFS) were pooled to assess the prognostic value of high H19 expression. Kaplan-Meier analysis revealed that patients with uterine corpus cancer and higher H19 expression had a shorter OS (HR=2.710, p<0.05), while females with cervical cancer and increased H19 expression had a shorter RFS (HR=2.261, p<0.05). Multivariate Cox regression analysis showed that high H19 expression could independently predict a poorer prognosis in cervical cancer patients (HR=4.099, p<0.05). In the meta-analysis, patients with high H19 expression showed a poorer outcome in non-female cancer (p<0.05). These results suggest that high lncRNA H19 expression is predictive of an unfavorable prognosis in two female cancers (uterine corpus endometrioid cancer and cervical cancer) as well as in non-female cancer patients. PMID:27926484

  16. High lncRNA H19 expression as prognostic indicator: data mining in female cancers and polling analysis in non-female cancers.

    PubMed

    Peng, Li; Yuan, Xiao-Qing; Liu, Zhao-Yang; Li, Wen-Ling; Zhang, Chao-Yang; Zhang, Ya-Qin; Pan, Xi; Chen, Jun; Li, Yue-Hui; Li, Guan-Cheng

    2017-01-03

    Upregulation of lncRNA H19 expression is associated with an unfavorable prognosis in some cancers. However, the prognostic value of H19 in female-specific cancers has remained uncharacterized. In this study, the prognostic power of high H19 expression in female cancer patients from the TCGA datasets was analyzed using Kaplan-Meier survival curves and Cox's proportional hazard modeling. In addition, in a meta-analysis of non-female cancer patients from TCGA datasets and 12 independent studies, hazard ratios (HRs) with 95% confidence interval (CI) for overall survival (OS) and disease-free survival (DFS)/relapse-free survival (RFS)/metastasis-free survival (MFS)/progression-free survival (PFS) were pooled to assess the prognostic value of high H19 expression. Kaplan-Meier analysis revealed that patients with uterine corpus cancer and higher H19 expression had a shorter OS (HR=2.710, p<0.05), while females with cervical cancer and increased H19 expression had a shorter RFS (HR=2.261, p<0.05). Multivariate Cox regression analysis showed that high H19 expression could independently predict a poorer prognosis in cervical cancer patients (HR=4.099, p<0.05). In the meta-analysis, patients with high H19 expression showed a poorer outcome in non-female cancer (p<0.05). These results suggest that high lncRNA H19 expression is predictive of an unfavorable prognosis in two female cancers (uterine corpus endometrioid cancer and cervical cancer) as well as in non-female cancer patients.

  17. A curated compendium of monocyte transcriptome datasets of relevance to human monocyte immunobiology research

    PubMed Central

    Rinchai, Darawan; Boughorbel, Sabri; Presnell, Scott; Quinn, Charlie; Chaussabel, Damien

    2016-01-01

    Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp. PMID:27158452

  18. THD-Module Extractor: An Application for CEN Module Extraction and Interesting Gene Identification for Alzheimer's Disease.

    PubMed

    Kakati, Tulika; Kashyap, Hirak; Bhattacharyya, Dhruba K

    2016-11-30

    There exist many tools and methods for construction of co-expression network from gene expression data and for extraction of densely connected gene modules. In this paper, a method is introduced to construct co-expression network and to extract co-expressed modules having high biological significance. The proposed method has been validated on several well known microarray datasets extracted from a diverse set of species, using statistical measures, such as p and q values. The modules obtained in these studies are found to be biologically significant based on Gene Ontology enrichment analysis, pathway analysis, and KEGG enrichment analysis. Further, the method was applied on an Alzheimer's disease dataset and some interesting genes are found, which have high semantic similarity among them, but are not significantly correlated in terms of expression similarity. Some of these interesting genes, such as MAPT, CASP2, and PSEN2, are linked with important aspects of Alzheimer's disease, such as dementia, increase cell death, and deposition of amyloid-beta proteins in Alzheimer's disease brains. The biological pathways associated with Alzheimer's disease, such as, Wnt signaling, Apoptosis, p53 signaling, and Notch signaling, incorporate these interesting genes. The proposed method is evaluated in regard to existing literature.

  19. Machine learning in computational biology to accelerate high-throughput protein expression.

    PubMed

    Sastry, Anand; Monk, Jonathan; Tegel, Hanna; Uhlen, Mathias; Palsson, Bernhard O; Rockberg, Johan; Brunk, Elizabeth

    2017-08-15

    The Human Protein Atlas (HPA) enables the simultaneous characterization of thousands of proteins across various tissues to pinpoint their spatial location in the human body. This has been achieved through transcriptomics and high-throughput immunohistochemistry-based approaches, where over 40 000 unique human protein fragments have been expressed in E. coli. These datasets enable quantitative tracking of entire cellular proteomes and present new avenues for understanding molecular-level properties influencing expression and solubility. Combining computational biology and machine learning identifies protein properties that hinder the HPA high-throughput antibody production pipeline. We predict protein expression and solubility with accuracies of 70% and 80%, respectively, based on a subset of key properties (aromaticity, hydropathy and isoelectric point). We guide the selection of protein fragments based on these characteristics to optimize high-throughput experimentation. We present the machine learning workflow as a series of IPython notebooks hosted on GitHub (https://github.com/SBRG/Protein_ML). The workflow can be used as a template for analysis of further expression and solubility datasets. ebrunk@ucsd.edu or johanr@biotech.kth.se. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  20. Integrative analysis of multi-omics data for identifying multi-markers for diagnosing pancreatic cancer

    PubMed Central

    2015-01-01

    Background microRNA (miRNA) expression plays an influential role in cancer classification and malignancy, and miRNAs are feasible as alternative diagnostic markers for pancreatic cancer, a highly aggressive neoplasm with silent early symptoms, high metastatic potential, and resistance to conventional therapies. Methods In this study, we evaluated the benefits of multi-omics data analysis by integrating miRNA and mRNA expression data in pancreatic cancer. Using support vector machine (SVM) modelling and leave-one-out cross validation (LOOCV), we evaluated the diagnostic performance of single- or multi-markers based on miRNA and mRNA expression profiles from 104 PDAC tissues and 17 benign pancreatic tissues. For selecting even more reliable and robust markers, we performed validation by independent datasets from the Gene Expression Omnibus (GEO) and the Cancer Genome Atlas (TCGA) data depositories. For validation, miRNA activity was estimated by miRNA-target gene interaction and mRNA expression datasets in pancreatic cancer. Results Using a comprehensive identification approach, we successfully identified 705 multi-markers having powerful diagnostic performance for PDAC. In addition, these marker candidates annotated with cancer pathways using gene ontology analysis. Conclusions Our prediction models have strong potential for the diagnosis of pancreatic cancer. PMID:26328610

  1. Identifying candidate genes for Type 2 Diabetes Mellitus and obesity through gene expression profiling in multiple tissues or cells.

    PubMed

    Chen, Junhui; Meng, Yuhuan; Zhou, Jinghui; Zhuo, Min; Ling, Fei; Zhang, Yu; Du, Hongli; Wang, Xiaoning

    2013-01-01

    Type 2 Diabetes Mellitus (T2DM) and obesity have become increasingly prevalent in recent years. Recent studies have focused on identifying causal variations or candidate genes for obesity and T2DM via analysis of expression quantitative trait loci (eQTL) within a single tissue. T2DM and obesity are affected by comprehensive sets of genes in multiple tissues. In the current study, gene expression levels in multiple human tissues from GEO datasets were analyzed, and 21 candidate genes displaying high percentages of differential expression were filtered out. Specifically, DENND1B, LYN, MRPL30, POC1B, PRKCB, RP4-655J12.3, HIBADH, and TMBIM4 were identified from the T2DM-control study, and BCAT1, BMP2K, CSRNP2, MYNN, NCKAP5L, SAP30BP, SLC35B4, SP1, BAP1, GRB14, HSP90AB1, ITGA5, and TOMM5 were identified from the obesity-control study. The majority of these genes are known to be involved in T2DM and obesity. Therefore, analysis of gene expression in various tissues using GEO datasets may be an effective and feasible method to determine novel or causal genes associated with T2DM and obesity.

  2. THD-Module Extractor: An Application for CEN Module Extraction and Interesting Gene Identification for Alzheimer’s Disease

    PubMed Central

    Kakati, Tulika; Kashyap, Hirak; Bhattacharyya, Dhruba K.

    2016-01-01

    There exist many tools and methods for construction of co-expression network from gene expression data and for extraction of densely connected gene modules. In this paper, a method is introduced to construct co-expression network and to extract co-expressed modules having high biological significance. The proposed method has been validated on several well known microarray datasets extracted from a diverse set of species, using statistical measures, such as p and q values. The modules obtained in these studies are found to be biologically significant based on Gene Ontology enrichment analysis, pathway analysis, and KEGG enrichment analysis. Further, the method was applied on an Alzheimer’s disease dataset and some interesting genes are found, which have high semantic similarity among them, but are not significantly correlated in terms of expression similarity. Some of these interesting genes, such as MAPT, CASP2, and PSEN2, are linked with important aspects of Alzheimer’s disease, such as dementia, increase cell death, and deposition of amyloid-beta proteins in Alzheimer’s disease brains. The biological pathways associated with Alzheimer’s disease, such as, Wnt signaling, Apoptosis, p53 signaling, and Notch signaling, incorporate these interesting genes. The proposed method is evaluated in regard to existing literature. PMID:27901073

  3. Evaluating the Prognostic Value of microRNA-203 in Solid Tumors Based on a Meta-Analysis and the Cancer Genome Atlas (TCGA) Datasets.

    PubMed

    Shao, Yingjie; Gu, Wendong; Ning, Zhonghua; Song, Xing; Pei, Honglei; Jiang, Jingting

    2017-01-01

    It has been reported that miR-203 expression was aberrant in various types of cancers, and it could be used as a prognostic biomarker. Therefore, in this study, we aimed to evaluate the prognostic value of miR-203 expression in solid tumors by using meta-analysis and The Cancer Genome Atlas (TCGA) datasets. By doing a literature research in PubMed, Embase and the Cochrane Library (last update by December 2016), we were able to identify the studies assessing the prognostic role of miR-203 in various tumors. We then used TCGA datasets to validate the results of meta-analysis. 33 studies from 26 articles were qualified and enrolled in this meta-analysis. Pooled analyses showed that higher expression of miR-203 in tissues couldn't predict poor overall survival (OS) and progression-free survival (PFS) in solid tumors. However, the results of subgroup analyses revealed that the upregulation of tissue miR-203 expression was associated with poor OS in colorectal cancer (hazard ratio (HR)=1.81, 95% confidence intervals (CI) 1.31-2.49; P<0.001), pancreatic cancer (HR=1.19, 95% CI 1.09-1.31; P<0.001) and ovarian cancer (HR=1.85, 95% CI 1.45-2.37; P<0.001); but it had opposite association in liver cancer (HR=0.52, 95% CI 0.28-0.97; P=0.040) and esophageal cancer (HR=0.41, 95% CI 0.25-0.66; P<0.001). Based on TCGA datasets, we found the same results for pancreatic cancer and esophageal cancer, but not for colorectal cancer and liver cancer. Moreover, patients with high circulating miR-203 in blood had significantly poor OS and PFS in colorectal cancer and breast cancer. Our study showed that the prognostic values of tissue miR-203 varied in different tumor types. In addition, the upregulation of circulating miR-203 in blood was associated with poor prognosis in colorectal cancer and breast cancer. © 2017 The Author(s)Published by S. Karger AG, Basel.

  4. Toward Computational Cumulative Biology by Combining Models of Biological Datasets

    PubMed Central

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database. PMID:25427176

  5. Toward computational cumulative biology by combining models of biological datasets.

    PubMed

    Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

    2014-01-01

    A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.

  6. Thermodynamic Data Rescue and Informatics for Deep Carbon Science

    NASA Astrophysics Data System (ADS)

    Zhong, H.; Ma, X.; Prabhu, A.; Eleish, A.; Pan, F.; Parsons, M. A.; Ghiorso, M. S.; West, P.; Zednik, S.; Erickson, J. S.; Chen, Y.; Wang, H.; Fox, P. A.

    2017-12-01

    A large number of legacy datasets are contained in geoscience literature published between 1930 and 1980 and not expressed external to the publication text in digitized formats. Extracting, organizing, and reusing these "dark" datasets is highly valuable for many within the Earth and planetary science community. As a part of the Deep Carbon Observatory (DCO) data legacy missions, the DCO Data Science Team and Extreme Physics and Chemistry community identified thermodynamic datasets related to carbon, or more specifically datasets about the enthalpy and entropy of chemicals, as a proof of principle analysis. The data science team endeavored to develop a semi-automatic workflow, which includes identifying relevant publications, extracting contained datasets using OCR methods, collaborative reviewing, and registering the datasets via the DCO Data Portal where the 'Linked Data' feature of the data portal provides a mechanism for connecting rescued datasets beyond their individual data sources, to research domains, DCO Communities, and more, making data discovery and retrieval more effective.To date, the team has successfully rescued, deposited and registered additional datasets from publications with thermodynamic sources. These datasets contain 3 main types of data: (1) heat content or enthalpy data determined for a given compound as a function of temperature using high-temperature calorimetry, (2) heat content or enthalpy data determined for a given compound as a function of temperature using adiabatic calorimetry, and (3) direct determination of heat capacity of a compound as a function of temperature using differential scanning calorimetry. The data science team integrated these datasets and delivered a spectrum of data analytics including visualizations, which will lead to a comprehensive characterization of the thermodynamics of carbon and carbon-related materials.

  7. Modelling gene expression profiles related to prostate tumor progression using binary states

    PubMed Central

    2013-01-01

    Background Cancer is a complex disease commonly characterized by the disrupted activity of several cancer-related genes such as oncogenes and tumor-suppressor genes. Previous studies suggest that the process of tumor progression to malignancy is dynamic and can be traced by changes in gene expression. Despite the enormous efforts made for differential expression detection and biomarker discovery, few methods have been designed to model the gene expression level to tumor stage during malignancy progression. Such models could help us understand the dynamics and simplify or reveal the complexity of tumor progression. Methods We have modeled an on-off state of gene activation per sample then per stage to select gene expression profiles associated to tumor progression. The selection is guided by statistical significance of profiles based on random permutated datasets. Results We show that our method identifies expected profiles corresponding to oncogenes and tumor suppressor genes in a prostate tumor progression dataset. Comparisons with other methods support our findings and indicate that a considerable proportion of significant profiles is not found by other statistical tests commonly used to detect differential expression between tumor stages nor found by other tailored methods. Ontology and pathway analysis concurred with these findings. Conclusions Results suggest that our methodology may be a valuable tool to study tumor malignancy progression, which might reveal novel cancer therapies. PMID:23721350

  8. Downstream targets of HOXB4 in a cell line model of primitive hematopoietic progenitor cells.

    PubMed

    Lee, Han M; Zhang, Hui; Schulz, Vincent; Tuck, David P; Forget, Bernard G

    2010-08-05

    Enforced expression of the homeobox transcription factor HOXB4 has been shown to enhance hematopoietic stem cell self-renewal and expansion ex vivo and in vivo. To investigate the downstream targets of HOXB4 in hematopoietic progenitor cells, HOXB4 was constitutively overexpressed in the primitive hematopoietic progenitor cell line EML. Two genome-wide analytical techniques were used: RNA expression profiling using microarrays and chromatin immunoprecipitation (ChIP)-chip. RNA expression profiling revealed that 465 gene transcripts were differentially expressed in KLS (c-Kit(+), Lin(-), Sca-1(+))-EML cells that overexpressed HOXB4 (KLS-EML-HOXB4) compared with control KLS-EML cells that were transduced with vector alone. In particular, erythroid-specific gene transcripts were observed to be highly down-regulated in KLS-EML-HOXB4 cells. ChIP-chip analysis revealed that the promoter region for 1910 genes, such as CD34, Sox4, and B220, were occupied by HOXB4 in KLS-EML-HOXB4 cells. Side-by-side comparison of the ChIP-chip and RNA expression profiling datasets provided correlative information and identified Gp49a and Laptm4b as candidate "stemness-related" genes. Both genes were highly ranked in both dataset lists and have been previously shown to be preferentially expressed in hematopoietic stem cells and down-regulated in mature hematopoietic cells, thus making them attractive candidates for future functional studies in hematopoietic cells.

  9. Brain growth across the life span in autism: age-specific changes in anatomical pathology.

    PubMed

    Courchesne, Eric; Campbell, Kathleen; Solso, Stephanie

    2011-03-22

    Autism is marked by overgrowth of the brain at the earliest ages but not at older ages when decreases in structural volumes and neuron numbers are observed instead. This has led to the theory of age-specific anatomic abnormalities in autism. Here we report age-related changes in brain size in autistic and typical subjects from 12 months to 50 years of age based on analyses of 586 longitudinal and cross-sectional MRI scans. This dataset is several times larger than the largest autism study to date. Results demonstrate early brain overgrowth during infancy and the toddler years in autistic boys and girls, followed by an accelerated rate of decline in size and perhaps degeneration from adolescence to late middle age in this disorder. We theorize that underlying these age-specific changes in anatomic abnormalities in autism, there may also be age-specific changes in gene expression, molecular, synaptic, cellular, and circuit abnormalities. A peak age for detecting and studying the earliest fundamental biological underpinnings of autism is prenatal life and the first three postnatal years. Studies of the older autistic brain may not address original causes but are essential to discovering how best to help the older aging autistic person. Lastly, the theory of age-specific anatomic abnormalities in autism has broad implications for a wide range of work on the disorder including the design, validation, and interpretation of animal model, lymphocyte gene expression, brain gene expression, and genotype/CNV-anatomic phenotype studies. Copyright © 2010 Elsevier B.V. All rights reserved.

  10. Ovarian Cancer Differential Interactome and Network Entropy Analysis Reveal New Candidate Biomarkers.

    PubMed

    Ayyildiz, Dilara; Gov, Esra; Sinha, Raghu; Arga, Kazim Yalcin

    2017-05-01

    Ovarian cancer is one of the most common cancers and has a high mortality rate due to insidious symptoms and lack of robust diagnostics. A hitherto understudied concept in cancer pathogenesis may offer new avenues for innovation in ovarian cancer biomarker development. Cancer cells are characterized by an increase in network entropy, and several studies have exploited this concept to identify disease-associated gene and protein modules. We report in this study the changes in protein-protein interactions (PPIs) in ovarian cancer within a differential network (interactome) analysis framework utilizing the entropy concept and gene expression data. A compendium of six transcriptome datasets that included 140 samples from laser microdissected epithelial cells of ovarian cancer patients and 51 samples from healthy population was obtained from Gene Expression Omnibus, and the high confidence human protein interactome (31,465 interactions among 10,681 proteins) was used. The uncertainties of the up- or downregulation of PPIs in ovarian cancer were estimated through an entropy formulation utilizing combined expression levels of genes, and the interacting protein pairs with minimum uncertainty were identified. We identified 105 proteins with differential PPI patterns scattered in 11 modules, each indicating significantly affected biological pathways in ovarian cancer such as DNA repair, cell proliferation-related mechanisms, nucleoplasmic translocation of estrogen receptor, extracellular matrix degradation, and inflammation response. In conclusion, we suggest several PPIs as biomarker candidates for ovarian cancer and discuss their future biological implications as potential molecular targets for pharmaceutical development as well. In addition, network entropy analysis is a concept that deserves greater research attention for diagnostic innovation in oncology and tumor pathogenesis.

  11. Comparability of Examinee Proficiency Scores on Computer Adaptive Tests Using Real and Simulated Data

    ERIC Educational Resources Information Center

    Evans, Josiah Jeremiah

    2010-01-01

    In measurement research, data simulations are a commonly used analytical technique. While simulation designs have many benefits, it is unclear if these artificially generated datasets are able to accurately capture real examinee item response behaviors. This potential lack of comparability may have important implications for administration of…

  12. Data mining in child welfare.

    PubMed

    Schoech, D; Quinn, A; Rycraft, J R

    2000-01-01

    Data mining is the sifting through of voluminous data to extract knowledge for decision making. This article illustrates the context, concepts, processes, techniques, and tools of data mining, using statistical and neural network analyses on a dataset concerning employee turnover. The resulting models and their predictive capability, advantages and disadvantages, and implications for decision support are highlighted.

  13. Data Mining and Privacy of Social Network Sites' Users: Implications of the Data Mining Problem.

    PubMed

    Al-Saggaf, Yeslam; Islam, Md Zahidul

    2015-08-01

    This paper explores the potential of data mining as a technique that could be used by malicious data miners to threaten the privacy of social network sites (SNS) users. It applies a data mining algorithm to a real dataset to provide empirically-based evidence of the ease with which characteristics about the SNS users can be discovered and used in a way that could invade their privacy. One major contribution of this article is the use of the decision forest data mining algorithm (SysFor) to the context of SNS, which does not only build a decision tree but rather a forest allowing the exploration of more logic rules from a dataset. One logic rule that SysFor built in this study, for example, revealed that anyone having a profile picture showing just the face or a picture showing a family is less likely to be lonely. Another contribution of this article is the discussion of the implications of the data mining problem for governments, businesses, developers and the SNS users themselves.

  14. Worldwide Distribution of Cytochrome P450 Alleles: A Meta-analysis of Population-scale Sequencing Projects.

    PubMed

    Zhou, Y; Ingelman-Sundberg, M; Lauschke, V M

    2017-10-01

    Genetic polymorphisms in cytochrome P450 (CYP) genes can result in altered metabolic activity toward a plethora of clinically important medications. Thus, single nucleotide variants and copy number variations in CYP genes are major determinants of drug pharmacokinetics and toxicity and constitute pharmacogenetic biomarkers for drug dosing, efficacy, and safety. Strikingly, the distribution of CYP alleles differs considerably between populations with important implications for personalized drug therapy and healthcare programs. To provide a global distribution map of CYP alleles with clinical importance, we integrated whole-genome and exome sequencing data from 56,945 unrelated individuals of five major human populations. By combining this dataset with population-specific linkage information, we derive the frequencies of 176 CYP haplotypes, providing an extensive resource for major genetic determinants of drug metabolism. Furthermore, we aggregated this dataset into spectra of predicted functional variability in the respective populations and discuss the implications for population-adjusted pharmacological treatment strategies. © 2017 The Authors Clinical Pharmacology & Therapeutics published by Wiley Periodicals, Inc. on behalf of American Society for Clinical Pharmacology and Therapeutics.

  15. Comparative and quantitative proteomics reveal the adaptive strategies of oyster larvae to ocean acidification.

    PubMed

    Dineshram, R; Quan, Q; Sharma, Rakesh; Chandramouli, Kondethimmanahalli; Yalamanchili, Hari Krishna; Chu, Ivan; Thiyagarajan, Vengatesen

    2015-12-01

    Decreasing pH due to anthropogenic CO2 inputs, called ocean acidification (OA), can make coastal environments unfavorable for oysters. This is a serious socioeconomical issue for China which supplies >70% of the world's edible oysters. Here, we present an iTRAQ-based protein profiling approach for the detection and quantification of proteome changes under OA in the early life stage of a commercially important oyster, Crassostrea hongkongensis. Availability of complete genome sequence for the pacific oyster (Crassostrea gigas) enabled us to confidently quantify over 1500 proteins in larval oysters. Over 7% of the proteome was altered in response to OA at pHNBS 7.6. Analysis of differentially expressed proteins and their associated functional pathways showed an upregulation of proteins involved in calcification, metabolic processes, and oxidative stress, each of which may be important in physiological adaptation of this species to OA. The downregulation of cytoskeletal and signal transduction proteins, on the other hand, might have impaired cellular dynamics and organelle development under OA. However, there were no significant detrimental effects in developmental processes such as metamorphic success. Implications of the differentially expressed proteins and metabolic pathways in the development of OA resistance in oyster larvae are discussed. The MS proteomics data have been deposited to the ProteomeXchange with identifiers PXD002138 (http://proteomecentral.proteomexchange.org/dataset/PXD002138). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  16. Inflammatory Gene Regulatory Networks in Amnion Cells Following Cytokine Stimulation: Translational Systems Approach to Modeling Human Parturition

    PubMed Central

    Summerfield, Taryn L.; Yu, Lianbo; Gulati, Parul; Zhang, Jie; Huang, Kun; Romero, Roberto; Kniss, Douglas A.

    2011-01-01

    A majority of the studies examining the molecular regulation of human labor have been conducted using single gene approaches. While the technology to produce multi-dimensional datasets is readily available, the means for facile analysis of such data are limited. The objective of this study was to develop a systems approach to infer regulatory mechanisms governing global gene expression in cytokine-challenged cells in vitro, and to apply these methods to predict gene regulatory networks (GRNs) in intrauterine tissues during term parturition. To this end, microarray analysis was applied to human amnion mesenchymal cells (AMCs) stimulated with interleukin-1β, and differentially expressed transcripts were subjected to hierarchical clustering, temporal expression profiling, and motif enrichment analysis, from which a GRN was constructed. These methods were then applied to fetal membrane specimens collected in the absence or presence of spontaneous term labor. Analysis of cytokine-responsive genes in AMCs revealed a sterile immune response signature, with promoters enriched in response elements for several inflammation-associated transcription factors. In comparison to the fetal membrane dataset, there were 34 genes commonly upregulated, many of which were part of an acute inflammation gene expression signature. Binding motifs for nuclear factor-κB were prominent in the gene interaction and regulatory networks for both datasets; however, we found little evidence to support the utilization of pathogen-associated molecular pattern (PAMP) signaling. The tissue specimens were also enriched for transcripts governed by hypoxia-inducible factor. The approach presented here provides an uncomplicated means to infer global relationships among gene clusters involved in cellular responses to labor-associated signals. PMID:21655103

  17. Bi-Force: large-scale bicluster editing and its application to gene expression data biclustering

    PubMed Central

    Sun, Peng; Speicher, Nora K.; Röttger, Richard; Guo, Jiong; Baumbach, Jan

    2014-01-01

    Abstract The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as ‘simultaneous clustering’ or ‘co-clustering’, has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: ‘Bi-Force’. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279–292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de. PMID:24682815

  18. FGFR3, as a receptor tyrosine kinase, is associated with differentiated biological functions and improved survival of glioma patients.

    PubMed

    Wang, Zheng; Zhang, Chuanbao; Sun, Lihua; Liang, Jingshan; Liu, Xing; Li, Guanzhang; Yao, Kun; Zhang, Wei; Jiang, Tao

    2016-12-20

    Activation of receptor tyrosine kinases is common in Malignancies. FGFR3 fusion with TACC3 has been reported to have transforming effects in primary glioblastoma and display oncogenic activity in vitro and in vivo. We set out to investigate the role of FGFR3 in glioma through transcriptomic analysis. FGFR3 increased in Classical subtype and Neural subtype consistently in CGGA and TCGA cohort. Similar patterns of FGFR3 distribution through subtypes were observed in CGGA and TCGA samples. Gene ontology analysis was performed with genes that were significantly correlated with FGFR3 expression. We found that positively associated biological processes of FGFR3 were focused on differentiated cellular functions and neuronal activities, while negatively correlated biological processes focused on mitosis and cell cycle phase. Clinical investigation showed that higher FGFR3 expression predicted improved survival for glioma patients, especially in Proneural subtype. Moreover, FGFR3 showed very limited relevance with other receptor tyrosine kinases in glioma at transcriptome level. FGFR3 expression data of glioma was obtained from Chinese Glioma Genome Atlas (CGGA) and TCGA (The Cancer Genome Atlas). In total, RNA sequencing data of 325 glioma samples and mRNA microarray data of 301 samples from CGGA dataset were enrolled into this study. To consolidate the findings that we have revealed in CGGA dataset, RNA-seq data of 672 glioma samples from TCGA dataset were used as a validation cohort. R language was used as the main tool to perform statistical analysis and graphical work. FGFR3 expression increased in classical and neural subtypes and was associated with differentiated cellular functions. FGFR3 showed very limited correlation with other common receptor tyrosine kinases, and predicted improved survival for glioma patients.

  19. FGFR3, as a receptor tyrosine kinase, is associated with differentiated biological functions and improved survival of glioma patients

    PubMed Central

    Wang, Zheng; Zhang, Chuanbao; Sun, Lihua; Liang, Jingshan; Liu, Xing; Li, Guanzhang; Yao, Kun; Zhang, Wei; Jiang, Tao

    2016-01-01

    Background Activation of receptor tyrosine kinases is common in Malignancies. FGFR3 fusion with TACC3 has been reported to have transforming effects in primary glioblastoma and display oncogenic activity in vitro and in vivo. We set out to investigate the role of FGFR3 in glioma through transcriptomic analysis. Results FGFR3 increased in Classical subtype and Neural subtype consistently in CGGA and TCGA cohort. Similar patterns of FGFR3 distribution through subtypes were observed in CGGA and TCGA samples. Gene ontology analysis was performed with genes that were significantly correlated with FGFR3 expression. We found that positively associated biological processes of FGFR3 were focused on differentiated cellular functions and neuronal activities, while negatively correlated biological processes focused on mitosis and cell cycle phase. Clinical investigation showed that higher FGFR3 expression predicted improved survival for glioma patients, especially in Proneural subtype. Moreover, FGFR3 showed very limited relevance with other receptor tyrosine kinases in glioma at transcriptome level. Materials and Methods FGFR3 expression data of glioma was obtained from Chinese Glioma Genome Atlas (CGGA) and TCGA (The Cancer Genome Atlas). In total, RNA sequencing data of 325 glioma samples and mRNA microarray data of 301 samples from CGGA dataset were enrolled into this study. To consolidate the findings that we have revealed in CGGA dataset, RNA-seq data of 672 glioma samples from TCGA dataset were used as a validation cohort. R language was used as the main tool to perform statistical analysis and graphical work. Conclusions FGFR3 expression increased in classical and neural subtypes and was associated with differentiated cellular functions. FGFR3 showed very limited correlation with other common receptor tyrosine kinases, and predicted improved survival for glioma patients. PMID:27829236

  20. 3D Face Model Dataset: Automatic Detection of Facial Expressions and Emotions for Educational Environments

    ERIC Educational Resources Information Center

    Chickerur, Satyadhyan; Joshi, Kartik

    2015-01-01

    Emotion detection using facial images is a technique that researchers have been using for the last two decades to try to analyze a person's emotional state given his/her image. Detection of various kinds of emotion using facial expressions of students in educational environment is useful in providing insight into the effectiveness of tutoring…

  1. Immune Checkpoint Molecules on Tumor-Infiltrating Lymphocytes and Their Association with Tertiary Lymphoid Structures in Human Breast Cancer

    PubMed Central

    Solinas, Cinzia; Garaud, Soizic; De Silva, Pushpamali; Boisson, Anaïs; Van den Eynden, Gert; de Wind, Alexandre; Risso, Paolo; Rodrigues Vitória, Joel; Richard, François; Migliori, Edoardo; Noël, Grégory; Duvillier, Hugues; Craciun, Ligia; Veys, Isabelle; Awada, Ahmad; Detours, Vincent; Larsimont, Denis; Piccart-Gebhart, Martine; Willard-Gallo, Karen

    2017-01-01

    There is an exponentially growing interest in targeting immune checkpoint molecules in breast cancer (BC), particularly in the triple-negative subtype where unmet treatment needs remain. This study was designed to analyze the expression, localization, and prognostic role of PD-1, PD-L1, PD-L2, CTLA-4, LAG3, and TIM3 in primary BC. Gene expression analysis using the METABRIC microarray dataset found that all six immune checkpoint molecules are highly expressed in basal-like and HER2-enriched compared to the other BC molecular subtypes. Flow cytometric analysis of fresh tissue homogenates from untreated primary tumors show that PD-1 is principally expressed on CD4+ or CD8+ T cells and CTLA-4 is expressed on CD4+ T cells. The global proportion of PD-L1+, PD-L2+, LAG3+, and TIM3+ tumor-infiltrating lymphocytes (TIL) was low and detectable in only a small number of tumors. Immunohistochemically staining fixed tissues from the same tumors was employed to score TIL and tertiary lymphoid structures (TLS). PD-L1+, PD-L2+, LAG3+, and TIM3+ cells were detected in some TLS in a pattern that resembles secondary lymphoid organs. This observation suggests that TLS are important sites of immune activation and regulation, particularly in tumors with extensive baseline immune infiltration. Significantly improved overall survival was correlated with PD-1 expression in the HER2-enriched and PD-L1 or CTLA-4 expression in basal-like BC. PD-1 and CTLA-4 proteins were most frequently detected on TIL, which supports the correlations observed between their gene expression and improved long-term outcome in basal-like and HER2-enriched BC. PD-L1 expression by tumor or immune cells is uncommon in BC. Overall, the data presented here distinguish PD-1 as a marker of T cell activity in both the T and B cell areas of BC associated TLS. We found that immune checkpoint molecule expression parallels the extent of TIL and TLS, although there is a noteworthy amount of heterogeneity between tumors even within the same molecular subtype. These data indicate that assessing the levels of immune checkpoint molecule expression in an individual patient has important implications for the success of therapeutically targeting them in BC. PMID:29163490

  2. Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

    PubMed Central

    Zhao, Shanrong; Prenger, Kurt; Smith, Lance

    2013-01-01

    RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets. PMID:25937948

  3. Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies.

    PubMed

    Zhao, Shanrong; Prenger, Kurt; Smith, Lance

    2013-01-01

    RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets.

  4. Inferring Time-Varying Network Topologies from Gene Expression Data

    PubMed Central

    2007-01-01

    Most current methods for gene regulatory network identification lead to the inference of steady-state networks, that is, networks prevalent over all times, a hypothesis which has been challenged. There has been a need to infer and represent networks in a dynamic, that is, time-varying fashion, in order to account for different cellular states affecting the interactions amongst genes. In this work, we present an approach, regime-SSM, to understand gene regulatory networks within such a dynamic setting. The approach uses a clustering method based on these underlying dynamics, followed by system identification using a state-space model for each learnt cluster—to infer a network adjacency matrix. We finally indicate our results on the mouse embryonic kidney dataset as well as the T-cell activation-based expression dataset and demonstrate conformity with reported experimental evidence. PMID:18309363

  5. Inferring time-varying network topologies from gene expression data.

    PubMed

    Rao, Arvind; Hero, Alfred O; States, David J; Engel, James Douglas

    2007-01-01

    Most current methods for gene regulatory network identification lead to the inference of steady-state networks, that is, networks prevalent over all times, a hypothesis which has been challenged. There has been a need to infer and represent networks in a dynamic, that is, time-varying fashion, in order to account for different cellular states affecting the interactions amongst genes. In this work, we present an approach, regime-SSM, to understand gene regulatory networks within such a dynamic setting. The approach uses a clustering method based on these underlying dynamics, followed by system identification using a state-space model for each learnt cluster--to infer a network adjacency matrix. We finally indicate our results on the mouse embryonic kidney dataset as well as the T-cell activation-based expression dataset and demonstrate conformity with reported experimental evidence.

  6. Application of machine learning on brain cancer multiclass classification

    NASA Astrophysics Data System (ADS)

    Panca, V.; Rustam, Z.

    2017-07-01

    Classification of brain cancer is a problem of multiclass classification. One approach to solve this problem is by first transforming it into several binary problems. The microarray gene expression dataset has the two main characteristics of medical data: extremely many features (genes) and only a few number of samples. The application of machine learning on microarray gene expression dataset mainly consists of two steps: feature selection and classification. In this paper, the features are selected using a method based on support vector machine recursive feature elimination (SVM-RFE) principle which is improved to solve multiclass classification, called multiple multiclass SVM-RFE. Instead of using only the selected features on a single classifier, this method combines the result of multiple classifiers. The features are divided into subsets and SVM-RFE is used on each subset. Then, the selected features on each subset are put on separate classifiers. This method enhances the feature selection ability of each single SVM-RFE. Twin support vector machine (TWSVM) is used as the method of the classifier to reduce computational complexity. While ordinary SVM finds single optimum hyperplane, the main objective Twin SVM is to find two non-parallel optimum hyperplanes. The experiment on the brain cancer microarray gene expression dataset shows this method could classify 71,4% of the overall test data correctly, using 100 and 1000 genes selected from multiple multiclass SVM-RFE feature selection method. Furthermore, the per class results show that this method could classify data of normal and MD class with 100% accuracy.

  7. Configurable pattern-based evolutionary biclustering of gene expression data

    PubMed Central

    2013-01-01

    Background Biclustering algorithms for microarray data aim at discovering functionally related gene sets under different subsets of experimental conditions. Due to the problem complexity and the characteristics of microarray datasets, heuristic searches are usually used instead of exhaustive algorithms. Also, the comparison among different techniques is still a challenge. The obtained results vary in relevant features such as the number of genes or conditions, which makes it difficult to carry out a fair comparison. Moreover, existing approaches do not allow the user to specify any preferences on these properties. Results Here, we present the first biclustering algorithm in which it is possible to particularize several biclusters features in terms of different objectives. This can be done by tuning the specified features in the algorithm or also by incorporating new objectives into the search. Furthermore, our approach bases the bicluster evaluation in the use of expression patterns, being able to recognize both shifting and scaling patterns either simultaneously or not. Evolutionary computation has been chosen as the search strategy, naming thus our proposal Evo-Bexpa (Evolutionary Biclustering based in Expression Patterns). Conclusions We have conducted experiments on both synthetic and real datasets demonstrating Evo-Bexpa abilities to obtain meaningful biclusters. Synthetic experiments have been designed in order to compare Evo-Bexpa performance with other approaches when looking for perfect patterns. Experiments with four different real datasets also confirm the proper performing of our algorithm, whose results have been biologically validated through Gene Ontology. PMID:23433178

  8. Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions.

    PubMed

    Yang, Yang; Saleemi, Imran; Shah, Mubarak

    2013-07-01

    This paper proposes a novel representation of articulated human actions and gestures and facial expressions. The main goals of the proposed approach are: 1) to enable recognition using very few examples, i.e., one or k-shot learning, and 2) meaningful organization of unlabeled datasets by unsupervised clustering. Our proposed representation is obtained by automatically discovering high-level subactions or motion primitives, by hierarchical clustering of observed optical flow in four-dimensional, spatial, and motion flow space. The completely unsupervised proposed method, in contrast to state-of-the-art representations like bag of video words, provides a meaningful representation conducive to visual interpretation and textual labeling. Each primitive action depicts an atomic subaction, like directional motion of limb or torso, and is represented by a mixture of four-dimensional Gaussian distributions. For one--shot and k-shot learning, the sequence of primitive labels discovered in a test video are labeled using KL divergence, and can then be represented as a string and matched against similar strings of training videos. The same sequence can also be collapsed into a histogram of primitives or be used to learn a Hidden Markov model to represent classes. We have performed extensive experiments on recognition by one and k-shot learning as well as unsupervised action clustering on six human actions and gesture datasets, a composite dataset, and a database of facial expressions. These experiments confirm the validity and discriminative nature of the proposed representation.

  9. Network information improves cancer outcome prediction.

    PubMed

    Roy, Janine; Winter, Christof; Isik, Zerrin; Schroeder, Michael

    2014-07-01

    Disease progression in cancer can vary substantially between patients. Yet, patients often receive the same treatment. Recently, there has been much work on predicting disease progression and patient outcome variables from gene expression in order to personalize treatment options. Despite first diagnostic kits in the market, there are open problems such as the choice of random gene signatures or noisy expression data. One approach to deal with these two problems employs protein-protein interaction networks and ranks genes using the random surfer model of Google's PageRank algorithm. In this work, we created a benchmark dataset collection comprising 25 cancer outcome prediction datasets from literature and systematically evaluated the use of networks and a PageRank derivative, NetRank, for signature identification. We show that the NetRank performs significantly better than classical methods such as fold change or t-test. Despite an order of magnitude difference in network size, a regulatory and protein-protein interaction network perform equally well. Experimental evaluation on cancer outcome prediction in all of the 25 underlying datasets suggests that the network-based methodology identifies highly overlapping signatures over all cancer types, in contrast to classical methods that fail to identify highly common gene sets across the same cancer types. Integration of network information into gene expression analysis allows the identification of more reliable and accurate biomarkers and provides a deeper understanding of processes occurring in cancer development and progression. © The Author 2012. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  10. EG-09EPIGENETIC PROFILING REVEALS A CpG HYPERMETHYLATION PHENOTYPE (CIMP) ASSOCIATED WITH WORSE PROGRESSION-FREE SURVIVAL IN MENINGIOMA

    PubMed Central

    Olar, Adriana; Wani, Khalida; Mansouri, Alireza; Zadeh, Gelareh; Wilson, Charmaine; DeMonte, Franco; Fuller, Gregory; Jones, David; Pfister, Stefan; von Deimling, Andreas; Sulman, Erik; Aldape, Kenneth

    2014-01-01

    BACKGROUND: Methylation profiling of solid tumors has revealed biologic subtypes, often with clinical implications. Methylation profiles of meningioma and their clinical implications are not well understood. METHODS: Ninety-two meningioma samples (n = 44 test set and n = 48 validation set) were profiled using the Illumina HumanMethylation450 BeadChip. Unsupervised clustering and analyses for recurrence-free survival (RFS) were performed. RESULTS: Unsupervised clustering of the test set using approximately 900 highly variable markers identified two clearly defined methylation subgroups. One of the groups (n = 19) showed global hypermethylation of a set of markers, analogous to CpG island methylator phenotype (CIMP). These findings were reproducible in the validation set, with 18/48 samples showing the CIMP-positive phenotype. Importantly, of 347 highly variable markers common to both the test and validation set analyses, 107 defined CIMP in the test set and 94 defined CIMP in the validation set, with an overlap of 83 markers between the two datasets. This number is much greater than expected by chance indicating reproducibly of the hypermethylated markers that define CIMP in meningioma. With respect to clinical correlation, the 37 CIMP-positive cases displayed significantly shorter RFS compared to the 55 non-CIMP cases (hazard ratio 2.9, p = 0.013). In an effort to develop a preliminary outcome predictor, a 155-marker subset correlated with RFS was identified in the test dataset. When interrogated in the validation dataset, this 155-marker subset showed a statistical trend (p < 0.1) towards distinguishing survival groups. CONCLUSIONS: This study defines the existence of a CIMP phenotype in meningioma, which involves a substantial proportion (37/92, 40%) of samples with clinical implications. Ongoing work will expand this cohort and examine identification of additional biologic differences (mutational and DNA copy number analysis) to further characterize the aberrant methylation subtype in meningioma. CIMP-positivity with aberrant methylation in recurrent/malignant meningioma suggests a potential therapeutic target for clinically aggressive cases.

  11. Identification of ELF3 as an early transcriptional regulator of human urothelium.

    PubMed

    Böck, Matthias; Hinley, Jennifer; Schmitt, Constanze; Wahlicht, Tom; Kramer, Stefan; Southgate, Jennifer

    2014-02-15

    Despite major advances in high-throughput and computational modelling techniques, understanding of the mechanisms regulating tissue specification and differentiation in higher eukaryotes, particularly man, remains limited. Microarray technology has been explored exhaustively in recent years and several standard approaches have been established to analyse the resultant datasets on a genome-wide scale. Gene expression time series offer a valuable opportunity to define temporal hierarchies and gain insight into the regulatory relationships of biological processes. However, unless datasets are exactly synchronous, time points cannot be compared directly. Here we present a data-driven analysis of regulatory elements from a microarray time series that tracked the differentiation of non-immortalised normal human urothelial (NHU) cells grown in culture. The datasets were obtained by harvesting differentiating and control cultures from finite bladder- and ureter-derived NHU cell lines at different time points using two previously validated, independent differentiation-inducing protocols. Due to the asynchronous nature of the data, a novel ranking analysis approach was adopted whereby we compared changes in the amplitude of experiment and control time series to identify common regulatory elements. Our approach offers a simple, fast and effective ranking method for genes that can be applied to other time series. The analysis identified ELF3 as a candidate transcriptional regulator involved in human urothelial cytodifferentiation. Differentiation-associated expression of ELF3 was confirmed in cell culture experiments and by immunohistochemical demonstration in situ. The importance of ELF3 in urothelial differentiation was verified by knockdown in NHU cells, which led to reduced expression of FOXA1 and GRHL3 transcription factors in response to PPARγ activation. The consequences of this were seen in the repressed expression of late/terminal differentiation-associated uroplakin 3a gene expression and in the compromised development and regeneration of urothelial barrier function. Copyright © 2014 Elsevier Inc. All rights reserved.

  12. Using the Positive and Negative Syndrome Scale (PANSS) to Define Different Domains of Negative Symptoms: Prediction of Everyday Functioning by Impairments in Emotional Expression and Emotional Experience.

    PubMed

    Harvey, Philip D; Khan, Anzalee; Keefe, Richard S E

    2017-12-01

    Background: Reduced emotional experience and expression are two domains of negative symptoms. The authors assessed these two domains of negative symptoms using previously developed Positive and Negative Syndrome Scale (PANSS) factors. Using an existing dataset, the authors predicted three different elements of everyday functioning (social, vocational, and everyday activities) with these two factors, as well as with performance on measures of functional capacity. Methods: A large (n=630) sample of people with schizophrenia was used as the data source of this study. Using regression analyses, the authors predicted the three different aspects of everyday functioning, first with just the two Positive and Negative Syndrome Scale factors and then with a global negative symptom factor. Finally, we added neurocognitive performance and functional capacity as predictors. Results: The Positive and Negative Syndrome Scale reduced emotional experience factor accounted for 21 percent of the variance in everyday social functioning, while reduced emotional expression accounted for no variance. The total Positive and Negative Syndrome Scale negative symptom factor accounted for less variance (19%) than the reduced experience factor alone. The Positive and Negative Syndrome Scale expression factor accounted for, at most, one percent of the variance in any of the functional outcomes, with or without the addition of other predictors. Implications: Reduced emotional experience measured with the Positive and Negative Syndrome Scale, often referred to as "avolition and anhedonia," specifically predicted impairments in social outcomes. Further, reduced experience predicted social impairments better than emotional expression or the total Positive and Negative Syndrome Scale negative symptom factor. In this cross-sectional study, reduced emotional experience was specifically related with social outcomes, accounting for essentially no variance in work or everyday activities, and being the sole meaningful predictor of impairment in social outcomes.

  13. Spatial aspects of building and population exposure data and their implications for global earthquake exposure modeling

    USGS Publications Warehouse

    Dell’Acqua, F.; Gamba, P.; Jaiswal, K.

    2012-01-01

    This paper discusses spatial aspects of the global exposure dataset and mapping needs for earthquake risk assessment. We discuss this in the context of development of a Global Exposure Database for the Global Earthquake Model (GED4GEM), which requires compilation of a multi-scale inventory of assets at risk, for example, buildings, populations, and economic exposure. After defining the relevant spatial and geographic scales of interest, different procedures are proposed to disaggregate coarse-resolution data, to map them, and if necessary to infer missing data by using proxies. We discuss the advantages and limitations of these methodologies and detail the potentials of utilizing remote-sensing data. The latter is used especially to homogenize an existing coarser dataset and, where possible, replace it with detailed information extracted from remote sensing using the built-up indicators for different environments. Present research shows that the spatial aspects of earthquake risk computation are tightly connected with the availability of datasets of the resolution necessary for producing sufficiently detailed exposure. The global exposure database designed by the GED4GEM project is able to manage datasets and queries of multiple spatial scales.

  14. Isoform-level gene expression patterns in single-cell RNA-sequencing data.

    PubMed

    Vu, Trung Nghia; Wills, Quin F; Kalari, Krishna R; Niu, Nifang; Wang, Liewei; Pawitan, Yudi; Rantalainen, Mattias

    2018-02-27

    RNA sequencing of single cells enables characterization of transcriptional heterogeneity in seemingly homogeneous cell populations. Single-cell sequencing has been applied in a wide range of researches fields. However, few studies have focus on characterization of isoform-level expression patterns at the single-cell level. In this study we propose and apply a novel method, ISOform-Patterns (ISOP), based on mixture modeling, to characterize the expression patterns of isoform pairs from the same gene in single-cell isoform-level expression data. We define six principal patterns of isoform expression relationships and describe a method for differential-pattern analysis. We demonstrate ISOP through analysis of single-cell RNA-sequencing data from a breast cancer cell line, with replication in three independent datasets. We assigned the pattern types to each of 16,562 isoform-pairs from 4,929 genes. Among those, 26% of the discovered patterns were significant (p<0.05), while remaining patterns are possibly effects of transcriptional bursting, drop-out and stochastic biological heterogeneity. Furthermore, 32% of genes discovered through differential-pattern analysis were not detected by differential-expression analysis. The effect of drop-out events, mean expression level, and properties of the expression distribution on the performances of ISOP were also investigated through simulated datasets. To conclude, ISOP provides a novel approach for characterization of isoformlevel preference, commitment and heterogeneity in single-cell RNA-sequencing data. The ISOP method has been implemented as a R package and is available at https://github.com/nghiavtr/ISOP under a GPL-3 license. mattias.rantalainen@ki.se. Supplementary data are available at Bioinformatics online.

  15. A Leveraged Signal-to-Noise Ratio (LSTNR) Method to Extract Differentially Expressed Genes and Multivariate Patterns of Expression From Noisy and Low-Replication RNAseq Data

    PubMed Central

    Lozoya, Oswaldo A.; Santos, Janine H.; Woychik, Richard P.

    2018-01-01

    To life scientists, one important feature offered by RNAseq, a next-generation sequencing tool used to estimate changes in gene expression levels, lies in its unprecedented resolution. It can score countable differences in transcript numbers among thousands of genes and between experimental groups, all at once. However, its high cost limits experimental designs to very small sample sizes, usually N = 3, which often results in statistically underpowered analysis and poor reproducibility. All these issues are compounded by the presence of experimental noise, which is harder to distinguish from instrumental error when sample sizes are limiting (e.g., small-budget pilot tests), experimental populations exhibit biologically heterogeneous or diffuse expression phenotypes (e.g., patient samples), or when discriminating among transcriptional signatures of closely related experimental conditions (e.g., toxicological modes of action, or MOAs). Here, we present a leveraged signal-to-noise ratio (LSTNR) thresholding method, founded on generalized linear modeling (GLM) of aligned read detection limits to extract differentially expressed genes (DEGs) from noisy low-replication RNAseq data. The LSTNR method uses an agnostic independent filtering strategy to define the dynamic range of detected aggregate read counts per gene, and assigns statistical weights that prioritize genes with better sequencing resolution in differential expression analyses. To assess its performance, we implemented the LSTNR method to analyze three separate datasets: first, using a systematically noisy in silico dataset, we demonstrated that LSTNR can extract pre-designed patterns of expression and discriminate between “noise” and “true” differentially expressed pseudogenes at a 100% success rate; then, we illustrated how the LSTNR method can assign patient-derived breast cancer specimens correctly to one out of their four reported molecular subtypes (luminal A, luminal B, Her2-enriched and basal-like); and last, we showed the ability to retrieve five different modes of action (MOA) elicited in livers of rats exposed to three toxicants under three nutritional routes by using the LSTNR method. By combining differential measurements with resolving power to detect DEGs, the LSTNR method offers an alternative approach to interrogate noisy and low-replication RNAseq datasets, which handles multiple biological conditions at once, and defines benchmarks to validate RNAseq experiments with standard benchtop assays. PMID:29868123

  16. From conservation genetics to conservation genomics: a genome-wide assessment of blue whales (Balaenoptera musculus) in Australian feeding aggregations

    PubMed Central

    Sandoval-Castillo, Jonathan; Jenner, K. Curt S.; Gill, Peter C.; Jenner, Micheline-Nicole M.; Morrice, Margaret G.

    2018-01-01

    Genetic datasets of tens of markers have been superseded through next-generation sequencing technology with genome-wide datasets of thousands of markers. Genomic datasets improve our power to detect low population structure and identify adaptive divergence. The increased population-level knowledge can inform the conservation management of endangered species, such as the blue whale (Balaenoptera musculus). In Australia, there are two known feeding aggregations of the pygmy blue whale (B. m. brevicauda) which have shown no evidence of genetic structure based on a small dataset of 10 microsatellites and mtDNA. Here, we develop and implement a high-resolution dataset of 8294 genome-wide filtered single nucleotide polymorphisms, the first of its kind for blue whales. We use these data to assess whether the Australian feeding aggregations constitute one population and to test for the first time whether there is adaptive divergence between the feeding aggregations. We found no evidence of neutral population structure and negligible evidence of adaptive divergence. We propose that individuals likely travel widely between feeding areas and to breeding areas, which would require them to be adapted to a wide range of environmental conditions. This has important implications for their conservation as this blue whale population is likely vulnerable to a range of anthropogenic threats both off Australia and elsewhere. PMID:29410806

  17. How does spatial extent of fMRI datasets affect independent component analysis decomposition?

    PubMed

    Aragri, Adriana; Scarabino, Tommaso; Seifritz, Erich; Comani, Silvia; Cirillo, Sossio; Tedeschi, Gioacchino; Esposito, Fabrizio; Di Salle, Francesco

    2006-09-01

    Spatial independent component analysis (sICA) of functional magnetic resonance imaging (fMRI) time series can generate meaningful activation maps and associated descriptive signals, which are useful to evaluate datasets of the entire brain or selected portions of it. Besides computational implications, variations in the input dataset combined with the multivariate nature of ICA may lead to different spatial or temporal readouts of brain activation phenomena. By reducing and increasing a volume of interest (VOI), we applied sICA to different datasets from real activation experiments with multislice acquisition and single or multiple sensory-motor task-induced blood oxygenation level-dependent (BOLD) signal sources with different spatial and temporal structure. Using receiver operating characteristics (ROC) methodology for accuracy evaluation and multiple regression analysis as benchmark, we compared sICA decompositions of reduced and increased VOI fMRI time-series containing auditory, motor and hemifield visual activation occurring separately or simultaneously in time. Both approaches yielded valid results; however, the results of the increased VOI approach were spatially more accurate compared to the results of the decreased VOI approach. This is consistent with the capability of sICA to take advantage of extended samples of statistical observations and suggests that sICA is more powerful with extended rather than reduced VOI datasets to delineate brain activity. (c) 2006 Wiley-Liss, Inc.

  18. Genome-wide analysis of endogenously expressed ZEB2 binding sites reveals inverse correlations between ZEB2 and GalNAc-transferase GALNT3 in human tumors.

    PubMed

    Balcik-Ercin, Pelin; Cetin, Metin; Yalim-Camci, Irem; Odabas, Gorkem; Tokay, Nurettin; Sayan, A Emre; Yagci, Tamer

    2018-03-07

    ZEB2 is a transcriptional repressor that regulates epithelial-to-mesenchymal transition (EMT) through binding to bipartite E-box motifs in gene regulatory regions. Despite the abundant presence of E-boxes within the human genome and the multiplicity of pathophysiological processes regulated during ZEB2-induced EMT, only a small fraction of ZEB2 targets has been identified so far. Hence, we explored genome-wide ZEB2 binding by chromatin immunoprecipitation-sequencing (ChIP-seq) under endogenous ZEB2 expression conditions. For ChIP-Seq we used an anti-ZEB2 monoclonal antibody, clone 6E5, in SNU398 hepatocellular carcinoma cells exhibiting a high endogenous ZEB2 expression. The ChIP-Seq targets were validated using ChIP-qPCR, whereas ZEB2-dependent expression of target genes was assessed by RT-qPCR and Western blotting in shRNA-mediated ZEB2 silenced SNU398 cells and doxycycline-induced ZEB2 overexpressing colorectal carcinoma DLD1 cells. Changes in target gene expression were also assessed using primary human tumor cDNA arrays in conjunction with RT-qPCR. Additional differential expression and correlation analyses were performed using expO and Human Protein Atlas datasets. Over 500 ChIP-Seq positive genes were annotated, and intervals related to these genes were found to include the ZEB2 binding motif CACCTG according to TOMTOM motif analysis in the MEME Suite database. Assessment of ZEB2-dependent expression of target genes in ZEB2-silenced SNU398 cells and ZEB2-induced DLD1 cells revealed that the GALNT3 gene serves as a ZEB2 target with the highest, but inversely correlated, expression level. Remarkably, GALNT3 also exhibited the highest enrichment in the ChIP-qPCR validation assays. Through the analyses of primary tumor cDNA arrays and expO datasets a significant differential expression and a significant inverse correlation between ZEB2 and GALNT3 expression were detected in most of the tumors. We also explored ZEB2 and GALNT3 protein expression using the Human Protein Atlas dataset and, again, observed an inverse correlation in all analyzed tumor types, except malignant melanoma. In contrast to a generally negative or weak ZEB2 expression, we found that most tumor tissues exhibited a strong or moderate GALNT3 expression. Our observation that ZEB2 negatively regulates a GalNAc-transferase (GALNT3) that is involved in O-glycosylation adds another layer of complexity to the role of ZEB2 in cancer progression and metastasis. Proteins glycosylated by GALNT3 may be exploited as novel diagnostics and/or therapeutic targets.

  19. An RNA-Seq based gene expression atlas of the common bean.

    PubMed

    O'Rourke, Jamie A; Iniguez, Luis P; Fu, Fengli; Bucciarelli, Bruna; Miller, Susan S; Jackson, Scott A; McClean, Philip E; Li, Jun; Dai, Xinbin; Zhao, Patrick X; Hernandez, Georgina; Vance, Carroll P

    2014-10-06

    Common bean (Phaseolus vulgaris) is grown throughout the world and comprises roughly 50% of the grain legumes consumed worldwide. Despite this, genetic resources for common beans have been lacking. Next generation sequencing, has facilitated our investigation of the gene expression profiles associated with biologically important traits in common bean. An increased understanding of gene expression in common bean will improve our understanding of gene expression patterns in other legume species. Combining recently developed genomic resources for Phaseolus vulgaris, including predicted gene calls, with RNA-Seq technology, we measured the gene expression patterns from 24 samples collected from seven tissues at developmentally important stages and from three nitrogen treatments. Gene expression patterns throughout the plant were analyzed to better understand changes due to nodulation, seed development, and nitrogen utilization. We have identified 11,010 genes differentially expressed with a fold change ≥ 2 and a P-value < 0.05 between different tissues at the same time point, 15,752 genes differentially expressed within a tissue due to changes in development, and 2,315 genes expressed only in a single tissue. These analyses identified 2,970 genes with expression patterns that appear to be directly dependent on the source of available nitrogen. Finally, we have assembled this data in a publicly available database, The Phaseolus vulgaris Gene Expression Atlas (Pv GEA), http://plantgrn.noble.org/PvGEA/ . Using the website, researchers can query gene expression profiles of their gene of interest, search for genes expressed in different tissues, or download the dataset in a tabular form. These data provide the basis for a gene expression atlas, which will facilitate functional genomic studies in common bean. Analysis of this dataset has identified genes important in regulating seed composition and has increased our understanding of nodulation and impact of the nitrogen source on assimilation and distribution throughout the plant.

  20. Genotype-based gene signature of glioma risk.

    PubMed

    Huang, Yen-Tsung; Zhang, Yi; Wu, Zhijin; Michaud, Dominique S

    2017-07-01

    Glioma accounts for 80% of malignant brain tumors, but its etiologic determinants remain elusive. Despite genetic susceptibility loci identified by genome-wide association study (GWAS), the agnostic approach leaves open the possibility that other susceptibility genes remain to be discovered. Here we conduct a gene-centric integrative GWAS (iGWAS) of glioma risk that combines transcriptomics and genetics. We synthesized a brain transcriptomics dataset (n = 354), a GWAS dataset (n = 4203), and an advanced glioma tumor transcriptomic dataset (n = 483) to conduct an iGWAS. Using the expression quantitative trait loci (eQTL) dataset, we built models to predict gene expression for the GWAS data, based on eQTL genotypes. With the predicted gene expression, iGWAS analyses were performed using a novel statistical method. Gene signature risk score was constructed using a penalized logistic regression model. A total of 30527 transcripts were analyzed using the iGWAS approach. Four novel glioma susceptibility genes were identified with internal and external validation, including DRD5 (P = 3.0 × 10-79), WDR1 (P = 8.4 × 10-77), NOMO1 (P = 1.3 × 10-25), and PDXDC1 (P = 8.3 × 10-24). The genotype-predicted transcription pattern between cases and controls is consistent with that between tumor and its matched normal tissue. The genotype-based 4-gene signature improved the classification between glioma cases and controls based on age, gender, and population stratification, with area under the receiver operating characteristic curve increasing from 0.77 to 0.85 (P = 8.1 × 10-23). A new genotype-based gene signature of glioma was identified using a novel iGWAS approach, which integrates multiplatform genomic data as well as different genetic association studies. © The Author(s) 2017. Published by Oxford University Press on behalf of the Society for Neuro-Oncology. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com

  1. Functional networks inference from rule-based machine learning models.

    PubMed

    Lazzarini, Nicola; Widera, Paweł; Williamson, Stuart; Heer, Rakesh; Krasnogor, Natalio; Bacardit, Jaume

    2016-01-01

    Functional networks play an important role in the analysis of biological processes and systems. The inference of these networks from high-throughput (-omics) data is an area of intense research. So far, the similarity-based inference paradigm (e.g. gene co-expression) has been the most popular approach. It assumes a functional relationship between genes which are expressed at similar levels across different samples. An alternative to this paradigm is the inference of relationships from the structure of machine learning models. These models are able to capture complex relationships between variables, that often are different/complementary to the similarity-based methods. We propose a protocol to infer functional networks from machine learning models, called FuNeL. It assumes, that genes used together within a rule-based machine learning model to classify the samples, might also be functionally related at a biological level. The protocol is first tested on synthetic datasets and then evaluated on a test suite of 8 real-world datasets related to human cancer. The networks inferred from the real-world data are compared against gene co-expression networks of equal size, generated with 3 different methods. The comparison is performed from two different points of view. We analyse the enriched biological terms in the set of network nodes and the relationships between known disease-associated genes in a context of the network topology. The comparison confirms both the biological relevance and the complementary character of the knowledge captured by the FuNeL networks in relation to similarity-based methods and demonstrates its potential to identify known disease associations as core elements of the network. Finally, using a prostate cancer dataset as a case study, we confirm that the biological knowledge captured by our method is relevant to the disease and consistent with the specialised literature and with an independent dataset not used in the inference process. The implementation of our network inference protocol is available at: http://ico2s.org/software/funel.html.

  2. Novel insights, challenges and practical implications of DOHaD-omics research.

    PubMed

    Hodyl, Nicolette A; Muhlhausler, Beverly

    2016-02-15

    Research investigating the developmental origins of health and disease (DOHaD) has never had the technology to investigate physiology in such a data-rich capacity and at such a microlevel as it does now. A symposium at the inaugural meeting of the DOHaD Society of Australia and New Zealand outlined the advantages and challenges of using "-omics" technologies in DOHaD research. DOHaD studies with -omics approaches to generate large, rich datasets were discussed. We discuss implications for policy and practice and make recommendations to facilitate successful translation of results of future DOHaD-omics studies.

  3. Corrigendum to "Three climatic cycles recorded in a loess-palaeosol sequence at Semlac (Romania)-Implications for dust accumulation in south-eastern Europe" [Quat. Sci. Rev. 154C (2016) 130-142

    NASA Astrophysics Data System (ADS)

    Zeeden, C.; Kels, H.; Hambach, U.; Schulte, P.; Protze, J.; Eckmeier, E.; Marković, S. B.; Klasen, N.; Lehmkuhl, F.

    2018-05-01

    In the article 'Three climatic cycles recorded in a loess-palaeosol sequence at Semlac (Romania)-Implications for dust accumulation in south-eastern Europe' (Zeeden et al., 2016) we employed rock magnetic and grain size proxy data in combination with OSL- and correlative age models. The data and dating is combined to discuss glacial-interglacial paleoclimate variability in an Eurasian context. This dataset was also interpreted regarding the dust source in the eastern Carpathian (Middle Danube) Basin.

  4. Inferring the expression variability of human transposable element-derived exons by linear model analysis of deep RNA sequencing data.

    PubMed

    Zhang, Wensheng; Edwards, Andrea; Fan, Wei; Fang, Zhide; Deininger, Prescott; Zhang, Kun

    2013-08-28

    The exonization of transposable elements (TEs) has proven to be a significant mechanism for the creation of novel exons. Existing knowledge of the retention patterns of TE exons in mRNAs were mainly established by the analysis of Expressed Sequence Tag (EST) data and microarray data. This study seeks to validate and extend previous studies on the expression of TE exons by an integrative statistical analysis of high throughput RNA sequencing data. We collected 26 RNA-seq datasets spanning multiple tissues and cancer types. The exon-level digital expressions (indicating retention rates in mRNAs) were quantified by a double normalized measure, called the rescaled RPKM (Reads Per Kilobase of exon model per Million mapped reads). We analyzed the distribution profiles and the variability (across samples and between tissue/disease groups) of TE exon expressions, and compared them with those of other constitutive or cassette exons. We inferred the effects of four genomic factors, including the location, length, cognate TE family and TE nucleotide proportion (RTE, see Methods section) of a TE exon, on the exons' expression level and expression variability. We also investigated the biological implications of an assembly of highly-expressed TE exons. Our analysis confirmed prior studies from the following four aspects. First, with relatively high expression variability, most TE exons in mRNAs, especially those without exact counterparts in the UCSC RefSeq (Reference Sequence) gene tables, demonstrate low but still detectable expression levels in most tissue samples. Second, the TE exons in coding DNA sequences (CDSs) are less highly expressed than those in 3' (5') untranslated regions (UTRs). Third, the exons derived from chronologically ancient repeat elements, such as MIRs, tend to be highly expressed in comparison with those derived from younger TEs. Fourth, the previously observed negative relationship between the lengths of exons and the inclusion levels in transcripts is also true for exonized TEs. Furthermore, our study resulted in several novel findings. They include: (1) for the TE exons with non-zero expression and as shown in most of the studied biological samples, a high TE nucleotide proportion leads to their lower retention rates in mRNAs; (2) the considered genomic features (i.e. a continuous variable such as the exon length or a category indicator such as 3'UTR) influence the expression level and the expression variability (CV) of TE exons in an inverse manner; (3) not only the exons derived from Alu elements but also the exons from the TEs of other families were preferentially established in zinc finger (ZNF) genes.

  5. ABC transporters and the proteasome complex are implicated in susceptibility to Stevens-Johnson syndrome and toxic epidermal necrolysis across multiple drugs.

    PubMed

    Nicoletti, Paola; Bansal, Mukesh; Lefebvre, Celine; Guarnieri, Paolo; Shen, Yufeng; Pe'er, Itsik; Califano, Andrea; Floratos, Aris

    2015-01-01

    Stevens-Johnson syndrome (SJS) and Toxic Epidermal Necrolysis (TEN) represent rare but serious adverse drug reactions (ADRs). Both are characterized by distinctive blistering lesions and significant mortality rates. While there is evidence for strong drug-specific genetic predisposition related to HLA alleles, recent genome wide association studies (GWAS) on European and Asian populations have failed to identify genetic susceptibility alleles that are common across multiple drugs. We hypothesize that this is a consequence of the low to moderate effect size of individual genetic risk factors. To test this hypothesis we developed Pointer, a new algorithm that assesses the aggregate effect of multiple low risk variants on a pathway using a gene set enrichment approach. A key advantage of our method is the capability to associate SNPs with genes by exploiting physical proximity as well as by using expression quantitative trait loci (eQTLs) that capture information about both cis- and trans-acting regulatory effects. We control for known bias-inducing aspects of enrichment based analyses, such as: 1) gene length, 2) gene set size, 3) presence of biologically related genes within the same linkage disequilibrium (LD) region, and, 4) genes shared among multiple gene sets. We applied this approach to publicly available SJS/TEN genome-wide genotype data and identified the ABC transporter and Proteasome pathways as potentially implicated in the genetic susceptibility of non-drug-specific SJS/TEN. We demonstrated that the innovative SNP-to-gene mapping phase of the method was essential in detecting the significant enrichment for those pathways. Analysis of an independent gene expression dataset provides supportive functional evidence for the involvement of Proteasome pathways in SJS/TEN cutaneous lesions. These results suggest that Pointer provides a useful framework for the integrative analysis of pharmacogenetic GWAS data, by increasing the power to detect aggregate effects of multiple low risk variants. The software is available for download at https://sourceforge.net/projects/pointergsa/.

  6. Predictors of needs for families of children with cerebral palsy.

    PubMed

    Almasri, Nihad A; O'Neil, Margaret; Palisano, Robert J

    2014-01-01

    This study examined child, family and service characteristics that are predictors of family needs for community, financial, family support and services needs for families of children with cerebral palsy (CP). CP is a non-progressive neurological condition caused by lesions in the central nervous system resulting in limitations in motor function and associated co-morbid conditions. Children with CP often require multiple health, rehabilitation, and community services. To identify risk and protective factors among predictors of needed resources and services (i.e. community, financial, family support) and to discuss implications for coordination of medical, rehabilitation, and community services for children with CP and their families. Secondary data analysis was conducted with a national dataset (n = 441) of mothers of children with CP. The average age of children was 10.7 years (SD = 4.5) and was distributed across the various Gross Motor Function Classification System levels. Four logistic regression models were conducted to examine predictive power of child, family and current service characteristics on needed resources and services. Limited child gross motor function was a risk factor (odds ratio (OR): 1.30-1.70) while perception of family-centered services (FCS) was a protective factor (OR: 0.57-0.63) in having the needs met. Mothers of children with CP who are able to walk, reported strong family relationships, and perceived need-oriented and FCS expressed less needs for community, financial, family support and services' resources needs. Implications for service providers are provided.

  7. Categorisation of Colour Terms Using New Validation Tools: A Case Study and Implications

    PubMed Central

    Arbab, Shabnam; Brindle, Jonathan A.; Matusiak, Barbara S.; Klöckner, Christian A.

    2018-01-01

    This article elaborates on the results of a field experiment conducted among speakers of the Chakali language, spoken in northern Ghana. In the original study, the Color-aid Corporation Chart was used to perform the focal task in which consultants were asked to point at a single colour tile on the chart. However, data from the focal task could not be analysed since the Color-aid tiles had not yet been converted into numerical values set forth by the Commission internationale de l’éclairage (CIE). In this study, the full set of 314 Color-aid tiles were measured for chromaticity and converted into the CIE values at the Daylight Laboratory of the Norwegian University of Science and Technology. This article presents the conversion methodology and makes the results of the measurements, which are available in the Online Appendix. We argue that some visual-perception terms cannot be reliably ascribed to colour categories established by the Color-aid Corporation. This suggests that the ideophonic expressions in the dataset do not denote ‘colours’, as categorised in the Color-aid system, as it was impossible to average the consultants’ data into a CIE chromaticity diagram, illustrate the phenomena on the Natural Colour System (NCS) Circle and Triangle diagrams, and conduct a statistical analysis. One of the implications of this study is that a line between a visual-perception term and a colour term could be systematically established using a method with predefined categorical thresholds. PMID:29755718

  8. Categorisation of Colour Terms Using New Validation Tools: A Case Study and Implications.

    PubMed

    Arbab, Shabnam; Brindle, Jonathan A; Matusiak, Barbara S; Klöckner, Christian A

    2018-01-01

    This article elaborates on the results of a field experiment conducted among speakers of the Chakali language, spoken in northern Ghana. In the original study, the Color-aid Corporation Chart was used to perform the focal task in which consultants were asked to point at a single colour tile on the chart. However, data from the focal task could not be analysed since the Color-aid tiles had not yet been converted into numerical values set forth by the Commission internationale de l'éclairage (CIE). In this study, the full set of 314 Color-aid tiles were measured for chromaticity and converted into the CIE values at the Daylight Laboratory of the Norwegian University of Science and Technology. This article presents the conversion methodology and makes the results of the measurements, which are available in the Online Appendix. We argue that some visual-perception terms cannot be reliably ascribed to colour categories established by the Color-aid Corporation. This suggests that the ideophonic expressions in the dataset do not denote 'colours', as categorised in the Color-aid system, as it was impossible to average the consultants' data into a CIE chromaticity diagram, illustrate the phenomena on the Natural Colour System (NCS) Circle and Triangle diagrams, and conduct a statistical analysis. One of the implications of this study is that a line between a visual-perception term and a colour term could be systematically established using a method with predefined categorical thresholds.

  9. KCNN Genes that Encode Small-Conductance Ca2+-Activated K+ Channels Influence Alcohol and Drug Addiction.

    PubMed

    Padula, Audrey E; Griffin, William C; Lopez, Marcelo F; Nimitvilai, Sudarat; Cannady, Reginald; McGuier, Natalie S; Chesler, Elissa J; Miles, Michael F; Williams, Robert W; Randall, Patrick K; Woodward, John J; Becker, Howard C; Mulholland, Patrick J

    2015-07-01

    Small-conductance Ca(2+)-activated K(+) (KCa2) channels control neuronal excitability and synaptic plasticity, and have been implicated in substance abuse. However, it is unknown if genes that encode KCa2 channels (KCNN1-3) influence alcohol and drug addiction. In the present study, an integrative functional genomics approach shows that genetic datasets for alcohol, nicotine, and illicit drugs contain the family of KCNN genes. Alcohol preference and dependence QTLs contain KCNN2 and KCNN3, and Kcnn3 transcript levels in the nucleus accumbens (NAc) of genetically diverse BXD strains of mice predicted voluntary alcohol consumption. Transcript levels of Kcnn3 in the NAc negatively correlated with alcohol intake levels in BXD strains, and alcohol dependence enhanced the strength of this association. Microinjections of the KCa2 channel inhibitor apamin into the NAc increased alcohol intake in control C57BL/6J mice, while spontaneous seizures developed in alcohol-dependent mice following apamin injection. Consistent with this finding, alcohol dependence enhanced the intrinsic excitability of medium spiny neurons in the NAc core and reduced the function and protein expression of KCa2 channels in the NAc. Altogether, these data implicate the family of KCNN genes in alcohol, nicotine, and drug addiction, and identify KCNN3 as a mediator of voluntary and excessive alcohol consumption. KCa2.3 channels represent a promising novel target in the pharmacogenetic treatment of alcohol and drug addiction.

  10. Who, What, When, Where? Determining the Health Implications of Wildfire Smoke Exposure

    NASA Astrophysics Data System (ADS)

    Ford, B.; Lassman, W.; Gan, R.; Burke, M.; Pfister, G.; Magzamen, S.; Fischer, E. V.; Volckens, J.; Pierce, J. R.

    2016-12-01

    Exposure to poor air quality is associated with negative impacts on human health. A large natural source of PM in the western U.S. is from wildland fires. Accurately attributing health endpoints to wildland-fire smoke requires a determination of the exposed population. This is a difficult endeavor because most current methods for monitoring air quality are not at high temporal and spatial resolutions. Therefore, there is a growing effort to include multiple datasets and create blended products of smoke exposure that can exploit the strengths of each dataset. In this work, we combine model (WRF-Chem) simulations, NASA satellite (MODIS) observations, and in-situ surface monitors to improve exposure estimates. We will also introduce a social-media dataset of self-reported smoke/haze/pollution to improve population-level exposure estimates for the summer of 2015. Finally, we use these detailed exposure estimates in different epidemiologic study designs to provide an in-depth understanding of the role wildfire exposure plays on health outcomes.

  11. Super-delta: a new differential gene expression analysis procedure with robust data normalization.

    PubMed

    Liu, Yuhang; Zhang, Jinfeng; Qiu, Xing

    2017-12-21

    Normalization is an important data preparation step in gene expression analyses, designed to remove various systematic noise. Sample variance is greatly reduced after normalization, hence the power of subsequent statistical analyses is likely to increase. On the other hand, variance reduction is made possible by borrowing information across all genes, including differentially expressed genes (DEGs) and outliers, which will inevitably introduce some bias. This bias typically inflates type I error; and can reduce statistical power in certain situations. In this study we propose a new differential expression analysis pipeline, dubbed as super-delta, that consists of a multivariate extension of the global normalization and a modified t-test. A robust procedure is designed to minimize the bias introduced by DEGs in the normalization step. The modified t-test is derived based on asymptotic theory for hypothesis testing that suitably pairs with the proposed robust normalization. We first compared super-delta with four commonly used normalization methods: global, median-IQR, quantile, and cyclic loess normalization in simulation studies. Super-delta was shown to have better statistical power with tighter control of type I error rate than its competitors. In many cases, the performance of super-delta is close to that of an oracle test in which datasets without technical noise were used. We then applied all methods to a collection of gene expression datasets on breast cancer patients who received neoadjuvant chemotherapy. While there is a substantial overlap of the DEGs identified by all of them, super-delta were able to identify comparatively more DEGs than its competitors. Downstream gene set enrichment analysis confirmed that all these methods selected largely consistent pathways. Detailed investigations on the relatively small differences showed that pathways identified by super-delta have better connections to breast cancer than other methods. As a new pipeline, super-delta provides new insights to the area of differential gene expression analysis. Solid theoretical foundation supports its asymptotic unbiasedness and technical noise-free properties. Implementation on real and simulated datasets demonstrates its decent performance compared with state-of-art procedures. It also has the potential of expansion to be incorporated with other data type and/or more general between-group comparison problems.

  12. Genomic pathways modulated by Twist in breast cancer.

    PubMed

    Vesuna, Farhad; Bergman, Yehudit; Raman, Venu

    2017-01-13

    The basic helix-loop-helix transcription factor TWIST1 (Twist) is involved in embryonic cell lineage determination and mesodermal differentiation. There is evidence to indicate that Twist expression plays a role in breast tumor formation and metastasis, but the role of Twist in dysregulating pathways that drive the metastatic cascade is unclear. Moreover, many of the genes and pathways dysregulated by Twist in cell lines and mouse models have not been validated against data obtained from larger, independant datasets of breast cancer patients. We over-expressed the human Twist gene in non-metastatic MCF-7 breast cancer cells to generate the estrogen-independent metastatic breast cancer cell line MCF-7/Twist. These cells were inoculated in the mammary fat pad of female severe compromised immunodeficient mice, which subsequently formed xenograft tumors that metastasized to the lungs. Microarray data was collected from both in vitro (MCF-7 and MCF-7/Twist cell lines) and in vivo (primary tumors and lung metastases) models of Twist expression. Our data was compared to several gene datasets of various subtypes, classes, and grades of human breast cancers. Our data establishes a Twist over-expressing mouse model of breast cancer, which metastasizes to the lung and replicates some of the ontogeny of human breast cancer progression. Gene profiling data, following Twist expression, exhibited novel metastasis driver genes as well as cellular maintenance genes that were synonymous with the metastatic process. We demonstrated that the genes and pathways altered in the transgenic cell line and metastatic animal models parallel many of the dysregulated gene pathways observed in human breast cancers. Analogous gene expression patterns were observed in both in vitro and in vivo Twist preclinical models of breast cancer metastasis and breast cancer patient datasets supporting the functional role of Twist in promoting breast cancer metastasis. The data suggests that genetic dysregulation of Twist at the cellular level drives alterations in gene pathways in the Twist metastatic mouse model which are comparable to changes seen in human breast cancers. Lastly, we have identified novel genes and pathways that could be further investigated as targets for drugs to treat metastatic breast cancer.

  13. Classification and Weakly Supervised Pain Localization using Multiple Segment Representation.

    PubMed

    Sikka, Karan; Dhall, Abhinav; Bartlett, Marian Stewart

    2014-10-01

    Automatic pain recognition from videos is a vital clinical application and, owing to its spontaneous nature, poses interesting challenges to automatic facial expression recognition (AFER) research. Previous pain vs no-pain systems have highlighted two major challenges: (1) ground truth is provided for the sequence, but the presence or absence of the target expression for a given frame is unknown, and (2) the time point and the duration of the pain expression event(s) in each video are unknown. To address these issues we propose a novel framework (referred to as MS-MIL) where each sequence is represented as a bag containing multiple segments, and multiple instance learning (MIL) is employed to handle this weakly labeled data in the form of sequence level ground-truth. These segments are generated via multiple clustering of a sequence or running a multi-scale temporal scanning window, and are represented using a state-of-the-art Bag of Words (BoW) representation. This work extends the idea of detecting facial expressions through 'concept frames' to 'concept segments' and argues through extensive experiments that algorithms such as MIL are needed to reap the benefits of such representation. The key advantages of our approach are: (1) joint detection and localization of painful frames using only sequence-level ground-truth, (2) incorporation of temporal dynamics by representing the data not as individual frames but as segments, and (3) extraction of multiple segments, which is well suited to signals with uncertain temporal location and duration in the video. Extensive experiments on UNBC-McMaster Shoulder Pain dataset highlight the effectiveness of the approach by achieving competitive results on both tasks of pain classification and localization in videos. We also empirically evaluate the contributions of different components of MS-MIL. The paper also includes the visualization of discriminative facial patches, important for pain detection, as discovered by our algorithm and relates them to Action Units that have been associated with pain expression. We conclude the paper by demonstrating that MS-MIL yields a significant improvement on another spontaneous facial expression dataset, the FEEDTUM dataset.

  14. Robust continuous clustering

    PubMed Central

    Shah, Sohil Atul

    2017-01-01

    Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank. PMID:28851838

  15. Defining global neuroendocrine gene expression patterns associated with reproductive seasonality in fish.

    PubMed

    Zhang, Dapeng; Xiong, Huiling; Mennigen, Jan A; Popesku, Jason T; Marlatt, Vicki L; Martyniuk, Christopher J; Crump, Kate; Cossins, Andrew R; Xia, Xuhua; Trudeau, Vance L

    2009-06-05

    Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted. In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABA(A) gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays. Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development.

  16. Defining Global Neuroendocrine Gene Expression Patterns Associated with Reproductive Seasonality in Fish

    PubMed Central

    Mennigen, Jan A.; Popesku, Jason T.; Marlatt, Vicki L.; Martyniuk, Christopher J.; Crump, Kate; Cossins, Andrew R.; Xia, Xuhua; Trudeau, Vance L.

    2009-01-01

    Background Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted. Methodology/Principal Findings In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABAA gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays. Conclusions/Significance Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development. PMID:19503831

  17. Transcriptome database resource and gene expression atlas for the rose

    PubMed Central

    2012-01-01

    Background For centuries roses have been selected based on a number of traits. Little information exists on the genetic and molecular basis that contributes to these traits, mainly because information on expressed genes for this economically important ornamental plant is scarce. Results Here, we used a combination of Illumina and 454 sequencing technologies to generate information on Rosa sp. transcripts using RNA from various tissues and in response to biotic and abiotic stresses. A total of 80714 transcript clusters were identified and 76611 peptides have been predicted among which 20997 have been clustered into 13900 protein families. BLASTp hits in closely related Rosaceae species revealed that about half of the predicted peptides in the strawberry and peach genomes have orthologs in Rosa dataset. Digital expression was obtained using RNA samples from organs at different development stages and under different stress conditions. qPCR validated the digital expression data for a selection of 23 genes with high or low expression levels. Comparative gene expression analyses between the different tissues and organs allowed the identification of clusters that are highly enriched in given tissues or under particular conditions, demonstrating the usefulness of the digital gene expression analysis. A web interface ROSAseq was created that allows data interrogation by BLAST, subsequent analysis of DNA clusters and access to thorough transcript annotation including best BLAST matches on Fragaria vesca, Prunus persica and Arabidopsis. The rose peptides dataset was used to create the ROSAcyc resource pathway database that allows access to the putative genes and enzymatic pathways. Conclusions The study provides useful information on Rosa expressed genes, with thorough annotation and an overview of expression patterns for transcripts with good accuracy. PMID:23164410

  18. ExpressionDB: An open source platform for distributing genome-scale datasets.

    PubMed

    Hughes, Laura D; Lewis, Scott A; Hughes, Michael E

    2017-01-01

    RNA-sequencing (RNA-seq) and microarrays are methods for measuring gene expression across the entire transcriptome. Recent advances have made these techniques practical and affordable for essentially any laboratory with experience in molecular biology. A variety of computational methods have been developed to decrease the amount of bioinformatics expertise necessary to analyze these data. Nevertheless, many barriers persist which discourage new labs from using functional genomics approaches. Since high-quality gene expression studies have enduring value as resources to the entire research community, it is of particular importance that small labs have the capacity to share their analyzed datasets with the research community. Here we introduce ExpressionDB, an open source platform for visualizing RNA-seq and microarray data accommodating virtually any number of different samples. ExpressionDB is based on Shiny, a customizable web application which allows data sharing locally and online with customizable code written in R. ExpressionDB allows intuitive searches based on gene symbols, descriptions, or gene ontology terms, and it includes tools for dynamically filtering results based on expression level, fold change, and false-discovery rates. Built-in visualization tools include heatmaps, volcano plots, and principal component analysis, ensuring streamlined and consistent visualization to all users. All of the scripts for building an ExpressionDB with user-supplied data are freely available on GitHub, and the Creative Commons license allows fully open customization by end-users. We estimate that a demo database can be created in under one hour with minimal programming experience, and that a new database with user-supplied expression data can be completed and online in less than one day.

  19. Enabling systematic interrogation of protein-protein interactions in live cells with a versatile ultra-high-throughput biosensor platform | Office of Cancer Genomics

    Cancer.gov

    The vast datasets generated by next generation gene sequencing and expression profiling have transformed biological and translational research. However, technologies to produce large-scale functional genomics datasets, such as high-throughput detection of protein-protein interactions (PPIs), are still in early development. While a number of powerful technologies have been employed to detect PPIs, a singular PPI biosensor platform featured with both high sensitivity and robustness in a mammalian cell environment remains to be established.

  20. Who Teaches Leadership? A Comparative Analysis of Faculty and Student Affairs Leadership Educators and Implications for Leadership Learning

    ERIC Educational Resources Information Center

    Jenkins, Daniel M.; Owen, Julie E.

    2016-01-01

    This study combines multiple national datasets on leadership educator demographics, education, positions, and experiences, in order to answer the question: Who teaches leadership? Comparing leadership educators across both curricular and co-curricular contexts allows a snapshot of the diverse perspectives of leadership educators and informs a set…

  1. The Ethics of Big Data and Nursing Science.

    PubMed

    Milton, Constance L

    2017-10-01

    Big data is a scientific, social, and technological trend referring to the process and size of datasets available for analysis. Ethical implications arise as healthcare disciplines, including nursing, struggle over questions of informed consent, privacy, ownership of data, and its possible use in epistemology. The author offers straight-thinking possibilities for the use of big data in nursing science.

  2. Topological data analysis of Escherichia coli O157:H7 and non-O157 survival in soils

    USDA-ARS?s Scientific Manuscript database

    Shiga toxin-producing E. coli O157:H7 and non-O157 have been implicated in many foodborne illnesses caused by the consumption of contaminated fresh produce. However, data on their persistence in major fresh produce-growing soils are limited due to the complexity in datasets generated from different ...

  3. Human Capital Linkages to Labour Productivity: Implications from Thai Manufacturers

    ERIC Educational Resources Information Center

    Rukumnuaykit, Pungpond; Pholphirul, Piriya

    2016-01-01

    Human capital investment is a necessary condition for improving labour market outcomes in most countries. Empirical studies to investigate human capital and its linkages on the labour demand side are, however, relatively scarce due to limitations of firm-level data-sets. Using firm-level data from the Thai manufacturing sector, this paper aims to…

  4. Management and land use implications of continuous nitrogen and phosphorus monitoring in a small non-karst catchment in southeastern PA

    USDA-ARS?s Scientific Manuscript database

    Long-term climate and water quality monitoring data provide some of the most essential and informative information to the scientific community. These datasets however, are often incomplete and do not have frequent enough sampling to provide full explanations of trends. With the advent of continuous ...

  5. Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae

    PubMed Central

    Reguly, Teresa; Breitkreutz, Ashton; Boucher, Lorrie; Breitkreutz, Bobby-Joe; Hon, Gary C; Myers, Chad L; Parsons, Ainslie; Friesen, Helena; Oughtred, Rose; Tong, Amy; Stark, Chris; Ho, Yuen; Botstein, David; Andrews, Brenda; Boone, Charles; Troyanskya, Olga G; Ideker, Trey; Dolinski, Kara; Batada, Nizar N; Tyers, Mike

    2006-01-01

    Background The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference. Results We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID () and SGD () databases. Conclusion Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks. PMID:16762047

  6. Comparative analysis and assessment of M. tuberculosis H37Rv protein-protein interaction datasets

    PubMed Central

    2011-01-01

    Background M. tuberculosis is a formidable bacterial pathogen. There is thus an increasing demand on understanding the function and relationship of proteins in various strains of M. tuberculosis. Protein-protein interactions (PPIs) data are crucial for this kind of knowledge. However, the quality of the main available M. tuberculosis PPI datasets is unclear. This hampers the effectiveness of research works that rely on these PPI datasets. Here, we analyze the two main available M. tuberculosis H37Rv PPI datasets. The first dataset is the high-throughput B2H PPI dataset from Wang et al’s recent paper in Journal of Proteome Research. The second dataset is from STRING database, version 8.3, comprising entirely of H37Rv PPIs predicted using various methods. We find that these two datasets have a surprisingly low level of agreement. We postulate the following causes for this low level of agreement: (i) the H37Rv B2H PPI dataset is of low quality; (ii) the H37Rv STRING PPI dataset is of low quality; and/or (iii) the H37Rv STRING PPIs are predictions of other forms of functional associations rather than direct physical interactions. Results To test the quality of these two datasets, we evaluate them based on correlated gene expression profiles, coherent informative GO term annotations, and conservation in other organisms. We observe a significantly greater portion of PPIs in the H37Rv STRING PPI dataset (with score ≥ 770) having correlated gene expression profiles and coherent informative GO term annotations in both interaction partners than that in the H37Rv B2H PPI dataset. Predicted H37Rv interologs derived from non-M. tuberculosis experimental PPIs are much more similar to the H37Rv STRING functional associations dataset (with score ≥ 770) than the H37Rv B2H PPI dataset. H37Rv predicted physical interologs from IntAct also show extremely low similarity with the H37Rv B2H PPI dataset; and this similarity level is much lower than that between the S. aureus MRSA252 predicted physical interologs from IntAct and S. aureus MRSA252 pull-down PPIs. Comparative analysis with several representative two-hybrid PPI datasets in other species further confirms that the H37Rv B2H PPI dataset is of low quality. Next, to test the possibility that the H37Rv STRING PPIs are not purely direct physical interactions, we compare M. tuberculosis H37Rv protein pairs that catalyze adjacent steps in enzymatic reactions to B2H PPIs and predicted PPIs in STRING, which shows it has much lower similarities with the B2H PPIs than with STRING PPIs. This result strongly suggests that the H37Rv STRING PPIs more likely correspond to indirect relationships between protein pairs than to B2H PPIs. For more precise support, we turn to S. cerevisiae for its comprehensively studied interactome. We compare S. cerevisiae predicted PPIs in STRING to three independent protein relationship datasets which respectively comprise PPIs reported in Y2H assays, protein pairs reported to be in the same protein complexes, and protein pairs that catalyze successive reaction steps in enzymatic reactions. Our analysis reveals that S. cerevisiae predicted STRING PPIs have much higher similarity to the latter two types of protein pairs than to two-hybrid PPIs. As H37Rv STRING PPIs are predicted using similar methods as S. cerevisiae predicted STRING PPIs, this suggests that these H37Rv STRING PPIs are more likely to correspond to the latter two types of protein pairs rather than to two-hybrid PPIs as well. Conclusions The H37Rv B2H PPI dataset has low quality. It should not be used as the gold standard to assess the quality of other (possibly predicted) H37Rv PPI datasets. The H37Rv STRING PPI dataset also has low quality; nevertheless, a subset consisting of STRING PPIs with score ≥770 has satisfactory quality. However, these STRING “PPIs” should be interpreted as functional associations, which include a substantial portion of indirect protein interactions, rather than direct physical interactions. These two factors cause the strikingly low similarity between these two main H37Rv PPI datasets. The results and conclusions from this comparative analysis provide valuable guidance in using these M. tuberculosis H37Rv PPI datasets in subsequent studies for a wide range of purposes. PMID:22369691

  7. University of Texas Southwestern Medical Center: Functional Signature Ontology Tool: Triplicate Measurements of Reporter Gene Expression in Response to Individual Genetic and Chemical Perturbations in HCT116 Cells | Office of Cancer Genomics

    Cancer.gov

    The goal of this project is to use an eight-gene expression profile to define functional signatures for small molecules and natural products with heretofore undefined mechanism of action. Two genes in the eight gene set are used as internal controls and do not vary across gene expression array data collected from the public domain. The remaining six genes are found to vary independently across a large collection of publically available gene expression array datasets.  Read the abstract

  8. A Pathway Based Classification Method for Analyzing Gene Expression for Alzheimer's Disease Diagnosis.

    PubMed

    Voyle, Nicola; Keohane, Aoife; Newhouse, Stephen; Lunnon, Katie; Johnston, Caroline; Soininen, Hilkka; Kloszewska, Iwona; Mecocci, Patrizia; Tsolaki, Magda; Vellas, Bruno; Lovestone, Simon; Hodges, Angela; Kiddle, Steven; Dobson, Richard Jb

    2016-01-01

    Recent studies indicate that gene expression levels in blood may be able to differentiate subjects with Alzheimer's disease (AD) from normal elderly controls and mild cognitively impaired (MCI) subjects. However, there is limited replicability at the single marker level. A pathway-based interpretation of gene expression may prove more robust. This study aimed to investigate whether a case/control classification model built on pathway level data was more robust than a gene level model and may consequently perform better in test data. The study used two batches of gene expression data from the AddNeuroMed (ANM) and Dementia Case Registry (DCR) cohorts. Our study used Illumina Human HT-12 Expression BeadChips to collect gene expression from blood samples. Random forest modeling with recursive feature elimination was used to predict case/control status. Age and APOE ɛ4 status were used as covariates for all analysis. Gene and pathway level models performed similarly to each other and to a model based on demographic information only. Any potential increase in concordance from the novel pathway level approach used here has not lead to a greater predictive ability in these datasets. However, we have only tested one method for creating pathway level scores. Further, we have been able to benchmark pathways against genes in datasets that had been extensively harmonized. Further work should focus on the use of alternative methods for creating pathway level scores, in particular those that incorporate pathway topology, and the use of an endophenotype based approach.

  9. Classification of foods by transferring knowledge from ImageNet dataset

    NASA Astrophysics Data System (ADS)

    Heravi, Elnaz J.; Aghdam, Hamed H.; Puig, Domenec

    2017-03-01

    Automatic classification of foods is a way to control food intake and tackle with obesity. However, it is a challenging problem since foods are highly deformable and complex objects. Results on ImageNet dataset have revealed that Convolutional Neural Network has a great expressive power to model natural objects. Nonetheless, it is not trivial to train a ConvNet from scratch for classification of foods. This is due to the fact that ConvNets require large datasets and to our knowledge there is not a large public dataset of food for this purpose. Alternative solution is to transfer knowledge from trained ConvNets to the domain of foods. In this work, we study how transferable are state-of-art ConvNets to the task of food classification. We also propose a method for transferring knowledge from a bigger ConvNet to a smaller ConvNet by keeping its accuracy similar to the bigger ConvNet. Our experiments on UECFood256 datasets show that Googlenet, VGG and residual networks produce comparable results if we start transferring knowledge from appropriate layer. In addition, we show that our method is able to effectively transfer knowledge to the smaller ConvNet using unlabeled samples.

  10. Inference of Gene Regulatory Networks Using Bayesian Nonparametric Regression and Topology Information.

    PubMed

    Fan, Yue; Wang, Xiao; Peng, Qinke

    2017-01-01

    Gene regulatory networks (GRNs) play an important role in cellular systems and are important for understanding biological processes. Many algorithms have been developed to infer the GRNs. However, most algorithms only pay attention to the gene expression data but do not consider the topology information in their inference process, while incorporating this information can partially compensate for the lack of reliable expression data. Here we develop a Bayesian group lasso with spike and slab priors to perform gene selection and estimation for nonparametric models. B-spline basis functions are used to capture the nonlinear relationships flexibly and penalties are used to avoid overfitting. Further, we incorporate the topology information into the Bayesian method as a prior. We present the application of our method on DREAM3 and DREAM4 datasets and two real biological datasets. The results show that our method performs better than existing methods and the topology information prior can improve the result.

  11. Beta-Poisson model for single-cell RNA-seq data analyses.

    PubMed

    Vu, Trung Nghia; Wills, Quin F; Kalari, Krishna R; Niu, Nifang; Wang, Liewei; Rantalainen, Mattias; Pawitan, Yudi

    2016-07-15

    Single-cell RNA-sequencing technology allows detection of gene expression at the single-cell level. One typical feature of the data is a bimodality in the cellular distribution even for highly expressed genes, primarily caused by a proportion of non-expressing cells. The standard and the over-dispersed gamma-Poisson models that are commonly used in bulk-cell RNA-sequencing are not able to capture this property. We introduce a beta-Poisson mixture model that can capture the bimodality of the single-cell gene expression distribution. We further integrate the model into the generalized linear model framework in order to perform differential expression analyses. The whole analytical procedure is called BPSC. The results from several real single-cell RNA-seq datasets indicate that ∼90% of the transcripts are well characterized by the beta-Poisson model; the model-fit from BPSC is better than the fit of the standard gamma-Poisson model in > 80% of the transcripts. Moreover, in differential expression analyses of simulated and real datasets, BPSC performs well against edgeR, a conventional method widely used in bulk-cell RNA-sequencing data, and against scde and MAST, two recent methods specifically designed for single-cell RNA-seq data. An R package BPSC for model fitting and differential expression analyses of single-cell RNA-seq data is available under GPL-3 license at https://github.com/nghiavtr/BPSC CONTACT: yudi.pawitan@ki.se or mattias.rantalainen@ki.se Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  12. Intra- and interspecies gene expression models for predicting drug response in canine osteosarcoma.

    PubMed

    Fowles, Jared S; Brown, Kristen C; Hess, Ann M; Duval, Dawn L; Gustafson, Daniel L

    2016-02-19

    Genomics-based predictors of drug response have the potential to improve outcomes associated with cancer therapy. Osteosarcoma (OS), the most common primary bone cancer in dogs, is commonly treated with adjuvant doxorubicin or carboplatin following amputation of the affected limb. We evaluated the use of gene-expression based models built in an intra- or interspecies manner to predict chemosensitivity and treatment outcome in canine OS. Models were built and evaluated using microarray gene expression and drug sensitivity data from human and canine cancer cell lines, and canine OS tumor datasets. The "COXEN" method was utilized to filter gene signatures between human and dog datasets based on strong co-expression patterns. Models were built using linear discriminant analysis via the misclassification penalized posterior algorithm. The best doxorubicin model involved genes identified in human lines that were co-expressed and trained on canine OS tumor data, which accurately predicted clinical outcome in 73 % of dogs (p = 0.0262, binomial). The best carboplatin model utilized canine lines for gene identification and model training, with canine OS tumor data for co-expression. Dogs whose treatment matched our predictions had significantly better clinical outcomes than those that didn't (p = 0.0006, Log Rank), and this predictor significantly associated with longer disease free intervals in a Cox multivariate analysis (hazard ratio = 0.3102, p = 0.0124). Our data show that intra- and interspecies gene expression models can successfully predict response in canine OS, which may improve outcome in dogs and serve as pre-clinical validation for similar methods in human cancer research.

  13. Gene-expression signature regulated by the KEAP1-NRF2-CUL3 axis is associated with a poor prognosis in head and neck squamous cell cancer.

    PubMed

    Namani, Akhileshwar; Matiur Rahaman, Md; Chen, Ming; Tang, Xiuwen

    2018-01-06

    NRF2 is the key regulator of oxidative stress in normal cells and aberrant expression of the NRF2 pathway due to genetic alterations in the KEAP1 (Kelch-like ECH-associated protein 1)-NRF2 (nuclear factor erythroid 2 like 2)-CUL3 (cullin 3) axis leads to tumorigenesis and drug resistance in many cancers including head and neck squamous cell cancer (HNSCC). The main goal of this study was to identify specific genes regulated by the KEAP1-NRF2-CUL3 axis in HNSCC patients, to assess the prognostic value of this gene signature in different cohorts, and to reveal potential biomarkers. RNA-Seq V2 level 3 data from 279 tumor samples along with 37 adjacent normal samples from patients enrolled in the The Cancer Genome Atlas (TCGA)-HNSCC study were used to identify upregulated genes using two methods (altered KEAP1-NRF2-CUL3 versus normal, and altered KEAP1-NRF2-CUL3 versus wild-type). We then used a new approach to identify the combined gene signature by integrating both datasets and subsequently tested this signature in 4 independent HNSCC datasets to assess its prognostic value. In addition, functional annotation using the DAVID v6.8 database and protein-protein interaction (PPI) analysis using the STRING v10 database were performed on the signature. A signature composed of a subset of 17 genes regulated by the KEAP1-NRF2-CUL3 axis was identified by overlapping both the upregulated genes of altered versus normal (251 genes) and altered versus wild-type (25 genes) datasets. We showed that increased expression was significantly associated with poor survival in 4 independent HNSCC datasets, including the TCGA-HNSCC dataset. Furthermore, Gene Ontology, Kyoto Encyclopedia of Genes and Genomes, and PPI analysis revealed that most of the genes in this signature are associated with drug metabolism and glutathione metabolic pathways. Altogether, our study emphasizes the discovery of a gene signature regulated by the KEAP1-NRF2-CUL3 axis which is strongly associated with tumorigenesis and drug resistance in HNSCC. This 17-gene signature provides potential biomarkers and therapeutic targets for HNSCC cases in which the NRF2 pathway is activated.

  14. Identification of Differentially Expressed Genes through Integrated Study of Alzheimer's Disease Affected Brain Regions.

    PubMed

    Puthiyedth, Nisha; Riveros, Carlos; Berretta, Regina; Moscato, Pablo

    2016-01-01

    Alzheimer's disease (AD) is the most common form of dementia in older adults that damages the brain and results in impaired memory, thinking and behaviour. The identification of differentially expressed genes and related pathways among affected brain regions can provide more information on the mechanisms of AD. In the past decade, several studies have reported many genes that are associated with AD. This wealth of information has become difficult to follow and interpret as most of the results are conflicting. In that case, it is worth doing an integrated study of multiple datasets that helps to increase the total number of samples and the statistical power in detecting biomarkers. In this study, we present an integrated analysis of five different brain region datasets and introduce new genes that warrant further investigation. The aim of our study is to apply a novel combinatorial optimisation based meta-analysis approach to identify differentially expressed genes that are associated to AD across brain regions. In this study, microarray gene expression data from 161 samples (74 non-demented controls, 87 AD) from the Entorhinal Cortex (EC), Hippocampus (HIP), Middle temporal gyrus (MTG), Posterior cingulate cortex (PC), Superior frontal gyrus (SFG) and visual cortex (VCX) brain regions were integrated and analysed using our method. The results are then compared to two popular meta-analysis methods, RankProd and GeneMeta, and to what can be obtained by analysing the individual datasets. We find genes related with AD that are consistent with existing studies, and new candidate genes not previously related with AD. Our study confirms the up-regualtion of INFAR2 and PTMA along with the down regulation of GPHN, RAB2A, PSMD14 and FGF. Novel genes PSMB2, WNK1, RPL15, SEMA4C, RWDD2A and LARGE are found to be differentially expressed across all brain regions. Further investigation on these genes may provide new insights into the development of AD. In addition, we identified the presence of 23 non-coding features, including four miRNA precursors (miR-7, miR570, miR-1229 and miR-6821), dysregulated across the brain regions. Furthermore, we compared our results with two popular meta-analysis methods RankProd and GeneMeta to validate our findings and performed a sensitivity analysis by removing one dataset at a time to assess the robustness of our results. These new findings may provide new insights into the disease mechanisms and thus make a significant contribution in the near future towards understanding, prevention and cure of AD.

  15. Analysis of copy number variations at 15 schizophrenia-associated loci.

    PubMed

    Rees, Elliott; Walters, James T R; Georgieva, Lyudmila; Isles, Anthony R; Chambert, Kimberly D; Richards, Alexander L; Mahoney-Davies, Gerwyn; Legge, Sophie E; Moran, Jennifer L; McCarroll, Steven A; O'Donovan, Michael C; Owen, Michael J; Kirov, George

    2014-02-01

    A number of copy number variants (CNVs) have been suggested as susceptibility factors for schizophrenia. For some of these the data remain equivocal, and the frequency in individuals with schizophrenia is uncertain. To determine the contribution of CNVs at 15 schizophrenia-associated loci (a) using a large new data-set of patients with schizophrenia (n = 6882) and controls (n = 6316), and (b) combining our results with those from previous studies. We used Illumina microarrays to analyse our data. Analyses were restricted to 520 766 probes common to all arrays used in the different data-sets. We found higher rates in participants with schizophrenia than in controls for 13 of the 15 previously implicated CNVs. Six were nominally significantly associated (P<0.05) in this new data-set: deletions at 1q21.1, NRXN1, 15q11.2 and 22q11.2 and duplications at 16p11.2 and the Angelman/Prader-Willi Syndrome (AS/PWS) region. All eight AS/PWS duplications in patients were of maternal origin. When combined with published data, 11 of the 15 loci showed highly significant evidence for association with schizophrenia (P<4.1×10(-4)). We strengthen the support for the majority of the previously implicated CNVs in schizophrenia. About 2.5% of patients with schizophrenia and 0.9% of controls carry a large, detectable CNV at one of these loci. Routine CNV screening may be clinically appropriate given the high rate of known deleterious mutations in the disorder and the comorbidity associated with these heritable mutations.

  16. Electrically-Active Convection in Tropical Easterly Waves and Implications for Tropical Cyclogenesis in the Atlantic and East Pacific

    NASA Technical Reports Server (NTRS)

    Leppert, Kenneth D., II; Petersen, Walter A.; Cecil, Daniel J.

    2012-01-01

    In this study, we investigate the characteristics of tropical easterly wave convection and the possible implications of convective structure on tropical cyclogenesis and intensification over the Atlantic Ocean and East Pacific using data from the Tropical Rainfall Measurement Mission Microwave Imager, Precipitation Radar (PR), and Lightning Imaging Sensor as well as infrared (IR) brightness temperature data from the NASA global-merged IR brightness temperature dataset. Easterly waves were partitioned into northerly, southerly, trough, and ridge phases based on the 700-hPa meridional wind from the NCEP-NCAR reanalysis dataset. Waves were subsequently divided according to whether they did or did not develop tropical cyclones (i.e., developing and nondeveloping, respectively), and developing waves were further subdivided according to development location. Finally, composites as a function of wave phase and category were created using the various datasets. Results suggest that the convective characteristics that best distinguish developing from nondeveloping waves vary according to where developing waves spawn tropical cyclones. For waves that developed a cyclone in the Atlantic basin, coverage by IR brightness temperatures .240 K and .210 K provide the best distinction between developing and nondeveloping waves. In contrast, several variables provide a significant distinction between nondeveloping waves and waves that develop cyclones over the East Pacific as these waves near their genesis location including IR threshold coverage, lightning flash rates, and low-level (<4.5 km) PR reflectivity. Results of this study may be used to help develop thresholds to better distinguish developing from nondeveloping waves and serve as another aid for tropical cyclogenesis forecasting.

  17. Visualising nursing data using correspondence analysis.

    PubMed

    Kokol, Peter; Blažun Vošner, Helena; Železnik, Danica

    2016-09-01

    Digitally stored, large healthcare datasets enable nurses to use 'big data' techniques and tools in nursing research. Big data is complex and multi-dimensional, so visualisation may be a preferable approach to analyse and understand it. To demonstrate the use of visualisation of big data in a technique called correspondence analysis. In the authors' study, relations among data in a nursing dataset were shown visually in graphs using correspondence analysis. The case presented demonstrates that correspondence analysis is easy to use, shows relations between data visually in a form that is simple to interpret, and can reveal hidden associations between data. Correspondence analysis supports the discovery of new knowledge. Implications for practice Knowledge obtained using correspondence analysis can be transferred immediately into practice or used to foster further research.

  18. Enabling Data Fusion via a Common Data Model and Programming Interface

    NASA Astrophysics Data System (ADS)

    Lindholm, D. M.; Wilson, A.

    2011-12-01

    Much progress has been made in scientific data interoperability, especially in the areas of metadata and discovery. However, while a data user may have improved techniques for finding data, there is often a large chasm to span when it comes to acquiring the desired subsets of various datasets and integrating them into a data processing environment. Some tools such as OPeNDAP servers and the Unidata Common Data Model (CDM) have introduced improved abstractions for accessing data via a common interface, but they alone do not go far enough to enable fusion of data from multidisciplinary sources. Although data from various scientific disciplines may represent semantically similar concepts (e.g. time series), the user may face widely varying structural representations of the data (e.g. row versus column oriented), not to mention radically different storage formats. It is not enough to convert data to a common format. The key to fusing scientific data is to represent each dataset with consistent sampling. This can best be done by using a data model that expresses the functional relationship that each dataset represents. The domain of those functions determines how the data can be combined. The Visualization for Algorithm Development (VisAD) Java API has provided a sophisticated data model for representing the functional nature of scientific datasets for well over a decade. Because VisAD is largely designed for its visualization capabilities, the data model can be cumbersome to use for numerical computation, especially for those not comfortable with Java. Although both VisAD and the implementation of the CDM are written in Java, neither defines a pure Java interface that others could implement and program to, further limiting potential for interoperability. In this talk, we will present a solution for data integration based on a simple discipline-agnostic scientific data model and programming interface that enables a dataset to be defined in terms of three variable types: Scalar (a), Tuple (a,b), and Function (a -> b). These basic building blocks can be combined and nested to represent any arbitrarily complex dataset. For example, a time series of surface temperature and pressure could be represented as: time -> ((lon,lat) -> (T,P)). Our data model is expressed in UML and can be implemented in numerous programming languages. We will demonstrate an implementation of our data model and interface using the Scala programming language. Given its functional programming constructs, sophisticated type system, and other language features, Scala enables us to construct complex data structures that can be manipulated using natural mathematical expressions while taking advantage of the language's ability to operate on collections in parallel. This API will be applied to the problem of assimilating various measurements of the solar spectrum and other proxies from multiple sources to construct a composite Lyman-alpha irradiance dataset.

  19. Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering.

    PubMed

    Deveci, Mehmet; Küçüktunç, Onur; Eren, Kemal; Bozdağ, Doruk; Kaya, Kamer; Çatalyürek, Ümit V

    2016-01-01

    Rapid development and increasing popularity of gene expression microarrays have resulted in a number of studies on the discovery of co-regulated genes. One important way of discovering such co-regulations is the query-based search since gene co-expressions may indicate a shared role in a biological process. Although there exist promising query-driven search methods adapting clustering, they fail to capture many genes that function in the same biological pathway because microarray datasets are fraught with spurious samples or samples of diverse origin, or the pathways might be regulated under only a subset of samples. On the other hand, a class of clustering algorithms known as biclustering algorithms which simultaneously cluster both the items and their features are useful while analyzing gene expression data, or any data in which items are related in only a subset of their samples. This means that genes need not be related in all samples to be clustered together. Because many genes only interact under specific circumstances, biclustering may recover the relationships that traditional clustering algorithms can easily miss. In this chapter, we briefly summarize the literature using biclustering for querying co-regulated genes. Then we present a novel biclustering approach and evaluate its performance by a thorough experimental analysis.

  20. Identification of upstream transcription factors (TFs) for expression signature genes in breast cancer.

    PubMed

    Zang, Hongyan; Li, Ning; Pan, Yuling; Hao, Jingguang

    2017-03-01

    Breast cancer is a common malignancy among women with a rising incidence. Our intention was to detect transcription factors (TFs) for deeper understanding of the underlying mechanisms of breast cancer. Integrated analysis of gene expression datasets of breast cancer was performed. Then, functional annotation of differentially expressed genes (DEGs) was conducted, including Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment. Furthermore, TFs were identified and a global transcriptional regulatory network was constructed. Seven publically available GEO datasets were obtained, and a set of 1196 DEGs were identified (460 up-regulated and 736 down-regulated). Functional annotation results showed that cell cycle was the most significantly enriched pathway, which was consistent with the fact that cell cycle is closely related to various tumors. Fifty-three differentially expressed TFs were identified, and the regulatory networks consisted of 817 TF-target interactions between 46 TFs and 602 DEGs in the context of breast cancer. Top 10 TFs covering the most downstream DEGs were SOX10, NFATC2, ZNF354C, ARID3A, BRCA1, FOXO3, GATA3, ZEB1, HOXA5 and EGR1. The transcriptional regulatory networks could enable a better understanding of regulatory mechanisms of breast cancer pathology and provide an opportunity for the development of potential therapy.

  1. Genome wide transcriptome profiling reveals differential gene expression in secondary metabolite pathway of Cymbopogon winterianus.

    PubMed

    Devi, Kamalakshi; Mishra, Surajit K; Sahu, Jagajjit; Panda, Debashis; Modi, Mahendra K; Sen, Priyabrata

    2016-02-15

    Advances in transcriptome sequencing provide fast, cost-effective and reliable approach to generate large expression datasets especially suitable for non-model species to identify putative genes, key pathway and regulatory mechanism. Citronella (Cymbopogon winterianus) is an aromatic medicinal grass used for anti-tumoral, antibacterial, anti-fungal, antiviral, detoxifying and natural insect repellent properties. Despite of having number of utilities, the genes involved in terpenes biosynthetic pathway is not yet clearly elucidated. The present study is a pioneering attempt to generate an exhaustive molecular information of secondary metabolite pathway and to increase genomic resources in Citronella. Using high-throughput RNA-Seq technology, root and leaf transcriptome was analysed at an unprecedented depth (11.7 Gb). Targeted searches identified majority of the genes associated with metabolic pathway and other natural product pathway viz. antibiotics synthesis along with many novel genes. Terpenoid biosynthesis genes comparative expression results were validated for 15 unigenes by RT-PCR and qRT-PCR. Thus the coverage of these transcriptome is comprehensive enough to discover all known genes of major metabolic pathways. This transcriptome dataset can serve as important public information for gene expression, genomics and function genomics studies in Citronella and shall act as a benchmark for future improvement of the crop.

  2. Research resource: Tissue-specific transcriptomics and cistromics of nuclear receptor signaling: a web research resource.

    PubMed

    Ochsner, Scott A; Watkins, Christopher M; LaGrone, Benjamin S; Steffen, David L; McKenna, Neil J

    2010-10-01

    Nuclear receptors (NRs) are ligand-regulated transcription factors that recruit coregulators and other transcription factors to gene promoters to effect regulation of tissue-specific transcriptomes. The prodigious rate at which the NR signaling field has generated high content gene expression and, more recently, genome-wide location analysis datasets has not been matched by a committed effort to archiving this information for routine access by bench and clinical scientists. As a first step towards this goal, we searched the MEDLINE database for studies, which referenced either expression microarray and/or genome-wide location analysis datasets in which a NR or NR ligand was an experimental variable. A total of 1122 studies encompassing 325 unique organs, tissues, primary cells, and cell lines, 35 NRs, and 91 NR ligands were retrieved and annotated. The data were incorporated into a new section of the Nuclear Receptor Signaling Atlas Molecule Pages, Transcriptomics and Cistromics, for which we designed an intuitive, freely accessible user interface to browse the studies. Each study links to an abstract, the MEDLINE record, and, where available, Gene Expression Omnibus and ArrayExpress records. The resource will be updated on a regular basis to provide a current and comprehensive entrez into the sum of transcriptomic and cistromic research in this field.

  3. Multiparametric Determination of Radiation Risk

    NASA Technical Reports Server (NTRS)

    Richmond, Robert C.

    2003-01-01

    Predicting risk of human cancer following exposure to ionizing space radiation is challenging in part because of uncertainties of low-dose distribution amongst cells, of unknown potentially synergistic effects of microgravity upon cellular protein-expression, and of processing dose-related damage within cells to produce rare and late-appearing malignant transformation, degrade the confidence of cancer risk-estimates. The NASA- specific responsibility to estimate the risks of radiogenic cancer in a limited number of astronauts is not amenable to epidemiologic study, thereby increasing this challenge. Developing adequately sensitive cellular biodosimeters that simultaneously report 1) the quantity of absorbed close after exposure to ionizing radiation, 2) the quality of radiation delivering that dose, and 3) the risk of developing malignant transformation by the cells absorbing that dose could be useful for resolving these challenges. Use of a multiparametric cellular biodosimeter is suggested using analyses of gene-expression and protein-expression whereby large datasets of cellular response to radiation-induced damage are obtained and analyzed for expression-profiles correlated with established end points and molecular markers predictive for cancer-risk. Analytical techniques of genomics and proteomics may be used to establish dose-dependency of multiple gene- and protein- expressions resulting from radiation-induced cellular damage. Furthermore, gene- and protein-expression from cells in microgravity are known to be altered relative to cells grown on the ground at 1g. Therefore, hypotheses are proposed that 1) macromolecular expression caused by radiation-induced damage in cells in microgravity may be different than on the ground, and 2) different patterns of macromolecular expression in microgravity may alter human radiogenic cancer risk relative to radiation exposure on Earth. A new paradigm is accordingly suggested as a national database wherein genomic and proteomic datasets are registered and interrogated in order to provide statistically significant dose-dependent risk estimation of radiogenic cancer in astronauts.

  4. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing

    PubMed Central

    2012-01-01

    Background RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. Results Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. Conclusions This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates. PMID:22985019

  5. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing.

    PubMed

    Robles, José A; Qureshi, Sumaira E; Stephen, Stuart J; Wilson, Susan R; Burden, Conrad J; Taylor, Jennifer M

    2012-09-17

    RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates.

  6. Hierarchical cortical transcriptome disorganization in autism.

    PubMed

    Lombardo, Michael V; Courchesne, Eric; Lewis, Nathan E; Pramparo, Tiziano

    2017-01-01

    Autism spectrum disorders (ASD) are etiologically heterogeneous and complex. Functional genomics work has begun to identify a diverse array of dysregulated transcriptomic programs (e.g., synaptic, immune, cell cycle, DNA damage, WNT signaling, cortical patterning and differentiation) potentially involved in ASD brain abnormalities during childhood and adulthood. However, it remains unclear whether such diverse dysregulated pathways are independent of each other or instead reflect coordinated hierarchical systems-level pathology. Two ASD cortical transcriptome datasets were re-analyzed using consensus weighted gene co-expression network analysis (WGCNA) to identify common co-expression modules across datasets. Linear mixed-effect models and Bayesian replication statistics were used to identify replicable differentially expressed modules. Eigengene network analysis was then utilized to identify between-group differences in how co-expression modules interact and cluster into hierarchical meta-modular organization. Protein-protein interaction analyses were also used to determine whether dysregulated co-expression modules show enhanced interactions. We find replicable evidence for 10 gene co-expression modules that are differentially expressed in ASD cortex. Rather than being independent non-interacting sources of pathology, these dysregulated co-expression modules work in synergy and physically interact at the protein level. These systems-level transcriptional signals are characterized by downregulation of synaptic processes coordinated with upregulation of immune/inflammation, response to other organism, catabolism, viral processes, translation, protein targeting and localization, cell proliferation, and vasculature development. Hierarchical organization of meta-modules (clusters of highly correlated modules) is also highly affected in ASD. These findings highlight that dysregulation of the ASD cortical transcriptome is characterized by the dysregulation of multiple coordinated transcriptional programs producing synergistic systems-level effects that cannot be fully appreciated by studying the individual component biological processes in isolation.

  7. A gene expression inflammatory signature specifically predicts multiple myeloma evolution and patients survival.

    PubMed

    Botta, C; Di Martino, M T; Ciliberto, D; Cucè, M; Correale, P; Rossi, M; Tagliaferri, P; Tassone, P

    2016-12-16

    Multiple myeloma (MM) is closely dependent on cross-talk between malignant plasma cells and cellular components of the inflammatory/immunosuppressive bone marrow milieu, which promotes disease progression, drug resistance, neo-angiogenesis, bone destruction and immune-impairment. We investigated the relevance of inflammatory genes in predicting disease evolution and patient survival. A bioinformatics study by Ingenuity Pathway Analysis on gene expression profiling dataset of monoclonal gammopathy of undetermined significance, smoldering and symptomatic-MM, identified inflammatory and cytokine/chemokine pathways as the most progressively affected during disease evolution. We then selected 20 candidate genes involved in B-cell inflammation and we investigated their role in predicting clinical outcome, through univariate and multivariate analyses (log-rank test, logistic regression and Cox-regression model). We defined an 8-genes signature (IL8, IL10, IL17A, CCL3, CCL5, VEGFA, EBI3 and NOS2) identifying each condition (MGUS/smoldering/symptomatic-MM) with 84% accuracy. Moreover, six genes (IFNG, IL2, LTA, CCL2, VEGFA, CCL3) were found independently correlated with patients' survival. Patients whose MM cells expressed high levels of Th1 cytokines (IFNG/LTA/IL2/CCL2) and low levels of CCL3 and VEGFA, experienced the longest survival. On these six genes, we built a prognostic risk score that was validated in three additional independent datasets. In this study, we provide proof-of-concept that inflammation has a critical role in MM patient progression and survival. The inflammatory-gene prognostic signature validated in different datasets clearly indicates novel opportunities for personalized anti-MM treatment.

  8. VaDiR: an integrated approach to Variant Detection in RNA.

    PubMed

    Neums, Lisa; Suenaga, Seiji; Beyerlein, Peter; Anders, Sara; Koestler, Devin; Mariani, Andrea; Chien, Jeremy

    2018-02-01

    Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.

  9. Functional Targets of the Monogenic Diabetes Transcription Factors HNF-1α and HNF-4α Are Highly Conserved Between Mice and Humans

    PubMed Central

    Boj, Sylvia F.; Servitja, Joan Marc; Martin, David; Rios, Martin; Talianidis, Iannis; Guigo, Roderic; Ferrer, Jorge

    2009-01-01

    OBJECTIVE The evolutionary conservation of transcriptional mechanisms has been widely exploited to understand human biology and disease. Recent findings, however, unexpectedly showed that the transcriptional regulators hepatocyte nuclear factor (HNF)-1α and -4α rarely bind to the same genes in mice and humans, leading to the proposal that tissue-specific transcriptional regulation has undergone extensive divergence in the two species. Such observations have major implications for the use of mouse models to understand HNF-1α– and HNF-4α–deficient diabetes. However, the significance of studies that assess binding without considering regulatory function is poorly understood. RESEARCH DESIGN AND METHODS We compared previously reported mouse and human HNF-1α and HNF-4α binding studies with independent binding experiments. We also integrated binding studies with mouse and human loss-of-function gene expression datasets. RESULTS First, we confirmed the existence of species-specific HNF-1α and -4α binding, yet observed incomplete detection of binding in the different datasets, causing an underestimation of binding conservation. Second, only a minor fraction of HNF-1α– and HNF-4α–bound genes were downregulated in the absence of these regulators. This subset of functional targets did not show evidence for evolutionary divergence of binding or binding sequence motifs. Finally, we observed differences between conserved and species-specific binding properties. For example, conserved binding was more frequently located near transcriptional start sites and was more likely to involve multiple binding events in the same gene. CONCLUSIONS Despite evolutionary changes in binding, essential direct transcriptional functions of HNF-1α and -4α are largely conserved between mice and humans. PMID:19188435

  10. Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

    PubMed Central

    Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene

    2018-01-01

    Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie PMID:29688379

  11. A systems biology pipeline identifies new immune and disease related molecular signatures and networks in human cells during microgravity exposure

    NASA Astrophysics Data System (ADS)

    Mukhopadhyay, Sayak; Saha, Rohini; Palanisamy, Anbarasi; Ghosh, Madhurima; Biswas, Anupriya; Roy, Saheli; Pal, Arijit; Sarkar, Kathakali; Bagh, Sangram

    2016-05-01

    Microgravity is a prominent health hazard for astronauts, yet we understand little about its effect at the molecular systems level. In this study, we have integrated a set of systems-biology tools and databases and have analysed more than 8000 molecular pathways on published global gene expression datasets of human cells in microgravity. Hundreds of new pathways have been identified with statistical confidence for each dataset and despite the difference in cell types and experiments, around 100 of the new pathways are appeared common across the datasets. They are related to reduced inflammation, autoimmunity, diabetes and asthma. We have identified downregulation of NfκB pathway via Notch1 signalling as new pathway for reduced immunity in microgravity. Induction of few cancer types including liver cancer and leukaemia and increased drug response to cancer in microgravity are also found. Increase in olfactory signal transduction is also identified. Genes, based on their expression pattern, are clustered and mathematically stable clusters are identified. The network mapping of genes within a cluster indicates the plausible functional connections in microgravity. This pipeline gives a new systems level picture of human cells under microgravity, generates testable hypothesis and may help estimating risk and developing medicine for space missions.

  12. ChloroSeq, an optimized chloroplast RNA-Seq bioinformatic pipeline, reveals remodeling of the organellar transcriptome under heat stress

    DOE PAGES

    Castandet, Benoît; Hotto, Amber M.; Strickler, Susan R.; ...

    2016-07-06

    Although RNA-Seq has revolutionized transcript analysis, organellar transcriptomes are rarely assessed even when present in published datasets. Here, we describe the development and application of a rapid and convenient method, ChloroSeq, to delineate qualitative and quantitative features of chloroplast RNA metabolism from strand-specific RNA-Seq datasets, including processing, editing, splicing, and relative transcript abundance. The use of a single experiment to analyze systematically chloroplast transcript maturation and abundance is of particular interest due to frequent pleiotropic effects observed in mutants that affect chloroplast gene expression and/or photosynthesis. To illustrate its utility, ChloroSeq was applied to published RNA-Seq datasets derived from Arabidopsismore » thaliana grown under control and abiotic stress conditions, where the organellar transcriptome had not been examined. The most appreciable effects were found for heat stress, which induces a global reduction in splicing and editing efficiency, and leads to increased abundance of chloroplast transcripts, including genic, intergenic, and antisense transcripts. Moreover, by concomitantly analyzing nuclear transcripts that encode chloroplast gene expression regulators from the same libraries, we demonstrate the possibility of achieving a holistic understanding of the nucleus-organelle system. In conclusion, ChloroSeq thus represents a unique method for streamlining RNA-Seq data interpretation of the chloroplast transcriptome and its regulators.« less

  13. ChloroSeq, an optimized chloroplast RNA-Seq bioinformatic pipeline, reveals remodeling of the organellar transcriptome under heat stress

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Castandet, Benoît; Hotto, Amber M.; Strickler, Susan R.

    Although RNA-Seq has revolutionized transcript analysis, organellar transcriptomes are rarely assessed even when present in published datasets. Here, we describe the development and application of a rapid and convenient method, ChloroSeq, to delineate qualitative and quantitative features of chloroplast RNA metabolism from strand-specific RNA-Seq datasets, including processing, editing, splicing, and relative transcript abundance. The use of a single experiment to analyze systematically chloroplast transcript maturation and abundance is of particular interest due to frequent pleiotropic effects observed in mutants that affect chloroplast gene expression and/or photosynthesis. To illustrate its utility, ChloroSeq was applied to published RNA-Seq datasets derived from Arabidopsismore » thaliana grown under control and abiotic stress conditions, where the organellar transcriptome had not been examined. The most appreciable effects were found for heat stress, which induces a global reduction in splicing and editing efficiency, and leads to increased abundance of chloroplast transcripts, including genic, intergenic, and antisense transcripts. Moreover, by concomitantly analyzing nuclear transcripts that encode chloroplast gene expression regulators from the same libraries, we demonstrate the possibility of achieving a holistic understanding of the nucleus-organelle system. In conclusion, ChloroSeq thus represents a unique method for streamlining RNA-Seq data interpretation of the chloroplast transcriptome and its regulators.« less

  14. Robust diagnosis of non-Hodgkin lymphoma phenotypes validated on gene expression data from different laboratories.

    PubMed

    Bhanot, Gyan; Alexe, Gabriela; Levine, Arnold J; Stolovitzky, Gustavo

    2005-01-01

    A major challenge in cancer diagnosis from microarray data is the need for robust, accurate, classification models which are independent of the analysis techniques used and can combine data from different laboratories. We propose such a classification scheme originally developed for phenotype identification from mass spectrometry data. The method uses a robust multivariate gene selection procedure and combines the results of several machine learning tools trained on raw and pattern data to produce an accurate meta-classifier. We illustrate and validate our method by applying it to gene expression datasets: the oligonucleotide HuGeneFL microarray dataset of Shipp et al. (www.genome.wi.mit.du/MPR/lymphoma) and the Hu95Av2 Affymetrix dataset (DallaFavera's laboratory, Columbia University). Our pattern-based meta-classification technique achieves higher predictive accuracies than each of the individual classifiers , is robust against data perturbations and provides subsets of related predictive genes. Our techniques predict that combinations of some genes in the p53 pathway are highly predictive of phenotype. In particular, we find that in 80% of DLBCL cases the mRNA level of at least one of the three genes p53, PLK1 and CDK2 is elevated, while in 80% of FL cases, the mRNA level of at most one of them is elevated.

  15. A systems biology pipeline identifies new immune and disease related molecular signatures and networks in human cells during microgravity exposure.

    PubMed

    Mukhopadhyay, Sayak; Saha, Rohini; Palanisamy, Anbarasi; Ghosh, Madhurima; Biswas, Anupriya; Roy, Saheli; Pal, Arijit; Sarkar, Kathakali; Bagh, Sangram

    2016-05-17

    Microgravity is a prominent health hazard for astronauts, yet we understand little about its effect at the molecular systems level. In this study, we have integrated a set of systems-biology tools and databases and have analysed more than 8000 molecular pathways on published global gene expression datasets of human cells in microgravity. Hundreds of new pathways have been identified with statistical confidence for each dataset and despite the difference in cell types and experiments, around 100 of the new pathways are appeared common across the datasets. They are related to reduced inflammation, autoimmunity, diabetes and asthma. We have identified downregulation of NfκB pathway via Notch1 signalling as new pathway for reduced immunity in microgravity. Induction of few cancer types including liver cancer and leukaemia and increased drug response to cancer in microgravity are also found. Increase in olfactory signal transduction is also identified. Genes, based on their expression pattern, are clustered and mathematically stable clusters are identified. The network mapping of genes within a cluster indicates the plausible functional connections in microgravity. This pipeline gives a new systems level picture of human cells under microgravity, generates testable hypothesis and may help estimating risk and developing medicine for space missions.

  16. BubbleGUM: automatic extraction of phenotype molecular signatures and comprehensive visualization of multiple Gene Set Enrichment Analyses.

    PubMed

    Spinelli, Lionel; Carpentier, Sabrina; Montañana Sanchis, Frédéric; Dalod, Marc; Vu Manh, Thien-Phong

    2015-10-19

    Recent advances in the analysis of high-throughput expression data have led to the development of tools that scaled-up their focus from single-gene to gene set level. For example, the popular Gene Set Enrichment Analysis (GSEA) algorithm can detect moderate but coordinated expression changes of groups of presumably related genes between pairs of experimental conditions. This considerably improves extraction of information from high-throughput gene expression data. However, although many gene sets covering a large panel of biological fields are available in public databases, the ability to generate home-made gene sets relevant to one's biological question is crucial but remains a substantial challenge to most biologists lacking statistic or bioinformatic expertise. This is all the more the case when attempting to define a gene set specific of one condition compared to many other ones. Thus, there is a crucial need for an easy-to-use software for generation of relevant home-made gene sets from complex datasets, their use in GSEA, and the correction of the results when applied to multiple comparisons of many experimental conditions. We developed BubbleGUM (GSEA Unlimited Map), a tool that allows to automatically extract molecular signatures from transcriptomic data and perform exhaustive GSEA with multiple testing correction. One original feature of BubbleGUM notably resides in its capacity to integrate and compare numerous GSEA results into an easy-to-grasp graphical representation. We applied our method to generate transcriptomic fingerprints for murine cell types and to assess their enrichments in human cell types. This analysis allowed us to confirm homologies between mouse and human immunocytes. BubbleGUM is an open-source software that allows to automatically generate molecular signatures out of complex expression datasets and to assess directly their enrichment by GSEA on independent datasets. Enrichments are displayed in a graphical output that helps interpreting the results. This innovative methodology has recently been used to answer important questions in functional genomics, such as the degree of similarities between microarray datasets from different laboratories or with different experimental models or clinical cohorts. BubbleGUM is executable through an intuitive interface so that both bioinformaticians and biologists can use it. It is available at http://www.ciml.univ-mrs.fr/applications/BubbleGUM/index.html .

  17. Gene discovery in an invasive tephritid model pest species, the Mediterranean fruit fly, Ceratitis capitata

    PubMed Central

    Gomulski, Ludvik M; Dimopoulos, George; Xi, Zhiyong; Soares, Marcelo B; Bonaldo, Maria F; Malacrida, Anna R; Gasperi, Giuliano

    2008-01-01

    Background The medfly, Ceratitis capitata, is a highly invasive agricultural pest that has become a model insect for the development of biological control programs. Despite research into the behavior and classical and population genetics of this organism, the quantity of sequence data available is limited. We have utilized an expressed sequence tag (EST) approach to obtain detailed information on transcriptome signatures that relate to a variety of physiological systems in the medfly; this information emphasizes on reproduction, sex determination, and chemosensory perception, since the study was based on normalized cDNA libraries from embryos and adult heads. Results A total of 21,253 high-quality ESTs were obtained from the embryo and head libraries. Clustering analyses performed separately for each library resulted in 5201 embryo and 6684 head transcripts. Considering an estimated 19% overlap in the transcriptomes of the two libraries, they represent about 9614 unique transcripts involved in a wide range of biological processes and molecular functions. Of particular interest are the sequences that share homology with Drosophila genes involved in sex determination, olfaction, and reproductive behavior. The medfly transformer2 (tra2) homolog was identified among the embryonic sequences, and its genomic organization and expression were characterized. Conclusion The sequences obtained in this study represent the first major dataset of expressed genes in a tephritid species of agricultural importance. This resource provides essential information to support the investigation of numerous questions regarding the biology of the medfly and other related species and also constitutes an invaluable tool for the annotation of complete genome sequences. Our study has revealed intriguing findings regarding the transcript regulation of tra2 and other sex determination genes, as well as insights into the comparative genomics of genes implicated in chemosensory reception and reproduction. PMID:18500975

  18. Genome-Wide Temporal Expression Profiling in Caenorhabditis elegans Identifies a Core Gene Set Related to Long-Term Memory.

    PubMed

    Freytag, Virginie; Probst, Sabine; Hadziselimovic, Nils; Boglari, Csaba; Hauser, Yannick; Peter, Fabian; Gabor Fenyves, Bank; Milnik, Annette; Demougin, Philippe; Vukojevic, Vanja; de Quervain, Dominique J-F; Papassotiropoulos, Andreas; Stetak, Attila

    2017-07-12

    The identification of genes related to encoding, storage, and retrieval of memories is a major interest in neuroscience. In the current study, we analyzed the temporal gene expression changes in a neuronal mRNA pool during an olfactory long-term associative memory (LTAM) in Caenorhabditis elegans hermaphrodites. Here, we identified a core set of 712 (538 upregulated and 174 downregulated) genes that follows three distinct temporal peaks demonstrating multiple gene regulation waves in LTAM. Compared with the previously published positive LTAM gene set (Lakhina et al., 2015), 50% of the identified upregulated genes here overlap with the previous dataset, possibly representing stimulus-independent memory-related genes. On the other hand, the remaining genes were not previously identified in positive associative memory and may specifically regulate aversive LTAM. Our results suggest a multistep gene activation process during the formation and retrieval of long-term memory and define general memory-implicated genes as well as conditioning-type-dependent gene sets. SIGNIFICANCE STATEMENT The identification of genes regulating different steps of memory is of major interest in neuroscience. Identification of common memory genes across different learning paradigms and the temporal activation of the genes are poorly studied. Here, we investigated the temporal aspects of Caenorhabditis elegans gene expression changes using aversive olfactory associative long-term memory (LTAM) and identified three major gene activation waves. Like in previous studies, aversive LTAM is also CREB dependent, and CREB activity is necessary immediately after training. Finally, we define a list of memory paradigm-independent core gene sets as well as conditioning-dependent genes. Copyright © 2017 the authors 0270-6474/17/376661-12$15.00/0.

  19. Assessing the Determinants and Implications of Teacher Layoffs. Working Paper 55

    ERIC Educational Resources Information Center

    Goldhaber, Dan; Theobald, Roddy

    2010-01-01

    Over 2000 teachers in the state of Washington received reduction-in-force (RIF) notices in the past two years. The authors link data on these RIF notices to a unique dataset that includes student, teacher, school, and district variables to determine the factors that predict the likelihood of a teacher receiving a RIF notice. They find a teacher's…

  20. Transmission Models of Historical Ebola Outbreaks

    PubMed Central

    Drake, John M.; Bakach, Iurii; Just, Matthew R.; O’Regan, Suzanne M.; Gambhir, Manoj

    2015-01-01

    To guide the collection of data under emergent epidemic conditions, we reviewed compartmental models of historical Ebola outbreaks to determine their implications and limitations. We identified future modeling directions and propose that the minimal epidemiologic dataset for Ebola model construction comprises duration of incubation period and symptomatic period, distribution of secondary cases by infection setting, and compliance with intervention recommendations. PMID:26196358

  1. Combining Shapley value and statistics to the analysis of gene expression data in children exposed to air pollution

    PubMed Central

    Moretti, Stefano; van Leeuwen, Danitsja; Gmuender, Hans; Bonassi, Stefano; van Delft, Joost; Kleinjans, Jos; Patrone, Fioravante; Merlo, Domenico Franco

    2008-01-01

    Background In gene expression analysis, statistical tests for differential gene expression provide lists of candidate genes having, individually, a sufficiently low p-value. However, the interpretation of each single p-value within complex systems involving several interacting genes is problematic. In parallel, in the last sixty years, game theory has been applied to political and social problems to assess the power of interacting agents in forcing a decision and, more recently, to represent the relevance of genes in response to certain conditions. Results In this paper we introduce a Bootstrap procedure to test the null hypothesis that each gene has the same relevance between two conditions, where the relevance is represented by the Shapley value of a particular coalitional game defined on a microarray data-set. This method, which is called Comparative Analysis of Shapley value (shortly, CASh), is applied to data concerning the gene expression in children differentially exposed to air pollution. The results provided by CASh are compared with the results from a parametric statistical test for testing differential gene expression. Both lists of genes provided by CASh and t-test are informative enough to discriminate exposed subjects on the basis of their gene expression profiles. While many genes are selected in common by CASh and the parametric test, it turns out that the biological interpretation of the differences between these two selections is more interesting, suggesting a different interpretation of the main biological pathways in gene expression regulation for exposed individuals. A simulation study suggests that CASh offers more power than t-test for the detection of differential gene expression variability. Conclusion CASh is successfully applied to gene expression analysis of a data-set where the joint expression behavior of genes may be critical to characterize the expression response to air pollution. We demonstrate a synergistic effect between coalitional games and statistics that resulted in a selection of genes with a potential impact in the regulation of complex pathways. PMID:18764936

  2. Being an honest broker of hydrology: Uncovering, communicating and addressing model error in a climate change streamflow dataset

    NASA Astrophysics Data System (ADS)

    Chegwidden, O.; Nijssen, B.; Pytlak, E.

    2017-12-01

    Any model simulation has errors, including errors in meteorological data, process understanding, model structure, and model parameters. These errors may express themselves as bias, timing lags, and differences in sensitivity between the model and the physical world. The evaluation and handling of these errors can greatly affect the legitimacy, validity and usefulness of the resulting scientific product. In this presentation we will discuss a case study of handling and communicating model errors during the development of a hydrologic climate change dataset for the Pacific Northwestern United States. The dataset was the result of a four-year collaboration between the University of Washington, Oregon State University, the Bonneville Power Administration, the United States Army Corps of Engineers and the Bureau of Reclamation. Along the way, the partnership facilitated the discovery of multiple systematic errors in the streamflow dataset. Through an iterative review process, some of those errors could be resolved. For the errors that remained, honest communication of the shortcomings promoted the dataset's legitimacy. Thoroughly explaining errors also improved ways in which the dataset would be used in follow-on impact studies. Finally, we will discuss the development of the "streamflow bias-correction" step often applied to climate change datasets that will be used in impact modeling contexts. We will describe the development of a series of bias-correction techniques through close collaboration among universities and stakeholders. Through that process, both universities and stakeholders learned about the others' expectations and workflows. This mutual learning process allowed for the development of methods that accommodated the stakeholders' specific engineering requirements. The iterative revision process also produced a functional and actionable dataset while preserving its scientific merit. We will describe how encountering earlier techniques' pitfalls allowed us to develop improved methods for scientists and practitioners alike.

  3. State-of-the-Art Fusion-Finder Algorithms Sensitivity and Specificity

    PubMed Central

    Carrara, Matteo; Beccuti, Marco; Lazzarato, Fulvio; Cavallo, Federica; Cordero, Francesca; Donatelli, Susanna; Calogero, Raffaele A.

    2013-01-01

    Background. Gene fusions arising from chromosomal translocations have been implicated in cancer. RNA-seq has the potential to discover such rearrangements generating functional proteins (chimera/fusion). Recently, many methods for chimeras detection have been published. However, specificity and sensitivity of those tools were not extensively investigated in a comparative way. Results. We tested eight fusion-detection tools (FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse, Bellerophontes, ChimeraScan, and TopHat-fusion) to detect fusion events using synthetic and real datasets encompassing chimeras. The comparison analysis run only on synthetic data could generate misleading results since we found no counterpart on real dataset. Furthermore, most tools report a very high number of false positive chimeras. In particular, the most sensitive tool, ChimeraScan, reports a large number of false positives that we were able to significantly reduce by devising and applying two filters to remove fusions not supported by fusion junction-spanning reads or encompassing large intronic regions. Conclusions. The discordant results obtained using synthetic and real datasets suggest that synthetic datasets encompassing fusion events may not fully catch the complexity of RNA-seq experiment. Moreover, fusion detection tools are still limited in sensitivity or specificity; thus, there is space for further improvement in the fusion-finder algorithms. PMID:23555082

  4. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.

    PubMed

    Saito, Takaya; Rehmsmeier, Marc

    2015-01-01

    Binary classifiers are routinely evaluated with performance measures such as sensitivity and specificity, and performance is frequently illustrated with Receiver Operating Characteristics (ROC) plots. Alternative measures such as positive predictive value (PPV) and the associated Precision/Recall (PRC) plots are used less frequently. Many bioinformatics studies develop and evaluate classifiers that are to be applied to strongly imbalanced datasets in which the number of negatives outweighs the number of positives significantly. While ROC plots are visually appealing and provide an overview of a classifier's performance across a wide range of specificities, one can ask whether ROC plots could be misleading when applied in imbalanced classification scenarios. We show here that the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity. PRC plots, on the other hand, can provide the viewer with an accurate prediction of future classification performance due to the fact that they evaluate the fraction of true positives among positive predictions. Our findings have potential implications for the interpretation of a large number of studies that use ROC plots on imbalanced datasets.

  5. Parton Distributions based on a Maximally Consistent Dataset

    NASA Astrophysics Data System (ADS)

    Rojo, Juan

    2016-04-01

    The choice of data that enters a global QCD analysis can have a substantial impact on the resulting parton distributions and their predictions for collider observables. One of the main reasons for this has to do with the possible presence of inconsistencies, either internal within an experiment or external between different experiments. In order to assess the robustness of the global fit, different definitions of a conservative PDF set, that is, a PDF set based on a maximally consistent dataset, have been introduced. However, these approaches are typically affected by theory biases in the selection of the dataset. In this contribution, after a brief overview of recent NNPDF developments, we propose a new, fully objective, definition of a conservative PDF set, based on the Bayesian reweighting approach. Using the new NNPDF3.0 framework, we produce various conservative sets, which turn out to be mutually in agreement within the respective PDF uncertainties, as well as with the global fit. We explore some of their implications for LHC phenomenology, finding also good consistency with the global fit result. These results provide a non-trivial validation test of the new NNPDF3.0 fitting methodology, and indicate that possible inconsistencies in the fitted dataset do not affect substantially the global fit PDFs.

  6. NetCDF-U - Uncertainty conventions for netCDF datasets

    NASA Astrophysics Data System (ADS)

    Bigagli, Lorenzo; Nativi, Stefano; Domenico, Ben

    2013-04-01

    To facilitate the automated processing of uncertain data (e.g. uncertainty propagation in modeling applications), we have proposed a set of conventions for expressing uncertainty information within the netCDF data model and format: the NetCDF Uncertainty Conventions (NetCDF-U). From a theoretical perspective, it can be said that no dataset is a perfect representation of the reality it purports to represent. Inevitably, errors arise from the observation process, including the sensor system and subsequent processing, differences in scales of phenomena and the spatial support of the observation mechanism, lack of knowledge about the detailed conversion between the measured quantity and the target variable. This means that, in principle, all data should be treated as uncertain. The most natural representation of an uncertain quantity is in terms of random variables, with a probabilistic approach. However, it must be acknowledged that almost all existing data resources are not treated in this way. Most datasets come simply as a series of values, often without any uncertainty information. If uncertainty information is present, then it is typically within the metadata, as a data quality element. This is typically a global (dataset wide) representation of uncertainty, often derived through some form of validation process. Typically, it is a statistical measure of spread, for example the standard deviation of the residuals. The introduction of a mechanism by which such descriptions of uncertainty can be integrated into existing geospatial applications is considered a practical step towards a more accurate modeling of our uncertain understanding of any natural process. Given the generality and flexibility of the netCDF data model, conventions on naming, syntax, and semantics have been adopted by several communities of practice, as a means of improving data interoperability. Some of the existing conventions include provisions on uncertain elements and concepts, but, to our knowledge, no general convention on the encoding of uncertainty has been proposed, to date. In particular, the netCDF Climate and Forecast Conventions (NetCDF-CF), a de-facto standard for a large amount of data in Fluid Earth Sciences, mention the issue and provide limited support for uncertainty representation. NetCDF-U is designed to be fully compatible with NetCDF-CF, where possible adopting the same mechanisms (e.g. using the same attributes name with compatible semantics). The rationale for this is that a probabilistic description of scientific quantities is a crosscutting aspect, which may be modularized (note that a netCDF dataset may be compliant with more than one convention). The scope of NetCDF-U is to extend and qualify the netCDF classic data model (also known as netCDF3), to capture the uncertainty related to geospatial information encoded in that format. In the future, a netCDF4 approach for uncertainty encoding will be investigated. The NetCDF-U Conventions have the following rationale: • Compatibility with netCDF-CF Conventions 1.5. • Human-readability of conforming datasets structure. • Minimal difference between certain/agnostic and uncertain representations of data (e.g. with respect to dataset structure). NetCDF-U is based on a generic mechanism for annotating netCDF data variables with probability theory semantics. The Uncertainty Markup Language (UncertML) 2.0 is used as a controlled conceptual model and vocabulary for NetCDF-U annotations. The proposed mechanism anticipates a generalized support for semantic annotations in netCDF. NetCDF-U defines syntactical conventions for encoding samples, summary statistics, and distributions, along with mechanisms for expressing dependency relationships among variables. The conventions were accepted as an Open Geospatial Consortium (OGC) Discussion Paper (OGC 11-163); related discussions are conducted on a public forum hosted by the OGC. NetCDF-U may have implications for future work directed at communicating geospatial data provenance and uncertainty in contexts other than netCDF. The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n° 248488.

  7. Pattern identification in time-course gene expression data with the CoGAPS matrix factorization.

    PubMed

    Fertig, Elana J; Stein-O'Brien, Genevieve; Jaffe, Andrew; Colantuoni, Carlo

    2014-01-01

    Patterns in time-course gene expression data can represent the biological processes that are active over the measured time period. However, the orthogonality constraint in standard pattern-finding algorithms, including notably principal components analysis (PCA), confounds expression changes resulting from simultaneous, non-orthogonal biological processes. Previously, we have shown that Markov chain Monte Carlo nonnegative matrix factorization algorithms are particularly adept at distinguishing such concurrent patterns. One such matrix factorization is implemented in the software package CoGAPS. We describe the application of this software and several technical considerations for identification of age-related patterns in a public, prefrontal cortex gene expression dataset.

  8. University of Texas Southwestern Medical Center (UTSW): Functional Signature Ontology Tool: Triplicate Measurements of Reporter Gene Expression in Response to Individual Genetic and Chemical Perturbations in HCT116 Cells | Office of Cancer Genomics

    Cancer.gov

    The goal of this project is to use an eight-gene expression profile to define functional signatures for small molecules and natural products with heretofore undefined mechanism of action. Two genes in the eight gene set are used as internal controls and do not vary across gene expression array data collected from the public domain. The remaining six genes are found to vary independently across a large collection of publically available gene expression array datasets.  Read the abstract

  9. Mars Global Geologic Mapping Progress and Suggested Geographic-Based Hierarchal Systems for Unit Grouping and Naming

    NASA Technical Reports Server (NTRS)

    Tanaka, K. L.; Dohm, J. M.; Irwin, R.; Kolb, E. J.; Skinner, J. A., Jr.; Hare, T. M.

    2010-01-01

    We are in the fourth year of a fiveyear effort to map the global geology of Mars at 1:20M scale using mainly Mars Global Surveyor, Mars Express, and Mars Odyssey image and altimetry datasets. Previously, we reported on details of project management, mapping datasets (local and regional), initial and anticipated mapping approaches, and tactics of map unit delineation and description [1-2]. Last year, we described mapping and unit delineation results thus far, a new unit identified in the northern plains, and remaining steps to complete the map [3].

  10. Clinical and Biological Relevance of Genomic Heterogeneity in Chronic Lymphocytic Leukemia

    PubMed Central

    Friedman, Daphne R.; Lucas, Joseph E.; Weinberg, J. Brice

    2013-01-01

    Background Chronic lymphocytic leukemia (CLL) is typically regarded as an indolent B-cell malignancy. However, there is wide variability with regards to need for therapy, time to progressive disease, and treatment response. This clinical variability is due, in part, to biological heterogeneity between individual patients’ leukemias. While much has been learned about this biological variation using genomic approaches, it is unclear whether such efforts have sufficiently evaluated biological and clinical heterogeneity in CLL. Methods To study the extent of genomic variability in CLL and the biological and clinical attributes of genomic classification in CLL, we evaluated 893 unique CLL samples from fifteen publicly available gene expression profiling datasets. We used unsupervised approaches to divide the data into subgroups, evaluated the biological pathways and genetic aberrations that were associated with the subgroups, and compared prognostic and clinical outcome data between the subgroups. Results Using an unsupervised approach, we determined that approximately 600 CLL samples are needed to define the spectrum of diversity in CLL genomic expression. We identified seven genomically-defined CLL subgroups that have distinct biological properties, are associated with specific chromosomal deletions and amplifications, and have marked differences in molecular prognostic markers and clinical outcomes. Conclusions Our results indicate that investigations focusing on small numbers of patient samples likely provide a biased outlook on CLL biology. These findings may have important implications in identifying patients who should be treated with specific targeted therapies, which could have efficacy against CLL cells that rely on specific biological pathways. PMID:23468975

  11. The Long Non-Coding RNA Transcriptome Landscape in CHO Cells Under Batch and Fed-Batch Conditions.

    PubMed

    Vito, Davide; Smales, C Mark

    2018-05-21

    The role of non-coding RNAs in determining growth, productivity and recombinant product quality attributes in Chinese hamster ovary (CHO) cells has received much attention in recent years, exemplified by studies into microRNAs in particular. However, other classes of non-coding RNAs have received less attention. One such class are the non-coding RNAs known collectively as long non-coding RNAs (lncRNAs). We have undertaken the first landscape analysis of the lncRNA transcriptome in CHO using a mouse based microarray that also provided for the surveillance of the coding transcriptome. We report on those lncRNAs present in a model host CHO cell line under batch and fed-batch conditions on two different days and relate the expression of different lncRNAs to each other. We demonstrate that the mouse microarray was suitable for the detection and analysis of thousands of CHO lncRNAs and validated a number of these by qRT-PCR. We then further analysed the data to identify those lncRNAs whose expression changed the most between growth and stationary phases of culture or between batch and fed-batch culture to identify potential lncRNA targets for further functional studies with regard to their role in controlling growth of CHO cells. We discuss the implications for the publication of this rich dataset and how this may be used by the community. This article is protected by copyright. All rights reserved.

  12. Clinical and biological relevance of genomic heterogeneity in chronic lymphocytic leukemia.

    PubMed

    Friedman, Daphne R; Lucas, Joseph E; Weinberg, J Brice

    2013-01-01

    Chronic lymphocytic leukemia (CLL) is typically regarded as an indolent B-cell malignancy. However, there is wide variability with regards to need for therapy, time to progressive disease, and treatment response. This clinical variability is due, in part, to biological heterogeneity between individual patients' leukemias. While much has been learned about this biological variation using genomic approaches, it is unclear whether such efforts have sufficiently evaluated biological and clinical heterogeneity in CLL. To study the extent of genomic variability in CLL and the biological and clinical attributes of genomic classification in CLL, we evaluated 893 unique CLL samples from fifteen publicly available gene expression profiling datasets. We used unsupervised approaches to divide the data into subgroups, evaluated the biological pathways and genetic aberrations that were associated with the subgroups, and compared prognostic and clinical outcome data between the subgroups. Using an unsupervised approach, we determined that approximately 600 CLL samples are needed to define the spectrum of diversity in CLL genomic expression. We identified seven genomically-defined CLL subgroups that have distinct biological properties, are associated with specific chromosomal deletions and amplifications, and have marked differences in molecular prognostic markers and clinical outcomes. Our results indicate that investigations focusing on small numbers of patient samples likely provide a biased outlook on CLL biology. These findings may have important implications in identifying patients who should be treated with specific targeted therapies, which could have efficacy against CLL cells that rely on specific biological pathways.

  13. Downregulated CDKN1C/p57kip2 drives tumorigenesis and associates with poor overall survival in breast cancer.

    PubMed

    Qiu, Zhu; Li, Yunhai; Zeng, Beilei; Guan, Xiaoqin; Li, Hongzhong

    2018-02-26

    CDKN1C, also known as p57 kip2 , is considered to be a potential tumor suppressor implicated in several kinds of human cancers. However, the current knowledge of CDKN1C in breast cancer remains obscure. In the present study, we demonstrated that CDKN1C was dramatically downregulated in breast cancer compared with normal tissues by using real-time quantitative polymerase chain reaction, western blot and two public data portals: The Cancer Genome Atlas (TCGA) and Oncomine datasets. Moreover, the expression of CDKN1C was correlated with age and tumor size in the TCGA cohort containing 708 cases of breast cancer. Low expression of CDKN1C was significantly associated with poor overall survival (OS) in the TCGA cohort and validated cohort composed of 1402 patients. Multivariate Cox regression analysis indicated that CDKN1C was an independent prognostic factor for worse OS (HR = 1.78, 95% CI: 1.09-2.89, p = 0.020). Furthermore, gene set enrichment analysis (GSEA) revealed that CDKN1C was significantly correlated with gene signatures involving DNA repair, cell cycle, glycolysis, adipogenesis, and two critical signaling pathways mTORC1 and PI3K/Akt/mTOR. In conclusion, our data suggested an essential role of CDKN1C in the tumorgenesis of breast cancer. Targeting CDKN1C may be a promising strategy for anticancer therapeutics. Copyright © 2018 Elsevier Inc. All rights reserved.

  14. Comparative transcriptome sequencing and de novo analysis of Vaccinium corymbosum during fruit and color development.

    PubMed

    Li, Lingli; Zhang, Hehua; Liu, Zhongshuai; Cui, Xiaoyue; Zhang, Tong; Li, Yanfang; Zhang, Lingyun

    2016-10-12

    Blueberry is an economically important fruit crop in Ericaceae family. The substantial quantities of flavonoids in blueberry have been implicated in a broad range of health benefits. However, the information regarding fruit development and flavonoid metabolites based on the transcriptome level is still limited. In the present study, the transcriptome and gene expression profiling over berry development, especially during color development were initiated. A total of approximately 13.67 Gbp of data were obtained and assembled into 186,962 transcripts and 80,836 unigenes from three stages of blueberry fruit and color development. A large number of simple sequence repeats (SSRs) and candidate genes, which are potentially involved in plant development, metabolic and hormone pathways, were identified. A total of 6429 sequences containing 8796 SSRs were characterized from 15,457 unigenes and 1763 unigenes contained more than one SSR. The expression profiles of key genes involved in anthocyanin biosynthesis were also studied. In addition, a comparison between our dataset and other published results was carried out. Our high quality reads produced in this study are an important advancement and provide a new resource for the interpretation of high-throughput data for blueberry species whether regarding sequencing data depth or species extension. The use of this transcriptome data will serve as a valuable public information database for the studies of blueberry genome and would greatly boost the research of fruit and color development, flavonoid metabolisms and regulation and breeding of more healthful blueberries.

  15. Decreased expression of serine protease inhibitor family G1 (SERPING1) in prostate cancer can help distinguish high-risk prostate cancer and predicts malignant progression.

    PubMed

    Peng, Shengmeng; Du, Tao; Wu, Wanhua; Chen, Xianju; Lai, Yiming; Zhu, Dingjun; Wang, Qiong; Ma, Xiaoming; Lin, Chunhao; Li, Zean; Guo, Zhenghui; Huang, Hai

    2018-06-11

    The aim of this study was to investigate the associations of serine proteinase inhibitor family G1 (SERPING1) down-regulation with poor prognosis in patients with prostate cancer (PCa). Furthermore, we aim to find more novel and effective PCa molecular markers to provide an early screening of PCa, distinguish patients with aggressive PCa, predict the prognosis, or reduce the economic burden of PCa. SERPING1 protein expression in both human PCa and normal prostate tissues was detected by immunohistochemical staining, which intensity was analyzed in association with clinical pathological parameters such Gleason score, pathological grade, clinical stage, tumor stage, lymph node metastasis, and distant metastasis. Moreover, we used The Cancer Genome Atlas (TCGA) Database, Taylor Database, and Oncomine dataset to validate our immunohistochemical results and investigated the value of SERPING1 in PCa at mRNA level. Kaplan-Meier analysis and Cox regression analysis were performed to evaluate the relationship between SERPING1 and prognosis of patients with PCa. The outcome showed that SERPING1 was expressed mainly in cytoplasm of grand cells of prostate tissue and was significantly expressed less in PCa (P<0.001). Furthermore, in the tissue microarray of our samples, decreasing expression of SERPING1 was correlated with the higher Gleason score (P = 0.004), the higher pathological grade (P = 0.01) and the advanced tumor stage (P = 0.005) at protein level. In TCGA dataset and Taylor Dataset, low-expressed SERPING1 was correlated with the younger patient (P = 0.02 in TCGA, P = 0.044 in Taylor) and the higher Gleason score (P = 0.019 in TCGA, P<0.001 in Taylor) at mRNA level. Kaplan-Meier analysis revealed that the lower mRNA of SERPING1 predicted lower overall survivals (P = 0.027 in TCGA), lower disease-free survival (P = 0.029) and lower biochemical recurrence-free survival (P = 0.011 in Taylor). Data from Oncomine database shown that SERPING1 low expression implying higher malignancy of prostate lesions. Using multivariate analysis, we also found that SERPING1 expression was independent prognostic marker of poor disease-free survival and biochemical recurrence-free survival. SERPING1 may play an important role in PCa and can be serve as a novel marker in diagnosis and prognostic prediction in PCa. In addition, levels of SERPING1 can help identify low-risk prostate to provide reference for patients with PCa to accept active surveillance and reduce overtreatment. Copyright © 2018 Elsevier Inc. All rights reserved.

  16. Cross-species comparison of the gut: Differential gene expression sheds light on biological differences in closely related tenebrionids.

    PubMed

    Oppert, Brenda; Perkin, Lindsey; Martynov, Alexander G; Elpidina, Elena N

    2018-04-01

    The gut is one of the primary interfaces between an insect and its environment. Understanding gene expression profiles in the insect gut can provide insight into interactions with the environment as well as identify potential control methods for pests. We compared the expression profiles of transcripts from the gut of larval stages of two coleopteran insects, Tenebrio molitor and Tribolium castaneum. These tenebrionids have different life cycles, varying in the duration and number of larval instars. T. castaneum has a sequenced genome and has been a model for coleopterans, and we recently obtained a draft genome for T. molitor. We assembled gut transcriptome reads from each insect to their respective genomes and filtered mapped reads to RPKM>1, yielding 11,521 and 17,871 genes in the T. castaneum and T. molitor datasets, respectively. There were identical GO terms in each dataset, and enrichment analyses also identified shared GO terms. From these datasets, we compiled an ortholog list of 6907 genes; 45% of the total assembled reads from T. castaneum were found in the top 25 orthologs, but only 27% of assembled reads were found in the top 25 T. molitor orthologs. There were 2281 genes unique to T. castaneum, and 2088 predicted genes unique to T. molitor, although improvements to the T. molitor genome will likely reduce these numbers as more orthologs are identified. We highlight a few unique genes in T. castaneum or T. molitor that may relate to distinct biological functions. A large number of putative genes expressed in the larval gut with uncharacterized functions (36 and 68% from T. castaneum and T. molitor, respectively) support the need for further research. These data are the first step in building a comprehensive understanding of the physiology of the gut in tenebrionid insects, illustrating commonalities and differences that may be related to speciation and environmental adaptation. Published by Elsevier Ltd.

  17. Bi-Force: large-scale bicluster editing and its application to gene expression data biclustering.

    PubMed

    Sun, Peng; Speicher, Nora K; Röttger, Richard; Guo, Jiong; Baumbach, Jan

    2014-05-01

    The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as 'simultaneous clustering' or 'co-clustering', has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: 'Bi-Force'. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279-292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  18. Apprentissage de l'expression orale en autonomie. Implications de l'approche fonctionelle (Learning Oral Expression in Independent Study. Implications of the Functional Approach). Melanges pedagogiques.

    ERIC Educational Resources Information Center

    Abe, D.; And Others

    Within the CRAPEL autonomous learning scheme, modular material is being developed for communicative oral expression. The purpose of this material is twofold: (1) to enable the learner to define his or her own needs in communicative terms, that is, to analyse a situation in terms of communicative acts needed in a given situation, the relationships…

  19. Genome-wide identification of suitable zebrafish Danio rerio reference genes for normalization of gene expression data by RT-qPCR.

    PubMed

    Xu, H; Li, C; Zeng, Q; Agrawal, I; Zhu, X; Gong, Z

    2016-06-01

    In this study, to systematically identify the most stably expressed genes for internal reference in zebrafish Danio rerio investigations, 37 D. rerio transcriptomic datasets (both RNA sequencing and microarray data) were collected from gene expression omnibus (GEO) database and unpublished data, and gene expression variations were analysed under three experimental conditions: tissue types, developmental stages and chemical treatments. Forty-four putative candidate genes were identified with the c.v. <0·2 from all datasets. Following clustering into different functional groups, 21 genes, in addition to four conventional housekeeping genes (eef1a1l1, b2m, hrpt1l and actb1), were selected from different functional groups for further quantitative real-time (qrt-)PCR validation using 25 RNA samples from different adult tissues, developmental stages and chemical treatments. The qrt-PCR data were then analysed using the statistical algorithm refFinder for gene expression stability. Several new candidate genes showed better expression stability than the conventional housekeeping genes in all three categories. It was found that sep15 and metap1 were the top two stable genes for tissue types, ube2a and tmem50a the top two for different developmental stages, and rpl13a and rp1p0 the top two for chemical treatments. Thus, based on the extensive transcriptomic analyses and qrt-PCR validation, these new reference genes are recommended for normalization of D. rerio qrt-PCR data respectively for the three different experimental conditions. © 2016 The Fisheries Society of the British Isles.

  20. Reduction in expression of the benign AR transcriptome is a hallmark of localised prostate cancer progression.

    PubMed

    Stuchbery, Ryan; Macintyre, Geoff; Cmero, Marek; Harewood, Laurence M; Peters, Justin S; Costello, Anthony J; Hovens, Christopher M; Corcoran, Niall M

    2016-05-24

    Despite the importance of androgen receptor (AR) signalling to prostate cancer development, little is known about how this signalling pathway changes with increasing grade and stage of the disease. To explore changes in the normal AR transcriptome in localised prostate cancer, and its relation to adverse pathological features and disease recurrence. Publically accessible human prostate cancer expression arrays as well as RNA sequencing data from the prostate TCGA. Tumour associated PSA and PSAD were calculated for a large cohort of men (n=1108) undergoing prostatectomy. We performed a meta-analysis of the expression of an androgen-regulated gene set across datasets using Oncomine. Differential expression of selected genes in the prostate TCGA database was probed using the edgeR Bioconductor package. Changes in tumour PSA density with stage and grade were assessed by Student's t-test, and its association with biochemical recurrence explored by Kaplan-Meier curves and Cox regression. Meta-analysis revealed a systematic decline in the expression of a previously identified benign prostate androgen-regulated gene set with increasing tumour grade, reaching significance in nine of 25 genes tested despite increasing AR expression. These results were confirmed in a large independent dataset from the TCGA. At the protein level, when serum PSA was corrected for tumour volume, significantly lower levels were observed with increasing tumour grade and stage, and predicted disease recurrence. Lower PSA secretion-per-tumour-volume is associated with increasing grade and stage of prostate cancer, has prognostic relevance, and reflects a systematic perturbation of androgen signalling.

  1. Gene expression metadata analysis reveals molecular mechanisms employed by Phanerochaete chrysosporium during lignin degradation and detoxification of plant extractives.

    PubMed

    Kameshwar, Ayyappa Kumar Sista; Qin, Wensheng

    2017-10-01

    Lignin, most complex and abundant biopolymer on the earth's surface, attains its stability from intricate polyphenolic units and non-phenolic bonds, making it difficult to depolymerize or separate from other units of biomass. Eccentric lignin degrading ability and availability of annotated genome make Phanerochaete chrysosporium ideal for studying lignin degrading mechanisms. Decoding and understanding the molecular mechanisms underlying the process of lignin degradation will significantly aid the progressing biofuel industries and lead to the production of commercially vital platform chemicals. In this study, we have performed a large-scale metadata analysis to understand the common gene expression patterns of P. chrysosporium during lignin degradation. Gene expression datasets were retrieved from NCBI GEO database and analyzed using GEO2R and Bioconductor packages. Commonly expressed statistically significant genes among different datasets were further considered to understand their involvement in lignin degradation and detoxification mechanisms. We have observed three sets of enzymes commonly expressed during ligninolytic conditions which were later classified into primary ligninolytic, aromatic compound-degrading and other necessary enzymes. Similarly, we have observed three sets of genes coding for detoxification and stress-responsive, phase I and phase II metabolic enzymes. Results obtained in this study indicate the coordinated action of enzymes involved in lignin depolymerization and detoxification-stress responses under ligninolytic conditions. We have developed tentative network of genes and enzymes involved in lignin degradation and detoxification mechanisms by P. chrysosporium based on the literature and results obtained in this study. However, ambiguity raised due to higher expression of several uncharacterized proteins necessitates for further proteomic studies in P. chrysosporium.

  2. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome.

    PubMed

    Tothill, Richard W; Tinker, Anna V; George, Joshy; Brown, Robert; Fox, Stephen B; Lade, Stephen; Johnson, Daryl S; Trivett, Melanie K; Etemadmoghadam, Dariush; Locandro, Bianca; Traficante, Nadia; Fereday, Sian; Hung, Jillian A; Chiew, Yoke-Eng; Haviv, Izhak; Gertig, Dorota; DeFazio, Anna; Bowtell, David D L

    2008-08-15

    The study aim to identify novel molecular subtypes of ovarian cancer by gene expression profiling with linkage to clinical and pathologic features. Microarray gene expression profiling was done on 285 serous and endometrioid tumors of the ovary, peritoneum, and fallopian tube. K-means clustering was applied to identify robust molecular subtypes. Statistical analysis identified differentially expressed genes, pathways, and gene ontologies. Laser capture microdissection, pathology review, and immunohistochemistry validated the array-based findings. Patient survival within k-means groups was evaluated using Cox proportional hazards models. Class prediction validated k-means groups in an independent dataset. A semisupervised survival analysis of the array data was used to compare against unsupervised clustering results. Optimal clustering of array data identified six molecular subtypes. Two subtypes represented predominantly serous low malignant potential and low-grade endometrioid subtypes, respectively. The remaining four subtypes represented higher grade and advanced stage cancers of serous and endometrioid morphology. A novel subtype of high-grade serous cancers reflected a mesenchymal cell type, characterized by overexpression of N-cadherin and P-cadherin and low expression of differentiation markers, including CA125 and MUC1. A poor prognosis subtype was defined by a reactive stroma gene expression signature, correlating with extensive desmoplasia in such samples. A similar poor prognosis signature could be found using a semisupervised analysis. Each subtype displayed distinct levels and patterns of immune cell infiltration. Class prediction identified similar subtypes in an independent ovarian dataset with similar prognostic trends. Gene expression profiling identified molecular subtypes of ovarian cancer of biological and clinical importance.

  3. FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks

    PubMed Central

    Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

    2015-01-01

    Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758

  4. FastGCN: a GPU accelerated tool for fast gene co-expression networks.

    PubMed

    Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

    2015-01-01

    Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.

  5. Integrative functional analyses using rainbow trout selected for tolerance to plant diets reveal nutrigenomic signatures for soy utilization without the concurrence of enteritis

    PubMed Central

    Brezas, Andreas; Snekvik, Kevin R.; Hardy, Ronald W.; Overturf, Ken

    2017-01-01

    Finding suitable alternative protein sources for diets of carnivorous fish species remains a major concern for sustainable aquaculture. Through genetic selection, we created a strain of rainbow trout that outperforms parental lines in utilizing an all-plant protein diet and does not develop enteritis in the distal intestine, as is typical with salmonids on long-term plant protein-based feeds. By incorporating this strain into functional analyses, we set out to determine which genes are critical to plant protein utilization in the absence of gut inflammation. After a 12-week feeding trial with our selected strain and a control trout strain fed either a fishmeal-based diet or an all-plant protein diet, high-throughput RNA sequencing was completed on both liver and muscle tissues. Differential gene expression analyses, weighted correlation network analyses and further functional characterization were performed. A strain-by-diet design revealed differential expression ranging from a few dozen to over one thousand genes among the various comparisons and tissues. Major gene ontology groups identified between comparisons included those encompassing central, intermediary and foreign molecule metabolism, associated biosynthetic pathways as well as immunity. A systems approach indicated that genes involved in purine metabolism were highly perturbed. Systems analysis among the tissues tested further suggests the interplay between selection for growth, dietary utilization and protein tolerance may also have implications for nonspecific immunity. By combining data from differential gene expression and co-expression networks using selected trout, along with ontology and pathway analyses, a set of 63 candidate genes for plant diet tolerance was found. Risk loci in human inflammatory bowel diseases were also found in our datasets, indicating rainbow trout selected for plant-diet tolerance may have added utility as a potential biomedical model. PMID:28723948

  6. SMAD4 Loss Is Associated with Cetuximab Resistance and Induction of MAPK/JNK Activation in Head and Neck Cancer Cells.

    PubMed

    Ozawa, Hiroyuki; Ranaweera, Ruchira S; Izumchenko, Evgeny; Makarev, Eugene; Zhavoronkov, Alex; Fertig, Elana J; Howard, Jason D; Markovic, Ana; Bedi, Atul; Ravi, Rajani; Perez, Jimena; Le, Quynh-Thu; Kong, Christina S; Jordan, Richard C; Wang, Hao; Kang, Hyunseok; Quon, Harry; Sidransky, David; Chung, Christine H

    2017-09-01

    Purpose: We previously demonstrated an association between decreased SMAD4 expression and cetuximab resistance in head and neck squamous cell carcinoma (HNSCC). The purpose of this study was to further elucidate the clinical relevance of SMAD4 loss in HNSCC. Experimental Design: SMAD4 expression was assessed by IHC in 130 newly diagnosed and 43 patients with recurrent HNSCC. Correlative statistical analysis with clinicopathologic data was also performed. OncoFinder, a bioinformatics tool, was used to analyze molecular signaling in TCGA tumors with low or high SMAD4 mRNA levels. The role of SMAD4 was investigated by shRNA knockdown and gene reconstitution of HPV-negative HNSCC cell lines in vitro and in vivo Results: Our analysis revealed that SMAD4 loss was associated with an aggressive, HPV-negative, cetuximab-resistant phenotype. We found a signature of prosurvival and antiapoptotic pathways that were commonly dysregulated in SMAD4 -low cases derived from TCGA-HNSCC dataset and an independent oral cavity squamous cell carcinoma (OSCC) cohort obtained from GEO. We show that SMAD4 depletion in an HNSCC cell line induces cetuximab resistance and results in worse survival in an orthotopic mouse model in vivo We implicate JNK and MAPK activation as mediators of cetuximab resistance and provide the foundation for the concomitant EGFR and JNK/MAPK inhibition as a potential strategy for overcoming cetuximab resistance in HNSCCs with SMAD4 loss. Conclusions: Our study demonstrates that loss of SMAD4 expression is a signature characterizing the cetuximab-resistant phenotype and suggests that SMAD4 expression may be a determinant of sensitivity/resistance to EGFR/MAPK or EGFR/JNK inhibition in HPV-negative HNSCC tumors. Clin Cancer Res; 23(17); 5162-75. ©2017 AACR . ©2017 American Association for Cancer Research.

  7. Bayesian Network Webserver: a comprehensive tool for biological network modeling.

    PubMed

    Ziebarth, Jesse D; Bhattacharya, Anindya; Cui, Yan

    2013-11-01

    The Bayesian Network Webserver (BNW) is a platform for comprehensive network modeling of systems genetics and other biological datasets. It allows users to quickly and seamlessly upload a dataset, learn the structure of the network model that best explains the data and use the model to understand relationships between network variables. Many datasets, including those used to create genetic network models, contain both discrete (e.g. genotype) and continuous (e.g. gene expression traits) variables, and BNW allows for modeling hybrid datasets. Users of BNW can incorporate prior knowledge during structure learning through an easy-to-use structural constraint interface. After structure learning, users are immediately presented with an interactive network model, which can be used to make testable hypotheses about network relationships. BNW, including a downloadable structure learning package, is available at http://compbio.uthsc.edu/BNW. (The BNW interface for adding structural constraints uses HTML5 features that are not supported by current version of Internet Explorer. We suggest using other browsers (e.g. Google Chrome or Mozilla Firefox) when accessing BNW). ycui2@uthsc.edu. Supplementary data are available at Bioinformatics online.

  8. A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast

    PubMed Central

    Kundaje, Anshul; Xin, Xiantong; Lan, Changgui; Lianoglou, Steve; Zhou, Mei; Zhang, Li; Leslie, Christina

    2008-01-01

    Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included. PMID:19008939

  9. Loss of Dickkopf 3 Promotes the Tumorigenesis of Basal Breast Cancer

    PubMed Central

    Lorsy, Eva; Topuz, Aylin Sophie; Geisler, Cordelia; Stahl, Sarah; Garczyk, Stefan; von Stillfried, Saskia; Hoss, Mareike; Gluz, Oleg; Hartmann, Arndt; Knüchel, Ruth; Dahl, Edgar

    2016-01-01

    Dickkopf 3 (DKK3) has been associated with tumor suppression of various tumor entities including breast cancer. However, the functional impact of DKK3 on the tumorigenesis of distinct molecular breast cancer subtypes has not been considered so far. Therefore, we initiated a study analyzing the subtype-specific DKK3 expression pattern as well as its prognostic and functional impact with respect to breast cancer subtypes. Based on three independent tissue cohorts including one in silico dataset (n = 30, n = 463 and n = 791) we observed a clear down-regulation of DKK3 expression in breast cancer samples compared to healthy breast tissue controls on mRNA and protein level. Interestingly, most abundant reduction of DKK3 expression was detected in the highly aggressive basal breast cancer subtype. Analyzing a large in silico dataset comprising 3,554 cases showed that low DKK3 mRNA expression was significantly associated with reduced recurrence free survival (RFS) of luminal and basal-like breast cancer cases. Functionally, DKK3 re-expression in human breast cancer cell lines led to suppression of cell growth possibly mediated by up-regulation of apoptosis in basal-like but not in luminal-like breast cancer cell lines. Moreover, ectopic DKK3 expression in mesenchymal basal breast cancer cells resulted in partial restoration of epithelial cell morphology which was molecularly supported by higher expression of epithelial markers like E-Cadherin and down-regulation of mesenchymal markers such as Snail 1. Hence, we provide evidence that down-regulation of DKK3 especially promotes tumorigenesis of the aggressive basal breast cancer subtype. Further studies decoding the underlying molecular mechanisms of DKK3-mediated effects may help to identify novel targeted therapies for this clinically highly relevant breast cancer subtype. PMID:27467270

  10. The New Planetary Science Archive (PSA): Exploration and Discovery of Scientific Datasets from ESA's Planetary Missions

    NASA Astrophysics Data System (ADS)

    Heather, David; Besse, Sebastien; Vallat, Claire; Barbarisi, Isa; Arviset, Christophe; De Marchi, Guido; Barthelemy, Maud; Coia, Daniela; Costa, Marc; Docasal, Ruben; Fraga, Diego; Grotheer, Emmanuel; Lim, Tanya; MacFarlane, Alan; Martinez, Santa; Rios, Carlos; Vallejo, Fran; Saiz, Jaime

    2017-04-01

    The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://psa.esa.int. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA is currently implementing a number of significant improvements, mostly driven by the evolution of the PDS standard, and the growing need for better interfaces and advanced applications to support science exploitation. As of the end of 2016, the PSA is hosting data from all of ESA's planetary missions. This includes ESA's first planetary mission Giotto that encountered comet 1P/Halley in 1986 with a flyby at 800km. Science data from Venus Express, Mars Express, Huygens and the SMART-1 mission are also all available at the PSA. The PSA also contains all science data from Rosetta, which explored comet 67P/Churyumov-Gerasimenko and asteroids Steins and Lutetia. The year 2016 has seen the arrival of the ExoMars 2016 data in the archive. In the upcoming years, at least three new projects are foreseen to be fully archived at the PSA. The BepiColombo mission is scheduled for launch in 2018. Following that, the ExoMars Rover Surface Platform (RSP) in 2020, and then the JUpiter ICy moon Explorer (JUICE). All of these will archive their data in the PSA. In addition, a few ground-based support programmes are also available, especially for the Venus Express and Rosetta missions. The newly designed PSA will enhance the user experience and will significantly reduce the complexity for users to find their data promoting one-click access to the scientific datasets with more customized views when needed. This includes a better integration with Planetary GIS analysis tools and Planetary interoperability services (search and retrieve data, supporting e.g. PDAP, EPN-TAP). It will also be up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's ExoMars and upcoming BepiColombo missions. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. The new PSA interface was released in January 2017. The home page provides a direct and simple access to the scientific data, aiming to help scientists to discover and explore its content. The archive can be explored through a set of parameters that allow the selection of products through space and time. Quick views provide information needed for the selection of appropriate scientific products. During 2017, the PSA team will focus their efforts on developing a map search interface using GIS technologies to display ESA planetary datasets, an image gallery providing navigation through images to explore the datasets, and interoperability with international partners. This will be done in parallel with additional metadata searchable through the interface (i.e., geometry), and with a dedication to improve the content of 20 years of space exploration.

  11. Multiclass cancer classification using a feature subset-based ensemble from microRNA expression profiles.

    PubMed

    Piao, Yongjun; Piao, Minghao; Ryu, Keun Ho

    2017-01-01

    Cancer classification has been a crucial topic of research in cancer treatment. In the last decade, messenger RNA (mRNA) expression profiles have been widely used to classify different types of cancers. With the discovery of a new class of small non-coding RNAs; known as microRNAs (miRNAs), various studies have shown that the expression patterns of miRNA can also accurately classify human cancers. Therefore, there is a great demand for the development of machine learning approaches to accurately classify various types of cancers using miRNA expression data. In this article, we propose a feature subset-based ensemble method in which each model is learned from a different projection of the original feature space to classify multiple cancers. In our method, the feature relevance and redundancy are considered to generate multiple feature subsets, the base classifiers are learned from each independent miRNA subset, and the average posterior probability is used to combine the base classifiers. To test the performance of our method, we used bead-based and sequence-based miRNA expression datasets and conducted 10-fold and leave-one-out cross validations. The experimental results show that the proposed method yields good results and has higher prediction accuracy than popular ensemble methods. The Java program and source code of the proposed method and the datasets in the experiments are freely available at https://sourceforge.net/projects/mirna-ensemble/. Copyright © 2016 Elsevier Ltd. All rights reserved.

  12. SigEMD: A powerful method for differential gene expression analysis in single-cell RNA sequencing data.

    PubMed

    Wang, Tianyu; Nabavi, Sheida

    2018-04-24

    Differential gene expression analysis is one of the significant efforts in single cell RNA sequencing (scRNAseq) analysis to discover the specific changes in expression levels of individual cell types. Since scRNAseq exhibits multimodality, large amounts of zero counts, and sparsity, it is different from the traditional bulk RNA sequencing (RNAseq) data. The new challenges of scRNAseq data promote the development of new methods for identifying differentially expressed (DE) genes. In this study, we proposed a new method, SigEMD, that combines a data imputation approach, a logistic regression model and a nonparametric method based on the Earth Mover's Distance, to precisely and efficiently identify DE genes in scRNAseq data. The regression model and data imputation are used to reduce the impact of large amounts of zero counts, and the nonparametric method is used to improve the sensitivity of detecting DE genes from multimodal scRNAseq data. By additionally employing gene interaction network information to adjust the final states of DE genes, we further reduce the false positives of calling DE genes. We used simulated datasets and real datasets to evaluate the detection accuracy of the proposed method and to compare its performance with those of other differential expression analysis methods. Results indicate that the proposed method has an overall powerful performance in terms of precision in detection, sensitivity, and specificity. Copyright © 2018 Elsevier Inc. All rights reserved.

  13. Genome-wide screen identifies a novel prognostic signature for breast cancer survival

    DOE PAGES

    Mao, Xuan Y.; Lee, Matthew J.; Zhu, Jeffrey; ...

    2017-01-21

    Large genomic datasets in combination with clinical data can be used as an unbiased tool to identify genes important in patient survival and discover potential therapeutic targets. We used a genome-wide screen to identify 587 genes significantly and robustly deregulated across four independent breast cancer (BC) datasets compared to normal breast tissue. Gene expression of 381 genes was significantly associated with relapse-free survival (RFS) in BC patients. We used a gene co-expression network approach to visualize the genetic architecture in normal breast and BCs. In normal breast tissue, co-expression cliques were identified enriched for cell cycle, gene transcription, cell adhesion,more » cytoskeletal organization and metabolism. In contrast, in BC, only two major co-expression cliques were identified enriched for cell cycle-related processes or blood vessel development, cell adhesion and mammary gland development processes. Interestingly, gene expression levels of 7 genes were found to be negatively correlated with many cell cycle related genes, highlighting these genes as potential tumor suppressors and novel therapeutic targets. A forward-conditional Cox regression analysis was used to identify a 12-gene signature associated with RFS. A prognostic scoring system was created based on the 12-gene signature. This scoring system robustly predicted BC patient RFS in 60 sampling test sets and was further validated in TCGA and METABRIC BC data. Our integrated study identified a 12-gene prognostic signature that could guide adjuvant therapy for BC patients and includes novel potential molecular targets for therapy.« less

  14. Estimating replicate time shifts using Gaussian process regression

    PubMed Central

    Liu, Qiang; Andersen, Bogi; Smyth, Padhraic; Ihler, Alexander

    2010-01-01

    Motivation: Time-course gene expression datasets provide important insights into dynamic aspects of biological processes, such as circadian rhythms, cell cycle and organ development. In a typical microarray time-course experiment, measurements are obtained at each time point from multiple replicate samples. Accurately recovering the gene expression patterns from experimental observations is made challenging by both measurement noise and variation among replicates' rates of development. Prior work on this topic has focused on inference of expression patterns assuming that the replicate times are synchronized. We develop a statistical approach that simultaneously infers both (i) the underlying (hidden) expression profile for each gene, as well as (ii) the biological time for each individual replicate. Our approach is based on Gaussian process regression (GPR) combined with a probabilistic model that accounts for uncertainty about the biological development time of each replicate. Results: We apply GPR with uncertain measurement times to a microarray dataset of mRNA expression for the hair-growth cycle in mouse back skin, predicting both profile shapes and biological times for each replicate. The predicted time shifts show high consistency with independently obtained morphological estimates of relative development. We also show that the method systematically reduces prediction error on out-of-sample data, significantly reducing the mean squared error in a cross-validation study. Availability: Matlab code for GPR with uncertain time shifts is available at http://sli.ics.uci.edu/Code/GPRTimeshift/ Contact: ihler@ics.uci.edu PMID:20147305

  15. Robust transcriptional tumor signatures applicable to both formalin-fixed paraffin-embedded and fresh-frozen samples

    PubMed Central

    Cheng, Jun; He, Jun; Liu, Huaping; Cai, Hao; Hong, Guini; Zhang, Jiahui; Li, Na; Ao, Lu; Guo, Zheng

    2017-01-01

    Formalin-fixed paraffin-embedded (FFPE) samples represent a valuable resource for clinical researches. However, FFPE samples are usually considered an unreliable source for gene expression analysis due to the partial RNA degradation. In this study, through comparing gene expression profiles between FFPE samples and paired fresh-frozen (FF) samples for three cancer types, we firstly showed that expression measurements of thousands of genes had at least two-fold change in FFPE samples compared with paired FF samples. Therefore, for a transcriptional signature based on risk scores summarized from the expression levels of the signature genes, the risk score thresholds trained from FFPE (or FF) samples could not be applied to FF (or FFPE) samples. On the other hand, we found that more than 90% of the relative expression orderings (REOs) of gene pairs in the FF samples were maintained in their paired FFPE samples and largely unaffected by the storage time. The result suggested that the REOs of gene pairs were highly robust against partial RNA degradation in FFPE samples. Finally, as a case study, we developed a REOs-based signature to distinguish liver cirrhosis from hepatocellular carcinoma (HCC) using FFPE samples. The signature was validated in four datasets of FFPE samples and eight datasets of FF samples. In conclusion, the valuable FFPE samples can be fully exploited to identify REOs-based diagnostic and prognostic signatures which could be robustly applicable to both FF samples and FFPE samples with degraded RNA. PMID:28036264

  16. Genome-wide screen identifies a novel prognostic signature for breast cancer survival

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mao, Xuan Y.; Lee, Matthew J.; Zhu, Jeffrey

    Large genomic datasets in combination with clinical data can be used as an unbiased tool to identify genes important in patient survival and discover potential therapeutic targets. We used a genome-wide screen to identify 587 genes significantly and robustly deregulated across four independent breast cancer (BC) datasets compared to normal breast tissue. Gene expression of 381 genes was significantly associated with relapse-free survival (RFS) in BC patients. We used a gene co-expression network approach to visualize the genetic architecture in normal breast and BCs. In normal breast tissue, co-expression cliques were identified enriched for cell cycle, gene transcription, cell adhesion,more » cytoskeletal organization and metabolism. In contrast, in BC, only two major co-expression cliques were identified enriched for cell cycle-related processes or blood vessel development, cell adhesion and mammary gland development processes. Interestingly, gene expression levels of 7 genes were found to be negatively correlated with many cell cycle related genes, highlighting these genes as potential tumor suppressors and novel therapeutic targets. A forward-conditional Cox regression analysis was used to identify a 12-gene signature associated with RFS. A prognostic scoring system was created based on the 12-gene signature. This scoring system robustly predicted BC patient RFS in 60 sampling test sets and was further validated in TCGA and METABRIC BC data. Our integrated study identified a 12-gene prognostic signature that could guide adjuvant therapy for BC patients and includes novel potential molecular targets for therapy.« less

  17. Machine Learning, Sentiment Analysis, and Tweets: An Examination of Alzheimer's Disease Stigma on Twitter.

    PubMed

    Oscar, Nels; Fox, Pamela A; Croucher, Racheal; Wernick, Riana; Keune, Jessica; Hooker, Karen

    2017-09-01

    Social scientists need practical methods for harnessing large, publicly available datasets that inform the social context of aging. We describe our development of a semi-automated text coding method and use a content analysis of Alzheimer's disease (AD) and dementia portrayal on Twitter to demonstrate its use. The approach improves feasibility of examining large publicly available datasets. Machine learning techniques modeled stigmatization expressed in 31,150 AD-related tweets collected via Twitter's search API based on 9 AD-related keywords. Two researchers manually coded 311 random tweets on 6 dimensions. This input from 1% of the dataset was used to train a classifier against the tweet text and code the remaining 99% of the dataset. Our automated process identified that 21.13% of the AD-related tweets used AD-related keywords to perpetuate public stigma, which could impact stereotypes and negative expectations for individuals with the disease and increase "excess disability". This technique could be applied to questions in social gerontology related to how social media outlets reflect and shape attitudes bearing on other developmental outcomes. Recommendations for the collection and analysis of large Twitter datasets are discussed. © The Author 2017. Published by Oxford University Press on behalf of The Gerontological Society of America. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  18. MEXPRESS: visualizing expression, DNA methylation and clinical TCGA data.

    PubMed

    Koch, Alexander; De Meyer, Tim; Jeschke, Jana; Van Criekinge, Wim

    2015-08-26

    In recent years, increasing amounts of genomic and clinical cancer data have become publically available through large-scale collaborative projects such as The Cancer Genome Atlas (TCGA). However, as long as these datasets are difficult to access and interpret, they are essentially useless for a major part of the research community and their scientific potential will not be fully realized. To address these issues we developed MEXPRESS, a straightforward and easy-to-use web tool for the integration and visualization of the expression, DNA methylation and clinical TCGA data on a single-gene level ( http://mexpress.be ). In comparison to existing tools, MEXPRESS allows researchers to quickly visualize and interpret the different TCGA datasets and their relationships for a single gene, as demonstrated for GSTP1 in prostate adenocarcinoma. We also used MEXPRESS to reveal the differences in the DNA methylation status of the PAM50 marker gene MLPH between the breast cancer subtypes and how these differences were linked to the expression of MPLH. We have created a user-friendly tool for the visualization and interpretation of TCGA data, offering clinical researchers a simple way to evaluate the TCGA data for their genes or candidate biomarkers of interest.

  19. Structural analysis of online handwritten mathematical symbols based on support vector machines

    NASA Astrophysics Data System (ADS)

    Simistira, Foteini; Papavassiliou, Vassilis; Katsouros, Vassilis; Carayannis, George

    2013-01-01

    Mathematical expression recognition is still a very challenging task for the research community mainly because of the two-dimensional (2d) structure of mathematical expressions (MEs). In this paper, we present a novel approach for the structural analysis between two on-line handwritten mathematical symbols of a ME, based on spatial features of the symbols. We introduce six features to represent the spatial affinity of the symbols and compare two multi-class classification methods that employ support vector machines (SVMs): one based on the "one-against-one" technique and one based on the "one-against-all", in identifying the relation between a pair of symbols (i.e. subscript, numerator, etc). A dataset containing 1906 spatial relations derived from the Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) 2012 training dataset is constructed to evaluate the classifiers and compare them with the rule-based classifier of the ILSP-1 system participated in the contest. The experimental results give an overall mean error rate of 2.61% for the "one-against-one" SVM approach, 6.57% for the "one-against-all" SVM technique and 12.31% error rate for the ILSP-1 classifier.

  20. GTA: a game theoretic approach to identifying cancer subnetwork markers.

    PubMed

    Farahmand, S; Goliaei, S; Ansari-Pour, N; Razaghi-Moghadam, Z

    2016-03-01

    The identification of genetic markers (e.g. genes, pathways and subnetworks) for cancer has been one of the most challenging research areas in recent years. A subset of these studies attempt to analyze genome-wide expression profiles to identify markers with high reliability and reusability across independent whole-transcriptome microarray datasets. Therefore, the functional relationships of genes are integrated with their expression data. However, for a more accurate representation of the functional relationships among genes, utilization of the protein-protein interaction network (PPIN) seems to be necessary. Herein, a novel game theoretic approach (GTA) is proposed for the identification of cancer subnetwork markers by integrating genome-wide expression profiles and PPIN. The GTA method was applied to three distinct whole-transcriptome breast cancer datasets to identify the subnetwork markers associated with metastasis. To evaluate the performance of our approach, the identified subnetwork markers were compared with gene-based, pathway-based and network-based markers. We show that GTA is not only capable of identifying robust metastatic markers, it also provides a higher classification performance. In addition, based on these GTA-based subnetworks, we identified a new bonafide candidate gene for breast cancer susceptibility.

Top