Richard, Arianne C; Lyons, Paul A; Peters, James E; Biasci, Daniele; Flint, Shaun M; Lee, James C; McKinney, Eoin F; Siegel, Richard M; Smith, Kenneth G C
2014-08-04
Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study. Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this "gold-standard" comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues. Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently.
Missing value imputation for microarray data: a comprehensive comparison study and a web tool.
Chiu, Chia-Chun; Chan, Shih-Yao; Wang, Chung-Ching; Wu, Wei-Sheng
2013-01-01
Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
Missing value imputation for microarray data: a comprehensive comparison study and a web tool
2013-01-01
Background Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. Results In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. Conclusions In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses. PMID:24565220
Mukwaya, Anthony; Lindvall, Jessica M; Xeroudaki, Maria; Peebo, Beatrice; Ali, Zaheer; Lennikov, Anton; Jensen, Lasse Dahl Ejby; Lagali, Neil
2016-11-22
In angiogenesis with concurrent inflammation, many pathways are activated, some linked to VEGF and others largely VEGF-independent. Pathways involving inflammatory mediators, chemokines, and micro-RNAs may play important roles in maintaining a pro-angiogenic environment or mediating angiogenic regression. Here, we describe a gene expression dataset to facilitate exploration of pro-angiogenic, pro-inflammatory, and remodelling/normalization-associated genes during both an active capillary sprouting phase, and in the restoration of an avascular phenotype. The dataset was generated by microarray analysis of the whole transcriptome in a rat model of suture-induced inflammatory corneal neovascularisation. Regions of active capillary sprout growth or regression in the cornea were harvested and total RNA extracted from four biological replicates per group. High quality RNA was obtained for gene expression analysis using microarrays. Fold change of selected genes was validated by qPCR, and protein expression was evaluated by immunohistochemistry. We provide a gene expression dataset that may be re-used to investigate corneal neovascularisation, and may also have implications in other contexts of inflammation-mediated angiogenesis.
Leung, Yuk Yee; Chang, Chun Qi; Hung, Yeung Sam
2012-01-01
Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own. We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the 'wrong' (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.
2010-01-01
Background The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells. Results The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques. Conclusions BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show BABAR transforms real and simulated datasets to allow for the correct interpretation of these data, and is the ideal tool to facilitate the identification of differentially expressed genes or network inference analysis from transcriptomic datasets. PMID:20128918
McArt, Darragh G.; Dunne, Philip D.; Blayney, Jaine K.; Salto-Tellez, Manuel; Van Schaeybroeck, Sandra; Hamilton, Peter W.; Zhang, Shu-Dong
2013-01-01
The advent of next generation sequencing technologies (NGS) has expanded the area of genomic research, offering high coverage and increased sensitivity over older microarray platforms. Although the current cost of next generation sequencing is still exceeding that of microarray approaches, the rapid advances in NGS will likely make it the platform of choice for future research in differential gene expression. Connectivity mapping is a procedure for examining the connections among diseases, genes and drugs by differential gene expression initially based on microarray technology, with which a large collection of compound-induced reference gene expression profiles have been accumulated. In this work, we aim to test the feasibility of incorporating NGS RNA-Seq data into the current connectivity mapping framework by utilizing the microarray based reference profiles and the construction of a differentially expressed gene signature from a NGS dataset. This would allow for the establishment of connections between the NGS gene signature and those microarray reference profiles, alleviating the associated incurring cost of re-creating drug profiles with NGS technology. We examined the connectivity mapping approach on a publicly available NGS dataset with androgen stimulation of LNCaP cells in order to extract candidate compounds that could inhibit the proliferative phenotype of LNCaP cells and to elucidate their potential in a laboratory setting. In addition, we also analyzed an independent microarray dataset of similar experimental settings. We found a high level of concordance between the top compounds identified using the gene signatures from the two datasets. The nicotine derivative cotinine was returned as the top candidate among the overlapping compounds with potential to suppress this proliferative phenotype. Subsequent lab experiments validated this connectivity mapping hit, showing that cotinine inhibits cell proliferation in an androgen dependent manner. Thus the results in this study suggest a promising prospect of integrating NGS data with connectivity mapping. PMID:23840550
Chowdhury, Nilotpal; Sapru, Shantanu
2015-01-01
Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate - adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.
Chowdhury, Nilotpal; Sapru, Shantanu
2015-01-01
Introduction Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. Aim The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Methods Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Results Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. Conclusion To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research. PMID:26080057
Yılmaz Isıkhan, Selen; Karabulut, Erdem; Alpar, Celal Reha
2016-01-01
Background/Aim . Evaluating the success of dose prediction based on genetic or clinical data has substantially advanced recently. The aim of this study is to predict various clinical dose values from DNA gene expression datasets using data mining techniques. Materials and Methods . Eleven real gene expression datasets containing dose values were included. First, important genes for dose prediction were selected using iterative sure independence screening. Then, the performances of regression trees (RTs), support vector regression (SVR), RT bagging, SVR bagging, and RT boosting were examined. Results . The results demonstrated that a regression-based feature selection method substantially reduced the number of irrelevant genes from raw datasets. Overall, the best prediction performance in nine of 11 datasets was achieved using SVR; the second most accurate performance was provided using a gradient-boosting machine (GBM). Conclusion . Analysis of various dose values based on microarray gene expression data identified common genes found in our study and the referenced studies. According to our findings, SVR and GBM can be good predictors of dose-gene datasets. Another result of the study was to identify the sample size of n = 25 as a cutoff point for RT bagging to outperform a single RT.
2013-01-01
Background Differential diagnosis between malignant follicular thyroid cancer (FTC) and benign follicular thyroid adenoma (FTA) is a great challenge for even an experienced pathologist and requires special effort. Molecular markers may potentially support a differential diagnosis between FTC and FTA in postoperative specimens. The purpose of this study was to derive molecular support for differential post-operative diagnosis, in the form of a simple multigene mRNA-based classifier that would differentiate between FTC and FTA tissue samples. Methods A molecular classifier was created based on a combined analysis of two microarray datasets (using 66 thyroid samples). The performance of the classifier was assessed using an independent dataset comprising 71 formalin-fixed paraffin-embedded (FFPE) samples (31 FTC and 40 FTA), which were analysed by quantitative real-time PCR (qPCR). In addition, three other microarray datasets (62 samples) were used to confirm the utility of the classifier. Results Five of 8 genes selected from training datasets (ELMO1, EMCN, ITIH5, KCNAB1, SLCO2A1) were amplified by qPCR in FFPE material from an independent sample set. Three other genes did not amplify in FFPE material, probably due to low abundance. All 5 analysed genes were downregulated in FTC compared to FTA. The sensitivity and specificity of the 5-gene classifier tested on the FFPE dataset were 71% and 72%, respectively. Conclusions The proposed approach could support histopathological examination: 5-gene classifier may aid in molecular discrimination between FTC and FTA in FFPE material. PMID:24099521
Integrative missing value estimation for microarray data.
Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine
2006-10-12
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
DigOut: viewing differential expression genes as outliers.
Yu, Hui; Tu, Kang; Xie, Lu; Li, Yuan-Yuan
2010-12-01
With regards to well-replicated two-conditional microarray datasets, the selection of differentially expressed (DE) genes is a well-studied computational topic, but for multi-conditional microarray datasets with limited or no replication, the same task is not properly addressed by previous studies. This paper adopts multivariate outlier analysis to analyze replication-lacking multi-conditional microarray datasets, finding that it performs significantly better than the widely used limit fold change (LFC) model in a simulated comparative experiment. Compared with the LFC model, the multivariate outlier analysis also demonstrates improved stability against sample variations in a series of manipulated real expression datasets. The reanalysis of a real non-replicated multi-conditional expression dataset series leads to satisfactory results. In conclusion, a multivariate outlier analysis algorithm, like DigOut, is particularly useful for selecting DE genes from non-replicated multi-conditional gene expression dataset.
Chockalingam, Sriram; Aluru, Maneesha; Aluru, Srinivas
2016-09-19
Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.
ERIC Educational Resources Information Center
Grenville-Briggs, Laura J.; Stansfield, Ian
2011-01-01
This report describes a linked series of Masters-level computer practical workshops. They comprise an advanced functional genomics investigation, based upon analysis of a microarray dataset probing yeast DNA damage responses. The workshops require the students to analyse highly complex transcriptomics datasets, and were designed to stimulate…
MAAMD: a workflow to standardize meta-analyses and comparison of affymetrix microarray data
2014-01-01
Background Mandatory deposit of raw microarray data files for public access, prior to study publication, provides significant opportunities to conduct new bioinformatics analyses within and across multiple datasets. Analysis of raw microarray data files (e.g. Affymetrix CEL files) can be time consuming, complex, and requires fundamental computational and bioinformatics skills. The development of analytical workflows to automate these tasks simplifies the processing of, improves the efficiency of, and serves to standardize multiple and sequential analyses. Once installed, workflows facilitate the tedious steps required to run rapid intra- and inter-dataset comparisons. Results We developed a workflow to facilitate and standardize Meta-Analysis of Affymetrix Microarray Data analysis (MAAMD) in Kepler. Two freely available stand-alone software tools, R and AltAnalyze were embedded in MAAMD. The inputs of MAAMD are user-editable csv files, which contain sample information and parameters describing the locations of input files and required tools. MAAMD was tested by analyzing 4 different GEO datasets from mice and drosophila. MAAMD automates data downloading, data organization, data quality control assesment, differential gene expression analysis, clustering analysis, pathway visualization, gene-set enrichment analysis, and cross-species orthologous-gene comparisons. MAAMD was utilized to identify gene orthologues responding to hypoxia or hyperoxia in both mice and drosophila. The entire set of analyses for 4 datasets (34 total microarrays) finished in ~ one hour. Conclusions MAAMD saves time, minimizes the required computer skills, and offers a standardized procedure for users to analyze microarray datasets and make new intra- and inter-dataset comparisons. PMID:24621103
Identification of consensus biomarkers for predicting non-genotoxic hepatocarcinogens
Huang, Shan-Han; Tung, Chun-Wei
2017-01-01
The assessment of non-genotoxic hepatocarcinogens (NGHCs) is currently relying on two-year rodent bioassays. Toxicogenomics biomarkers provide a potential alternative method for the prioritization of NGHCs that could be useful for risk assessment. However, previous studies using inconsistently classified chemicals as the training set and a single microarray dataset concluded no consensus biomarkers. In this study, 4 consensus biomarkers of A2m, Ca3, Cxcl1, and Cyp8b1 were identified from four large-scale microarray datasets of the one-day single maximum tolerated dose and a large set of chemicals without inconsistent classifications. Machine learning techniques were subsequently applied to develop prediction models for NGHCs. The final bagging decision tree models were constructed with an average AUC performance of 0.803 for an independent test. A set of 16 chemicals with controversial classifications were reclassified according to the consensus biomarkers. The developed prediction models and identified consensus biomarkers are expected to be potential alternative methods for prioritization of NGHCs for further experimental validation. PMID:28117354
Chondrocyte channel transcriptomics
Lewis, Rebecca; May, Hannah; Mobasheri, Ali; Barrett-Jolley, Richard
2013-01-01
To date, a range of ion channels have been identified in chondrocytes using a number of different techniques, predominantly electrophysiological and/or biomolecular; each of these has its advantages and disadvantages. Here we aim to compare and contrast the data available from biophysical and microarray experiments. This letter analyses recent transcriptomics datasets from chondrocytes, accessible from the European Bioinformatics Institute (EBI). We discuss whether such bioinformatic analysis of microarray datasets can potentially accelerate identification and discovery of ion channels in chondrocytes. The ion channels which appear most frequently across these microarray datasets are discussed, along with their possible functions. We discuss whether functional or protein data exist which support the microarray data. A microarray experiment comparing gene expression in osteoarthritis and healthy cartilage is also discussed and we verify the differential expression of 2 of these genes, namely the genes encoding large calcium-activated potassium (BK) and aquaporin channels. PMID:23995703
Giancarlo, Raffaele; Scaturro, Davide; Utro, Filippo
2008-10-29
Inferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. We consider five such measures: Clest, Consensus (Consensus Clustering), FOM (Figure of Merit), Gap (Gap Statistics) and ME (Model Explorer), in addition to the classic WCSS (Within Cluster Sum-of-Squares) and KL (Krzanowski and Lai index). We perform extensive experiments on six benchmark microarray datasets, using both Hierarchical and K-means clustering algorithms, and we provide an analysis assessing both the intrinsic ability of a measure to predict the correct number of clusters in a dataset and its merit relative to the other measures. We pay particular attention both to precision and speed. Moreover, we also provide various fast approximation algorithms for the computation of Gap, FOM and WCSS. The main result is a hierarchy of those measures in terms of precision and speed, highlighting some of their merits and limitations not reported before in the literature. Based on our analysis, we draw several conclusions for the use of those internal measures on microarray data. We report the main ones. Consensus is by far the best performer in terms of predictive power and remarkably algorithm-independent. Unfortunately, on large datasets, it may be of no use because of its non-trivial computer time demand (weeks on a state of the art PC). FOM is the second best performer although, quite surprisingly, it may not be competitive in this scenario: it has essentially the same predictive power of WCSS but it is from 6 to 100 times slower in time, depending on the dataset. The approximation algorithms for the computation of FOM, Gap and WCSS perform very well, i.e., they are faster while still granting a very close approximation of FOM and WCSS. The approximation algorithm for the computation of Gap deserves to be singled-out since it has a predictive power far better than Gap, it is competitive with the other measures, but it is at least two order of magnitude faster in time with respect to Gap. Another important novel conclusion that can be drawn from our analysis is that all the measures we have considered show severe limitations on large datasets, either due to computational demand (Consensus, as already mentioned, Clest and Gap) or to lack of precision (all of the other measures, including their approximations). The software and datasets are available under the GNU GPL on the supplementary material web page.
Wolff, Alexander; Bayerlová, Michaela; Gaedcke, Jochen; Kube, Dieter; Beißbarth, Tim
2018-01-01
Pipeline comparisons for gene expression data are highly valuable for applied real data analyses, as they enable the selection of suitable analysis strategies for the dataset at hand. Such pipelines for RNA-Seq data should include mapping of reads, counting and differential gene expression analysis or preprocessing, normalization and differential gene expression in case of microarray analysis, in order to give a global insight into pipeline performances. Four commonly used RNA-Seq pipelines (STAR/HTSeq-Count/edgeR, STAR/RSEM/edgeR, Sailfish/edgeR, TopHat2/Cufflinks/CuffDiff)) were investigated on multiple levels (alignment and counting) and cross-compared with the microarray counterpart on the level of gene expression and gene ontology enrichment. For these comparisons we generated two matched microarray and RNA-Seq datasets: Burkitt Lymphoma cell line data and rectal cancer patient data. The overall mapping rate of STAR was 98.98% for the cell line dataset and 98.49% for the patient dataset. Tophat's overall mapping rate was 97.02% and 96.73%, respectively, while Sailfish had only an overall mapping rate of 84.81% and 54.44%. The correlation of gene expression in microarray and RNA-Seq data was moderately worse for the patient dataset (ρ = 0.67-0.69) than for the cell line dataset (ρ = 0.87-0.88). An exception were the correlation results of Cufflinks, which were substantially lower (ρ = 0.21-0.29 and 0.34-0.53). For both datasets we identified very low numbers of differentially expressed genes using the microarray platform. For RNA-Seq we checked the agreement of differentially expressed genes identified in the different pipelines and of GO-term enrichment results. In conclusion the combination of STAR aligner with HTSeq-Count followed by STAR aligner with RSEM and Sailfish generated differentially expressed genes best suited for the dataset at hand and in agreement with most of the other transcriptomics pipelines.
Grenville-Briggs, Laura J; Stansfield, Ian
2011-01-01
This report describes a linked series of Masters-level computer practical workshops. They comprise an advanced functional genomics investigation, based upon analysis of a microarray dataset probing yeast DNA damage responses. The workshops require the students to analyse highly complex transcriptomics datasets, and were designed to stimulate active learning through experience of current research methods in bioinformatics and functional genomics. They seek to closely mimic a realistic research environment, and require the students first to propose research hypotheses, then test those hypotheses using specific sections of the microarray dataset. The complexity of the microarray data provides students with the freedom to propose their own unique hypotheses, tested using appropriate sections of the microarray data. This research latitude was highly regarded by students and is a strength of this practical. In addition, the focus on DNA damage by radiation and mutagenic chemicals allows them to place their results in a human medical context, and successfully sparks broad interest in the subject material. In evaluation, 79% of students scored the practical workshops on a five-point scale as 4 or 5 (totally effective) for student learning. More broadly, the general use of microarray data as a "student research playground" is also discussed. Copyright © 2011 Wiley Periodicals, Inc.
Validation of MIMGO: a method to identify differentially expressed GO terms in a microarray dataset
2012-01-01
Background We previously proposed an algorithm for the identification of GO terms that commonly annotate genes whose expression is upregulated or downregulated in some microarray data compared with in other microarray data. We call these “differentially expressed GO terms” and have named the algorithm “matrix-assisted identification method of differentially expressed GO terms” (MIMGO). MIMGO can also identify microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. However, MIMGO has not yet been validated on a real microarray dataset using all available GO terms. Findings We combined Gene Set Enrichment Analysis (GSEA) with MIMGO to identify differentially expressed GO terms in a yeast cell cycle microarray dataset. GSEA followed by MIMGO (GSEA + MIMGO) correctly identified (p < 0.05) microarray data in which genes annotated to differentially expressed GO terms are upregulated. We found that GSEA + MIMGO was slightly less effective than, or comparable to, GSEA (Pearson), a method that uses Pearson’s correlation as a metric, at detecting true differentially expressed GO terms. However, unlike other methods including GSEA (Pearson), GSEA + MIMGO can comprehensively identify the microarray data in which genes annotated with a differentially expressed GO term are upregulated or downregulated. Conclusions MIMGO is a reliable method to identify differentially expressed GO terms comprehensively. PMID:23232071
USDA-ARS?s Scientific Manuscript database
Phenotype microarrays were analyzed for 51 datasets derived from Salmonella enterica. The top 4 serovars associated with poultry products and one associated with turkey, respectively Typhimurium, Enteritidis, Heidelberg, Infantis and Senftenberg, were represented. Datasets were clustered into two ...
Molloy, Timothy J.; Roepman, Paul; Naume, Bjørn; van't Veer, Laura J.
2012-01-01
The detection of circulating tumor cells (CTCs) in the peripheral blood and microarray gene expression profiling of the primary tumor are two promising new technologies able to provide valuable prognostic data for patients with breast cancer. Meta-analyses of several established prognostic breast cancer gene expression profiles in large patient cohorts have demonstrated that despite sharing few genes, their delineation of patients into “good prognosis” or “poor prognosis” are frequently very highly correlated, and combining prognostic profiles does not increase prognostic power. In the current study, we aimed to develop a novel profile which provided independent prognostic data by building a signature predictive of CTC status rather than outcome. Microarray gene expression data from an initial training cohort of 72 breast cancer patients for which CTC status had been determined in a previous study using a multimarker QPCR-based assay was used to develop a CTC-predictive profile. The generated profile was validated in two independent datasets of 49 and 123 patients and confirmed to be both predictive of CTC status, and independently prognostic. Importantly, the “CTC profile” also provided prognostic information independent of the well-established and powerful ‘70-gene’ prognostic breast cancer signature. This profile therefore has the potential to not only add prognostic information to currently-available microarray tests but in some circumstances even replace blood-based prognostic CTC tests at time of diagnosis for those patients already undergoing testing by multigene assays. PMID:22384245
Ooi, Chia Huey; Chetty, Madhu; Teng, Shyh Wei
2006-06-23
Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.
Bhanot, Gyan; Alexe, Gabriela; Levine, Arnold J; Stolovitzky, Gustavo
2005-01-01
A major challenge in cancer diagnosis from microarray data is the need for robust, accurate, classification models which are independent of the analysis techniques used and can combine data from different laboratories. We propose such a classification scheme originally developed for phenotype identification from mass spectrometry data. The method uses a robust multivariate gene selection procedure and combines the results of several machine learning tools trained on raw and pattern data to produce an accurate meta-classifier. We illustrate and validate our method by applying it to gene expression datasets: the oligonucleotide HuGeneFL microarray dataset of Shipp et al. (www.genome.wi.mit.du/MPR/lymphoma) and the Hu95Av2 Affymetrix dataset (DallaFavera's laboratory, Columbia University). Our pattern-based meta-classification technique achieves higher predictive accuracies than each of the individual classifiers , is robust against data perturbations and provides subsets of related predictive genes. Our techniques predict that combinations of some genes in the p53 pathway are highly predictive of phenotype. In particular, we find that in 80% of DLBCL cases the mRNA level of at least one of the three genes p53, PLK1 and CDK2 is elevated, while in 80% of FL cases, the mRNA level of at most one of them is elevated.
A database for the analysis of immunity genes in Drosophila: PADMA database.
Lee, Mark J; Mondal, Ariful; Small, Chiyedza; Paddibhatla, Indira; Kawaguchi, Akira; Govind, Shubha
2011-01-01
While microarray experiments generate voluminous data, discerning trends that support an existing or alternative paradigm is challenging. To synergize hypothesis building and testing, we designed the Pathogen Associated Drosophila MicroArray (PADMA) database for easy retrieval and comparison of microarray results from immunity-related experiments (www.padmadatabase.org). PADMA also allows biologists to upload their microarray-results and compare it with datasets housed within PADMA. We tested PADMA using a preliminary dataset from Ganaspis xanthopoda-infected fly larvae, and uncovered unexpected trends in gene expression, reshaping our hypothesis. Thus, the PADMA database will be a useful resource to fly researchers to evaluate, revise, and refine hypotheses.
Bessonov, Kyrylo; Walkey, Christopher J.; Shelp, Barry J.; van Vuuren, Hennie J. J.; Chiu, David; van der Merwe, George
2013-01-01
Analyzing time-course expression data captured in microarray datasets is a complex undertaking as the vast and complex data space is represented by a relatively low number of samples as compared to thousands of available genes. Here, we developed the Interdependent Correlation Clustering (ICC) method to analyze relationships that exist among genes conditioned on the expression of a specific target gene in microarray data. Based on Correlation Clustering, the ICC method analyzes a large set of correlation values related to gene expression profiles extracted from given microarray datasets. ICC can be applied to any microarray dataset and any target gene. We applied this method to microarray data generated from wine fermentations and selected NSF1, which encodes a C2H2 zinc finger-type transcription factor, as the target gene. The validity of the method was verified by accurate identifications of the previously known functional roles of NSF1. In addition, we identified and verified potential new functions for this gene; specifically, NSF1 is a negative regulator for the expression of sulfur metabolism genes, the nuclear localization of Nsf1 protein (Nsf1p) is controlled in a sulfur-dependent manner, and the transcription of NSF1 is regulated by Met4p, an important transcriptional activator of sulfur metabolism genes. The inter-disciplinary approach adopted here highlighted the accuracy and relevancy of the ICC method in mining for novel gene functions using complex microarray datasets with a limited number of samples. PMID:24130853
Shrinkage regression-based methods for microarray missing value imputation.
Wang, Hsiuying; Chiu, Chia-Chun; Wu, Yi-Ching; Wu, Wei-Sheng
2013-01-01
Missing values commonly occur in the microarray data, which usually contain more than 5% missing values with up to 90% of genes affected. Inaccurate missing value estimation results in reducing the power of downstream microarray data analyses. Many types of methods have been developed to estimate missing values. Among them, the regression-based methods are very popular and have been shown to perform better than the other types of methods in many testing microarray datasets. To further improve the performances of the regression-based methods, we propose shrinkage regression-based methods. Our methods take the advantage of the correlation structure in the microarray data and select similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and then use the new coefficients to estimate missing values. Simulation results show that the proposed methods provide more accurate missing value estimation in six testing microarray datasets than the existing regression-based methods do. Imputation of missing values is a very important aspect of microarray data analyses because most of the downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become an essential issue. Since our proposed shrinkage regression-based methods can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based methods.
Yeh, Hsiang-Yuan; Cheng, Shih-Wu; Lin, Yu-Chun; Yeh, Cheng-Yu; Lin, Shih-Fang; Soo, Von-Wun
2009-12-21
Prostate cancer is a world wide leading cancer and it is characterized by its aggressive metastasis. According to the clinical heterogeneity, prostate cancer displays different stages and grades related to the aggressive metastasis disease. Although numerous studies used microarray analysis and traditional clustering method to identify the individual genes during the disease processes, the important gene regulations remain unclear. We present a computational method for inferring genetic regulatory networks from micorarray data automatically with transcription factor analysis and conditional independence testing to explore the potential significant gene regulatory networks that are correlated with cancer, tumor grade and stage in the prostate cancer. To deal with missing values in microarray data, we used a K-nearest-neighbors (KNN) algorithm to determine the precise expression values. We applied web services technology to wrap the bioinformatics toolkits and databases to automatically extract the promoter regions of DNA sequences and predicted the transcription factors that regulate the gene expressions. We adopt the microarray datasets consists of 62 primary tumors, 41 normal prostate tissues from Stanford Microarray Database (SMD) as a target dataset to evaluate our method. The predicted results showed that the possible biomarker genes related to cancer and denoted the androgen functions and processes may be in the development of the prostate cancer and promote the cell death in cell cycle. Our predicted results showed that sub-networks of genes SREBF1, STAT6 and PBX1 are strongly related to a high extent while ETS transcription factors ELK1, JUN and EGR2 are related to a low extent. Gene SLC22A3 may explain clinically the differentiation associated with the high grade cancer compared with low grade cancer. Enhancer of Zeste Homolg 2 (EZH2) regulated by RUNX1 and STAT3 is correlated to the pathological stage. We provide a computational framework to reconstruct the genetic regulatory network from the microarray data using biological knowledge and constraint-based inferences. Our method is helpful in verifying possible interaction relations in gene regulatory networks and filtering out incorrect relations inferred by imperfect methods. We predicted not only individual gene related to cancer but also discovered significant gene regulation networks. Our method is also validated in several enriched published papers and databases and the significant gene regulatory networks perform critical biological functions and processes including cell adhesion molecules, androgen and estrogen metabolism, smooth muscle contraction, and GO-annotated processes. Those significant gene regulations and the critical concept of tumor progression are useful to understand cancer biology and disease treatment.
The Importance of Normalization on Large and Heterogeneous Microarray Datasets
DNA microarray technology is a powerful functional genomics tool increasingly used for investigating global gene expression in environmental studies. Microarrays can also be used in identifying biological networks, as they give insight on the complex gene-to-gene interactions, ne...
GeneXplorer: an interactive web application for microarray data visualization and analysis.
Rees, Christian A; Demeter, Janos; Matese, John C; Botstein, David; Sherlock, Gavin
2004-10-01
When publishing large-scale microarray datasets, it is of great value to create supplemental websites where either the full data, or selected subsets corresponding to figures within the paper, can be browsed. We set out to create a CGI application containing many of the features of some of the existing standalone software for the visualization of clustered microarray data. We present GeneXplorer, a web application for interactive microarray data visualization and analysis in a web environment. GeneXplorer allows users to browse a microarray dataset in an intuitive fashion. It provides simple access to microarray data over the Internet and uses only HTML and JavaScript to display graphic and annotation information. It provides radar and zoom views of the data, allows display of the nearest neighbors to a gene expression vector based on their Pearson correlations and provides the ability to search gene annotation fields. The software is released under the permissive MIT Open Source license, and the complete documentation and the entire source code are freely available for download from CPAN http://search.cpan.org/dist/Microarray-GeneXplorer/.
CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets
Li, Yang; Liu, Jun S.; Mootha, Vamsi K.
2017-01-01
In recent years, there has been a huge rise in the number of publicly available transcriptional profiling datasets. These massive compendia comprise billions of measurements and provide a special opportunity to predict the function of unstudied genes based on co-expression to well-studied pathways. Such analyses can be very challenging, however, since biological pathways are modular and may exhibit co-expression only in specific contexts. To overcome these challenges we introduce CLIC, CLustering by Inferred Co-expression. CLIC accepts as input a pathway consisting of two or more genes. It then uses a Bayesian partition model to simultaneously partition the input gene set into coherent co-expressed modules (CEMs), while assigning the posterior probability for each dataset in support of each CEM. CLIC then expands each CEM by scanning the transcriptome for additional co-expressed genes, quantified by an integrated log-likelihood ratio (LLR) score weighted for each dataset. As a byproduct, CLIC automatically learns the conditions (datasets) within which a CEM is operative. We implemented CLIC using a compendium of 1774 mouse microarray datasets (28628 microarrays) or 1887 human microarray datasets (45158 microarrays). CLIC analysis reveals that of 910 canonical biological pathways, 30% consist of strongly co-expressed gene modules for which new members are predicted. For example, CLIC predicts a functional connection between protein C7orf55 (FMC1) and the mitochondrial ATP synthase complex that we have experimentally validated. CLIC is freely available at www.gene-clic.org. We anticipate that CLIC will be valuable both for revealing new components of biological pathways as well as the conditions in which they are active. PMID:28719601
Prediction of clinical behaviour and treatment for cancers.
Futschik, Matthias E; Sullivan, Mike; Reeve, Anthony; Kasabov, Nikola
2003-01-01
Prediction of clinical behaviour and treatment for cancers is based on the integration of clinical and pathological parameters. Recent reports have demonstrated that gene expression profiling provides a powerful new approach for determining disease outcome. If clinical and microarray data each contain independent information then it should be possible to combine these datasets to gain more accurate prognostic information. Here, we have used existing clinical information and microarray data to generate a combined prognostic model for outcome prediction for diffuse large B-cell lymphoma (DLBCL). A prediction accuracy of 87.5% was achieved. This constitutes a significant improvement compared to the previously most accurate prognostic model with an accuracy of 77.6%. The model introduced here may be generally applicable to the combination of various types of molecular and clinical data for improving medical decision support systems and individualising patient care.
2009-01-01
Background Prostate cancer is a world wide leading cancer and it is characterized by its aggressive metastasis. According to the clinical heterogeneity, prostate cancer displays different stages and grades related to the aggressive metastasis disease. Although numerous studies used microarray analysis and traditional clustering method to identify the individual genes during the disease processes, the important gene regulations remain unclear. We present a computational method for inferring genetic regulatory networks from micorarray data automatically with transcription factor analysis and conditional independence testing to explore the potential significant gene regulatory networks that are correlated with cancer, tumor grade and stage in the prostate cancer. Results To deal with missing values in microarray data, we used a K-nearest-neighbors (KNN) algorithm to determine the precise expression values. We applied web services technology to wrap the bioinformatics toolkits and databases to automatically extract the promoter regions of DNA sequences and predicted the transcription factors that regulate the gene expressions. We adopt the microarray datasets consists of 62 primary tumors, 41 normal prostate tissues from Stanford Microarray Database (SMD) as a target dataset to evaluate our method. The predicted results showed that the possible biomarker genes related to cancer and denoted the androgen functions and processes may be in the development of the prostate cancer and promote the cell death in cell cycle. Our predicted results showed that sub-networks of genes SREBF1, STAT6 and PBX1 are strongly related to a high extent while ETS transcription factors ELK1, JUN and EGR2 are related to a low extent. Gene SLC22A3 may explain clinically the differentiation associated with the high grade cancer compared with low grade cancer. Enhancer of Zeste Homolg 2 (EZH2) regulated by RUNX1 and STAT3 is correlated to the pathological stage. Conclusions We provide a computational framework to reconstruct the genetic regulatory network from the microarray data using biological knowledge and constraint-based inferences. Our method is helpful in verifying possible interaction relations in gene regulatory networks and filtering out incorrect relations inferred by imperfect methods. We predicted not only individual gene related to cancer but also discovered significant gene regulation networks. Our method is also validated in several enriched published papers and databases and the significant gene regulatory networks perform critical biological functions and processes including cell adhesion molecules, androgen and estrogen metabolism, smooth muscle contraction, and GO-annotated processes. Those significant gene regulations and the critical concept of tumor progression are useful to understand cancer biology and disease treatment. PMID:20025723
Correcting for batch effects in case-control microbiome studies
Gibbons, Sean M.; Duvallet, Claire
2018-01-01
High-throughput data generation platforms, like mass-spectrometry, microarrays, and second-generation sequencing are susceptible to batch effects due to run-to-run variation in reagents, equipment, protocols, or personnel. Currently, batch correction methods are not commonly applied to microbiome sequencing datasets. In this paper, we compare different batch-correction methods applied to microbiome case-control studies. We introduce a model-free normalization procedure where features (i.e. bacterial taxa) in case samples are converted to percentiles of the equivalent features in control samples within a study prior to pooling data across studies. We look at how this percentile-normalization method compares to traditional meta-analysis methods for combining independent p-values and to limma and ComBat, widely used batch-correction models developed for RNA microarray data. Overall, we show that percentile-normalization is a simple, non-parametric approach for correcting batch effects and improving sensitivity in case-control meta-analyses. PMID:29684016
Ma, Chifeng; Chen, Hung-I; Flores, Mario; Huang, Yufei; Chen, Yidong
2013-01-01
Connectivity map (cMap) is a recent developed dataset and algorithm for uncovering and understanding the treatment effect of small molecules on different cancer cell lines. It is widely used but there are still remaining challenges for accurate predictions. Here, we propose BRCA-MoNet, a network of drug mode of action (MoA) specific to breast cancer, which is constructed based on the cMap dataset. A drug signature selection algorithm fitting the characteristic of cMap data, a quality control scheme as well as a novel query algorithm based on BRCA-MoNet are developed for more effective prediction of drug effects. BRCA-MoNet was applied to three independent data sets obtained from the GEO database: Estrodial treated MCF7 cell line, BMS-754807 treated MCF7 cell line, and a breast cancer patient microarray dataset. In the first case, BRCA-MoNet could identify drug MoAs likely to share same and reverse treatment effect. In the second case, the result demonstrated the potential of BRCA-MoNet to reposition drugs and predict treatment effects for drugs not in cMap data. In the third case, a possible procedure of personalized drug selection is showcased. The results clearly demonstrated that the proposed BRCA-MoNet approach can provide increased prediction power to cMap and thus will be useful for identification of new therapeutic candidates.
Sun, Jie; Chen, Xihai; Wang, Zhenzhen; Guo, Maoni; Shi, Hongbo; Wang, Xiaojun; Cheng, Liang; Zhou, Meng
2015-11-09
Long non-coding RNAs (lncRNAs) have been implicated in a variety of biological processes, and dysregulated lncRNAs have demonstrated potential roles as biomarkers and therapeutic targets for cancer prognosis and treatment. In this study, by repurposing microarray probes, we analyzed lncRNA expression profiles of 916 breast cancer patients from the Gene Expression Omnibus (GEO). Nine lncRNAs were identified to be significantly associated with metastasis-free survival (MFS) in the training dataset of 254 patients using the Cox proportional hazards regression model. These nine lncRNAs were then combined to form a single prognostic signature for predicting metastatic risk in breast cancer patients that was able to classify patients in the training dataset into high- and low-risk subgroups with significantly different MFSs (median 2.4 years versus 3.0 years, log-rank test p < 0.001). This nine-lncRNA signature was similarly effective for prognosis in a testing dataset and two independent datasets. Further analysis showed that the predictive ability of the signature was independent of clinical variables, including age, ER status, ESR1 status and ERBB2 status. Our results indicated that lncRNA signature could be a useful prognostic marker to predict metastatic risk in breast cancer patients and may improve upon our understanding of the molecular mechanisms underlying breast cancer metastasis.
VTCdb: a gene co-expression database for the crop species Vitis vinifera (grapevine).
Wong, Darren C J; Sweetman, Crystal; Drew, Damian P; Ford, Christopher M
2013-12-16
Gene expression datasets in model plants such as Arabidopsis have contributed to our understanding of gene function and how a single underlying biological process can be governed by a diverse network of genes. The accumulation of publicly available microarray data encompassing a wide range of biological and environmental conditions has enabled the development of additional capabilities including gene co-expression analysis (GCA). GCA is based on the understanding that genes encoding proteins involved in similar and/or related biological processes may exhibit comparable expression patterns over a range of experimental conditions, developmental stages and tissues. We present an open access database for the investigation of gene co-expression networks within the cultivated grapevine, Vitis vinifera. The new gene co-expression database, VTCdb (http://vtcdb.adelaide.edu.au/Home.aspx), offers an online platform for transcriptional regulatory inference in the cultivated grapevine. Using condition-independent and condition-dependent approaches, grapevine co-expression networks were constructed using the latest publicly available microarray datasets from diverse experimental series, utilising the Affymetrix Vitis vinifera GeneChip (16 K) and the NimbleGen Grape Whole-genome microarray chip (29 K), thus making it possible to profile approximately 29,000 genes (95% of the predicted grapevine transcriptome). Applications available with the online platform include the use of gene names, probesets, modules or biological processes to query the co-expression networks, with the option to choose between Affymetrix or Nimblegen datasets and between multiple co-expression measures. Alternatively, the user can browse existing network modules using interactive network visualisation and analysis via CytoscapeWeb. To demonstrate the utility of the database, we present examples from three fundamental biological processes (berry development, photosynthesis and flavonoid biosynthesis) whereby the recovered sub-networks reconfirm established plant gene functions and also identify novel associations. Together, we present valuable insights into grapevine transcriptional regulation by developing network models applicable to researchers in their prioritisation of gene candidates, for on-going study of biological processes related to grapevine development, metabolism and stress responses.
Accession numbers for microarray datasets used in Oshida et al. Chemical and Hormonal Effects on STAT5b-Dependent Sexual Dimorphism of the Liver Transcriptome. PLoS One. 2016 Mar 9;11(3):e0150284. This dataset is associated with the following publication:Oshida, K., D. Waxman, and C. Corton. Chemical and Hormonal Effects on STAT5b-Dependent Sexual Dimorphism of the Liver Transcriptome.. PLoS ONE. Public Library of Science, San Francisco, CA, USA, 11(3): NA, (2016).
Gene selection for microarray data classification via subspace learning and manifold regularization.
Tang, Chang; Cao, Lijuan; Zheng, Xiao; Wang, Minhui
2017-12-19
With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.
Kumar, Mukesh; Rath, Nitish Kumar; Rath, Santanu Kumar
2016-04-01
Microarray-based gene expression profiling has emerged as an efficient technique for classification, prognosis, diagnosis, and treatment of cancer. Frequent changes in the behavior of this disease generates an enormous volume of data. Microarray data satisfies both the veracity and velocity properties of big data, as it keeps changing with time. Therefore, the analysis of microarray datasets in a small amount of time is essential. They often contain a large amount of expression, but only a fraction of it comprises genes that are significantly expressed. The precise identification of genes of interest that are responsible for causing cancer are imperative in microarray data analysis. Most existing schemes employ a two-phase process such as feature selection/extraction followed by classification. In this paper, various statistical methods (tests) based on MapReduce are proposed for selecting relevant features. After feature selection, a MapReduce-based K-nearest neighbor (mrKNN) classifier is also employed to classify microarray data. These algorithms are successfully implemented in a Hadoop framework. A comparative analysis is done on these MapReduce-based models using microarray datasets of various dimensions. From the obtained results, it is observed that these models consume much less execution time than conventional models in processing big data. Copyright © 2016 Elsevier Inc. All rights reserved.
Klein, Hans-Ulrich; Ruckert, Christian; Kohlmann, Alexander; Bullinger, Lars; Thiede, Christian; Haferlach, Torsten; Dugas, Martin
2009-12-15
Multiple gene expression signatures derived from microarray experiments have been published in the field of leukemia research. A comparison of these signatures with results from new experiments is useful for verification as well as for interpretation of the results obtained. Currently, the percentage of overlapping genes is frequently used to compare published gene signatures against a signature derived from a new experiment. However, it has been shown that the percentage of overlapping genes is of limited use for comparing two experiments due to the variability of gene signatures caused by different array platforms or assay-specific influencing parameters. Here, we present a robust approach for a systematic and quantitative comparison of published gene expression signatures with an exemplary query dataset. A database storing 138 leukemia-related published gene signatures was designed. Each gene signature was manually annotated with terms according to a leukemia-specific taxonomy. Two analysis steps are implemented to compare a new microarray dataset with the results from previous experiments stored and curated in the database. First, the global test method is applied to assess gene signatures and to constitute a ranking among them. In a subsequent analysis step, the focus is shifted from single gene signatures to chromosomal aberrations or molecular mutations as modeled in the taxonomy. Potentially interesting disease characteristics are detected based on the ranking of gene signatures associated with these aberrations stored in the database. Two example analyses are presented. An implementation of the approach is freely available as web-based application. The presented approach helps researchers to systematically integrate the knowledge derived from numerous microarray experiments into the analysis of a new dataset. By means of example leukemia datasets we demonstrate that this approach detects related experiments as well as related molecular mutations and may help to interpret new microarray data.
Cheng, Ningtao; Wu, Leihong; Cheng, Yiyu
2013-01-01
The promise of microarray technology in providing prediction classifiers for cancer outcome estimation has been confirmed by a number of demonstrable successes. However, the reliability of prediction results relies heavily on the accuracy of statistical parameters involved in classifiers. It cannot be reliably estimated with only a small number of training samples. Therefore, it is of vital importance to determine the minimum number of training samples and to ensure the clinical value of microarrays in cancer outcome prediction. We evaluated the impact of training sample size on model performance extensively based on 3 large-scale cancer microarray datasets provided by the second phase of MicroArray Quality Control project (MAQC-II). An SSNR-based (scale of signal-to-noise ratio) protocol was proposed in this study for minimum training sample size determination. External validation results based on another 3 cancer datasets confirmed that the SSNR-based approach could not only determine the minimum number of training samples efficiently, but also provide a valuable strategy for estimating the underlying performance of classifiers in advance. Once translated into clinical routine applications, the SSNR-based protocol would provide great convenience in microarray-based cancer outcome prediction in improving classifier reliability. PMID:23861920
Hybrid genetic algorithm-neural network: feature extraction for unpreprocessed microarray data.
Tong, Dong Ling; Schierz, Amanda C
2011-09-01
Suitable techniques for microarray analysis have been widely researched, particularly for the study of marker genes expressed to a specific type of cancer. Most of the machine learning methods that have been applied to significant gene selection focus on the classification ability rather than the selection ability of the method. These methods also require the microarray data to be preprocessed before analysis takes place. The objective of this study is to develop a hybrid genetic algorithm-neural network (GANN) model that emphasises feature selection and can operate on unpreprocessed microarray data. The GANN is a hybrid model where the fitness value of the genetic algorithm (GA) is based upon the number of samples correctly labelled by a standard feedforward artificial neural network (ANN). The model is evaluated by using two benchmark microarray datasets with different array platforms and differing number of classes (a 2-class oligonucleotide microarray data for acute leukaemia and a 4-class complementary DNA (cDNA) microarray dataset for SRBCTs (small round blue cell tumours)). The underlying concept of the GANN algorithm is to select highly informative genes by co-evolving both the GA fitness function and the ANN weights at the same time. The novel GANN selected approximately 50% of the same genes as the original studies. This may indicate that these common genes are more biologically significant than other genes in the datasets. The remaining 50% of the significant genes identified were used to build predictive models and for both datasets, the models based on the set of genes extracted by the GANN method produced more accurate results. The results also suggest that the GANN method not only can detect genes that are exclusively associated with a single cancer type but can also explore the genes that are differentially expressed in multiple cancer types. The results show that the GANN model has successfully extracted statistically significant genes from the unpreprocessed microarray data as well as extracting known biologically significant genes. We also show that assessing the biological significance of genes based on classification accuracy may be misleading and though the GANN's set of extra genes prove to be more statistically significant than those selected by other methods, a biological assessment of these genes is highly recommended to confirm their functionality. Copyright © 2011 Elsevier B.V. All rights reserved.
De Hertogh, Benoît; De Meulder, Bertrand; Berger, Fabrice; Pierre, Michael; Bareke, Eric; Gaigneaux, Anthoula; Depiereux, Eric
2010-01-11
Recent reanalysis of spike-in datasets underscored the need for new and more accurate benchmark datasets for statistical microarray analysis. We present here a fresh method using biologically-relevant data to evaluate the performance of statistical methods. Our novel method ranks the probesets from a dataset composed of publicly-available biological microarray data and extracts subset matrices with precise information/noise ratios. Our method can be used to determine the capability of different methods to better estimate variance for a given number of replicates. The mean-variance and mean-fold change relationships of the matrices revealed a closer approximation of biological reality. Performance analysis refined the results from benchmarks published previously.We show that the Shrinkage t test (close to Limma) was the best of the methods tested, except when two replicates were examined, where the Regularized t test and the Window t test performed slightly better. The R scripts used for the analysis are available at http://urbm-cluster.urbm.fundp.ac.be/~bdemeulder/.
An efficient method to identify differentially expressed genes in microarray experiments
Qin, Huaizhen; Feng, Tao; Harding, Scott A.; Tsai, Chung-Jui; Zhang, Shuanglin
2013-01-01
Motivation Microarray experiments typically analyze thousands to tens of thousands of genes from small numbers of biological replicates. The fact that genes are normally expressed in functionally relevant patterns suggests that gene-expression data can be stratified and clustered into relatively homogenous groups. Cluster-wise dimensionality reduction should make it feasible to improve screening power while minimizing information loss. Results We propose a powerful and computationally simple method for finding differentially expressed genes in small microarray experiments. The method incorporates a novel stratification-based tight clustering algorithm, principal component analysis and information pooling. Comprehensive simulations show that our method is substantially more powerful than the popular SAM and eBayes approaches. We applied the method to three real microarray datasets: one from a Populus nitrogen stress experiment with 3 biological replicates; and two from public microarray datasets of human cancers with 10 to 40 biological replicates. In all three analyses, our method proved more robust than the popular alternatives for identification of differentially expressed genes. Availability The C++ code to implement the proposed method is available upon request for academic use. PMID:18453554
A Self-Directed Method for Cell-Type Identification and Separation of Gene Expression Microarrays
Zuckerman, Neta S.; Noam, Yair; Goldsmith, Andrea J.; Lee, Peter P.
2013-01-01
Gene expression analysis is generally performed on heterogeneous tissue samples consisting of multiple cell types. Current methods developed to separate heterogeneous gene expression rely on prior knowledge of the cell-type composition and/or signatures - these are not available in most public datasets. We present a novel method to identify the cell-type composition, signatures and proportions per sample without need for a-priori information. The method was successfully tested on controlled and semi-controlled datasets and performed as accurately as current methods that do require additional information. As such, this method enables the analysis of cell-type specific gene expression using existing large pools of publically available microarray datasets. PMID:23990767
Johnstone, Daniel M.; Riveros, Carlos; Heidari, Moones; Graham, Ross M.; Trinder, Debbie; Berretta, Regina; Olynyk, John K.; Scott, Rodney J.; Moscato, Pablo; Milward, Elizabeth A.
2013-01-01
While Illumina microarrays can be used successfully for detecting small gene expression changes due to their high degree of technical replicability, there is little information on how different normalization and differential expression analysis strategies affect outcomes. To evaluate this, we assessed concordance across gene lists generated by applying different combinations of normalization strategy and analytical approach to two Illumina datasets with modest expression changes. In addition to using traditional statistical approaches, we also tested an approach based on combinatorial optimization. We found that the choice of both normalization strategy and analytical approach considerably affected outcomes, in some cases leading to substantial differences in gene lists and subsequent pathway analysis results. Our findings suggest that important biological phenomena may be overlooked when there is a routine practice of using only one approach to investigate all microarray datasets. Analytical artefacts of this kind are likely to be especially relevant for datasets involving small fold changes, where inherent technical variation—if not adequately minimized by effective normalization—may overshadow true biological variation. This report provides some basic guidelines for optimizing outcomes when working with Illumina datasets involving small expression changes. PMID:27605185
Reproducibility-optimized test statistic for ranking genes in microarray studies.
Elo, Laura L; Filén, Sanna; Lahesmaa, Riitta; Aittokallio, Tero
2008-01-01
A principal goal of microarray studies is to identify the genes showing differential expression under distinct conditions. In such studies, the selection of an optimal test statistic is a crucial challenge, which depends on the type and amount of data under analysis. While previous studies on simulated or spike-in datasets do not provide practical guidance on how to choose the best method for a given real dataset, we introduce an enhanced reproducibility-optimization procedure, which enables the selection of a suitable gene- anking statistic directly from the data. In comparison with existing ranking methods, the reproducibilityoptimized statistic shows good performance consistently under various simulated conditions and on Affymetrix spike-in dataset. Further, the feasibility of the novel statistic is confirmed in a practical research setting using data from an in-house cDNA microarray study of asthma-related gene expression changes. These results suggest that the procedure facilitates the selection of an appropriate test statistic for a given dataset without relying on a priori assumptions, which may bias the findings and their interpretation. Moreover, the general reproducibilityoptimization procedure is not limited to detecting differential expression only but could be extended to a wide range of other applications as well.
Feng, Yinling; Wang, Xuefeng
2017-03-01
In order to investigate commonly disturbed genes and pathways in various brain regions of patients with Parkinson's disease (PD), microarray datasets from previous studies were collected and systematically analyzed. Different normalization methods were applied to microarray datasets from different platforms. A strategy combining gene co‑expression networks and clinical information was adopted, using weighted gene co‑expression network analysis (WGCNA) to screen for commonly disturbed genes in different brain regions of patients with PD. Functional enrichment analysis of commonly disturbed genes was performed using the Database for Annotation, Visualization, and Integrated Discovery (DAVID). Co‑pathway relationships were identified with Pearson's correlation coefficient tests and a hypergeometric distribution‑based test. Common genes in pathway pairs were selected out and regarded as risk genes. A total of 17 microarray datasets from 7 platforms were retained for further analysis. Five gene coexpression modules were identified, containing 9,745, 736, 233, 101 and 93 genes, respectively. One module was significantly correlated with PD samples and thus the 736 genes it contained were considered to be candidate PD‑associated genes. Functional enrichment analysis demonstrated that these genes were implicated in oxidative phosphorylation and PD. A total of 44 pathway pairs and 52 risk genes were revealed, and a risk gene pathway relationship network was constructed. Eight modules were identified and were revealed to be associated with PD, cancers and metabolism. A number of disturbed pathways and risk genes were unveiled in PD, and these findings may help advance understanding of PD pathogenesis.
Animal Viruses Probe dataset (AVPDS) for microarray-based diagnosis and identification of viruses.
Yadav, Brijesh S; Pokhriyal, Mayank; Vasishtha, Dinesh P; Sharma, Bhaskar
2014-03-01
AVPDS (Animal Viruses Probe dataset) is a dataset of virus-specific and conserve oligonucleotides for identification and diagnosis of viruses infecting animals. The current dataset contain 20,619 virus specific probes for 833 viruses and their subtypes and 3,988 conserved probes for 146 viral genera. Dataset of virus specific probe has been divided into two fields namely virus name and probe sequence. Similarly conserved probes for virus genera table have genus, and subgroup within genus name and probe sequence. The subgroup within genus is artificially divided subgroups with no taxonomic significance and contains probes which identifies viruses in that specific subgroup of the genus. Using this dataset we have successfully diagnosed the first case of Newcastle disease virus in sheep and reported a mixed infection of Bovine viral diarrhea and Bovine herpesvirus in cattle. These dataset also contains probes which cross reacts across species experimentally though computationally they meet specifications. These probes have been marked. We hope that this dataset will be useful in microarray-based detection of viruses. The dataset can be accessed through the link https://dl.dropboxusercontent.com/u/94060831/avpds/HOME.html.
FDAs Critical Path Initiative identifies pharmacogenomics and toxicogenomics as key opportunities in advancing medical product development and personalized medicine, and the Guidance for Industry: Pharmacogenomic Data Submissions has been released. Microarrays represent a co...
2013-01-01
Background Connectivity map (cMap) is a recent developed dataset and algorithm for uncovering and understanding the treatment effect of small molecules on different cancer cell lines. It is widely used but there are still remaining challenges for accurate predictions. Method Here, we propose BRCA-MoNet, a network of drug mode of action (MoA) specific to breast cancer, which is constructed based on the cMap dataset. A drug signature selection algorithm fitting the characteristic of cMap data, a quality control scheme as well as a novel query algorithm based on BRCA-MoNet are developed for more effective prediction of drug effects. Result BRCA-MoNet was applied to three independent data sets obtained from the GEO database: Estrodial treated MCF7 cell line, BMS-754807 treated MCF7 cell line, and a breast cancer patient microarray dataset. In the first case, BRCA-MoNet could identify drug MoAs likely to share same and reverse treatment effect. In the second case, the result demonstrated the potential of BRCA-MoNet to reposition drugs and predict treatment effects for drugs not in cMap data. In the third case, a possible procedure of personalized drug selection is showcased. Conclusions The results clearly demonstrated that the proposed BRCA-MoNet approach can provide increased prediction power to cMap and thus will be useful for identification of new therapeutic candidates. Website: The web based application is developed and can be access through the following link http://compgenomics.utsa.edu/BRCAMoNet/ PMID:24564956
New Statistics for Testing Differential Expression of Pathways from Microarray Data
NASA Astrophysics Data System (ADS)
Siu, Hoicheong; Dong, Hua; Jin, Li; Xiong, Momiao
Exploring biological meaning from microarray data is very important but remains a great challenge. Here, we developed three new statistics: linear combination test, quadratic test and de-correlation test to identify differentially expressed pathways from gene expression profile. We apply our statistics to two rheumatoid arthritis datasets. Notably, our results reveal three significant pathways and 275 genes in common in two datasets. The pathways we found are meaningful to uncover the disease mechanisms of rheumatoid arthritis, which implies that our statistics are a powerful tool in functional analysis of gene expression data.
An Introduction to MAMA (Meta-Analysis of MicroArray data) System.
Zhang, Zhe; Fenstermacher, David
2005-01-01
Analyzing microarray data across multiple experiments has been proven advantageous. To support this kind of analysis, we are developing a software system called MAMA (Meta-Analysis of MicroArray data). MAMA utilizes a client-server architecture with a relational database on the server-side for the storage of microarray datasets collected from various resources. The client-side is an application running on the end user's computer that allows the user to manipulate microarray data and analytical results locally. MAMA implementation will integrate several analytical methods, including meta-analysis within an open-source framework offering other developers the flexibility to plug in additional statistical algorithms.
Use of Network Inference to Elucidate Common and Chemical-specific Effects on Steoidogenesis
Microarray data is a key source for modeling gene regulatory interactions. Regulatory network models based on multiple datasets are potentially more robust and can provide greater confidence. In this study, we used network modeling on microarray data generated by exposing the fat...
Yu, Hualong; Hong, Shufang; Yang, Xibei; Ni, Jun; Dan, Yuanyuan; Qin, Bin
2013-01-01
DNA microarray technology can measure the activities of tens of thousands of genes simultaneously, which provides an efficient way to diagnose cancer at the molecular level. Although this strategy has attracted significant research attention, most studies neglect an important problem, namely, that most DNA microarray datasets are skewed, which causes traditional learning algorithms to produce inaccurate results. Some studies have considered this problem, yet they merely focus on binary-class problem. In this paper, we dealt with multiclass imbalanced classification problem, as encountered in cancer DNA microarray, by using ensemble learning. We utilized one-against-all coding strategy to transform multiclass to multiple binary classes, each of them carrying out feature subspace, which is an evolving version of random subspace that generates multiple diverse training subsets. Next, we introduced one of two different correction technologies, namely, decision threshold adjustment or random undersampling, into each training subset to alleviate the damage of class imbalance. Specifically, support vector machine was used as base classifier, and a novel voting rule called counter voting was presented for making a final decision. Experimental results on eight skewed multiclass cancer microarray datasets indicate that unlike many traditional classification approaches, our methods are insensitive to class imbalance.
Boolean dynamics of genetic regulatory networks inferred from microarray time series data
Martin, Shawn; Zhang, Zhaoduo; Martino, Anthony; ...
2007-01-31
Methods available for the inference of genetic regulatory networks strive to produce a single network, usually by optimizing some quantity to fit the experimental observations. In this paper we investigate the possibility that multiple networks can be inferred, all resulting in similar dynamics. This idea is motivated by theoretical work which suggests that biological networks are robust and adaptable to change, and that the overall behavior of a genetic regulatory network might be captured in terms of dynamical basins of attraction. We have developed and implemented a method for inferring genetic regulatory networks for time series microarray data. Our methodmore » first clusters and discretizes the gene expression data using k-means and support vector regression. We then enumerate Boolean activation–inhibition networks to match the discretized data. In conclusion, the dynamics of the Boolean networks are examined. We have tested our method on two immunology microarray datasets: an IL-2-stimulated T cell response dataset and a LPS-stimulated macrophage response dataset. In both cases, we discovered that many networks matched the data, and that most of these networks had similar dynamics.« less
Release of (and lessons learned from mining) a pioneering large toxicogenomics database.
Sandhu, Komal S; Veeramachaneni, Vamsi; Yao, Xiang; Nie, Alex; Lord, Peter; Amaratunga, Dhammika; McMillian, Michael K; Verheyen, Geert R
2015-07-01
We release the Janssen Toxicogenomics database. This rat liver gene-expression database was generated using Codelink microarrays, and has been used over the past years within Janssen to derive signatures for multiple end points and to classify proprietary compounds. The release consists of gene-expression responses to 124 compounds, selected to give a broad coverage of liver-active compounds. A selection of the compounds were also analyzed on Affymetrix microarrays. The release includes results of an in-house reannotation pipeline to Entrez gene annotations, to classify probes into different confidence classes. High confidence unambiguously annotated probes were used to create gene-level data which served as starting point for cross-platform comparisons. Connectivity map-based similarity methods show excellent agreement between Codelink and Affymetrix runs of the same samples. We also compared our dataset with the Japanese Toxicogenomics Project and observed reasonable agreement, especially for compounds with stronger gene signatures. We describe an R-package containing the gene-level data and show how it can be used for expression-based similarity searches. Comparing the same biological samples run on the Affymetrix and the Codelink platform, good correspondence is observed using connectivity mapping approaches. As expected, this correspondence is smaller when the data are compared with an independent dataset such as TG-GATE. We hope that this collection of gene-expression profiles will be incorporated in toxicogenomics pipelines of users.
Hu, Ting; Sun, Qian; Wu, Jianli; Lin, Xingguang; Luo, Danfeng; Sun, Chaoyang; Wang, Changyu; Zhou, Bo; Li, Na; Xia, Meng; Lu, Hao; Meng, Li; Xu, Xiaoyan; Hu, Junbo; Ma, Ding; Chen, Gang; Zhu, Tao
2016-01-01
Approximately 50-75% of patients with serous ovarian carcinoma (SOC) experience recurrence within 18 months after first-line treatment. Current clinical indicators are inadequate for predicting the risk of recurrence. In this study, we used 7 publicly available microarray datasets to identify gene signatures related to recurrence in optimally debulked SOC patients, and validated their expressions in an independent clinic cohort of 127 patients using immunohistochemistry (IHC). We identified a two-gene signature including KCNN4 and S100A14 which was related to recurrence in optimally debulked SOC patients. Their mRNA expression levels were positively correlated and regulated by DNA copy number alterations (CNA) (KCNN4: p=1.918e-05) and DNA promotermethylation (KCNN4: p=0.0179; S100A14: p=2.787e-13). Recurrence prediction models built in the TCGA dataset based on KCNN4 and S100A14 individually and in combination showed good prediction performance in the other 6 datasets (AUC:0.5442-0.9524). The independent cohort supported the expression difference between SOC recurrences. Also, a KCNN4 and S100A14-centered protein interaction subnetwork was built from the STRING database, and the shortest regulation path between them, called the KCNN4-UBA52-KLF4-S100A14 axis, was identified. This discovery might facilitate individualized treatment of SOC. PMID:27270322
Parham, Fred; Portier, Christopher J.; Chang, Xiaoqing; Mevissen, Meike
2016-01-01
Using in vitro data in human cell lines, several research groups have investigated changes in gene expression in cellular systems following exposure to extremely low frequency (ELF) and radiofrequency (RF) electromagnetic fields (EMF). For ELF EMF, we obtained five studies with complete microarray data and three studies with only lists of significantly altered genes. Likewise, for RF EMF, we obtained 13 complete microarray datasets and 5 limited datasets. Plausible linkages between exposure to ELF and RF EMF and human diseases were identified using a three-step process: (a) linking genes associated with classes of human diseases to molecular pathways, (b) linking pathways to ELF and RF EMF microarray data, and (c) identifying associations between human disease and EMF exposures where the pathways are significantly similar. A total of 60 pathways were associated with human diseases, mostly focused on basic cellular functions like JAK–STAT signaling or metabolic functions like xenobiotic metabolism by cytochrome P450 enzymes. ELF EMF datasets were sporadically linked to human diseases, but no clear pattern emerged. Individual datasets showed some linkage to cancer, chemical dependency, metabolic disorders, and neurological disorders. RF EMF datasets were not strongly linked to any disorders but strongly linked to changes in several pathways. Based on these analyses, the most promising area for further research would be to focus on EMF and neurological function and disorders. PMID:27656641
Analysis of baseline gene expression levels from ...
The use of gene expression profiling to predict chemical mode of action would be enhanced by better characterization of variance due to individual, environmental, and technical factors. Meta-analysis of microarray data from untreated or vehicle-treated animals within the control arm of toxicogenomics studies has yielded useful information on baseline fluctuations in gene expression. A dataset of control animal microarray expression data was assembled by a working group of the Health and Environmental Sciences Institute's Technical Committee on the Application of Genomics in Mechanism Based Risk Assessment in order to provide a public resource for assessments of variability in baseline gene expression. Data from over 500 Affymetrix microarrays from control rat liver and kidney were collected from 16 different institutions. Thirty-five biological and technical factors were obtained for each animal, describing a wide range of study characteristics, and a subset were evaluated in detail for their contribution to total variability using multivariate statistical and graphical techniques. The study factors that emerged as key sources of variability included gender, organ section, strain, and fasting state. These and other study factors were identified as key descriptors that should be included in the minimal information about a toxicogenomics study needed for interpretation of results by an independent source. Genes that are the most and least variable, gender-selectiv
Gupta, Sanjay K.; Dahiya, Saurabh; Lundy, Robert F.; Kumar, Ashok
2010-01-01
Background Skeletal muscle wasting is a debilitating consequence of large number of disease states and conditions. Tumor necrosis factor-α (TNF-α) is one of the most important muscle-wasting cytokine, elevated levels of which cause significant muscular abnormalities. However, the underpinning molecular mechanisms by which TNF-α causes skeletal muscle wasting are less well-understood. Methodology/Principal Findings We have used microarray, quantitative real-time PCR (QRT-PCR), Western blot, and bioinformatics tools to study the effects of TNF-α on various molecular pathways and gene networks in C2C12 cells (a mouse myoblastic cell line). Microarray analyses of C2C12 myotubes treated with TNF-α (10 ng/ml) for 18h showed differential expression of a number of genes involved in distinct molecular pathways. The genes involved in nuclear factor-kappa B (NF-kappaB) signaling, 26s proteasome pathway, Notch1 signaling, and chemokine networks are the most important ones affected by TNF-α. The expression of some of the genes in microarray dataset showed good correlation in independent QRT-PCR and Western blot assays. Analysis of TNF-treated myotubes showed that TNF-α augments the activity of both canonical and alternative NF-κB signaling pathways in myotubes. Bioinformatics analyses of microarray dataset revealed that TNF-α affects the activity of several important pathways including those involved in oxidative stress, hepatic fibrosis, mitochondrial dysfunction, cholesterol biosynthesis, and TGF-β signaling. Furthermore, TNF-α was found to affect the gene networks related to drug metabolism, cell cycle, cancer, neurological disease, organismal injury, and abnormalities in myotubes. Conclusions TNF-α regulates the expression of multiple genes involved in various toxic pathways which may be responsible for TNF-induced muscle loss in catabolic conditions. Our study suggests that TNF-α activates both canonical and alternative NF-κB signaling pathways in a time-dependent manner in skeletal muscle cells. The study provides novel insight into the mechanisms of action of TNF-α in skeletal muscle cells. PMID:20967264
Novel harmonic regularization approach for variable selection in Cox's proportional hazards model.
Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan
2014-01-01
Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods.
Gan, Lin; Denecke, Bernd
2013-06-24
It came to our attention that a paper has recently been published concerning one of the GEO datasets (GSE34413) we cited in our published paper [1]. The original reference (reference 27) cited for this dataset leads to a paper about a similar study from the same research group [2]. In order to provide readers with exact citation information, we would like to update reference 27 in our previous paper to the new published paper concerning GSE34413 [3]. The authors apologize for this inconvenience. [...].
ArrayWiki: an enabling technology for sharing public microarray data repositories and meta-analyses
Stokes, Todd H; Torrance, JT; Li, Henry; Wang, May D
2008-01-01
Background A survey of microarray databases reveals that most of the repository contents and data models are heterogeneous (i.e., data obtained from different chip manufacturers), and that the repositories provide only basic biological keywords linking to PubMed. As a result, it is difficult to find datasets using research context or analysis parameters information beyond a few keywords. For example, to reduce the "curse-of-dimension" problem in microarray analysis, the number of samples is often increased by merging array data from different datasets. Knowing chip data parameters such as pre-processing steps (e.g., normalization, artefact removal, etc), and knowing any previous biological validation of the dataset is essential due to the heterogeneity of the data. However, most of the microarray repositories do not have meta-data information in the first place, and do not have a a mechanism to add or insert this information. Thus, there is a critical need to create "intelligent" microarray repositories that (1) enable update of meta-data with the raw array data, and (2) provide standardized archiving protocols to minimize bias from the raw data sources. Results To address the problems discussed, we have developed a community maintained system called ArrayWiki that unites disparate meta-data of microarray meta-experiments from multiple primary sources with four key features. First, ArrayWiki provides a user-friendly knowledge management interface in addition to a programmable interface using standards developed by Wikipedia. Second, ArrayWiki includes automated quality control processes (caCORRECT) and novel visualization methods (BioPNG, Gel Plots), which provide extra information about data quality unavailable in other microarray repositories. Third, it provides a user-curation capability through the familiar Wiki interface. Fourth, ArrayWiki provides users with simple text-based searches across all experiment meta-data, and exposes data to search engine crawlers (Semantic Agents) such as Google to further enhance data discovery. Conclusions Microarray data and meta information in ArrayWiki are distributed and visualized using a novel and compact data storage format, BioPNG. Also, they are open to the research community for curation, modification, and contribution. By making a small investment of time to learn the syntax and structure common to all sites running MediaWiki software, domain scientists and practioners can all contribute to make better use of microarray technologies in research and medical practices. ArrayWiki is available at . PMID:18541053
Estimating replicate time shifts using Gaussian process regression
Liu, Qiang; Andersen, Bogi; Smyth, Padhraic; Ihler, Alexander
2010-01-01
Motivation: Time-course gene expression datasets provide important insights into dynamic aspects of biological processes, such as circadian rhythms, cell cycle and organ development. In a typical microarray time-course experiment, measurements are obtained at each time point from multiple replicate samples. Accurately recovering the gene expression patterns from experimental observations is made challenging by both measurement noise and variation among replicates' rates of development. Prior work on this topic has focused on inference of expression patterns assuming that the replicate times are synchronized. We develop a statistical approach that simultaneously infers both (i) the underlying (hidden) expression profile for each gene, as well as (ii) the biological time for each individual replicate. Our approach is based on Gaussian process regression (GPR) combined with a probabilistic model that accounts for uncertainty about the biological development time of each replicate. Results: We apply GPR with uncertain measurement times to a microarray dataset of mRNA expression for the hair-growth cycle in mouse back skin, predicting both profile shapes and biological times for each replicate. The predicted time shifts show high consistency with independently obtained morphological estimates of relative development. We also show that the method systematically reduces prediction error on out-of-sample data, significantly reducing the mean squared error in a cross-validation study. Availability: Matlab code for GPR with uncertain time shifts is available at http://sli.ics.uci.edu/Code/GPRTimeshift/ Contact: ihler@ics.uci.edu PMID:20147305
Use of autocorrelation scanning in DNA copy number analysis.
Zhang, Liangcai; Zhang, Li
2013-11-01
Data quality is a critical issue in the analyses of DNA copy number alterations obtained from microarrays. It is commonly assumed that copy number alteration data can be modeled as piecewise constant and the measurement errors of different probes are independent. However, these assumptions do not always hold in practice. In some published datasets, we find that measurement errors are highly correlated between probes that interrogate nearby genomic loci, and the piecewise-constant model does not fit the data well. The correlated errors cause problems in downstream analysis, leading to a large number of DNA segments falsely identified as having copy number gains and losses. We developed a simple tool, called autocorrelation scanning profile, to assess the dependence of measurement error between neighboring probes. Autocorrelation scanning profile can be used to check data quality and refine the analysis of DNA copy number data, which we demonstrate in some typical datasets. lzhangli@mdanderson.org. Supplementary data are available at Bioinformatics online.
Novel Harmonic Regularization Approach for Variable Selection in Cox's Proportional Hazards Model
Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan
2014-01-01
Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods. PMID:25506389
Trayhurn, Paul; Denyer, Gareth
2012-01-01
Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity.
Trayhurn, Paul; Denyer, Gareth
2012-01-01
Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity. PMID:25191551
Derivation of an artificial gene to improve classification accuracy upon gene selection.
Seo, Minseok; Oh, Sejong
2012-02-01
Classification analysis has been developed continuously since 1936. This research field has advanced as a result of development of classifiers such as KNN, ANN, and SVM, as well as through data preprocessing areas. Feature (gene) selection is required for very high dimensional data such as microarray before classification work. The goal of feature selection is to choose a subset of informative features that reduces processing time and provides higher classification accuracy. In this study, we devised a method of artificial gene making (AGM) for microarray data to improve classification accuracy. Our artificial gene was derived from a whole microarray dataset, and combined with a result of gene selection for classification analysis. We experimentally confirmed a clear improvement of classification accuracy after inserting artificial gene. Our artificial gene worked well for popular feature (gene) selection algorithms and classifiers. The proposed approach can be applied to any type of high dimensional dataset. Copyright © 2011 Elsevier Ltd. All rights reserved.
Alshamlan, Hala M; Badr, Ghada H; Alohali, Yousef A
2015-06-01
Naturally inspired evolutionary algorithms prove effectiveness when used for solving feature selection and classification problems. Artificial Bee Colony (ABC) is a relatively new swarm intelligence method. In this paper, we propose a new hybrid gene selection method, namely Genetic Bee Colony (GBC) algorithm. The proposed algorithm combines the used of a Genetic Algorithm (GA) along with Artificial Bee Colony (ABC) algorithm. The goal is to integrate the advantages of both algorithms. The proposed algorithm is applied to a microarray gene expression profile in order to select the most predictive and informative genes for cancer classification. In order to test the accuracy performance of the proposed algorithm, extensive experiments were conducted. Three binary microarray datasets are use, which include: colon, leukemia, and lung. In addition, another three multi-class microarray datasets are used, which are: SRBCT, lymphoma, and leukemia. Results of the GBC algorithm are compared with our recently proposed technique: mRMR when combined with the Artificial Bee Colony algorithm (mRMR-ABC). We also compared the combination of mRMR with GA (mRMR-GA) and Particle Swarm Optimization (mRMR-PSO) algorithms. In addition, we compared the GBC algorithm with other related algorithms that have been recently published in the literature, using all benchmark datasets. The GBC algorithm shows superior performance as it achieved the highest classification accuracy along with the lowest average number of selected genes. This proves that the GBC algorithm is a promising approach for solving the gene selection problem in both binary and multi-class cancer classification. Copyright © 2015 Elsevier Ltd. All rights reserved.
Robust gene selection methods using weighting schemes for microarray data analysis.
Kang, Suyeon; Song, Jongwoo
2017-09-02
A common task in microarray data analysis is to identify informative genes that are differentially expressed between two different states. Owing to the high-dimensional nature of microarray data, identification of significant genes has been essential in analyzing the data. However, the performances of many gene selection techniques are highly dependent on the experimental conditions, such as the presence of measurement error or a limited number of sample replicates. We have proposed new filter-based gene selection techniques, by applying a simple modification to significance analysis of microarrays (SAM). To prove the effectiveness of the proposed method, we considered a series of synthetic datasets with different noise levels and sample sizes along with two real datasets. The following findings were made. First, our proposed methods outperform conventional methods for all simulation set-ups. In particular, our methods are much better when the given data are noisy and sample size is small. They showed relatively robust performance regardless of noise level and sample size, whereas the performance of SAM became significantly worse as the noise level became high or sample size decreased. When sufficient sample replicates were available, SAM and our methods showed similar performance. Finally, our proposed methods are competitive with traditional methods in classification tasks for microarrays. The results of simulation study and real data analysis have demonstrated that our proposed methods are effective for detecting significant genes and classification tasks, especially when the given data are noisy or have few sample replicates. By employing weighting schemes, we can obtain robust and reliable results for microarray data analysis.
The use of open source bioinformatics tools to dissect transcriptomic data.
Nitsche, Benjamin M; Ram, Arthur F J; Meyer, Vera
2012-01-01
Microarrays are a valuable technology to study fungal physiology on a transcriptomic level. Various microarray platforms are available comprising both single and two channel arrays. Despite different technologies, preprocessing of microarray data generally includes quality control, background correction, normalization, and summarization of probe level data. Subsequently, depending on the experimental design, diverse statistical analysis can be performed, including the identification of differentially expressed genes and the construction of gene coexpression networks.We describe how Bioconductor, a collection of open source and open development packages for the statistical programming language R, can be used for dissecting microarray data. We provide fundamental details that facilitate the process of getting started with R and Bioconductor. Using two publicly available microarray datasets from Aspergillus niger, we give detailed protocols on how to identify differentially expressed genes and how to construct gene coexpression networks.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Martin, Katherine J.; Patrick, Denis R.; Bissell, Mina J.
2008-10-20
One of the major tenets in breast cancer research is that early detection is vital for patient survival by increasing treatment options. To that end, we have previously used a novel unsupervised approach to identify a set of genes whose expression predicts prognosis of breast cancer patients. The predictive genes were selected in a well-defined three dimensional (3D) cell culture model of non-malignant human mammary epithelial cell morphogenesis as down-regulated during breast epithelial cell acinar formation and cell cycle arrest. Here we examine the ability of this gene signature (3D-signature) to predict prognosis in three independent breast cancer microarray datasetsmore » having 295, 286, and 118 samples, respectively. Our results show that the 3D-signature accurately predicts prognosis in three unrelated patient datasets. At 10 years, the probability of positive outcome was 52, 51, and 47 percent in the group with a poor-prognosis signature and 91, 75, and 71 percent in the group with a good-prognosis signature for the three datasets, respectively (Kaplan-Meier survival analysis, p<0.05). Hazard ratios for poor outcome were 5.5 (95% CI 3.0 to 12.2, p<0.0001), 2.4 (95% CI 1.6 to 3.6, p<0.0001) and 1.9 (95% CI 1.1 to 3.2, p = 0.016) and remained significant for the two larger datasets when corrected for estrogen receptor (ER) status. Hence the 3D-signature accurately predicts breast cancer outcome in both ER-positive and ER-negative tumors, though individual genes differed in their prognostic ability in the two subtypes. Genes that were prognostic in ER+ patients are AURKA, CEP55, RRM2, EPHA2, FGFBP1, and VRK1, while genes prognostic in ER patients include ACTB, FOXM1 and SERPINE2 (Kaplan-Meier p<0.05). Multivariable Cox regression analysis in the largest dataset showed that the 3D-signature was a strong independent factor in predicting breast cancer outcome. The 3D-signature accurately predicts breast cancer outcome across multiple datasets and holds prognostic value for both ER-positive and ER-negative breast cancer. The signature was selected using a novel biological approach and hence holds promise to represent the key biological processes of breast cancer.« less
cluML: A markup language for clustering and cluster validity assessment of microarray data.
Bolshakova, Nadia; Cunningham, Pádraig
2005-01-01
cluML is a new markup language for microarray data clustering and cluster validity assessment. The XML-based format has been designed to address some of the limitations observed in traditional formats, such as inability to store multiple clustering (including biclustering) and validation results within a dataset. cluML is an effective tool to support biomedical knowledge representation in gene expression data analysis. Although cluML was developed for DNA microarray analysis applications, it can be effectively used for the representation of clustering and for the validation of other biomedical and physical data that has no limitations.
Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.
Ritz, Cecilia; Edén, Patrik
2008-01-19
For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.
Increasing consistency of disease biomarker prediction across datasets.
Chikina, Maria D; Sealfon, Stuart C
2014-01-01
Microarray studies with human subjects often have limited sample sizes which hampers the ability to detect reliable biomarkers associated with disease and motivates the need to aggregate data across studies. However, human gene expression measurements may be influenced by many non-random factors such as genetics, sample preparations, and tissue heterogeneity. These factors can contribute to a lack of agreement among related studies, limiting the utility of their aggregation. We show that it is feasible to carry out an automatic correction of individual datasets to reduce the effect of such 'latent variables' (without prior knowledge of the variables) in such a way that datasets addressing the same condition show better agreement once each is corrected. We build our approach on the method of surrogate variable analysis but we demonstrate that the original algorithm is unsuitable for the analysis of human tissue samples that are mixtures of different cell types. We propose a modification to SVA that is crucial to obtaining the improvement in agreement that we observe. We develop our method on a compendium of multiple sclerosis data and verify it on an independent compendium of Parkinson's disease datasets. In both cases, we show that our method is able to improve agreement across varying study designs, platforms, and tissues. This approach has the potential for wide applicability to any field where lack of inter-study agreement has been a concern.
Haram, Kerstyn M; Peltier, Heidi J; Lu, Bin; Bhasin, Manoj; Otu, Hasan H; Choy, Bob; Regan, Meredith; Libermann, Towia A; Latham, Gary J; Sanda, Martin G; Arredouani, Mohamed S
2008-10-01
Translation of preclinical studies into effective human cancer therapy is hampered by the lack of defined molecular expression patterns in mouse models that correspond to the human counterpart. We sought to generate an open source TRAMP mouse microarray dataset and to use this array to identify differentially expressed genes from human prostate cancer (PCa) that have concordant expression in TRAMP tumors, and thereby represent lead targets for preclinical therapy development. We performed microarrays on total RNA extracted and amplified from eight TRAMP tumors and nine normal prostates. A subset of differentially expressed genes was validated by QRT-PCR. Differentially expressed TRAMP genes were analyzed for concordant expression in publicly available human prostate array datasets and a subset of resulting genes was analyzed by QRT-PCR. Cross-referencing differentially expressed TRAMP genes to public human prostate array datasets revealed 66 genes with concordant expression in mouse and human PCa; 56 between metastases and normal and 10 between primary tumor and normal tissues. Of these 10 genes, two, Sox4 and Tubb2a, were validated by QRT-PCR. Our analysis also revealed various dysregulations in major biologic pathways in the TRAMP prostates. We report a TRAMP microarray dataset of which a gene subset was validated by QRT-PCR with expression patterns consistent with previous gene-specific TRAMP studies. Concordance analysis between TRAMP and human PCa associated genes supports the utility of the model and suggests several novel molecular targets for preclinical therapy.
Fuzzy support vector machine: an efficient rule-based classification technique for microarrays.
Hajiloo, Mohsen; Rabiee, Hamid R; Anooshahpour, Mahdi
2013-01-01
The abundance of gene expression microarray data has led to the development of machine learning algorithms applicable for tackling disease diagnosis, disease prognosis, and treatment selection problems. However, these algorithms often produce classifiers with weaknesses in terms of accuracy, robustness, and interpretability. This paper introduces fuzzy support vector machine which is a learning algorithm based on combination of fuzzy classifiers and kernel machines for microarray classification. Experimental results on public leukemia, prostate, and colon cancer datasets show that fuzzy support vector machine applied in combination with filter or wrapper feature selection methods develops a robust model with higher accuracy than the conventional microarray classification models such as support vector machine, artificial neural network, decision trees, k nearest neighbors, and diagonal linear discriminant analysis. Furthermore, the interpretable rule-base inferred from fuzzy support vector machine helps extracting biological knowledge from microarray data. Fuzzy support vector machine as a new classification model with high generalization power, robustness, and good interpretability seems to be a promising tool for gene expression microarray classification.
Identification of ELF3 as an early transcriptional regulator of human urothelium.
Böck, Matthias; Hinley, Jennifer; Schmitt, Constanze; Wahlicht, Tom; Kramer, Stefan; Southgate, Jennifer
2014-02-15
Despite major advances in high-throughput and computational modelling techniques, understanding of the mechanisms regulating tissue specification and differentiation in higher eukaryotes, particularly man, remains limited. Microarray technology has been explored exhaustively in recent years and several standard approaches have been established to analyse the resultant datasets on a genome-wide scale. Gene expression time series offer a valuable opportunity to define temporal hierarchies and gain insight into the regulatory relationships of biological processes. However, unless datasets are exactly synchronous, time points cannot be compared directly. Here we present a data-driven analysis of regulatory elements from a microarray time series that tracked the differentiation of non-immortalised normal human urothelial (NHU) cells grown in culture. The datasets were obtained by harvesting differentiating and control cultures from finite bladder- and ureter-derived NHU cell lines at different time points using two previously validated, independent differentiation-inducing protocols. Due to the asynchronous nature of the data, a novel ranking analysis approach was adopted whereby we compared changes in the amplitude of experiment and control time series to identify common regulatory elements. Our approach offers a simple, fast and effective ranking method for genes that can be applied to other time series. The analysis identified ELF3 as a candidate transcriptional regulator involved in human urothelial cytodifferentiation. Differentiation-associated expression of ELF3 was confirmed in cell culture experiments and by immunohistochemical demonstration in situ. The importance of ELF3 in urothelial differentiation was verified by knockdown in NHU cells, which led to reduced expression of FOXA1 and GRHL3 transcription factors in response to PPARγ activation. The consequences of this were seen in the repressed expression of late/terminal differentiation-associated uroplakin 3a gene expression and in the compromised development and regeneration of urothelial barrier function. Copyright © 2014 Elsevier Inc. All rights reserved.
CrossQuery: a web tool for easy associative querying of transcriptome data.
Wagner, Toni U; Fischer, Andreas; Thoma, Eva C; Schartl, Manfred
2011-01-01
Enormous amounts of data are being generated by modern methods such as transcriptome or exome sequencing and microarray profiling. Primary analyses such as quality control, normalization, statistics and mapping are highly complex and need to be performed by specialists. Thereafter, results are handed back to biomedical researchers, who are then confronted with complicated data lists. For rather simple tasks like data filtering, sorting and cross-association there is a need for new tools which can be used by non-specialists. Here, we describe CrossQuery, a web tool that enables straight forward, simple syntax queries to be executed on transcriptome sequencing and microarray datasets. We provide deep-sequencing data sets of stem cell lines derived from the model fish Medaka and microarray data of human endothelial cells. In the example datasets provided, mRNA expression levels, gene, transcript and sample identification numbers, GO-terms and gene descriptions can be freely correlated, filtered and sorted. Queries can be saved for later reuse and results can be exported to standard formats that allow copy-and-paste to all widespread data visualization tools such as Microsoft Excel. CrossQuery enables researchers to quickly and freely work with transcriptome and microarray data sets requiring only minimal computer skills. Furthermore, CrossQuery allows growing association of multiple datasets as long as at least one common point of correlated information, such as transcript identification numbers or GO-terms, is shared between samples. For advanced users, the object-oriented plug-in and event-driven code design of both server-side and client-side scripts allow easy addition of new features, data sources and data types.
Cross-platform normalization of microarray and RNA-seq data for machine learning applications
Thompson, Jeffrey A.; Tan, Jie
2016-01-01
Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language. PMID:26844019
Weighted analysis of paired microarray experiments.
Kristiansson, Erik; Sjögren, Anders; Rudemo, Mats; Nerman, Olle
2005-01-01
In microarray experiments quality often varies, for example between samples and between arrays. The need for quality control is therefore strong. A statistical model and a corresponding analysis method is suggested for experiments with pairing, including designs with individuals observed before and after treatment and many experiments with two-colour spotted arrays. The model is of mixed type with some parameters estimated by an empirical Bayes method. Differences in quality are modelled by individual variances and correlations between repetitions. The method is applied to three real and several simulated datasets. Two of the real datasets are of Affymetrix type with patients profiled before and after treatment, and the third dataset is of two-colour spotted cDNA type. In all cases, the patients or arrays had different estimated variances, leading to distinctly unequal weights in the analysis. We suggest also plots which illustrate the variances and correlations that affect the weights computed by our analysis method. For simulated data the improvement relative to previously published methods without weighting is shown to be substantial.
Performance analysis of clustering techniques over microarray data: A case study
NASA Astrophysics Data System (ADS)
Dash, Rasmita; Misra, Bijan Bihari
2018-03-01
Handling big data is one of the major issues in the field of statistical data analysis. In such investigation cluster analysis plays a vital role to deal with the large scale data. There are many clustering techniques with different cluster analysis approach. But which approach suits a particular dataset is difficult to predict. To deal with this problem a grading approach is introduced over many clustering techniques to identify a stable technique. But the grading approach depends on the characteristic of dataset as well as on the validity indices. So a two stage grading approach is implemented. In this study the grading approach is implemented over five clustering techniques like hybrid swarm based clustering (HSC), k-means, partitioning around medoids (PAM), vector quantization (VQ) and agglomerative nesting (AGNES). The experimentation is conducted over five microarray datasets with seven validity indices. The finding of grading approach that a cluster technique is significant is also established by Nemenyi post-hoc hypothetical test.
Welham, Nathan V.; Ling, Changying; Dawson, John A.; Kendziorski, Christina; Thibeault, Susan L.; Yamashita, Masaru
2015-01-01
The vocal fold (VF) mucosa confers elegant biomechanical function for voice production but is susceptible to scar formation following injury. Current understanding of VF wound healing is hindered by a paucity of data and is therefore often generalized from research conducted in skin and other mucosal systems. Here, using a previously validated rat injury model, expression microarray technology and an empirical Bayes analysis approach, we generated a VF-specific transcriptome dataset to better capture the system-level complexity of wound healing in this specialized tissue. We measured differential gene expression at 3, 14 and 60 days post-injury compared to experimentally naïve controls, pursued functional enrichment analyses to refine and add greater biological definition to the previously proposed temporal phases of VF wound healing, and validated the expression and localization of a subset of previously unidentified repair- and regeneration-related genes at the protein level. Our microarray dataset is a resource for the wider research community and has the potential to stimulate new hypotheses and avenues of investigation, improve biological and mechanistic insight, and accelerate the identification of novel therapeutic targets. PMID:25592437
SoFoCles: feature filtering for microarray classification based on gene ontology.
Papachristoudis, Georgios; Diplaris, Sotiris; Mitkas, Pericles A
2010-02-01
Marker gene selection has been an important research topic in the classification analysis of gene expression data. Current methods try to reduce the "curse of dimensionality" by using statistical intra-feature set calculations, or classifiers that are based on the given dataset. In this paper, we present SoFoCles, an interactive tool that enables semantic feature filtering in microarray classification problems with the use of external, well-defined knowledge retrieved from the Gene Ontology. The notion of semantic similarity is used to derive genes that are involved in the same biological path during the microarray experiment, by enriching a feature set that has been initially produced with legacy methods. Among its other functionalities, SoFoCles offers a large repository of semantic similarity methods that are used in order to derive feature sets and marker genes. The structure and functionality of the tool are discussed in detail, as well as its ability to improve classification accuracy. Through experimental evaluation, SoFoCles is shown to outperform other classification schemes in terms of classification accuracy in two real datasets using different semantic similarity computation approaches.
A study of metaheuristic algorithms for high dimensional feature selection on microarray data
NASA Astrophysics Data System (ADS)
Dankolo, Muhammad Nasiru; Radzi, Nor Haizan Mohamed; Sallehuddin, Roselina; Mustaffa, Noorfa Haszlinna
2017-11-01
Microarray systems enable experts to examine gene profile at molecular level using machine learning algorithms. It increases the potentials of classification and diagnosis of many diseases at gene expression level. Though, numerous difficulties may affect the efficiency of machine learning algorithms which includes vast number of genes features comprised in the original data. Many of these features may be unrelated to the intended analysis. Therefore, feature selection is necessary to be performed in the data pre-processing. Many feature selection algorithms are developed and applied on microarray which including the metaheuristic optimization algorithms. This paper discusses the application of the metaheuristics algorithms for feature selection in microarray dataset. This study reveals that, the algorithms have yield an interesting result with limited resources thereby saving computational expenses of machine learning algorithms.
Microarray Data Mining for Potential Selenium Targets in Chemoprevention of Prostate Cancer
ZHANG, HAITAO; DONG, YAN; ZHAO, HONGJUAN; BROOKS, JAMES D.; HAWTHORN, LESLEYANN; NOWAK, NORMA; MARSHALL, JAMES R.; GAO, ALLEN C.; IP, CLEMENT
2008-01-01
Background A previous clinical trial showed that selenium supplementation significantly reduced the incidence of prostate cancer. We report here a bioinformatics approach to gain new insights into selenium molecular targets that might be relevant to prostate cancer chemoprevention. Materials and Methods We first performed data mining analysis to identify genes which are consistently dysregulated in prostate cancer using published datasets from gene expression profiling of clinical prostate specimens. We then devised a method to systematically analyze three selenium microarray datasets from the LNCaP human prostate cancer cells, and to match the analysis to the cohort of genes implicated in prostate carcinogenesis. Moreover, we compared the selenium datasets with two datasets obtained from expression profiling of androgen-stimulated LNCaP cells. Results We found that selenium reverses the expression of genes implicated in prostate carcinogenesis. In addition, we found that selenium could counteract the effect of androgen on the expression of a subset obtained from androgen-regulated genes. Conclusions The above information provides us with a treasure of new clues to investigate the mechanism of selenium chemoprevention of prostate cancer. Furthermore, these selenium target genes could also serve as biomarkers in future clinical trials to gauge the efficacy of selenium intervention. PMID:18548127
Ni, Ming; Ye, Fuqiang; Zhu, Juanjuan; Li, Zongwei; Yang, Shuai; Yang, Bite; Han, Lu; Wu, Yongge; Chen, Ying; Li, Fei; Wang, Shengqi; Bo, Xiaochen
2014-12-01
Numerous public microarray datasets are valuable resources for the scientific communities. Several online tools have made great steps to use these data by querying related datasets with users' own gene signatures or expression profiles. However, dataset annotation and result exhibition still need to be improved. ExpTreeDB is a database that allows for queries on human and mouse microarray experiments from Gene Expression Omnibus with gene signatures or profiles. Compared with similar applications, ExpTreeDB pays more attention to dataset annotations and result visualization. We introduced a multiple-level annotation system to depict and organize original experiments. For example, a tamoxifen-treated cell line experiment is hierarchically annotated as 'agent→drug→estrogen receptor antagonist→tamoxifen'. Consequently, retrieved results are exhibited by an interactive tree-structured graphics, which provide an overview for related experiments and might enlighten users on key items of interest. The database is freely available at http://biotech.bmi.ac.cn/ExpTreeDB. Web site is implemented in Perl, PHP, R, MySQL and Apache. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
The hepatic transcriptome of young suckling and aging intrauterine growth restricted male rats
Freije, William A.; Thamotharan, Shanthie; Lee, Regina; Shin, Bo-Chul; Devaskar, Sherin U.
2015-01-01
Intrauterine growth restriction leads to the development of adult onset obesity/metabolic syndrome, diabetes mellitus, cardiovascular disease, hypertension, stroke, dyslipidemia, and non-alcoholic fatty liver disease/steatohepatitis. Continued postnatal growth restriction has been shown to ameliorate many of these sequelae. To further our understanding of the mechanism of how intrauterine and early postnatal growth affects adult health we have employed Affymetrix microarray-based expression profiling to characterize hepatic gene expression of male offspring in a rat model of maternal nutrient restriction in early and late life. At day 21 of life (p21) combined intrauterine and postnatal calorie restriction treatment led to expression changes in circadian, metabolic, and insulin-like growth factor genes as part of a larger transcriptional response that encompasses 144 genes. Independent and controlled experiments at p21 confirm the early life circadian, metabolic, and growth factor perturbations. In contrast to the p21 transcriptional response, at day 450 of life (d450) only seven genes, largely uncharacterized, were differentially expressed. This lack of a transcriptional response identifies non-transcriptional mechanisms mediating the adult sequelae of intrauterine growth restriction. Independent experiments at d450 identify a circadian defect as well as validate expression changes to four of the genes identified by the microarray screen which have a novel association with growth restriction. Emerging from this rich dataset is a portrait of how the liver responds to growth restriction through circadian dysregulation, energy/substrate management, and growth factor modulation. PMID:25371150
The hepatic transcriptome of young suckling and aging intrauterine growth restricted male rats.
Freije, William A; Thamotharan, Shanthie; Lee, Regina; Shin, Bo-Chul; Devaskar, Sherin U
2015-04-01
Intrauterine growth restriction leads to the development of adult onset obesity/metabolic syndrome, diabetes mellitus, cardiovascular disease, hypertension, stroke, dyslipidemia, and non-alcoholic fatty liver disease/steatohepatitis. Continued postnatal growth restriction has been shown to ameliorate many of these sequelae. To further our understanding of the mechanism of how intrauterine and early postnatal growth affects adult health we have employed Affymetrix microarray-based expression profiling to characterize hepatic gene expression of male offspring in a rat model of maternal nutrient restriction in early and late life. At day 21 of life (p21) combined intrauterine and postnatal calorie restriction treatment led to expression changes in circadian, metabolic, and insulin-like growth factor genes as part of a larger transcriptional response that encompasses 144 genes. Independent and controlled experiments at p21 confirm the early life circadian, metabolic, and growth factor perturbations. In contrast to the p21 transcriptional response, at day 450 of life (d450) only seven genes, largely uncharacterized, were differentially expressed. This lack of a transcriptional response identifies non-transcriptional mechanisms mediating the adult sequelae of intrauterine growth restriction. Independent experiments at d450 identify a circadian defect as well as validate expression changes to four of the genes identified by the microarray screen which have a novel association with growth restriction. Emerging from this rich dataset is a portrait of how the liver responds to growth restriction through circadian dysregulation, energy/substrate management, and growth factor modulation. © 2014 Wiley Periodicals, Inc.
Kitchen, Robert R; Sabine, Vicky S; Simen, Arthur A; Dixon, J Michael; Bartlett, John M S; Sims, Andrew H
2011-12-01
Systematic processing noise, which includes batch effects, is very common in microarray experiments but is often ignored despite its potential to confound or compromise experimental results. Compromised results are most likely when re-analysing or integrating datasets from public repositories due to the different conditions under which each dataset is generated. To better understand the relative noise-contributions of various factors in experimental-design, we assessed several Illumina and Affymetrix datasets for technical variation between replicate hybridisations of Universal Human Reference (UHRR) and individual or pooled breast-tumour RNA. A varying degree of systematic noise was observed in each of the datasets, however in all cases the relative amount of variation between standard control RNA replicates was found to be greatest at earlier points in the sample-preparation workflow. For example, 40.6% of the total variation in reported expressions were attributed to replicate extractions, compared to 13.9% due to amplification/labelling and 10.8% between replicate hybridisations. Deliberate probe-wise batch-correction methods were effective in reducing the magnitude of this variation, although the level of improvement was dependent on the sources of noise included in the model. Systematic noise introduced at the chip, run, and experiment levels of a combined Illumina dataset were found to be highly dependent upon the experimental design. Both UHRR and pools of RNA, which were derived from the samples of interest, modelled technical variation well although the pools were significantly better correlated (4% average improvement) and better emulated the effects of systematic noise, over all probes, than the UHRRs. The effect of this noise was not uniform over all probes, with low GC-content probes found to be more vulnerable to batch variation than probes with a higher GC-content. The magnitude of systematic processing noise in a microarray experiment is variable across probes and experiments, however it is generally the case that procedures earlier in the sample-preparation workflow are liable to introduce the most noise. Careful experimental design is important to protect against noise, detailed meta-data should always be provided, and diagnostic procedures should be routinely performed prior to downstream analyses for the detection of bias in microarray studies.
2011-01-01
Background Systematic processing noise, which includes batch effects, is very common in microarray experiments but is often ignored despite its potential to confound or compromise experimental results. Compromised results are most likely when re-analysing or integrating datasets from public repositories due to the different conditions under which each dataset is generated. To better understand the relative noise-contributions of various factors in experimental-design, we assessed several Illumina and Affymetrix datasets for technical variation between replicate hybridisations of Universal Human Reference (UHRR) and individual or pooled breast-tumour RNA. Results A varying degree of systematic noise was observed in each of the datasets, however in all cases the relative amount of variation between standard control RNA replicates was found to be greatest at earlier points in the sample-preparation workflow. For example, 40.6% of the total variation in reported expressions were attributed to replicate extractions, compared to 13.9% due to amplification/labelling and 10.8% between replicate hybridisations. Deliberate probe-wise batch-correction methods were effective in reducing the magnitude of this variation, although the level of improvement was dependent on the sources of noise included in the model. Systematic noise introduced at the chip, run, and experiment levels of a combined Illumina dataset were found to be highly dependant upon the experimental design. Both UHRR and pools of RNA, which were derived from the samples of interest, modelled technical variation well although the pools were significantly better correlated (4% average improvement) and better emulated the effects of systematic noise, over all probes, than the UHRRs. The effect of this noise was not uniform over all probes, with low GC-content probes found to be more vulnerable to batch variation than probes with a higher GC-content. Conclusions The magnitude of systematic processing noise in a microarray experiment is variable across probes and experiments, however it is generally the case that procedures earlier in the sample-preparation workflow are liable to introduce the most noise. Careful experimental design is important to protect against noise, detailed meta-data should always be provided, and diagnostic procedures should be routinely performed prior to downstream analyses for the detection of bias in microarray studies. PMID:22133085
Huerta, Mario; Munyi, Marc; Expósito, David; Querol, Enric; Cedano, Juan
2014-06-15
The microarrays performed by scientific teams grow exponentially. These microarray data could be useful for researchers around the world, but unfortunately they are underused. To fully exploit these data, it is necessary (i) to extract these data from a repository of the high-throughput gene expression data like Gene Expression Omnibus (GEO) and (ii) to make the data from different microarrays comparable with tools easy to use for scientists. We have developed these two solutions in our server, implementing a database of microarray marker genes (Marker Genes Data Base). This database contains the marker genes of all GEO microarray datasets and it is updated monthly with the new microarrays from GEO. Thus, researchers can see whether the marker genes of their microarray are marker genes in other microarrays in the database, expanding the analysis of their microarray to the rest of the public microarrays. This solution helps not only to corroborate the conclusions regarding a researcher's microarray but also to identify the phenotype of different subsets of individuals under investigation, to frame the results with microarray experiments from other species, pathologies or tissues, to search for drugs that promote the transition between the studied phenotypes, to detect undesirable side effects of the treatment applied, etc. Thus, the researcher can quickly add relevant information to his/her studies from all of the previous analyses performed in other studies as long as they have been deposited in public repositories. Marker-gene database tool: http://ibb.uab.es/mgdb © The Author 2014. Published by Oxford University Press.
Wu, Wei-Sheng; Jhou, Meng-Jhun
2017-01-13
Missing value imputation is important for microarray data analyses because microarray data with missing values would significantly degrade the performance of the downstream analyses. Although many microarray missing value imputation algorithms have been developed, an objective and comprehensive performance comparison framework is still lacking. To solve this problem, we previously proposed a framework which can perform a comprehensive performance comparison of different existing algorithms. Also the performance of a new algorithm can be evaluated by our performance comparison framework. However, constructing our framework is not an easy task for the interested researchers. To save researchers' time and efforts, here we present an easy-to-use web tool named MVIAeval (Missing Value Imputation Algorithm evaluator) which implements our performance comparison framework. MVIAeval provides a user-friendly interface allowing users to upload the R code of their new algorithm and select (i) the test datasets among 20 benchmark microarray (time series and non-time series) datasets, (ii) the compared algorithms among 12 existing algorithms, (iii) the performance indices from three existing ones, (iv) the comprehensive performance scores from two possible choices, and (v) the number of simulation runs. The comprehensive performance comparison results are then generated and shown as both figures and tables. MVIAeval is a useful tool for researchers to easily conduct a comprehensive and objective performance evaluation of their newly developed missing value imputation algorithm for microarray data or any data which can be represented as a matrix form (e.g. NGS data or proteomics data). Thus, MVIAeval will greatly expedite the progress in the research of missing value imputation algorithms.
2013-09-01
sequence dataset. All procedures were performed by personnel in the IIMT UT Southwestern Genomics and Microarray Core using standard protocols. More... sequencing run, samples were demultiplexed using standard algorithms in the Genomics and Microarray Core and processed into individual sample Illumina single... Sequencing (RNA-Seq), using Illumina’s multiplexing mRNA-Seq to generate full sequence libraries from the poly-A tailed RNA to a read depth of 30
Alshamlan, Hala; Badr, Ghada; Alohali, Yousef
2015-01-01
An artificial bee colony (ABC) is a relatively recent swarm intelligence optimization approach. In this paper, we propose the first attempt at applying ABC algorithm in analyzing a microarray gene expression profile. In addition, we propose an innovative feature selection algorithm, minimum redundancy maximum relevance (mRMR), and combine it with an ABC algorithm, mRMR-ABC, to select informative genes from microarray profile. The new approach is based on a support vector machine (SVM) algorithm to measure the classification accuracy for selected genes. We evaluate the performance of the proposed mRMR-ABC algorithm by conducting extensive experiments on six binary and multiclass gene expression microarray datasets. Furthermore, we compare our proposed mRMR-ABC algorithm with previously known techniques. We reimplemented two of these techniques for the sake of a fair comparison using the same parameters. These two techniques are mRMR when combined with a genetic algorithm (mRMR-GA) and mRMR when combined with a particle swarm optimization algorithm (mRMR-PSO). The experimental results prove that the proposed mRMR-ABC algorithm achieves accurate classification performance using small number of predictive genes when tested using both datasets and compared to previously suggested methods. This shows that mRMR-ABC is a promising approach for solving gene selection and cancer classification problems. PMID:25961028
Alshamlan, Hala; Badr, Ghada; Alohali, Yousef
2015-01-01
An artificial bee colony (ABC) is a relatively recent swarm intelligence optimization approach. In this paper, we propose the first attempt at applying ABC algorithm in analyzing a microarray gene expression profile. In addition, we propose an innovative feature selection algorithm, minimum redundancy maximum relevance (mRMR), and combine it with an ABC algorithm, mRMR-ABC, to select informative genes from microarray profile. The new approach is based on a support vector machine (SVM) algorithm to measure the classification accuracy for selected genes. We evaluate the performance of the proposed mRMR-ABC algorithm by conducting extensive experiments on six binary and multiclass gene expression microarray datasets. Furthermore, we compare our proposed mRMR-ABC algorithm with previously known techniques. We reimplemented two of these techniques for the sake of a fair comparison using the same parameters. These two techniques are mRMR when combined with a genetic algorithm (mRMR-GA) and mRMR when combined with a particle swarm optimization algorithm (mRMR-PSO). The experimental results prove that the proposed mRMR-ABC algorithm achieves accurate classification performance using small number of predictive genes when tested using both datasets and compared to previously suggested methods. This shows that mRMR-ABC is a promising approach for solving gene selection and cancer classification problems.
Kaushik, Abhinav; Ali, Shakir; Gupta, Dinesh
2017-01-01
Gene connection rewiring is an essential feature of gene network dynamics. Apart from its normal functional role, it may also lead to dysregulated functional states by disturbing pathway homeostasis. Very few computational tools measure rewiring within gene co-expression and its corresponding regulatory networks in order to identify and prioritize altered pathways which may or may not be differentially regulated. We have developed Altered Pathway Analyzer (APA), a microarray dataset analysis tool for identification and prioritization of altered pathways, including those which are differentially regulated by TFs, by quantifying rewired sub-network topology. Moreover, APA also helps in re-prioritization of APA shortlisted altered pathways enriched with context-specific genes. We performed APA analysis of simulated datasets and p53 status NCI-60 cell line microarray data to demonstrate potential of APA for identification of several case-specific altered pathways. APA analysis reveals several altered pathways not detected by other tools evaluated by us. APA analysis of unrelated prostate cancer datasets identifies sample-specific as well as conserved altered biological processes, mainly associated with lipid metabolism, cellular differentiation and proliferation. APA is designed as a cross platform tool which may be transparently customized to perform pathway analysis in different gene expression datasets. APA is freely available at http://bioinfo.icgeb.res.in/APA. PMID:28084397
Autoregressive-model-based missing value estimation for DNA microarray time series data.
Choong, Miew Keen; Charbit, Maurice; Yan, Hong
2009-01-01
Missing value estimation is important in DNA microarray data analysis. A number of algorithms have been developed to solve this problem, but they have several limitations. Most existing algorithms are not able to deal with the situation where a particular time point (column) of the data is missing entirely. In this paper, we present an autoregressive-model-based missing value estimation method (ARLSimpute) that takes into account the dynamic property of microarray temporal data and the local similarity structures in the data. ARLSimpute is especially effective for the situation where a particular time point contains many missing values or where the entire time point is missing. Experiment results suggest that our proposed algorithm is an accurate missing value estimator in comparison with other imputation methods on simulated as well as real microarray time series datasets.
Challenges in projecting clustering results across gene expression-profiling datasets.
Lusa, Lara; McShane, Lisa M; Reid, James F; De Cecco, Loris; Ambrogi, Federico; Biganzoli, Elia; Gariboldi, Manuela; Pierotti, Marco A
2007-11-21
Gene expression microarray studies for several types of cancer have been reported to identify previously unknown subtypes of tumors. For breast cancer, a molecular classification consisting of five subtypes based on gene expression microarray data has been proposed. These subtypes have been reported to exist across several breast cancer microarray studies, and they have demonstrated some association with clinical outcome. A classification rule based on the method of centroids has been proposed for identifying the subtypes in new collections of breast cancer samples; the method is based on the similarity of the new profiles to the mean expression profile of the previously identified subtypes. Previously identified centroids of five breast cancer subtypes were used to assign 99 breast cancer samples, including a subset of 65 estrogen receptor-positive (ER+) samples, to five breast cancer subtypes based on microarray data for the samples. The effect of mean centering the genes (i.e., transforming the expression of each gene so that its mean expression is equal to 0) on subtype assignment by method of centroids was assessed. Further studies of the effect of mean centering and of class prevalence in the test set on the accuracy of method of centroids classifications of ER status were carried out using training and test sets for which ER status had been independently determined by ligand-binding assay and for which the proportion of ER+ and ER- samples were systematically varied. When all 99 samples were considered, mean centering before application of the method of centroids appeared to be helpful for correctly assigning samples to subtypes, as evidenced by the expression of genes that had previously been used as markers to identify the subtypes. However, when only the 65 ER+ samples were considered for classification, many samples appeared to be misclassified, as evidenced by an unexpected distribution of ER+ samples among the resultant subtypes. When genes were mean centered before classification of samples for ER status, the accuracy of the ER subgroup assignments was highly dependent on the proportion of ER+ samples in the test set; this effect of subtype prevalence was not seen when gene expression data were not mean centered. Simple corrections such as mean centering of genes aimed at microarray platform or batch effect correction can have undesirable consequences because patient population effects can easily be confused with these assay-related effects. Careful thought should be given to the comparability of the patient populations before attempting to force data comparability for purposes of assigning subtypes to independent subjects.
Cross-platform method for identifying candidate network biomarkers for prostate cancer.
Jin, G; Zhou, X; Cui, K; Zhang, X-S; Chen, L; Wong, S T C
2009-11-01
Discovering biomarkers using mass spectrometry (MS) and microarray expression profiles is a promising strategy in molecular diagnosis. Here, the authors proposed a new pipeline for biomarker discovery that integrates disease information for proteins and genes, expression profiles in both genomic and proteomic levels, and protein-protein interactions (PPIs) to discover high confidence network biomarkers. Using this pipeline, a total of 474 molecules (genes and proteins) related to prostate cancer were identified and a prostate-cancer-related network (PCRN) was derived from the integrative information. Thus, a set of candidate network biomarkers were identified from multiple expression profiles composed by eight microarray datasets and one proteomics dataset. The network biomarkers with PPIs can accurately distinguish the prostate patients from the normal ones, which potentially provide more reliable hits of biomarker candidates than conventional biomarker discovery methods.
Application of machine learning on brain cancer multiclass classification
NASA Astrophysics Data System (ADS)
Panca, V.; Rustam, Z.
2017-07-01
Classification of brain cancer is a problem of multiclass classification. One approach to solve this problem is by first transforming it into several binary problems. The microarray gene expression dataset has the two main characteristics of medical data: extremely many features (genes) and only a few number of samples. The application of machine learning on microarray gene expression dataset mainly consists of two steps: feature selection and classification. In this paper, the features are selected using a method based on support vector machine recursive feature elimination (SVM-RFE) principle which is improved to solve multiclass classification, called multiple multiclass SVM-RFE. Instead of using only the selected features on a single classifier, this method combines the result of multiple classifiers. The features are divided into subsets and SVM-RFE is used on each subset. Then, the selected features on each subset are put on separate classifiers. This method enhances the feature selection ability of each single SVM-RFE. Twin support vector machine (TWSVM) is used as the method of the classifier to reduce computational complexity. While ordinary SVM finds single optimum hyperplane, the main objective Twin SVM is to find two non-parallel optimum hyperplanes. The experiment on the brain cancer microarray gene expression dataset shows this method could classify 71,4% of the overall test data correctly, using 100 and 1000 genes selected from multiple multiclass SVM-RFE feature selection method. Furthermore, the per class results show that this method could classify data of normal and MD class with 100% accuracy.
Multi-task feature selection in microarray data by binary integer programming.
Lan, Liang; Vucetic, Slobodan
2013-12-20
A major challenge in microarray classification is that the number of features is typically orders of magnitude larger than the number of examples. In this paper, we propose a novel feature filter algorithm to select the feature subset with maximal discriminative power and minimal redundancy by solving a quadratic objective function with binary integer constraints. To improve the computational efficiency, the binary integer constraints are relaxed and a low-rank approximation to the quadratic term is applied. The proposed feature selection algorithm was extended to solve multi-task microarray classification problems. We compared the single-task version of the proposed feature selection algorithm with 9 existing feature selection methods on 4 benchmark microarray data sets. The empirical results show that the proposed method achieved the most accurate predictions overall. We also evaluated the multi-task version of the proposed algorithm on 8 multi-task microarray datasets. The multi-task feature selection algorithm resulted in significantly higher accuracy than when using the single-task feature selection methods.
Noda, Masaru; Okayama, Hirokazu; Tachibana, Kazunoshin; Sakamoto, Wataru; Saito, Katsuharu; Thar Min, Aung Kyi; Ashizawa, Mai; Nakajima, Takahiro; Aoto, Keita; Momma, Tomoyuki; Katakura, Kyoko; Ohki, Shinji; Kono, Koji
2018-05-29
We aimed to discover glycosyltransferase gene (glycogene)-derived molecular subtypes of colorectal cancer (CRC) associated with patient outcomes. Transcriptomic and epigenomic datasets of non-tumor, pre-cancerous, cancerous tissues and cell lines with somatic mutations, mismatch repair status, clinicopathological and survival information, were assembled (n=4223) and glycogene profiles were analyzed. Immunohistochemistry for a glycogene, GALNT6, was conducted in adenoma and carcinoma specimens (n=403). The functional role and cell surface glycan profiles were further investigated by in vitro loss-of-function assays and lectin microarray analysis. We initially developed and validated a 15-glycogene signature that can identify a poor-prognostic subtype, which closely related to deficient mismatch repair (dMMR) and GALNT6 downregulation. The association of decreased GALNT6 with dMMR was confirmed in multiple datasets of tumors and cell lines, and was further recapitulated by immunohistochemistry, where approximately 15% tumors exhibited loss of GALNT6 protein. GALNT6 mRNA and protein was expressed in premalignant/preinvasive lesions but was subsequently downregulated in a subset of carcinomas, possibly through epigenetic silencing. Decreased GALNT6 was independently associated with poor prognosis in the immunohistochemistry cohort and an additional microarray meta-cohort, by multivariate analyses, and its discriminative power of survival was particularly remarkable in stage III patients. GALNT6 silencing in SW480 cells promoted invasion, migration, chemoresistance and increased cell surface expression of a cancer-associated truncated O-glycan, Tn-antigen. The 15-glycogene signature and the expression levels of GALNT6 mRNA and protein each serve as a novel prognostic biomarker, highlighting the role of dysregulated glycogenes in cancer-associated glycan synthesis and poor prognosis. Copyright ©2018, American Association for Cancer Research.
Dashtban, M; Balafar, Mohammadali
2017-03-01
Gene selection is a demanding task for microarray data analysis. The diverse complexity of different cancers makes this issue still challenging. In this study, a novel evolutionary method based on genetic algorithms and artificial intelligence is proposed to identify predictive genes for cancer classification. A filter method was first applied to reduce the dimensionality of feature space followed by employing an integer-coded genetic algorithm with dynamic-length genotype, intelligent parameter settings, and modified operators. The algorithmic behaviors including convergence trends, mutation and crossover rate changes, and running time were studied, conceptually discussed, and shown to be coherent with literature findings. Two well-known filter methods, Laplacian and Fisher score, were examined considering similarities, the quality of selected genes, and their influences on the evolutionary approach. Several statistical tests concerning choice of classifier, choice of dataset, and choice of filter method were performed, and they revealed some significant differences between the performance of different classifiers and filter methods over datasets. The proposed method was benchmarked upon five popular high-dimensional cancer datasets; for each, top explored genes were reported. Comparing the experimental results with several state-of-the-art methods revealed that the proposed method outperforms previous methods in DLBCL dataset. Copyright © 2017 Elsevier Inc. All rights reserved.
Recursive feature selection with significant variables of support vectors.
Tsai, Chen-An; Huang, Chien-Hsun; Chang, Ching-Wei; Chen, Chun-Houh
2012-01-01
The development of DNA microarray makes researchers screen thousands of genes simultaneously and it also helps determine high- and low-expression level genes in normal and disease tissues. Selecting relevant genes for cancer classification is an important issue. Most of the gene selection methods use univariate ranking criteria and arbitrarily choose a threshold to choose genes. However, the parameter setting may not be compatible to the selected classification algorithms. In this paper, we propose a new gene selection method (SVM-t) based on the use of t-statistics embedded in support vector machine. We compared the performance to two similar SVM-based methods: SVM recursive feature elimination (SVMRFE) and recursive support vector machine (RSVM). The three methods were compared based on extensive simulation experiments and analyses of two published microarray datasets. In the simulation experiments, we found that the proposed method is more robust in selecting informative genes than SVMRFE and RSVM and capable to attain good classification performance when the variations of informative and noninformative genes are different. In the analysis of two microarray datasets, the proposed method yields better performance in identifying fewer genes with good prediction accuracy, compared to SVMRFE and RSVM.
Wong, Gerard; Leckie, Christopher; Kowalczyk, Adam
2012-01-15
Feature selection is a key concept in machine learning for microarray datasets, where features represented by probesets are typically several orders of magnitude larger than the available sample size. Computational tractability is a key challenge for feature selection algorithms in handling very high-dimensional datasets beyond a hundred thousand features, such as in datasets produced on single nucleotide polymorphism microarrays. In this article, we present a novel feature set reduction approach that enables scalable feature selection on datasets with hundreds of thousands of features and beyond. Our approach enables more efficient handling of higher resolution datasets to achieve better disease subtype classification of samples for potentially more accurate diagnosis and prognosis, which allows clinicians to make more informed decisions in regards to patient treatment options. We applied our feature set reduction approach to several publicly available cancer single nucleotide polymorphism (SNP) array datasets and evaluated its performance in terms of its multiclass predictive classification accuracy over different cancer subtypes, its speedup in execution as well as its scalability with respect to sample size and array resolution. Feature Set Reduction (FSR) was able to reduce the dimensions of an SNP array dataset by more than two orders of magnitude while achieving at least equal, and in most cases superior predictive classification performance over that achieved on features selected by existing feature selection methods alone. An examination of the biological relevance of frequently selected features from FSR-reduced feature sets revealed strong enrichment in association with cancer. FSR was implemented in MATLAB R2010b and is available at http://ww2.cs.mu.oz.au/~gwong/FSR.
Empirical evaluation of data normalization methods for molecular classification.
Huang, Huei-Chung; Qin, Li-Xuan
2018-01-01
Data artifacts due to variations in experimental handling are ubiquitous in microarray studies, and they can lead to biased and irreproducible findings. A popular approach to correct for such artifacts is through post hoc data adjustment such as data normalization. Statistical methods for data normalization have been developed and evaluated primarily for the discovery of individual molecular biomarkers. Their performance has rarely been studied for the development of multi-marker molecular classifiers-an increasingly important application of microarrays in the era of personalized medicine. In this study, we set out to evaluate the performance of three commonly used methods for data normalization in the context of molecular classification, using extensive simulations based on re-sampling from a unique pair of microRNA microarray datasets for the same set of samples. The data and code for our simulations are freely available as R packages at GitHub. In the presence of confounding handling effects, all three normalization methods tended to improve the accuracy of the classifier when evaluated in an independent test data. The level of improvement and the relative performance among the normalization methods depended on the relative level of molecular signal, the distributional pattern of handling effects (e.g., location shift vs scale change), and the statistical method used for building the classifier. In addition, cross-validation was associated with biased estimation of classification accuracy in the over-optimistic direction for all three normalization methods. Normalization may improve the accuracy of molecular classification for data with confounding handling effects; however, it cannot circumvent the over-optimistic findings associated with cross-validation for assessing classification accuracy.
Cai, Guoshuai; Xiao, Feifei; Cheng, Chao; Li, Yafang; Amos, Christopher I.; Whitfield, Michael L.
2017-01-01
Background We analyzed and integrated transcriptome data from two large studies of lung adenocarcinomas on distinct populations. Our goal was to investigate the variable gene expression alterations between paired tumor-normal tissues and prospectively identify those alterations that can reliably predict lung disease related outcomes across populations. Methods We developed a mixed model that combined the paired tumor-normal RNA-seq from two populations. Alterations in gene expression common to both populations were detected and validated in two independent DNA microarray datasets. A 10-gene prognosis signature was developed through a l1 penalized regression approach and its prognostic value was evaluated in a third independent microarray cohort. Results Deregulation of apoptosis pathways and increased expression of cell cycle pathways were identified in tumors of both Caucasian and Asian lung adenocarcinoma patients. We demonstrate that a 10-gene biomarker panel can predict prognosis of lung adenocarcinoma in both Caucasians and Asians. Compared to low risk groups, high risk groups showed significantly shorter overall survival time (Caucasian patients data: HR = 3.63, p-value = 0.007; Asian patients data: HR = 3.25, p-value = 0.001). Conclusions This study uses a statistical framework to detect DEGs between paired tumor and normal tissues that considers variances among patients and ethnicities, which will aid in understanding the common genes and signalling pathways with the largest effect sizes in ethnically diverse cohorts. We propose multifunctional markers for distinguishing tumor from normal tissue and prognosis for both populations studied. PMID:28426704
Cross-Study Homogeneity of Psoriasis Gene Expression in Skin across a Large Expression Range
Kerkof, Keith; Timour, Martin; Russell, Christopher B.
2013-01-01
Background In psoriasis, only limited overlap between sets of genes identified as differentially expressed (psoriatic lesional vs. psoriatic non-lesional) was found using statistical and fold-change cut-offs. To provide a framework for utilizing prior psoriasis data sets we sought to understand the consistency of those sets. Methodology/Principal Findings Microarray expression profiling and qRT-PCR were used to characterize gene expression in PP and PN skin from psoriasis patients. cDNA (three new data sets) and cRNA hybridization (four existing data sets) data were compared using a common analysis pipeline. Agreement between data sets was assessed using varying qualitative and quantitative cut-offs to generate a DEG list in a source data set and then using other data sets to validate the list. Concordance increased from 67% across all probe sets to over 99% across more than 10,000 probe sets when statistical filters were employed. The fold-change behavior of individual genes tended to be consistent across the multiple data sets. We found that genes with <2-fold change values were quantitatively reproducible between pairs of data-sets. In a subset of transcripts with a role in inflammation changes detected by microarray were confirmed by qRT-PCR with high concordance. For transcripts with both PN and PP levels within the microarray dynamic range, microarray and qRT-PCR were quantitatively reproducible, including minimal fold-changes in IL13, TNFSF11, and TNFRSF11B and genes with >10-fold changes in either direction such as CHRM3, IL12B and IFNG. Conclusions/Significance Gene expression changes in psoriatic lesions were consistent across different studies, despite differences in patient selection, sample handling, and microarray platforms but between-study comparisons showed stronger agreement within than between platforms. We could use cut-offs as low as log10(ratio) = 0.1 (fold-change = 1.26), generating larger gene lists that validate on independent data sets. The reproducibility of PP signatures across data sets suggests that different sample sets can be productively compared. PMID:23308107
Comparative study of classification algorithms for immunosignaturing data
2012-01-01
Background High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data. Results We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Using a wide variety of assessment criteria, we found ‘Naïve Bayes’ far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy. Conclusions ‘Naïve Bayes’ algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties. PMID:22720696
Trivedi, Prinal; Edwards, Jode W; Wang, Jelai; Gadbury, Gary L; Srinivasasainagendra, Vinodh; Zakharkin, Stanislav O; Kim, Kyoungmi; Mehta, Tapan; Brand, Jacob P L; Patki, Amit; Page, Grier P; Allison, David B
2005-04-06
Many efforts in microarray data analysis are focused on providing tools and methods for the qualitative analysis of microarray data. HDBStat! (High-Dimensional Biology-Statistics) is a software package designed for analysis of high dimensional biology data such as microarray data. It was initially developed for the analysis of microarray gene expression data, but it can also be used for some applications in proteomics and other aspects of genomics. HDBStat! provides statisticians and biologists a flexible and easy-to-use interface to analyze complex microarray data using a variety of methods for data preprocessing, quality control analysis and hypothesis testing. Results generated from data preprocessing methods, quality control analysis and hypothesis testing methods are output in the form of Excel CSV tables, graphs and an Html report summarizing data analysis. HDBStat! is a platform-independent software that is freely available to academic institutions and non-profit organizations. It can be downloaded from our website http://www.soph.uab.edu/ssg_content.asp?id=1164.
Consistency of biological networks inferred from microarray and sequencing data.
Vinciotti, Veronica; Wit, Ernst C; Jansen, Rick; de Geus, Eco J C N; Penninx, Brenda W J H; Boomsma, Dorret I; 't Hoen, Peter A C
2016-06-24
Sparse Gaussian graphical models are popular for inferring biological networks, such as gene regulatory networks. In this paper, we investigate the consistency of these models across different data platforms, such as microarray and next generation sequencing, on the basis of a rich dataset containing samples that are profiled under both techniques as well as a large set of independent samples. Our analysis shows that individual node variances can have a remarkable effect on the connectivity of the resulting network. Their inconsistency across platforms and the fact that the variability level of a node may not be linked to its regulatory role mean that, failing to scale the data prior to the network analysis, leads to networks that are not reproducible across different platforms and that may be misleading. Moreover, we show how the reproducibility of networks across different platforms is significantly higher if networks are summarised in terms of enrichment amongst functional groups of interest, such as pathways, rather than at the level of individual edges. Careful pre-processing of transcriptional data and summaries of networks beyond individual edges can improve the consistency of network inference across platforms. However, caution is needed at this stage in the (over)interpretation of gene regulatory networks inferred from biological data.
Plantier, Laurent; Renaud, Hélène; Respaud, Renaud; Marchand-Adam, Sylvain; Crestani, Bruno
2016-12-13
Heritable profibrotic differentiation of lung fibroblasts is a key mechanism of idiopathic pulmonary fibrosis (IPF). Its mechanisms are yet to be fully understood. In this study, individual data from four independent microarray studies comparing the transcriptome of fibroblasts cultured in vitro from normal (total n = 20) and IPF (total n = 20) human lung were compiled for meta-analysis following normalization to z-scores. One hundred and thirteen transcripts were upregulated and 115 were downregulated in IPF fibroblasts using the Significance Analysis of Microrrays algorithm with a false discovery rate of 5%. Downregulated genes were highly enriched for Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional classes related to inflammation and immunity such as Defense response to virus, Influenza A, tumor necrosis factor (TNF) mediated signaling pathway, interferon-inducible absent in melanoma2 (AIM2) inflammasome as well as Apoptosis. Although upregulated genes were not enriched for any functional class, select factors known to play key roles in lung fibrogenesis were overexpressed in IPF fibroblasts, most notably connective tissue growth factor ( CTGF ) and serum response factor ( SRF ), supporting their role as drivers of IPF. The full data table is available as a supplement.
A Comparison Study for DNA Motif Modeling on Protein Binding Microarray.
Wong, Ka-Chun; Li, Yue; Peng, Chengbin; Wong, Hau-San
2016-01-01
Transcription factor binding sites (TFBSs) are relatively short (5-15 bp) and degenerate. Identifying them is a computationally challenging task. In particular, protein binding microarray (PBM) is a high-throughput platform that can measure the DNA binding preference of a protein in a comprehensive and unbiased manner; for instance, a typical PBM experiment can measure binding signal intensities of a protein to all possible DNA k-mers (k = 8∼10). Since proteins can often bind to DNA with different binding intensities, one of the major challenges is to build TFBS (also known as DNA motif) models which can fully capture the quantitative binding affinity data. To learn DNA motif models from the non-convex objective function landscape, several optimization methods are compared and applied to the PBM motif model building problem. In particular, representative methods from different optimization paradigms have been chosen for modeling performance comparison on hundreds of PBM datasets. The results suggest that the multimodal optimization methods are very effective for capturing the binding preference information from PBM data. In particular, we observe a general performance improvement if choosing di-nucleotide modeling over mono-nucleotide modeling. In addition, the models learned by the best-performing method are applied to two independent applications: PBM probe rotation testing and ChIP-Seq peak sequence prediction, demonstrating its biological applicability.
A measure of the signal-to-noise ratio of microarray samples and studies using gene correlations.
Venet, David; Detours, Vincent; Bersini, Hugues
2012-01-01
The quality of gene expression data can vary dramatically from platform to platform, study to study, and sample to sample. As reliable statistical analysis rests on reliable data, determining such quality is of the utmost importance. Quality measures to spot problematic samples exist, but they are platform-specific, and cannot be used to compare studies. As a proxy for quality, we propose a signal-to-noise ratio for microarray data, the "Signal-to-Noise Applied to Gene Expression Experiments", or SNAGEE. SNAGEE is based on the consistency of gene-gene correlations. We applied SNAGEE to a compendium of 80 large datasets on 37 platforms, for a total of 24,380 samples, and assessed the signal-to-noise ratio of studies and samples. This allowed us to discover serious issues with three studies. We show that signal-to-noise ratios of both studies and samples are linked to the statistical significance of the biological results. We showed that SNAGEE is an effective way to measure data quality for most types of gene expression studies, and that it often outperforms existing techniques. Furthermore, SNAGEE is platform-independent and does not require raw data files. The SNAGEE R package is available in BioConductor.
Includes 1) list of genes in the STAT5b biomarker and 2) list of accession numbers for microarray datasets used in study.This dataset is associated with the following publication:Oshida, K., N. Vasani, D. Waxman, and C. Corton. Disruption of STAT5b-Regulated Sexual Dimorphism of the Liver Transcriptome by Diverse Factors Is a Common Event. PLoS ONE. Public Library of Science, San Francisco, CA, USA, 11(3): NA, (2016).
Inferring Boolean network states from partial information
2013-01-01
Networks of molecular interactions regulate key processes in living cells. Therefore, understanding their functionality is a high priority in advancing biological knowledge. Boolean networks are often used to describe cellular networks mathematically and are fitted to experimental datasets. The fitting often results in ambiguities since the interpretation of the measurements is not straightforward and since the data contain noise. In order to facilitate a more reliable mapping between datasets and Boolean networks, we develop an algorithm that infers network trajectories from a dataset distorted by noise. We analyze our algorithm theoretically and demonstrate its accuracy using simulation and microarray expression data. PMID:24006954
Huang, Chien-Hung; Peng, Huai-Shun; Ng, Ka-Lok
2015-01-01
Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis.
2015-01-01
Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis. PMID:25866773
GTA: a game theoretic approach to identifying cancer subnetwork markers.
Farahmand, S; Goliaei, S; Ansari-Pour, N; Razaghi-Moghadam, Z
2016-03-01
The identification of genetic markers (e.g. genes, pathways and subnetworks) for cancer has been one of the most challenging research areas in recent years. A subset of these studies attempt to analyze genome-wide expression profiles to identify markers with high reliability and reusability across independent whole-transcriptome microarray datasets. Therefore, the functional relationships of genes are integrated with their expression data. However, for a more accurate representation of the functional relationships among genes, utilization of the protein-protein interaction network (PPIN) seems to be necessary. Herein, a novel game theoretic approach (GTA) is proposed for the identification of cancer subnetwork markers by integrating genome-wide expression profiles and PPIN. The GTA method was applied to three distinct whole-transcriptome breast cancer datasets to identify the subnetwork markers associated with metastasis. To evaluate the performance of our approach, the identified subnetwork markers were compared with gene-based, pathway-based and network-based markers. We show that GTA is not only capable of identifying robust metastatic markers, it also provides a higher classification performance. In addition, based on these GTA-based subnetworks, we identified a new bonafide candidate gene for breast cancer susceptibility.
Clustering approaches to identifying gene expression patterns from DNA microarray data.
Do, Jin Hwan; Choi, Dong-Kug
2008-04-30
The analysis of microarray data is essential for large amounts of gene expression data. In this review we focus on clustering techniques. The biological rationale for this approach is the fact that many co-expressed genes are co-regulated, and identifying co-expressed genes could aid in functional annotation of novel genes, de novo identification of transcription factor binding sites and elucidation of complex biological pathways. Co-expressed genes are usually identified in microarray experiments by clustering techniques. There are many such methods, and the results obtained even for the same datasets may vary considerably depending on the algorithms and metrics for dissimilarity measures used, as well as on user-selectable parameters such as desired number of clusters and initial values. Therefore, biologists who want to interpret microarray data should be aware of the weakness and strengths of the clustering methods used. In this review, we survey the basic principles of clustering of DNA microarray data from crisp clustering algorithms such as hierarchical clustering, K-means and self-organizing maps, to complex clustering algorithms like fuzzy clustering.
Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm
NASA Astrophysics Data System (ADS)
Salameh Shreem, Salam; Abdullah, Salwani; Nazri, Mohd Zakree Ahmad
2016-04-01
Microarray technology can be used as an efficient diagnostic system to recognise diseases such as tumours or to discriminate between different types of cancers in normal tissues. This technology has received increasing attention from the bioinformatics community because of its potential in designing powerful decision-making tools for cancer diagnosis. However, the presence of thousands or tens of thousands of genes affects the predictive accuracy of this technology from the perspective of classification. Thus, a key issue in microarray data is identifying or selecting the smallest possible set of genes from the input data that can achieve good predictive accuracy for classification. In this work, we propose a two-stage selection algorithm for gene selection problems in microarray data-sets called the symmetrical uncertainty filter and harmony search algorithm wrapper (SU-HSA). Experimental results show that the SU-HSA is better than HSA in isolation for all data-sets in terms of the accuracy and achieves a lower number of genes on 6 out of 10 instances. Furthermore, the comparison with state-of-the-art methods shows that our proposed approach is able to obtain 5 (out of 10) new best results in terms of the number of selected genes and competitive results in terms of the classification accuracy.
Mutual information estimation reveals global associations between stimuli and biological processes
Suzuki, Taiji; Sugiyama, Masashi; Kanamori, Takafumi; Sese, Jun
2009-01-01
Background Although microarray gene expression analysis has become popular, it remains difficult to interpret the biological changes caused by stimuli or variation of conditions. Clustering of genes and associating each group with biological functions are often used methods. However, such methods only detect partial changes within cell processes. Herein, we propose a method for discovering global changes within a cell by associating observed conditions of gene expression with gene functions. Results To elucidate the association, we introduce a novel feature selection method called Least-Squares Mutual Information (LSMI), which computes mutual information without density estimaion, and therefore LSMI can detect nonlinear associations within a cell. We demonstrate the effectiveness of LSMI through comparison with existing methods. The results of the application to yeast microarray datasets reveal that non-natural stimuli affect various biological processes, whereas others are no significant relation to specific cell processes. Furthermore, we discover that biological processes can be categorized into four types according to the responses of various stimuli: DNA/RNA metabolism, gene expression, protein metabolism, and protein localization. Conclusion We proposed a novel feature selection method called LSMI, and applied LSMI to mining the association between conditions of yeast and biological processes through microarray datasets. In fact, LSMI allows us to elucidate the global organization of cellular process control. PMID:19208155
Yang, Mingxing; Li, Xiumin; Li, Zhibin; Ou, Zhimin; Liu, Ming; Liu, Suhuan; Li, Xuejun; Yang, Shuyu
2013-01-01
DNA microarray analysis is characterized by obtaining a large number of gene variables from a small number of observations. Cluster analysis is widely used to analyze DNA microarray data to make classification and diagnosis of disease. Because there are so many irrelevant and insignificant genes in a dataset, a feature selection approach must be employed in data analysis. The performance of cluster analysis of this high-throughput data depends on whether the feature selection approach chooses the most relevant genes associated with disease classes. Here we proposed a new method using multiple Orthogonal Partial Least Squares-Discriminant Analysis (mOPLS-DA) models and S-plots to select the most relevant genes to conduct three-class disease classification and prediction. We tested our method using Golub's leukemia microarray data. For three classes with subtypes, we proposed hierarchical orthogonal partial least squares-discriminant analysis (OPLS-DA) models and S-plots to select features for two main classes and their subtypes. For three classes in parallel, we employed three OPLS-DA models and S-plots to choose marker genes for each class. The power of feature selection to classify and predict three-class disease was evaluated using cluster analysis. Further, the general performance of our method was tested using four public datasets and compared with those of four other feature selection methods. The results revealed that our method effectively selected the most relevant features for disease classification and prediction, and its performance was better than that of the other methods.
Identifying novel glioma associated pathways based on systems biology level meta-analysis.
Hu, Yangfan; Li, Jinquan; Yan, Wenying; Chen, Jiajia; Li, Yin; Hu, Guang; Shen, Bairong
2013-01-01
With recent advances in microarray technology, including genomics, proteomics, and metabolomics, it brings a great challenge for integrating this "-omics" data to analysis complex disease. Glioma is an extremely aggressive and lethal form of brain tumor, and thus the study of the molecule mechanism underlying glioma remains very important. To date, most studies focus on detecting the differentially expressed genes in glioma. However, the meta-analysis for pathway analysis based on multiple microarray datasets has not been systematically pursued. In this study, we therefore developed a systems biology based approach by integrating three types of omics data to identify common pathways in glioma. Firstly, the meta-analysis has been performed to study the overlapping of signatures at different levels based on the microarray gene expression data of glioma. Among these gene expression datasets, 12 pathways were found in GeneGO database that shared by four stages. Then, microRNA expression profiles and ChIP-seq data were integrated for the further pathway enrichment analysis. As a result, we suggest 5 of these pathways could be served as putative pathways in glioma. Among them, the pathway of TGF-beta-dependent induction of EMT via SMAD is of particular importance. Our results demonstrate that the meta-analysis based on systems biology level provide a more useful approach to study the molecule mechanism of complex disease. The integration of different types of omics data, including gene expression microarrays, microRNA and ChIP-seq data, suggest some common pathways correlated with glioma. These findings will offer useful potential candidates for targeted therapeutic intervention of glioma.
Kirby, Ralph; Herron, Paul; Hoskisson, Paul
2011-02-01
Based on available genome sequences, Actinomycetales show significant gene synteny across a wide range of species and genera. In addition, many genera show varying degrees of complex morphological development. Using the presence of gene synteny as a basis, it is clear that an analysis of gene conservation across the Streptomyces and various other Actinomycetales will provide information on both the importance of genes and gene clusters and the evolution of morphogenesis in these bacteria. Genome sequencing, although becoming cheaper, is still relatively expensive for comparing large numbers of strains. Thus, a heterologous DNA/DNA microarray hybridization dataset based on a Streptomyces coelicolor microarray allows a cheaper and greater depth of analysis of gene conservation. This study, using both bioinformatical and microarray approaches, was able to classify genes previously identified as involved in morphogenesis in Streptomyces into various subgroups in terms of conservation across species and genera. This will allow the targeting of genes for further study based on their importance at the species level and at higher evolutionary levels.
Cross species analysis of microarray expression data
Lu, Yong; Huggins, Peter; Bar-Joseph, Ziv
2009-01-01
Motivation: Many biological systems operate in a similar manner across a large number of species or conditions. Cross-species analysis of sequence and interaction data is often applied to determine the function of new genes. In contrast to these static measurements, microarrays measure the dynamic, condition-specific response of complex biological systems. The recent exponential growth in microarray expression datasets allows researchers to combine expression experiments from multiple species to identify genes that are not only conserved in sequence but also operated in a similar way in the different species studied. Results: In this review we discuss the computational and technical challenges associated with these studies, the approaches that have been developed to address these challenges and the advantages of cross-species analysis of microarray data. We show how successful application of these methods lead to insights that cannot be obtained when analyzing data from a single species. We also highlight current open problems and discuss possible ways to address them. Contact: zivbj@cs.cmu.edu PMID:19357096
Clustering gene expression data based on predicted differential effects of GV interaction.
Pan, Hai-Yan; Zhu, Jun; Han, Dan-Fu
2005-02-01
Microarray has become a popular biotechnology in biological and medical research. However, systematic and stochastic variabilities in microarray data are expected and unavoidable, resulting in the problem that the raw measurements have inherent "noise" within microarray experiments. Currently, logarithmic ratios are usually analyzed by various clustering methods directly, which may introduce bias interpretation in identifying groups of genes or samples. In this paper, a statistical method based on mixed model approaches was proposed for microarray data cluster analysis. The underlying rationale of this method is to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV (gene by variety) interaction using the adjusted unbiased prediction (AUP) method. The predicted GV interaction effects can then be used as the inputs of cluster analysis. We illustrated the application of our method with a gene expression dataset and elucidated the utility of our approach using an external validation.
Empirical evaluation of data normalization methods for molecular classification
Huang, Huei-Chung
2018-01-01
Background Data artifacts due to variations in experimental handling are ubiquitous in microarray studies, and they can lead to biased and irreproducible findings. A popular approach to correct for such artifacts is through post hoc data adjustment such as data normalization. Statistical methods for data normalization have been developed and evaluated primarily for the discovery of individual molecular biomarkers. Their performance has rarely been studied for the development of multi-marker molecular classifiers—an increasingly important application of microarrays in the era of personalized medicine. Methods In this study, we set out to evaluate the performance of three commonly used methods for data normalization in the context of molecular classification, using extensive simulations based on re-sampling from a unique pair of microRNA microarray datasets for the same set of samples. The data and code for our simulations are freely available as R packages at GitHub. Results In the presence of confounding handling effects, all three normalization methods tended to improve the accuracy of the classifier when evaluated in an independent test data. The level of improvement and the relative performance among the normalization methods depended on the relative level of molecular signal, the distributional pattern of handling effects (e.g., location shift vs scale change), and the statistical method used for building the classifier. In addition, cross-validation was associated with biased estimation of classification accuracy in the over-optimistic direction for all three normalization methods. Conclusion Normalization may improve the accuracy of molecular classification for data with confounding handling effects; however, it cannot circumvent the over-optimistic findings associated with cross-validation for assessing classification accuracy. PMID:29666754
Identification of significant features by the Global Mean Rank test.
Klammer, Martin; Dybowski, J Nikolaj; Hoffmann, Daniel; Schaab, Christoph
2014-01-01
With the introduction of omics-technologies such as transcriptomics and proteomics, numerous methods for the reliable identification of significantly regulated features (genes, proteins, etc.) have been developed. Experimental practice requires these tests to successfully deal with conditions such as small numbers of replicates, missing values, non-normally distributed expression levels, and non-identical distributions of features. With the MeanRank test we aimed at developing a test that performs robustly under these conditions, while favorably scaling with the number of replicates. The test proposed here is a global one-sample location test, which is based on the mean ranks across replicates, and internally estimates and controls the false discovery rate. Furthermore, missing data is accounted for without the need of imputation. In extensive simulations comparing MeanRank to other frequently used methods, we found that it performs well with small and large numbers of replicates, feature dependent variance between replicates, and variable regulation across features on simulation data and a recent two-color microarray spike-in dataset. The tests were then used to identify significant changes in the phosphoproteomes of cancer cells induced by the kinase inhibitors erlotinib and 3-MB-PP1 in two independently published mass spectrometry-based studies. MeanRank outperformed the other global rank-based methods applied in this study. Compared to the popular Significance Analysis of Microarrays and Linear Models for Microarray methods, MeanRank performed similar or better. Furthermore, MeanRank exhibits more consistent behavior regarding the degree of regulation and is robust against the choice of preprocessing methods. MeanRank does not require any imputation of missing values, is easy to understand, and yields results that are easy to interpret. The software implementing the algorithm is freely available for academic and commercial use.
Jong, Victor L; Novianti, Putri W; Roes, Kit C B; Eijkemans, Marinus J C
2014-12-01
The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.
This file contains a link for Gene Expression Omnibus and the GSE designations for the publicly available gene expression data used in the study and reflected in Figures 6 and 7 for the Das et al., 2016 paper.This dataset is associated with the following publication:Das, K., C. Wood, M. Lin, A.A. Starkov, C. Lau, K.B. Wallace, C. Corton, and B. Abbott. Perfluoroalky acids-induced liver steatosis: Effects on genes controlling lipid homeostasis. TOXICOLOGY. Elsevier Science Ltd, New York, NY, USA, 378: 32-52, (2017).
Muller, Julius; Parizotto, Eneida; Antrobus, Richard; Francis, James; Bunce, Campbell; Stranks, Amanda; Nichols, Marshall; McClain, Micah; Hill, Adrian V S; Ramasamy, Adaikalavan; Gilbert, Sarah C
2017-06-08
Influenza challenge trials are important for vaccine efficacy testing. Currently, disease severity is determined by self-reported scores to a list of symptoms which can be highly subjective. A more objective measure would allow for improved data analysis. Twenty-one volunteers participated in an influenza challenge trial. We calculated the daily sum of scores (DSS) for a list of 16 influenza symptoms. Whole blood collected at baseline and 24, 48, 72 and 96 h post challenge was profiled on Illumina HT12v4 microarrays. Changes in gene expression most strongly correlated with DSS were selected to train a Random Forest model and tested on two independent test sets consisting of 41 individuals profiled on a different microarray platform and 33 volunteers assayed by qRT-PCR. 1456 probes are significantly associated with DSS at 1% false discovery rate. We selected 19 genes with the largest fold change to train a random forest model. We observed good concordance between predicted and actual scores in the first test set (r = 0.57; RMSE = -16.1%) with the greatest agreement achieved on samples collected approximately 72 h post challenge. Therefore, we assayed samples collected at baseline and 72 h post challenge in the second test set by qRT-PCR and observed good concordance (r = 0.81; RMSE = -36.1%). We developed a 19-gene qRT-PCR panel to predict DSS, validated on two independent datasets. A transcriptomics based panel could provide a more objective measure of symptom scoring in future influenza challenge studies. Trial registration Samples were obtained from a clinical trial with the ClinicalTrials.gov Identifier: NCT02014870, first registered on December 5, 2013.
Ludovini, Vienna; Bianconi, Fortunato; Siggillino, Annamaria; Piobbico, Danilo; Vannucci, Jacopo; Metro, Giulio; Chiari, Rita; Bellezza, Guido; Puma, Francesco; Della Fazia, Maria Agnese; Servillo, Giuseppe; Crinò, Lucio
2016-05-24
Risk assessment and treatment choice remains a challenge in early non-small-cell lung cancer (NSCLC). The aim of this study was to identify novel genes involved in the risk of early relapse (ER) compared to no relapse (NR) in resected lung adenocarcinoma (AD) patients using a combination of high throughput technology and computational analysis. We identified 18 patients (n.13 NR and n.5 ER) with stage I AD. Frozen samples of patients in ER, NR and corresponding normal lung (NL) were subjected to Microarray technology and quantitative-PCR (Q-PCR). A gene network computational analysis was performed to select predictive genes. An independent set of 79 ADs stage I samples was used to validate selected genes by Q-PCR.From microarray analysis we selected 50 genes, using the fold change ratio of ER versus NR. They were validated both in pool and individually in patient samples (ER and NR) by Q-PCR. Fourteen increased and 25 decreased genes showed a concordance between two methods. They were used to perform a computational gene network analysis that identified 4 increased (HOXA10, CLCA2, AKR1B10, FABP3) and 6 decreased (SCGB1A1, PGC, TFF1, PSCA, SPRR1B and PRSS1) genes. Moreover, in an independent dataset of ADs samples, we showed that both high FABP3 expression and low SCGB1A1 expression was associated with a worse disease-free survival (DFS).Our results indicate that it is possible to define, through gene expression and computational analysis, a characteristic gene profiling of patients with an increased risk of relapse that may become a tool for patient selection for adjuvant therapy.
An efficient pseudomedian filter for tiling microrrays.
Royce, Thomas E; Carriero, Nicholas J; Gerstein, Mark B
2007-06-07
Tiling microarrays are becoming an essential technology in the functional genomics toolbox. They have been applied to the tasks of novel transcript identification, elucidation of transcription factor binding sites, detection of methylated DNA and several other applications in several model organisms. These experiments are being conducted at increasingly finer resolutions as the microarray technology enjoys increasingly greater feature densities. The increased densities naturally lead to increased data analysis requirements. Specifically, the most widely employed algorithm for tiling array analysis involves smoothing observed signals by computing pseudomedians within sliding windows, a O(n2logn) calculation in each window. This poor time complexity is an issue for tiling array analysis and could prove to be a real bottleneck as tiling microarray experiments become grander in scope and finer in resolution. We therefore implemented Monahan's HLQEST algorithm that reduces the runtime complexity for computing the pseudomedian of n numbers to O(nlogn) from O(n2logn). For a representative tiling microarray dataset, this modification reduced the smoothing procedure's runtime by nearly 90%. We then leveraged the fact that elements within sliding windows remain largely unchanged in overlapping windows (as one slides across genomic space) to further reduce computation by an additional 43%. This was achieved by the application of skip lists to maintaining a sorted list of values from window to window. This sorted list could be maintained with simple O(log n) inserts and deletes. We illustrate the favorable scaling properties of our algorithms with both time complexity analysis and benchmarking on synthetic datasets. Tiling microarray analyses that rely upon a sliding window pseudomedian calculation can require many hours of computation. We have eased this requirement significantly by implementing efficient algorithms that scale well with genomic feature density. This result not only speeds the current standard analyses, but also makes possible ones where many iterations of the filter may be required, such as might be required in a bootstrap or parameter estimation setting. Source code and executables are available at http://tiling.gersteinlab.org/pseudomedian/.
An efficient pseudomedian filter for tiling microrrays
Royce, Thomas E; Carriero, Nicholas J; Gerstein, Mark B
2007-01-01
Background Tiling microarrays are becoming an essential technology in the functional genomics toolbox. They have been applied to the tasks of novel transcript identification, elucidation of transcription factor binding sites, detection of methylated DNA and several other applications in several model organisms. These experiments are being conducted at increasingly finer resolutions as the microarray technology enjoys increasingly greater feature densities. The increased densities naturally lead to increased data analysis requirements. Specifically, the most widely employed algorithm for tiling array analysis involves smoothing observed signals by computing pseudomedians within sliding windows, a O(n2logn) calculation in each window. This poor time complexity is an issue for tiling array analysis and could prove to be a real bottleneck as tiling microarray experiments become grander in scope and finer in resolution. Results We therefore implemented Monahan's HLQEST algorithm that reduces the runtime complexity for computing the pseudomedian of n numbers to O(nlogn) from O(n2logn). For a representative tiling microarray dataset, this modification reduced the smoothing procedure's runtime by nearly 90%. We then leveraged the fact that elements within sliding windows remain largely unchanged in overlapping windows (as one slides across genomic space) to further reduce computation by an additional 43%. This was achieved by the application of skip lists to maintaining a sorted list of values from window to window. This sorted list could be maintained with simple O(log n) inserts and deletes. We illustrate the favorable scaling properties of our algorithms with both time complexity analysis and benchmarking on synthetic datasets. Conclusion Tiling microarray analyses that rely upon a sliding window pseudomedian calculation can require many hours of computation. We have eased this requirement significantly by implementing efficient algorithms that scale well with genomic feature density. This result not only speeds the current standard analyses, but also makes possible ones where many iterations of the filter may be required, such as might be required in a bootstrap or parameter estimation setting. Source code and executables are available at . PMID:17555595
A-MADMAN: Annotation-based microarray data meta-analysis tool
Bisognin, Andrea; Coppe, Alessandro; Ferrari, Francesco; Risso, Davide; Romualdi, Chiara; Bicciato, Silvio; Bortoluzzi, Stefania
2009-01-01
Background Publicly available datasets of microarray gene expression signals represent an unprecedented opportunity for extracting genomic relevant information and validating biological hypotheses. However, the exploitation of this exceptionally rich mine of information is still hampered by the lack of appropriate computational tools, able to overcome the critical issues raised by meta-analysis. Results This work presents A-MADMAN, an open source web application which allows the retrieval, annotation, organization and meta-analysis of gene expression datasets obtained from Gene Expression Omnibus. A-MADMAN addresses and resolves several open issues in the meta-analysis of gene expression data. Conclusion A-MADMAN allows i) the batch retrieval from Gene Expression Omnibus and the local organization of raw data files and of any related meta-information, ii) the re-annotation of samples to fix incomplete, or otherwise inadequate, metadata and to create user-defined batches of data, iii) the integrative analysis of data obtained from different Affymetrix platforms through custom chip definition files and meta-normalization. Software and documentation are available on-line at . PMID:19563634
Mazzarelli, Joan M; Brestelli, John; Gorski, Regina K; Liu, Junmin; Manduchi, Elisabetta; Pinney, Deborah F; Schug, Jonathan; White, Peter; Kaestner, Klaus H; Stoeckert, Christian J
2007-01-01
EPConDB (http://www.cbil.upenn.edu/EPConDB) is a public web site that supports research in diabetes, pancreatic development and beta-cell function by providing information about genes expressed in cells of the pancreas. EPConDB displays expression profiles for individual genes and information about transcripts, promoter elements and transcription factor binding sites. Gene expression results are obtained from studies examining tissue expression, pancreatic development and growth, differentiation of insulin-producing cells, islet or beta-cell injury, and genetic models of impaired beta-cell function. The expression datasets are derived using different microarray platforms, including the BCBC PancChips and Affymetrix gene expression arrays. Other datasets include semi-quantitative RT-PCR and MPSS expression studies. For selected microarray studies, lists of differentially expressed genes, derived from PaGE analysis, are displayed on the site. EPConDB provides database queries and tools to examine the relationship between a gene, its transcriptional regulation, protein function and expression in pancreatic tissues.
Meyer, Patrick E; Lafitte, Frédéric; Bontempi, Gianluca
2008-10-29
This paper presents the R/Bioconductor package minet (version 1.1.6) which provides a set of functions to infer mutual information networks from a dataset. Once fed with a microarray dataset, the package returns a network where nodes denote genes, edges model statistical dependencies between genes and the weight of an edge quantifies the statistical evidence of a specific (e.g transcriptional) gene-to-gene interaction. Four different entropy estimators are made available in the package minet (empirical, Miller-Madow, Schurmann-Grassberger and shrink) as well as four different inference methods, namely relevance networks, ARACNE, CLR and MRNET. Also, the package integrates accuracy assessment tools, like F-scores, PR-curves and ROC-curves in order to compare the inferred network with a reference one. The package minet provides a series of tools for inferring transcriptional networks from microarray data. It is freely available from the Comprehensive R Archive Network (CRAN) as well as from the Bioconductor website.
Irigoyen, Antonio; Jimenez-Luna, Cristina; Benavides, Manuel; Caba, Octavio; Gallego, Javier; Ortuño, Francisco Manuel; Guillen-Ponce, Carmen; Rojas, Ignacio; Aranda, Enrique; Torres, Carolina; Prados, Jose
2018-01-01
Applying differentially expressed genes (DEGs) to identify feasible biomarkers in diseases can be a hard task when working with heterogeneous datasets. Expression data are strongly influenced by technology, sample preparation processes, and/or labeling methods. The proliferation of different microarray platforms for measuring gene expression increases the need to develop models able to compare their results, especially when different technologies can lead to signal values that vary greatly. Integrative meta-analysis can significantly improve the reliability and robustness of DEG detection. The objective of this work was to develop an integrative approach for identifying potential cancer biomarkers by integrating gene expression data from two different platforms. Pancreatic ductal adenocarcinoma (PDAC), where there is an urgent need to find new biomarkers due its late diagnosis, is an ideal candidate for testing this technology. Expression data from two different datasets, namely Affymetrix and Illumina (18 and 36 PDAC patients, respectively), as well as from 18 healthy controls, was used for this study. A meta-analysis based on an empirical Bayesian methodology (ComBat) was then proposed to integrate these datasets. DEGs were finally identified from the integrated data by using the statistical programming language R. After our integrative meta-analysis, 5 genes were commonly identified within the individual analyses of the independent datasets. Also, 28 novel genes that were not reported by the individual analyses ('gained' genes) were also discovered. Several of these gained genes have been already related to other gastroenterological tumors. The proposed integrative meta-analysis has revealed novel DEGs that may play an important role in PDAC and could be potential biomarkers for diagnosing the disease.
An evaluation of two-channel ChIP-on-chip and DNA methylation microarray normalization strategies
2012-01-01
Background The combination of chromatin immunoprecipitation with two-channel microarray technology enables genome-wide mapping of binding sites of DNA-interacting proteins (ChIP-on-chip) or sites with methylated CpG di-nucleotides (DNA methylation microarray). These powerful tools are the gateway to understanding gene transcription regulation. Since the goals of such studies, the sample preparation procedures, the microarray content and study design are all different from transcriptomics microarrays, the data pre-processing strategies traditionally applied to transcriptomics microarrays may not be appropriate. Particularly, the main challenge of the normalization of "regulation microarrays" is (i) to make the data of individual microarrays quantitatively comparable and (ii) to keep the signals of the enriched probes, representing DNA sequences from the precipitate, as distinguishable as possible from the signals of the un-enriched probes, representing DNA sequences largely absent from the precipitate. Results We compare several widely used normalization approaches (VSN, LOWESS, quantile, T-quantile, Tukey's biweight scaling, Peng's method) applied to a selection of regulation microarray datasets, ranging from DNA methylation to transcription factor binding and histone modification studies. Through comparison of the data distributions of control probes and gene promoter probes before and after normalization, and assessment of the power to identify known enriched genomic regions after normalization, we demonstrate that there are clear differences in performance between normalization procedures. Conclusion T-quantile normalization applied separately on the channels and Tukey's biweight scaling outperform other methods in terms of the conservation of enriched and un-enriched signal separation, as well as in identification of genomic regions known to be enriched. T-quantile normalization is preferable as it additionally improves comparability between microarrays. In contrast, popular normalization approaches like quantile, LOWESS, Peng's method and VSN normalization alter the data distributions of regulation microarrays to such an extent that using these approaches will impact the reliability of the downstream analysis substantially. PMID:22276688
Privacy Preserving PCA on Distributed Bioinformatics Datasets
ERIC Educational Resources Information Center
Li, Xin
2011-01-01
In recent years, new bioinformatics technologies, such as gene expression microarray, genome-wide association study, proteomics, and metabolomics, have been widely used to simultaneously identify a huge number of human genomic/genetic biomarkers, generate a tremendously large amount of data, and dramatically increase the knowledge on human…
A new normalizing algorithm for BAC CGH arrays with quality control metrics.
Miecznikowski, Jeffrey C; Gaile, Daniel P; Liu, Song; Shepherd, Lori; Nowak, Norma
2011-01-01
The main focus in pin-tip (or print-tip) microarray analysis is determining which probes, genes, or oligonucleotides are differentially expressed. Specifically in array comparative genomic hybridization (aCGH) experiments, researchers search for chromosomal imbalances in the genome. To model this data, scientists apply statistical methods to the structure of the experiment and assume that the data consist of the signal plus random noise. In this paper we propose "SmoothArray", a new method to preprocess comparative genomic hybridization (CGH) bacterial artificial chromosome (BAC) arrays and we show the effects on a cancer dataset. As part of our R software package "aCGHplus," this freely available algorithm removes the variation due to the intensity effects, pin/print-tip, the spatial location on the microarray chip, and the relative location from the well plate. removal of this variation improves the downstream analysis and subsequent inferences made on the data. Further, we present measures to evaluate the quality of the dataset according to the arrayer pins, 384-well plates, plate rows, and plate columns. We compare our method against competing methods using several metrics to measure the biological signal. With this novel normalization algorithm and quality control measures, the user can improve their inferences on datasets and pinpoint problems that may arise in their BAC aCGH technology.
Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome.
Tothill, Richard W; Tinker, Anna V; George, Joshy; Brown, Robert; Fox, Stephen B; Lade, Stephen; Johnson, Daryl S; Trivett, Melanie K; Etemadmoghadam, Dariush; Locandro, Bianca; Traficante, Nadia; Fereday, Sian; Hung, Jillian A; Chiew, Yoke-Eng; Haviv, Izhak; Gertig, Dorota; DeFazio, Anna; Bowtell, David D L
2008-08-15
The study aim to identify novel molecular subtypes of ovarian cancer by gene expression profiling with linkage to clinical and pathologic features. Microarray gene expression profiling was done on 285 serous and endometrioid tumors of the ovary, peritoneum, and fallopian tube. K-means clustering was applied to identify robust molecular subtypes. Statistical analysis identified differentially expressed genes, pathways, and gene ontologies. Laser capture microdissection, pathology review, and immunohistochemistry validated the array-based findings. Patient survival within k-means groups was evaluated using Cox proportional hazards models. Class prediction validated k-means groups in an independent dataset. A semisupervised survival analysis of the array data was used to compare against unsupervised clustering results. Optimal clustering of array data identified six molecular subtypes. Two subtypes represented predominantly serous low malignant potential and low-grade endometrioid subtypes, respectively. The remaining four subtypes represented higher grade and advanced stage cancers of serous and endometrioid morphology. A novel subtype of high-grade serous cancers reflected a mesenchymal cell type, characterized by overexpression of N-cadherin and P-cadherin and low expression of differentiation markers, including CA125 and MUC1. A poor prognosis subtype was defined by a reactive stroma gene expression signature, correlating with extensive desmoplasia in such samples. A similar poor prognosis signature could be found using a semisupervised analysis. Each subtype displayed distinct levels and patterns of immune cell infiltration. Class prediction identified similar subtypes in an independent ovarian dataset with similar prognostic trends. Gene expression profiling identified molecular subtypes of ovarian cancer of biological and clinical importance.
Identifying pathogenic processes by integrating microarray data with prior knowledge
2014-01-01
Background It is of great importance to identify molecular processes and pathways that are involved in disease etiology. Although there has been an extensive use of various high-throughput methods for this task, pathogenic pathways are still not completely understood. Often the set of genes or proteins identified as altered in genome-wide screens show a poor overlap with canonical disease pathways. These findings are difficult to interpret, yet crucial in order to improve the understanding of the molecular processes underlying the disease progression. We present a novel method for identifying groups of connected molecules from a set of differentially expressed genes. These groups represent functional modules sharing common cellular function and involve signaling and regulatory events. Specifically, our method makes use of Bayesian statistics to identify groups of co-regulated genes based on the microarray data, where external information about molecular interactions and connections are used as priors in the group assignments. Markov chain Monte Carlo sampling is used to search for the most reliable grouping. Results Simulation results showed that the method improved the ability of identifying correct groups compared to traditional clustering, especially for small sample sizes. Applied to a microarray heart failure dataset the method found one large cluster with several genes important for the structure of the extracellular matrix and a smaller group with many genes involved in carbohydrate metabolism. The method was also applied to a microarray dataset on melanoma cancer patients with or without metastasis, where the main cluster was dominated by genes related to keratinocyte differentiation. Conclusion Our method found clusters overlapping with known pathogenic processes, but also pointed to new connections extending beyond the classical pathways. PMID:24758699
Dynamic association rules for gene expression data analysis.
Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung
2015-10-14
The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.
Hu, Pingsha; Maiti, Tapabrata
2011-01-01
Microarray is a powerful tool for genome-wide gene expression analysis. In microarray expression data, often mean and variance have certain relationships. We present a non-parametric mean-variance smoothing method (NPMVS) to analyze differentially expressed genes. In this method, a nonlinear smoothing curve is fitted to estimate the relationship between mean and variance. Inference is then made upon shrinkage estimation of posterior means assuming variances are known. Different methods have been applied to simulated datasets, in which a variety of mean and variance relationships were imposed. The simulation study showed that NPMVS outperformed the other two popular shrinkage estimation methods in some mean-variance relationships; and NPMVS was competitive with the two methods in other relationships. A real biological dataset, in which a cold stress transcription factor gene, CBF2, was overexpressed, has also been analyzed with the three methods. Gene ontology and cis-element analysis showed that NPMVS identified more cold and stress responsive genes than the other two methods did. The good performance of NPMVS is mainly due to its shrinkage estimation for both means and variances. In addition, NPMVS exploits a non-parametric regression between mean and variance, instead of assuming a specific parametric relationship between mean and variance. The source code written in R is available from the authors on request.
Hu, Pingsha; Maiti, Tapabrata
2011-01-01
Microarray is a powerful tool for genome-wide gene expression analysis. In microarray expression data, often mean and variance have certain relationships. We present a non-parametric mean-variance smoothing method (NPMVS) to analyze differentially expressed genes. In this method, a nonlinear smoothing curve is fitted to estimate the relationship between mean and variance. Inference is then made upon shrinkage estimation of posterior means assuming variances are known. Different methods have been applied to simulated datasets, in which a variety of mean and variance relationships were imposed. The simulation study showed that NPMVS outperformed the other two popular shrinkage estimation methods in some mean-variance relationships; and NPMVS was competitive with the two methods in other relationships. A real biological dataset, in which a cold stress transcription factor gene, CBF2, was overexpressed, has also been analyzed with the three methods. Gene ontology and cis-element analysis showed that NPMVS identified more cold and stress responsive genes than the other two methods did. The good performance of NPMVS is mainly due to its shrinkage estimation for both means and variances. In addition, NPMVS exploits a non-parametric regression between mean and variance, instead of assuming a specific parametric relationship between mean and variance. The source code written in R is available from the authors on request. PMID:21611181
Classification of mislabelled microarrays using robust sparse logistic regression.
Bootkrajang, Jakramate; Kabán, Ata
2013-04-01
Previous studies reported that labelling errors are not uncommon in microarray datasets. In such cases, the training set may become misleading, and the ability of classifiers to make reliable inferences from the data is compromised. Yet, few methods are currently available in the bioinformatics literature to deal with this problem. The few existing methods focus on data cleansing alone, without reference to classification, and their performance crucially depends on some tuning parameters. In this article, we develop a new method to detect mislabelled arrays simultaneously with learning a sparse logistic regression classifier. Our method may be seen as a label-noise robust extension of the well-known and successful Bayesian logistic regression classifier. To account for possible mislabelling, we formulate a label-flipping process as part of the classifier. The regularization parameter is automatically set using Bayesian regularization, which not only saves the computation time that cross-validation would take, but also eliminates any unwanted effects of label noise when setting the regularization parameter. Extensive experiments with both synthetic data and real microarray datasets demonstrate that our approach is able to counter the bad effects of labelling errors in terms of predictive performance, it is effective at identifying marker genes and simultaneously it detects mislabelled arrays to high accuracy. The code is available from http://cs.bham.ac.uk/∼jxb008. Supplementary data are available at Bioinformatics online.
Falgreen, Steffen; Ellern Bilgrau, Anders; Brøndum, Rasmus Froberg; Hjort Jakobsen, Lasse; Have, Jonas; Lindblad Nielsen, Kasper; El-Galaly, Tarec Christoffer; Bødker, Julie Støve; Schmitz, Alexander; H Young, Ken; Johnsen, Hans Erik; Dybkær, Karen; Bøgsted, Martin
2016-01-01
Dozens of omics based cancer classification systems have been introduced with prognostic, diagnostic, and predictive capabilities. However, they often employ complex algorithms and are only applicable on whole cohorts of patients, making them difficult to apply in a personalized clinical setting. This prompted us to create hemaClass.org, an online web application providing an easy interface to one-by-one RMA normalization of microarrays and subsequent risk classifications of diffuse large B-cell lymphoma (DLBCL) into cell-of-origin and chemotherapeutic sensitivity classes. Classification results for one-by-one array pre-processing with and without a laboratory specific RMA reference dataset were compared to cohort based classifiers in 4 publicly available datasets. Classifications showed high agreement between one-by-one and whole cohort pre-processsed data when a laboratory specific reference set was supplied. The website is essentially the R-package hemaClass accompanied by a Shiny web application. The well-documented package can be used to run the website locally or to use the developed methods programmatically. The website and R-package is relevant for biological and clinical lymphoma researchers using affymetrix U-133 Plus 2 arrays, as it provides reliable and swift methods for calculation of disease subclasses. The proposed one-by-one pre-processing method is relevant for all researchers using microarrays.
Wong, Kah Keng; Ch'ng, Ewe Seng; Loo, Suet Kee; Husin, Azlan; Muruzabal, María Arestin; Møller, Michael B; Pedersen, Lars M; Pomposo, María Puente; Gaafar, Ayman; Banham, Alison H; Green, Tina M; Lawrie, Charles H
2015-12-01
Huntingtin-interacting protein 1-related (HIP1R) is an endocytic protein involved in receptor trafficking, including regulating cell surface expression of receptor tyrosine kinases. We have previously shown that low HIP1R protein expression was associated with poorer survival in diffuse large B-cell lymphoma (DLBCL) patients from Denmark treated with R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, prednisone). In this multicenter study, we extend these findings and validate the prognostic and subtyping utility of HIP1R expression at both transcript and protein level. Using data mining on three independent transcriptomic datasets of DLBCL, HIP1R transcript was preferentially expressed in germinal center B-cell (GCB)-like DLBCL subtype (P<0.01 in all three datasets), and lower expression was correlated with worse overall survival (OS; P<0.01) and progression-free survival (PFS; P<0.05) in a microarray-profiled DLBCL dataset. At the protein level examined by immunohistochemistry, HIP1R expression at 30% cut-off was associated with GCB-DLBCL molecular subtype (P=0.0004; n=42), and predictive of OS (P=0.0006) and PFS (P=0.0230) in de novo DLBCL patients treated with R-CHOP (n=73). Cases with high FOXP1 and low HIP1R expression frequency (FOXP1(hi)/HIP1R(lo) phenotype) exhibited poorer OS (P=0.0038) and PFS (P=0.0134). Multivariate analysis showed that HIP1R<30% or FOXP1(hi)/HIP1R(lo) subgroup of patients exhibited inferior OS and PFS (P<0.05) independently of the International Prognostic Index. We conclude that HIP1R expression is strongly indicative of survival when utilized on its own or in combination with FOXP1, and the molecule is potentially applicable for subtyping of DLBCL cases. Copyright © 2015 Elsevier Inc. All rights reserved.
Identification of Common Differentially Expressed Genes in Urinary Bladder Cancer
Zaravinos, Apostolos; Lambrou, George I.; Boulalas, Ioannis; Delakas, Dimitris; Spandidos, Demetrios A.
2011-01-01
Background Current diagnosis and treatment of urinary bladder cancer (BC) has shown great progress with the utilization of microarrays. Purpose Our goal was to identify common differentially expressed (DE) genes among clinically relevant subclasses of BC using microarrays. Methodology/Principal Findings BC samples and controls, both experimental and publicly available datasets, were analyzed by whole genome microarrays. We grouped the samples according to their histology and defined the DE genes in each sample individually, as well as in each tumor group. A dual analysis strategy was followed. First, experimental samples were analyzed and conclusions were formulated; and second, experimental sets were combined with publicly available microarray datasets and were further analyzed in search of common DE genes. The experimental dataset identified 831 genes that were DE in all tumor samples, simultaneously. Moreover, 33 genes were up-regulated and 85 genes were down-regulated in all 10 BC samples compared to the 5 normal tissues, simultaneously. Hierarchical clustering partitioned tumor groups in accordance to their histology. K-means clustering of all genes and all samples, as well as clustering of tumor groups, presented 49 clusters. K-means clustering of common DE genes in all samples revealed 24 clusters. Genes manifested various differential patterns of expression, based on PCA. YY1 and NFκB were among the most common transcription factors that regulated the expression of the identified DE genes. Chromosome 1 contained 32 DE genes, followed by chromosomes 2 and 11, which contained 25 and 23 DE genes, respectively. Chromosome 21 had the least number of DE genes. GO analysis revealed the prevalence of transport and binding genes in the common down-regulated DE genes; the prevalence of RNA metabolism and processing genes in the up-regulated DE genes; as well as the prevalence of genes responsible for cell communication and signal transduction in the DE genes that were down-regulated in T1-Grade III tumors and up-regulated in T2/T3-Grade III tumors. Combination of samples from all microarray platforms revealed 17 common DE genes, (BMP4, CRYGD, DBH, GJB1, KRT83, MPZ, NHLH1, TACR3, ACTC1, MFAP4, SPARCL1, TAGLN, TPM2, CDC20, LHCGR, TM9SF1 and HCCS) 4 of which participate in numerous pathways. Conclusions/Significance The identification of the common DE genes among BC samples of different histology can provide further insight into the discovery of new putative markers. PMID:21483740
A proposed metric for assessing the measurement quality of individual microarrays
Kim, Kyoungmi; Page, Grier P; Beasley, T Mark; Barnes, Stephen; Scheirer, Katherine E; Allison, David B
2006-01-01
Background High-density microarray technology is increasingly applied to study gene expression levels on a large scale. Microarray experiments rely on several critical steps that may introduce error and uncertainty in analyses. These steps include mRNA sample extraction, amplification and labeling, hybridization, and scanning. In some cases this may be manifested as systematic spatial variation on the surface of microarray in which expression measurements within an individual array may vary as a function of geographic position on the array surface. Results We hypothesized that an index of the degree of spatiality of gene expression measurements associated with their physical geographic locations on an array could indicate the summary of the physical reliability of the microarray. We introduced a novel way to formulate this index using a statistical analysis tool. Our approach regressed gene expression intensity measurements on a polynomial response surface of the microarray's Cartesian coordinates. We demonstrated this method using a fixed model and presented results from real and simulated datasets. Conclusion We demonstrated the potential of such a quantitative metric for assessing the reliability of individual arrays. Moreover, we showed that this procedure can be incorporated into laboratory practice as a means to set quality control specifications and as a tool to determine whether an array has sufficient quality to be retained in terms of spatial correlation of gene expression measurements. PMID:16430768
Tojo, Axel; Malm, Johan; Marko-Varga, György; Lilja, Hans; Laurell, Thomas
2014-01-01
The antibody microarrays have become widespread, but their use for quantitative analyses in clinical samples has not yet been established. We investigated an immunoassay based on nanoporous silicon antibody microarrays for quantification of total prostate-specific-antigen (PSA) in 80 clinical plasma samples, and provide quantitative data from a duplex microarray assay that simultaneously quantifies free and total PSA in plasma. To further develop the assay the porous silicon chips was placed into a standard 96-well microtiter plate for higher throughput analysis. The samples analyzed by this quantitative microarray were 80 plasma samples obtained from men undergoing clinical PSA testing (dynamic range: 0.14-44ng/ml, LOD: 0.14ng/ml). The second dataset, measuring free PSA (dynamic range: 0.40-74.9ng/ml, LOD: 0.47ng/ml) and total PSA (dynamic range: 0.87-295ng/ml, LOD: 0.76ng/ml), was also obtained from the clinical routine. The reference for the quantification was a commercially available assay, the ProStatus PSA Free/Total DELFIA. In an analysis of 80 plasma samples the microarray platform performs well across the range of total PSA levels. This assay might have the potential to substitute for the large-scale microtiter plate format in diagnostic applications. The duplex assay paves the way for a future quantitative multiplex assay, which analyses several prostate cancer biomarkers simultaneously. PMID:22921878
Wimmer, Isabella; Tröscher, Anna R; Brunner, Florian; Rubino, Stephen J; Bien, Christian G; Weiner, Howard L; Lassmann, Hans; Bauer, Jan
2018-04-20
Formalin-fixed paraffin-embedded (FFPE) tissues are valuable resources commonly used in pathology. However, formalin fixation modifies nucleic acids challenging the isolation of high-quality RNA for genetic profiling. Here, we assessed feasibility and reliability of microarray studies analysing transcriptome data from fresh, fresh-frozen (FF) and FFPE tissues. We show that reproducible microarray data can be generated from only 2 ng FFPE-derived RNA. For RNA quality assessment, fragment size distribution (DV200) and qPCR proved most suitable. During RNA isolation, extending tissue lysis time to 10 hours reduced high-molecular-weight species, while additional incubation at 70 °C markedly increased RNA yields. Since FF- and FFPE-derived microarrays constitute different data entities, we used indirect measures to investigate gene signal variation and relative gene expression. Whole-genome analyses revealed high concordance rates, while reviewing on single-genes basis showed higher data variation in FFPE than FF arrays. Using an experimental model, gene set enrichment analysis (GSEA) of FFPE-derived microarrays and fresh tissue-derived RNA-Seq datasets yielded similarly affected pathways confirming the applicability of FFPE tissue in global gene expression analysis. Our study provides a workflow comprising RNA isolation, quality assessment and microarray profiling using minimal RNA input, thus enabling hypothesis-generating pathway analyses from limited amounts of precious, pathologically significant FFPE tissues.
Large-scale atlas of microarray data reveals biological landscape of gene expression in Arabidopsis
USDA-ARS?s Scientific Manuscript database
Transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metad...
Variations in study design are typical for toxicogenomic studies, but their impact on gene expression in control animals has not been well characterized. A dataset of control animal microarray expression data was assembled by a working group of the Health and Environmental Scienc...
Adaptable gene-specific dye bias correction for two-channel DNA microarrays.
Margaritis, Thanasis; Lijnzaad, Philip; van Leenen, Dik; Bouwmeester, Diane; Kemmeren, Patrick; van Hooff, Sander R; Holstege, Frank C P
2009-01-01
DNA microarray technology is a powerful tool for monitoring gene expression or for finding the location of DNA-bound proteins. DNA microarrays can suffer from gene-specific dye bias (GSDB), causing some probes to be affected more by the dye than by the sample. This results in large measurement errors, which vary considerably for different probes and also across different hybridizations. GSDB is not corrected by conventional normalization and has been difficult to address systematically because of its variance. We show that GSDB is influenced by label incorporation efficiency, explaining the variation of GSDB across different hybridizations. A correction method (Gene- And Slide-Specific Correction, GASSCO) is presented, whereby sequence-specific corrections are modulated by the overall bias of individual hybridizations. GASSCO outperforms earlier methods and works well on a variety of publically available datasets covering a range of platforms, organisms and applications, including ChIP on chip. A sequence-based model is also presented, which predicts which probes will suffer most from GSDB, useful for microarray probe design and correction of individual hybridizations. Software implementing the method is publicly available.
Adaptable gene-specific dye bias correction for two-channel DNA microarrays
Margaritis, Thanasis; Lijnzaad, Philip; van Leenen, Dik; Bouwmeester, Diane; Kemmeren, Patrick; van Hooff, Sander R; Holstege, Frank CP
2009-01-01
DNA microarray technology is a powerful tool for monitoring gene expression or for finding the location of DNA-bound proteins. DNA microarrays can suffer from gene-specific dye bias (GSDB), causing some probes to be affected more by the dye than by the sample. This results in large measurement errors, which vary considerably for different probes and also across different hybridizations. GSDB is not corrected by conventional normalization and has been difficult to address systematically because of its variance. We show that GSDB is influenced by label incorporation efficiency, explaining the variation of GSDB across different hybridizations. A correction method (Gene- And Slide-Specific Correction, GASSCO) is presented, whereby sequence-specific corrections are modulated by the overall bias of individual hybridizations. GASSCO outperforms earlier methods and works well on a variety of publically available datasets covering a range of platforms, organisms and applications, including ChIP on chip. A sequence-based model is also presented, which predicts which probes will suffer most from GSDB, useful for microarray probe design and correction of individual hybridizations. Software implementing the method is publicly available. PMID:19401678
Microarray data from independent labs and studies can be compared to potentially identify toxicologically and biologically relevant genes. The Baseline Animal Database working group of HESI was formed to assess baseline gene expression from microarray data derived from control or...
2014-01-01
Background Long noncoding RNAs (lncRNAs) constitute a major, but poorly characterized part of human transcriptome. Recent evidence indicates that many lncRNAs are involved in cancer and can be used as predictive and prognostic biomarkers. Significant fraction of lncRNAs is represented on widely used microarray platforms, however they have usually been ignored in cancer studies. Results We developed a computational pipeline to annotate lncRNAs on popular Affymetrix U133 microarrays, creating a resource allowing measurement of expression of 1581 lncRNAs. This resource can be utilized to interrogate existing microarray datasets for various lncRNA studies. We found that these lncRNAs fall into three distinct classes according to their statistical distribution by length. Remarkably, these three classes of lncRNAs were co-localized with protein coding genes exhibiting distinct gene ontology groups. This annotation was applied to microarray analysis which identified a 159 lncRNA signature that discriminates between localized and metastatic stages of neuroblastoma. Analysis of an independent patient cohort revealed that this signature differentiates also relapsing from non-relapsing primary tumors. This is the first example of the signature developed via the analysis of expression of lncRNAs solely. One of these lncRNAs, termed HOXD-AS1, is encoded in HOXD cluster. HOXD-AS1 is evolutionary conserved among hominids and has all bona fide features of a gene. Studying retinoid acid (RA) response of SH-SY5Y cell line, a model of human metastatic neuroblastoma, we found that HOXD-AS1 is a subject to morphogenic regulation, is activated by PI3K/Akt pathway and itself is involved in control of RA-induced cell differentiation. Knock-down experiments revealed that HOXD-AS1 controls expression levels of clinically significant protein-coding genes involved in angiogenesis and inflammation, the hallmarks of metastatic cancer. Conclusions Our findings greatly extend the number of noncoding RNAs functionally implicated in tumor development and patient treatment and highlight their role as potential prognostic biomarkers of neuroblastomas. PMID:25522241
Pine, P S; Boedigheimer, M; Rosenzweig, B A; Turpaz, Y; He, Y D; Delenstarr, G; Ganter, B; Jarnagin, K; Jones, W D; Reid, L H; Thompson, K L
2008-11-01
Effective use of microarray technology in clinical and regulatory settings is contingent on the adoption of standard methods for assessing performance. The MicroArray Quality Control project evaluated the repeatability and comparability of microarray data on the major commercial platforms and laid the groundwork for the application of microarray technology to regulatory assessments. However, methods for assessing performance that are commonly applied to diagnostic assays used in laboratory medicine remain to be developed for microarray assays. A reference system for microarray performance evaluation and process improvement was developed that includes reference samples, metrics and reference datasets. The reference material is composed of two mixes of four different rat tissue RNAs that allow defined target ratios to be assayed using a set of tissue-selective analytes that are distributed along the dynamic range of measurement. The diagnostic accuracy of detected changes in expression ratios, measured as the area under the curve from receiver operating characteristic plots, provides a single commutable value for comparing assay specificity and sensitivity. The utility of this system for assessing overall performance was evaluated for relevant applications like multi-laboratory proficiency testing programs and single-laboratory process drift monitoring. The diagnostic accuracy of detection of a 1.5-fold change in signal level was found to be a sensitive metric for comparing overall performance. This test approaches the technical limit for reliable discrimination of differences between two samples using this technology. We describe a reference system that provides a mechanism for internal and external assessment of laboratory proficiency with microarray technology and is translatable to performance assessments on other whole-genome expression arrays used for basic and clinical research.
A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system.
Lim, Jongwoo; Wang, Bohyun; Lim, Joon S
2016-04-29
Finding the minimum number of appropriate biomarkers for specific targets such as a lung cancer has been a challenging issue in bioinformatics. We propose a hierarchical two-phase framework for selecting appropriate biomarkers that extracts candidate biomarkers from the cancer microarray datasets and then selects the minimum number of appropriate biomarkers from the extracted candidate biomarkers datasets with a specific neuro-fuzzy algorithm, which is called a neural network with weighted fuzzy membership function (NEWFM). In this context, as the first phase, the proposed framework is to extract candidate biomarkers by using a Bhattacharyya distance method that measures the similarity of two discrete probability distributions. Finally, the proposed framework is able to reduce the cost of finding biomarkers by not receiving medical supplements and improve the accuracy of the biomarkers in specific cancer target datasets.
Microarray labeling extension values: laboratory signatures for Affymetrix GeneChips
Lee, Yun-Shien; Chen, Chun-Houh; Tsai, Chi-Neu; Tsai, Chia-Lung; Chao, Angel; Wang, Tzu-Hao
2009-01-01
Interlaboratory comparison of microarray data, even when using the same platform, imposes several challenges to scientists. RNA quality, RNA labeling efficiency, hybridization procedures and data-mining tools can all contribute variations in each laboratory. In Affymetrix GeneChips, about 11–20 different 25-mer oligonucleotides are used to measure the level of each transcript. Here, we report that ‘labeling extension values (LEVs)’, which are correlation coefficients between probe intensities and probe positions, are highly correlated with the gene expression levels (GEVs) on eukayotic Affymetrix microarray data. By analyzing LEVs and GEVs in the publicly available 2414 cel files of 20 Affymetrix microarray types covering 13 species, we found that correlations between LEVs and GEVs only exist in eukaryotic RNAs, but not in prokaryotic ones. Surprisingly, Affymetrix results of the same specimens that were analyzed in different laboratories could be clearly differentiated only by LEVs, leading to the identification of ‘laboratory signatures’. In the examined dataset, GSE10797, filtering out high-LEV genes did not compromise the discovery of biological processes that are constructed by differentially expressed genes. In conclusion, LEVs provide a new filtering parameter for microarray analysis of gene expression and it may improve the inter- and intralaboratory comparability of Affymetrix GeneChips data. PMID:19295132
2010-01-01
Background The large amount of high-throughput genomic data has facilitated the discovery of the regulatory relationships between transcription factors and their target genes. While early methods for discovery of transcriptional regulation relationships from microarray data often focused on the high-throughput experimental data alone, more recent approaches have explored the integration of external knowledge bases of gene interactions. Results In this work, we develop an algorithm that provides improved performance in the prediction of transcriptional regulatory relationships by supplementing the analysis of microarray data with a new method of integrating information from an existing knowledge base. Using a well-known dataset of yeast microarrays and the Yeast Proteome Database, a comprehensive collection of known information of yeast genes, we show that knowledge-based predictions demonstrate better sensitivity and specificity in inferring new transcriptional interactions than predictions from microarray data alone. We also show that comprehensive, direct and high-quality knowledge bases provide better prediction performance. Comparison of our results with ChIP-chip data and growth fitness data suggests that our predicted genome-wide regulatory pairs in yeast are reasonable candidates for follow-up biological verification. Conclusion High quality, comprehensive, and direct knowledge bases, when combined with appropriate bioinformatic algorithms, can significantly improve the discovery of gene regulatory relationships from high throughput gene expression data. PMID:20122245
Seok, Junhee; Kaushal, Amit; Davis, Ronald W; Xiao, Wenzhong
2010-01-18
The large amount of high-throughput genomic data has facilitated the discovery of the regulatory relationships between transcription factors and their target genes. While early methods for discovery of transcriptional regulation relationships from microarray data often focused on the high-throughput experimental data alone, more recent approaches have explored the integration of external knowledge bases of gene interactions. In this work, we develop an algorithm that provides improved performance in the prediction of transcriptional regulatory relationships by supplementing the analysis of microarray data with a new method of integrating information from an existing knowledge base. Using a well-known dataset of yeast microarrays and the Yeast Proteome Database, a comprehensive collection of known information of yeast genes, we show that knowledge-based predictions demonstrate better sensitivity and specificity in inferring new transcriptional interactions than predictions from microarray data alone. We also show that comprehensive, direct and high-quality knowledge bases provide better prediction performance. Comparison of our results with ChIP-chip data and growth fitness data suggests that our predicted genome-wide regulatory pairs in yeast are reasonable candidates for follow-up biological verification. High quality, comprehensive, and direct knowledge bases, when combined with appropriate bioinformatic algorithms, can significantly improve the discovery of gene regulatory relationships from high throughput gene expression data.
Hu, Guohong; Wang, Hui-Yun; Greenawalt, Danielle M.; Azaro, Marco A.; Luo, Minjie; Tereshchenko, Irina V.; Cui, Xiangfeng; Yang, Qifeng; Gao, Richeng; Shen, Li; Li, Honghua
2006-01-01
Microarray-based analysis of single nucleotide polymorphisms (SNPs) has many applications in large-scale genetic studies. To minimize the influence of experimental variation, microarray data usually need to be processed in different aspects including background subtraction, normalization and low-signal filtering before genotype determination. Although many algorithms are sophisticated for these purposes, biases are still present. In the present paper, new algorithms for SNP microarray data analysis and the software, AccuTyping, developed based on these algorithms are described. The algorithms take advantage of a large number of SNPs included in each assay, and the fact that the top and bottom 20% of SNPs can be safely treated as homozygous after sorting based on their ratios between the signal intensities. These SNPs are then used as controls for color channel normalization and background subtraction. Genotype calls are made based on the logarithms of signal intensity ratios using two cutoff values, which were determined after training the program with a dataset of ∼160 000 genotypes and validated by non-microarray methods. AccuTyping was used to determine >300 000 genotypes of DNA and sperm samples. The accuracy was shown to be >99%. AccuTyping can be downloaded from . PMID:16982644
Imholte, Gregory; Gottardo, Raphael
2017-01-01
Summary The peptide microarray immunoassay simultaneously screens sample serum against thousands of peptides, determining the presence of antibodies bound to array probes. Peptide microarrays tiling immunogenic regions of pathogens (e.g. envelope proteins of a virus) are an important high throughput tool for querying and mapping antibody binding. Because of the assay’s many steps, from probe synthesis to incubation, peptide microarray data can be noisy with extreme outliers. In addition, subjects may produce different antibody profiles in response to an identical vaccine stimulus or infection, due to variability among subjects’ immune systems. We present a robust Bayesian hierarchical model for peptide microarray experiments, pepBayes, to estimate the probability of antibody response for each subject/peptide combination. Heavy-tailed error distributions accommodate outliers and extreme responses, and tailored random effect terms automatically incorporate technical effects prevalent in the assay. We apply our model to two vaccine trial datasets to demonstrate model performance. Our approach enjoys high sensitivity and specificity when detecting vaccine induced antibody responses. A simulation study shows an adaptive thresholding classification method has appropriate false discovery rate control with high sensitivity, and receiver operating characteristics generated on vaccine trial data suggest that pepBayes clearly separates responses from non-responses. PMID:27061097
Ontology-based meta-analysis of global collections of high-throughput public data.
Kupershmidt, Ilya; Su, Qiaojuan Jane; Grewal, Anoop; Sundaresh, Suman; Halperin, Inbal; Flynn, James; Shekar, Mamatha; Wang, Helen; Park, Jenny; Cui, Wenwu; Wall, Gregory D; Wisotzkey, Robert; Alag, Satnam; Akhtari, Saeid; Ronaghi, Mostafa
2010-09-29
The investigation of the interconnections between the molecular and genetic events that govern biological systems is essential if we are to understand the development of disease and design effective novel treatments. Microarray and next-generation sequencing technologies have the potential to provide this information. However, taking full advantage of these approaches requires that biological connections be made across large quantities of highly heterogeneous genomic datasets. Leveraging the increasingly huge quantities of genomic data in the public domain is fast becoming one of the key challenges in the research community today. We have developed a novel data mining framework that enables researchers to use this growing collection of public high-throughput data to investigate any set of genes or proteins. The connectivity between molecular states across thousands of heterogeneous datasets from microarrays and other genomic platforms is determined through a combination of rank-based enrichment statistics, meta-analyses, and biomedical ontologies. We address data quality concerns through dataset replication and meta-analysis and ensure that the majority of the findings are derived using multiple lines of evidence. As an example of our strategy and the utility of this framework, we apply our data mining approach to explore the biology of brown fat within the context of the thousands of publicly available gene expression datasets. Our work presents a practical strategy for organizing, mining, and correlating global collections of large-scale genomic data to explore normal and disease biology. Using a hypothesis-free approach, we demonstrate how a data-driven analysis across very large collections of genomic data can reveal novel discoveries and evidence to support existing hypothesis.
Zhong, Qing; Guo, Tiannan; Rechsteiner, Markus; Rüschoff, Jan H.; Rupp, Niels; Fankhauser, Christian; Saba, Karim; Mortezavi, Ashkan; Poyet, Cédric; Hermanns, Thomas; Zhu, Yi; Moch, Holger; Aebersold, Ruedi; Wild, Peter J.
2017-01-01
Microscopy image data of human cancers provide detailed phenotypes of spatially and morphologically intact tissues at single-cell resolution, thus complementing large-scale molecular analyses, e.g., next generation sequencing or proteomic profiling. Here we describe a high-resolution tissue microarray (TMA) image dataset from a cohort of 71 prostate tissue samples, which was hybridized with bright-field dual colour chromogenic and silver in situ hybridization probes for the tumour suppressor gene PTEN. These tissue samples were digitized and supplemented with expert annotations, clinical information, statistical models of PTEN genetic status, and computer source codes. For validation, we constructed an additional TMA dataset for 424 prostate tissues, hybridized with FISH probes for PTEN, and performed survival analysis on a subset of 339 radical prostatectomy specimens with overall, disease-specific and recurrence-free survival (maximum 167 months). For application, we further produced 6,036 image patches derived from two whole slides. Our curated collection of prostate cancer data sets provides reuse potential for both biomedical and computational studies. PMID:28291248
Fuertes Marraco, Silvia A; Soneson, Charlotte; Delorenzi, Mauro; Speiser, Daniel E
2015-09-01
The live-attenuated Yellow Fever (YF) vaccine YF-17D induces a broad and polyfunctional CD8 T cell response in humans. Recently, we identified a population of stem cell-like memory CD8 T cells induced by YF-17D that persists at stable frequency for at least 25 years after vaccination. The YF-17D is thus a model system of human CD8 T cell biology that furthermore allows to track and study long-lasting and antigen-specific human memory CD8 T cells. Here, we describe in detail the sample characteristics and preparation of a microarray dataset acquired for genome-wide gene expression profiling of long-lasting YF-specific stem cell-like memory CD8 T cells, compared to the reference CD8 T cell differentiation subsets from total CD8 T cells. We also describe the quality controls, annotations and exploratory analyses of the dataset. The microarray data is available from the Gene Expression Omnibus (GEO) public repository with accession number GSE65804.
Kunz, Meik; Dandekar, Thomas; Naseem, Muhammad
2017-01-01
Cytokinins (CKs) play an important role in plant growth and development. Also, several studies highlight the modulatory implications of CKs for plant-pathogen interaction. However, the underlying mechanisms of CK mediating immune networks in plants are still not fully understood. A detailed analysis of high-throughput transcriptome (RNA-Seq and microarrays) datasets under modulated conditions of plant CKs and its mergence with cellular interactome (large-scale protein-protein interaction data) has the potential to unlock the contribution of CKs to plant defense. Here, we specifically describe a detailed systems biology methodology pertinent to the acquisition and analysis of various omics datasets that delineate the role of plant CKs in impacting immune pathways in Arabidopsis.
Translating standards into practice - one Semantic Web API for Gene Expression.
Deus, Helena F; Prud'hommeaux, Eric; Miller, Michael; Zhao, Jun; Malone, James; Adamusiak, Tomasz; McCusker, Jim; Das, Sudeshna; Rocca Serra, Philippe; Fox, Ronan; Marshall, M Scott
2012-08-01
Sharing and describing experimental results unambiguously with sufficient detail to enable replication of results is a fundamental tenet of scientific research. In today's cluttered world of "-omics" sciences, data standards and standardized use of terminologies and ontologies for biomedical informatics play an important role in reporting high-throughput experiment results in formats that can be interpreted by both researchers and analytical tools. Increasing adoption of Semantic Web and Linked Data technologies for the integration of heterogeneous and distributed health care and life sciences (HCLSs) datasets has made the reuse of standards even more pressing; dynamic semantic query federation can be used for integrative bioinformatics when ontologies and identifiers are reused across data instances. We present here a methodology to integrate the results and experimental context of three different representations of microarray-based transcriptomic experiments: the Gene Expression Atlas, the W3C BioRDF task force approach to reporting Provenance of Microarray Experiments, and the HSCI blood genomics project. Our approach does not attempt to improve the expressivity of existing standards for genomics but, instead, to enable integration of existing datasets published from microarray-based transcriptomic experiments. SPARQL Construct is used to create a posteriori mappings of concepts and properties and linking rules that match entities based on query constraints. We discuss how our integrative approach can encourage reuse of the Experimental Factor Ontology (EFO) and the Ontology for Biomedical Investigations (OBIs) for the reporting of experimental context and results of gene expression studies. Copyright © 2012 Elsevier Inc. All rights reserved.
TIPMaP: a web server to establish transcript isoform profiles from reliable microarray probes.
Chitturi, Neelima; Balagannavar, Govindkumar; Chandrashekar, Darshan S; Abinaya, Sadashivam; Srini, Vasan S; Acharya, Kshitish K
2013-12-27
Standard 3' Affymetrix gene expression arrays have contributed a significantly higher volume of existing gene expression data than other microarray platforms. These arrays were designed to identify differentially expressed genes, but not their alternatively spliced transcript forms. No resource can currently identify expression pattern of specific mRNA forms using these microarray data, even though it is possible to do this. We report a web server for expression profiling of alternatively spliced transcripts using microarray data sets from 31 standard 3' Affymetrix arrays for human, mouse and rat species. The tool has been experimentally validated for mRNAs transcribed or not-detected in a human disease condition (non-obstructive azoospermia, a male infertility condition). About 4000 gene expression datasets were downloaded from a public repository. 'Good probes' with complete coverage and identity to latest reference transcript sequences were first identified. Using them, 'Transcript specific probe-clusters' were derived for each platform and used to identify expression status of possible transcripts. The web server can lead the user to datasets corresponding to specific tissues, conditions via identifiers of the microarray studies or hybridizations, keywords, official gene symbols or reference transcript identifiers. It can identify, in the tissues and conditions of interest, about 40% of known transcripts as 'transcribed', 'not-detected' or 'differentially regulated'. Corresponding additional information for probes, genes, transcripts and proteins can be viewed too. We identified the expression of transcripts in a specific clinical condition and validated a few of these transcripts by experiments (using reverse transcription followed by polymerase chain reaction). The experimental observations indicated higher agreements with the web server results, than contradictions. The tool is accessible at http://resource.ibab.ac.in/TIPMaP. The newly developed online tool forms a reliable means for identification of alternatively spliced transcript-isoforms that may be differentially expressed in various tissues, cell types or physiological conditions. Thus, by making better use of existing data, TIPMaP avoids the dependence on precious tissue-samples, in experiments with a goal to establish expression profiles of alternative splice forms--at least in some cases.
Khan, Haseeb Ahmad
2004-01-01
The massive surge in the production of microarray data poses a great challenge for proper analysis and interpretation. In recent years numerous computational tools have been developed to extract meaningful interpretation of microarray gene expression data. However, a convenient tool for two-groups comparison of microarray data is still lacking and users have to rely on commercial statistical packages that might be costly and require special skills, in addition to extra time and effort for transferring data from one platform to other. Various statistical methods, including the t-test, analysis of variance, Pearson test and Mann-Whitney U test, have been reported for comparing microarray data, whereas the utilization of the Wilcoxon signed-rank test, which is an appropriate test for two-groups comparison of gene expression data, has largely been neglected in microarray studies. The aim of this investigation was to build an integrated tool, ArraySolver, for colour-coded graphical display and comparison of gene expression data using the Wilcoxon signed-rank test. The results of software validation showed similar outputs with ArraySolver and SPSS for large datasets. Whereas the former program appeared to be more accurate for 25 or fewer pairs (n < or = 25), suggesting its potential application in analysing molecular signatures that usually contain small numbers of genes. The main advantages of ArraySolver are easy data selection, convenient report format, accurate statistics and the familiar Excel platform.
2004-01-01
The massive surge in the production of microarray data poses a great challenge for proper analysis and interpretation. In recent years numerous computational tools have been developed to extract meaningful interpretation of microarray gene expression data. However, a convenient tool for two-groups comparison of microarray data is still lacking and users have to rely on commercial statistical packages that might be costly and require special skills, in addition to extra time and effort for transferring data from one platform to other. Various statistical methods, including the t-test, analysis of variance, Pearson test and Mann–Whitney U test, have been reported for comparing microarray data, whereas the utilization of the Wilcoxon signed-rank test, which is an appropriate test for two-groups comparison of gene expression data, has largely been neglected in microarray studies. The aim of this investigation was to build an integrated tool, ArraySolver, for colour-coded graphical display and comparison of gene expression data using the Wilcoxon signed-rank test. The results of software validation showed similar outputs with ArraySolver and SPSS for large datasets. Whereas the former program appeared to be more accurate for 25 or fewer pairs (n ≤ 25), suggesting its potential application in analysing molecular signatures that usually contain small numbers of genes. The main advantages of ArraySolver are easy data selection, convenient report format, accurate statistics and the familiar Excel platform. PMID:18629036
Hou, Qi; Bing, Zhi-Tong; Hu, Cheng; Li, Mao-Yin; Yang, Ke-Hu; Mo, Zu; Xie, Xiang-Wei; Liao, Ji-Lin; Lu, Yan; Horie, Shigeo; Lou, Ming-Wu
2018-06-01
Prostate cancer (PCa) is the most commonly diagnosed cancer in males in the Western world. Although prostate-specific antigen (PSA) has been widely used as a biomarker for PCa diagnosis, its results can be controversial. Therefore, new biomarkers are needed to enhance the clinical management of PCa. From publicly available microarray data, differentially expressed genes (DEGs) were identified by meta-analysis with RankProd. Genetic algorithm optimized artificial neural network (GA-ANN) was introduced to establish a diagnostic prediction model and to filter candidate genes. The diagnostic and prognostic capability of the prediction model and candidate genes were investigated in both GEO and TCGA datasets. Candidate genes were further validated by qPCR, Western Blot and Tissue microarray. By RankProd meta-analyses, 2306 significantly up- and 1311 down-regulated probes were found in 133 cases and 30 controls microarray data. The overall accuracy rate of the PCa diagnostic prediction model, consisting of a 15-gene signature, reached up to 100% in both the training and test dataset. The prediction model also showed good results for the diagnosis (AUC = 0.953) and prognosis (AUC of 5 years overall survival time = 0.808) of PCa in the TCGA database. The expression levels of three genes, FABP5, C1QTNF3 and LPHN3, were validated by qPCR. C1QTNF3 high expression was further validated in PCa tissue by Western Blot and Tissue microarray. In the GEO datasets, C1QTNF3 was a good predictor for the diagnosis of PCa (GSE6956: AUC = 0.791; GSE8218: AUC = 0.868; GSE26910: AUC = 0.972). In the TCGA database, C1QTNF3 was significantly associated with PCa patient recurrence free survival (P < .001, AUC = 0.57). In this study, we have developed a diagnostic and prognostic prediction model for PCa. C1QTNF3 was revealed as a promising biomarker for PCa. This approach can be applied to other high-throughput data from different platforms for the discovery of oncogenes or biomarkers in different kinds of diseases. Copyright © 2018. Published by Elsevier B.V.
Hurley, Daniel; Araki, Hiromitsu; Tamada, Yoshinori; Dunmore, Ben; Sanders, Deborah; Humphreys, Sally; Affara, Muna; Imoto, Seiya; Yasuda, Kaori; Tomiyasu, Yuki; Tashiro, Kosuke; Savoie, Christopher; Cho, Vicky; Smith, Stephen; Kuhara, Satoru; Miyano, Satoru; Charnock-Jones, D. Stephen; Crampin, Edmund J.; Print, Cristin G.
2012-01-01
Gene regulatory networks inferred from RNA abundance data have generated significant interest, but despite this, gene network approaches are used infrequently and often require input from bioinformaticians. We have assembled a suite of tools for analysing regulatory networks, and we illustrate their use with microarray datasets generated in human endothelial cells. We infer a range of regulatory networks, and based on this analysis discuss the strengths and limitations of network inference from RNA abundance data. We welcome contact from researchers interested in using our inference and visualization tools to answer biological questions. PMID:22121215
Ensemble analyses improve signatures of tumour hypoxia and reveal inter-platform differences
2014-01-01
Background The reproducibility of transcriptomic biomarkers across datasets remains poor, limiting clinical application. We and others have suggested that this is in-part caused by differential error-structure between datasets, and their incomplete removal by pre-processing algorithms. Methods To test this hypothesis, we systematically assessed the effects of pre-processing on biomarker classification using 24 different pre-processing methods and 15 distinct signatures of tumour hypoxia in 10 datasets (2,143 patients). Results We confirm strong pre-processing effects for all datasets and signatures, and find that these differ between microarray versions. Importantly, exploiting different pre-processing techniques in an ensemble technique improved classification for a majority of signatures. Conclusions Assessing biomarkers using an ensemble of pre-processing techniques shows clear value across multiple diseases, datasets and biomarkers. Importantly, ensemble classification improves biomarkers with initially good results but does not result in spuriously improved performance for poor biomarkers. While further research is required, this approach has the potential to become a standard for transcriptomic biomarkers. PMID:24902696
Speeding up the Consensus Clustering methodology for microarray data analysis
2011-01-01
Background The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speed-up. Results Since the time-precision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, FC turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of Consensus. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by Consensus. We have also experimented with the use of Consensus and FC in conjunction with NMF (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although NMF is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about NMF, shedding further light on its merits and limitations. Conclusions In summary, FC with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures. PMID:21235792
TAM: a method for enrichment and depletion analysis of a microRNA category in a list of microRNAs.
Lu, Ming; Shi, Bing; Wang, Juan; Cao, Qun; Cui, Qinghua
2010-08-09
MicroRNAs (miRNAs) are a class of important gene regulators. The number of identified miRNAs has been increasing dramatically in recent years. An emerging major challenge is the interpretation of the genome-scale miRNA datasets, including those derived from microarray and deep-sequencing. It is interesting and important to know the common rules or patterns behind a list of miRNAs, (i.e. the deregulated miRNAs resulted from an experiment of miRNA microarray or deep-sequencing). For the above purpose, this study presents a method and develops a tool (TAM) for annotations of meaningful human miRNAs categories. We first integrated miRNAs into various meaningful categories according to prior knowledge, such as miRNA family, miRNA cluster, miRNA function, miRNA associated diseases, and tissue specificity. Using TAM, given lists of miRNAs can be rapidly annotated and summarized according to the integrated miRNA categorical data. Moreover, given a list of miRNAs, TAM can be used to predict novel related miRNAs. Finally, we confirmed the usefulness and reliability of TAM by applying it to deregulated miRNAs in acute myocardial infarction (AMI) from two independent experiments. TAM can efficiently identify meaningful categories for given miRNAs. In addition, TAM can be used to identify novel miRNA biomarkers. TAM tool, source codes, and miRNA category data are freely available at http://cmbi.bjmu.edu.cn/tam.
Rafehi, Haloom; Kaspi, Antony; Ziemann, Mark; Okabe, Jun; Karagiannis, Tom C; El-Osta, Assam
2017-01-01
Given the skyrocketing costs to develop new drugs, repositioning of approved drugs, such as histone deacetylase (HDAC) inhibitors, may be a promising strategy to develop novel therapies. However, a gap exists in the understanding and advancement of these agents to meaningful translation for which new indications may emerge. To address this, we performed systems-level analyses of 33 independent HDAC inhibitor microarray studies. Based on network analysis, we identified enrichment for pathways implicated in metabolic syndrome and diabetes (insulin receptor signaling, lipid metabolism, immunity and trafficking). Integration with ENCODE ChIP-seq datasets identified suppression of EP300 target genes implicated in diabetes. Experimental validation indicates reversal of diabetes-associated EP300 target genes in primary vascular endothelial cells derived from a diabetic individual following inhibition of HDACs (by SAHA), EP300, or EP300 knockdown. Our computational systems biology approach provides an adaptable framework for the prediction of novel therapeutics for existing disease.
Pantazatos, Spiro P.; Li, Jianrong; Pavlidis, Paul; Lussier, Yves A.
2009-01-01
An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT®). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames, and allowed for complex queries such as “List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes”. Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n = 50), and precision of the semantic mapping between these terms across datasets was 98% (n = 100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets. PMID:20495688
CEM-designer: design of custom expression microarrays in the post-ENCODE Era.
Arnold, Christian; Externbrink, Fabian; Hackermüller, Jörg; Reiche, Kristin
2014-11-10
Microarrays are widely used in gene expression studies, and custom expression microarrays are popular to monitor expression changes of a customer-defined set of genes. However, the complexity of transcriptomes uncovered recently make custom expression microarray design a non-trivial task. Pervasive transcription and alternative processing of transcripts generate a wealth of interweaved transcripts that requires well-considered probe design strategies and is largely neglected in existing approaches. We developed the web server CEM-Designer that facilitates microarray platform independent design of custom expression microarrays for complex transcriptomes. CEM-Designer covers (i) the collection and generation of a set of unique target sequences from different sources and (ii) the selection of a set of sensitive and specific probes that optimally represents the target sequences. Probe design itself is left to third party software to ensure that probes meet provider-specific constraints. CEM-Designer is available at http://designpipeline.bioinf.uni-leipzig.de. Copyright © 2014 Elsevier B.V. All rights reserved.
Estimating differential expression from multiple indicators
Ilmjärv, Sten; Hundahl, Christian Ansgar; Reimets, Riin; Niitsoo, Margus; Kolde, Raivo; Vilo, Jaak; Vasar, Eero; Luuk, Hendrik
2014-01-01
Regardless of the advent of high-throughput sequencing, microarrays remain central in current biomedical research. Conventional microarray analysis pipelines apply data reduction before the estimation of differential expression, which is likely to render the estimates susceptible to noise from signal summarization and reduce statistical power. We present a probe-level framework, which capitalizes on the high number of concurrent measurements to provide more robust differential expression estimates. The framework naturally extends to various experimental designs and target categories (e.g. transcripts, genes, genomic regions) as well as small sample sizes. Benchmarking in relation to popular microarray and RNA-sequencing data-analysis pipelines indicated high and stable performance on the Microarray Quality Control dataset and in a cell-culture model of hypoxia. Experimental-data-exhibiting long-range epigenetic silencing of gene expression was used to demonstrate the efficacy of detecting differential expression of genomic regions, a level of analysis not embraced by conventional workflows. Finally, we designed and conducted an experiment to identify hypothermia-responsive genes in terms of monotonic time-response. As a novel insight, hypothermia-dependent up-regulation of multiple genes of two major antioxidant pathways was identified and verified by quantitative real-time PCR. PMID:24586062
Tomato Expression Database (TED): a suite of data presentation and analysis tools
Fei, Zhangjun; Tang, Xuemei; Alba, Rob; Giovannoni, James
2006-01-01
The Tomato Expression Database (TED) includes three integrated components. The Tomato Microarray Data Warehouse serves as a central repository for raw gene expression data derived from the public tomato cDNA microarray. In addition to expression data, TED stores experimental design and array information in compliance with the MIAME guidelines and provides web interfaces for researchers to retrieve data for their own analysis and use. The Tomato Microarray Expression Database contains normalized and processed microarray data for ten time points with nine pair-wise comparisons during fruit development and ripening in a normal tomato variety and nearly isogenic single gene mutants impacting fruit development and ripening. Finally, the Tomato Digital Expression Database contains raw and normalized digital expression (EST abundance) data derived from analysis of the complete public tomato EST collection containing >150 000 ESTs derived from 27 different non-normalized EST libraries. This last component also includes tools for the comparison of tomato and Arabidopsis digital expression data. A set of query interfaces and analysis, and visualization tools have been developed and incorporated into TED, which aid users in identifying and deciphering biologically important information from our datasets. TED can be accessed at . PMID:16381976
Tomato Expression Database (TED): a suite of data presentation and analysis tools.
Fei, Zhangjun; Tang, Xuemei; Alba, Rob; Giovannoni, James
2006-01-01
The Tomato Expression Database (TED) includes three integrated components. The Tomato Microarray Data Warehouse serves as a central repository for raw gene expression data derived from the public tomato cDNA microarray. In addition to expression data, TED stores experimental design and array information in compliance with the MIAME guidelines and provides web interfaces for researchers to retrieve data for their own analysis and use. The Tomato Microarray Expression Database contains normalized and processed microarray data for ten time points with nine pair-wise comparisons during fruit development and ripening in a normal tomato variety and nearly isogenic single gene mutants impacting fruit development and ripening. Finally, the Tomato Digital Expression Database contains raw and normalized digital expression (EST abundance) data derived from analysis of the complete public tomato EST collection containing >150,000 ESTs derived from 27 different non-normalized EST libraries. This last component also includes tools for the comparison of tomato and Arabidopsis digital expression data. A set of query interfaces and analysis, and visualization tools have been developed and incorporated into TED, which aid users in identifying and deciphering biologically important information from our datasets. TED can be accessed at http://ted.bti.cornell.edu.
Ling, Zhi-Qiang; Wang, Yi; Mukaisho, Kenichi; Hattori, Takanori; Tatsuta, Takeshi; Ge, Ming-Hua; Jin, Li; Mao, Wei-Min; Sugihara, Hiroyuki
2010-06-01
Tests of differentially expressed genes (DEGs) from microarray experiments are based on the null hypothesis that genes that are irrelevant to the phenotype/stimulus are expressed equally in the target and control samples. However, this strict hypothesis is not always true, as there can be several transcriptomic background differences between target and control samples, including different cell/tissue types, different cell cycle stages and different biological donors. These differences lead to increased false positives, which have little biological/medical significance. In this article, we propose a statistical framework to identify DEGs between target and control samples from expression microarray data allowing transcriptomic background differences between these samples by introducing a modified null hypothesis that the gene expression background difference is normally distributed. We use an iterative procedure to perform robust estimation of the null hypothesis and identify DEGs as outliers. We evaluated our method using our own triplicate microarray experiment, followed by validations with reverse transcription-polymerase chain reaction (RT-PCR) and on the MicroArray Quality Control dataset. The evaluations suggest that our technique (i) results in less false positive and false negative results, as measured by the degree of agreement with RT-PCR of the same samples, (ii) can be applied to different microarray platforms and results in better reproducibility as measured by the degree of DEG identification concordance both intra- and inter-platforms and (iii) can be applied efficiently with only a few microarray replicates. Based on these evaluations, we propose that this method not only identifies more reliable and biologically/medically significant DEG, but also reduces the power-cost tradeoff problem in the microarray field. Source code and binaries freely available for download at http://comonca.org.cn/fdca/resources/softwares/deg.zip.
Chavan, Shweta S; Bauer, Michael A; Peterson, Erich A; Heuck, Christoph J; Johann, Donald J
2013-01-01
Transcriptome analysis by microarrays has produced important advances in biomedicine. For instance in multiple myeloma (MM), microarray approaches led to the development of an effective disease subtyping via cluster assignment, and a 70 gene risk score. Both enabled an improved molecular understanding of MM, and have provided prognostic information for the purposes of clinical management. Many researchers are now transitioning to Next Generation Sequencing (NGS) approaches and RNA-seq in particular, due to its discovery-based nature, improved sensitivity, and dynamic range. Additionally, RNA-seq allows for the analysis of gene isoforms, splice variants, and novel gene fusions. Given the voluminous amounts of historical microarray data, there is now a need to associate and integrate microarray and RNA-seq data via advanced bioinformatic approaches. Custom software was developed following a model-view-controller (MVC) approach to integrate Affymetrix probe set-IDs, and gene annotation information from a variety of sources. The tool/approach employs an assortment of strategies to integrate, cross reference, and associate microarray and RNA-seq datasets. Output from a variety of transcriptome reconstruction and quantitation tools (e.g., Cufflinks) can be directly integrated, and/or associated with Affymetrix probe set data, as well as necessary gene identifiers and/or symbols from a diversity of sources. Strategies are employed to maximize the annotation and cross referencing process. Custom gene sets (e.g., MM 70 risk score (GEP-70)) can be specified, and the tool can be directly assimilated into an RNA-seq pipeline. A novel bioinformatic approach to aid in the facilitation of both annotation and association of historic microarray data, in conjunction with richer RNA-seq data, is now assisting with the study of MM cancer biology.
Discovering monotonic stemness marker genes from time-series stem cell microarray data.
Wang, Hsei-Wei; Sun, Hsing-Jen; Chang, Ting-Yu; Lo, Hung-Hao; Cheng, Wei-Chung; Tseng, George C; Lin, Chin-Teng; Chang, Shing-Jyh; Pal, Nikhil; Chung, I-Fang
2015-01-01
Identification of genes with ascending or descending monotonic expression patterns over time or stages of stem cells is an important issue in time-series microarray data analysis. We propose a method named Monotonic Feature Selector (MFSelector) based on a concept of total discriminating error (DEtotal) to identify monotonic genes. MFSelector considers various time stages in stage order (i.e., Stage One vs. other stages, Stages One and Two vs. remaining stages and so on) and computes DEtotal of each gene. MFSelector can successfully identify genes with monotonic characteristics. We have demonstrated the effectiveness of MFSelector on two synthetic data sets and two stem cell differentiation data sets: embryonic stem cell neurogenesis (ESCN) and embryonic stem cell vasculogenesis (ESCV) data sets. We have also performed extensive quantitative comparisons of the three monotonic gene selection approaches. Some of the monotonic marker genes such as OCT4, NANOG, BLBP, discovered from the ESCN dataset exhibit consistent behavior with that reported in other studies. The role of monotonic genes found by MFSelector in either stemness or differentiation is validated using information obtained from Gene Ontology analysis and other literature. We justify and demonstrate that descending genes are involved in the proliferation or self-renewal activity of stem cells, while ascending genes are involved in differentiation of stem cells into variant cell lineages. We have developed a novel system, easy to use even with no pre-existing knowledge, to identify gene sets with monotonic expression patterns in multi-stage as well as in time-series genomics matrices. The case studies on ESCN and ESCV have helped to get a better understanding of stemness and differentiation. The novel monotonic marker genes discovered from a data set are found to exhibit consistent behavior in another independent data set, demonstrating the utility of the proposed method. The MFSelector R function and data sets can be downloaded from: http://microarray.ym.edu.tw/tools/MFSelector/.
Hierarchical Gene Selection and Genetic Fuzzy System for Cancer Microarray Data Classification
Nguyen, Thanh; Khosravi, Abbas; Creighton, Douglas; Nahavandi, Saeid
2015-01-01
This paper introduces a novel approach to gene selection based on a substantial modification of analytic hierarchy process (AHP). The modified AHP systematically integrates outcomes of individual filter methods to select the most informative genes for microarray classification. Five individual ranking methods including t-test, entropy, receiver operating characteristic (ROC) curve, Wilcoxon and signal to noise ratio are employed to rank genes. These ranked genes are then considered as inputs for the modified AHP. Additionally, a method that uses fuzzy standard additive model (FSAM) for cancer classification based on genes selected by AHP is also proposed in this paper. Traditional FSAM learning is a hybrid process comprising unsupervised structure learning and supervised parameter tuning. Genetic algorithm (GA) is incorporated in-between unsupervised and supervised training to optimize the number of fuzzy rules. The integration of GA enables FSAM to deal with the high-dimensional-low-sample nature of microarray data and thus enhance the efficiency of the classification. Experiments are carried out on numerous microarray datasets. Results demonstrate the performance dominance of the AHP-based gene selection against the single ranking methods. Furthermore, the combination of AHP-FSAM shows a great accuracy in microarray data classification compared to various competing classifiers. The proposed approach therefore is useful for medical practitioners and clinicians as a decision support system that can be implemented in the real medical practice. PMID:25823003
Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification.
Nguyen, Thanh; Khosravi, Abbas; Creighton, Douglas; Nahavandi, Saeid
2015-01-01
This paper introduces a novel approach to gene selection based on a substantial modification of analytic hierarchy process (AHP). The modified AHP systematically integrates outcomes of individual filter methods to select the most informative genes for microarray classification. Five individual ranking methods including t-test, entropy, receiver operating characteristic (ROC) curve, Wilcoxon and signal to noise ratio are employed to rank genes. These ranked genes are then considered as inputs for the modified AHP. Additionally, a method that uses fuzzy standard additive model (FSAM) for cancer classification based on genes selected by AHP is also proposed in this paper. Traditional FSAM learning is a hybrid process comprising unsupervised structure learning and supervised parameter tuning. Genetic algorithm (GA) is incorporated in-between unsupervised and supervised training to optimize the number of fuzzy rules. The integration of GA enables FSAM to deal with the high-dimensional-low-sample nature of microarray data and thus enhance the efficiency of the classification. Experiments are carried out on numerous microarray datasets. Results demonstrate the performance dominance of the AHP-based gene selection against the single ranking methods. Furthermore, the combination of AHP-FSAM shows a great accuracy in microarray data classification compared to various competing classifiers. The proposed approach therefore is useful for medical practitioners and clinicians as a decision support system that can be implemented in the real medical practice.
Implementation of spectral clustering on microarray data of carcinoma using k-means algorithm
NASA Astrophysics Data System (ADS)
Frisca, Bustamam, Alhadi; Siswantining, Titin
2017-03-01
Clustering is one of data analysis methods that aims to classify data which have similar characteristics in the same group. Spectral clustering is one of the most popular modern clustering algorithms. As an effective clustering technique, spectral clustering method emerged from the concepts of spectral graph theory. Spectral clustering method needs partitioning algorithm. There are some partitioning methods including PAM, SOM, Fuzzy c-means, and k-means. Based on the research that has been done by Capital and Choudhury in 2013, when using Euclidian distance k-means algorithm provide better accuracy than PAM algorithm. So in this paper we use k-means as our partition algorithm. The major advantage of spectral clustering is in reducing data dimension, especially in this case to reduce the dimension of large microarray dataset. Microarray data is a small-sized chip made of a glass plate containing thousands and even tens of thousands kinds of genes in the DNA fragments derived from doubling cDNA. Application of microarray data is widely used to detect cancer, for the example is carcinoma, in which cancer cells express the abnormalities in his genes. The purpose of this research is to classify the data that have high similarity in the same group and the data that have low similarity in the others. In this research, Carcinoma microarray data using 7457 genes. The result of partitioning using k-means algorithm is two clusters.
Harvey, Benjamin Simeon; Ji, Soo-Yeon
2017-01-01
As microarray data available to scientists continues to increase in size and complexity, it has become overwhelmingly important to find multiple ways to bring forth oncological inference to the bioinformatics community through the analysis of large-scale cancer genomic (LSCG) DNA and mRNA microarray data that is useful to scientists. Though there have been many attempts to elucidate the issue of bringing forth biological interpretation by means of wavelet preprocessing and classification, there has not been a research effort that focuses on a cloud-scale distributed parallel (CSDP) separable 1-D wavelet decomposition technique for denoising through differential expression thresholding and classification of LSCG microarray data. This research presents a novel methodology that utilizes a CSDP separable 1-D method for wavelet-based transformation in order to initialize a threshold which will retain significantly expressed genes through the denoising process for robust classification of cancer patients. Additionally, the overall study was implemented and encompassed within CSDP environment. The utilization of cloud computing and wavelet-based thresholding for denoising was used for the classification of samples within the Global Cancer Map, Cancer Cell Line Encyclopedia, and The Cancer Genome Atlas. The results proved that separable 1-D parallel distributed wavelet denoising in the cloud and differential expression thresholding increased the computational performance and enabled the generation of higher quality LSCG microarray datasets, which led to more accurate classification results.
Ghan, Ryan; Van Sluyter, Steven C; Hochberg, Uri; Degu, Asfaw; Hopper, Daniel W; Tillet, Richard L; Schlauch, Karen A; Haynes, Paul A; Fait, Aaron; Cramer, Grant R
2015-11-16
Grape cultivars and wines are distinguishable by their color, flavor and aroma profiles. Omic analyses (transcripts, proteins and metabolites) are powerful tools for assessing biochemical differences in biological systems. Berry skins of red- (Cabernet Sauvignon, Merlot, Pinot Noir) and white-skinned (Chardonnay, Semillon) wine grapes were harvested near optimum maturity (°Brix-to-titratable acidity ratio) from the same experimental vineyard. The cultivars were exposed to a mild, seasonal water-deficit treatment from fruit set until harvest in 2011. Identical sample aliquots were analyzed for transcripts by grapevine whole-genome oligonucleotide microarray and RNAseq technologies, proteins by nano-liquid chromatography-mass spectroscopy, and metabolites by gas chromatography-mass spectroscopy and liquid chromatography-mass spectroscopy. Principal components analysis of each of five Omic technologies showed similar results across cultivars in all Omic datasets. Comparison of the processed data of genes mapped in RNAseq and microarray data revealed a strong Pearson's correlation (0.80). The exclusion of probesets associated with genes with potential for cross-hybridization on the microarray improved the correlation to 0.93. The overall concordance of protein with transcript data was low with a Pearson's correlation of 0.27 and 0.24 for the RNAseq and microarray data, respectively. Integration of metabolite with protein and transcript data produced an expected model of phenylpropanoid biosynthesis, which distinguished red from white grapes, yet provided detail of individual cultivar differences. The mild water deficit treatment did not significantly alter the abundance of proteins or metabolites measured in the five cultivars, but did have a small effect on gene expression. The five Omic technologies were consistent in distinguishing cultivar variation. There was high concordance between transcriptomic technologies, but generally protein abundance did not correlate well with transcript abundance. The integration of multiple high-throughput Omic datasets revealed complex biochemical variation amongst five cultivars of an ancient and economically important crop species.
Catto, James W F; Abbod, Maysam F; Wild, Peter J; Linkens, Derek A; Pilarsky, Christian; Rehman, Ishtiaq; Rosario, Derek J; Denzinger, Stefan; Burger, Maximilian; Stoehr, Robert; Knuechel, Ruth; Hartmann, Arndt; Hamdy, Freddie C
2010-03-01
New methods for identifying bladder cancer (BCa) progression are required. Gene expression microarrays can reveal insights into disease biology and identify novel biomarkers. However, these experiments produce large datasets that are difficult to interpret. To develop a novel method of microarray analysis combining two forms of artificial intelligence (AI): neurofuzzy modelling (NFM) and artificial neural networks (ANN) and validate it in a BCa cohort. We used AI and statistical analyses to identify progression-related genes in a microarray dataset (n=66 tumours, n=2800 genes). The AI-selected genes were then investigated in a second cohort (n=262 tumours) using immunohistochemistry. We compared the accuracy of AI and statistical approaches to identify tumour progression. AI identified 11 progression-associated genes (odds ratio [OR]: 0.70; 95% confidence interval [CI], 0.56-0.87; p=0.0004), and these were more discriminate than genes chosen using statistical analyses (OR: 1.24; 95% CI, 0.96-1.60; p=0.09). The expression of six AI-selected genes (LIG3, FAS, KRT18, ICAM1, DSG2, and BRCA2) was determined using commercial antibodies and successfully identified tumour progression (concordance index: 0.66; log-rank test: p=0.01). AI-selected genes were more discriminate than pathologic criteria at determining progression (Cox multivariate analysis: p=0.01). Limitations include the use of statistical correlation to identify 200 genes for AI analysis and that we did not compare regression identified genes with immunohistochemistry. AI and statistical analyses use different techniques of inference to determine gene-phenotype associations and identify distinct prognostic gene signatures that are equally valid. We have identified a prognostic gene signature whose members reflect a variety of carcinogenic pathways that could identify progression in non-muscle-invasive BCa. 2009 European Association of Urology. Published by Elsevier B.V. All rights reserved.
Liu, Wan-Ting; Wang, Yang; Zhang, Jing; Ye, Fei; Huang, Xiao-Hui; Li, Bin; He, Qing-Yu
2018-07-01
Lung adenocarcinoma (LAC) is the most lethal cancer and the leading cause of cancer-related death worldwide. The identification of meaningful clusters of co-expressed genes or representative biomarkers may help improve the accuracy of LAC diagnoses. Public databases, such as the Gene Expression Omnibus (GEO), provide rich resources of valuable information for clinics, however, the integration of multiple microarray datasets from various platforms and institutes remained a challenge. To determine potential indicators of LAC, we performed genome-wide relative significance (GWRS), genome-wide global significance (GWGS) and support vector machine (SVM) analyses progressively to identify robust gene biomarker signatures from 5 different microarray datasets that included 330 samples. The top 200 genes with robust signatures were selected for integrative analysis according to "guilt-by-association" methods, including protein-protein interaction (PPI) analysis and gene co-expression analysis. Of these 200 genes, only 10 genes showed both intensive PPI network and high gene co-expression correlation (r > 0.8). IPA analysis of this regulatory networks suggested that the cell cycle process is a crucial determinant of LAC. CENPA, as well as two linked hub genes CDK1 and CDC20, are determined to be potential indicators of LAC. Immunohistochemical staining showed that CENPA, CDK1 and CDC20 were highly expressed in LAC cancer tissue with co-expression patterns. A Cox regression model indicated that LAC patients with CENPA + /CDK1 + and CENPA + /CDC20 + were high-risk groups in terms of overall survival. In conclusion, our integrated microarray analysis demonstrated that CENPA, CDK1 and CDC20 might serve as novel cluster of prognostic biomarkers for LAC, and the cooperative unit of three genes provides a technically simple approach for identification of LAC patients. Copyright © 2018 Elsevier B.V. All rights reserved.
Polyadenylation state microarray (PASTA) analysis.
Beilharz, Traude H; Preiss, Thomas
2011-01-01
Nearly all eukaryotic mRNAs terminate in a poly(A) tail that serves important roles in mRNA utilization. In the cytoplasm, the poly(A) tail promotes both mRNA stability and translation, and these functions are frequently regulated through changes in tail length. To identify the scope of poly(A) tail length control in a transcriptome, we developed the polyadenylation state microarray (PASTA) method. It involves the purification of mRNA based on poly(A) tail length using thermal elution from poly(U) sepharose, followed by microarray analysis of the resulting fractions. In this chapter we detail our PASTA approach and describe some methods for bulk and mRNA-specific poly(A) tail length measurements of use to monitor the procedure and independently verify the microarray data.
Pashaei, Elnaz; Guzel, Esra; Ozgurses, Mete Emir; Demirel, Goksun; Aydin, Nizamettin; Ozen, Mustafa
MicroRNAs, which are small regulatory RNAs, post-transcriptionally regulate gene expression by binding 3'-UTR of their mRNA targets. Their deregulation has been shown to cause increased proliferation, migration, invasion, and apoptosis. miR-145, an important tumor supressor microRNA, has shown to be downregulated in many cancer types and has crucial roles in tumor initiation, progression, metastasis, invasion, recurrence, and chemo-radioresistance. Our aim is to investigate potential common target genes of miR-145, and to help understanding the underlying molecular pathways of tumor pathogenesis in association with those common target genes. Eight published microarray datasets, where targets of mir-145 were investigated in cell lines upon mir-145 over expression, were included into this study for meta-analysis. Inter group variabilities were assessed by box-plot analysis. Microarray datasets were analyzed using GEOquery package in Bioconducter 3.2 with R version 3.2.2 and two-way Hierarchical Clustering was used for gene expression data analysis. Meta-analysis of different GEO datasets showed that UNG, FUCA2, DERA, GMFB, TF, and SNX2 were commonly downregulated genes, whereas MYL9 and TAGLN were found to be commonly upregulated upon mir-145 over expression in prostate, breast, esophageal, bladder cancer, and head and neck squamous cell carcinoma. Biological process, molecular function, and pathway analysis of these potential targets of mir-145 through functional enrichments in PPI network demonstrated that those genes are significantly involved in telomere maintenance, DNA binding and repair mechanisms. As a conclusion, our results indicated that mir-145, through targeting its common potential targets, may significantly contribute to tumor pathogenesis in distinct cancer types and might serve as an important target for cancer therapy.
Who shares? Who doesn't? Factors associated with openly archiving raw research data.
Piwowar, Heather A
2011-01-01
Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%-35% in 2007-2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.
Guard, Jean; Rothrock, Michael J; Shah, Devendra H; Jones, Deana R; Gast, Richard K; Sanchez-Ingunza, Roxana; Madsen, Melissa; El-Attrache, John; Lungu, Bwalya
Phenotype microarrays were analyzed for 51 datasets derived from Salmonella enterica. The top 4 serotypes associated with poultry products and one associated with turkey, respectively Typhimurium, Enteritidis, Heidelberg, Infantis and Senftenberg, were represented. Datasets were partitioned initially into two clusters based on ranking by values at pH 4.5 (PM10 A03). Negative control wells were used to establish 90 respiratory units as the point differentiating acid resistance from sensitive strains. Thus, 24 isolates that appeared most acid-resistant were compared initially to 27 that appeared most acid-sensitive (24 × 27 format). Paired cluster analysis was also done and it included the 7 most acid-resistant and -sensitive datasets (7 × 7 format). Statistical analyses of ranked data were then calculated in order of standard deviation, probability value by the Student's t-test and a measure of the magnitude of difference called effect size. Data were reported as significant if, by order of filtering, the following parameters were calculated: i) a standard deviation of 24 respiratory units or greater from all datasets for each chemical, ii) a probability value of less than or equal to 0.03 between clusters and iii) an effect size of at least 0.50 or greater between clusters. Results suggest that between 7.89% and 23.16% of 950 chemicals differentiated acid-resistant isolates from sensitive ones, depending on the format applied. Differences were more evident at the extremes of phenotype using the subset of data in the paired 7 × 7 format. Results thus provide a strategy for selecting compounds for additional research, which may impede the emergence of acid-resistant Salmonella enterica in food. Published by Elsevier Masson SAS.
Genetic programming based ensemble system for microarray data classification.
Liu, Kun-Hong; Tong, Muchenxuan; Xie, Shu-Tong; Yee Ng, Vincent To
2015-01-01
Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.
Genetic Programming Based Ensemble System for Microarray Data Classification
Liu, Kun-Hong; Tong, Muchenxuan; Xie, Shu-Tong; Yee Ng, Vincent To
2015-01-01
Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved. PMID:25810748
BioconductorBuntu: a Linux distribution that implements a web-based DNA microarray analysis server.
Geeleher, Paul; Morris, Dermot; Hinde, John P; Golden, Aaron
2009-06-01
BioconductorBuntu is a custom distribution of Ubuntu Linux that automatically installs a server-side microarray processing environment, providing a user-friendly web-based GUI to many of the tools developed by the Bioconductor Project, accessible locally or across a network. System installation is via booting off a CD image or by using a Debian package provided to upgrade an existing Ubuntu installation. In its current version, several microarray analysis pipelines are supported including oligonucleotide, dual-or single-dye experiments, including post-processing with Gene Set Enrichment Analysis. BioconductorBuntu is designed to be extensible, by server-side integration of further relevant Bioconductor modules as required, facilitated by its straightforward underlying Python-based infrastructure. BioconductorBuntu offers an ideal environment for the development of processing procedures to facilitate the analysis of next-generation sequencing datasets. BioconductorBuntu is available for download under a creative commons license along with additional documentation and a tutorial from (http://bioinf.nuigalway.ie).
Gong, Ping; Nan, Xiaofei; Barker, Natalie D; Boyd, Robert E; Chen, Yixin; Wilkins, Dawn E; Johnson, David R; Suedel, Burton C; Perkins, Edward J
2016-03-08
Chemical bioavailability is an important dose metric in environmental risk assessment. Although many approaches have been used to evaluate bioavailability, not a single approach is free from limitations. Previously, we developed a new genomics-based approach that integrated microarray technology and regression modeling for predicting bioavailability (tissue residue) of explosives compounds in exposed earthworms. In the present study, we further compared 18 different regression models and performed variable selection simultaneously with parameter estimation. This refined approach was applied to both previously collected and newly acquired earthworm microarray gene expression datasets for three explosive compounds. Our results demonstrate that a prediction accuracy of R(2) = 0.71-0.82 was achievable at a relatively low model complexity with as few as 3-10 predictor genes per model. These results are much more encouraging than our previous ones. This study has demonstrated that our approach is promising for bioavailability measurement, which warrants further studies of mixed contamination scenarios in field settings.
Vafaee Sharbaf, Fatemeh; Mosafer, Sara; Moattar, Mohammad Hossein
2016-06-01
This paper proposes an approach for gene selection in microarray data. The proposed approach consists of a primary filter approach using Fisher criterion which reduces the initial genes and hence the search space and time complexity. Then, a wrapper approach which is based on cellular learning automata (CLA) optimized with ant colony method (ACO) is used to find the set of features which improve the classification accuracy. CLA is applied due to its capability to learn and model complicated relationships. The selected features from the last phase are evaluated using ROC curve and the most effective while smallest feature subset is determined. The classifiers which are evaluated in the proposed framework are K-nearest neighbor; support vector machine and naïve Bayes. The proposed approach is evaluated on 4 microarray datasets. The evaluations confirm that the proposed approach can find the smallest subset of genes while approaching the maximum accuracy. Copyright © 2016 Elsevier Inc. All rights reserved.
Vukmirovic, Milica; Herazo-Maya, Jose D; Blackmon, John; Skodric-Trifunovic, Vesna; Jovanovic, Dragana; Pavlovic, Sonja; Stojsic, Jelena; Zeljkovic, Vesna; Yan, Xiting; Homer, Robert; Stefanovic, Branko; Kaminski, Naftali
2017-01-12
Idiopathic Pulmonary Fibrosis (IPF) is a lethal lung disease of unknown etiology. A major limitation in transcriptomic profiling of lung tissue in IPF has been a dependence on snap-frozen fresh tissues (FF). In this project we sought to determine whether genome scale transcript profiling using RNA Sequencing (RNA-Seq) could be applied to archived Formalin-Fixed Paraffin-Embedded (FFPE) IPF tissues. We isolated total RNA from 7 IPF and 5 control FFPE lung tissues and performed 50 base pair paired-end sequencing on Illumina 2000 HiSeq. TopHat2 was used to map sequencing reads to the human genome. On average ~62 million reads (53.4% of ~116 million reads) were mapped per sample. 4,131 genes were differentially expressed between IPF and controls (1,920 increased and 2,211 decreased (FDR < 0.05). We compared our results to differentially expressed genes calculated from a previously published dataset generated from FF tissues analyzed on Agilent microarrays (GSE47460). The overlap of differentially expressed genes was very high (760 increased and 1,413 decreased, FDR < 0.05). Only 92 differentially expressed genes changed in opposite directions. Pathway enrichment analysis performed using MetaCore confirmed numerous IPF relevant genes and pathways including extracellular remodeling, TGF-beta, and WNT. Gene network analysis of MMP7, a highly differentially expressed gene in both datasets, revealed the same canonical pathways and gene network candidates in RNA-Seq and microarray data. For validation by NanoString nCounter® we selected 35 genes that had a fold change of 2 in at least one dataset (10 discordant, 10 significantly differentially expressed in one dataset only and 15 concordant genes). High concordance of fold change and FDR was observed for each type of the samples (FF vs FFPE) with both microarrays (r = 0.92) and RNA-Seq (r = 0.90) and the number of discordant genes was reduced to four. Our results demonstrate that RNA sequencing of RNA obtained from archived FFPE lung tissues is feasible. The results obtained from FFPE tissue are highly comparable to FF tissues. The ability to perform RNA-Seq on archived FFPE IPF tissues should greatly enhance the availability of tissue biopsies for research in IPF.
Gregori, Josep; Villarreal, Laura; Sánchez, Alex; Baselga, José; Villanueva, Josep
2013-12-16
The microarray community has shown that the low reproducibility observed in gene expression-based biomarker discovery studies is partially due to relying solely on p-values to get the lists of differentially expressed genes. Their conclusions recommended complementing the p-value cutoff with the use of effect-size criteria. The aim of this work was to evaluate the influence of such an effect-size filter on spectral counting-based comparative proteomic analysis. The results proved that the filter increased the number of true positives and decreased the number of false positives and the false discovery rate of the dataset. These results were confirmed by simulation experiments where the effect size filter was used to evaluate systematically variable fractions of differentially expressed proteins. Our results suggest that relaxing the p-value cut-off followed by a post-test filter based on effect size and signal level thresholds can increase the reproducibility of statistical results obtained in comparative proteomic analysis. Based on our work, we recommend using a filter consisting of a minimum absolute log2 fold change of 0.8 and a minimum signal of 2-4 SpC on the most abundant condition for the general practice of comparative proteomics. The implementation of feature filtering approaches could improve proteomic biomarker discovery initiatives by increasing the reproducibility of the results obtained among independent laboratories and MS platforms. Quality control analysis of microarray-based gene expression studies pointed out that the low reproducibility observed in the lists of differentially expressed genes could be partially attributed to the fact that these lists are generated relying solely on p-values. Our study has established that the implementation of an effect size post-test filter improves the statistical results of spectral count-based quantitative proteomics. The results proved that the filter increased the number of true positives whereas decreased the false positives and the false discovery rate of the datasets. The results presented here prove that a post-test filter applying a reasonable effect size and signal level thresholds helps to increase the reproducibility of statistical results in comparative proteomic analysis. Furthermore, the implementation of feature filtering approaches could improve proteomic biomarker discovery initiatives by increasing the reproducibility of results obtained among independent laboratories and MS platforms. This article is part of a Special Issue entitled: Standardization and Quality Control in Proteomics. Copyright © 2013 Elsevier B.V. All rights reserved.
Kadarmideen, Haja N; Watson-haigh, Nathan S
2012-01-01
Gene co-expression networks (GCN), built using high-throughput gene expression data are fundamental aspects of systems biology. The main aims of this study were to compare two popular approaches to building and analysing GCN. We use real ovine microarray transcriptomics datasets representing four different treatments with Metyrapone, an inhibitor of cortisol biosynthesis. We conducted several microarray quality control checks before applying GCN methods to filtered datasets. Then we compared the outputs of two methods using connectivity as a criterion, as it measures how well a node (gene) is connected within a network. The two GCN construction methods used were, Weighted Gene Co-expression Network Analysis (WGCNA) and Partial Correlation and Information Theory (PCIT) methods. Nodes were ranked based on their connectivity measures in each of the four different networks created by WGCNA and PCIT and node ranks in two methods were compared to identify those nodes which are highly differentially ranked (HDR). A total of 1,017 HDR nodes were identified across one or more of four networks. We investigated HDR nodes by gene enrichment analyses in relation to their biological relevance to phenotypes. We observed that, in contrast to WGCNA method, PCIT algorithm removes many of the edges of the most highly interconnected nodes. Removal of edges of most highly connected nodes or hub genes will have consequences for downstream analyses and biological interpretations. In general, for large GCN construction (with > 20000 genes) access to large computer clusters, particularly those with larger amounts of shared memory is recommended. PMID:23144540
A four-gene signature predicts survival in clear-cell renal-cell carcinoma.
Dai, Jun; Lu, Yuchao; Wang, Jinyu; Yang, Lili; Han, Yingyan; Wang, Ying; Yan, Dan; Ruan, Qiurong; Wang, Shaogang
2016-12-13
Clear-cell renal-cell carcinoma (ccRCC) is the most common pathological subtype of renal cell carcinoma (RCC), accounting for about 80% of RCC. In order to find potential prognostic biomarkers in ccRCC, we presented a four-gene signature to evaluate the prognosis of ccRCC. SurvExpress and immunohistochemical (IHC) staining of tissue microarrays were used to analyze the association between the four genes and the prognosis of ccRCC. Data from TCGA dataset revealed a prognostic prompt function of the four genes (PTEN, PIK3C2A, ITPA and BCL3). Further discovery suggested that the four-gene signature predicted survival better than any of the four genes alone. Moreover, IHC staining demonstrated a consistent result with TCGA, indicating that the signature was an independent prognostic factor of survival in ccRCC. Univariate and multivariate Cox proportional hazard regression analysis were conducted to verify the association of clinicopathological variables and the four genes' expression levels with survival. The results further testified that the risk (four-gene signature) was an independent prognostic factors of both Overall Survival (OS) and Disease-free Survival (DFS) (P<0.05). In conclusion, the four-gene signature was correlated with the survival of ccRCC, and therefore, may help to provide significant clinical implications for predicting the prognosis of patients.
Configurable pattern-based evolutionary biclustering of gene expression data
2013-01-01
Background Biclustering algorithms for microarray data aim at discovering functionally related gene sets under different subsets of experimental conditions. Due to the problem complexity and the characteristics of microarray datasets, heuristic searches are usually used instead of exhaustive algorithms. Also, the comparison among different techniques is still a challenge. The obtained results vary in relevant features such as the number of genes or conditions, which makes it difficult to carry out a fair comparison. Moreover, existing approaches do not allow the user to specify any preferences on these properties. Results Here, we present the first biclustering algorithm in which it is possible to particularize several biclusters features in terms of different objectives. This can be done by tuning the specified features in the algorithm or also by incorporating new objectives into the search. Furthermore, our approach bases the bicluster evaluation in the use of expression patterns, being able to recognize both shifting and scaling patterns either simultaneously or not. Evolutionary computation has been chosen as the search strategy, naming thus our proposal Evo-Bexpa (Evolutionary Biclustering based in Expression Patterns). Conclusions We have conducted experiments on both synthetic and real datasets demonstrating Evo-Bexpa abilities to obtain meaningful biclusters. Synthetic experiments have been designed in order to compare Evo-Bexpa performance with other approaches when looking for perfect patterns. Experiments with four different real datasets also confirm the proper performing of our algorithm, whose results have been biologically validated through Gene Ontology. PMID:23433178
Multi-test decision tree and its application to microarray data classification.
Czajkowski, Marcin; Grześ, Marek; Kretowski, Marek
2014-05-01
The desirable property of tools used to investigate biological data is easy to understand models and predictive decisions. Decision trees are particularly promising in this regard due to their comprehensible nature that resembles the hierarchical process of human decision making. However, existing algorithms for learning decision trees have tendency to underfit gene expression data. The main aim of this work is to improve the performance and stability of decision trees with only a small increase in their complexity. We propose a multi-test decision tree (MTDT); our main contribution is the application of several univariate tests in each non-terminal node of the decision tree. We also search for alternative, lower-ranked features in order to obtain more stable and reliable predictions. Experimental validation was performed on several real-life gene expression datasets. Comparison results with eight classifiers show that MTDT has a statistically significantly higher accuracy than popular decision tree classifiers, and it was highly competitive with ensemble learning algorithms. The proposed solution managed to outperform its baseline algorithm on 14 datasets by an average 6%. A study performed on one of the datasets showed that the discovered genes used in the MTDT classification model are supported by biological evidence in the literature. This paper introduces a new type of decision tree which is more suitable for solving biological problems. MTDTs are relatively easy to analyze and much more powerful in modeling high dimensional microarray data than their popular counterparts. Copyright © 2014 Elsevier B.V. All rights reserved.
Yang, Lingjian; Ainali, Chrysanthi; Tsoka, Sophia; Papageorgiou, Lazaros G
2014-12-05
Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies. A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile. The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.
Evolutionary Approach for Relative Gene Expression Algorithms
Czajkowski, Marcin
2014-01-01
A Relative Expression Analysis (RXA) uses ordering relationships in a small collection of genes and is successfully applied to classiffication using microarray data. As checking all possible subsets of genes is computationally infeasible, the RXA algorithms require feature selection and multiple restrictive assumptions. Our main contribution is a specialized evolutionary algorithm (EA) for top-scoring pairs called EvoTSP which allows finding more advanced gene relations. We managed to unify the major variants of relative expression algorithms through EA and introduce weights to the top-scoring pairs. Experimental validation of EvoTSP on public available microarray datasets showed that the proposed solution significantly outperforms in terms of accuracy other relative expression algorithms and allows exploring much larger solution space. PMID:24790574
Wang, Yun; Huang, Fangzhou
2018-01-01
The selection of feature genes with high recognition ability from the gene expression profiles has gained great significance in biology. However, most of the existing methods have a high time complexity and poor classification performance. Motivated by this, an effective feature selection method, called supervised locally linear embedding and Spearman's rank correlation coefficient (SLLE-SC2), is proposed which is based on the concept of locally linear embedding and correlation coefficient algorithms. Supervised locally linear embedding takes into account class label information and improves the classification performance. Furthermore, Spearman's rank correlation coefficient is used to remove the coexpression genes. The experiment results obtained on four public tumor microarray datasets illustrate that our method is valid and feasible. PMID:29666661
Xu, Jiucheng; Mu, Huiyu; Wang, Yun; Huang, Fangzhou
2018-01-01
The selection of feature genes with high recognition ability from the gene expression profiles has gained great significance in biology. However, most of the existing methods have a high time complexity and poor classification performance. Motivated by this, an effective feature selection method, called supervised locally linear embedding and Spearman's rank correlation coefficient (SLLE-SC 2 ), is proposed which is based on the concept of locally linear embedding and correlation coefficient algorithms. Supervised locally linear embedding takes into account class label information and improves the classification performance. Furthermore, Spearman's rank correlation coefficient is used to remove the coexpression genes. The experiment results obtained on four public tumor microarray datasets illustrate that our method is valid and feasible.
Genomic pathways modulated by Twist in breast cancer.
Vesuna, Farhad; Bergman, Yehudit; Raman, Venu
2017-01-13
The basic helix-loop-helix transcription factor TWIST1 (Twist) is involved in embryonic cell lineage determination and mesodermal differentiation. There is evidence to indicate that Twist expression plays a role in breast tumor formation and metastasis, but the role of Twist in dysregulating pathways that drive the metastatic cascade is unclear. Moreover, many of the genes and pathways dysregulated by Twist in cell lines and mouse models have not been validated against data obtained from larger, independant datasets of breast cancer patients. We over-expressed the human Twist gene in non-metastatic MCF-7 breast cancer cells to generate the estrogen-independent metastatic breast cancer cell line MCF-7/Twist. These cells were inoculated in the mammary fat pad of female severe compromised immunodeficient mice, which subsequently formed xenograft tumors that metastasized to the lungs. Microarray data was collected from both in vitro (MCF-7 and MCF-7/Twist cell lines) and in vivo (primary tumors and lung metastases) models of Twist expression. Our data was compared to several gene datasets of various subtypes, classes, and grades of human breast cancers. Our data establishes a Twist over-expressing mouse model of breast cancer, which metastasizes to the lung and replicates some of the ontogeny of human breast cancer progression. Gene profiling data, following Twist expression, exhibited novel metastasis driver genes as well as cellular maintenance genes that were synonymous with the metastatic process. We demonstrated that the genes and pathways altered in the transgenic cell line and metastatic animal models parallel many of the dysregulated gene pathways observed in human breast cancers. Analogous gene expression patterns were observed in both in vitro and in vivo Twist preclinical models of breast cancer metastasis and breast cancer patient datasets supporting the functional role of Twist in promoting breast cancer metastasis. The data suggests that genetic dysregulation of Twist at the cellular level drives alterations in gene pathways in the Twist metastatic mouse model which are comparable to changes seen in human breast cancers. Lastly, we have identified novel genes and pathways that could be further investigated as targets for drugs to treat metastatic breast cancer.
A robust prognostic signature for hormone-positive node-negative breast cancer.
Griffith, Obi L; Pepin, François; Enache, Oana M; Heiser, Laura M; Collisson, Eric A; Spellman, Paul T; Gray, Joe W
2013-01-01
Systemic chemotherapy in the adjuvant setting can cure breast cancer in some patients that would otherwise recur with incurable, metastatic disease. However, since only a fraction of patients would have recurrence after surgery alone, the challenge is to stratify high-risk patients (who stand to benefit from systemic chemotherapy) from low-risk patients (who can safely be spared treatment related toxicities and costs). We focus here on risk stratification in node-negative, ER-positive, HER2-negative breast cancer. We use a large database of publicly available microarray datasets to build a random forests classifier and develop a robust multi-gene mRNA transcription-based predictor of relapse free survival at 10 years, which we call the Random Forests Relapse Score (RFRS). Performance was assessed by internal cross-validation, multiple independent data sets, and comparison to existing algorithms using receiver-operating characteristic and Kaplan-Meier survival analysis. Internal redundancy of features was determined using k-means clustering to define optimal signatures with smaller numbers of primary genes, each with multiple alternates. Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704, which was comparable to or better than those reported previously or obtained by applying existing methods to our dataset. Three risk groups with probability cutoffs for low, intermediate, and high-risk were defined. Survival analysis determined a highly significant difference in relapse rate between these risk groups. Validation of the models against independent test datasets showed highly similar results. Smaller 17-gene and 8-gene optimized models were also developed with minimal reduction in performance. Furthermore, the signature was shown to be almost equally effective on both hormone-treated and untreated patients. RFRS allows flexibility in both the number and identity of genes utilized from thousands to as few as 17 or eight genes, each with multiple alternatives. The RFRS reports a probability score strongly correlated with risk of relapse. This score could therefore be used to assign systemic chemotherapy specifically to those high-risk patients most likely to benefit from further treatment.
A robust prognostic signature for hormone-positive node-negative breast cancer
2013-01-01
Background Systemic chemotherapy in the adjuvant setting can cure breast cancer in some patients that would otherwise recur with incurable, metastatic disease. However, since only a fraction of patients would have recurrence after surgery alone, the challenge is to stratify high-risk patients (who stand to benefit from systemic chemotherapy) from low-risk patients (who can safely be spared treatment related toxicities and costs). Methods We focus here on risk stratification in node-negative, ER-positive, HER2-negative breast cancer. We use a large database of publicly available microarray datasets to build a random forests classifier and develop a robust multi-gene mRNA transcription-based predictor of relapse free survival at 10 years, which we call the Random Forests Relapse Score (RFRS). Performance was assessed by internal cross-validation, multiple independent data sets, and comparison to existing algorithms using receiver-operating characteristic and Kaplan-Meier survival analysis. Internal redundancy of features was determined using k-means clustering to define optimal signatures with smaller numbers of primary genes, each with multiple alternates. Results Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704, which was comparable to or better than those reported previously or obtained by applying existing methods to our dataset. Three risk groups with probability cutoffs for low, intermediate, and high-risk were defined. Survival analysis determined a highly significant difference in relapse rate between these risk groups. Validation of the models against independent test datasets showed highly similar results. Smaller 17-gene and 8-gene optimized models were also developed with minimal reduction in performance. Furthermore, the signature was shown to be almost equally effective on both hormone-treated and untreated patients. Conclusions RFRS allows flexibility in both the number and identity of genes utilized from thousands to as few as 17 or eight genes, each with multiple alternatives. The RFRS reports a probability score strongly correlated with risk of relapse. This score could therefore be used to assign systemic chemotherapy specifically to those high-risk patients most likely to benefit from further treatment. PMID:24112773
Convergent Genetic and Expression Datasets Highlight TREM2 in Parkinson's Disease Susceptibility.
Liu, Guiyou; Liu, Yongquan; Jiang, Qinghua; Jiang, Yongshuai; Feng, Rennan; Zhang, Liangcai; Chen, Zugen; Li, Keshen; Liu, Jiafeng
2016-09-01
A rare TREM2 missense mutation (rs75932628-T) was reported to confer a significant Alzheimer's disease (AD) risk. A recent study indicated no evidence of the involvement of this variant in Parkinson's disease (PD). Here, we used the genetic and expression data to reinvestigate the potential association between TREM2 and PD susceptibility. In stage 1, using 10 independent studies (N = 89,157; 8787 cases and 80,370 controls), we conducted a subgroup meta-analysis. We identified a significant association between rs75932628 and PD (P = 3.10E-03, odds ratio (OR) = 3.88, 95 % confidence interval (CI) 1.58-9.54) in No-Northern Europe subgroup, and significantly increased PD risks (P = 0.01 for Mann-Whitney test) in No-Northern Europe subgroup than in Northern Europe subgroup. In stage 2, we used the summary results from a large-scale PD genome-wide association study (GWAS; N = 108,990; 13,708 cases and 95,282 controls) to search for other TREM2 variants contributing to PD susceptibility. We identified 14 single-nucleotide polymorphisms (SNPs) associated with PD within 50-kb upstream and downstream range of TREM2. In stage 3, using two brain expression GWAS datasets (N = 773), we identified 6 of the 14 SNPs regulating increased expression of TREM2. In stage 4, using the whole human genome microarray data (N = 50), we further identified significantly increased expression of TREM2 in PD cases compared with controls in human prefrontal cortex. In summary, convergent genetic and expression datasets demonstrate that TREM2 is a potent risk factor for PD and may be a therapeutic target in PD and other neurodegenerative diseases.
Kar, Siddhartha P; Tyrer, Jonathan P; Li, Qiyuan; Lawrenson, Kate; Aben, Katja K H; Anton-Culver, Hoda; Antonenkova, Natalia; Chenevix-Trench, Georgia; Baker, Helen; Bandera, Elisa V; Bean, Yukie T; Beckmann, Matthias W; Berchuck, Andrew; Bisogna, Maria; Bjørge, Line; Bogdanova, Natalia; Brinton, Louise; Brooks-Wilson, Angela; Butzow, Ralf; Campbell, Ian; Carty, Karen; Chang-Claude, Jenny; Chen, Yian Ann; Chen, Zhihua; Cook, Linda S; Cramer, Daniel; Cunningham, Julie M; Cybulski, Cezary; Dansonka-Mieszkowska, Agnieszka; Dennis, Joe; Dicks, Ed; Doherty, Jennifer A; Dörk, Thilo; du Bois, Andreas; Dürst, Matthias; Eccles, Diana; Easton, Douglas F; Edwards, Robert P; Ekici, Arif B; Fasching, Peter A; Fridley, Brooke L; Gao, Yu-Tang; Gentry-Maharaj, Aleksandra; Giles, Graham G; Glasspool, Rosalind; Goode, Ellen L; Goodman, Marc T; Grownwald, Jacek; Harrington, Patricia; Harter, Philipp; Hein, Alexander; Heitz, Florian; Hildebrandt, Michelle A T; Hillemanns, Peter; Hogdall, Estrid; Hogdall, Claus K; Hosono, Satoyo; Iversen, Edwin S; Jakubowska, Anna; Paul, James; Jensen, Allan; Ji, Bu-Tian; Karlan, Beth Y; Kjaer, Susanne K; Kelemen, Linda E; Kellar, Melissa; Kelley, Joseph; Kiemeney, Lambertus A; Krakstad, Camilla; Kupryjanczyk, Jolanta; Lambrechts, Diether; Lambrechts, Sandrina; Le, Nhu D; Lee, Alice W; Lele, Shashi; Leminen, Arto; Lester, Jenny; Levine, Douglas A; Liang, Dong; Lissowska, Jolanta; Lu, Karen; Lubinski, Jan; Lundvall, Lene; Massuger, Leon; Matsuo, Keitaro; McGuire, Valerie; McLaughlin, John R; McNeish, Iain A; Menon, Usha; Modugno, Francesmary; Moysich, Kirsten B; Narod, Steven A; Nedergaard, Lotte; Ness, Roberta B; Nevanlinna, Heli; Odunsi, Kunle; Olson, Sara H; Orlow, Irene; Orsulic, Sandra; Weber, Rachel Palmieri; Pearce, Celeste Leigh; Pejovic, Tanja; Pelttari, Liisa M; Permuth-Wey, Jennifer; Phelan, Catherine M; Pike, Malcolm C; Poole, Elizabeth M; Ramus, Susan J; Risch, Harvey A; Rosen, Barry; Rossing, Mary Anne; Rothstein, Joseph H; Rudolph, Anja; Runnebaum, Ingo B; Rzepecka, Iwona K; Salvesen, Helga B; Schildkraut, Joellen M; Schwaab, Ira; Shu, Xiao-Ou; Shvetsov, Yurii B; Siddiqui, Nadeem; Sieh, Weiva; Song, Honglin; Southey, Melissa C; Sucheston-Campbell, Lara E; Tangen, Ingvild L; Teo, Soo-Hwang; Terry, Kathryn L; Thompson, Pamela J; Timorek, Agnieszka; Tsai, Ya-Yu; Tworoger, Shelley S; van Altena, Anne M; Van Nieuwenhuysen, Els; Vergote, Ignace; Vierkant, Robert A; Wang-Gohrke, Shan; Walsh, Christine; Wentzensen, Nicolas; Whittemore, Alice S; Wicklund, Kristine G; Wilkens, Lynne R; Woo, Yin-Ling; Wu, Xifeng; Wu, Anna; Yang, Hannah; Zheng, Wei; Ziogas, Argyrios; Sellers, Thomas A; Monteiro, Alvaro N A; Freedman, Matthew L; Gayther, Simon A; Pharoah, Paul D P
2015-10-01
Genome-wide association studies (GWAS) have so far reported 12 loci associated with serous epithelial ovarian cancer (EOC) risk. We hypothesized that some of these loci function through nearby transcription factor (TF) genes and that putative target genes of these TFs as identified by coexpression may also be enriched for additional EOC risk associations. We selected TF genes within 1 Mb of the top signal at the 12 genome-wide significant risk loci. Mutual information, a form of correlation, was used to build networks of genes strongly coexpressed with each selected TF gene in the unified microarray dataset of 489 serous EOC tumors from The Cancer Genome Atlas. Genes represented in this dataset were subsequently ranked using a gene-level test based on results for germline SNPs from a serous EOC GWAS meta-analysis (2,196 cases/4,396 controls). Gene set enrichment analysis identified six networks centered on TF genes (HOXB2, HOXB5, HOXB6, HOXB7 at 17q21.32 and HOXD1, HOXD3 at 2q31) that were significantly enriched for genes from the risk-associated end of the ranked list (P < 0.05 and FDR < 0.05). These results were replicated (P < 0.05) using an independent association study (7,035 cases/21,693 controls). Genes underlying enrichment in the six networks were pooled into a combined network. We identified a HOX-centric network associated with serous EOC risk containing several genes with known or emerging roles in serous EOC development. Network analysis integrating large, context-specific datasets has the potential to offer mechanistic insights into cancer susceptibility and prioritize genes for experimental characterization. ©2015 American Association for Cancer Research.
Kameue, Chiyoko; Tsukahara, Takamitsu; Ushida, Kazunari
2006-03-01
Butyrate induces apoptosis of various cancer cell lines in a p53-independent manner and inhibits the proliferation of cancer cells. In a previous report, we reported a significant reduction in tumor incidence in rat colon as a result of dietary sodium gluconate (GNA). The stimulation of apoptosis through enhanced butyrate production in the large intestine was involved in the antitumorigenic effect of GNA. In the present study, a cDNA microarray analysis was performed to investigate the particular mechanism involved in the antitumorigenic effect of GNA. Some up-regulated genes suggested by microarray analysis were further evaluated using real-time PCR. A microarray revealed that GNA regulates the expression of retinoic acid receptor (RAR) and retinoid X receptor (RXR), and several genes known as the target of retinoids in cancer cells. In other words, the antitumorigenic effect of GNA may involve the regulation of the retinoid signaling pathway by butyrate in a retinoid-independent manner.
Classification of Microarray Data Using Kernel Fuzzy Inference System
Kumar Rath, Santanu
2014-01-01
The DNA microarray classification technique has gained more popularity in both research and practice. In real data analysis, such as microarray data, the dataset contains a huge number of insignificant and irrelevant features that tend to lose useful information. Classes with high relevance and feature sets with high significance are generally referred for the selected features, which determine the samples classification into their respective classes. In this paper, kernel fuzzy inference system (K-FIS) algorithm is applied to classify the microarray data (leukemia) using t-test as a feature selection method. Kernel functions are used to map original data points into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function ϕ through a mathematical process called the kernel trick. This paper also presents a comparative study for classification using K-FIS along with support vector machine (SVM) for different set of features (genes). Performance parameters available in the literature such as precision, recall, specificity, F-measure, ROC curve, and accuracy are considered to analyze the efficiency of the classification model. From the proposed approach, it is apparent that K-FIS model obtains similar results when compared with SVM model. This is an indication that the proposed approach relies on kernel function. PMID:27433543
Zhang, Min; Zhang, Lin; Zou, Jinfeng; Yao, Chen; Xiao, Hui; Liu, Qing; Wang, Jing; Wang, Dong; Wang, Chenguang; Guo, Zheng
2009-07-01
According to current consistency metrics such as percentage of overlapping genes (POG), lists of differentially expressed genes (DEGs) detected from different microarray studies for a complex disease are often highly inconsistent. This irreproducibility problem also exists in other high-throughput post-genomic areas such as proteomics and metabolism. A complex disease is often characterized with many coordinated molecular changes, which should be considered when evaluating the reproducibility of discovery lists from different studies. We proposed metrics percentage of overlapping genes-related (POGR) and normalized POGR (nPOGR) to evaluate the consistency between two DEG lists for a complex disease, considering correlated molecular changes rather than only counting gene overlaps between the lists. Based on microarray datasets of three diseases, we showed that though the POG scores for DEG lists from different studies for each disease are extremely low, the POGR and nPOGR scores can be rather high, suggesting that the apparently inconsistent DEG lists may be highly reproducible in the sense that they are actually significantly correlated. Observing different discovery results for a disease by the POGR and nPOGR scores will obviously reduce the uncertainty of the microarray studies. The proposed metrics could also be applicable in many other high-throughput post-genomic areas.
Lovell, Peter V; Huizinga, Nicole A; Getachew, Abel; Mees, Brianna; Friedrich, Samantha R; Wirthlin, Morgan; Mello, Claudio V
2018-05-18
Zebra finches are a major model organism for investigating mechanisms of vocal learning, a trait that enables spoken language in humans. The development of cDNA collections with expressed sequence tags (ESTs) and microarrays has allowed for extensive molecular characterizations of circuitry underlying vocal learning and production. However, poor database curation can lead to errors in transcriptome and bioinformatics analyses, limiting the impact of these resources. Here we used genomic alignments and synteny analysis for orthology verification to curate and reannotate ~ 35% of the oligonucleotides and corresponding ESTs/cDNAs that make-up Agilent microarrays for gene expression analysis in finches. We found that: (1) 5475 out of 43,084 oligos (a) failed to align to the zebra finch genome, (b) aligned to multiple loci, or (c) aligned to Chr_un only, and thus need to be flagged until a better genome assembly is available, or (d) reflect cloning artifacts; (2) Out of 9635 valid oligos examined further, 3120 were incorrectly named, including 1533 with no known orthologs; and (3) 2635 oligos required name update. The resulting curated dataset provides a reference for correcting gene identification errors in previous finch microarrays studies, and avoiding such errors in future studies.
Integrative prescreening in analysis of multiple cancer genomic studies
2012-01-01
Background In high throughput cancer genomic studies, results from the analysis of single datasets often suffer from a lack of reproducibility because of small sample sizes. Integrative analysis can effectively pool and analyze multiple datasets and provides a cost effective way to improve reproducibility. In integrative analysis, simultaneously analyzing all genes profiled may incur high computational cost. A computationally affordable remedy is prescreening, which fits marginal models, can be conducted in a parallel manner, and has low computational cost. Results An integrative prescreening approach is developed for the analysis of multiple cancer genomic datasets. Simulation shows that the proposed integrative prescreening has better performance than alternatives, particularly including prescreening with individual datasets, an intensity approach and meta-analysis. We also analyze multiple microarray gene profiling studies on liver and pancreatic cancers using the proposed approach. Conclusions The proposed integrative prescreening provides an effective way to reduce the dimensionality in cancer genomic studies. It can be coupled with existing analysis methods to identify cancer markers. PMID:22799431
Questioning the utility of pooling samples in microarray experiments with cell lines.
Lusa, L; Cappelletti, V; Gariboldi, M; Ferrario, C; De Cecco, L; Reid, J F; Toffanin, S; Gallus, G; McShane, L M; Daidone, M G; Pierotti, M A
2006-01-01
We describe a microarray experiment using the MCF-7 breast cancer cell line in two different experimental conditions for which the same number of independent pools as the number of individual samples was hybridized on Affymetrix GeneChips. Unexpectedly, when using individual samples, the number of probe sets found to be differentially expressed between treated and untreated cells was about three times greater than that found using pools. These findings indicate that pooling samples in microarray experiments where the biological variability is expected to be small might not be helpful and could even decrease one's ability to identify differentially expressed genes.
2016-01-01
Abstract Microarray gene expression data sets are jointly analyzed to increase statistical power. They could either be merged together or analyzed by meta-analysis. For a given ensemble of data sets, it cannot be foreseen which of these paradigms, merging or meta-analysis, works better. In this article, three joint analysis methods, Z -score normalization, ComBat and the inverse normal method (meta-analysis) were selected for survival prognosis and risk assessment of breast cancer patients. The methods were applied to eight microarray gene expression data sets, totaling 1324 patients with two clinical endpoints, overall survival and relapse-free survival. The performance derived from the joint analysis methods was evaluated using Cox regression for survival analysis and independent validation used as bias estimation. Overall, Z -score normalization had a better performance than ComBat and meta-analysis. Higher Area Under the Receiver Operating Characteristic curve and hazard ratio were also obtained when independent validation was used as bias estimation. With a lower time and memory complexity, Z -score normalization is a simple method for joint analysis of microarray gene expression data sets. The derived findings suggest further assessment of this method in future survival prediction and cancer classification applications. PMID:26504096
Dupl'áková, Nikoleta; Renák, David; Hovanec, Patrik; Honysová, Barbora; Twell, David; Honys, David
2007-07-23
Microarray technologies now belong to the standard functional genomics toolbox and have undergone massive development leading to increased genome coverage, accuracy and reliability. The number of experiments exploiting microarray technology has markedly increased in recent years. In parallel with the rapid accumulation of transcriptomic data, on-line analysis tools are being introduced to simplify their use. Global statistical data analysis methods contribute to the development of overall concepts about gene expression patterns and to query and compose working hypotheses. More recently, these applications are being supplemented with more specialized products offering visualization and specific data mining tools. We present a curated gene family-oriented gene expression database, Arabidopsis Gene Family Profiler (aGFP; http://agfp.ueb.cas.cz), which gives the user access to a large collection of normalised Affymetrix ATH1 microarray datasets. The database currently contains NASC Array and AtGenExpress transcriptomic datasets for various tissues at different developmental stages of wild type plants gathered from nearly 350 gene chips. The Arabidopsis GFP database has been designed as an easy-to-use tool for users needing an easily accessible resource for expression data of single genes, pre-defined gene families or custom gene sets, with the further possibility of keyword search. Arabidopsis Gene Family Profiler presents a user-friendly web interface using both graphic and text output. Data are stored at the MySQL server and individual queries are created in PHP script. The most distinguishable features of Arabidopsis Gene Family Profiler database are: 1) the presentation of normalized datasets (Affymetrix MAS algorithm and calculation of model-based gene-expression values based on the Perfect Match-only model); 2) the choice between two different normalization algorithms (Affymetrix MAS4 or MAS5 algorithms); 3) an intuitive interface; 4) an interactive "virtual plant" visualizing the spatial and developmental expression profiles of both gene families and individual genes. Arabidopsis GFP gives users the possibility to analyze current Arabidopsis developmental transcriptomic data starting with simple global queries that can be expanded and further refined to visualize comparative and highly selective gene expression profiles.
Gluck, Christian; Min, Sangwon; Oyelakin, Akinsola; Smalley, Kirsten; Sinha, Satrajit; Romano, Rose-Anne
2016-11-16
Mouse models have served a valuable role in deciphering various facets of Salivary Gland (SG) biology, from normal developmental programs to diseased states. To facilitate such studies, gene expression profiling maps have been generated for various stages of SG organogenesis. However these prior studies fall short of capturing the transcriptional complexity due to the limited scope of gene-centric microarray-based technology. Compared to microarray, RNA-sequencing (RNA-seq) offers unbiased detection of novel transcripts, broader dynamic range and high specificity and sensitivity for detection of genes, transcripts, and differential gene expression. Although RNA-seq data, particularly under the auspices of the ENCODE project, have covered a large number of biological specimens, studies on the SG have been lacking. To better appreciate the wide spectrum of gene expression profiles, we isolated RNA from mouse submandibular salivary glands at different embryonic and adult stages. In parallel, we processed RNA-seq data for 24 organs and tissues obtained from the mouse ENCODE consortium and calculated the average gene expression values. To identify molecular players and pathways likely to be relevant for SG biology, we performed functional gene enrichment analysis, network construction and hierarchal clustering of the RNA-seq datasets obtained from different stages of SG development and maturation, and other mouse organs and tissues. Our bioinformatics-based data analysis not only reaffirmed known modulators of SG morphogenesis but revealed novel transcription factors and signaling pathways unique to mouse SG biology and function. Finally we demonstrated that the unique SG gene signature obtained from our mouse studies is also well conserved and can demarcate features of the human SG transcriptome that is different from other tissues. Our RNA-seq based Atlas has revealed a high-resolution cartographic view of the dynamic transcriptomic landscape of the mouse SG at various stages. These RNA-seq datasets will complement pre-existing microarray based datasets, including the Salivary Gland Molecular Anatomy Project by offering a broader systems-biology based perspective rather than the classical gene-centric view. Ultimately such resources will be valuable in providing a useful toolkit to better understand how the diverse cell population of the SG are organized and controlled during development and differentiation.
Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm.
Martinez, Emmanuel; Alvarez, Mario Moises; Trevino, Victor
2010-08-01
Biomarker discovery is a typical application from functional genomics. Due to the large number of genes studied simultaneously in microarray data, feature selection is a key step. Swarm intelligence has emerged as a solution for the feature selection problem. However, swarm intelligence settings for feature selection fail to select small features subsets. We have proposed a swarm intelligence feature selection algorithm based on the initialization and update of only a subset of particles in the swarm. In this study, we tested our algorithm in 11 microarray datasets for brain, leukemia, lung, prostate, and others. We show that the proposed swarm intelligence algorithm successfully increase the classification accuracy and decrease the number of selected features compared to other swarm intelligence methods. Copyright © 2010 Elsevier Ltd. All rights reserved.
McLachlan, G J; Bean, R W; Jones, L Ben-Tovim
2006-07-01
An important problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. We provide a straightforward and easily implemented method for estimating the posterior probability that an individual gene is null. The problem can be expressed in a two-component mixture framework, using an empirical Bayes approach. Current methods of implementing this approach either have some limitations due to the minimal assumptions made or with more specific assumptions are computationally intensive. By converting to a z-score the value of the test statistic used to test the significance of each gene, we propose a simple two-component normal mixture that models adequately the distribution of this score. The usefulness of our approach is demonstrated on three real datasets.
Paiton, Dylan M.; Kenyon, Garrett T.; Brumby, Steven P.; Schultz, Peter F.; George, John S.
2015-07-28
An approach to detecting objects in an image dataset may combine texture/color detection, shape/contour detection, and/or motion detection using sparse, generative, hierarchical models with lateral and top-down connections. A first independent representation of objects in an image dataset may be produced using a color/texture detection algorithm. A second independent representation of objects in the image dataset may be produced using a shape/contour detection algorithm. A third independent representation of objects in the image dataset may be produced using a motion detection algorithm. The first, second, and third independent representations may then be combined into a single coherent output using a combinatorial algorithm.
Evaluation of artificial time series microarray data for dynamic gene regulatory network inference.
Xenitidis, P; Seimenis, I; Kakolyris, S; Adamopoulos, A
2017-08-07
High-throughput technology like microarrays is widely used in the inference of gene regulatory networks (GRNs). We focused on time series data since we are interested in the dynamics of GRNs and the identification of dynamic networks. We evaluated the amount of information that exists in artificial time series microarray data and the ability of an inference process to produce accurate models based on them. We used dynamic artificial gene regulatory networks in order to create artificial microarray data. Key features that characterize microarray data such as the time separation of directly triggered genes, the percentage of directly triggered genes and the triggering function type were altered in order to reveal the limits that are imposed by the nature of microarray data on the inference process. We examined the effect of various factors on the inference performance such as the network size, the presence of noise in microarray data, and the network sparseness. We used a system theory approach and examined the relationship between the pole placement of the inferred system and the inference performance. We examined the relationship between the inference performance in the time domain and the true system parameter identification. Simulation results indicated that time separation and the percentage of directly triggered genes are crucial factors. Also, network sparseness, the triggering function type and noise in input data affect the inference performance. When two factors were simultaneously varied, it was found that variation of one parameter significantly affects the dynamic response of the other. Crucial factors were also examined using a real GRN and acquired results confirmed simulation findings with artificial data. Different initial conditions were also used as an alternative triggering approach. Relevant results confirmed that the number of datasets constitutes the most significant parameter with regard to the inference performance. Copyright © 2017 Elsevier Ltd. All rights reserved.
2012-01-01
Background The role of n-3 fatty acids in prevention of breast cancer is well recognized, but the underlying molecular mechanisms are still unclear. In view of the growing need for early detection of breast cancer, Graham et al. (2010) studied the microarray gene expression in histologically normal epithelium of subjects with or without breast cancer. We conducted a secondary analysis of this dataset with a focus on the genes (n = 47) involved in fat and lipid metabolism. We used stepwise multivariate logistic regression analyses, volcano plots and false discovery rates for association analyses. We also conducted meta-analyses of other microarray studies using random effects models for three outcomes--risk of breast cancer (380 breast cancer patients and 240 normal subjects), risk of metastasis (430 metastatic compared to 1104 non-metastatic breast cancers) and risk of recurrence (484 recurring versus 890 non-recurring breast cancers). Results The HADHA gene [hydroxyacyl-CoA dehydrogenase/3-ketoacyl-CoA thiolase/enoyl-CoA hydratase (trifunctional protein), alpha subunit] was significantly under-expressed in breast cancer; more so in those with estrogen receptor-negative status. Our meta-analysis showed an 18.4%-26% reduction in HADHA expression in breast cancer. Also, there was an inconclusive but consistent under-expression of HADHA in subjects with metastatic and recurring breast cancers. Conclusions Involvement of mitochondria and the mitochondrial trifunctional protein (encoded by HADHA gene) in breast carcinogenesis is known. Our results lend additional support to the possibility of this involvement. Further, our results suggest that targeted subset analysis of large genome-based datasets can provide interesting association signals. PMID:22240105
Morine, Melissa J; McMonagle, Jolene; Toomey, Sinead; Reynolds, Clare M; Moloney, Aidan P; Gormley, Isobel C; Gaora, Peadar O; Roche, Helen M
2010-10-07
Currently, a number of bioinformatics methods are available to generate appropriate lists of genes from a microarray experiment. While these lists represent an accurate primary analysis of the data, fewer options exist to contextualise those lists. The development and validation of such methods is crucial to the wider application of microarray technology in the clinical setting. Two key challenges in clinical bioinformatics involve appropriate statistical modelling of dynamic transcriptomic changes, and extraction of clinically relevant meaning from very large datasets. Here, we apply an approach to gene set enrichment analysis that allows for detection of bi-directional enrichment within a gene set. Furthermore, we apply canonical correlation analysis and Fisher's exact test, using plasma marker data with known clinical relevance to aid identification of the most important gene and pathway changes in our transcriptomic dataset. After a 28-day dietary intervention with high-CLA beef, a range of plasma markers indicated a marked improvement in the metabolic health of genetically obese mice. Tissue transcriptomic profiles indicated that the effects were most dramatic in liver (1270 genes significantly changed; p < 0.05), followed by muscle (601 genes) and adipose (16 genes). Results from modified GSEA showed that the high-CLA beef diet affected diverse biological processes across the three tissues, and that the majority of pathway changes reached significance only with the bi-directional test. Combining the liver tissue microarray results with plasma marker data revealed 110 CLA-sensitive genes showing strong canonical correlation with one or more plasma markers of metabolic health, and 9 significantly overrepresented pathways among this set; each of these pathways was also significantly changed by the high-CLA diet. Closer inspection of two of these pathways--selenoamino acid metabolism and steroid biosynthesis--illustrated clear diet-sensitive changes in constituent genes, as well as strong correlations between gene expression and plasma markers of metabolic syndrome independent of the dietary effect. Bi-directional gene set enrichment analysis more accurately reflects dynamic regulatory behaviour in biochemical pathways, and as such highlighted biologically relevant changes that were not detected using a traditional approach. In such cases where transcriptomic response to treatment is exceptionally large, canonical correlation analysis in conjunction with Fisher's exact test highlights the subset of pathways showing strongest correlation with the clinical markers of interest. In this case, we have identified selenoamino acid metabolism and steroid biosynthesis as key pathways mediating the observed relationship between metabolic health and high-CLA beef. These results indicate that this type of analysis has the potential to generate novel transcriptome-based biomarkers of disease.
2010-01-01
Background Currently, a number of bioinformatics methods are available to generate appropriate lists of genes from a microarray experiment. While these lists represent an accurate primary analysis of the data, fewer options exist to contextualise those lists. The development and validation of such methods is crucial to the wider application of microarray technology in the clinical setting. Two key challenges in clinical bioinformatics involve appropriate statistical modelling of dynamic transcriptomic changes, and extraction of clinically relevant meaning from very large datasets. Results Here, we apply an approach to gene set enrichment analysis that allows for detection of bi-directional enrichment within a gene set. Furthermore, we apply canonical correlation analysis and Fisher's exact test, using plasma marker data with known clinical relevance to aid identification of the most important gene and pathway changes in our transcriptomic dataset. After a 28-day dietary intervention with high-CLA beef, a range of plasma markers indicated a marked improvement in the metabolic health of genetically obese mice. Tissue transcriptomic profiles indicated that the effects were most dramatic in liver (1270 genes significantly changed; p < 0.05), followed by muscle (601 genes) and adipose (16 genes). Results from modified GSEA showed that the high-CLA beef diet affected diverse biological processes across the three tissues, and that the majority of pathway changes reached significance only with the bi-directional test. Combining the liver tissue microarray results with plasma marker data revealed 110 CLA-sensitive genes showing strong canonical correlation with one or more plasma markers of metabolic health, and 9 significantly overrepresented pathways among this set; each of these pathways was also significantly changed by the high-CLA diet. Closer inspection of two of these pathways - selenoamino acid metabolism and steroid biosynthesis - illustrated clear diet-sensitive changes in constituent genes, as well as strong correlations between gene expression and plasma markers of metabolic syndrome independent of the dietary effect. Conclusion Bi-directional gene set enrichment analysis more accurately reflects dynamic regulatory behaviour in biochemical pathways, and as such highlighted biologically relevant changes that were not detected using a traditional approach. In such cases where transcriptomic response to treatment is exceptionally large, canonical correlation analysis in conjunction with Fisher's exact test highlights the subset of pathways showing strongest correlation with the clinical markers of interest. In this case, we have identified selenoamino acid metabolism and steroid biosynthesis as key pathways mediating the observed relationship between metabolic health and high-CLA beef. These results indicate that this type of analysis has the potential to generate novel transcriptome-based biomarkers of disease. PMID:20929581
Hidden discriminative features extraction for supervised high-order time series modeling.
Nguyen, Ngoc Anh Thi; Yang, Hyung-Jeong; Kim, Sunhee
2016-11-01
In this paper, an orthogonal Tucker-decomposition-based extraction of high-order discriminative subspaces from a tensor-based time series data structure is presented, named as Tensor Discriminative Feature Extraction (TDFE). TDFE relies on the employment of category information for the maximization of the between-class scatter and the minimization of the within-class scatter to extract optimal hidden discriminative feature subspaces that are simultaneously spanned by every modality for supervised tensor modeling. In this context, the proposed tensor-decomposition method provides the following benefits: i) reduces dimensionality while robustly mining the underlying discriminative features, ii) results in effective interpretable features that lead to an improved classification and visualization, and iii) reduces the processing time during the training stage and the filtering of the projection by solving the generalized eigenvalue issue at each alternation step. Two real third-order tensor-structures of time series datasets (an epilepsy electroencephalogram (EEG) that is modeled as channel×frequency bin×time frame and a microarray data that is modeled as gene×sample×time) were used for the evaluation of the TDFE. The experiment results corroborate the advantages of the proposed method with averages of 98.26% and 89.63% for the classification accuracies of the epilepsy dataset and the microarray dataset, respectively. These performance averages represent an improvement on those of the matrix-based algorithms and recent tensor-based, discriminant-decomposition approaches; this is especially the case considering the small number of samples that are used in practice. Copyright © 2016 Elsevier Ltd. All rights reserved.
Goonesekere, Nalin C W; Andersen, Wyatt; Smith, Alex; Wang, Xiaosheng
2018-02-01
The lack of specific symptoms at early tumor stages, together with a high biological aggressiveness of the tumor contribute to the high mortality rate for pancreatic cancer (PC), which has a 5-year survival rate of about 7%. Recent failures of targeted therapies inhibiting kinase activity in clinical trials have highlighted the need for new approaches towards combating this deadly disease. In this study, we have identified genes that are significantly downregulated in PC, through a meta-analysis of large number of microarray datasets. We have used qRT-PCR to confirm the downregulation of selected genes in a panel of PC cell lines. This study has yielded several novel candidate tumor-suppressor genes (TSGs) including GNMT, CEL, PLA2G1B and SERPINI2. We highlight the role of GNMT, a methyl transferase associated with the methylation potential of the cell, and CEL, a lipase, as potential therapeutic targets. We have uncovered genetic links to risk factors associated with PC such as smoking and obesity. Genes important for patient survival and prognosis are also discussed, and we confirm the dysregulation of metabolic pathways previously observed in PC. While many of the genes downregulated in our dataset are associated with protein products normally produced by the pancreas for excretion, we have uncovered some genes whose downregulation appear to play a more causal role in PC. These genes will assist in providing a better understanding of the disease etiology of PC, and in the search for new therapeutic targets and biomarkers.
Schönmann, Susan; Loy, Alexander; Wimmersberger, Céline; Sobek, Jens; Aquino, Catharine; Vandamme, Peter; Frey, Beat; Rehrauer, Hubert; Eberl, Leo
2009-04-01
For cultivation-independent and highly parallel analysis of members of the genus Burkholderia, an oligonucleotide microarray (phylochip) consisting of 131 hierarchically nested 16S rRNA gene-targeted oligonucleotide probes was developed. A novel primer pair was designed for selective amplification of a 1.3 kb 16S rRNA gene fragment of Burkholderia species prior to microarray analysis. The diagnostic performance of the microarray for identification and differentiation of Burkholderia species was tested with 44 reference strains of the genera Burkholderia, Pandoraea, Ralstonia and Limnobacter. Hybridization patterns based on presence/absence of probe signals were interpreted semi-automatically using the novel likelihood-based strategy of the web-tool Phylo- Detect. Eighty-eight per cent of the reference strains were correctly identified at the species level. The evaluated microarray was applied to investigate shifts in the Burkholderia community structure in acidic forest soil upon addition of cadmium, a condition that selected for Burkholderia species. The microarray results were in agreement with those obtained from phylogenetic analysis of Burkholderia 16S rRNA gene sequences recovered from the same cadmiumcontaminated soil, demonstrating the value of the Burkholderia phylochip for determinative and environmental studies.
A null model for Pearson coexpression networks.
Gobbi, Andrea; Jurman, Giuseppe
2015-01-01
Gene coexpression networks inferred by correlation from high-throughput profiling such as microarray data represent simple but effective structures for discovering and interpreting linear gene relationships. In recent years, several approaches have been proposed to tackle the problem of deciding when the resulting correlation values are statistically significant. This is most crucial when the number of samples is small, yielding a non-negligible chance that even high correlation values are due to random effects. Here we introduce a novel hard thresholding solution based on the assumption that a coexpression network inferred by randomly generated data is expected to be empty. The threshold is theoretically derived by means of an analytic approach and, as a deterministic independent null model, it depends only on the dimensions of the starting data matrix, with assumptions on the skewness of the data distribution compatible with the structure of gene expression levels data. We show, on synthetic and array datasets, that the proposed threshold is effective in eliminating all false positive links, with an offsetting cost in terms of false negative detected edges.
Workflows for microarray data processing in the Kepler environment.
Stropp, Thomas; McPhillips, Timothy; Ludäscher, Bertram; Bieda, Mark
2012-05-17
Microarray data analysis has been the subject of extensive and ongoing pipeline development due to its complexity, the availability of several options at each analysis step, and the development of new analysis demands, including integration with new data sources. Bioinformatics pipelines are usually custom built for different applications, making them typically difficult to modify, extend and repurpose. Scientific workflow systems are intended to address these issues by providing general-purpose frameworks in which to develop and execute such pipelines. The Kepler workflow environment is a well-established system under continual development that is employed in several areas of scientific research. Kepler provides a flexible graphical interface, featuring clear display of parameter values, for design and modification of workflows. It has capabilities for developing novel computational components in the R, Python, and Java programming languages, all of which are widely used for bioinformatics algorithm development, along with capabilities for invoking external applications and using web services. We developed a series of fully functional bioinformatics pipelines addressing common tasks in microarray processing in the Kepler workflow environment. These pipelines consist of a set of tools for GFF file processing of NimbleGen chromatin immunoprecipitation on microarray (ChIP-chip) datasets and more comprehensive workflows for Affymetrix gene expression microarray bioinformatics and basic primer design for PCR experiments, which are often used to validate microarray results. Although functional in themselves, these workflows can be easily customized, extended, or repurposed to match the needs of specific projects and are designed to be a toolkit and starting point for specific applications. These workflows illustrate a workflow programming paradigm focusing on local resources (programs and data) and therefore are close to traditional shell scripting or R/BioConductor scripting approaches to pipeline design. Finally, we suggest that microarray data processing task workflows may provide a basis for future example-based comparison of different workflow systems. We provide a set of tools and complete workflows for microarray data analysis in the Kepler environment, which has the advantages of offering graphical, clear display of conceptual steps and parameters and the ability to easily integrate other resources such as remote data and web services.
Data reuse and the open data citation advantage
Vision, Todd J.
2013-01-01
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003. PMID:24109559
Kohlmann, Alexander; Kipps, Thomas J; Rassenti, Laura Z; Downing, James R; Shurtleff, Sheila A; Mills, Ken I; Gilkes, Amanda F; Hofmann, Wolf-Karsten; Basso, Giuseppe; Dell’Orto, Marta Campo; Foà, Robin; Chiaretti, Sabina; De Vos, John; Rauhut, Sonja; Papenhausen, Peter R; Hernández, Jesus M; Lumbreras, Eva; Yeoh, Allen E; Koay, Evelyn S; Li, Rachel; Liu, Wei-min; Williams, Paul M; Wieczorek, Lothar; Haferlach, Torsten
2008-01-01
Gene expression profiling has the potential to enhance current methods for the diagnosis of haematological malignancies. Here, we present data on 204 analyses from an international standardization programme that was conducted in 11 laboratories as a prephase to the Microarray Innovations in LEukemia (MILE) study. Each laboratory prepared two cell line samples, together with three replicate leukaemia patient lysates in two distinct stages: (i) a 5-d course of protocol training, and (ii) independent proficiency testing. Unsupervised, supervised, and r2 correlation analyses demonstrated that microarray analysis can be performed with remarkably high intra-laboratory reproducibility and with comparable quality and reliability. PMID:18573112
Molecular Targeted Therapies of Childhood Choroid Plexus Carcinoma
2013-10-01
Microarray intensities were analyzed in PGS, using the benign human choroid plexus papilloma (CPP) samples as an expression baseline reference. This...additional human and mouse CPC genomic profiles (timeframe: months 1-5). The goal of these studies is to expand our number of genomic profiles (DNA and...mRNA arrays) of both human and mouse CPCs to provide a comprehensive dataset with which to identify key candidate oncogenes, tumor suppressor genes
Molecular Targeted Therapies of Childhood Choroid Plexus Carcinoma
2012-10-01
Microarray intensities were analyzed in PGS, using the benign human choroid plexus papilloma (CPP) samples as an expression baseline reference...identify candidate drug targets of CPC. Task 1: Generation of additional human and mouse CPC genomic profiles (timeframe: months 1-5). The goal...of these studies is to expand our number of genomic profiles (DNA and mRNA arrays) of both human and mouse CPCs to provide a comprehensive dataset
DOE Office of Scientific and Technical Information (OSTI.GOV)
Paiton, Dylan M.; Kenyon, Garrett T.; Brumby, Steven P.
An approach to detecting objects in an image dataset may combine texture/color detection, shape/contour detection, and/or motion detection using sparse, generative, hierarchical models with lateral and top-down connections. A first independent representation of objects in an image dataset may be produced using a color/texture detection algorithm. A second independent representation of objects in the image dataset may be produced using a shape/contour detection algorithm. A third independent representation of objects in the image dataset may be produced using a motion detection algorithm. The first, second, and third independent representations may then be combined into a single coherent output using amore » combinatorial algorithm.« less
Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering.
Deveci, Mehmet; Küçüktunç, Onur; Eren, Kemal; Bozdağ, Doruk; Kaya, Kamer; Çatalyürek, Ümit V
2016-01-01
Rapid development and increasing popularity of gene expression microarrays have resulted in a number of studies on the discovery of co-regulated genes. One important way of discovering such co-regulations is the query-based search since gene co-expressions may indicate a shared role in a biological process. Although there exist promising query-driven search methods adapting clustering, they fail to capture many genes that function in the same biological pathway because microarray datasets are fraught with spurious samples or samples of diverse origin, or the pathways might be regulated under only a subset of samples. On the other hand, a class of clustering algorithms known as biclustering algorithms which simultaneously cluster both the items and their features are useful while analyzing gene expression data, or any data in which items are related in only a subset of their samples. This means that genes need not be related in all samples to be clustered together. Because many genes only interact under specific circumstances, biclustering may recover the relationships that traditional clustering algorithms can easily miss. In this chapter, we briefly summarize the literature using biclustering for querying co-regulated genes. Then we present a novel biclustering approach and evaluate its performance by a thorough experimental analysis.
Analyzing Kernel Matrices for the Identification of Differentially Expressed Genes
Xia, Xiao-Lei; Xing, Huanlai; Liu, Xueqin
2013-01-01
One of the most important applications of microarray data is the class prediction of biological samples. For this purpose, statistical tests have often been applied to identify the differentially expressed genes (DEGs), followed by the employment of the state-of-the-art learning machines including the Support Vector Machines (SVM) in particular. The SVM is a typical sample-based classifier whose performance comes down to how discriminant samples are. However, DEGs identified by statistical tests are not guaranteed to result in a training dataset composed of discriminant samples. To tackle this problem, a novel gene ranking method namely the Kernel Matrix Gene Selection (KMGS) is proposed. The rationale of the method, which roots in the fundamental ideas of the SVM algorithm, is described. The notion of ''the separability of a sample'' which is estimated by performing -like statistics on each column of the kernel matrix, is first introduced. The separability of a classification problem is then measured, from which the significance of a specific gene is deduced. Also described is a method of Kernel Matrix Sequential Forward Selection (KMSFS) which shares the KMGS method's essential ideas but proceeds in a greedy manner. On three public microarray datasets, our proposed algorithms achieved noticeably competitive performance in terms of the B.632+ error rate. PMID:24349110
A formal concept analysis approach to consensus clustering of multi-experiment expression data
2014-01-01
Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. PMID:24885407
Fuzzy support vector machine for microarray imbalanced data classification
NASA Astrophysics Data System (ADS)
Ladayya, Faroh; Purnami, Santi Wulan; Irhamah
2017-11-01
DNA microarrays are data containing gene expression with small sample sizes and high number of features. Furthermore, imbalanced classes is a common problem in microarray data. This occurs when a dataset is dominated by a class which have significantly more instances than the other minority classes. Therefore, it is needed a classification method that solve the problem of high dimensional and imbalanced data. Support Vector Machine (SVM) is one of the classification methods that is capable of handling large or small samples, nonlinear, high dimensional, over learning and local minimum issues. SVM has been widely applied to DNA microarray data classification and it has been shown that SVM provides the best performance among other machine learning methods. However, imbalanced data will be a problem because SVM treats all samples in the same importance thus the results is bias for minority class. To overcome the imbalanced data, Fuzzy SVM (FSVM) is proposed. This method apply a fuzzy membership to each input point and reformulate the SVM such that different input points provide different contributions to the classifier. The minority classes have large fuzzy membership so FSVM can pay more attention to the samples with larger fuzzy membership. Given DNA microarray data is a high dimensional data with a very large number of features, it is necessary to do feature selection first using Fast Correlation based Filter (FCBF). In this study will be analyzed by SVM, FSVM and both methods by applying FCBF and get the classification performance of them. Based on the overall results, FSVM on selected features has the best classification performance compared to SVM.
2014-01-01
Background Uncovering the complex transcriptional regulatory networks (TRNs) that underlie plant and animal development remains a challenge. However, a vast amount of data from public microarray experiments is available, which can be subject to inference algorithms in order to recover reliable TRN architectures. Results In this study we present a simple bioinformatics methodology that uses public, carefully curated microarray data and the mutual information algorithm ARACNe in order to obtain a database of transcriptional interactions. We used data from Arabidopsis thaliana root samples to show that the transcriptional regulatory networks derived from this database successfully recover previously identified root transcriptional modules and to propose new transcription factors for the SHORT ROOT/SCARECROW and PLETHORA pathways. We further show that these networks are a powerful tool to integrate and analyze high-throughput expression data, as exemplified by our analysis of a SHORT ROOT induction time-course microarray dataset, and are a reliable source for the prediction of novel root gene functions. In particular, we used our database to predict novel genes involved in root secondary cell-wall synthesis and identified the MADS-box TF XAL1/AGL12 as an unexpected participant in this process. Conclusions This study demonstrates that network inference using carefully curated microarray data yields reliable TRN architectures. In contrast to previous efforts to obtain root TRNs, that have focused on particular functional modules or tissues, our root transcriptional interactions provide an overview of the transcriptional pathways present in Arabidopsis thaliana roots and will likely yield a plethora of novel hypotheses to be tested experimentally. PMID:24739361
Kimbung, Siker; Johansson, Ida; Danielsson, Anna; Veerla, Srinivas; Egyhazi Brage, Suzanne; Frostvik Stolt, Marianne; Skoog, Lambert; Carlsson, Lena; Einbeigi, Zakaria; Lidbrink, Elisabet; Linderholm, Barbro; Loman, Niklas; Malmström, Per-Olof; Söderberg, Martin; Walz, Thomas M; Fernö, Mårten; Hatschek, Thomas; Hedenfalk, Ingrid
2016-01-01
The complete molecular basis of the organ-specificity of metastasis is elusive. This study aimed to provide an independent characterization of the transcriptional landscape of breast cancer metastases with the specific objective to identify liver metastasis-selective genes of prognostic importance following primary tumor diagnosis. A cohort of 304 women with advanced breast cancer was studied. Associations between the site of recurrence and clinicopathologic features were investigated. Fine-needle aspirates of metastases (n = 91) were subjected to whole-genome transcriptional profiling. Liver metastasis-selective genes were identified by significance analysis of microarray (SAM) analyses and independently validated in external datasets. Finally, the prognostic relevance of the liver metastasis-selective genes in primary breast cancer was tested. Liver relapse was associated with estrogen receptor (ER) expression (P = 0.002), luminal B subtype (P = 0.01), and was prognostic for an inferior postrelapse survival (P = 0.01). The major variation in the transcriptional landscape of metastases was also associated with ER expression and molecular subtype. However, liver metastases displayed unique transcriptional fingerprints, characterized by downregulation of extracellular matrix (i.e., stromal) genes. Importantly, we identified a 17-gene liver metastasis-selective signature, which was significantly and independently prognostic for shorter relapse-free (P < 0.001) and overall (P = 0.001) survival in ER-positive tumors. Remarkably, this signature remained independently prognostic for shorter relapse-free survival (P = 0.001) among luminal A tumors. Extracellular matrix (stromal) genes can be used to partition breast cancer by site of relapse and may be used to further refine prognostication in ER positive primary breast cancer. ©2015 American Association for Cancer Research.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gee, Harriet E., E-mail: harriet.gee@sydney.edu.au; The Chris O'Brien Lifehouse, Missenden Road, Camperdown, NSW; Central Clinical School, Sydney Medical School, University of Sydney, NSW
Purpose: Local recurrence and distant failure after adjuvant radiation therapy for breast cancer remain significant clinical problems, incompletely predicted by conventional clinicopathologic markers. We had previously identified microRNA-139-5p and microRNA-1274a as key regulators of breast cancer radiation response in vitro. The purpose of this study was to investigate standard clinicopathologic markers of local recurrence in a contemporary series and to establish whether putative target genes of microRNAs involved in DNA repair and cell cycle control could better predict radiation therapy response in vivo. Methods and Materials: With institutional ethics board approval, local recurrence was measured in a contemporary, prospectively collected series ofmore » 458 patients treated with radiation therapy after breast-conserving surgery. Additionally, independent publicly available mRNA/microRNA microarray expression datasets totaling >1000 early-stage breast cancer patients, treated with adjuvant radiation therapy, with >10 years of follow-up, were analyzed. The expression of putative microRNA target biomarkers—TOP2A, POLQ, RAD54L, SKP2, PLK2, and RAG1—were correlated with standard clinicopathologic variables using 2-sided nonparametric tests, and to local/distant relapse and survival using Kaplan-Meier and Cox regression analysis. Results: We found a low rate of isolated local recurrence (1.95%) in our modern series, and that few clinicopathologic variables (such as lymphovascular invasion) were significantly predictive. In multiple independent datasets (n>1000), however, high expression of RAD54L, TOP2A, POLQ, and SKP2 significantly correlated with local recurrence, survival, or both in univariate and multivariate analyses (P<.001). Low RAG1 expression significantly correlated with local recurrence (multivariate, P=.008). Additionally, RAD54L, SKP2, and PLK2 may be predictive, being prognostic in radiation therapy–treated patients but not in untreated matched control individuals (n=107; P<.05). Conclusions: Biomarkers of DNA repair and cell cycle control can identify patients at high risk of treatment failure in those receiving radiation therapy for early breast cancer in independent cohorts. These should be further investigated prospectively, especially TOP2A and SKP2, for which targeted therapies are available.« less
Nehme, A; Zibara, K; Cerutti, C; Bricca, G
2015-06-01
The implication of the renin-angiotensin-aldosterone system (RAAS) in atheroma development is well described. However, a complete view of the local RAAS in atheroma is still missing. In this study we aimed to reveal the organization of RAAS in atheroma at the transcriptomic level and identify the transcriptional regulators behind it. Extended RAAS (extRAAS) was defined as the set of 37 genes coding for classical and novel RAAS participants (Figure 1). Five microarray datasets containing overall 590 samples representing carotid and peripheral atheroma were downloaded from the GEO database. Correlation-based hierarchical clustering (R software) of extRAAS genes within each dataset allowed the identification of modules of co-expressed genes. Reproducible co-expression modules across datasets were then extracted. Transcription factors (TFs) having common binding sites (TFBSs) in the promoters of coordinated genes were identified using the Genomatix database tools and analyzed for their correlation with extRAAS genes in the microarray datasets. Expression data revealed the expressed extRAAS components and their relative abundance displaying the favored pathways in atheroma. Three co-expression modules with more than 80% reproducibility across datasets were extracted. Two of them (M1 and M2) contained genes coding for angiotensin metabolizing enzymes involved in different pathways: M1 included ACE, MME, RNPEP, and DPP3, in addition to 7 other genes; and M2 included CMA1, CTSG, and CPA3. The third module (M3) contained genes coding for receptors known to be implicated in atheroma (AGTR1, MR, GR, LNPEP, EGFR and GPER). M1 and M3 were negatively correlated in 3 of 5 datasets. We identified 19 TFs that have enriched TFBSs in the promoters of genes of M1, and two for M3, but none was found for M2. Among the extracted TFs, ELF1, MAX, and IRF5 showed significant positive correlations with peptidase-coding genes from M1 and negative correlations with receptors-coding genes from M3 (p < 0.05). The identified co-expression modules display the transcriptional organization of local extRAAS in human carotid atheroma. The identification of several TFs potentially associated to extRAAS genes may provide a frame for the discovery of atheroma-specific modulators of extRAAS activity.(Figure is included in full-text article.).
Loy, Alexander; Lehner, Angelika; Lee, Natuschka; Adamczyk, Justyna; Meier, Harald; Ernst, Jens; Schleifer, Karl-Heinz; Wagner, Michael
2002-01-01
For cultivation-independent detection of sulfate-reducing prokaryotes (SRPs) an oligonucleotide microarray consisting of 132 16S rRNA gene-targeted oligonucleotide probes (18-mers) having hierarchical and parallel (identical) specificity for the detection of all known lineages of sulfate-reducing prokaryotes (SRP-PhyloChip) was designed and subsequently evaluated with 41 suitable pure cultures of SRPs. The applicability of SRP-PhyloChip for diversity screening of SRPs in environmental and clinical samples was tested by using samples from periodontal tooth pockets and from the chemocline of a hypersaline cyanobacterial mat from Solar Lake (Sinai, Egypt). Consistent with previous studies, SRP-PhyloChip indicated the occurrence of Desulfomicrobium spp. in the tooth pockets and the presence of Desulfonema- and Desulfomonile-like SRPs (together with other SRPs) in the chemocline of the mat. The SRP-PhyloChip results were confirmed by several DNA microarray-independent techniques, including specific PCR amplification, cloning, and sequencing of SRP 16S rRNA genes and the genes encoding the dissimilatory (bi)sulfite reductase (dsrAB). PMID:12324358
Bourguignon, Natalia; Bargiela, Rafael; Rojo, David; Chernikova, Tatyana N; de Rodas, Sara A López; García-Cantalejo, Jesús; Näther, Daniela J; Golyshin, Peter N; Barbas, Coral; Ferrero, Marcela; Ferrer, Manuel
2016-12-01
The analysis of catabolic capacities of microorganisms is currently often achieved by cultivation approaches and by the analysis of genomic or metagenomic datasets. Recently, a microarray system designed from curated key aromatic catabolic gene families and key alkane degradation genes was designed. The collection of genes in the microarray can be exploited to indicate whether a given microbe or microbial community is likely to be functionally connected with certain degradative phenotypes, without previous knowledge of genome data. Herein, this microarray was applied to capture new insights into the catabolic capacities of copper-resistant actinomycete Amycolatopsis tucumanensis DSM 45259. The array data support the presumptive ability of the DSM 45259 strain to utilize single alkanes (n-decane and n-tetradecane) and aromatics such as benzoate, phthalate and phenol as sole carbon sources, which was experimentally validated by cultivation and mass spectrometry. Interestingly, while in strain DSM 45259 alkB gene encoding an alkane hydroxylase is most likely highly similar to that found in other actinomycetes, the genes encoding benzoate 1,2-dioxygenase, phthalate 4,5-dioxygenase and phenol hydroxylase were homologous to proteobacterial genes. This suggests that strain DSM 45259 contains catabolic genes distantly related to those found in other actinomycetes. Together, this study not only provided new insight into the catabolic abilities of strain DSM 45259, but also suggests that this strain contains genes uncommon within actinomycetes.
Wang, Wen; Li, Hao; Zhao, Zheng; Wang, Haoyuan; Zhang, Dong; Zhang, Yan; Lan, Qing; Wang, Jiangfei; Cao, Yong; Zhao, Jizong
2018-04-01
Abdominal aortic aneurysms (AAAs) and intracranial saccular aneurysms (IAs) are the most common types of aneurysms. This study was to investigate the common pathogenesis shared between these two kinds of aneurysms. We collected 12 IAs samples and 12 control arteries from the Beijing Tiantan Hospital and performed microarray analysis. In addition, we utilized the microarray datasets of IAs and AAAs from the Gene Expression Omnibus (GEO), in combination with our microarray results, to generate messenger RNA expression profiles for both AAAs and IAs in our study. Functional exploration and protein-protein interaction (PPI) analysis were performed. A total of 727 common genes were differentially expressed (404 was upregulated; 323 was downregulated) for both AAAs and IAs. The GO and pathway analyses showed that the common dysregulated genes were mainly enriched in vascular smooth muscle contraction, muscle contraction, immune response, defense response, cell activation, IL-6 signaling and chemokine signaling pathways, etc. The further protein-protein analysis identified 35 hub nodes, including TNF, IL6, MAPK13, and CCL5. These hub node genes were enriched in inflammatory response, positive regulation of IL-6 production, chemokine signaling pathway, and T/B cell receptor signaling pathway. Our study will gain new insight into the molecular mechanisms for the pathogenesis of both types of aneurysms and provide new therapeutic targets for the patients harboring AAAs and IAs.
An integrated method for cancer classification and rule extraction from microarray data
Huang, Liang-Tsung
2009-01-01
Different microarray techniques recently have been successfully used to investigate useful information for cancer diagnosis at the gene expression level due to their ability to measure thousands of gene expression levels in a massively parallel way. One important issue is to improve classification performance of microarray data. However, it would be ideal that influential genes and even interpretable rules can be explored at the same time to offer biological insight. Introducing the concepts of system design in software engineering, this paper has presented an integrated and effective method (named X-AI) for accurate cancer classification and the acquisition of knowledge from DNA microarray data. This method included a feature selector to systematically extract the relative important genes so as to reduce the dimension and retain as much as possible of the class discriminatory information. Next, diagonal quadratic discriminant analysis (DQDA) was combined to classify tumors, and generalized rule induction (GRI) was integrated to establish association rules which can give an understanding of the relationships between cancer classes and related genes. Two non-redundant datasets of acute leukemia were used to validate the proposed X-AI, showing significantly high accuracy for discriminating different classes. On the other hand, I have presented the abilities of X-AI to extract relevant genes, as well as to develop interpretable rules. Further, a web server has been established for cancer classification and it is freely available at . PMID:19272192
An object model and database for functional genomics.
Jones, Andrew; Hunt, Ela; Wastling, Jonathan M; Pizarro, Angel; Stoeckert, Christian J
2004-07-10
Large-scale functional genomics analysis is now feasible and presents significant challenges in data analysis, storage and querying. Data standards are required to enable the development of public data repositories and to improve data sharing. There is an established data format for microarrays (microarray gene expression markup language, MAGE-ML) and a draft standard for proteomics (PEDRo). We believe that all types of functional genomics experiments should be annotated in a consistent manner, and we hope to open up new ways of comparing multiple datasets used in functional genomics. We have created a functional genomics experiment object model (FGE-OM), developed from the microarray model, MAGE-OM and two models for proteomics, PEDRo and our own model (Gla-PSI-Glasgow Proposal for the Proteomics Standards Initiative). FGE-OM comprises three namespaces representing (i) the parts of the model common to all functional genomics experiments; (ii) microarray-specific components; and (iii) proteomics-specific components. We believe that FGE-OM should initiate discussion about the contents and structure of the next version of MAGE and the future of proteomics standards. A prototype database called RNA And Protein Abundance Database (RAPAD), based on FGE-OM, has been implemented and populated with data from microbial pathogenesis. FGE-OM and the RAPAD schema are available from http://www.gusdb.org/fge.html, along with a set of more detailed diagrams. RAPAD can be accessed by registration at the site.
Cloud-scale genomic signals processing classification analysis for gene expression microarray data.
Harvey, Benjamin; Soo-Yeon Ji
2014-01-01
As microarray data available to scientists continues to increase in size and complexity, it has become overwhelmingly important to find multiple ways to bring inference though analysis of DNA/mRNA sequence data that is useful to scientists. Though there have been many attempts to elucidate the issue of bringing forth biological inference by means of wavelet preprocessing and classification, there has not been a research effort that focuses on a cloud-scale classification analysis of microarray data using Wavelet thresholding in a Cloud environment to identify significantly expressed features. This paper proposes a novel methodology that uses Wavelet based Denoising to initialize a threshold for determination of significantly expressed genes for classification. Additionally, this research was implemented and encompassed within cloud-based distributed processing environment. The utilization of Cloud computing and Wavelet thresholding was used for the classification 14 tumor classes from the Global Cancer Map (GCM). The results proved to be more accurate than using a predefined p-value for differential expression classification. This novel methodology analyzed Wavelet based threshold features of gene expression in a Cloud environment, furthermore classifying the expression of samples by analyzing gene patterns, which inform us of biological processes. Moreover, enabling researchers to face the present and forthcoming challenges that may arise in the analysis of data in functional genomics of large microarray datasets.
A new locally weighted K-means for cancer-aided microarray data analysis.
Iam-On, Natthakan; Boongoen, Tossapon
2012-11-01
Cancer has been identified as the leading cause of death. It is predicted that around 20-26 million people will be diagnosed with cancer by 2020. With this alarming rate, there is an urgent need for a more effective methodology to understand, prevent and cure cancer. Microarray technology provides a useful basis of achieving this goal, with cluster analysis of gene expression data leading to the discrimination of patients, identification of possible tumor subtypes and individualized treatment. Amongst clustering techniques, k-means is normally chosen for its simplicity and efficiency. However, it does not account for the different importance of data attributes. This paper presents a new locally weighted extension of k-means, which has proven more accurate across many published datasets than the original and other extensions found in the literature.
Rai, Muhammad Farooq; Tycksen, Eric D; Sandell, Linda J; Brophy, Robert H
2018-01-01
Microarrays and RNA-seq are at the forefront of high throughput transcriptome analyses. Since these methodologies are based on different principles, there are concerns about the concordance of data between the two techniques. The concordance of RNA-seq and microarrays for genome-wide analysis of differential gene expression has not been rigorously assessed in clinically derived ligament tissues. To demonstrate the concordance between RNA-seq and microarrays and to assess potential benefits of RNA-seq over microarrays, we assessed differences in transcript expression in anterior cruciate ligament (ACL) tissues based on time-from-injury. ACL remnants were collected from patients with an ACL tear at the time of ACL reconstruction. RNA prepared from torn ACL remnants was subjected to Agilent microarrays (N = 24) and RNA-seq (N = 8). The correlation of biological replicates in RNA-seq and microarrays data was similar (0.98 vs. 0.97), demonstrating that each platform has high internal reproducibility. Correlations between the RNA-seq data and the individual microarrays were low, but correlations between the RNA-seq values and the geometric mean of the microarrays values were moderate. The cross-platform concordance for differentially expressed transcripts or enriched pathways was linearly correlated (r = 0.64). RNA-Seq was superior in detecting low abundance transcripts and differentiating biologically critical isoforms. Additional independent validation of transcript expression was undertaken using microfluidic PCR for selected genes. PCR data showed 100% concordance (in expression pattern) with RNA-seq and microarrays data. These findings demonstrate that RNA-seq has advantages over microarrays for transcriptome profiling of ligament tissues when available and affordable. Furthermore, these findings are likely transferable to other musculoskeletal tissues where tissue collection is challenging and cells are in low abundance. © 2017 Orthopaedic Research Society. Published by Wiley Periodicals, Inc. J Orthop Res 36:484-497, 2018. © 2017 Orthopaedic Research Society. Published by Wiley Periodicals, Inc.
De Marni, Marzia L; Monegal, Ana; Venturini, Samuele; Vinati, Simone; Carbone, Roberta; de Marco, Ario
2012-02-01
The preparation of effective conventional antibody microarrays depends on the availability of high quality material and on the correct accessibility of the antibody active moieties following their immobilization on the support slide. We show that spotting bacteria that expose recombinant antibodies on their external surface directly on nanostructured-TiO(2) or epoxy slides (purification-independent microarray - PIM) is a simple and reliable alternative for preparing sensitive and specific microarrays for antigen detection. Variable domains of single heavy-chain antibodies (VHHs) against fibroblast growth factor receptor 1 (FGFR1) were used to capture the antigen diluted in serum or BSA solution. The FGFR1 detection was performed by either direct antigen labeling or using a sandwich system in which FGFR1 was first bound to its antibody and successively identified using a labeled FGF. In both cases the signal distribution within each spot was uniform and spot morphology regular. The signal-to-noise ratio of the signal was extremely elevated and the specificity of the system was proved statistically. The LOD of the system for the antigen was calculated being 0.4ng/mL and the dynamic range between 0.4ng/mL and 10μg/mL. The microarrays prepared with bacteria exposing antibodies remain fully functional for at least 31 days after spotting. We finally demonstrated that the method is suitable for other antigen-antibody pairs and expect that it could be easily adapted to further applications such as the display of scFv and IgG antibodies or the autoantibody detection using protein PIMs. Copyright © 2011. Published by Elsevier Inc.
Identification and Optimization of Classifier Genes from Multi-Class Earthworm Microarray Dataset
2010-10-28
rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of...analyze toxicological mechanisms for two military- unique explosive compounds 2,4,6-trinitrotolune (TNT) and 1,3,5- trinitro-1,3,5-triazacyclohexane...also known as Royal Demolition eXplosive or RDX) [7,8]. These two compounds exhibit distinctive toxicological properties that are accompanied by
Discovering time-lagged rules from microarray data using gene profile classifiers
2011-01-01
Background Gene regulatory networks have an essential role in every process of life. In this regard, the amount of genome-wide time series data is becoming increasingly available, providing the opportunity to discover the time-delayed gene regulatory networks that govern the majority of these molecular processes. Results This paper aims at reconstructing gene regulatory networks from multiple genome-wide microarray time series datasets. In this sense, a new model-free algorithm called GRNCOP2 (Gene Regulatory Network inference by Combinatorial OPtimization 2), which is a significant evolution of the GRNCOP algorithm, was developed using combinatorial optimization of gene profile classifiers. The method is capable of inferring potential time-delay relationships with any span of time between genes from various time series datasets given as input. The proposed algorithm was applied to time series data composed of twenty yeast genes that are highly relevant for the cell-cycle study, and the results were compared against several related approaches. The outcomes have shown that GRNCOP2 outperforms the contrasted methods in terms of the proposed metrics, and that the results are consistent with previous biological knowledge. Additionally, a genome-wide study on multiple publicly available time series data was performed. In this case, the experimentation has exhibited the soundness and scalability of the new method which inferred highly-related statistically-significant gene associations. Conclusions A novel method for inferring time-delayed gene regulatory networks from genome-wide time series datasets is proposed in this paper. The method was carefully validated with several publicly available data sets. The results have demonstrated that the algorithm constitutes a usable model-free approach capable of predicting meaningful relationships between genes, revealing the time-trends of gene regulation. PMID:21524308
Bennet, Jaison; Ganaprakasam, Chilambuchelvan Arul; Arputharaj, Kannan
2014-01-01
Cancer classification by doctors and radiologists was based on morphological and clinical features and had limited diagnostic ability in olden days. The recent arrival of DNA microarray technology has led to the concurrent monitoring of thousands of gene expressions in a single chip which stimulates the progress in cancer classification. In this paper, we have proposed a hybrid approach for microarray data classification based on nearest neighbor (KNN), naive Bayes, and support vector machine (SVM). Feature selection prior to classification plays a vital role and a feature selection technique which combines discrete wavelet transform (DWT) and moving window technique (MWT) is used. The performance of the proposed method is compared with the conventional classifiers like support vector machine, nearest neighbor, and naive Bayes. Experiments have been conducted on both real and benchmark datasets and the results indicate that the ensemble approach produces higher classification accuracy than conventional classifiers. This paper serves as an automated system for the classification of cancer and can be applied by doctors in real cases which serve as a boon to the medical community. This work further reduces the misclassification of cancers which is highly not allowed in cancer detection.
DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data
Glez-Peña, Daniel; Álvarez, Rodrigo; Díaz, Fernando; Fdez-Riverola, Florentino
2009-01-01
Background Expression profiling assays done by using DNA microarray technology generate enormous data sets that are not amenable to simple analysis. The greatest challenge in maximizing the use of this huge amount of data is to develop algorithms to interpret and interconnect results from different genes under different conditions. In this context, fuzzy logic can provide a systematic and unbiased way to both (i) find biologically significant insights relating to meaningful genes, thereby removing the need for expert knowledge in preliminary steps of microarray data analyses and (ii) reduce the cost and complexity of later applied machine learning techniques being able to achieve interpretable models. Results DFP is a new Bioconductor R package that implements a method for discretizing and selecting differentially expressed genes based on the application of fuzzy logic. DFP takes advantage of fuzzy membership functions to assign linguistic labels to gene expression levels. The technique builds a reduced set of relevant genes (FP, Fuzzy Pattern) able to summarize and represent each underlying class (pathology). A last step constructs a biased set of genes (DFP, Discriminant Fuzzy Pattern) by intersecting existing fuzzy patterns in order to detect discriminative elements. In addition, the software provides new functions and visualisation tools that summarize achieved results and aid in the interpretation of differentially expressed genes from multiple microarray experiments. Conclusion DFP integrates with other packages of the Bioconductor project, uses common data structures and is accompanied by ample documentation. It has the advantage that its parameters are highly configurable, facilitating the discovery of biologically relevant connections between sets of genes belonging to different pathologies. This information makes it possible to automatically filter irrelevant genes thereby reducing the large volume of data supplied by microarray experiments. Based on these contributions GENECBR, a successful tool for cancer diagnosis using microarray datasets, has recently been released. PMID:19178723
Gene ARMADA: an integrated multi-analysis platform for microarray data implemented in MATLAB.
Chatziioannou, Aristotelis; Moulos, Panagiotis; Kolisis, Fragiskos N
2009-10-27
The microarray data analysis realm is ever growing through the development of various tools, open source and commercial. However there is absence of predefined rational algorithmic analysis workflows or batch standardized processing to incorporate all steps, from raw data import up to the derivation of significantly differentially expressed gene lists. This absence obfuscates the analytical procedure and obstructs the massive comparative processing of genomic microarray datasets. Moreover, the solutions provided, heavily depend on the programming skills of the user, whereas in the case of GUI embedded solutions, they do not provide direct support of various raw image analysis formats or a versatile and simultaneously flexible combination of signal processing methods. We describe here Gene ARMADA (Automated Robust MicroArray Data Analysis), a MATLAB implemented platform with a Graphical User Interface. This suite integrates all steps of microarray data analysis including automated data import, noise correction and filtering, normalization, statistical selection of differentially expressed genes, clustering, classification and annotation. In its current version, Gene ARMADA fully supports 2 coloured cDNA and Affymetrix oligonucleotide arrays, plus custom arrays for which experimental details are given in tabular form (Excel spreadsheet, comma separated values, tab-delimited text formats). It also supports the analysis of already processed results through its versatile import editor. Besides being fully automated, Gene ARMADA incorporates numerous functionalities of the Statistics and Bioinformatics Toolboxes of MATLAB. In addition, it provides numerous visualization and exploration tools plus customizable export data formats for seamless integration by other analysis tools or MATLAB, for further processing. Gene ARMADA requires MATLAB 7.4 (R2007a) or higher and is also distributed as a stand-alone application with MATLAB Component Runtime. Gene ARMADA provides a highly adaptable, integrative, yet flexible tool which can be used for automated quality control, analysis, annotation and visualization of microarray data, constituting a starting point for further data interpretation and integration with numerous other tools.
2009-01-01
Background Large discrepancies in signature composition and outcome concordance have been observed between different microarray breast cancer expression profiling studies. This is often ascribed to differences in array platform as well as biological variability. We conjecture that other reasons for the observed discrepancies are the measurement error associated with each feature and the choice of preprocessing method. Microarray data are known to be subject to technical variation and the confidence intervals around individual point estimates of expression levels can be wide. Furthermore, the estimated expression values also vary depending on the selected preprocessing scheme. In microarray breast cancer classification studies, however, these two forms of feature variability are almost always ignored and hence their exact role is unclear. Results We have performed a comprehensive sensitivity analysis of microarray breast cancer classification under the two types of feature variability mentioned above. We used data from six state of the art preprocessing methods, using a compendium consisting of eight diferent datasets, involving 1131 hybridizations, containing data from both one and two-color array technology. For a wide range of classifiers, we performed a joint study on performance, concordance and stability. In the stability analysis we explicitly tested classifiers for their noise tolerance by using perturbed expression profiles that are based on uncertainty information directly related to the preprocessing methods. Our results indicate that signature composition is strongly influenced by feature variability, even if the array platform and the stratification of patient samples are identical. In addition, we show that there is often a high level of discordance between individual class assignments for signatures constructed on data coming from different preprocessing schemes, even if the actual signature composition is identical. Conclusion Feature variability can have a strong impact on breast cancer signature composition, as well as the classification of individual patient samples. We therefore strongly recommend that feature variability is considered in analyzing data from microarray breast cancer expression profiling experiments. PMID:19941644
DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data.
Glez-Peña, Daniel; Alvarez, Rodrigo; Díaz, Fernando; Fdez-Riverola, Florentino
2009-01-29
Expression profiling assays done by using DNA microarray technology generate enormous data sets that are not amenable to simple analysis. The greatest challenge in maximizing the use of this huge amount of data is to develop algorithms to interpret and interconnect results from different genes under different conditions. In this context, fuzzy logic can provide a systematic and unbiased way to both (i) find biologically significant insights relating to meaningful genes, thereby removing the need for expert knowledge in preliminary steps of microarray data analyses and (ii) reduce the cost and complexity of later applied machine learning techniques being able to achieve interpretable models. DFP is a new Bioconductor R package that implements a method for discretizing and selecting differentially expressed genes based on the application of fuzzy logic. DFP takes advantage of fuzzy membership functions to assign linguistic labels to gene expression levels. The technique builds a reduced set of relevant genes (FP, Fuzzy Pattern) able to summarize and represent each underlying class (pathology). A last step constructs a biased set of genes (DFP, Discriminant Fuzzy Pattern) by intersecting existing fuzzy patterns in order to detect discriminative elements. In addition, the software provides new functions and visualisation tools that summarize achieved results and aid in the interpretation of differentially expressed genes from multiple microarray experiments. DFP integrates with other packages of the Bioconductor project, uses common data structures and is accompanied by ample documentation. It has the advantage that its parameters are highly configurable, facilitating the discovery of biologically relevant connections between sets of genes belonging to different pathologies. This information makes it possible to automatically filter irrelevant genes thereby reducing the large volume of data supplied by microarray experiments. Based on these contributions GENECBR, a successful tool for cancer diagnosis using microarray datasets, has recently been released.
Giancarlo, R; Scaturro, D; Utro, F
2015-02-01
The prediction of the number of clusters in a dataset, in particular microarrays, is a fundamental task in biological data analysis, usually performed via validation measures. Unfortunately, it has received very little attention and in fact there is a growing need for software tools/libraries dedicated to it. Here we present ValWorkBench, a software library consisting of eleven well known validation measures, together with novel heuristic approximations for some of them. The main objective of this paper is to provide the interested researcher with the full software documentation of an open source cluster validation platform having the main features of being easily extendible in a homogeneous way and of offering software components that can be readily re-used. Consequently, the focus of the presentation is on the architecture of the library, since it provides an essential map that can be used to access the full software documentation, which is available at the supplementary material website [1]. The mentioned main features of ValWorkBench are also discussed and exemplified, with emphasis on software abstraction design and re-usability. A comparison with existing cluster validation software libraries, mainly in terms of the mentioned features, is also offered. It suggests that ValWorkBench is a much needed contribution to the microarray software development/algorithm engineering community. For completeness, it is important to mention that previous accurate algorithmic experimental analysis of the relative merits of each of the implemented measures [19,23,25], carried out specifically on microarray data, gives useful insights on the effectiveness of ValWorkBench for cluster validation to researchers in the microarray community interested in its use for the mentioned task. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
2010-01-01
Introduction Various multigene predictors of breast cancer clinical outcome have been commercialized, but proved to be prognostic only for hormone receptor (HR) subsets overexpressing estrogen or progesterone receptors. Hormone receptor negative (HRneg) breast cancers, particularly those lacking HER2/ErbB2 overexpression and known as triple-negative (Tneg) cases, are heterogeneous and generally aggressive breast cancer subsets in need of prognostic subclassification, since most early stage HRneg and Tneg breast cancer patients are cured with conservative treatment yet invariably receive aggressive adjuvant chemotherapy. Methods An unbiased search for genes predictive of distant metastatic relapse was undertaken using a training cohort of 199 node-negative, adjuvant treatment naïve HRneg (including 154 Tneg) breast cancer cases curated from three public microarray datasets. Prognostic gene candidates were subsequently validated using a different cohort of 75 node-negative, adjuvant naïve HRneg cases curated from three additional datasets. The HRneg/Tneg gene signature was prognostically compared with eight other previously reported gene signatures, and evaluated for cancer network associations by two commercial pathway analysis programs. Results A novel set of 14 prognostic gene candidates was identified as outcome predictors: CXCL13, CLIC5, RGS4, RPS28, RFX7, EXOC7, HAPLN1, ZNF3, SSX3, HRBL, PRRG3, ABO, PRTN3, MATN1. A composite HRneg/Tneg gene signature index proved more accurate than any individual candidate gene or other reported multigene predictors in identifying cases likely to remain free of metastatic relapse. Significant positive correlations between the HRneg/Tneg index and three independent immune-related signatures (STAT1, IFN, and IR) were observed, as were consistent negative associations between the three immune-related signatures and five other proliferation module-containing signatures (MS-14, ONCO-RS, GGI, CSR/wound and NKI-70). Network analysis identified 8 genes within the HRneg/Tneg signature as being functionally linked to immune/inflammatory chemokine regulation. Conclusions A multigene HRneg/Tneg signature linked to immune/inflammatory cytokine regulation was identified from pooled expression microarray data and shown to be superior to other reported gene signatures in predicting the metastatic outcome of early stage and conservatively managed HRneg and Tneg breast cancer. Further validation of this prognostic signature may lead to new therapeutic insights and spare many newly diagnosed breast cancer patients the need for aggressive adjuvant chemotherapy. PMID:20946665
flowVS: channel-specific variance stabilization in flow cytometry.
Azad, Ariful; Rajwa, Bartek; Pothen, Alex
2016-07-28
Comparing phenotypes of heterogeneous cell populations from multiple biological conditions is at the heart of scientific discovery based on flow cytometry (FC). When the biological signal is measured by the average expression of a biomarker, standard statistical methods require that variance be approximately stabilized in populations to be compared. Since the mean and variance of a cell population are often correlated in fluorescence-based FC measurements, a preprocessing step is needed to stabilize the within-population variances. We present a variance-stabilization algorithm, called flowVS, that removes the mean-variance correlations from cell populations identified in each fluorescence channel. flowVS transforms each channel from all samples of a data set by the inverse hyperbolic sine (asinh) transformation. For each channel, the parameters of the transformation are optimally selected by Bartlett's likelihood-ratio test so that the populations attain homogeneous variances. The optimum parameters are then used to transform the corresponding channels in every sample. flowVS is therefore an explicit variance-stabilization method that stabilizes within-population variances in each channel by evaluating the homoskedasticity of clusters with a likelihood-ratio test. With two publicly available datasets, we show that flowVS removes the mean-variance dependence from raw FC data and makes the within-population variance relatively homogeneous. We demonstrate that alternative transformation techniques such as flowTrans, flowScape, logicle, and FCSTrans might not stabilize variance. Besides flow cytometry, flowVS can also be applied to stabilize variance in microarray data. With a publicly available data set we demonstrate that flowVS performs as well as the VSN software, a state-of-the-art approach developed for microarrays. The homogeneity of variance in cell populations across FC samples is desirable when extracting features uniformly and comparing cell populations with different levels of marker expressions. The newly developed flowVS algorithm solves the variance-stabilization problem in FC and microarrays by optimally transforming data with the help of Bartlett's likelihood-ratio test. On two publicly available FC datasets, flowVS stabilizes within-population variances more evenly than the available transformation and normalization techniques. flowVS-based variance stabilization can help in performing comparison and alignment of phenotypically identical cell populations across different samples. flowVS and the datasets used in this paper are publicly available in Bioconductor.
MiMiR – an integrated platform for microarray data sharing, mining and analysis
Tomlinson, Chris; Thimma, Manjula; Alexandrakis, Stelios; Castillo, Tito; Dennis, Jayne L; Brooks, Anthony; Bradley, Thomas; Turnbull, Carly; Blaveri, Ekaterini; Barton, Geraint; Chiba, Norie; Maratou, Klio; Soutter, Pat; Aitman, Tim; Game, Laurence
2008-01-01
Background Despite considerable efforts within the microarray community for standardising data format, content and description, microarray technologies present major challenges in managing, sharing, analysing and re-using the large amount of data generated locally or internationally. Additionally, it is recognised that inconsistent and low quality experimental annotation in public data repositories significantly compromises the re-use of microarray data for meta-analysis. MiMiR, the Microarray data Mining Resource was designed to tackle some of these limitations and challenges. Here we present new software components and enhancements to the original infrastructure that increase accessibility, utility and opportunities for large scale mining of experimental and clinical data. Results A user friendly Online Annotation Tool allows researchers to submit detailed experimental information via the web at the time of data generation rather than at the time of publication. This ensures the easy access and high accuracy of meta-data collected. Experiments are programmatically built in the MiMiR database from the submitted information and details are systematically curated and further annotated by a team of trained annotators using a new Curation and Annotation Tool. Clinical information can be annotated and coded with a clinical Data Mapping Tool within an appropriate ethical framework. Users can visualise experimental annotation, assess data quality, download and share data via a web-based experiment browser called MiMiR Online. All requests to access data in MiMiR are routed through a sophisticated middleware security layer thereby allowing secure data access and sharing amongst MiMiR registered users prior to publication. Data in MiMiR can be mined and analysed using the integrated EMAAS open source analysis web portal or via export of data and meta-data into Rosetta Resolver data analysis package. Conclusion The new MiMiR suite of software enables systematic and effective capture of extensive experimental and clinical information with the highest MIAME score, and secure data sharing prior to publication. MiMiR currently contains more than 150 experiments corresponding to over 3000 hybridisations and supports the Microarray Centre's large microarray user community and two international consortia. The MiMiR flexible and scalable hardware and software architecture enables secure warehousing of thousands of datasets, including clinical studies, from microarray and potentially other -omics technologies. PMID:18801157
MiMiR--an integrated platform for microarray data sharing, mining and analysis.
Tomlinson, Chris; Thimma, Manjula; Alexandrakis, Stelios; Castillo, Tito; Dennis, Jayne L; Brooks, Anthony; Bradley, Thomas; Turnbull, Carly; Blaveri, Ekaterini; Barton, Geraint; Chiba, Norie; Maratou, Klio; Soutter, Pat; Aitman, Tim; Game, Laurence
2008-09-18
Despite considerable efforts within the microarray community for standardising data format, content and description, microarray technologies present major challenges in managing, sharing, analysing and re-using the large amount of data generated locally or internationally. Additionally, it is recognised that inconsistent and low quality experimental annotation in public data repositories significantly compromises the re-use of microarray data for meta-analysis. MiMiR, the Microarray data Mining Resource was designed to tackle some of these limitations and challenges. Here we present new software components and enhancements to the original infrastructure that increase accessibility, utility and opportunities for large scale mining of experimental and clinical data. A user friendly Online Annotation Tool allows researchers to submit detailed experimental information via the web at the time of data generation rather than at the time of publication. This ensures the easy access and high accuracy of meta-data collected. Experiments are programmatically built in the MiMiR database from the submitted information and details are systematically curated and further annotated by a team of trained annotators using a new Curation and Annotation Tool. Clinical information can be annotated and coded with a clinical Data Mapping Tool within an appropriate ethical framework. Users can visualise experimental annotation, assess data quality, download and share data via a web-based experiment browser called MiMiR Online. All requests to access data in MiMiR are routed through a sophisticated middleware security layer thereby allowing secure data access and sharing amongst MiMiR registered users prior to publication. Data in MiMiR can be mined and analysed using the integrated EMAAS open source analysis web portal or via export of data and meta-data into Rosetta Resolver data analysis package. The new MiMiR suite of software enables systematic and effective capture of extensive experimental and clinical information with the highest MIAME score, and secure data sharing prior to publication. MiMiR currently contains more than 150 experiments corresponding to over 3000 hybridisations and supports the Microarray Centre's large microarray user community and two international consortia. The MiMiR flexible and scalable hardware and software architecture enables secure warehousing of thousands of datasets, including clinical studies, from microarray and potentially other -omics technologies.
An immune-related lncRNA signature for patients with anaplastic gliomas.
Wang, Wen; Zhao, Zheng; Yang, Fan; Wang, Haoyuan; Wu, Fan; Liang, Tingyu; Yan, Xiaoyan; Li, Jiye; Lan, Qing; Wang, Jiangfei; Zhao, Jizong
2018-01-01
We investigated immune-related long non-coding RNAs (lncRNAs) that may be exploited as potential therapeutic targets in anaplastic gliomas. We obtained 572 lncRNAs and 317 immune genes from the Chinese Glioma Genome Atlas microarray and constructed immune-related lncRNAs co-expression networks to identify immune-related lncRNAs. Two additional datasets (GSE16011, REMBRANDT) were used for validation. Gene set enrichment analysis and principal component analysis were used for functional annotation. Immune-lncRNAs co-expression networks were constructed. Nine immune-related lncRNAs (SNHG8, PGM5-AS1, ST20-AS1, LINC00937, AGAP2-AS1, MIR155HG, TUG1, MAPKAPK5-AS1, and HCG18) signature was identified in patients with anaplastic gliomas. Patients in the low-risk group showed longer overall survival (OS) and progression-free survival than those in the high-risk group (P < 0.0001; P < 0.0001). Additionally, patients in the high-risk group displayed no-deletion of chromosomal arms 1p and/or 19q, isocitrate dehydrogenase wild-type, classical and mesenchymal TCGA subtype, G3 CGGA subtype, and lower Karnofsky performance score (KPS). Moreover, the signature was an independent factor and was significantly associated with the OS (P = 0.000, hazard ratio (HR) = 1.434). These findings were further validated in two additional datasets (GSE16011, REMBRANDT). Low-risk and high-risk groups displayed different immune status based on principal components analysis. Our results showed that the nine immune-related lncRNAs signature has prognostic value for anaplastic gliomas.
Paraboschi, Elvezia Maria; Cardamone, Giulia; Rimoldi, Valeria; Gemmati, Donato; Spreafico, Marta; Duga, Stefano; Soldà, Giulia; Asselta, Rosanna
2015-09-30
Abnormalities in RNA metabolism and alternative splicing (AS) are emerging as important players in complex disease phenotypes. In particular, accumulating evidence suggests the existence of pathogenic links between multiple sclerosis (MS) and altered AS, including functional studies showing that an imbalance in alternatively-spliced isoforms may contribute to disease etiology. Here, we tested whether the altered expression of AS-related genes represents a MS-specific signature. A comprehensive comparative analysis of gene expression profiles of publicly-available microarray datasets (190 MS cases, 182 controls), followed by gene-ontology enrichment analysis, highlighted a significant enrichment for differentially-expressed genes involved in RNA metabolism/AS. In detail, a total of 17 genes were found to be differentially expressed in MS in multiple datasets, with CELF1 being dysregulated in five out of seven studies. We confirmed CELF1 downregulation in MS (p=0.0015) by real-time RT-PCRs on RNA extracted from blood cells of 30 cases and 30 controls. As a proof of concept, we experimentally verified the unbalance in alternatively-spliced isoforms in MS of the NFAT5 gene, a putative CELF1 target. In conclusion, for the first time we provide evidence of a consistent dysregulation of splicing-related genes in MS and we discuss its possible implications in modulating specific AS events in MS susceptibility genes.
Gerns Storey, Helen L; Richardson, Barbra A; Singa, Benson; Naulikha, Jackie; Prindle, Vivian C; Diaz-Ochoa, Vladimir E; Felgner, Phil L; Camerini, David; Horton, Helen; John-Stewart, Grace; Walson, Judd L
2014-01-01
The role of HIV-1-specific antibody responses in HIV disease progression is complex and would benefit from analysis techniques that examine clusterings of responses. Protein microarray platforms facilitate the simultaneous evaluation of numerous protein-specific antibody responses, though excessive data are cumbersome in analyses. Principal components analysis (PCA) reduces data dimensionality by generating fewer composite variables that maximally account for variance in a dataset. To identify clusters of antibody responses involved in disease control, we investigated the association of HIV-1-specific antibody responses by protein microarray, and assessed their association with disease progression using PCA in a nested cohort design. Associations observed among collections of antibody responses paralleled protein-specific responses. At baseline, greater antibody responses to the transmembrane glycoprotein (TM) and reverse transcriptase (RT) were associated with higher viral loads, while responses to the surface glycoprotein (SU), capsid (CA), matrix (MA), and integrase (IN) proteins were associated with lower viral loads. Over 12 months greater antibody responses were associated with smaller decreases in CD4 count (CA, MA, IN), and reduced likelihood of disease progression (CA, IN). PCA and protein microarray analyses highlighted a collection of HIV-specific antibody responses that together were associated with reduced disease progression, and may not have been identified by examining individual antibody responses. This technique may be useful to explore multifaceted host-disease interactions, such as HIV coinfections.
Benschop, Corina C G; Quaak, Frederike C A; Boon, Mathilde E; Sijen, Titia; Kuiper, Irene
2012-03-01
Forensic analysis of biological traces generally encompasses the investigation of both the person who contributed to the trace and the body site(s) from which the trace originates. For instance, for sexual assault cases, it can be beneficial to distinguish vaginal samples from skin or saliva samples. In this study, we explored the use of microbial flora to indicate vaginal origin. First, we explored the vaginal microbiome for a large set of clinical vaginal samples (n = 240) by next generation sequencing (n = 338,184 sequence reads) and found 1,619 different sequences. Next, we selected 389 candidate probes targeting genera or species and designed a microarray, with which we analysed a diverse set of samples; 43 DNA extracts from vaginal samples and 25 DNA extracts from samples from other body sites, including sites in close proximity of or in contact with the vagina. Finally, we used the microarray results and next generation sequencing dataset to assess the potential for a future approach that uses microbial markers to indicate vaginal origin. Since no candidate genera/species were found to positively identify all vaginal DNA extracts on their own, while excluding all non-vaginal DNA extracts, we deduce that a reliable statement about the cellular origin of a biological trace should be based on the detection of multiple species within various genera. Microarray analysis of a sample will then render a microbial flora pattern that is probably best analysed in a probabilistic approach.
LS Bound based gene selection for DNA microarray data.
Zhou, Xin; Mao, K Z
2005-04-15
One problem with discriminant analysis of DNA microarray data is that each sample is represented by quite a large number of genes, and many of them are irrelevant, insignificant or redundant to the discriminant problem at hand. Methods for selecting important genes are, therefore, of much significance in microarray data analysis. In the present study, a new criterion, called LS Bound measure, is proposed to address the gene selection problem. The LS Bound measure is derived from leave-one-out procedure of LS-SVMs (least squares support vector machines), and as the upper bound for leave-one-out classification results it reflects to some extent the generalization performance of gene subsets. We applied this LS Bound measure for gene selection on two benchmark microarray datasets: colon cancer and leukemia. We also compared the LS Bound measure with other evaluation criteria, including the well-known Fisher's ratio and Mahalanobis class separability measure, and other published gene selection algorithms, including Weighting factor and SVM Recursive Feature Elimination. The strength of the LS Bound measure is that it provides gene subsets leading to more accurate classification results than the filter method while its computational complexity is at the level of the filter method. A companion website can be accessed at http://www.ntu.edu.sg/home5/pg02776030/lsbound/. The website contains: (1) the source code of the gene selection algorithm; (2) the complete set of tables and figures regarding the experimental study; (3) proof of the inequality (9). ekzmao@ntu.edu.sg.
Reboiro-Jato, Miguel; Arrais, Joel P; Oliveira, José Luis; Fdez-Riverola, Florentino
2014-01-30
The diagnosis and prognosis of several diseases can be shortened through the use of different large-scale genome experiments. In this context, microarrays can generate expression data for a huge set of genes. However, to obtain solid statistical evidence from the resulting data, it is necessary to train and to validate many classification techniques in order to find the best discriminative method. This is a time-consuming process that normally depends on intricate statistical tools. geneCommittee is a web-based interactive tool for routinely evaluating the discriminative classification power of custom hypothesis in the form of biologically relevant gene sets. While the user can work with different gene set collections and several microarray data files to configure specific classification experiments, the tool is able to run several tests in parallel. Provided with a straightforward and intuitive interface, geneCommittee is able to render valuable information for diagnostic analyses and clinical management decisions based on systematically evaluating custom hypothesis over different data sets using complementary classifiers, a key aspect in clinical research. geneCommittee allows the enrichment of microarrays raw data with gene functional annotations, producing integrated datasets that simplify the construction of better discriminative hypothesis, and allows the creation of a set of complementary classifiers. The trained committees can then be used for clinical research and diagnosis. Full documentation including common use cases and guided analysis workflows is freely available at http://sing.ei.uvigo.es/GC/.
A sub-space greedy search method for efficient Bayesian Network inference.
Zhang, Qing; Cao, Yong; Li, Yong; Zhu, Yanming; Sun, Samuel S M; Guo, Dianjing
2011-09-01
Bayesian network (BN) has been successfully used to infer the regulatory relationships of genes from microarray dataset. However, one major limitation of BN approach is the computational cost because the calculation time grows more than exponentially with the dimension of the dataset. In this paper, we propose a sub-space greedy search method for efficient Bayesian Network inference. Particularly, this method limits the greedy search space by only selecting gene pairs with higher partial correlation coefficients. Using both synthetic and real data, we demonstrate that the proposed method achieved comparable results with standard greedy search method yet saved ∼50% of the computational time. We believe that sub-space search method can be widely used for efficient BN inference in systems biology. Copyright © 2011 Elsevier Ltd. All rights reserved.
A whole blood gene expression-based signature for smoking status
2012-01-01
Background Smoking is the leading cause of preventable death worldwide and has been shown to increase the risk of multiple diseases including coronary artery disease (CAD). We sought to identify genes whose levels of expression in whole blood correlate with self-reported smoking status. Methods Microarrays were used to identify gene expression changes in whole blood which correlated with self-reported smoking status; a set of significant genes from the microarray analysis were validated by qRT-PCR in an independent set of subjects. Stepwise forward logistic regression was performed using the qRT-PCR data to create a predictive model whose performance was validated in an independent set of subjects and compared to cotinine, a nicotine metabolite. Results Microarray analysis of whole blood RNA from 209 PREDICT subjects (41 current smokers, 4 quit ≤ 2 months, 64 quit > 2 months, 100 never smoked; NCT00500617) identified 4214 genes significantly correlated with self-reported smoking status. qRT-PCR was performed on 1,071 PREDICT subjects across 256 microarray genes significantly correlated with smoking or CAD. A five gene (CLDND1, LRRN3, MUC1, GOPC, LEF1) predictive model, derived from the qRT-PCR data using stepwise forward logistic regression, had a cross-validated mean AUC of 0.93 (sensitivity=0.78; specificity=0.95), and was validated using 180 independent PREDICT subjects (AUC=0.82, CI 0.69-0.94; sensitivity=0.63; specificity=0.94). Plasma from the 180 validation subjects was used to assess levels of cotinine; a model using a threshold of 10 ng/ml cotinine resulted in an AUC of 0.89 (CI 0.81-0.97; sensitivity=0.81; specificity=0.97; kappa with expression model = 0.53). Conclusion We have constructed and validated a whole blood gene expression score for the evaluation of smoking status, demonstrating that clinical and environmental factors contributing to cardiovascular disease risk can be assessed by gene expression. PMID:23210427
Cancer Detection in Microarray Data Using a Modified Cat Swarm Optimization Clustering Approach
M, Pandi; R, Balamurugan; N, Sadhasivam
2017-12-29
Objective: A better understanding of functional genomics can be obtained by extracting patterns hidden in gene expression data. This could have paramount implications for cancer diagnosis, gene treatments and other domains. Clustering may reveal natural structures and identify interesting patterns in underlying data. The main objective of this research was to derive a heuristic approach to detection of highly co-expressed genes related to cancer from gene expression data with minimum Mean Squared Error (MSE). Methods: A modified CSO algorithm using Harmony Search (MCSO-HS) for clustering cancer gene expression data was applied. Experiment results are analyzed using two cancer gene expression benchmark datasets, namely for leukaemia and for breast cancer. Result: The results indicated MCSO-HS to be better than HS and CSO, 13% and 9% with the leukaemia dataset. For breast cancer dataset improvement was by 22% and 17%, respectively, in terms of MSE. Conclusion: The results showed MCSO-HS to outperform HS and CSO with both benchmark datasets. To validate the clustering results, this work was tested with internal and external cluster validation indices. Also this work points to biological validation of clusters with gene ontology in terms of function, process and component. Creative Commons Attribution License
On the relevance of glycolysis process on brain gliomas.
Kounelakis, M G; Zervakis, M E; Giakos, G C; Postma, G J; Buydens, L M C; Kotsiakis, X
2013-01-01
The proposed analysis considers aspects of both statistical and biological validation of the glycolysis effect on brain gliomas, at both genomic and metabolic level. In particular, two independent datasets are analyzed in parallel, one engaging genomic (Microarray Expression) data and the other metabolomic (Magnetic Resonance Spectroscopy Imaging) data. The aim of this study is twofold. First to show that, apart from the already studied genes (markers), other genes such as those involved in the human cell glycolysis significantly contribute in gliomas discrimination. Second, to demonstrate how the glycolysis process can open new ways towards the design of patient-specific therapeutic protocols. The results of our analysis demonstrate that the combination of genes participating in the glycolytic process (ALDOA, ALDOC, ENO2, GAPDH, HK2, LDHA, LDHB, MDH1, PDHB, PFKM, PGI, PGK1, PGM1 and PKLR) with the already known tumor suppressors (PTEN, Rb, TP53), oncogenes (CDK4, EGFR, PDGF) and HIF-1, enhance the discrimination of low versus high-grade gliomas providing high prediction ability in a cross-validated framework. Following these results and supported by the biological effect of glycolytic genes on cancer cells, we address the study of glycolysis for the development of new treatment protocols.
White-Al Habeeb, Nicole M A; Ho, Linh T; Olkhov-Mitsel, Ekaterina; Kron, Ken; Pethe, Vaijayanti; Lehman, Melanie; Jovanovic, Lidija; Fleshner, Neil; van der Kwast, Theodorus; Nelson, Colleen C; Bapat, Bharati
2014-09-15
Epigenetic silencing mediated by CpG methylation is a common feature of many cancers. Characterizing aberrant DNA methylation changes associated with tumor progression may identify potential prognostic markers for prostate cancer (PCa). We treated two PCa cell lines, 22Rv1 and DU-145 with the demethylating agent 5-Aza 2'-deoxycitidine (DAC) and global methylation status was analyzed by performing methylation-sensitive restriction enzyme based differential methylation hybridization strategy followed by genome-wide CpG methylation array profiling. In addition, we examined gene expression changes using a custom microarray. Gene Set Enrichment Analysis (GSEA) identified the most significantly dysregulated pathways. In addition, we assessed methylation status of candidate genes that showed reduced CpG methylation and increased gene expression after DAC treatment, in Gleason score (GS) 8 vs. GS6 patients using three independent cohorts of patients; the publically available The Cancer Genome Atlas (TCGA) dataset, and two separate patient cohorts. Our analysis, by integrating methylation and gene expression in PCa cell lines, combined with patient tumor data, identified novel potential biomarkers for PCa patients. These markers may help elucidate the pathogenesis of PCa and represent potential prognostic markers for PCa patients.
Various Cmap analyses within and across species and microarray platforms conducted and summarized to generate the tables in the publication.This dataset is associated with the following publication:Wang , R., A. Biales , N. Garcia-Reyero, E. Perkins, D. Villeneuve, G. Ankley, and D. Bencic. Fish Connectivity Mapping: Linking Chemical Stressors by Their MOA-Driven Transcriptomic Profiles. BMC Genomics. BioMed Central Ltd, London, UK, 17(84): 1-20, (2016).
Yi, Ming; Mudunuri, Uma; Che, Anney; Stephens, Robert M
2009-06-29
One of the challenges in the analysis of microarray data is to integrate and compare the selected (e.g., differential) gene lists from multiple experiments for common or unique underlying biological themes. A common way to approach this problem is to extract common genes from these gene lists and then subject these genes to enrichment analysis to reveal the underlying biology. However, the capacity of this approach is largely restricted by the limited number of common genes shared by datasets from multiple experiments, which could be caused by the complexity of the biological system itself. We now introduce a new Pathway Pattern Extraction Pipeline (PPEP), which extends the existing WPS application by providing a new pathway-level comparative analysis scheme. To facilitate comparing and correlating results from different studies and sources, PPEP contains new interfaces that allow evaluation of the pathway-level enrichment patterns across multiple gene lists. As an exploratory tool, this analysis pipeline may help reveal the underlying biological themes at both the pathway and gene levels. The analysis scheme provided by PPEP begins with multiple gene lists, which may be derived from different studies in terms of the biological contexts, applied technologies, or methodologies. These lists are then subjected to pathway-level comparative analysis for extraction of pathway-level patterns. This analysis pipeline helps to explore the commonality or uniqueness of these lists at the level of pathways or biological processes from different but relevant biological systems using a combination of statistical enrichment measurements, pathway-level pattern extraction, and graphical display of the relationships of genes and their associated pathways as Gene-Term Association Networks (GTANs) within the WPS platform. As a proof of concept, we have used the new method to analyze many datasets from our collaborators as well as some public microarray datasets. This tool provides a new pathway-level analysis scheme for integrative and comparative analysis of data derived from different but relevant systems. The tool is freely available as a Pathway Pattern Extraction Pipeline implemented in our existing software package WPS, which can be obtained at http://www.abcc.ncifcrf.gov/wps/wps_index.php.
NASA Astrophysics Data System (ADS)
Wisesty, Untari N.; Warastri, Riris S.; Puspitasari, Shinta Y.
2018-03-01
Cancer is one of the major causes of mordibility and mortality problems in the worldwide. Therefore, the need of a system that can analyze and identify a person suffering from a cancer by using microarray data derived from the patient’s Deoxyribonucleic Acid (DNA). But on microarray data has thousands of attributes, thus making the challenges in data processing. This is often referred to as the curse of dimensionality. Therefore, in this study built a system capable of detecting a patient whether contracted cancer or not. The algorithm used is Genetic Algorithm as feature selection and Momentum Backpropagation Neural Network as a classification method, with data used from the Kent Ridge Bio-medical Dataset. Based on system testing that has been done, the system can detect Leukemia and Colon Tumor with best accuracy equal to 98.33% for colon tumor data and 100% for leukimia data. Genetic Algorithm as feature selection algorithm can improve system accuracy, which is from 64.52% to 98.33% for colon tumor data and 65.28% to 100% for leukemia data, and the use of momentum parameters can accelerate the convergence of the system in the training process of Neural Network.
Framework for Parallel Preprocessing of Microarray Data Using Hadoop
2018-01-01
Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach. PMID:29796018
Liao, J. G.; Mcmurry, Timothy; Berg, Arthur
2014-01-01
Empirical Bayes methods have been extensively used for microarray data analysis by modeling the large number of unknown parameters as random effects. Empirical Bayes allows borrowing information across genes and can automatically adjust for multiple testing and selection bias. However, the standard empirical Bayes model can perform poorly if the assumed working prior deviates from the true prior. This paper proposes a new rank-conditioned inference in which the shrinkage and confidence intervals are based on the distribution of the error conditioned on rank of the data. Our approach is in contrast to a Bayesian posterior, which conditions on the data themselves. The new method is almost as efficient as standard Bayesian methods when the working prior is close to the true prior, and it is much more robust when the working prior is not close. In addition, it allows a more accurate (but also more complex) non-parametric estimate of the prior to be easily incorporated, resulting in improved inference. The new method’s prior robustness is demonstrated via simulation experiments. Application to a breast cancer gene expression microarray dataset is presented. Our R package rank.Shrinkage provides a ready-to-use implementation of the proposed methodology. PMID:23934072
Brodsky, Leonid; Leontovich, Andrei; Shtutman, Michael; Feinstein, Elena
2004-01-01
Mathematical methods of analysis of microarray hybridizations deal with gene expression profiles as elementary units. However, some of these profiles do not reflect a biologically relevant transcriptional response, but rather stem from technical artifacts. Here, we describe two technically independent but rationally interconnected methods for identification of such artifactual profiles. Our diagnostics are based on detection of deviations from uniformity, which is assumed as the main underlying principle of microarray design. Method 1 is based on detection of non-uniformity of microarray distribution of printed genes that are clustered based on the similarity of their expression profiles. Method 2 is based on evaluation of the presence of gene-specific microarray spots within the slides’ areas characterized by an abnormal concentration of low/high differential expression values, which we define as ‘patterns of differentials’. Applying two novel algorithms, for nested clustering (method 1) and for pattern detection (method 2), we can make a dual estimation of the profile’s quality for almost every printed gene. Genes with artifactual profiles detected by method 1 may then be removed from further analysis. Suspicious differential expression values detected by method 2 may be either removed or weighted according to the probabilities of patterns that cover them, thus diminishing their input in any further data analysis. PMID:14999086
A novel feature extraction approach for microarray data based on multi-algorithm fusion
Jiang, Zhu; Xu, Rong
2015-01-01
Feature extraction is one of the most important and effective method to reduce dimension in data mining, with emerging of high dimensional data such as microarray gene expression data. Feature extraction for gene selection, mainly serves two purposes. One is to identify certain disease-related genes. The other is to find a compact set of discriminative genes to build a pattern classifier with reduced complexity and improved generalization capabilities. Depending on the purpose of gene selection, two types of feature extraction algorithms including ranking-based feature extraction and set-based feature extraction are employed in microarray gene expression data analysis. In ranking-based feature extraction, features are evaluated on an individual basis, without considering inter-relationship between features in general, while set-based feature extraction evaluates features based on their role in a feature set by taking into account dependency between features. Just as learning methods, feature extraction has a problem in its generalization ability, which is robustness. However, the issue of robustness is often overlooked in feature extraction. In order to improve the accuracy and robustness of feature extraction for microarray data, a novel approach based on multi-algorithm fusion is proposed. By fusing different types of feature extraction algorithms to select the feature from the samples set, the proposed approach is able to improve feature extraction performance. The new approach is tested against gene expression dataset including Colon cancer data, CNS data, DLBCL data, and Leukemia data. The testing results show that the performance of this algorithm is better than existing solutions. PMID:25780277
A novel feature extraction approach for microarray data based on multi-algorithm fusion.
Jiang, Zhu; Xu, Rong
2015-01-01
Feature extraction is one of the most important and effective method to reduce dimension in data mining, with emerging of high dimensional data such as microarray gene expression data. Feature extraction for gene selection, mainly serves two purposes. One is to identify certain disease-related genes. The other is to find a compact set of discriminative genes to build a pattern classifier with reduced complexity and improved generalization capabilities. Depending on the purpose of gene selection, two types of feature extraction algorithms including ranking-based feature extraction and set-based feature extraction are employed in microarray gene expression data analysis. In ranking-based feature extraction, features are evaluated on an individual basis, without considering inter-relationship between features in general, while set-based feature extraction evaluates features based on their role in a feature set by taking into account dependency between features. Just as learning methods, feature extraction has a problem in its generalization ability, which is robustness. However, the issue of robustness is often overlooked in feature extraction. In order to improve the accuracy and robustness of feature extraction for microarray data, a novel approach based on multi-algorithm fusion is proposed. By fusing different types of feature extraction algorithms to select the feature from the samples set, the proposed approach is able to improve feature extraction performance. The new approach is tested against gene expression dataset including Colon cancer data, CNS data, DLBCL data, and Leukemia data. The testing results show that the performance of this algorithm is better than existing solutions.
Normal uniform mixture differential gene expression detection for cDNA microarrays
Dean, Nema; Raftery, Adrian E
2005-01-01
Background One of the primary tasks in analysing gene expression data is finding genes that are differentially expressed in different samples. Multiple testing issues due to the thousands of tests run make some of the more popular methods for doing this problematic. Results We propose a simple method, Normal Uniform Differential Gene Expression (NUDGE) detection for finding differentially expressed genes in cDNA microarrays. The method uses a simple univariate normal-uniform mixture model, in combination with new normalization methods for spread as well as mean that extend the lowess normalization of Dudoit, Yang, Callow and Speed (2002) [1]. It takes account of multiple testing, and gives probabilities of differential expression as part of its output. It can be applied to either single-slide or replicated experiments, and it is very fast. Three datasets are analyzed using NUDGE, and the results are compared to those given by other popular methods: unadjusted and Bonferroni-adjusted t tests, Significance Analysis of Microarrays (SAM), and Empirical Bayes for microarrays (EBarrays) with both Gamma-Gamma and Lognormal-Normal models. Conclusion The method gives a high probability of differential expression to genes known/suspected a priori to be differentially expressed and a low probability to the others. In terms of known false positives and false negatives, the method outperforms all multiple-replicate methods except for the Gamma-Gamma EBarrays method to which it offers comparable results with the added advantages of greater simplicity, speed, fewer assumptions and applicability to the single replicate case. An R package called nudge to implement the methods in this paper will be made available soon at . PMID:16011807
Time Series Expression Analyses Using RNA-seq: A Statistical Approach
Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P.
2013-01-01
RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis. PMID:23586021
Time series expression analyses using RNA-seq: a statistical approach.
Oh, Sunghee; Song, Seongho; Grabowski, Gregory; Zhao, Hongyu; Noonan, James P
2013-01-01
RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis.
Theory of impossible worlds: Toward a physics of information.
Buscema, Paolo Massimo; Sacco, Pier Luigi; Della Torre, Francesca; Massini, Giulia; Breda, Marco; Ferilli, Guido
2018-05-01
In this paper, we introduce an innovative approach to the fusion between datasets in terms of attributes and observations, even when they are not related at all. With our technique, starting from datasets representing independent worlds, it is possible to analyze a single global dataset, and transferring each dataset onto the others is always possible. This procedure allows a deeper perspective in the study of a problem, by offering the chance of looking into it from other, independent points of view. Even unrelated datasets create a metaphoric representation of the problem, useful in terms of speed of convergence and predictive results, preserving the fundamental relationships in the data. In order to extract such knowledge, we propose a new learning rule named double backpropagation, by which an auto-encoder concurrently codifies all the different worlds. We test our methodology on different datasets and different issues, to underline the power and flexibility of the Theory of Impossible Worlds.
Theory of impossible worlds: Toward a physics of information
NASA Astrophysics Data System (ADS)
Buscema, Paolo Massimo; Sacco, Pier Luigi; Della Torre, Francesca; Massini, Giulia; Breda, Marco; Ferilli, Guido
2018-05-01
In this paper, we introduce an innovative approach to the fusion between datasets in terms of attributes and observations, even when they are not related at all. With our technique, starting from datasets representing independent worlds, it is possible to analyze a single global dataset, and transferring each dataset onto the others is always possible. This procedure allows a deeper perspective in the study of a problem, by offering the chance of looking into it from other, independent points of view. Even unrelated datasets create a metaphoric representation of the problem, useful in terms of speed of convergence and predictive results, preserving the fundamental relationships in the data. In order to extract such knowledge, we propose a new learning rule named double backpropagation, by which an auto-encoder concurrently codifies all the different worlds. We test our methodology on different datasets and different issues, to underline the power and flexibility of the Theory of Impossible Worlds.
Estimation of transformation parameters for microarray data.
Durbin, Blythe; Rocke, David M
2003-07-22
Durbin et al. (2002), Huber et al. (2002) and Munson (2001) independently introduced a family of transformations (the generalized-log family) which stabilizes the variance of microarray data up to the first order. We introduce a method for estimating the transformation parameter in tandem with a linear model based on the procedure outlined in Box and Cox (1964). We also discuss means of finding transformations within the generalized-log family which are optimal under other criteria, such as minimum residual skewness and minimum mean-variance dependency. R and Matlab code and test data are available from the authors on request.
DOE Office of Scientific and Technical Information (OSTI.GOV)
He, Fei; Maslov, Sergei; Yoo, Shinjae
Here, transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metadata or differences in annotation styles by different labs. In this study, we carefully selected and integrated 6,057 Arabidopsis microarray expression samples from 304 experiments deposited to NCBI GEO. Metadata such as tissue type, growth condition, and developmental stage were manually curated for each sample. We then studied global expression landscape of the integrated dataset andmore » found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome compared to aerial tissues, but the transcriptome of cultured root is more similar to those of aerial tissues as the former samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating re-use of plant transcriptome data. As a proof of principle we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified accuracy of our predictions with samples’ metadata provided by authors.« less
He, Fei; Maslov, Sergei; Yoo, Shinjae; ...
2016-05-25
Here, transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metadata or differences in annotation styles by different labs. In this study, we carefully selected and integrated 6,057 Arabidopsis microarray expression samples from 304 experiments deposited to NCBI GEO. Metadata such as tissue type, growth condition, and developmental stage were manually curated for each sample. We then studied global expression landscape of the integrated dataset andmore » found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome compared to aerial tissues, but the transcriptome of cultured root is more similar to those of aerial tissues as the former samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating re-use of plant transcriptome data. As a proof of principle we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified accuracy of our predictions with samples’ metadata provided by authors.« less
WholePathwayScope: a comprehensive pathway-based analysis tool for high-throughput data
Yi, Ming; Horton, Jay D; Cohen, Jonathan C; Hobbs, Helen H; Stephens, Robert M
2006-01-01
Background Analysis of High Throughput (HTP) Data such as microarray and proteomics data has provided a powerful methodology to study patterns of gene regulation at genome scale. A major unresolved problem in the post-genomic era is to assemble the large amounts of data generated into a meaningful biological context. We have developed a comprehensive software tool, WholePathwayScope (WPS), for deriving biological insights from analysis of HTP data. Result WPS extracts gene lists with shared biological themes through color cue templates. WPS statistically evaluates global functional category enrichment of gene lists and pathway-level pattern enrichment of data. WPS incorporates well-known biological pathways from KEGG (Kyoto Encyclopedia of Genes and Genomes) and Biocarta, GO (Gene Ontology) terms as well as user-defined pathways or relevant gene clusters or groups, and explores gene-term relationships within the derived gene-term association networks (GTANs). WPS simultaneously compares multiple datasets within biological contexts either as pathways or as association networks. WPS also integrates Genetic Association Database and Partial MedGene Database for disease-association information. We have used this program to analyze and compare microarray and proteomics datasets derived from a variety of biological systems. Application examples demonstrated the capacity of WPS to significantly facilitate the analysis of HTP data for integrative discovery. Conclusion This tool represents a pathway-based platform for discovery integration to maximize analysis power. The tool is freely available at . PMID:16423281
A 15-gene signature for prediction of colon cancer recurrence and prognosis based on SVM.
Xu, Guangru; Zhang, Minghui; Zhu, Hongxing; Xu, Jinhua
2017-03-10
To screen the gene signature for distinguishing patients with high risks from those with low-risks for colon cancer recurrence and predicting their prognosis. Five microarray datasets of colon cancer samples were collected from Gene Expression Omnibus database and one was obtained from The Cancer Genome Atlas (TCGA). After preprocessing, data in GSE17537 were analyzed using the Linear Models for Microarray data (LIMMA) method to identify the differentially expressed genes (DEGs). The DEGs further underwent PPI network-based neighborhood scoring and support vector machine (SVM) analyses to screen the feature genes associated with recurrence and prognosis, which were then validated by four datasets GSE38832, GSE17538, GSE28814 and TCGA using SVM and Cox regression analyses. A total of 1207 genes were identified as DEGs between recurrence and no-recurrence samples, including 726 downregulated and 481 upregulated genes. Using SVM analysis and five gene expression profile data confirmation, a 15-gene signature (HES5, ZNF417, GLRA2, OR8D2, HOXA7, FABP6, MUSK, HTR6, GRIP2, KLRK1, VEGFA, AKAP12, RHEB, NCRNA00152 and PMEPA1) were identified as a predictor of recurrence risk and prognosis for colon cancer patients. Our identified 15-gene signature may be useful to classify colon cancer patients with different prognosis and some genes in this signature may represent new therapeutic targets. Copyright © 2016. Published by Elsevier B.V.
Belciug, Smaranda; Gorunescu, Florin
2018-06-08
Methods based on microarrays (MA), mass spectrometry (MS), and machine learning (ML) algorithms have evolved rapidly in recent years, allowing for early detection of several types of cancer. A pitfall of these approaches, however, is the overfitting of data due to large number of attributes and small number of instances -- a phenomenon known as the 'curse of dimensionality'. A potentially fruitful idea to avoid this drawback is to develop algorithms that combine fast computation with a filtering module for the attributes. The goal of this paper is to propose a statistical strategy to initiate the hidden nodes of a single-hidden layer feedforward neural network (SLFN) by using both the knowledge embedded in data and a filtering mechanism for attribute relevance. In order to attest its feasibility, the proposed model has been tested on five publicly available high-dimensional datasets: breast, lung, colon, and ovarian cancer regarding gene expression and proteomic spectra provided by cDNA arrays, DNA microarray, and MS. The novel algorithm, called adaptive SLFN (aSLFN), has been compared with four major classification algorithms: traditional ELM, radial basis function network (RBF), single-hidden layer feedforward neural network trained by backpropagation algorithm (BP-SLFN), and support vector-machine (SVM). Experimental results showed that the classification performance of aSLFN is competitive with the comparison models. Copyright © 2018. Published by Elsevier Inc.
Aghdam, Rosa; Baghfalaki, Taban; Khosravi, Pegah; Saberi Ansari, Elnaz
2017-12-01
Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/. Copyright © 2017. Production and hosting by Elsevier B.V.
Currie, Richard A.; Peffer, Richard C.; Goetz, Amber K.; Omiecinski, Curtis J.; Goodman, Jay I.
2014-01-01
Toxicogenomics (TGx) is employed frequently to investigate underlying molecular mechanisms of the compound of interest and, thus, has become an aid to mode of action determination. However, the results and interpretation of a TGx dataset are influenced by the experimental design and methods of analysis employed. This article describes an evaluation and reanalysis, by two independent laboratories, of previously published TGx mouse liver microarray data for a triazole fungicide, propiconazole (PPZ), and the anticonvulsant drug phenobarbital (PB). Propiconazole produced an increase incidence of liver tumors in male CD-1 mice only at a dose that exceeded the maximum tolerated dose (2500 ppm). Firstly, we illustrate how experimental design differences between two in vivo studies with PPZ and PB may impact the comparisons of TGx results. Secondly, we demonstrate that different researchers using different pathway analysis tools can come to different conclusions on specific mechanistic pathways, even when using the same datasets. Finally, despite these differences the results across three different analyses also show a striking degree of similarity observed for PPZ and PB treated livers when the expression data are viewed as major signaling pathways and cell processes affected. Additional studies described here show that the postulated key event of hepatocellular proliferation was observed in CD-1 mice for both PPZ and PB, and that PPZ is also a potent activator of the mouse CAR nuclear receptor. Thus, with regard to the events which are hallmarks of CAR-induced effects that are key events in the mode of action (MOA) of mouse liver carcinogenesis with PB, PPZ-induced tumors can be viewed as being promoted by a similar PB-like CAR-dependent MOA. PMID:24675475
Gusenleitner, Daniel; Auerbach, Scott S.; Melia, Tisha; Gómez, Harold F.; Sherr, David H.; Monti, Stefano
2014-01-01
Background Despite an overall decrease in incidence of and mortality from cancer, about 40% of Americans will be diagnosed with the disease in their lifetime, and around 20% will die of it. Current approaches to test carcinogenic chemicals adopt the 2-year rodent bioassay, which is costly and time-consuming. As a result, fewer than 2% of the chemicals on the market have actually been tested. However, evidence accumulated to date suggests that gene expression profiles from model organisms exposed to chemical compounds reflect underlying mechanisms of action, and that these toxicogenomic models could be used in the prediction of chemical carcinogenicity. Results In this study, we used a rat-based microarray dataset from the NTP DrugMatrix Database to test the ability of toxicogenomics to model carcinogenicity. We analyzed 1,221 gene-expression profiles obtained from rats treated with 127 well-characterized compounds, including genotoxic and non-genotoxic carcinogens. We built a classifier that predicts a chemical's carcinogenic potential with an AUC of 0.78, and validated it on an independent dataset from the Japanese Toxicogenomics Project consisting of 2,065 profiles from 72 compounds. Finally, we identified differentially expressed genes associated with chemical carcinogenesis, and developed novel data-driven approaches for the molecular characterization of the response to chemical stressors. Conclusion Here, we validate a toxicogenomic approach to predict carcinogenicity and provide strong evidence that, with a larger set of compounds, we should be able to improve the sensitivity and specificity of the predictions. We found that the prediction of carcinogenicity is tissue-dependent and that the results also confirm and expand upon previous studies implicating DNA damage, the peroxisome proliferator-activated receptor, the aryl hydrocarbon receptor, and regenerative pathology in the response to carcinogen exposure. PMID:25058030
Zhang, Dapeng; Xiong, Huiling; Mennigen, Jan A; Popesku, Jason T; Marlatt, Vicki L; Martyniuk, Christopher J; Crump, Kate; Cossins, Andrew R; Xia, Xuhua; Trudeau, Vance L
2009-06-05
Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted. In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABA(A) gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays. Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development.
Mennigen, Jan A.; Popesku, Jason T.; Marlatt, Vicki L.; Martyniuk, Christopher J.; Crump, Kate; Cossins, Andrew R.; Xia, Xuhua; Trudeau, Vance L.
2009-01-01
Background Many vertebrates, including the goldfish, exhibit seasonal reproductive rhythms, which are a result of interactions between external environmental stimuli and internal endocrine systems in the hypothalamo-pituitary-gonadal axis. While it is long believed that differential expression of neuroendocrine genes contributes to establishing seasonal reproductive rhythms, no systems-level investigation has yet been conducted. Methodology/Principal Findings In the present study, by analyzing multiple female goldfish brain microarray datasets, we have characterized global gene expression patterns for a seasonal cycle. A core set of genes (873 genes) in the hypothalamus were identified to be differentially expressed between May, August and December, which correspond to physiologically distinct stages that are sexually mature (prespawning), sexual regression, and early gonadal redevelopment, respectively. Expression changes of these genes are also shared by another brain region, the telencephalon, as revealed by multivariate analysis. More importantly, by examining one dataset obtained from fish in October who were kept under long-daylength photoperiod (16 h) typical of the springtime breeding season (May), we observed that the expression of identified genes appears regulated by photoperiod, a major factor controlling vertebrate reproductive cyclicity. Gene ontology analysis revealed that hormone genes and genes functionally involved in G-protein coupled receptor signaling pathway and transmission of nerve impulses are significantly enriched in an expression pattern, whose transition is located between prespawning and sexually regressed stages. The existence of seasonal expression patterns was verified for several genes including isotocin, ependymin II, GABAA gamma2 receptor, calmodulin, and aromatase b by independent samplings of goldfish brains from six seasonal time points and real-time PCR assays. Conclusions/Significance Using both theoretical and experimental strategies, we report for the first time global gene expression patterns throughout a breeding season which may account for dynamic neuroendocrine regulation of seasonal reproductive development. PMID:19503831
Gene expression inference with deep learning.
Chen, Yifei; Li, Yi; Narayan, Rajiv; Subramanian, Aravind; Xie, Xiaohui
2016-06-15
Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes. We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. D-GEX is available at https://github.com/uci-cbcl/D-GEX CONTACT: xhx@ics.uci.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Gene expression inference with deep learning
Chen, Yifei; Li, Yi; Narayan, Rajiv; Subramanian, Aravind; Xie, Xiaohui
2016-01-01
Motivation: Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes. Results: We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. Availability and implementation: D-GEX is available at https://github.com/uci-cbcl/D-GEX. Contact: xhx@ics.uci.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26873929
A ground truth based comparative study on clustering of gene expression data.
Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue
2008-05-01
Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.
Mansourian, Robert; Mutch, David M; Antille, Nicolas; Aubert, Jerome; Fogel, Paul; Le Goff, Jean-Marc; Moulin, Julie; Petrov, Anton; Rytz, Andreas; Voegel, Johannes J; Roberts, Matthew-Alan
2004-11-01
Microarray technology has become a powerful research tool in many fields of study; however, the cost of microarrays often results in the use of a low number of replicates (k). Under circumstances where k is low, it becomes difficult to perform standard statistical tests to extract the most biologically significant experimental results. Other more advanced statistical tests have been developed; however, their use and interpretation often remain difficult to implement in routine biological research. The present work outlines a method that achieves sufficient statistical power for selecting differentially expressed genes under conditions of low k, while remaining as an intuitive and computationally efficient procedure. The present study describes a Global Error Assessment (GEA) methodology to select differentially expressed genes in microarray datasets, and was developed using an in vitro experiment that compared control and interferon-gamma treated skin cells. In this experiment, up to nine replicates were used to confidently estimate error, thereby enabling methods of different statistical power to be compared. Gene expression results of a similar absolute expression are binned, so as to enable a highly accurate local estimate of the mean squared error within conditions. The model then relates variability of gene expression in each bin to absolute expression levels and uses this in a test derived from the classical ANOVA. The GEA selection method is compared with both the classical and permutational ANOVA tests, and demonstrates an increased stability, robustness and confidence in gene selection. A subset of the selected genes were validated by real-time reverse transcription-polymerase chain reaction (RT-PCR). All these results suggest that GEA methodology is (i) suitable for selection of differentially expressed genes in microarray data, (ii) intuitive and computationally efficient and (iii) especially advantageous under conditions of low k. The GEA code for R software is freely available upon request to authors.
Wieghaus, Kristen A.; Gianchandani, Erwin P.; Neal, Rebekah A.; Paige, Mikell A.; Brown, Milton L.; Papin, Jason A.; Botchwey, Edward A.
2009-01-01
We are creating synthetic pharmaceuticals with angiogenic activity and potential to promote vascular invasion. We previously demonstrated that one of these molecules, phthalimide neovascular factor 1 (PNF1), significantly expands microvascular networks in vivo following sustained release from poly(lactic-co-glycolic acid) (PLAGA) films. In addition, to probe PNF1 mode-of-action, we recently applied a novel pathway-based compendium analysis to a multi-timepoint, controlled microarray dataset of PNF1-treated (versus control) human microvascular endothelial cells (HMVECs), and we identified induction of tumor necrosis factor-alpha (TNF-α) and, subsequently, transforming growth factor-beta (TGF-β) signaling networks by PNF1. Here we validate this microarray data-set with quantitative real-time polymerase chain reaction (RT-PCR) analysis. Subsequently, we probe this dataset and identify three specific TGF-β-induced genes with regulation by PNF1 conserved over multiple timepoints—amyloid beta (A4) precursor protein (APP), early growth response 1 (EGR-1), and matrix metalloproteinase 14 (MMP14 or MT1-MMP)—that are also implicated in angiogenesis. We further focus on MMP14 given its unique role in angiogenesis, and we validate MT1-MMP modulation by PNF1 with an in vitro fluorescence assay that demonstrates the direct effects that PNF1 exerts on functional metalloproteinase activity. We also utilize endothelial cord formation in collagen gels to show that PNF1-induced stimulation of endothelial cord network formation in vitro is in some way MT1-MMP-dependent. Ultimately, this new network analysis of our transcriptional footprint characterizing PNF1 activity 1–48 h post-supplementation in HMVECs coupled with corresponding validating experiments suggests a key set of a few specific targets that are involved in PNF1 mode-of-action and important for successful promotion of the neovascularization that we have observed by the drug in vivo. PMID:19326468
Gene regulatory network identification from the yeast cell cycle based on a neuro-fuzzy system.
Wang, B H; Lim, J W; Lim, J S
2016-08-30
Many studies exist for reconstructing gene regulatory networks (GRNs). In this paper, we propose a method based on an advanced neuro-fuzzy system, for gene regulatory network reconstruction from microarray time-series data. This approach uses a neural network with a weighted fuzzy function to model the relationships between genes. Fuzzy rules, which determine the regulators of genes, are very simplified through this method. Additionally, a regulator selection procedure is proposed, which extracts the exact dynamic relationship between genes, using the information obtained from the weighted fuzzy function. Time-series related features are extracted from the original data to employ the characteristics of temporal data that are useful for accurate GRN reconstruction. The microarray dataset of the yeast cell cycle was used for our study. We measured the mean squared prediction error for the efficiency of the proposed approach and evaluated the accuracy in terms of precision, sensitivity, and F-score. The proposed method outperformed the other existing approaches.
Overcoming confounded controls in the analysis of gene expression data from microarray experiments.
Bhattacharya, Soumyaroop; Long, Dang; Lyons-Weiler, James
2003-01-01
A potential limitation of data from microarray experiments exists when improper control samples are used. In cancer research, comparisons of tumour expression profiles to those from normal samples is challenging due to tissue heterogeneity (mixed cell populations). A specific example exists in a published colon cancer dataset, in which tissue heterogeneity was reported among the normal samples. In this paper, we show how to overcome or avoid the problem of using normal samples that do not derive from the same tissue of origin as the tumour. We advocate an exploratory unsupervised bootstrap analysis that can reveal unexpected and undesired, but strongly supported, clusters of samples that reflect tissue differences instead of tumour versus normal differences. All of the algorithms used in the analysis, including the maximum difference subset algorithm, unsupervised bootstrap analysis, pooled variance t-test for finding differentially expressed genes and the jackknife to reduce false positives, are incorporated into our online Gene Expression Data Analyzer ( http:// bioinformatics.upmc.edu/GE2/GEDA.html ).
From Saccharomyces cerevisiae to human: The important gene co-expression modules.
Liu, Wei; Li, Li; Ye, Hua; Chen, Haiwei; Shen, Weibiao; Zhong, Yuexian; Tian, Tian; He, Huaqin
2017-08-01
Network-based systems biology has become an important method for analyzing high-throughput gene expression data and gene function mining. Yeast has long been a popular model organism for biomedical research. In the current study, a weighted gene co-expression network analysis algorithm was applied to construct a gene co-expression network in Saccharomyces cerevisiae . Seventeen stable gene co-expression modules were detected from 2,814 S. cerevisiae microarray data. Further characterization of these modules with the Database for Annotation, Visualization and Integrated Discovery tool indicated that these modules were associated with certain biological processes, such as heat response, cell cycle, translational regulation, mitochondrion oxidative phosphorylation, amino acid metabolism and autophagy. Hub genes were also screened by intra-modular connectivity. Finally, the module conservation was evaluated in a human disease microarray dataset. Functional modules were identified in budding yeast, some of which are associated with patient survival. The current study provided a paradigm for single cell microorganisms and potentially other organisms.
New insights about host response to smallpox using microarray data.
Esteves, Gustavo H; Simoes, Ana C Q; Souza, Estevao; Dias, Rodrigo A; Ospina, Raydonal; Venancio, Thiago M
2007-08-24
Smallpox is a lethal disease that was endemic in many parts of the world until eradicated by massive immunization. Due to its lethality, there are serious concerns about its use as a bioweapon. Here we analyze publicly available microarray data to further understand survival of smallpox infected macaques, using systems biology approaches. Our goal is to improve the knowledge about the progression of this disease. We used KEGG pathways annotations to define groups of genes (or modules), and subsequently compared them to macaque survival times. This technique provided additional insights about the host response to this disease, such as increased expression of the cytokines and ECM receptors in the individuals with higher survival times. These results could indicate that these gene groups could influence an effective response from the host to smallpox. Macaques with higher survival times clearly express some specific pathways previously unidentified using regular gene-by-gene approaches. Our work also shows how third party analysis of public datasets can be important to support new hypotheses to relevant biological problems.
Kim, Eunjung; Kim, Eun Jung; Seo, Seung-Won; Hur, Cheol-Goo; McGregor, Robin A; Choi, Myung-Sook
2014-01-01
Worldwide obesity and related comorbidities are increasing, but identifying new therapeutic targets remains a challenge. A plethora of microarray studies in diet-induced obesity models has provided large datasets of obesity associated genes. In this review, we describe an approach to examine the underlying molecular network regulating obesity, and we discuss interactions between obesity candidate genes. We conducted network analysis on functional protein-protein interactions associated with 25 obesity candidate genes identified in a literature-driven approach based on published microarray studies of diet-induced obesity. The obesity candidate genes were closely associated with lipid metabolism and inflammation. Peroxisome proliferator activated receptor gamma (Pparg) appeared to be a core obesity gene, and obesity candidate genes were highly interconnected, suggesting a coordinately regulated molecular network in adipose tissue. In conclusion, the current network analysis approach may help elucidate the underlying molecular network regulating obesity and identify anti-obesity targets for therapeutic intervention.
Prediction of beta-turns with learning machines.
Cai, Yu-Dong; Liu, Xiao-Jun; Li, Yi-Xue; Xu, Xue-biao; Chou, Kuo-Chen
2003-05-01
The support vector machine approach was introduced to predict the beta-turns in proteins. The overall self-consistency rate by the re-substitution test for the training or learning dataset reached 100%. Both the training dataset and independent testing dataset were taken from Chou [J. Pept. Res. 49 (1997) 120]. The success prediction rates by the jackknife test for the beta-turn subset of 455 tetrapeptides and non-beta-turn subset of 3807 tetrapeptides in the training dataset were 58.1 and 98.4%, respectively. The success rates with the independent dataset test for the beta-turn subset of 110 tetrapeptides and non-beta-turn subset of 30,231 tetrapeptides were 69.1 and 97.3%, respectively. The results obtained from this study support the conclusion that the residue-coupled effect along a tetrapeptide is important for the formation of a beta-turn.
Spinelli, Lionel; Carpentier, Sabrina; Montañana Sanchis, Frédéric; Dalod, Marc; Vu Manh, Thien-Phong
2015-10-19
Recent advances in the analysis of high-throughput expression data have led to the development of tools that scaled-up their focus from single-gene to gene set level. For example, the popular Gene Set Enrichment Analysis (GSEA) algorithm can detect moderate but coordinated expression changes of groups of presumably related genes between pairs of experimental conditions. This considerably improves extraction of information from high-throughput gene expression data. However, although many gene sets covering a large panel of biological fields are available in public databases, the ability to generate home-made gene sets relevant to one's biological question is crucial but remains a substantial challenge to most biologists lacking statistic or bioinformatic expertise. This is all the more the case when attempting to define a gene set specific of one condition compared to many other ones. Thus, there is a crucial need for an easy-to-use software for generation of relevant home-made gene sets from complex datasets, their use in GSEA, and the correction of the results when applied to multiple comparisons of many experimental conditions. We developed BubbleGUM (GSEA Unlimited Map), a tool that allows to automatically extract molecular signatures from transcriptomic data and perform exhaustive GSEA with multiple testing correction. One original feature of BubbleGUM notably resides in its capacity to integrate and compare numerous GSEA results into an easy-to-grasp graphical representation. We applied our method to generate transcriptomic fingerprints for murine cell types and to assess their enrichments in human cell types. This analysis allowed us to confirm homologies between mouse and human immunocytes. BubbleGUM is an open-source software that allows to automatically generate molecular signatures out of complex expression datasets and to assess directly their enrichment by GSEA on independent datasets. Enrichments are displayed in a graphical output that helps interpreting the results. This innovative methodology has recently been used to answer important questions in functional genomics, such as the degree of similarities between microarray datasets from different laboratories or with different experimental models or clinical cohorts. BubbleGUM is executable through an intuitive interface so that both bioinformaticians and biologists can use it. It is available at http://www.ciml.univ-mrs.fr/applications/BubbleGUM/index.html .
2014-01-01
Background In complex large-scale experiments, in addition to simultaneously considering a large number of features, multiple hypotheses are often being tested for each feature. This leads to a problem of multi-dimensional multiple testing. For example, in gene expression studies over ordered categories (such as time-course or dose-response experiments), interest is often in testing differential expression across several categories for each gene. In this paper, we consider a framework for testing multiple sets of hypothesis, which can be applied to a wide range of problems. Results We adopt the concept of the overall false discovery rate (OFDR) for controlling false discoveries on the hypothesis set level. Based on an existing procedure for identifying differentially expressed gene sets, we discuss a general two-step hierarchical hypothesis set testing procedure, which controls the overall false discovery rate under independence across hypothesis sets. In addition, we discuss the concept of the mixed-directional false discovery rate (mdFDR), and extend the general procedure to enable directional decisions for two-sided alternatives. We applied the framework to the case of microarray time-course/dose-response experiments, and proposed three procedures for testing differential expression and making multiple directional decisions for each gene. Simulation studies confirm the control of the OFDR and mdFDR by the proposed procedures under independence and positive correlations across genes. Simulation results also show that two of our new procedures achieve higher power than previous methods. Finally, the proposed methodology is applied to a microarray dose-response study, to identify 17 β-estradiol sensitive genes in breast cancer cells that are induced at low concentrations. Conclusions The framework we discuss provides a platform for multiple testing procedures covering situations involving two (or potentially more) sources of multiplicity. The framework is easy to use and adaptable to various practical settings that frequently occur in large-scale experiments. Procedures generated from the framework are shown to maintain control of the OFDR and mdFDR, quantities that are especially relevant in the case of multiple hypothesis set testing. The procedures work well in both simulations and real datasets, and are shown to have better power than existing methods. PMID:24731138
Malinowski, Douglas P
2007-05-01
In recent years, the application of genomic and proteomic technologies to the problem of breast cancer prognosis and the prediction of therapy response have begun to yield encouraging results. Independent studies employing transcriptional profiling of primary breast cancer specimens using DNA microarrays have identified gene expression profiles that correlate with clinical outcome in primary breast biopsy specimens. Recent advances in microarray technology have demonstrated reproducibility, making clinical applications more achievable. In this regard, one such DNA microarray device based upon a 70-gene expression signature was recently cleared by the US FDA for application to breast cancer prognosis. These DNA microarrays often employ at least 70 gene targets for transcriptional profiling and prognostic assessment in breast cancer. The use of PCR-based methods utilizing a small subset of genes has recently demonstrated the ability to predict the clinical outcome in early-stage breast cancer. Furthermore, protein-based immunohistochemistry methods have progressed from using gene clusters and gene expression profiling to smaller subsets of expressed proteins to predict prognosis in early-stage breast cancer. Beyond prognostic applications, DNA microarray-based transcriptional profiling has demonstrated the ability to predict response to chemotherapy in early-stage breast cancer patients. In this review, recent advances in the use of multiple markers for prognosis of disease recurrence in early-stage breast cancer and the prediction of therapy response will be discussed.
Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun
2014-01-01
As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.
Reliable pre-eclampsia pathways based on multiple independent microarray data sets.
Kawasaki, Kaoru; Kondoh, Eiji; Chigusa, Yoshitsugu; Ujita, Mari; Murakami, Ryusuke; Mogami, Haruta; Brown, J B; Okuno, Yasushi; Konishi, Ikuo
2015-02-01
Pre-eclampsia is a multifactorial disorder characterized by heterogeneous clinical manifestations. Gene expression profiling of preeclamptic placenta have provided different and even opposite results, partly due to data compromised by various experimental artefacts. Here we aimed to identify reliable pre-eclampsia-specific pathways using multiple independent microarray data sets. Gene expression data of control and preeclamptic placentas were obtained from Gene Expression Omnibus. Single-sample gene-set enrichment analysis was performed to generate gene-set activation scores of 9707 pathways obtained from the Molecular Signatures Database. Candidate pathways were identified by t-test-based screening using data sets, GSE10588, GSE14722 and GSE25906. Additionally, recursive feature elimination was applied to arrive at a further reduced set of pathways. To assess the validity of the pre-eclampsia pathways, a statistically-validated protocol was executed using five data sets including two independent other validation data sets, GSE30186, GSE44711. Quantitative real-time PCR was performed for genes in a panel of potential pre-eclampsia pathways using placentas of 20 women with normal or severe preeclamptic singleton pregnancies (n = 10, respectively). A panel of ten pathways were found to discriminate women with pre-eclampsia from controls with high accuracy. Among these were pathways not previously associated with pre-eclampsia, such as the GABA receptor pathway, as well as pathways that have already been linked to pre-eclampsia, such as the glutathione and CDKN1C pathways. mRNA expression of GABRA3 (GABA receptor pathway), GCLC and GCLM (glutathione metabolic pathway), and CDKN1C was significantly reduced in the preeclamptic placentas. In conclusion, ten accurate and reliable pre-eclampsia pathways were identified based on multiple independent microarray data sets. A pathway-based classification may be a worthwhile approach to elucidate the pathogenesis of pre-eclampsia. © The Author 2014. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Wisdom of crowds for robust gene network inference
Marbach, Daniel; Costello, James C.; Küffner, Robert; Vega, Nicci; Prill, Robert J.; Camacho, Diogo M.; Allison, Kyle R.; Kellis, Manolis; Collins, James J.; Stolovitzky, Gustavo
2012-01-01
Reconstructing gene regulatory networks from high-throughput data is a long-standing problem. Through the DREAM project (Dialogue on Reverse Engineering Assessment and Methods), we performed a comprehensive blind assessment of over thirty network inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae, and in silico microarray data. We characterize performance, data requirements, and inherent biases of different inference approaches offering guidelines for both algorithm application and development. We observe that no single inference method performs optimally across all datasets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse datasets. Thereby, we construct high-confidence networks for E. coli and S. aureus, each comprising ~1700 transcriptional interactions at an estimated precision of 50%. We experimentally test 53 novel interactions in E. coli, of which 23 were supported (43%). Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks. PMID:22796662
Bayes multiple decision functions.
Wu, Wensong; Peña, Edsel A
2013-01-01
This paper deals with the problem of simultaneously making many ( M ) binary decisions based on one realization of a random data matrix X . M is typically large and X will usually have M rows associated with each of the M decisions to make, but for each row the data may be low dimensional. Such problems arise in many practical areas such as the biological and medical sciences, where the available dataset is from microarrays or other high-throughput technology and with the goal being to decide which among of many genes are relevant with respect to some phenotype of interest; in the engineering and reliability sciences; in astronomy; in education; and in business. A Bayesian decision-theoretic approach to this problem is implemented with the overall loss function being a cost-weighted linear combination of Type I and Type II loss functions. The class of loss functions considered allows for use of the false discovery rate (FDR), false nondiscovery rate (FNR), and missed discovery rate (MDR) in assessing the quality of decision. Through this Bayesian paradigm, the Bayes multiple decision function (BMDF) is derived and an efficient algorithm to obtain the optimal Bayes action is described. In contrast to many works in the literature where the rows of the matrix X are assumed to be stochastically independent, we allow a dependent data structure with the associations obtained through a class of frailty-induced Archimedean copulas. In particular, non-Gaussian dependent data structure, which is typical with failure-time data, can be entertained. The numerical implementation of the determination of the Bayes optimal action is facilitated through sequential Monte Carlo techniques. The theory developed could also be extended to the problem of multiple hypotheses testing, multiple classification and prediction, and high-dimensional variable selection. The proposed procedure is illustrated for the simple versus simple hypotheses setting and for the composite hypotheses setting through simulation studies. The procedure is also applied to a subset of a microarray data set from a colon cancer study.
Mayday - integrative analytics for expression data
2010-01-01
Background DNA Microarrays have become the standard method for large scale analyses of gene expression and epigenomics. The increasing complexity and inherent noisiness of the generated data makes visual data exploration ever more important. Fast deployment of new methods as well as a combination of predefined, easy to apply methods with programmer's access to the data are important requirements for any analysis framework. Mayday is an open source platform with emphasis on visual data exploration and analysis. Many built-in methods for clustering, machine learning and classification are provided for dissecting complex datasets. Plugins can easily be written to extend Mayday's functionality in a large number of ways. As Java program, Mayday is platform-independent and can be used as Java WebStart application without any installation. Mayday can import data from several file formats, database connectivity is included for efficient data organization. Numerous interactive visualization tools, including box plots, profile plots, principal component plots and a heatmap are available, can be enhanced with metadata and exported as publication quality vector files. Results We have rewritten large parts of Mayday's core to make it more efficient and ready for future developments. Among the large number of new plugins are an automated processing framework, dynamic filtering, new and efficient clustering methods, a machine learning module and database connectivity. Extensive manual data analysis can be done using an inbuilt R terminal and an integrated SQL querying interface. Our visualization framework has become more powerful, new plot types have been added and existing plots improved. Conclusions We present a major extension of Mayday, a very versatile open-source framework for efficient micro array data analysis designed for biologists and bioinformaticians. Most everyday tasks are already covered. The large number of available plugins as well as the extension possibilities using compiled plugins and ad-hoc scripting allow for the rapid adaption of Mayday also to very specialized data exploration. Mayday is available at http://microarray-analysis.org. PMID:20214778
Ryan, Natalia; Chorley, Brian; Tice, Raymond R.; Judson, Richard; Corton, J. Christopher
2016-01-01
Microarray profiling of chemical-induced effects is being increasingly used in medium- and high-throughput formats. Computational methods are described here to identify molecular targets from whole-genome microarray data using as an example the estrogen receptor α (ERα), often modulated by potential endocrine disrupting chemicals. ERα biomarker genes were identified by their consistent expression after exposure to 7 structurally diverse ERα agonists and 3 ERα antagonists in ERα-positive MCF-7 cells. Most of the biomarker genes were shown to be directly regulated by ERα as determined by ESR1 gene knockdown using siRNA as well as through chromatin immunoprecipitation coupled with DNA sequencing analysis of ERα-DNA interactions. The biomarker was evaluated as a predictive tool using the fold-change rank-based Running Fisher algorithm by comparison to annotated gene expression datasets from experiments using MCF-7 cells, including those evaluating the transcriptional effects of hormones and chemicals. Using 141 comparisons from chemical- and hormone-treated cells, the biomarker gave a balanced accuracy for prediction of ERα activation or suppression of 94% and 93%, respectively. The biomarker was able to correctly classify 18 out of 21 (86%) ER reference chemicals including “very weak” agonists. Importantly, the biomarker predictions accurately replicated predictions based on 18 in vitro high-throughput screening assays that queried different steps in ERα signaling. For 114 chemicals, the balanced accuracies were 95% and 98% for activation or suppression, respectively. These results demonstrate that the ERα gene expression biomarker can accurately identify ERα modulators in large collections of microarray data derived from MCF-7 cells. PMID:26865669
Puthiyedth, Nisha; Riveros, Carlos; Berretta, Regina; Moscato, Pablo
2015-01-01
Background The joint study of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers obtained from smaller studies. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest. This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. While this approach is well established in biostatistics, the introduction of new combinatorial optimization models to address this issue has not been explored in depth. In this study, we introduce a new model for the integration of multiple datasets and we show its application in transcriptomics. Methods We propose a new combinatorial optimization problem that addresses the core issue of biomarker detection in integrated datasets. Optimal solutions for this model deliver a feature selection from a panel of prospective biomarkers. The model we propose is a generalised version of the (α,β)-k-Feature Set problem. We illustrate the performance of this new methodology via a challenging meta-analysis task involving six prostate cancer microarray datasets. The results are then compared to the popular RankProd meta-analysis tool and to what can be obtained by analysing the individual datasets by statistical and combinatorial methods alone. Results Application of the integrated method resulted in a more informative signature than the rank-based meta-analysis or individual dataset results, and overcomes problems arising from real world datasets. The set of genes identified is highly significant in the context of prostate cancer. The method used does not rely on homogenisation or transformation of values to a common scale, and at the same time is able to capture markers associated with subgroups of the disease. PMID:26106884
Johnson, Michael E.; Mahoney, J. Matthew; Taroni, Jaclyn; Sargent, Jennifer L.; Marmarelis, Eleni; Wu, Ming-Ru; Varga, John; Hinchcliff, Monique E.; Whitfield, Michael L.
2015-01-01
Genome-wide expression profiling in systemic sclerosis (SSc) has identified four ‘intrinsic’ subsets of disease (fibroproliferative, inflammatory, limited, and normal-like), each of which shows deregulation of distinct signaling pathways; however, the full set of pathways contributing to this differential gene expression has not been fully elucidated. Here we examine experimentally derived gene expression signatures in dermal fibroblasts for thirteen different signaling pathways implicated in SSc pathogenesis. These data show distinct and overlapping sets of genes induced by each pathway, allowing for a better understanding of the molecular relationship between profibrotic and immune signaling networks. Pathway-specific gene signatures were analyzed across a compendium of microarray datasets consisting of skin biopsies from three independent cohorts representing 80 SSc patients, 4 morphea, and 26 controls. IFNα signaling showed a strong association with early disease, while TGFβ signaling spanned the fibroproliferative and inflammatory subsets, was associated with worse MRSS, and was higher in lesional than non-lesional skin. The fibroproliferative subset was most strongly associated with PDGF signaling, while the inflammatory subset demonstrated strong activation of innate immune pathways including TLR signaling upstream of NF-κB. The limited and normal-like subsets did not show associations with fibrotic and inflammatory mediators such as TGFβ and TNFα. The normal-like subset showed high expression of genes associated with lipid signaling, which was absent in the inflammatory and limited subsets. Together, these data suggest a model by which IFNα is involved in early disease pathology, and disease severity is associated with active TGFβ signaling. PMID:25607805
Xu, Xie L; Kapoun, Ann M
2009-01-01
Background TGFβ has emerged as an attractive target for the therapeutic intervention of glioblastomas. Aberrant TGFβ overproduction in glioblastoma and other high-grade gliomas has been reported, however, to date, none of these reports has systematically examined the components of TGFβ signaling to gain a comprehensive view of TGFβ activation in large cohorts of human glioma patients. Methods TGFβ activation in mammalian cells leads to a transcriptional program that typically affects 5–10% of the genes in the genome. To systematically examine the status of TGFβ activation in high-grade glial tumors, we compiled a gene set of transcriptional response to TGFβ stimulation from tissue culture and in vivo animal studies. These genes were used to examine the status of TGFβ activation in high-grade gliomas including a large cohort of glioblastomas. Unsupervised and supervised classification analysis was performed in two independent, publicly available glioma microarray datasets. Results Unsupervised and supervised classification using the TGFβ-responsive gene list in two independent glial tumor gene expression data sets revealed various levels of TGFβ activation in these tumors. Among glioblastomas, one of the most devastating human cancers, two subgroups were identified that showed distinct TGFβ activation patterns as measured from transcriptional responses. Approximately 62% of glioblastoma samples analyzed showed strong TGFβ activation, while the rest showed a weak TGFβ transcriptional response. Conclusion Our findings suggest heterogeneous TGFβ activation in glioblastomas, which may cause potential differences in responses to anti-TGFβ therapies in these two distinct subgroups of glioblastomas patients. PMID:19192267
Azuaje, Francisco; Zheng, Huiru; Camargo, Anyela; Wang, Haiying
2011-08-01
The discovery of novel disease biomarkers is a crucial challenge for translational bioinformatics. Demonstration of both their classification power and reproducibility across independent datasets are essential requirements to assess their potential clinical relevance. Small datasets and multiplicity of putative biomarker sets may explain lack of predictive reproducibility. Studies based on pathway-driven discovery approaches have suggested that, despite such discrepancies, the resulting putative biomarkers tend to be implicated in common biological processes. Investigations of this problem have been mainly focused on datasets derived from cancer research. We investigated the predictive and functional concordance of five methods for discovering putative biomarkers in four independently-generated datasets from the cardiovascular disease domain. A diversity of biosignatures was identified by the different methods. However, we found strong biological process concordance between them, especially in the case of methods based on gene set analysis. With a few exceptions, we observed lack of classification reproducibility using independent datasets. Partial overlaps between our putative sets of biomarkers and the primary studies exist. Despite the observed limitations, pathway-driven or gene set analysis can predict potentially novel biomarkers and can jointly point to biomedically-relevant underlying molecular mechanisms. Copyright © 2011 Elsevier Inc. All rights reserved.
Reconstructing the temporal ordering of biological samples using microarray data.
Magwene, Paul M; Lizardi, Paul; Kim, Junhyong
2003-05-01
Accurate time series for biological processes are difficult to estimate due to problems of synchronization, temporal sampling and rate heterogeneity. Methods are needed that can utilize multi-dimensional data, such as those resulting from DNA microarray experiments, in order to reconstruct time series from unordered or poorly ordered sets of observations. We present a set of algorithms for estimating temporal orderings from unordered sets of sample elements. The techniques we describe are based on modifications of a minimum-spanning tree calculated from a weighted, undirected graph. We demonstrate the efficacy of our approach by applying these techniques to an artificial data set as well as several gene expression data sets derived from DNA microarray experiments. In addition to estimating orderings, the techniques we describe also provide useful heuristics for assessing relevant properties of sample datasets such as noise and sampling intensity, and we show how a data structure called a PQ-tree can be used to represent uncertainty in a reconstructed ordering. Academic implementations of the ordering algorithms are available as source code (in the programming language Python) on our web site, along with documentation on their use. The artificial 'jelly roll' data set upon which the algorithm was tested is also available from this web site. The publicly available gene expression data may be found at http://genome-www.stanford.edu/cellcycle/ and http://caulobacter.stanford.edu/CellCycle/.
Swertz, Morris A; De Brock, E O; Van Hijum, Sacha A F T; De Jong, Anne; Buist, Girbe; Baerends, Richard J S; Kok, Jan; Kuipers, Oscar P; Jansen, Ritsert C
2004-09-01
Genomic research laboratories need adequate infrastructure to support management of their data production and research workflow. But what makes infrastructure adequate? A lack of appropriate criteria makes any decision on buying or developing a system difficult. Here, we report on the decision process for the case of a molecular genetics group establishing a microarray laboratory. Five typical requirements for experimental genomics database systems were identified: (i) evolution ability to keep up with the fast developing genomics field; (ii) a suitable data model to deal with local diversity; (iii) suitable storage of data files in the system; (iv) easy exchange with other software; and (v) low maintenance costs. The computer scientists and the researchers of the local microarray laboratory considered alternative solutions for these five requirements and chose the following options: (i) use of automatic code generation; (ii) a customized data model based on standards; (iii) storage of datasets as black boxes instead of decomposing them in database tables; (iv) loosely linking to other programs for improved flexibility; and (v) a low-maintenance web-based user interface. Our team evaluated existing microarray databases and then decided to build a new system, Molecular Genetics Information System (MOLGENIS), implemented using code generation in a period of three months. This case can provide valuable insights and lessons to both software developers and a user community embarking on large-scale genomic projects. http://www.molgenis.nl
Gálvez, Juan Manuel; Castillo, Daniel; Herrera, Luis Javier; San Román, Belén; Valenzuela, Olga; Ortuño, Francisco Manuel; Rojas, Ignacio
2018-01-01
Most of the research studies developed applying microarray technology to the characterization of different pathological states of any disease may fail in reaching statistically significant results. This is largely due to the small repertoire of analysed samples, and to the limitation in the number of states or pathologies usually addressed. Moreover, the influence of potential deviations on the gene expression quantification is usually disregarded. In spite of the continuous changes in omic sciences, reflected for instance in the emergence of new Next-Generation Sequencing-related technologies, the existing availability of a vast amount of gene expression microarray datasets should be properly exploited. Therefore, this work proposes a novel methodological approach involving the integration of several heterogeneous skin cancer series, and a later multiclass classifier design. This approach is thus a way to provide the clinicians with an intelligent diagnosis support tool based on the use of a robust set of selected biomarkers, which simultaneously distinguishes among different cancer-related skin states. To achieve this, a multi-platform combination of microarray datasets from Affymetrix and Illumina manufacturers was carried out. This integration is expected to strengthen the statistical robustness of the study as well as the finding of highly-reliable skin cancer biomarkers. Specifically, the designed operation pipeline has allowed the identification of a small subset of 17 differentially expressed genes (DEGs) from which to distinguish among 7 involved skin states. These genes were obtained from the assessment of a number of potential batch effects on the gene expression data. The biological interpretation of these genes was inspected in the specific literature to understand their underlying information in relation to skin cancer. Finally, in order to assess their possible effectiveness in cancer diagnosis, a cross-validation Support Vector Machines (SVM)-based classification including feature ranking was performed. The accuracy attained exceeded the 92% in overall recognition of the 7 different cancer-related skin states. The proposed integration scheme is expected to allow the co-integration with other state-of-the-art technologies such as RNA-seq.
A novel approach for dimension reduction of microarray.
Aziz, Rabia; Verma, C K; Srivastava, Namita
2017-12-01
This paper proposes a new hybrid search technique for feature (gene) selection (FS) using Independent component analysis (ICA) and Artificial Bee Colony (ABC) called ICA+ABC, to select informative genes based on a Naïve Bayes (NB) algorithm. An important trait of this technique is the optimization of ICA feature vector using ABC. ICA+ABC is a hybrid search algorithm that combines the benefits of extraction approach, to reduce the size of data and wrapper approach, to optimize the reduced feature vectors. This hybrid search technique is facilitated by evaluating the performance of ICA+ABC on six standard gene expression datasets of classification. Extensive experiments were conducted to compare the performance of ICA+ABC with the results obtained from recently published Minimum Redundancy Maximum Relevance (mRMR) +ABC algorithm for NB classifier. Also to check the performance that how ICA+ABC works as feature selection with NB classifier, compared the combination of ICA with popular filter techniques and with other similar bio inspired algorithm such as Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). The result shows that ICA+ABC has a significant ability to generate small subsets of genes from the ICA feature vector, that significantly improve the classification accuracy of NB classifier compared to other previously suggested methods. Copyright © 2017 Elsevier Ltd. All rights reserved.
Interaction between dietary lipids and gut microbiota regulates hepatic cholesterol metabolism.
Caesar, Robert; Nygren, Heli; Orešič, Matej; Bäckhed, Fredrik
2016-03-01
The gut microbiota influences many aspects of host metabolism. We have previously shown that the presence of a gut microbiota remodels lipid composition. Here we investigated how interaction between gut microbiota and dietary lipids regulates lipid composition in the liver and plasma, and gene expression in the liver. Germ-free and conventionally raised mice were fed a lard or fish oil diet for 11 weeks. We performed lipidomics analysis of the liver and serum and microarray analysis of the liver. As expected, most of the variation in the lipidomics dataset was induced by the diet, and abundance of most lipid classes differed between mice fed lard and fish oil. However, the gut microbiota also affected lipid composition. The gut microbiota increased hepatic levels of cholesterol and cholesteryl esters in mice fed lard, but not in mice fed fish oil. Serum levels of cholesterol and cholesteryl esters were not affected by the gut microbiota. Genes encoding enzymes involved in cholesterol biosynthesis were downregulated by the gut microbiota in mice fed lard and were expressed at a low level in mice fed fish oil independent of microbial status. In summary, we show that gut microbiota-induced regulation of hepatic cholesterol metabolism is dependent on dietary lipid composition. Copyright © 2016 by the American Society for Biochemistry and Molecular Biology, Inc.
An, Ning; Yang, Xue; Cheng, Shujun; Wang, Guiqi; Zhang, Kaitai
2015-01-01
Carcinogenesis is an exceedingly complicated process, which involves multi-level dysregulations, including genomics (majorly caused by somatic mutation and copy number variation), DNA methylomics, and transcriptomics. Therefore, only looking into one molecular level of cancer is not sufficient to uncover the intricate underlying mechanisms. With the abundant resources of public available data in the Cancer Genome Atlas (TCGA) database, an integrative strategy was conducted to systematically analyze the aberrant patterns of colorectal cancer on the basis of DNA copy number, promoter methylation, somatic mutation and gene expression. In this study, paired samples in each genomic level were retrieved to identify differentially expressed genes with corresponding genetic or epigenetic dysregulations. Notably, the result of gene ontology enrichment analysis indicated that the differentially expressed genes with corresponding aberrant promoter methylation or somatic mutation were both functionally concentrated upon developmental process, suggesting the intimate association between development and carcinogenesis. Thus, by means of random walk with restart, 37 significant development-related genes were retrieved from a priori-knowledge based biological network. In five independent microarray datasets, Kaplan–Meier survival and Cox regression analyses both confirmed that the expression of these genes was significantly associated with overall survival of Stage III/IV colorectal cancer patients. PMID:26691761
An, Ning; Yang, Xue; Cheng, Shujun; Wang, Guiqi; Zhang, Kaitai
2015-12-22
Carcinogenesis is an exceedingly complicated process, which involves multi-level dysregulations, including genomics (majorly caused by somatic mutation and copy number variation), DNA methylomics, and transcriptomics. Therefore, only looking into one molecular level of cancer is not sufficient to uncover the intricate underlying mechanisms. With the abundant resources of public available data in the Cancer Genome Atlas (TCGA) database, an integrative strategy was conducted to systematically analyze the aberrant patterns of colorectal cancer on the basis of DNA copy number, promoter methylation, somatic mutation and gene expression. In this study, paired samples in each genomic level were retrieved to identify differentially expressed genes with corresponding genetic or epigenetic dysregulations. Notably, the result of gene ontology enrichment analysis indicated that the differentially expressed genes with corresponding aberrant promoter methylation or somatic mutation were both functionally concentrated upon developmental process, suggesting the intimate association between development and carcinogenesis. Thus, by means of random walk with restart, 37 significant development-related genes were retrieved from a priori-knowledge based biological network. In five independent microarray datasets, Kaplan-Meier survival and Cox regression analyses both confirmed that the expression of these genes was significantly associated with overall survival of Stage III/IV colorectal cancer patients.
Xiong, Dan-Dan; Feng, Zhen-Bo; Cen, Wei-Luan; Zeng, Jing-Jing; Liang, Lu; Tang, Rui-Xue; Gan, Xiao-Ning; Liang, Hai-Wei; Li, Zu-Yun; Chen, Gang; Luo, Dian-Zhong
2017-03-14
This comprehensive investigation was performed to evaluate the expression level and potential clinical value of NEAT1 in digestive system malignancies. A total of 57 lncRNA datasets of microarray or RNA-seq and 5 publications were included. The pooled standard mean deviation (SMD) indicated that NEAT1 was down-regulated in esophageal carcinoma (ESCA, SMD = -0.35, 95% CI: -0.5~-0.20, P < 0.0001) and hepatocellular carcinoma (HCC, SMD = -0.47, 95% CI: -0.60~-0.34, P < 0.0001), while in pancreatic cancer (PC), NEAT1 was up-regulated (SMD = 0.45, 95% CI: 0.2~0.71, P = 0.001). However, NEAT1 expression in gastric cancer (GC), colorectal cancer (CRC), biliary tract cancer (BTC) and gallbladder carcinoma (GBC) showed no significant difference between cancer and control groups. The pooled area under the curve values for ESCA, GC, CRC, PC and HCC were 0.60, 0.89, 0.81, 0.77 and 0.69, respectively. Furthermore, our result demonstrated that a high expression of NEAT1 predicted an unfavorable prognosis in patients with digestive system malignancies (HR: 1.50, 95% CI: 1.28-1.76, P < 0.0001). Our study suggests that NEAT1 may play different roles in the initiation and progression of digestive system cancers and could be a potential diagnostic and prognostic biomarker in patients with digestive system carcinomas. Further and stricter studies with a larger number of cases are necessary to strengthen our conclusions.
Cen, Wei-Luan; Zeng, Jing-Jing; Liang, Lu; Tang, Rui-Xue; Gan, Xiao-Ning; Liang, Hai-Wei; Li, Zu-Yun; Chen, Gang; Luo, Dian-Zhong
2017-01-01
This comprehensive investigation was performed to evaluate the expression level and potential clinical value of NEAT1 in digestive system malignancies. A total of 57 lncRNA datasets of microarray or RNA-seq and 5 publications were included. The pooled standard mean deviation (SMD) indicated that NEAT1 was down-regulated in esophageal carcinoma (ESCA, SMD = −0.35, 95% CI: −0.5~-0.20, P < 0.0001) and hepatocellular carcinoma (HCC, SMD = −0.47, 95% CI: −0.60~-0.34, P < 0.0001), while in pancreatic cancer (PC), NEAT1 was up-regulated (SMD = 0.45, 95% CI: 0.2~0.71, P = 0.001). However, NEAT1 expression in gastric cancer (GC), colorectal cancer (CRC), biliary tract cancer (BTC) and gallbladder carcinoma (GBC) showed no significant difference between cancer and control groups. The pooled area under the curve values for ESCA, GC, CRC, PC and HCC were 0.60, 0.89, 0.81, 0.77 and 0.69, respectively. Furthermore, our result demonstrated that a high expression of NEAT1 predicted an unfavorable prognosis in patients with digestive system malignancies (HR: 1.50, 95% CI: 1.28-1.76, P < 0.0001). Our study suggests that NEAT1 may play different roles in the initiation and progression of digestive system cancers and could be a potential diagnostic and prognostic biomarker in patients with digestive system carcinomas. Further and stricter studies with a larger number of cases are necessary to strengthen our conclusions. PMID:28118609
Gene expression profiling in the adult Down syndrome brain.
Lockstone, H E; Harris, L W; Swatton, J E; Wayland, M T; Holland, A J; Bahn, S
2007-12-01
The mechanisms by which trisomy 21 leads to the characteristic Down syndrome (DS) phenotype are unclear. We used whole genome microarrays to characterize for the first time the transcriptome of human adult brain tissue (dorsolateral prefrontal cortex) from seven DS subjects and eight controls. These data were coanalyzed with a publicly available dataset from fetal DS tissue and functional profiling was performed to identify the biological processes central to DS and those that may be related to late onset pathologies, particularly Alzheimer disease neuropathology. A total of 685 probe sets were differentially expressed between adult DS and control brains at a stringent significance threshold (adjusted p value (q) < 0.005), 70% of these being up-regulated in DS. Over 25% of genes on chromosome 21 were differentially expressed in comparison to a median of 4.4% for all chromosomes. The unique profile of up-regulation on chromosome 21, consistent with primary dosage effects, was accompanied by widespread transcriptional disruption. The critical Alzheimer disease gene, APP, located on chromosome 21, was not found to be up-regulated in adult brain by microarray or QPCR analysis. However, numerous other genes functionally linked to APP processing were dysregulated. Functional profiling of genes dysregulated in both fetal and adult datasets identified categories including development (notably Notch signaling and Dlx family genes), lipid transport, and cellular proliferation. In the adult brain these processes were concomitant with cytoskeletal regulation and vesicle trafficking categories, and increased immune response and oxidative stress response, which are likely linked to the development of Alzheimer pathology in individuals with DS.
Drost, Derek R; Novaes, Evandro; Boaventura-Novaes, Carolina; Benedict, Catherine I; Brown, Ryan S; Yin, Tongming; Tuskan, Gerald A; Kirst, Matias
2009-06-01
Microarrays have demonstrated significant power for genome-wide analyses of gene expression, and recently have also revolutionized the genetic analysis of segregating populations by genotyping thousands of loci in a single assay. Although microarray-based genotyping approaches have been successfully applied in yeast and several inbred plant species, their power has not been proven in an outcrossing species with extensive genetic diversity. Here we have developed methods for high-throughput microarray-based genotyping in such species using a pseudo-backcross progeny of 154 individuals of Populus trichocarpa and P. deltoides analyzed with long-oligonucleotide in situ-synthesized microarray probes. Our analysis resulted in high-confidence genotypes for 719 single-feature polymorphism (SFP) and 1014 gene expression marker (GEM) candidates. Using these genotypes and an established microsatellite (SSR) framework map, we produced a high-density genetic map comprising over 600 SFPs, GEMs and SSRs. The abundance of gene-based markers allowed us to localize over 35 million base pairs of previously unplaced whole-genome shotgun (WGS) scaffold sequence to putative locations in the genome of P. trichocarpa. A high proportion of sampled scaffolds could be verified for their placement with independently mapped SSRs, demonstrating the previously un-utilized power that high-density genotyping can provide in the context of map-based WGS sequence reassembly. Our results provide a substantial contribution to the continued improvement of the Populus genome assembly, while demonstrating the feasibility of microarray-based genotyping in a highly heterozygous population. The strategies presented are applicable to genetic mapping efforts in all plant species with similarly high levels of genetic diversity.
Shaw, Joseph R; Colbourne, John K; Davey, Jennifer C; Glaholt, Stephen P; Hampton, Thomas H; Chen, Celia Y; Folt, Carol L; Hamilton, Joshua W
2007-12-21
Genomic research tools such as microarrays are proving to be important resources to study the complex regulation of genes that respond to environmental perturbations. A first generation cDNA microarray was developed for the environmental indicator species Daphnia pulex, to identify genes whose regulation is modulated following exposure to the metal stressor cadmium. Our experiments revealed interesting changes in gene transcription that suggest their biological roles and their potentially toxicological features in responding to this important environmental contaminant. Our microarray identified genes reported in the literature to be regulated in response to cadmium exposure, suggested functional attributes for genes that share no sequence similarity to proteins in the public databases, and pointed to genes that are likely members of expanded gene families in the Daphnia genome. Genes identified on the microarray also were associated with cadmium induced phenotypes and population-level outcomes that we experimentally determined. A subset of genes regulated in response to cadmium exposure was independently validated using quantitative-realtime (Q-RT)-PCR. These microarray studies led to the discovery of three genes coding for the metal detoxication protein metallothionein (MT). The gene structures and predicted translated sequences of D. pulex MTs clearly place them in this gene family. Yet, they share little homology with previously characterized MTs. The genomic information obtained from this study represents an important first step in characterizing microarray patterns that may be diagnostic to specific environmental contaminants and give insights into their toxicological mechanisms, while also providing a practical tool for evolutionary, ecological, and toxicological functional gene discovery studies. Advances in Daphnia genomics will enable the further development of this species as a model organism for the environmental sciences.
Shaw, Joseph R; Colbourne, John K; Davey, Jennifer C; Glaholt, Stephen P; Hampton, Thomas H; Chen, Celia Y; Folt, Carol L; Hamilton, Joshua W
2007-01-01
Background Genomic research tools such as microarrays are proving to be important resources to study the complex regulation of genes that respond to environmental perturbations. A first generation cDNA microarray was developed for the environmental indicator species Daphnia pulex, to identify genes whose regulation is modulated following exposure to the metal stressor cadmium. Our experiments revealed interesting changes in gene transcription that suggest their biological roles and their potentially toxicological features in responding to this important environmental contaminant. Results Our microarray identified genes reported in the literature to be regulated in response to cadmium exposure, suggested functional attributes for genes that share no sequence similarity to proteins in the public databases, and pointed to genes that are likely members of expanded gene families in the Daphnia genome. Genes identified on the microarray also were associated with cadmium induced phenotypes and population-level outcomes that we experimentally determined. A subset of genes regulated in response to cadmium exposure was independently validated using quantitative-realtime (Q-RT)-PCR. These microarray studies led to the discovery of three genes coding for the metal detoxication protein metallothionein (MT). The gene structures and predicted translated sequences of D. pulex MTs clearly place them in this gene family. Yet, they share little homology with previously characterized MTs. Conclusion The genomic information obtained from this study represents an important first step in characterizing microarray patterns that may be diagnostic to specific environmental contaminants and give insights into their toxicological mechanisms, while also providing a practical tool for evolutionary, ecological, and toxicological functional gene discovery studies. Advances in Daphnia genomics will enable the further development of this species as a model organism for the environmental sciences. PMID:18154678
Strauss, Christian; Endimiani, Andrea; Perreten, Vincent
2015-01-01
A rapid and simple DNA labeling system has been developed for disposable microarrays and has been validated for the detection of 117 antibiotic resistance genes abundant in Gram-positive bacteria. The DNA was fragmented and amplified using phi-29 polymerase and random primers with linkers. Labeling and further amplification were then performed by classic PCR amplification using biotinylated primers specific for the linkers. The microarray developed by Perreten et al. (Perreten, V., Vorlet-Fawer, L., Slickers, P., Ehricht, R., Kuhnert, P., Frey, J., 2005. Microarray-based detection of 90 antibiotic resistance genes of gram-positive bacteria. J.Clin.Microbiol. 43, 2291-2302.) was improved by additional oligonucleotides. A total of 244 oligonucleotides (26 to 37 nucleotide length and with similar melting temperatures) were spotted on the microarray, including genes conferring resistance to clinically important antibiotic classes like β-lactams, macrolides, aminoglycosides, glycopeptides and tetracyclines. Each antibiotic resistance gene is represented by at least 2 oligonucleotides designed from consensus sequences of gene families. The specificity of the oligonucleotides and the quality of the amplification and labeling were verified by analysis of a collection of 65 strains belonging to 24 species. Association between genotype and phenotype was verified for 6 antibiotics using 77 Staphylococcus strains belonging to different species and revealed 95% test specificity and a 93% predictive value of a positive test. The DNA labeling and amplification is independent of the species and of the target genes and could be used for different types of microarrays. This system has also the advantage to detect several genes within one bacterium at once, like in Staphylococcus aureus strain BM3318, in which up to 15 genes were detected. This new microarray-based detection system offers a large potential for applications in clinical diagnostic, basic research, food safety and surveillance programs for antimicrobial resistance. Copyright © 2014 Elsevier B.V. All rights reserved.
Optimization of single-base-pair mismatch discrimination in oligonucleotide microarrays
NASA Technical Reports Server (NTRS)
Urakawa, Hidetoshi; El Fantroussi, Said; Smidt, Hauke; Smoot, James C.; Tribou, Erik H.; Kelly, John J.; Noble, Peter A.; Stahl, David A.
2003-01-01
The discrimination between perfect-match and single-base-pair-mismatched nucleic acid duplexes was investigated by using oligonucleotide DNA microarrays and nonequilibrium dissociation rates (melting profiles). DNA and RNA versions of two synthetic targets corresponding to the 16S rRNA sequences of Staphylococcus epidermidis (38 nucleotides) and Nitrosomonas eutropha (39 nucleotides) were hybridized to perfect-match probes (18-mer and 19-mer) and to a set of probes having all possible single-base-pair mismatches. The melting profiles of all probe-target duplexes were determined in parallel by using an imposed temperature step gradient. We derived an optimum wash temperature for each probe and target by using a simple formula to calculate a discrimination index for each temperature of the step gradient. This optimum corresponded to the output of an independent analysis using a customized neural network program. These results together provide an experimental and analytical framework for optimizing mismatch discrimination among all probes on a DNA microarray.
Zhao, Shanrong; Prenger, Kurt; Smith, Lance
2013-01-01
RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets. PMID:25937948
Zhao, Shanrong; Prenger, Kurt; Smith, Lance
2013-01-01
RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets.
State Space Model with hidden variables for reconstruction of gene regulatory networks.
Wu, Xi; Li, Peng; Wang, Nan; Gong, Ping; Perkins, Edward J; Deng, Youping; Zhang, Chaoyang
2011-01-01
State Space Model (SSM) is a relatively new approach to inferring gene regulatory networks. It requires less computational time than Dynamic Bayesian Networks (DBN). There are two types of variables in the linear SSM, observed variables and hidden variables. SSM uses an iterative method, namely Expectation-Maximization, to infer regulatory relationships from microarray datasets. The hidden variables cannot be directly observed from experiments. How to determine the number of hidden variables has a significant impact on the accuracy of network inference. In this study, we used SSM to infer Gene regulatory networks (GRNs) from synthetic time series datasets, investigated Bayesian Information Criterion (BIC) and Principle Component Analysis (PCA) approaches to determining the number of hidden variables in SSM, and evaluated the performance of SSM in comparison with DBN. True GRNs and synthetic gene expression datasets were generated using GeneNetWeaver. Both DBN and linear SSM were used to infer GRNs from the synthetic datasets. The inferred networks were compared with the true networks. Our results show that inference precision varied with the number of hidden variables. For some regulatory networks, the inference precision of DBN was higher but SSM performed better in other cases. Although the overall performance of the two approaches is compatible, SSM is much faster and capable of inferring much larger networks than DBN. This study provides useful information in handling the hidden variables and improving the inference precision.
Identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network.
Chen, Xi; Wang, Qiao-Ling; Zhang, Meng-Hui
2017-10-01
The current study aimed to identify key genes in glaucoma based on a benchmarked dataset and gene regulatory network (GRN). Local and global noise was added to the gene expression dataset to produce a benchmarked dataset. Differentially-expressed genes (DEGs) between patients with glaucoma and normal controls were identified utilizing the Linear Models for Microarray Data (Limma) package based on benchmarked dataset. A total of 5 GRN inference methods, including Zscore, GeneNet, context likelihood of relatedness (CLR) algorithm, Partial Correlation coefficient with Information Theory (PCIT) and GEne Network Inference with Ensemble of Trees (Genie3) were evaluated using receiver operating characteristic (ROC) and precision and recall (PR) curves. The interference method with the best performance was selected to construct the GRN. Subsequently, topological centrality (degree, closeness and betweenness) was conducted to identify key genes in the GRN of glaucoma. Finally, the key genes were validated by performing reverse transcription-quantitative polymerase chain reaction (RT-qPCR). A total of 176 DEGs were detected from the benchmarked dataset. The ROC and PR curves of the 5 methods were analyzed and it was determined that Genie3 had a clear advantage over the other methods; thus, Genie3 was used to construct the GRN. Following topological centrality analysis, 14 key genes for glaucoma were identified, including IL6 , EPHA2 and GSTT1 and 5 of these 14 key genes were validated by RT-qPCR. Therefore, the current study identified 14 key genes in glaucoma, which may be potential biomarkers to use in the diagnosis of glaucoma and aid in identifying the molecular mechanism of this disease.
Integrative Exploratory Analysis of Two or More Genomic Datasets.
Meng, Chen; Culhane, Aedin
2016-01-01
Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.
Fish connectivity mapping intermediate data files and outputs
RLWrankedLists.tar.gz:These lists linked to various chemical treatment conditions serve as the target collection of Cmap. Probes of the entire microarray are sorted based on their log fold changes over control conditions. RLWsignatures2015.tar.gz: These signatures linked to various chemical treatment conditions serve as queries in Cmap.This dataset is associated with the following publication:Wang , R., A. Biales , N. Garcia-Reyero, E. Perkins, D. Villeneuve, G. Ankley, and D. Bencic. Fish Connectivity Mapping: Linking Chemical Stressors by Their MOA-Driven Transcriptomic Profiles. BMC Genomics. BioMed Central Ltd, London, UK, 17(84): 1-20, (2016).
Mining featured biomarkers associated with prostatic carcinoma based on bioinformatics.
Piao, Guanying; Wu, Jiarui
2013-11-01
To analyze the differentially expressed genes and identify featured biomarkers from prostatic carcinoma. The software "Significance Analysis of Microarray" (SAM) was used to identify the differentially coexpressed genes (DCGs). The DCGs existed in two datasets were analyzed by GO (Gene Ontology) functional annotation. A total of 389 DCGs were obtained. By GO analysis, we found these DCGs were closely related with the acinus development, TGF-β receptor and signal transduction pathways. Furthermore, five featured biomarkers were discovered by interaction analysis. These important signal pathways and oncogenes may provide potential therapeutic targets for prostatic carcinoma.
Nie, Zhi; Vairavan, Srinivasan; Narayan, Vaibhav A; Ye, Jieping; Li, Qingqin S
2018-01-01
Identification of risk factors of treatment resistance may be useful to guide treatment selection, avoid inefficient trial-and-error, and improve major depressive disorder (MDD) care. We extended the work in predictive modeling of treatment resistant depression (TRD) via partition of the data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) cohort into a training and a testing dataset. We also included data from a small yet completely independent cohort RIS-INT-93 as an external test dataset. We used features from enrollment and level 1 treatment (up to week 2 response only) of STAR*D to explore the feature space comprehensively and applied machine learning methods to model TRD outcome at level 2. For TRD defined using QIDS-C16 remission criteria, multiple machine learning models were internally cross-validated in the STAR*D training dataset and externally validated in both the STAR*D testing dataset and RIS-INT-93 independent dataset with an area under the receiver operating characteristic curve (AUC) of 0.70-0.78 and 0.72-0.77, respectively. The upper bound for the AUC achievable with the full set of features could be as high as 0.78 in the STAR*D testing dataset. Model developed using top 30 features identified using feature selection technique (k-means clustering followed by χ2 test) achieved an AUC of 0.77 in the STAR*D testing dataset. In addition, the model developed using overlapping features between STAR*D and RIS-INT-93, achieved an AUC of > 0.70 in both the STAR*D testing and RIS-INT-93 datasets. Among all the features explored in STAR*D and RIS-INT-93 datasets, the most important feature was early or initial treatment response or symptom severity at week 2. These results indicate that prediction of TRD prior to undergoing a second round of antidepressant treatment could be feasible even in the absence of biomarker data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Achyuthan, Komandoor E.; Wheeler, David R.
Evaluating the stability of coupling reagents, quality control (QC), and surface functionalization metrology are all critical to the production of high quality peptide microarrays. We describe a broadly applicable screening technique for evaluating the fidelity of solid phase peptide synthesis (SPPS), the stability of activation/coupling reagents, and a microarray surface metrology tool. This technique was used to assess the stability of the activation reagent 1-{[1-(Cyano-2-ethoxy-2-oxo-ethylidenaminooxy)dimethylamino-morpholinomethylene]}methaneaminiumHexafluorophosphate (COMU) (Sigma-Aldrich, St. Louis, MO, USA) by SPPS of Leu-Enkephalin (YGGFL) or the coupling of commercially synthesized YGGFL peptides to (3-aminopropyl)triethyoxysilane-modified glass surfaces. Coupling efficiency was quantitated by fluorescence signaling based on immunoreactivity of themore » YGGFL motif. It was concluded that COMU solutions should be prepared fresh and used within 5 h when stored at ~23 °C and not beyond 24 h if stored refrigerated, both in closed containers. Caveats to gauging COMU stability by absorption spectroscopy are discussed. Commercial YGGFL peptides needed independent QC, due to immunoreactivity variations for the same sequence synthesized by different vendors. This technique is useful in evaluating the stability of other activation/coupling reagents besides COMU and as a metrology tool for SPPS and peptide microarrays.« less
Achyuthan, Komandoor E.; Wheeler, David R.
2015-08-27
Evaluating the stability of coupling reagents, quality control (QC), and surface functionalization metrology are all critical to the production of high quality peptide microarrays. We describe a broadly applicable screening technique for evaluating the fidelity of solid phase peptide synthesis (SPPS), the stability of activation/coupling reagents, and a microarray surface metrology tool. This technique was used to assess the stability of the activation reagent 1-{[1-(Cyano-2-ethoxy-2-oxo-ethylidenaminooxy)dimethylamino-morpholinomethylene]}methaneaminiumHexafluorophosphate (COMU) (Sigma-Aldrich, St. Louis, MO, USA) by SPPS of Leu-Enkephalin (YGGFL) or the coupling of commercially synthesized YGGFL peptides to (3-aminopropyl)triethyoxysilane-modified glass surfaces. Coupling efficiency was quantitated by fluorescence signaling based on immunoreactivity of themore » YGGFL motif. It was concluded that COMU solutions should be prepared fresh and used within 5 h when stored at ~23 °C and not beyond 24 h if stored refrigerated, both in closed containers. Caveats to gauging COMU stability by absorption spectroscopy are discussed. Commercial YGGFL peptides needed independent QC, due to immunoreactivity variations for the same sequence synthesized by different vendors. This technique is useful in evaluating the stability of other activation/coupling reagents besides COMU and as a metrology tool for SPPS and peptide microarrays.« less
Ueda, Erica; Feng, Wenqian; Levkin, Pavel A
2016-10-01
High-density microarrays can screen thousands of genetic and chemical probes at once in a miniaturized and parallelized manner, and thus are a cost-effective alternative to microwell plates. Here, high-density cell microarrays are fabricated by creating superhydrophilic-superhydrophobic micropatterns in thin, nanoporous polymer substrates such that the superhydrophobic barriers confine both aqueous solutions and adherent cells within each superhydrophilic microspot. The superhydrophobic barriers confine and prevent the mixing of larger droplet volumes, and also control the spreading of droplets independent of the volume, minimizing the variability that arises due to different liquid and surface properties. Using a novel liposomal transfection reagent, ScreenFect A, the method of reverse cell transfection is optimized on the patterned substrates and several factors that affect transfection efficiency and cytotoxicity are identified. Higher levels of transfection are achieved on HOOC- versus NH 2 -functionalized superhydrophilic spots, as well as when gelatin and fibronectin are added to the transfection mixture, while minimizing the amount of transfection reagent improves cell viability. Almost no diffusion of the printed transfection mixtures to the neighboring microspots is detected. Thus, superhydrophilic-superhydrophobic patterned surfaces can be used as cell microarrays and for optimizing reverse cell transfection conditions before performing further cell screenings. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
A distributed system for fast alignment of next-generation sequencing data.
Srimani, Jaydeep K; Wu, Po-Yen; Phan, John H; Wang, May D
2010-12-01
We developed a scalable distributed computing system using the Berkeley Open Interface for Network Computing (BOINC) to align next-generation sequencing (NGS) data quickly and accurately. NGS technology is emerging as a promising platform for gene expression analysis due to its high sensitivity compared to traditional genomic microarray technology. However, despite the benefits, NGS datasets can be prohibitively large, requiring significant computing resources to obtain sequence alignment results. Moreover, as the data and alignment algorithms become more prevalent, it will become necessary to examine the effect of the multitude of alignment parameters on various NGS systems. We validate the distributed software system by (1) computing simple timing results to show the speed-up gained by using multiple computers, (2) optimizing alignment parameters using simulated NGS data, and (3) computing NGS expression levels for a single biological sample using optimal parameters and comparing these expression levels to that of a microarray sample. Results indicate that the distributed alignment system achieves approximately a linear speed-up and correctly distributes sequence data to and gathers alignment results from multiple compute clients.
New insights about host response to smallpox using microarray data
Esteves, Gustavo H; Simoes, Ana CQ; Souza, Estevao; Dias, Rodrigo A; Ospina, Raydonal; Venancio, Thiago M
2007-01-01
Background Smallpox is a lethal disease that was endemic in many parts of the world until eradicated by massive immunization. Due to its lethality, there are serious concerns about its use as a bioweapon. Here we analyze publicly available microarray data to further understand survival of smallpox infected macaques, using systems biology approaches. Our goal is to improve the knowledge about the progression of this disease. Results We used KEGG pathways annotations to define groups of genes (or modules), and subsequently compared them to macaque survival times. This technique provided additional insights about the host response to this disease, such as increased expression of the cytokines and ECM receptors in the individuals with higher survival times. These results could indicate that these gene groups could influence an effective response from the host to smallpox. Conclusion Macaques with higher survival times clearly express some specific pathways previously unidentified using regular gene-by-gene approaches. Our work also shows how third party analysis of public datasets can be important to support new hypotheses to relevant biological problems. PMID:17718913
GESearch: An Interactive GUI Tool for Identifying Gene Expression Signature.
Ye, Ning; Yin, Hengfu; Liu, Jingjing; Dai, Xiaogang; Yin, Tongming
2015-01-01
The huge amount of gene expression data generated by microarray and next-generation sequencing technologies present challenges to exploit their biological meanings. When searching for the coexpression genes, the data mining process is largely affected by selection of algorithms. Thus, it is highly desirable to provide multiple options of algorithms in the user-friendly analytical toolkit to explore the gene expression signatures. For this purpose, we developed GESearch, an interactive graphical user interface (GUI) toolkit, which is written in MATLAB and supports a variety of gene expression data files. This analytical toolkit provides four models, including the mean, the regression, the delegate, and the ensemble models, to identify the coexpression genes, and enables the users to filter data and to select gene expression patterns by browsing the display window or by importing knowledge-based genes. Subsequently, the utility of this analytical toolkit is demonstrated by analyzing two sets of real-life microarray datasets from cell-cycle experiments. Overall, we have developed an interactive GUI toolkit that allows for choosing multiple algorithms for analyzing the gene expression signatures.
A Universal Genome Array and Transcriptome Atlas for Brachypodium Distachyon
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mockler, Todd
Brachypodium distachyon is the premier experimental model grass platform and is related to candidate feedstock crops for bioethanol production. Based on the DOE-JGI Brachypodium Bd21 genome sequence and annotation we designed a whole genome DNA microarray platform. The quality of this array platform is unprecedented due to the exceptional quality of the Brachypodium genome assembly and annotation and the stringent probe selection criteria employed in the design. We worked with members of the international community and the bioinformatics/design team at Affymetrix at all stages in the development of the array. We used the Brachypodium arrays to interrogate the transcriptomes ofmore » plants grown in a variety of environmental conditions including diurnal and circadian light/temperature conditions and under a variety of environmental conditions. We examined the transciptional responses of Brachypodium seedlings subjected to various abiotic stresses including heat, cold, salt, and high intensity light. We generated a gene expression atlas representing various organs and developmental stages. The results of these efforts including all microarray datasets are published and available at online public databases.« less
Schüler, Susann; Wenz, Ingrid; Wiederanders, B; Slickers, P; Ehricht, R
2006-06-12
Recent developments in DNA microarray technology led to a variety of open and closed devices and systems including high and low density microarrays for high-throughput screening applications as well as microarrays of lower density for specific diagnostic purposes. Beside predefined microarrays for specific applications manufacturers offer the production of custom-designed microarrays adapted to customers' wishes. Array based assays demand complex procedures including several steps for sample preparation (RNA extraction, amplification and sample labelling), hybridization and detection, thus leading to a high variability between several approaches and resulting in the necessity of extensive standardization and normalization procedures. In the present work a custom designed human proteinase DNA microarray of lower density in ArrayTube format was established. This highly economic open platform only requires standard laboratory equipment and allows the study of the molecular regulation of cell behaviour by proteinases. We established a procedure for sample preparation and hybridization and verified the array based gene expression profile by quantitative real-time PCR (QRT-PCR). Moreover, we compared the results with the well established Affymetrix microarray. By application of standard labelling procedures with e.g. Klenow fragment exo-, single primer amplification (SPA) or In Vitro Transcription (IVT) we noticed a loss of signal conservation for some genes. To overcome this problem we developed a protocol in accordance with the SPA protocol, in which we included target specific primers designed individually for each spotted oligomer. Here we present a complete array based assay in which only the specific transcripts of interest are amplified in parallel and in a linear manner. The array represents a proof of principle which can be adapted to other species as well. As the designed protocol for amplifying mRNA starts from as little as 100 ng total RNA, it presents an alternative method for detecting even low expressed genes by microarray experiments in a highly reproducible and sensitive manner. Preservation of signal integrity is demonstrated out by QRT-PCR measurements. The little amounts of total RNA necessary for the analyses make this method applicable for investigations with limited material as in clinical samples from, for example, organ or tumour biopsies. Those are arguments in favour of the high potential of our assay compared to established procedures for amplification within the field of diagnostic expression profiling. Nevertheless, the screening character of microarray data must be mentioned, and independent methods should verify the results.
Brase, Jan C.; Kronenwett, Ralf; Petry, Christoph; Denkert, Carsten; Schmidt, Marcus
2013-01-01
Several multigene tests have been developed for breast cancer patients to predict the individual risk of recurrence. Most of the first generation tests rely on proliferation-associated genes and are commonly carried out in central reference laboratories. Here, we describe the development of a second generation multigene assay, the EndoPredict test, a prognostic multigene expression test for estrogen receptor (ER) positive, human epidermal growth factor receptor (HER2) negative (ER+/HER2−) breast cancer patients. The EndoPredict gene signature was initially established in a large high-throughput microarray-based screening study. The key steps for biomarker identification are discussed in detail, in comparison to the establishment of other multigene signatures. After biomarker selection, genes and algorithms were transferred to a diagnostic platform (reverse transcription quantitative PCR (RT-qPCR)) to allow for assaying formalin-fixed, paraffin-embedded (FFPE) samples. A comprehensive analytical validation was performed and a prospective proficiency testing study with seven pathological laboratories finally proved that EndoPredict can be reliably used in the decentralized setting. Three independent large clinical validation studies (n = 2,257) demonstrated that EndoPredict offers independent prognostic information beyond current clinicopathological parameters and clinical guidelines. The review article summarizes several important steps that should be considered for the development process of a second generation multigene test and offers a means for transferring a microarray signature from the research laboratory to clinical practice. PMID:27605191
Galfalvy, Hanga C; Erraji-Benchekroun, Loubna; Smyrniotopoulos, Peggy; Pavlidis, Paul; Ellis, Steven P; Mann, J John; Sibille, Etienne; Arango, Victoria
2003-01-01
Background Genomic studies of complex tissues pose unique analytical challenges for assessment of data quality, performance of statistical methods used for data extraction, and detection of differentially expressed genes. Ideally, to assess the accuracy of gene expression analysis methods, one needs a set of genes which are known to be differentially expressed in the samples and which can be used as a "gold standard". We introduce the idea of using sex-chromosome genes as an alternative to spiked-in control genes or simulations for assessment of microarray data and analysis methods. Results Expression of sex-chromosome genes were used as true internal biological controls to compare alternate probe-level data extraction algorithms (Microarray Suite 5.0 [MAS5.0], Model Based Expression Index [MBEI] and Robust Multi-array Average [RMA]), to assess microarray data quality and to establish some statistical guidelines for analyzing large-scale gene expression. These approaches were implemented on a large new dataset of human brain samples. RMA-generated gene expression values were markedly less variable and more reliable than MAS5.0 and MBEI-derived values. A statistical technique controlling the false discovery rate was applied to adjust for multiple testing, as an alternative to the Bonferroni method, and showed no evidence of false negative results. Fourteen probesets, representing nine Y- and two X-chromosome linked genes, displayed significant sex differences in brain prefrontal cortex gene expression. Conclusion In this study, we have demonstrated the use of sex genes as true biological internal controls for genomic analysis of complex tissues, and suggested analytical guidelines for testing alternate oligonucleotide microarray data extraction protocols and for adjusting multiple statistical analysis of differentially expressed genes. Our results also provided evidence for sex differences in gene expression in the brain prefrontal cortex, supporting the notion of a putative direct role of sex-chromosome genes in differentiation and maintenance of sexual dimorphism of the central nervous system. Importantly, these analytical approaches are applicable to all microarray studies that include male and female human or animal subjects. PMID:12962547
Galfalvy, Hanga C; Erraji-Benchekroun, Loubna; Smyrniotopoulos, Peggy; Pavlidis, Paul; Ellis, Steven P; Mann, J John; Sibille, Etienne; Arango, Victoria
2003-09-08
Genomic studies of complex tissues pose unique analytical challenges for assessment of data quality, performance of statistical methods used for data extraction, and detection of differentially expressed genes. Ideally, to assess the accuracy of gene expression analysis methods, one needs a set of genes which are known to be differentially expressed in the samples and which can be used as a "gold standard". We introduce the idea of using sex-chromosome genes as an alternative to spiked-in control genes or simulations for assessment of microarray data and analysis methods. Expression of sex-chromosome genes were used as true internal biological controls to compare alternate probe-level data extraction algorithms (Microarray Suite 5.0 [MAS5.0], Model Based Expression Index [MBEI] and Robust Multi-array Average [RMA]), to assess microarray data quality and to establish some statistical guidelines for analyzing large-scale gene expression. These approaches were implemented on a large new dataset of human brain samples. RMA-generated gene expression values were markedly less variable and more reliable than MAS5.0 and MBEI-derived values. A statistical technique controlling the false discovery rate was applied to adjust for multiple testing, as an alternative to the Bonferroni method, and showed no evidence of false negative results. Fourteen probesets, representing nine Y- and two X-chromosome linked genes, displayed significant sex differences in brain prefrontal cortex gene expression. In this study, we have demonstrated the use of sex genes as true biological internal controls for genomic analysis of complex tissues, and suggested analytical guidelines for testing alternate oligonucleotide microarray data extraction protocols and for adjusting multiple statistical analysis of differentially expressed genes. Our results also provided evidence for sex differences in gene expression in the brain prefrontal cortex, supporting the notion of a putative direct role of sex-chromosome genes in differentiation and maintenance of sexual dimorphism of the central nervous system. Importantly, these analytical approaches are applicable to all microarray studies that include male and female human or animal subjects.
Evaluation and inter-comparison of modern day reanalysis datasets over Africa and the Middle East
NASA Astrophysics Data System (ADS)
Shukla, S.; Arsenault, K. R.; Hobbins, M.; Peters-Lidard, C. D.; Verdin, J. P.
2015-12-01
Reanalysis datasets are potentially very valuable for otherwise data-sparse regions such as Africa and the Middle East. They are potentially useful for long-term climate and hydrologic analyses and, given their availability in real-time, they are particularity attractive for real-time hydrologic monitoring purposes (e.g. to monitor flood and drought events). Generally in data-sparse regions, reanalysis variables such as precipitation, temperature, radiation and humidity are used in conjunction with in-situ and/or satellite-based datasets to generate long-term gridded atmospheric forcing datasets. These atmospheric forcing datasets are used to drive offline land surface models and simulate soil moisture and runoff, which are natural indicators of hydrologic conditions. Therefore, any uncertainty or bias in the reanalysis datasets contributes to uncertainties in hydrologic monitoring estimates. In this presentation, we report on a comprehensive analysis that evaluates several modern-day reanalysis products (such as NASA's MERRA-1 and -2, ECMWF's ERA-Interim and NCEP's CFS Reanalysis) over Africa and the Middle East region. We compare the precipitation and temperature from the reanalysis products with other independent gridded datasets such as GPCC, CRU, and USGS/UCSB's CHIRPS precipitation datasets, and CRU's temperature datasets. The evaluations are conducted at a monthly time scale, since some of these independent datasets are only available at this temporal resolution. The evaluations range from the comparison of the monthly mean climatology to inter-annual variability and long-term changes. Finally, we also present the results of inter-comparisons of radiation and humidity variables from the different reanalysis datasets.
Generation of openEHR Test Datasets for Benchmarking.
El Helou, Samar; Karvonen, Tuukka; Yamamoto, Goshiro; Kume, Naoto; Kobayashi, Shinji; Kondo, Eiji; Hiragi, Shusuke; Okamoto, Kazuya; Tamura, Hiroshi; Kuroda, Tomohiro
2017-01-01
openEHR is a widely used EHR specification. Given its technology-independent nature, different approaches for implementing openEHR data repositories exist. Public openEHR datasets are needed to conduct benchmark analyses over different implementations. To address their current unavailability, we propose a method for generating openEHR test datasets that can be publicly shared and used.
Bauman, Tyler M; Ewald, Jonathan A; Huang, Wei; Ricke, William A
2015-07-25
CD147 is an MMP-inducing protein often implicated in cancer progression. The purpose of this study was to investigate the expression of CD147 in prostate cancer (PCa) progression and the prognostic ability of CD147 in predicting biochemical recurrence after prostatectomy. Plasma membrane-localized CD147 protein expression was quantified in patient samples using immunohistochemistry and multispectral imaging, and expression was compared to clinico-pathological features (pathologic stage, Gleason score, tumor volume, preoperative PSA, lymph node status, surgical margins, biochemical recurrence status). CD147 specificity and expression were confirmed with immunoblotting of prostate cell lines, and CD147 mRNA expression was evaluated in public expression microarray datasets of patient prostate tumors. Expression of CD147 protein was significantly decreased in localized tumors (pT2; p = 0.02) and aggressive PCa (≥pT3; p = 0.004), and metastases (p = 0.001) compared to benign prostatic tissue. Decreased CD147 was associated with advanced pathologic stage (p = 0.009) and high Gleason score (p = 0.02), and low CD147 expression predicted biochemical recurrence (HR 0.55; 95 % CI 0.31-0.97; p = 0.04) independent of clinico-pathologic features. Immunoblot bands were detected at 44 kDa and 66 kDa, representing non-glycosylated and glycosylated forms of CD147 protein, and CD147 expression was lower in tumorigenic T10 cells than non-tumorigenic BPH-1 cells (p = 0.02). Decreased CD147 mRNA expression was associated with increased Gleason score and pathologic stage in patient tumors but is not associated with recurrence status. Membrane-associated CD147 expression is significantly decreased in PCa compared to non-malignant prostate tissue and is associated with tumor progression, and low CD147 expression predicts biochemical recurrence after prostatectomy independent of pathologic stage, Gleason score, lymph node status, surgical margins, and tumor volume in multivariable analysis.
Zhou, Zhen; Wang, Jian-Bao; Zang, Yu-Feng; Pan, Gang
2018-01-01
Classification approaches have been increasingly applied to differentiate patients and normal controls using resting-state functional magnetic resonance imaging data (RS-fMRI). Although most previous classification studies have reported promising accuracy within individual datasets, achieving high levels of accuracy with multiple datasets remains challenging for two main reasons: high dimensionality, and high variability across subjects. We used two independent RS-fMRI datasets (n = 31, 46, respectively) both with eyes closed (EC) and eyes open (EO) conditions. For each dataset, we first reduced the number of features to a small number of brain regions with paired t-tests, using the amplitude of low frequency fluctuation (ALFF) as a metric. Second, we employed a new method for feature extraction, named the PAIR method, examining EC and EO as paired conditions rather than independent conditions. Specifically, for each dataset, we obtained EC minus EO (EC—EO) maps of ALFF from half of subjects (n = 15 for dataset-1, n = 23 for dataset-2) and obtained EO—EC maps from the other half (n = 16 for dataset-1, n = 23 for dataset-2). A support vector machine (SVM) method was used for classification of EC RS-fMRI mapping and EO mapping. The mean classification accuracy of the PAIR method was 91.40% for dataset-1, and 92.75% for dataset-2 in the conventional frequency band of 0.01–0.08 Hz. For cross-dataset validation, we applied the classifier from dataset-1 directly to dataset-2, and vice versa. The mean accuracy of cross-dataset validation was 94.93% for dataset-1 to dataset-2 and 90.32% for dataset-2 to dataset-1 in the 0.01–0.08 Hz range. For the UNPAIR method, classification accuracy was substantially lower (mean 69.89% for dataset-1 and 82.97% for dataset-2), and was much lower for cross-dataset validation (64.69% for dataset-1 to dataset-2 and 64.98% for dataset-2 to dataset-1) in the 0.01–0.08 Hz range. In conclusion, for within-group design studies (e.g., paired conditions or follow-up studies), we recommend the PAIR method for feature extraction. In addition, dimensionality reduction with strong prior knowledge of specific brain regions should also be considered for feature selection in neuroimaging studies. PMID:29375288
Galbiati, Silvia; Monguzzi, Alessandra; Damin, Francesco; Soriani, Nadia; Passiu, Marianna; Castellani, Carlo; Natacci, Federica; Curcio, Cristina; Seia, Manuela; Lalatta, Faustina; Chiari, Marcella; Ferrari, Maurizio; Cremonesi, Laura
2016-07-01
Until now, non-invasive prenatal diagnosis of genetic diseases found only limited routine applications. In autosomal recessive diseases, it can be used to determine the carrier status of the fetus through the detection of a paternally inherited disease allele in cases where maternal and paternal mutated alleles differ. Conditions for non-invasive identification of fetal paternally inherited mutations in maternal plasma were developed by two independent approaches: coamplification at lower denaturation temperature-PCR (COLD-PCR) and highly sensitive microarrays. Assays were designed for identifying 14 mutations, 7 causing β-thalassaemia and 7 cystic fibrosis. In total, 87 non-invasive prenatal diagnoses were performed by COLD-PCR in 75 couples at risk for β-thalassaemia and 12 for cystic fibrosis. First, to identify the more appropriate methodology for the analysis of minority mutated fetal alleles in maternal plasma, both fast and full COLD-PCR protocols were developed for the most common Italian β-thalassaemia Cd39 and IVSI.110 mutations. In 5 out of 31 samples, no enrichment was obtained with the fast protocol, while full COLD-PCR provided the correct fetal genotypes. Thus, full COLD-PCR protocols were developed for all the remaining mutations and all analyses confirmed the fetal genotypes obtained by invasive prenatal diagnosis. Microarray analysis was performed on 40 samples from 28 couples at risk for β-thalassaemia and 12 for cystic fibrosis. Results were in complete concordance with those obtained by both COLD-PCR and invasive procedures. COLD-PCR and microarray approaches are not expensive, simple to handle, fast and can be easily set up in specialised clinical laboratories where prenatal diagnosis is routinely performed. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/
Shen, Yi; Wang, Zhanwei; Loo, Lenora WM; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A.; Katsaros, Dionyssios; Yu, Herbert
2015-01-01
Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management. PMID:26564482
Shen, Yi; Wang, Zhanwei; Loo, Lenora W M; Ni, Yan; Jia, Wei; Fei, Peiwen; Risch, Harvey A; Katsaros, Dionyssios; Yu, Herbert
2015-12-01
Long non-coding RNAs (lncRNAs) are a class of newly recognized DNA transcripts that have diverse biological activities. Dysregulation of lncRNAs may be involved in many pathogenic processes including cancer. Recently, we found an intergenic lncRNA, LINC00472, whose expression was correlated with breast cancer progression and patient survival. Our findings were consistent across multiple clinical datasets and supported by results from in vitro experiments. To evaluate further the role of LINC00472 in breast cancer, we used various online databases to investigate possible mechanisms that might affect LINC00472 expression in breast cancer. We also analyzed associations of LINC00472 with estrogen receptor, tumor grade, and molecular subtypes in additional online datasets generated by microarray platforms different from the one we investigated previously. We found that LINC00472 expression in breast cancer was regulated more possibly by promoter methylation than by the alteration of gene copy number. Analysis of additional datasets confirmed our previous findings of high expression of LINC00472 associated with ER-positive and low-grade tumors and favorable molecular subtypes. Finally, in nine datasets, we examined the association of LINC00472 expression with disease-free survival in patients with grade 2 tumors. Meta-analysis of the datasets showed that LINC00472 expression in breast tumors predicted the recurrence of breast cancer in patients with grade 2 tumors. In summary, our analyses confirm that LINC00472 is functionally a tumor suppressor, and that assessing its expression in breast tumors may have clinical implications in breast cancer management.
Structure and transcriptional regulation of the major intrinsic protein gene family in grapevine.
Wong, Darren Chern Jan; Zhang, Li; Merlin, Isabelle; Castellarin, Simone D; Gambetta, Gregory A
2018-04-11
The major intrinsic protein (MIP) family is a family of proteins, including aquaporins, which facilitate water and small molecule transport across plasma membranes. In plants, MIPs function in a huge variety of processes including water transport, growth, stress response, and fruit development. In this study, we characterize the structure and transcriptional regulation of the MIP family in grapevine, describing the putative genome duplication events leading to the family structure and characterizing the family's tissue and developmental specific expression patterns across numerous preexisting microarray and RNAseq datasets. Gene co-expression network (GCN) analyses were carried out across these datasets and the promoters of each family member were analyzed for cis-regulatory element structure in order to provide insight into their transcriptional regulation. A total of 29 Vitis vinifera MIP family members (excluding putative pseudogenes) were identified of which all but two were mapped onto Vitis vinifera chromosomes. In this study, segmental duplication events were identified for five plasma membrane intrinsic protein (PIP) and four tonoplast intrinsic protein (TIP) genes, contributing to the expansion of PIPs and TIPs in grapevine. Grapevine MIP family members have distinct tissue and developmental expression patterns and hierarchical clustering revealed two primary groups regardless of the datasets analyzed. Composite microarray and RNA-seq gene co-expression networks (GCNs) highlighted the relationships between MIP genes and functional categories involved in cell wall modification and transport, as well as with other MIPs revealing a strong co-regulation within the family itself. Some duplicated MIP family members have undergone sub-functionalization and exhibit distinct expression patterns and GCNs. Cis-regulatory element (CRE) analyses of the MIP promoters and their associated GCN members revealed enrichment for numerous CREs including AP2/ERFs and NACs. Combining phylogenetic analyses, gene expression profiling, gene co-expression network analyses, and cis-regulatory element enrichment, this study provides a comprehensive overview of the structure and transcriptional regulation of the grapevine MIP family. The study highlights the duplication and sub-functionalization of the family, its strong coordinated expression with genes involved in growth and transport, and the putative classes of TFs responsible for its regulation.
Mahmood, Khalid; Jung, Chol-Hee; Philip, Gayle; Georgeson, Peter; Chung, Jessica; Pope, Bernard J; Park, Daniel J
2017-05-16
Genetic variant effect prediction algorithms are used extensively in clinical genomics and research to determine the likely consequences of amino acid substitutions on protein function. It is vital that we better understand their accuracies and limitations because published performance metrics are confounded by serious problems of circularity and error propagation. Here, we derive three independent, functionally determined human mutation datasets, UniFun, BRCA1-DMS and TP53-TA, and employ them, alongside previously described datasets, to assess the pre-eminent variant effect prediction tools. Apparent accuracies of variant effect prediction tools were influenced significantly by the benchmarking dataset. Benchmarking with the assay-determined datasets UniFun and BRCA1-DMS yielded areas under the receiver operating characteristic curves in the modest ranges of 0.52 to 0.63 and 0.54 to 0.75, respectively, considerably lower than observed for other, potentially more conflicted datasets. These results raise concerns about how such algorithms should be employed, particularly in a clinical setting. Contemporary variant effect prediction tools are unlikely to be as accurate at the general prediction of functional impacts on proteins as reported prior. Use of functional assay-based datasets that avoid prior dependencies promises to be valuable for the ongoing development and accurate benchmarking of such tools.
Su, Hengchuan; Wang, Hongkai; Shi, Guohai; Zhang, Hailiang; Sun, Fukang; Ye, Dingwei
2018-06-01
In order to identify potential novel biomarkers of advanced clear cell renal cell carcinoma (ccRCC), we re-evaluated published long non-coding RNA (lncRNA) expression profiling data. The lncRNA expression profiles in ccRCC microarray dataset GSE47352 were analyzed and an independent cohort of 61 clinical samples including 21 advanced and 40 localized ccRCC patients was used to confirm the most statistically significant lncRNAs by real time PCR. Next, the relationships between the selected lncRNAs and ccRCC patients' clinicopathological features were investigated. The effects of LncRNAs on the invasion and proliferation of renal carcinoma cells were also investigated. The PCR results in a cohort of 21 advanced ccRCC and 40 localized ccRCC tissues were used for confirmation of the selected lncRNAs which were statistically most significant. The PCR results showed that the expression of three LncRNA (ENSG00000241684, ENSG00000231721 and NEAT1) were significantly downregulated in advanced ccRCC. Kaplan-Meier analysis revealed that reduced expression of LncRNA ENSG00000241684 and NEAT1 were significantly associated with poor overall survival. The univariate and multivariate Cox regression indicated LncRNA ENSG00000241684 had significant hazard ratios for predicting clinical outcome. LncRNA ENSG00000241684 expression was negatively correlated with pTNM stage. Overexpression of ENSG00000241684 significantly impaired cell proliferation and reduced the invasion ability in 786-O and ACHN cells. lncRNAs are involved in renal carcinogenesis and decreased lncRNA ENSG00000241684 expression may be an independent adverse prognostic factor in advanced ccRCC patients. Copyright © 2018 Elsevier Ltd, BASO ~ The Association for Cancer Surgery, and the European Society of Surgical Oncology. All rights reserved.
Fasting and Fast Food Diet Play an Opposite Role in Mice Brain Aging.
Castrogiovanni, Paola; Li Volti, Giovanni; Sanfilippo, Cristina; Tibullo, Daniele; Galvano, Fabio; Vecchio, Michele; Avola, Roberto; Barbagallo, Ignazio; Malaguarnera, Lucia; Castorina, Sergio; Musumeci, Giuseppe; Imbesi, Rosa; Di Rosa, Michelino
2018-01-20
Fasting may be exploited as a possible strategy for prevention and treatment of several diseases such as diabetes, obesity, and aging. On the other hand, high-fat diet (HFD) represents a risk factor for several diseases and increased mortality. The aim of the present study was to evaluate the impact of fasting on mouse brain aging transcriptome and how HFD regulates such pathways. We used the NCBI Gene Expression Omnibus (GEO) database, in order to identify suitable microarray datasets comparing mouse brain transcriptome under fasting or HFD vs aged mouse brain transcriptome. Three microarray datasets were selected for this study, GSE24504, GSE6285, and GSE8150, and the principal molecular mechanisms involved in this process were evaluated. This analysis showed that, regardless of fasting duration, mouse brain significantly expressed 21 and 30 upregulated and downregulated genes, respectively. The involved biological processes were related to cell cycle arrest, cell death inhibition, and regulation of cellular metabolism. Comparing mouse brain transcriptome under fasting and aged conditions, we found out that the number of genes in common increased with the duration of fasting (222 genes), peaking at 72 h. In addition, mouse brain transcriptome under HFD resembles for the 30% the one of the aged mice. Furthermore, several molecular processes were found to be shared between HFD and aging. In conclusion, we suggest that fasting and HFD play an opposite role in brain transcriptome of aged mice. Therefore, an intermittent diet could represent a possible clinical strategy to counteract aging, loss of memory, and neuroinflammation. Furthermore, low-fat diet leads to the inactivation of brain degenerative processes triggered by aging.
Kim, Tae-Hwan; Choi, Sung Jae; Lee, Young Ho; Song, Gwan Gyu; Ji, Jong Dae
2014-07-01
Anti-tumor necrosis factor (TNF) therapy is the treatment of choice for rheumatoid arthritis (RA) patients in whom standard disease-modifying anti-rheumatic drugs are ineffective. However, a substantial proportion of RA patients treated with anti-TNF agents do not show a significant clinical response. Therefore, biomarkers predicting response to anti-TNF agents are needed. Recently, gene expression profiling has been applied in research for developing such biomarkers. We compared gene expression profiles reported by previous studies dealing with the responsiveness of anti-TNF therapy in RA patients and attempted to identify differentially expressed genes (DEGs) that discriminated between responders and non-responders to anti-TNF therapy. We used microarray datasets available at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO). This analysis included 6 studies and 5 sets of microarray data that used peripheral blood samples for identification of DEGs predicting response to anti-TNF therapy. We found little overlap in the DEGs that were highly ranked in each study. Three DEGs including IL2RB, SH2D2A and G0S2 appeared in more than 1 study. In addition, a meta-analysis designed to increase statistical power found one DEG, G0S2 by the Fisher's method. Our finding suggests the possibility that G0S2 plays as a biomarker to predict response to anti-TNF therapy in patients with rheumatoid arthritis. Further investigations based on larger studies are therefore needed to confirm the significance of G0S2 in predicting response to anti-TNF therapy. Copyright © 2014 Société française de rhumatologie. Published by Elsevier SAS. All rights reserved.
ExpressionDB: An open source platform for distributing genome-scale datasets.
Hughes, Laura D; Lewis, Scott A; Hughes, Michael E
2017-01-01
RNA-sequencing (RNA-seq) and microarrays are methods for measuring gene expression across the entire transcriptome. Recent advances have made these techniques practical and affordable for essentially any laboratory with experience in molecular biology. A variety of computational methods have been developed to decrease the amount of bioinformatics expertise necessary to analyze these data. Nevertheless, many barriers persist which discourage new labs from using functional genomics approaches. Since high-quality gene expression studies have enduring value as resources to the entire research community, it is of particular importance that small labs have the capacity to share their analyzed datasets with the research community. Here we introduce ExpressionDB, an open source platform for visualizing RNA-seq and microarray data accommodating virtually any number of different samples. ExpressionDB is based on Shiny, a customizable web application which allows data sharing locally and online with customizable code written in R. ExpressionDB allows intuitive searches based on gene symbols, descriptions, or gene ontology terms, and it includes tools for dynamically filtering results based on expression level, fold change, and false-discovery rates. Built-in visualization tools include heatmaps, volcano plots, and principal component analysis, ensuring streamlined and consistent visualization to all users. All of the scripts for building an ExpressionDB with user-supplied data are freely available on GitHub, and the Creative Commons license allows fully open customization by end-users. We estimate that a demo database can be created in under one hour with minimal programming experience, and that a new database with user-supplied expression data can be completed and online in less than one day.
Heerema-McKenney, Amy; Wijnaendts, Liliane C D; Pulliam, Joseph F; Lopez-Terrada, Dolores; McKenney, Jesse K; Zhu, Shirley; Montgomery, Kelli; Mitchell, Janet; Marinelli, Robert J; Hart, Augustinus A M; van de Rijn, Matt; Linn, Sabine C
2008-10-01
The pathologic classification of rhabdomyosarcoma (RMS) into embryonal or alveolar subtype is an important prognostic factor guiding the therapeutic protocol chosen for an individual patient. Unfortunately, this classification is not always straightforward, and the diagnostic criteria are controversial in a subset of cases. Ancillary studies are used to aid in the classification, but their potential use as independent prognostic factors is rarely studied. The aim of this study is to identify immunohistochemical markers of potential prognostic significance in pediatric RMS and to correlate their expression with PAX-3/FKHR and PAX-7/FKHR fusion status. A single tissue microarray containing 71 paraffin-embedded pediatric RMSs was immunostained with antibodies against p53, bcl-2, Ki-67, CD44, myogenin, and MyoD1. The tissue microarray and whole paraffin blocks were studied for PAX-3/FKHR and PAX-7/FKHR gene fusions by fluorescence in situ hybridization and reverse transcription-polymerase chain reaction. Clinical follow-up data were available for each patient. Immunohistochemical staining results and translocation status were correlated with recurrence-free interval (RFI) and overall survival (OS) using the Kaplan-Meier method, the log-rank test, and Cox proportional hazard regression. The minimum clinical follow-up interval was 24 months (median follow-up=57 mo). On univariable analysis, immunohistochemical expression of myogenin, bcl-2, and identification of a gene fusion were associated with decreased 5-year RFI and 10-year OS (myogenin RFI P=0.0028, OS P=0.0021; bcl-2 RFI P=0.037, OS P=0.032; gene fusion RFI P=0.0001, OS P=0.0058). After adjustment for Intergroup Rhabdomyosarcoma Study-TNM stage, tumor site, age, tumor histology, and translocation status by multivariable analysis, only myogenin retained an independent association with RFI (P=0.034) and OS (P=0.0069). In this retrospective analysis, diffuse immunohistochemical reactivity for myogenin in RMS correlates with decreased RFI and OS, independent of histologic subtype, translocation status, tumor site, or stage.
Systematic Omics Analysis Review (SOAR) Tool to Support Risk Assessment
McConnell, Emma R.; Bell, Shannon M.; Cote, Ila; Wang, Rong-Lin; Perkins, Edward J.; Garcia-Reyero, Natàlia; Gong, Ping; Burgoon, Lyle D.
2014-01-01
Environmental health risk assessors are challenged to understand and incorporate new data streams as the field of toxicology continues to adopt new molecular and systems biology technologies. Systematic screening reviews can help risk assessors and assessment teams determine which studies to consider for inclusion in a human health assessment. A tool for systematic reviews should be standardized and transparent in order to consistently determine which studies meet minimum quality criteria prior to performing in-depth analyses of the data. The Systematic Omics Analysis Review (SOAR) tool is focused on assisting risk assessment support teams in performing systematic reviews of transcriptomic studies. SOAR is a spreadsheet tool of 35 objective questions developed by domain experts, focused on transcriptomic microarray studies, and including four main topics: test system, test substance, experimental design, and microarray data. The tool will be used as a guide to identify studies that meet basic published quality criteria, such as those defined by the Minimum Information About a Microarray Experiment standard and the Toxicological Data Reliability Assessment Tool. Seven scientists were recruited to test the tool by using it to independently rate 15 published manuscripts that study chemical exposures with microarrays. Using their feedback, questions were weighted based on importance of the information and a suitability cutoff was set for each of the four topic sections. The final validation resulted in 100% agreement between the users on four separate manuscripts, showing that the SOAR tool may be used to facilitate the standardized and transparent screening of microarray literature for environmental human health risk assessment. PMID:25531884
Genetics of PCOS: A systematic bioinformatics approach to unveil the proteins responsible for PCOS.
Panda, Pritam Kumar; Rane, Riya; Ravichandran, Rahul; Singh, Shrinkhla; Panchal, Hetalkumar
2016-06-01
Polycystic ovary syndrome (PCOS) is a hormonal imbalance in women, which causes problems during menstrual cycle and in pregnancy that sometimes results in fatality. Though the genetics of PCOS is not fully understood, early diagnosis and treatment can prevent long-term effects. In this study, we have studied the proteins involved in PCOS and the structural aspects of the proteins that are taken into consideration using computational tools. The proteins involved are modeled using Modeller 9v14 and Ab-initio programs. All the 43 proteins responsible for PCOS were subjected to phylogenetic analysis to identify the relatedness of the proteins. Further, microarray data analysis of PCOS datasets was analyzed that was downloaded from GEO datasets to find the significant protein-coding genes responsible for PCOS, which is an addition to the reported protein-coding genes. Various statistical analyses were done using R programming to get an insight into the structural aspects of PCOS that can be used as drug targets to treat PCOS and other related reproductive diseases.
NASA Astrophysics Data System (ADS)
Wiktor, Peter; Brunner, Al; Kahn, Peter; Qiu, Ji; Magee, Mitch; Bian, Xiaofang; Karthikeyan, Kailash; Labaer, Joshua
2015-03-01
We report a device to fill an array of small chemical reaction chambers (microreactors) with reagent and then seal them using pressurized viscous liquid acting through a flexible membrane. The device enables multiple, independent chemical reactions involving free floating intermediate molecules without interference from neighboring reactions or external environments. The device is validated by protein expressed in situ directly from DNA in a microarray of ~10,000 spots with no diffusion during three hours incubation. Using the device to probe for an autoantibody cancer biomarker in blood serum sample gave five times higher signal to background ratio compared to standard protein microarray expressed on a flat microscope slide. Physical design principles to effectively fill the array of microreactors with reagent and experimental results of alternate methods for sealing the microreactors are presented.
NASA Astrophysics Data System (ADS)
Ali, E. S. M.; Spencer, B.; McEwen, M. R.; Rogers, D. W. O.
2015-02-01
In this study, a quantitative estimate is derived for the uncertainty in the XCOM photon mass attenuation coefficients in the energy range of interest to external beam radiation therapy—i.e. 100 keV (orthovoltage) to 25 MeV—using direct comparisons of experimental data against Monte Carlo models and theoretical XCOM data. Two independent datasets are used. The first dataset is from our recent transmission measurements and the corresponding EGSnrc calculations (Ali et al 2012 Med. Phys. 39 5990-6003) for 10-30 MV photon beams from the research linac at the National Research Council Canada. The attenuators are graphite and lead, with a total of 140 data points and an experimental uncertainty of ˜0.5% (k = 1). An optimum energy-independent cross section scaling factor that minimizes the discrepancies between measurements and calculations is used to deduce cross section uncertainty. The second dataset is from the aggregate of cross section measurements in the literature for graphite and lead (49 experiments, 288 data points). The dataset is compared to the sum of the XCOM data plus the IAEA photonuclear data. Again, an optimum energy-independent cross section scaling factor is used to deduce the cross section uncertainty. Using the average result from the two datasets, the energy-independent cross section uncertainty estimate is 0.5% (68% confidence) and 0.7% (95% confidence). The potential for energy-dependent errors is discussed. Photon cross section uncertainty is shown to be smaller than the current qualitative ‘envelope of uncertainty’ of the order of 1-2%, as given by Hubbell (1999 Phys. Med. Biol 44 R1-22).
NASA Astrophysics Data System (ADS)
Dube, Timothy; Sibanda, Mbulisi; Shoko, Cletah; Mutanga, Onisimo
2017-10-01
Forest stand volume is one of the crucial stand parameters, which influences the ability of these forests to provide ecosystem goods and services. This study thus aimed at examining the potential of integrating multispectral SPOT 5 image, with ancillary data (forest age and rainfall metrics) in estimating stand volume between coppiced and planted Eucalyptus spp. in KwaZulu-Natal, South Africa. To achieve this objective, Partial Least Squares Regression (PLSR) algorithm was used. The PLSR algorithm was implemented by applying three tier analysis stages: stage I: using ancillary data as an independent dataset, stage II: SPOT 5 spectral bands as an independent dataset and stage III: combined SPOT 5 spectral bands and ancillary data. The results of the study showed that the use of an independent ancillary dataset better explained the volume of Eucalyptus spp. growing from coppices (adjusted R2 (R2Adj) = 0.54, RMSEP = 44.08 m3/ha), when compared with those that were planted (R2Adj = 0.43, RMSEP = 53.29 m3/ha). Similar results were also observed when SPOT 5 spectral bands were applied as an independent dataset, whereas improved volume estimates were produced when using combined dataset. For instance, planted Eucalyptus spp. were better predicted adjusted R2 (R2Adj) = 0.77, adjusted R2Adj = 0.59, RMSEP = 36.02 m3/ha) when compared with those that grow from coppices (R2 = 0.76, R2Adj = 0.46, RMSEP = 40.63 m3/ha). Overall, the findings of this study demonstrated the relevance of multi-source data in ecosystems modelling.
Feltus, F Alex; Ficklin, Stephen P; Gibson, Scott M; Smith, Melissa C
2013-06-05
In genomics, highly relevant gene interaction (co-expression) networks have been constructed by finding significant pair-wise correlations between genes in expression datasets. These networks are then mined to elucidate biological function at the polygenic level. In some cases networks may be constructed from input samples that measure gene expression under a variety of different conditions, such as for different genotypes, environments, disease states and tissues. When large sets of samples are obtained from public repositories it is often unmanageable to associate samples into condition-specific groups, and combining samples from various conditions has a negative effect on network size. A fixed significance threshold is often applied also limiting the size of the final network. Therefore, we propose pre-clustering of input expression samples to approximate condition-specific grouping of samples and individual network construction of each group as a means for dynamic significance thresholding. The net effect is increase sensitivity thus maximizing the total co-expression relationships in the final co-expression network compendium. A total of 86 Arabidopsis thaliana co-expression networks were constructed after k-means partitioning of 7,105 publicly available ATH1 Affymetrix microarray samples. We term each pre-sorted network a Gene Interaction Layer (GIL). Random Matrix Theory (RMT), an un-supervised thresholding method, was used to threshold each of the 86 networks independently, effectively providing a dynamic (non-global) threshold for the network. The overall gene count across all GILs reached 19,588 genes (94.7% measured gene coverage) and 558,022 unique co-expression relationships. In comparison, network construction without pre-sorting of input samples yielded only 3,297 genes (15.9%) and 129,134 relationships. in the global network. Here we show that pre-clustering of microarray samples helps approximate condition-specific networks and allows for dynamic thresholding using un-supervised methods. Because RMT ensures only highly significant interactions are kept, the GIL compendium consists of 558,022 unique high quality A. thaliana co-expression relationships across almost all of the measurable genes on the ATH1 array. For A. thaliana, these networks represent the largest compendium to date of significant gene co-expression relationships, and are a means to explore complex pathway, polygenic, and pleiotropic relationships for this focal model plant. The networks can be explored at sysbio.genome.clemson.edu. Finally, this method is applicable to any large expression profile collection for any organism and is best suited where a knowledge-independent network construction method is desired.
2013-01-01
Background In genomics, highly relevant gene interaction (co-expression) networks have been constructed by finding significant pair-wise correlations between genes in expression datasets. These networks are then mined to elucidate biological function at the polygenic level. In some cases networks may be constructed from input samples that measure gene expression under a variety of different conditions, such as for different genotypes, environments, disease states and tissues. When large sets of samples are obtained from public repositories it is often unmanageable to associate samples into condition-specific groups, and combining samples from various conditions has a negative effect on network size. A fixed significance threshold is often applied also limiting the size of the final network. Therefore, we propose pre-clustering of input expression samples to approximate condition-specific grouping of samples and individual network construction of each group as a means for dynamic significance thresholding. The net effect is increase sensitivity thus maximizing the total co-expression relationships in the final co-expression network compendium. Results A total of 86 Arabidopsis thaliana co-expression networks were constructed after k-means partitioning of 7,105 publicly available ATH1 Affymetrix microarray samples. We term each pre-sorted network a Gene Interaction Layer (GIL). Random Matrix Theory (RMT), an un-supervised thresholding method, was used to threshold each of the 86 networks independently, effectively providing a dynamic (non-global) threshold for the network. The overall gene count across all GILs reached 19,588 genes (94.7% measured gene coverage) and 558,022 unique co-expression relationships. In comparison, network construction without pre-sorting of input samples yielded only 3,297 genes (15.9%) and 129,134 relationships. in the global network. Conclusions Here we show that pre-clustering of microarray samples helps approximate condition-specific networks and allows for dynamic thresholding using un-supervised methods. Because RMT ensures only highly significant interactions are kept, the GIL compendium consists of 558,022 unique high quality A. thaliana co-expression relationships across almost all of the measurable genes on the ATH1 array. For A. thaliana, these networks represent the largest compendium to date of significant gene co-expression relationships, and are a means to explore complex pathway, polygenic, and pleiotropic relationships for this focal model plant. The networks can be explored at sysbio.genome.clemson.edu. Finally, this method is applicable to any large expression profile collection for any organism and is best suited where a knowledge-independent network construction method is desired. PMID:23738693
NASA Astrophysics Data System (ADS)
Tibbetts, Clark; Lichanska, Agnieszka M.; Borsuk, Lisa A.; Weslowski, Brian; Morris, Leah M.; Lorence, Matthew C.; Schafer, Klaus O.; Campos, Joseph; Sene, Mohamadou; Myers, Christopher A.; Faix, Dennis; Blair, Patrick J.; Brown, Jason; Metzgar, David
2010-04-01
High-density resequencing microarrays support simultaneous detection and identification of multiple viral and bacterial pathogens. Because detection and identification using RPM is based upon multiple specimen-specific target pathogen gene sequences generated in the individual test, the test results enable both a differential diagnostic analysis and epidemiological tracking of detected pathogen strains and variants from one specimen to the next. The RPM assay enables detection and identification of pathogen sequences that share as little as 80% sequence similarity to prototype target gene sequences represented as detector tiles on the array. This capability enables the RPM to detect and identify previously unknown strains and variants of a detected pathogen, as in sentinel cases associated with an infectious disease outbreak. We illustrate this capability using assay results from testing influenza A virus vaccines configured with strains that were first defined years after the design of the RPM microarray. Results are also presented from RPM-Flu testing of three specimens independently confirmed to the positive for the 2009 Novel H1N1 outbreak strain of influenza virus.
Development and validation of a flax (Linum usitatissimum L.) gene expression oligo microarray
2010-01-01
Background Flax (Linum usitatissimum L.) has been cultivated for around 9,000 years and is therefore one of the oldest cultivated species. Today, flax is still grown for its oil (oil-flax or linseed cultivars) and its cellulose-rich fibres (fibre-flax cultivars) used for high-value linen garments and composite materials. Despite the wide industrial use of flax-derived products, and our actual understanding of the regulation of both wood fibre production and oil biosynthesis more information must be acquired in both domains. Recent advances in genomics are now providing opportunities to improve our fundamental knowledge of these complex processes. In this paper we report the development and validation of a high-density oligo microarray platform dedicated to gene expression analyses in flax. Results Nine different RNA samples obtained from flax inner- and outer-stems, seeds, leaves and roots were used to generate a collection of 1,066,481 ESTs by massive parallel pyrosequencing. Sequences were assembled into 59,626 unigenes and 48,021 sequences were selected for oligo design and high-density microarray (Nimblegen 385K) fabrication with eight, non-overlapping 25-mers oligos per unigene. 18 independent experiments were used to evaluate the hybridization quality, precision, specificity and accuracy and all results confirmed the high technical quality of our microarray platform. Cross-validation of microarray data was carried out using quantitative qRT-PCR. Nine target genes were selected on the basis of microarray results and reflected the whole range of fold change (both up-regulated and down-regulated genes in different samples). A statistically significant positive correlation was obtained comparing expression levels for each target gene across all biological replicates both in qRT-PCR and microarray results. Further experiments illustrated the capacity of our arrays to detect differential gene expression in a variety of flax tissues as well as between two contrasted flax varieties. Conclusion All results suggest that our high-density flax oligo-microarray platform can be used as a very sensitive tool for analyzing gene expression in a large variety of tissues as well as in different cultivars. Moreover, this highly reliable platform can also be used for the quantification of mRNA transcriptional profiling in different flax tissues. PMID:20964859
Development and validation of a flax (Linum usitatissimum L.) gene expression oligo microarray.
Fenart, Stéphane; Ndong, Yves-Placide Assoumou; Duarte, Jorge; Rivière, Nathalie; Wilmer, Jeroen; van Wuytswinkel, Olivier; Lucau, Anca; Cariou, Emmanuelle; Neutelings, Godfrey; Gutierrez, Laurent; Chabbert, Brigitte; Guillot, Xavier; Tavernier, Reynald; Hawkins, Simon; Thomasset, Brigitte
2010-10-21
Flax (Linum usitatissimum L.) has been cultivated for around 9,000 years and is therefore one of the oldest cultivated species. Today, flax is still grown for its oil (oil-flax or linseed cultivars) and its cellulose-rich fibres (fibre-flax cultivars) used for high-value linen garments and composite materials. Despite the wide industrial use of flax-derived products, and our actual understanding of the regulation of both wood fibre production and oil biosynthesis more information must be acquired in both domains. Recent advances in genomics are now providing opportunities to improve our fundamental knowledge of these complex processes. In this paper we report the development and validation of a high-density oligo microarray platform dedicated to gene expression analyses in flax. Nine different RNA samples obtained from flax inner- and outer-stems, seeds, leaves and roots were used to generate a collection of 1,066,481 ESTs by massive parallel pyrosequencing. Sequences were assembled into 59,626 unigenes and 48,021 sequences were selected for oligo design and high-density microarray (Nimblegen 385K) fabrication with eight, non-overlapping 25-mers oligos per unigene. 18 independent experiments were used to evaluate the hybridization quality, precision, specificity and accuracy and all results confirmed the high technical quality of our microarray platform. Cross-validation of microarray data was carried out using quantitative qRT-PCR. Nine target genes were selected on the basis of microarray results and reflected the whole range of fold change (both up-regulated and down-regulated genes in different samples). A statistically significant positive correlation was obtained comparing expression levels for each target gene across all biological replicates both in qRT-PCR and microarray results. Further experiments illustrated the capacity of our arrays to detect differential gene expression in a variety of flax tissues as well as between two contrasted flax varieties. All results suggest that our high-density flax oligo-microarray platform can be used as a very sensitive tool for analyzing gene expression in a large variety of tissues as well as in different cultivars. Moreover, this highly reliable platform can also be used for the quantification of mRNA transcriptional profiling in different flax tissues.
Quantitative Missense Variant Effect Prediction Using Large-Scale Mutagenesis Data.
Gray, Vanessa E; Hause, Ronald J; Luebeck, Jens; Shendure, Jay; Fowler, Douglas M
2018-01-24
Large datasets describing the quantitative effects of mutations on protein function are becoming increasingly available. Here, we leverage these datasets to develop Envision, which predicts the magnitude of a missense variant's molecular effect. Envision combines 21,026 variant effect measurements from nine large-scale experimental mutagenesis datasets, a hitherto untapped training resource, with a supervised, stochastic gradient boosting learning algorithm. Envision outperforms other missense variant effect predictors both on large-scale mutagenesis data and on an independent test dataset comprising 2,312 TP53 variants whose effects were measured using a low-throughput approach. This dataset was never used for hyperparameter tuning or model training and thus serves as an independent validation set. Envision prediction accuracy is also more consistent across amino acids than other predictors. Finally, we demonstrate that Envision's performance improves as more large-scale mutagenesis data are incorporated. We precompute Envision predictions for every possible single amino acid variant in human, mouse, frog, zebrafish, fruit fly, worm, and yeast proteomes (https://envision.gs.washington.edu/). Copyright © 2017 Elsevier Inc. All rights reserved.
Zhou, Shiyong; Liu, Pengfei; Zhang, Huilai
2017-01-01
Acute myeloid leukemia (AML) is a frequently occurring malignant disease of the blood and may result from a variety of genetic disorders. The present study aimed to identify the underlying mechanisms associated with the therapeutic effects of decitabine and cytarabine on AML, using microarray analysis. The microarray datasets GSE40442 and GSE40870 were downloaded from the Gene Expression Omnibus database. Differentially expressed genes (DEGs) and differentially methylated sites were identified in AML cells treated with decitabine compared with those treated with cytarabine via the Linear Models for Microarray Data package, following data pre-processing. Gene Ontology (GO) analysis of DEGs was performed using the Database for Annotation, Visualization and Integrated Analysis Discovery. Genes corresponding to the differentially methylated sites were obtained using the annotation package of the methylation microarray platform. The overlapping genes were identified, which exhibited the opposite variation trend between gene expression and DNA methylation. Important transcription factor (TF)-gene pairs were screened out, and a regulated network subsequently constructed. A total of 190 DEGs and 540 differentially methylated sites were identified in AML cells treated with decitabine compared with those treated with cytarabine. A total of 36 GO terms of DEGs were enriched, including nucleosomes, protein-DNA complexes and the nucleosome assembly. The 540 differentially methylated sites were located on 240 genes, including the acid-repeat containing protein (ACRC) gene that was additionally differentially expressed. In addition, 60 TF pairs and overlapped methylated sites, and 140 TF-pairs and DEGs were screened out. The regulated network included 68 nodes and 140 TF-gene pairs. The present study identified various genes including ACRC and proliferating cell nuclear antigen, in addition to various TFs, including TATA-box binding protein associated factor 1 and CCCTC-binding factor, which may be potential therapeutic targets of AML. PMID:28498449
Zhou, Shiyong; Liu, Pengfei; Zhang, Huilai
2017-07-01
Acute myeloid leukemia (AML) is a frequently occurring malignant disease of the blood and may result from a variety of genetic disorders. The present study aimed to identify the underlying mechanisms associated with the therapeutic effects of decitabine and cytarabine on AML, using microarray analysis. The microarray datasets GSE40442 and GSE40870 were downloaded from the Gene Expression Omnibus database. Differentially expressed genes (DEGs) and differentially methylated sites were identified in AML cells treated with decitabine compared with those treated with cytarabine via the Linear Models for Microarray Data package, following data pre‑processing. Gene Ontology (GO) analysis of DEGs was performed using the Database for Annotation, Visualization and Integrated Analysis Discovery. Genes corresponding to the differentially methylated sites were obtained using the annotation package of the methylation microarray platform. The overlapping genes were identified, which exhibited the opposite variation trend between gene expression and DNA methylation. Important transcription factor (TF)‑gene pairs were screened out, and a regulated network subsequently constructed. A total of 190 DEGs and 540 differentially methylated sites were identified in AML cells treated with decitabine compared with those treated with cytarabine. A total of 36 GO terms of DEGs were enriched, including nucleosomes, protein‑DNA complexes and the nucleosome assembly. The 540 differentially methylated sites were located on 240 genes, including the acid‑repeat containing protein (ACRC) gene that was additionally differentially expressed. In addition, 60 TF pairs and overlapped methylated sites, and 140 TF‑pairs and DEGs were screened out. The regulated network included 68 nodes and 140 TF‑gene pairs. The present study identified various genes including ACRC and proliferating cell nuclear antigen, in addition to various TFs, including TATA‑box binding protein associated factor 1 and CCCTC‑binding factor, which may be potential therapeutic targets of AML.
A comprehensive simulation study on classification of RNA-Seq data.
Zararsız, Gökmen; Goksuluk, Dincer; Korkmaz, Selcuk; Eldem, Vahap; Zararsiz, Gozde Erturk; Duru, Izzet Parug; Ozturk, Ahmet
2017-01-01
RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html.
The Porcelain Crab Transcriptome and PCAD, the Porcelain Crab Microarray and Sequence Database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tagmount, Abderrahmane; Wang, Mei; Lindquist, Erika
2010-01-27
Background: With the emergence of a completed genome sequence of the freshwater crustacean Daphnia pulex, construction of genomic-scale sequence databases for additional crustacean sequences are important for comparative genomics and annotation. Porcelain crabs, genus Petrolisthes, have been powerful crustacean models for environmental and evolutionary physiology with respect to thermal adaptation and understanding responses of marine organisms to climate change. Here, we present a large-scale EST sequencing and cDNA microarray database project for the porcelain crab Petrolisthes cinctipes. Methodology/Principal Findings: A set of ~;;30K unique sequences (UniSeqs) representing ~;;19K clusters were generated from ~;;98K high quality ESTs from a set ofmore » tissue specific non-normalized and mixed-tissue normalized cDNA libraries from the porcelain crab Petrolisthes cinctipes. Homology for each UniSeq was assessed using BLAST, InterProScan, GO and KEGG database searches. Approximately 66percent of the UniSeqs had homology in at least one of the databases. All EST and UniSeq sequences along with annotation results and coordinated cDNA microarray datasets have been made publicly accessible at the Porcelain Crab Array Database (PCAD), a feature-enriched version of the Stanford and Longhorn Array Databases.Conclusions/Significance: The EST project presented here represents the third largest sequencing effort for any crustacean, and the largest effort for any crab species. Our assembly and clustering results suggest that our porcelain crab EST data set is equally diverse to the much larger EST set generated in the Daphnia pulex genome sequencing project, and thus will be an important resource to the Daphnia research community. Our homology results support the pancrustacea hypothesis and suggest that Malacostraca may be ancestral to Branchiopoda and Hexapoda. Our results also suggest that our cDNA microarrays cover as much of the transcriptome as can reasonably be captured in EST library sequencing approaches, and thus represent a rich resource for studies of environmental genomics.« less
Ryan, Natalia; Chorley, Brian; Tice, Raymond R; Judson, Richard; Corton, J Christopher
2016-05-01
Microarray profiling of chemical-induced effects is being increasingly used in medium- and high-throughput formats. Computational methods are described here to identify molecular targets from whole-genome microarray data using as an example the estrogen receptor α (ERα), often modulated by potential endocrine disrupting chemicals. ERα biomarker genes were identified by their consistent expression after exposure to 7 structurally diverse ERα agonists and 3 ERα antagonists in ERα-positive MCF-7 cells. Most of the biomarker genes were shown to be directly regulated by ERα as determined by ESR1 gene knockdown using siRNA as well as through chromatin immunoprecipitation coupled with DNA sequencing analysis of ERα-DNA interactions. The biomarker was evaluated as a predictive tool using the fold-change rank-based Running Fisher algorithm by comparison to annotated gene expression datasets from experiments using MCF-7 cells, including those evaluating the transcriptional effects of hormones and chemicals. Using 141 comparisons from chemical- and hormone-treated cells, the biomarker gave a balanced accuracy for prediction of ERα activation or suppression of 94% and 93%, respectively. The biomarker was able to correctly classify 18 out of 21 (86%) ER reference chemicals including "very weak" agonists. Importantly, the biomarker predictions accurately replicated predictions based on 18 in vitro high-throughput screening assays that queried different steps in ERα signaling. For 114 chemicals, the balanced accuracies were 95% and 98% for activation or suppression, respectively. These results demonstrate that the ERα gene expression biomarker can accurately identify ERα modulators in large collections of microarray data derived from MCF-7 cells. Published by Oxford University Press on behalf of the Society of Toxicology 2016. This work is written by US Government employees and is in the public domain in the US.
Similar compounds searching system by using the gene expression microarray database.
Toyoshiba, Hiroyoshi; Sawada, Hiroshi; Naeshiro, Ichiro; Horinouchi, Akira
2009-04-10
Numbers of microarrays have been examined and several public and commercial databases have been developed. However, it is not easy to compare in-house microarray data with those in a database because of insufficient reproducibility due to differences in the experimental conditions. As one of the approach to use these databases, we developed the similar compounds searching system (SCSS) on a toxicogenomics database. The datasets of 55 compounds administered to rats in the Toxicogenomics Project (TGP) database in Japan were used in this study. Using the fold-change ranking method developed by Lamb et al. [Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Lerner, J., Brunet, J.P., Subramanian, A., Ross, K.N., Reich, M., Hieronymus, H., Wei, G., Armstrong, S.A., Haggarty, S.J., Clemons, P.A., Wei, R., Carr, S.A., Lander, E.S., Golub, T.R., 2006. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929-1935] and criteria called hit ratio, the system let us compare in-house microarray data and those in the database. In-house generated data for clofibrate, phenobarbital, and a proprietary compound were tested to evaluate the performance of the SCSS method. Phenobarbital and clofibrate, which were included in the TGP database, scored highest by the SCSS method. Other high scoring compounds had effects similar to either phenobarbital (a cytochrome P450s inducer) or clofibrate (a peroxisome proliferator). Some of high scoring compounds identified using the proprietary compound-administered rats have been known to cause similar toxicological changes in different species. Our results suggest that the SCSS method could be used in drug discovery and development. Moreover, this method may be a powerful tool to understand the mechanisms by which biological systems respond to various chemical compounds and may also predict adverse effects of new compounds.
Swindell, William R
2007-01-01
Background Long-lived strains of dwarf mice carry mutations that suppress growth hormone (GH) and insulin-like growth factor I (IGF-I) signaling. The downstream effects of these endocrine abnormalities, however, are not well understood and it is unclear how these processes interact with aging mechanisms. This study presents a comparative analysis of microarray experiments that have measured hepatic gene expression levels in long-lived strains carrying one of four mutations (Prop1df/df, Pit1dw/dw, Ghrhrlit/lit, GHR-KO) and describes how the effects of these mutations relate to one another at the transcriptional level. Points of overlap with the effects of calorie restriction (CR), CR mimetic compounds, low fat diets, gender dimorphism and aging were also examined. Results All dwarf mutations had larger and more consistent effects on IGF-I expression than dietary treatments. In comparison to dwarf mutations, however, the transcriptional effects of CR (and some CR mimetics) overlapped more strongly with those of aging. Surprisingly, the Ghrhrlit/lit mutation had much larger effects on gene expression than the GHR-KO mutation, even though both mutations affect the same endocrine pathway. Several genes potentially regulated or co-regulated with the IGF-I transcript in liver tissue were identified, including a DNA repair gene (Snm1) that is upregulated in proportion to IGF-I inhibition. A total of 13 genes exhibiting parallel differential expression patterns among all four strains of long-lived dwarf mice were identified, in addition to 30 genes with matching differential expression patterns in multiple long-lived dwarf strains and under CR. Conclusion Comparative analysis of microarray datasets can identify patterns and consistencies not discernable from any one dataset individually. This study implements new analytical approaches to provide a detailed comparison among the effects of life-extending mutations, dietary treatments, gender and aging. This comparison provides insight into a broad range of issues relevant to the study of mammalian aging. In this context, 43 longevity-associated genes are identified and individual genes with the highest level of support among all microarray experiments are highlighted. These results provide promising targets for future experimental investigation as well as potential clues for understanding the functional basis of lifespan extension in mammalian systems. PMID:17915019
Causes and Consequences of Genetic Background Effects Illuminated by Integrative Genomic Analysis
Chandler, Christopher H.; Chari, Sudarshan; Dworkin, Ian
2014-01-01
The phenotypic consequences of individual mutations are modulated by the wild-type genetic background in which they occur. Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist or about the mechanisms underlying it. We also lack knowledge on how mutations interact with genetic background to influence gene expression and how this in turn mediates mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome-level datasets from a mapping-by-introgression experiment and a tagged RNA gene expression dataset. In addition we used whole genome resequencing of the parental lines—two commonly used laboratory strains—to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutation screen. By searching for genes showing a congruent signal across multiple datasets, we were able to identify a robust set of candidate loci contributing to the background-dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers previously reported are caused by higher-order epistasis, not quantitative noncomplementation. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well. PMID:24504186
Tall, Ben Davies; Gangiredla, Jayanthi; Gopinath, Gopal R.; Yan, Qiongqiong; Chase, Hannah R.; Lee, Boram; Hwang, Seongeun; Trach, Larisa; Park, Eunbi; Yoo, YeonJoo; Chung, TaeJung; Jackson, Scott A.; Patel, Isha R.; Sathyamoorthy, Venugopal; Pava-Ripoll, Monica; Kotewicz, Michael L.; Carter, Laurenda; Iversen, Carol; Pagotto, Franco; Stephan, Roger; Lehner, Angelika; Fanning, Séamus; Grim, Christopher J.
2015-01-01
Cronobacter species cause infections in all age groups; however neonates are at highest risk and remain the most susceptible age group for life-threatening invasive disease. The genus contains seven species:Cronobacter sakazakii, Cronobacter malonaticus, Cronobacter turicensis, Cronobacter muytjensii, Cronobacter dublinensis, Cronobacter universalis, and Cronobacter condimenti. Despite an abundance of published genomes of these species, genomics-based epidemiology of the genus is not well established. The gene content of a diverse group of 126 unique Cronobacter and taxonomically related isolates was determined using a pan genomic-based DNA microarray as a genotyping tool and as a means to identify outbreak isolates for food safety, environmental, and clinical surveillance purposes. The microarray constitutes 19,287 independent genes representing 15 Cronobacter genomes and 18 plasmids and 2,371 virulence factor genes of phylogenetically related Gram-negative bacteria. The Cronobacter microarray was able to distinguish the seven Cronobacter species from one another and from non-Cronobacter species; and within each species, strains grouped into distinct clusters based on their genomic diversity. These results also support the phylogenic divergence of the genus and clearly highlight the genomic diversity among each member of the genus. The current study establishes a powerful platform for further genomics research of this diverse genus, an important prerequisite toward the development of future countermeasures against this foodborne pathogen in the food safety and clinical arenas. PMID:25984509
A 16-Gene Signature Distinguishes Anaplastic Astrocytoma from Glioblastoma
Rao, Soumya Alige Mahabala; Srinivasan, Sujaya; Patric, Irene Rosita Pia; Hegde, Alangar Sathyaranjandas; Chandramouli, Bangalore Ashwathnarayanara; Arimappamagan, Arivazhagan; Santosh, Vani; Kondaiah, Paturu; Rao, Manchanahalli R. Sathyanarayana; Somasundaram, Kumaravel
2014-01-01
Anaplastic astrocytoma (AA; Grade III) and glioblastoma (GBM; Grade IV) are diffusely infiltrating tumors and are called malignant astrocytomas. The treatment regimen and prognosis are distinctly different between anaplastic astrocytoma and glioblastoma patients. Although histopathology based current grading system is well accepted and largely reproducible, intratumoral histologic variations often lead to difficulties in classification of malignant astrocytoma samples. In order to obtain a more robust molecular classifier, we analysed RT-qPCR expression data of 175 differentially regulated genes across astrocytoma using Prediction Analysis of Microarrays (PAM) and found the most discriminatory 16-gene expression signature for the classification of anaplastic astrocytoma and glioblastoma. The 16-gene signature obtained in the training set was validated in the test set with diagnostic accuracy of 89%. Additionally, validation of the 16-gene signature in multiple independent cohorts revealed that the signature predicted anaplastic astrocytoma and glioblastoma samples with accuracy rates of 99%, 88%, and 92% in TCGA, GSE1993 and GSE4422 datasets, respectively. The protein-protein interaction network and pathway analysis suggested that the 16-genes of the signature identified epithelial-mesenchymal transition (EMT) pathway as the most differentially regulated pathway in glioblastoma compared to anaplastic astrocytoma. In addition to identifying 16 gene classification signature, we also demonstrated that genes involved in epithelial-mesenchymal transition may play an important role in distinguishing glioblastoma from anaplastic astrocytoma. PMID:24475040
Song, Zhonghua; Zhao, Wenhua; Cao, Danfeng; Zhang, Jinqing; Chen, Shouhua
2018-01-01
Gastric cancer (GC) is the fifth most common cancer and the third leading cause of cancer-related deaths worldwide. The high mortality might be attributed to delay in detection and is closely related to lymph node metastasis. Therefore, it is of great importance to explore the mechanism of lymph node metastasis and find strategies to block GC metastasis. Messenger RNA (mRNA), microRNA (miRNA) and long non-coding RNA (lncRNA) expression data and clinical data were downloaded from The Cancer Genome Atlas (TCGA) database. A total of 908 differentially expressed factors with variance >0.5 including 542 genes, 42 miRNA, and 324 lncRNA were screened using significant analysis microarray algorithm, and interaction networks were constructed using these differentially expressed factors. Furthermore, we conducted functional modules analysis in the network, and found that yellow and turquoise modules could separate samples efficiently. The groups classified in the yellow and turquoise modules had a significant difference in survival time, which was verified in another independent GC mRNA dataset (GSE62254). The results suggested that differentially expressed factors in the yellow and turquoise modules may participate in lymph node metastasis of GC and could be applied as potential biomarkers or therapeutic targets for GC.
Brodsky, Alexander S.; Fischer, Andrew; Miller, Daniel H.; Vang, Souriya; MacLaughlan, Shannon; Wu, Hsin-Ta; Yu, Jovian; Steinhoff, Margaret; Collins, Colin; Smith, Peter J. S.; Raphael, Benjamin J.; Brard, Laurent
2014-01-01
The behavior and genetics of serous epithelial ovarian cancer (EOC) metastasis, the form of the disease lethal to patients, is poorly understood. The unique properties of metastases are critical to understand to improve treatments of the disease that remains in patients after debulking surgery. We sought to identify the genetic and phenotypic landscape of metastatic progression of EOC to understand how metastases compare to primary tumors. DNA copy number and mRNA expression differences between matched primary human tumors and omental metastases, collected at the same time during debulking surgery before chemotherapy, were measured using microarrays. qPCR and immunohistochemistry validated findings. Pathway analysis of mRNA expression revealed metastatic cancer cells are more proliferative and less apoptotic than primary tumors, perhaps explaining the aggressive nature of these lesions. Most cases had copy number aberrations (CNAs) that differed between primary and metastatic tumors, but we did not detect CNAs that are recurrent across cases. A six gene expression signature distinguishes primary from metastatic tumors and predicts overall survival in independent datasets. The genetic differences between primary and metastatic tumors, yet common expression changes, suggest that the major clone in metastases is not the same as in primary tumors, but the cancer cells adapt to the omentum similarly. Together, these data highlight how ovarian tumors develop into a distinct, more aggressive metastatic state that should be considered for therapy development. PMID:24732363
hsa-miR-135a-1 inhibits prostate cancer cell growth and migration by targeting EGFR.
Xu, Bin; Tao, Tao; Wang, Yiduo; Fang, Fang; Huang, Yeqing; Chen, Shuqiu; Zhu, Weidong; Chen, Ming
2016-10-01
Prostate cancer is one of the leading causes of death in men worldwide. Differentially expressed microRNAs (miRNAs) are associated with metastatic prostate cancer. However, their potential roles for affecting prostate cancer initiation and progression remain largely unknown. Here, we examined the aberrant expression profiles of miRNAs in human metastatic prostate cancer tissues. We further validated our miRNA expression data using two large, independent clinical prostate cancer datasets from the Memorial Sloan Kettering Cancer Center (MSKCC) and The Cancer Genome Atlas (TCGA). Our data support a model in which hsa-miR-135-1 acts as a potential tumor suppressor in metastatic prostate cancer. First, its downregulation was positively correlated with late TNM stage, high Gleason score, and adverse prognosis. Second, cell growth, cell cycle progression, cell migration and invasion, and xenograft tumor formation were dramatically inhibited by miR-135a overexpression. Third, in the microarray gene expression data analysis using Gene Set Enrichment Analysis (GSEA), Database for Annotation, Visualization and Integrated Discovery (DAVID) analysis, Ingenuity Pathway Analysis (IPA), and Oncomine concept analysis, we showed that miR-135a targets multiple oncogenic pathways including epidermal growth factor receptor (EGFR), which we verified using functional experimental assays. These results help advance our understanding of the function of miRNAs in metastatic prostate cancer and provide a basis for further clinical investigation.
Song, Zhonghua; Zhao, Wenhua; Cao, Danfeng; Zhang, Jinqing; Chen, Shouhua
2018-01-01
Gastric cancer (GC) is the fifth most common cancer and the third leading cause of cancer-related deaths worldwide. The high mortality might be attributed to delay in detection and is closely related to lymph node metastasis. Therefore, it is of great importance to explore the mechanism of lymph node metastasis and find strategies to block GC metastasis. Messenger RNA (mRNA), microRNA (miRNA) and long non-coding RNA (lncRNA) expression data and clinical data were downloaded from The Cancer Genome Atlas (TCGA) database. A total of 908 differentially expressed factors with variance >0.5 including 542 genes, 42 miRNA, and 324 lncRNA were screened using significant analysis microarray algorithm, and interaction networks were constructed using these differentially expressed factors. Furthermore, we conducted functional modules analysis in the network, and found that yellow and turquoise modules could separate samples efficiently. The groups classified in the yellow and turquoise modules had a significant difference in survival time, which was verified in another independent GC mRNA dataset (GSE62254). The results suggested that differentially expressed factors in the yellow and turquoise modules may participate in lymph node metastasis of GC and could be applied as potential biomarkers or therapeutic targets for GC. PMID:29489999
Cimmino, Flora; Pezone, Lucia; Avitabile, Marianna; Acierno, Giovanni; Andolfo, Immacolata; Capasso, Mario; Iolascon, Achille
2015-06-09
Neuroblastoma (NBL) is a heterogeneous tumor characterized by a wide range of clinical manifestations. A high tumor cell differentiation grade correlates to a favorable stage and positive outcome. Expression of the hypoxia inducible factors HIF1-α (HIF1A gene) and HIF2-α (EPAS1 gene) and/or hypoxia-regulated pathways has been shown to promote the undifferentiated phenotype of NBL cells. Our hypothesis is that HIF1A and EPAS1 expression represent one of the mechanisms responsible for the lack of responsiveness of NBL to differentiation therapy. Clinically, high levels of HIF1A and EPAS1 expression were associated with inferior survival in two NBL microarray datasets, and patient subgroups with lower expression of HIF1A and EPAS1 showed significant enrichment of pathways related to neuronal differentiation. In NBL cell lines, the combination of all-trans retinoic acid (ATRA) with HIF1A or EPAS1 silencing led to an acquired glial-cell phenotype and enhanced expression of glial-cell differentiation markers. Furthermore, HIF1A or EPAS1 silencing might promote cell senescence independent of ATRA treatment. Taken together, our data suggest that HIF inhibition coupled with ATRA treatment promotes differentiation into a more benign phenotype and cell senescence in vitro. These findings open the way for additional lines of attack in the treatment of NBL minimal residue disease.
Cimmino, Flora; Pezone, Lucia; Avitabile, Marianna; Acierno, Giovanni; Andolfo, Immacolata; Capasso, Mario; Iolascon, Achille
2015-01-01
Neuroblastoma (NBL) is a heterogeneous tumor characterized by a wide range of clinical manifestations. A high tumor cell differentiation grade correlates to a favorable stage and positive outcome. Expression of the hypoxia inducible factors HIF1-α (HIF1A gene) and HIF2-α (EPAS1 gene) and/or hypoxia-regulated pathways has been shown to promote the undifferentiated phenotype of NBL cells. Our hypothesis is that HIF1A and EPAS1 expression represent one of the mechanisms responsible for the lack of responsiveness of NBL to differentiation therapy. Clinically, high levels of HIF1A and EPAS1 expression were associated with inferior survival in two NBL microarray datasets, and patient subgroups with lower expression of HIF1A and EPAS1 showed significant enrichment of pathways related to neuronal differentiation. In NBL cell lines, the combination of all-trans retinoic acid (ATRA) with HIF1A or EPAS1 silencing led to an acquired glial-cell phenotype and enhanced expression of glial-cell differentiation markers. Furthermore, HIF1A or EPAS1 silencing might promote cell senescence independent of ATRA treatment. Taken together, our data suggest that HIF inhibition coupled with ATRA treatment promotes differentiation into a more benign phenotype and cell senescence in vitro. These findings open the way for additional lines of attack in the treatment of NBL minimal residue disease. PMID:26057707
APELA promotes tumour growth and cell migration in ovarian cancer in a p53-dependent manner.
Yi, Yuyin; Tsai, Shu-Huei; Cheng, Jung-Chien; Wang, Evan Y; Anglesio, Michael S; Cochrane, Dawn R; Fuller, Megan; Gibb, Ewan A; Wei, Wei; Huntsman, David G; Karsan, Aly; Hoodless, Pamela A
2017-12-01
APELA is a small, secreted peptide that can function as a ligand for the G-protein coupled receptor, Apelin Receptor (APLNR, APJ). APELA plays an essential role in endoderm differentiation and cardiac development during embryogenesis. We investigated whether APELA exerts any functions in cancer progression. The Cancer Genome Atlas (TCGA) RNA sequencing datasets, microarray from an OCCC mouse model, and RNA isolated from fresh frozen and FFPE patient tissue were used to assess APELA expression. APELA knockout ovarian clear cell carcinoma (OCCC) cell lines were generated using CRISPR/Cas9. APELA was expressed in various ovarian cancer histotypes and was especially elevated in OCCC. Disruption of APELA expression in OCCC cell lines suppressed cell growth and migration, and altered cell-cycle progression. Moreover, addition of human recombinant APELA peptide to the OCCC cell line OVISE promoted cell growth and migration. Interestingly, OVISE cells do not express APLNR, suggesting that APELA can function through an APLNR-independent pathway. Furthermore, APELA affected cell growth and cell cycle progression in a p53-dependent manner. In addition, APELA knockdown induced p53 expression in cancer cell lines. Our findings uncover a potential oncogenic role for APELA in promoting ovarian tumour progression and provide a possible therapeutic strategy in ovarian cancer by targeting APELA. Copyright © 2017 Elsevier Inc. All rights reserved.
Girard, Laurie D; Boissinot, Karel; Peytavi, Régis; Boissinot, Maurice; Bergeron, Michel G
2015-02-07
The combination of molecular diagnostic technologies is increasingly used to overcome limitations on sensitivity, specificity or multiplexing capabilities, and provide efficient lab-on-chip devices. Two such techniques, PCR amplification and microarray hybridization are used serially to take advantage of the high sensitivity and specificity of the former combined with high multiplexing capacities of the latter. These methods are usually performed in different buffers and reaction chambers. However, these elaborate methods have high complexity and cost related to reagent requirements, liquid storage and the number of reaction chambers to integrate into automated devices. Furthermore, microarray hybridizations have a sequence dependent efficiency not always predictable. In this work, we have developed the concept of a structured oligonucleotide probe which is activated by cleavage from polymerase exonuclease activity. This technology is called SCISSOHR for Structured Cleavage Induced Single-Stranded Oligonucleotide Hybridization Reaction. The SCISSOHR probes enable indexing the target sequence to a tag sequence. The SCISSOHR technology also allows the combination of nucleic acid amplification and microarray hybridization in a single vessel in presence of the PCR buffer only. The SCISSOHR technology uses an amplification probe that is irreversibly modified in presence of the target, releasing a single-stranded DNA tag for microarray hybridization. Each tag is composed of a 3-nucleotide sequence-dependent segment and a unique "target sequence-independent" 14-nucleotide segment allowing for optimal hybridization with minimal cross-hybridization. We evaluated the performance of five (5) PCR buffers to support microarray hybridization, compared to a conventional hybridization buffer. Finally, as a proof of concept, we developed a multiplexed assay for the amplification, detection, and identification of three (3) DNA targets. This new technology will facilitate the design of lab-on-chip microfluidic devices, while also reducing consumable costs. At term, it will allow the cost-effective automation of highly multiplexed assays for detection and identification of genetic targets.
Bengtsson, Henrik; Hössjer, Ola
2006-03-01
Low-level processing and normalization of microarray data are most important steps in microarray analysis, which have profound impact on downstream analysis. Multiple methods have been suggested to date, but it is not clear which is the best. It is therefore important to further study the different normalization methods in detail and the nature of microarray data in general. A methodological study of affine models for gene expression data is carried out. Focus is on two-channel comparative studies, but the findings generalize also to single- and multi-channel data. The discussion applies to spotted as well as in-situ synthesized microarray data. Existing normalization methods such as curve-fit ("lowess") normalization, parallel and perpendicular translation normalization, and quantile normalization, but also dye-swap normalization are revisited in the light of the affine model and their strengths and weaknesses are investigated in this context. As a direct result from this study, we propose a robust non-parametric multi-dimensional affine normalization method, which can be applied to any number of microarrays with any number of channels either individually or all at once. A high-quality cDNA microarray data set with spike-in controls is used to demonstrate the power of the affine model and the proposed normalization method. We find that an affine model can explain non-linear intensity-dependent systematic effects in observed log-ratios. Affine normalization removes such artifacts for non-differentially expressed genes and assures that symmetry between negative and positive log-ratios is obtained, which is fundamental when identifying differentially expressed genes. In addition, affine normalization makes the empirical distributions in different channels more equal, which is the purpose of quantile normalization, and may also explain why dye-swap normalization works or fails. All methods are made available in the aroma package, which is a platform-independent package for R.
Protein profiles associated with survival in lung adenocarcinoma
Chen, Guoan; Gharib, Tarek G; Wang, Hong; Huang, Chiang-Ching; Kuick, Rork; Thomas, Dafydd G.; Shedden, Kerby A.; Misek, David E.; Taylor, Jeremy M. G.; Giordano, Thomas J.; Kardia, Sharon L. R.; Iannettoni, Mark D.; Yee, John; Hogg, Philip J.; Orringer, Mark B.; Hanash, Samir M.; Beer, David G.
2003-01-01
Morphologic assessment of lung tumors is informative but insufficient to adequately predict patient outcome. We previously identified transcriptional profiles that predict patient survival, and here we identify proteins associated with patient survival in lung adenocarcinoma. A total of 682 individual protein spots were quantified in 90 lung adenocarcinomas by using quantitative two-dimensional polyacrylamide gel electrophoresis analysis. A leave-one-out cross-validation procedure using the top 20 survival-associated proteins identified by Cox modeling indicated that protein profiles as a whole can predict survival in stage I tumor patients (P = 0.01). Thirty-three of 46 survival-associated proteins were identified by using mass spectrometry. Expression of 12 candidate proteins was confirmed as tumor-derived with immunohistochemical analysis and tissue microarrays. Oligonucleotide microarray results from both the same tumors and from an independent study showed mRNAs associated with survival for 11 of 27 encoded genes. Combined analysis of protein and mRNA data revealed 11 components of the glycolysis pathway as associated with poor survival. Among these candidates, phosphoglycerate kinase 1 was associated with survival in the protein study, in both mRNA studies and in an independent validation set of 117 adenocarcinomas and squamous lung tumors using tissue microarrays. Elevated levels of phosphoglycerate kinase 1 in the serum were also significantly correlated with poor outcome in a validation set of 107 patients with lung adenocarcinomas using ELISA analysis. These studies identify new prognostic biomarkers and indicate that protein expression profiles can predict the outcome of patients with early-stage lung cancer. PMID:14573703
Uddin, Raihan; Singh, Shiva M.
2017-01-01
As humans age many suffer from a decrease in normal brain functions including spatial learning impairments. This study aimed to better understand the molecular mechanisms in age-associated spatial learning impairment (ASLI). We used a mathematical modeling approach implemented in Weighted Gene Co-expression Network Analysis (WGCNA) to create and compare gene network models of young (learning unimpaired) and aged (predominantly learning impaired) brains from a set of exploratory datasets in rats in the context of ASLI. The major goal was to overcome some of the limitations previously observed in the traditional meta- and pathway analysis using these data, and identify novel ASLI related genes and their networks based on co-expression relationship of genes. This analysis identified a set of network modules in the young, each of which is highly enriched with genes functioning in broad but distinct GO functional categories or biological pathways. Interestingly, the analysis pointed to a single module that was highly enriched with genes functioning in “learning and memory” related functions and pathways. Subsequent differential network analysis of this “learning and memory” module in the aged (predominantly learning impaired) rats compared to the young learning unimpaired rats allowed us to identify a set of novel ASLI candidate hub genes. Some of these genes show significant repeatability in networks generated from independent young and aged validation datasets. These hub genes are highly co-expressed with other genes in the network, which not only show differential expression but also differential co-expression and differential connectivity across age and learning impairment. The known function of these hub genes indicate that they play key roles in critical pathways, including kinase and phosphatase signaling, in functions related to various ion channels, and in maintaining neuronal integrity relating to synaptic plasticity and memory formation. Taken together, they provide a new insight and generate new hypotheses into the molecular mechanisms responsible for age associated learning impairment, including spatial learning. PMID:29066959
Uddin, Raihan; Singh, Shiva M
2017-01-01
As humans age many suffer from a decrease in normal brain functions including spatial learning impairments. This study aimed to better understand the molecular mechanisms in age-associated spatial learning impairment (ASLI). We used a mathematical modeling approach implemented in Weighted Gene Co-expression Network Analysis (WGCNA) to create and compare gene network models of young (learning unimpaired) and aged (predominantly learning impaired) brains from a set of exploratory datasets in rats in the context of ASLI. The major goal was to overcome some of the limitations previously observed in the traditional meta- and pathway analysis using these data, and identify novel ASLI related genes and their networks based on co-expression relationship of genes. This analysis identified a set of network modules in the young, each of which is highly enriched with genes functioning in broad but distinct GO functional categories or biological pathways. Interestingly, the analysis pointed to a single module that was highly enriched with genes functioning in "learning and memory" related functions and pathways. Subsequent differential network analysis of this "learning and memory" module in the aged (predominantly learning impaired) rats compared to the young learning unimpaired rats allowed us to identify a set of novel ASLI candidate hub genes. Some of these genes show significant repeatability in networks generated from independent young and aged validation datasets. These hub genes are highly co-expressed with other genes in the network, which not only show differential expression but also differential co-expression and differential connectivity across age and learning impairment. The known function of these hub genes indicate that they play key roles in critical pathways, including kinase and phosphatase signaling, in functions related to various ion channels, and in maintaining neuronal integrity relating to synaptic plasticity and memory formation. Taken together, they provide a new insight and generate new hypotheses into the molecular mechanisms responsible for age associated learning impairment, including spatial learning.
Johnson, Nathan T; Dhroso, Andi; Hughes, Katelyn J; Korkin, Dmitry
2018-06-25
The extent to which the genes are expressed in the cell can be simplistically defined as a function of one or more factors of the environment, lifestyle, and genetics. RNA sequencing (RNA-Seq) is becoming a prevalent approach to quantify gene expression, and is expected to gain better insights to a number of biological and biomedical questions, compared to the DNA microarrays. Most importantly, RNA-Seq allows to quantify expression at the gene and alternative splicing isoform levels. However, leveraging the RNA-Seq data requires development of new data mining and analytics methods. Supervised machine learning methods are commonly used approaches for biological data analysis, and have recently gained attention for their applications to the RNA-Seq data. In this work, we assess the utility of supervised learning methods trained on RNA-Seq data for a diverse range of biological classification tasks. We hypothesize that the isoform-level expression data is more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment is done through utilizing multiple datasets, organisms, lab groups, and RNA-Seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-Seq datasets and include over 2,000 samples that come from multiple organisms, lab groups, and RNA-Seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes and, the pathological tumor stage for the samples from the cancerous tissue. For each classification problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the isoform-based classifiers outperform or are comparable with gene expression based methods. The top-performing supervised learning techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-Seq based data analysis. Published by Cold Spring Harbor Laboratory Press for the RNA Society.
Wang, Yi Kan; Hurley, Daniel G.; Schnell, Santiago; Print, Cristin G.; Crampin, Edmund J.
2013-01-01
We develop a new regression algorithm, cMIKANA, for inference of gene regulatory networks from combinations of steady-state and time-series gene expression data. Using simulated gene expression datasets to assess the accuracy of reconstructing gene regulatory networks, we show that steady-state and time-series data sets can successfully be combined to identify gene regulatory interactions using the new algorithm. Inferring gene networks from combined data sets was found to be advantageous when using noisy measurements collected with either lower sampling rates or a limited number of experimental replicates. We illustrate our method by applying it to a microarray gene expression dataset from human umbilical vein endothelial cells (HUVECs) which combines time series data from treatment with growth factor TNF and steady state data from siRNA knockdown treatments. Our results suggest that the combination of steady-state and time-series datasets may provide better prediction of RNA-to-RNA interactions, and may also reveal biological features that cannot be identified from dynamic or steady state information alone. Finally, we consider the experimental design of genomics experiments for gene regulatory network inference and show that network inference can be improved by incorporating steady-state measurements with time-series data. PMID:23967277
Biswas, Surama; Dutta, Subarna; Acharyya, Sriyankar
2017-12-01
Identifying a small subset of disease critical genes out of a large size of microarray gene expression data is a challenge in computational life sciences. This paper has applied four meta-heuristic algorithms, namely, honey bee mating optimization (HBMO), harmony search (HS), differential evolution (DE) and genetic algorithm (basic version GA) to find disease critical genes of preeclampsia which affects women during gestation. Two hybrid algorithms, namely, HBMO-kNN and HS-kNN have been newly proposed here where kNN (k nearest neighbor classifier) is used for sample classification. Performances of these new approaches have been compared with other two hybrid algorithms, namely, DE-kNN and SGA-kNN. Three datasets of different sizes have been used. In a dataset, the set of genes found common in the output of each algorithm is considered here as disease critical genes. In different datasets, the percentage of classification or classification accuracy of meta-heuristic algorithms varied between 92.46 and 100%. HBMO-kNN has the best performance (99.64-100%) in almost all data sets. DE-kNN secures the second position (99.42-100%). Disease critical genes obtained here match with clinically revealed preeclampsia genes to a large extent.
Booma, P M; Prabhakaran, S; Dhanalakshmi, R
2014-01-01
Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality.
Booma, P. M.; Prabhakaran, S.; Dhanalakshmi, R.
2014-01-01
Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality. PMID:25136661
Prom-On, Santitham; Chanthaphan, Atthawut; Chan, Jonathan Hoyin; Meechai, Asawin
2011-02-01
Relationships among gene expression levels may be associated with the mechanisms of the disease. While identifying a direct association such as a difference in expression levels between case and control groups links genes to disease mechanisms, uncovering an indirect association in the form of a network structure may help reveal the underlying functional module associated with the disease under scrutiny. This paper presents a method to improve the biological relevance in functional module identification from the gene expression microarray data by enhancing the structure of a weighted gene co-expression network using minimum spanning tree. The enhanced network, which is called a backbone network, contains only the essential structural information to represent the gene co-expression network. The entire backbone network is decoupled into a number of coherent sub-networks, and then the functional modules are reconstructed from these sub-networks to ensure minimum redundancy. The method was tested with a simulated gene expression dataset and case-control expression datasets of autism spectrum disorder and colorectal cancer studies. The results indicate that the proposed method can accurately identify clusters in the simulated dataset, and the functional modules of the backbone network are more biologically relevant than those obtained from the original approach.
SBRML: a markup language for associating systems biology data with models.
Dada, Joseph O; Spasić, Irena; Paton, Norman W; Mendes, Pedro
2010-04-01
Research in systems biology is carried out through a combination of experiments and models. Several data standards have been adopted for representing models (Systems Biology Markup Language) and various types of relevant experimental data (such as FuGE and those of the Proteomics Standards Initiative). However, until now, there has been no standard way to associate a model and its entities to the corresponding datasets, or vice versa. Such a standard would provide a means to represent computational simulation results as well as to frame experimental data in the context of a particular model. Target applications include model-driven data analysis, parameter estimation, and sharing and archiving model simulations. We propose the Systems Biology Results Markup Language (SBRML), an XML-based language that associates a model with several datasets. Each dataset is represented as a series of values associated with model variables, and their corresponding parameter values. SBRML provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes. We present and discuss several examples of SBRML usage in applications such as enzyme kinetics, microarray gene expression and various types of simulation results. The XML Schema file for SBRML is available at http://www.comp-sys-bio.org/SBRML under the Academic Free License (AFL) v3.0.
Exudate-based diabetic macular edema detection in fundus images using publicly available datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul
2011-01-01
Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME through the presence of exudation. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME.more » This and other two publicly available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing (e.g., the classifier was trained on an independent dataset and tested on MESSIDOR). Our algorithm obtained an AUC between 0.88 and 0.94 depending on the dataset/features used. Additionally, it does not need ground truth at lesion level to reject false positives and is computationally efficient, as it generates a diagnosis on an average of 4.4 s (9.3 s, considering the optic nerve localization) per image on an 2.6 GHz platform with an unoptimized Matlab implementation.« less
Strand-specific transcriptome profiling with directly labeled RNA on genomic tiling microarrays
2011-01-01
Background With lower manufacturing cost, high spot density, and flexible probe design, genomic tiling microarrays are ideal for comprehensive transcriptome studies. Typically, transcriptome profiling using microarrays involves reverse transcription, which converts RNA to cDNA. The cDNA is then labeled and hybridized to the probes on the arrays, thus the RNA signals are detected indirectly. Reverse transcription is known to generate artifactual cDNA, in particular the synthesis of second-strand cDNA, leading to false discovery of antisense RNA. To address this issue, we have developed an effective method using RNA that is directly labeled, thus by-passing the cDNA generation. This paper describes this method and its application to the mapping of transcriptome profiles. Results RNA extracted from laboratory cultures of Porphyromonas gingivalis was fluorescently labeled with an alkylation reagent and hybridized directly to probes on genomic tiling microarrays specifically designed for this periodontal pathogen. The generated transcriptome profile was strand-specific and produced signals close to background level in most antisense regions of the genome. In contrast, high levels of signal were detected in the antisense regions when the hybridization was done with cDNA. Five antisense areas were tested with independent strand-specific RT-PCR and none to negligible amplification was detected, indicating that the strong antisense cDNA signals were experimental artifacts. Conclusions An efficient method was developed for mapping transcriptome profiles specific to both coding strands of a bacterial genome. This method chemically labels and uses extracted RNA directly in microarray hybridization. The generated transcriptome profile was free of cDNA artifactual signals. In addition, this method requires fewer processing steps and is potentially more sensitive in detecting small amount of RNA compared to conventional end-labeling methods due to the incorporation of more fluorescent molecules per RNA fragment. PMID:21235785
Classifying post-stroke fatigue: Optimal cut-off on the Fatigue Assessment Scale.
Cumming, Toby B; Mead, Gillian
2017-12-01
Post-stroke fatigue is common and has debilitating effects on independence and quality of life. The Fatigue Assessment Scale (FAS) is a valid screening tool for fatigue after stroke, but there is no established cut-off. We sought to identify the optimal cut-off for classifying post-stroke fatigue on the FAS. In retrospective analysis of two independent datasets (the '2015' and '2007' studies), we evaluated the predictive validity of FAS score against a case definition of fatigue (the criterion standard). Area under the curve (AUC) and sensitivity and specificity at the optimal cut-off were established in the larger 2015 dataset (n=126), and then independently validated in the 2007 dataset (n=52). In the 2015 dataset, AUC was 0.78 (95% CI 0.70-0.86), with the optimal ≥24 cut-off giving a sensitivity of 0.82 and specificity of 0.66. The 2007 dataset had an AUC of 0.83 (95% CI 0.71-0.94), and applying the ≥24 cut-off gave a sensitivity of 0.84 and specificity of 0.67. Post-hoc analysis of the 2015 dataset revealed that using only the 3 most predictive FAS items together ('FAS-3') also yielded good validity: AUC 0.81 (95% CI 0.73-0.89), with sensitivity of 0.83 and specificity of 0.75 at the optimal ≥8 cut-off. We propose ≥24 as a cut-off for classifying post-stroke fatigue on the FAS. While further validation work is needed, this is a positive step towards a coherent approach to reporting fatigue prevalence using the FAS. Copyright © 2017 Elsevier Inc. All rights reserved.
Data submission and quality in microarray-based microRNA profiling
Witwer, Kenneth W.
2014-01-01
Background Public sharing of scientific data has assumed greater importance in the ‘omics’ era. Transparency is necessary for confirmation and validation, and multiple examiners aid in extracting maximal value from large datasets. Accordingly, database submission and provision of the Minimum Information About a Microarray Experiment (MIAME) are required by most journals as a prerequisite for review or acceptance. Methods In this study, the level of data submission and MIAME compliance was reviewed for 127 articles that included microarray-based microRNA profiling and that were published from July, 2011 through April, 2012 in the journals that published the largest number of such articles—PLOS ONE, the Journal of Biological Chemistry, Blood, and Oncogene—along with articles from nine other journals, including Clinical Chemistry, that published smaller numbers of array-based articles. Results Overall, data submission was reported at publication for less than 40% of all articles, and almost 75% of articles were MIAME-noncompliant. On average, articles that included full data submission scored significantly higher on a quality metric than articles with limited or no data submission, and studies with adequate description of methods disproportionately included larger numbers of experimental repeats. Finally, for several articles that were not MIAME-compliant, data re-analysis revealed less than complete support for the published conclusions, in one case leading to retraction. Conclusions These findings buttress the hypothesis that reluctance to share data is associated with low study quality and suggest that most miRNA array investigations are underpowered and/or potentially compromised by a lack of appropriate reporting and data submission. PMID:23358751
Graubner, Felix R.; Gram, Aykut; Kautz, Ewa; Bauersachs, Stefan; Aslan, Selim; Agaoglu, Ali R.; Boos, Alois
2017-01-01
Abstract In the dog, there is no luteolysis in the absence of pregnancy. Thus, this species lacks any anti-luteolytic endocrine signal as found in other species that modulate uterine function during the critical period of pregnancy establishment. Nevertheless, in the dog an embryo-maternal communication must occur in order to prevent rejection of embryos. Based on this hypothesis, we performed microarray analysis of canine uterine samples collected during pre-attachment phase (days 10-12) and in corresponding non-pregnant controls, in order to elucidate the embryo attachment signal. An additional goal was to identify differences in uterine responses to pre-attachment embryos between dogs and other mammalian species exhibiting different reproductive patterns with regard to luteolysis, implantation, and preparation for placentation. Therefore, the canine microarray data were compared with gene sets from pigs, cattle, horses, and humans. We found 412 genes differentially regulated between the two experimental groups. The functional terms most strongly enriched in response to pre-attachment embryos related to extracellular matrix function and remodeling, and to immune and inflammatory responses. Several candidate genes were validated by semi-quantitative PCR. When compared with other species, best matches were found with human and equine counterparts. Especially for the pig, the majority of overlapping genes showed opposite expression patterns. Interestingly, 1926 genes did not pair with any of the other gene sets. Using a microarray approach, we report the uterine changes in the dog driven by the presence of embryos and compare these results with datasets from other mammalian species, finding common-, contrary-, and exclusively canine-regulated genes. PMID:28651344
Soldà, Giulia; Merlino, Giuseppe; Fina, Emanuela; Brini, Elena; Moles, Anna; Cappelletti, Vera; Daidone, Maria Grazia
2016-01-01
Numerous studies have reported the existence of tumor-promoting cells (TPC) with self-renewal potential and a relevant role in drug resistance. However, pathways and modifications involved in the maintenance of such tumor subpopulations are still only partially understood. Sequencing-based approaches offer the opportunity for a detailed study of TPC including their transcriptome modulation. Using microarrays and RNA sequencing approaches, we compared the transcriptional profiles of parental MCF7 breast cancer cells with MCF7-derived TPC (i.e. MCFS). Data were explored using different bioinformatic approaches, and major findings were experimentally validated. The different analytical pipelines (Lifescope and Cufflinks based) yielded similar although not identical results. RNA sequencing data partially overlapped microarray results and displayed a higher dynamic range, although overall the two approaches concordantly predicted pathway modifications. Several biological functions were altered in TPC, ranging from production of inflammatory cytokines (i.e., IL-8 and MCP-1) to proliferation and response to steroid hormones. More than 300 non-coding RNAs were defined as differentially expressed, and 2,471 potential splicing events were identified. A consensus signature of genes up-regulated in TPC was derived and was found to be significantly associated with insensitivity to fulvestrant in a public breast cancer patient dataset. Overall, we obtained a detailed portrait of the transcriptome of a breast cancer TPC line, highlighted the role of non-coding RNAs and differential splicing, and identified a gene signature with a potential as a context-specific biomarker in patients receiving endocrine treatment. PMID:26556871
Wiktor, Peter; Brunner, Al; Kahn, Peter; Qiu, Ji; Magee, Mitch; Bian, Xiaofang; Karthikeyan, Kailash; LaBaer, Joshua
2015-01-01
We report a device to fill an array of small chemical reaction chambers (microreactors) with reagent and then seal them using pressurized viscous liquid acting through a flexible membrane. The device enables multiple, independent chemical reactions involving free floating intermediate molecules without interference from neighboring reactions or external environments. The device is validated by protein expressed in situ directly from DNA in a microarray of ~10,000 spots with no diffusion during three hours incubation. Using the device to probe for an autoantibody cancer biomarker in blood serum sample gave five times higher signal to background ratio compared to standard protein microarray expressed on a flat microscope slide. Physical design principles to effectively fill the array of microreactors with reagent and experimental results of alternate methods for sealing the microreactors are presented. PMID:25736721
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chin, Mark H.; Qian, Weijun; Wang, Haixing
2008-02-10
The molecular mechanisms underlying the changes in the nigrostriatal pathway in Parkinson disease (PD) are not completely understood. Here we use mass spectrometry and microarrays to study the proteomic and transcriptomic changes in the striatum of two mouse models of PD, induced by the distinct neurotoxins 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP) and methamphetamine (METH). Proteomic analyses resulted in the identification and relative quantification of 912 proteins with two or more unique peptides and 85 proteins with significant abundance changes following neurotoxin treatment. Similarly, microarray analyses revealed 181 genes with significant changes in mRNA following neurotoxin treatment. The combined protein and gene list providesmore » a clearer picture of the potential mechanisms underlying neurodegeneration observed in PD. Functional analysis of this combined list revealed a number of significant categories, including mitochondrial dysfunction, oxidative stress response and apoptosis. Additionally, codon usage and miRNAs may play an important role in translational control in the striatum. These results constitute one of the largest datasets integrating protein and transcript changes for these neurotoxin models with many similar endpoint phenotypes but distinct mechanisms.« less
Pirooznia, Mehdi; Deng, Youping
2006-12-12
Graphical user interface (GUI) software promotes novelty by allowing users to extend the functionality. SVM Classifier is a cross-platform graphical application that handles very large datasets well. The purpose of this study is to create a GUI application that allows SVM users to perform SVM training, classification and prediction. The GUI provides user-friendly access to state-of-the-art SVM methods embodied in the LIBSVM implementation of Support Vector Machine. We implemented the java interface using standard swing libraries. We used a sample data from a breast cancer study for testing classification accuracy. We achieved 100% accuracy in classification among the BRCA1-BRCA2 samples with RBF kernel of SVM. We have developed a java GUI application that allows SVM users to perform SVM training, classification and prediction. We have demonstrated that support vector machines can accurately classify genes into functional categories based upon expression data from DNA microarray hybridization experiments. Among the different kernel functions that we examined, the SVM that uses a radial basis kernel function provides the best performance. The SVM Classifier is available at http://mfgn.usm.edu/ebl/svm/.
Rey, Benjamin; Dégletagne, Cyril; Duchamp, Claude
2016-12-01
In this article, we present differentially expressed gene profiles in the pectoralis muscle of wild juvenile king penguins that were either naturally acclimated to cold marine environment or experimentally immersed in cold water as compared with penguin juveniles that never experienced cold water immersion. Transcriptomic data were obtained by hybridizing penguins total cDNA on Affymetrix GeneChip Chicken Genome arrays and analyzed using maxRS algorithm , " Transcriptome analysis in non-model species: a new method for the analysis of heterologous hybridization on microarrays " (Dégletagne et al., 2010) [1] . We focused on genes involved in multiple antioxidant pathways. For better clarity, these differentially expressed genes were clustered into six functional groups according to their role in controlling redox homeostasis. The data are related to a comprehensive research study on the ontogeny of antioxidant functions in king penguins, "Hormetic response triggers multifaceted anti-oxidant strategies in immature king penguins (Aptenodytes patagonicus)" (Rey et al., 2016) [2] . The raw microarray dataset supporting the present analyses has been deposited at the Gene Expression Omnibus (GEO) repository under accessions GEO: GSE17725 and GEO: GSE82344.
Inferring Molecular Processes Heterogeneity from Transcriptional Data.
Gogolewski, Krzysztof; Wronowska, Weronika; Lech, Agnieszka; Lesyng, Bogdan; Gambin, Anna
2017-01-01
RNA microarrays and RNA-seq are nowadays standard technologies to study the transcriptional activity of cells. Most studies focus on tracking transcriptional changes caused by specific experimental conditions. Information referring to genes up- and downregulation is evaluated analyzing the behaviour of relatively large population of cells by averaging its properties. However, even assuming perfect sample homogeneity, different subpopulations of cells can exhibit diverse transcriptomic profiles, as they may follow different regulatory/signaling pathways. The purpose of this study is to provide a novel methodological scheme to account for possible internal, functional heterogeneity in homogeneous cell lines, including cancer ones. We propose a novel computational method to infer the proportion between subpopulations of cells that manifest various functional behaviour in a given sample. Our method was validated using two datasets from RNA microarray experiments. Both experiments aimed to examine cell viability in specific experimental conditions. The presented methodology can be easily extended to RNA-seq data as well as other molecular processes. Moreover, it complements standard tools to indicate most important networks from transcriptomic data and in particular could be useful in the analysis of cancer cell lines affected by biologically active compounds or drugs.
Inferring Molecular Processes Heterogeneity from Transcriptional Data
Wronowska, Weronika; Lesyng, Bogdan; Gambin, Anna
2017-01-01
RNA microarrays and RNA-seq are nowadays standard technologies to study the transcriptional activity of cells. Most studies focus on tracking transcriptional changes caused by specific experimental conditions. Information referring to genes up- and downregulation is evaluated analyzing the behaviour of relatively large population of cells by averaging its properties. However, even assuming perfect sample homogeneity, different subpopulations of cells can exhibit diverse transcriptomic profiles, as they may follow different regulatory/signaling pathways. The purpose of this study is to provide a novel methodological scheme to account for possible internal, functional heterogeneity in homogeneous cell lines, including cancer ones. We propose a novel computational method to infer the proportion between subpopulations of cells that manifest various functional behaviour in a given sample. Our method was validated using two datasets from RNA microarray experiments. Both experiments aimed to examine cell viability in specific experimental conditions. The presented methodology can be easily extended to RNA-seq data as well as other molecular processes. Moreover, it complements standard tools to indicate most important networks from transcriptomic data and in particular could be useful in the analysis of cancer cell lines affected by biologically active compounds or drugs. PMID:29362714
Lubbock, Alexander L. R.; Katz, Elad; Harrison, David J.; Overton, Ian M.
2013-01-01
Tissue microarrays (TMAs) allow multiplexed analysis of tissue samples and are frequently used to estimate biomarker protein expression in tumour biopsies. TMA Navigator (www.tmanavigator.org) is an open access web application for analysis of TMA data and related information, accommodating categorical, semi-continuous and continuous expression scores. Non-biological variation, or batch effects, can hinder data analysis and may be mitigated using the ComBat algorithm, which is incorporated with enhancements for automated application to TMA data. Unsupervised grouping of samples (patients) is provided according to Gaussian mixture modelling of marker scores, with cardinality selected by Bayesian information criterion regularization. Kaplan–Meier survival analysis is available, including comparison of groups identified by mixture modelling using the Mantel-Cox log-rank test. TMA Navigator also supports network inference approaches useful for TMA datasets, which often constitute comparatively few markers. Tissue and cell-type specific networks derived from TMA expression data offer insights into the molecular logic underlying pathophenotypes, towards more effective and personalized medicine. Output is interactive, and results may be exported for use with external programs. Private anonymous access is available, and user accounts may be generated for easier data management. PMID:23761446
Privacy-Preserving Integration of Medical Data : A Practical Multiparty Private Set Intersection.
Miyaji, Atsuko; Nakasho, Kazuhisa; Nishida, Shohei
2017-03-01
Medical data are often maintained by different organizations. However, detailed analyses sometimes require these datasets to be integrated without violating patient or commercial privacy. Multiparty Private Set Intersection (MPSI), which is an important privacy-preserving protocol, computes an intersection of multiple private datasets. This approach ensures that only designated parties can identify the intersection. In this paper, we propose a practical MPSI that satisfies the following requirements: The size of the datasets maintained by the different parties is independent of the others, and the computational complexity of the dataset held by each party is independent of the number of parties. Our MPSI is based on the use of an outsourcing provider, who has no knowledge of the data inputs or outputs. This reduces the computational complexity. The performance of the proposed MPSI is evaluated by implementing a prototype on a virtual private network to enable parallel computation in multiple threads. Our protocol is confirmed to be more efficient than comparable existing approaches.
NASA Astrophysics Data System (ADS)
Merchant, C. J.; Hulley, G. C.
2013-12-01
There are many datasets describing the evolution of global sea surface temperature (SST) over recent decades -- so why make another one? Answer: to provide observations of SST that have particular qualities relevant to climate applications: independence, accuracy and stability. This has been done within the European Space Agency (ESA) Climate Change Initative (CCI) project on SST. Independence refers to the fact that the new SST CCI dataset is not derived from or tuned to in situ observations. This matters for climate because the in situ observing network used to assess marine climate change (1) was not designed to monitor small changes over decadal timescales, and (2) has evolved significantly in its technology and mix of types of observation, even during the past 40 years. The potential for significant artefacts in our picture of global ocean surface warming is clear. Only by having an independent record can we confirm (or refute) that the work done to remove biases/trend artefacts in in-situ datasets has been successful. Accuracy is the degree to which SSTs are unbiased. For climate applications, a common accuracy target is 0.1 K for all regions of the ocean. Stability is the degree to which the bias, if any, in a dataset is constant over time. Long-term instability introduces trend artefacts. To observe trends of the magnitude of 'global warming', SST datasets need to be stable to <5 mK/year. The SST CCI project has produced a satellite-based dataset that addresses these characteristics relevant to climate applications. Satellite radiances (brightness temperatures) have been harmonised exploiting periods of overlapping observations between sensors. Less well-characterised sensors have had their calibration tuned to that of better characterised sensors (at radiance level). Non-conventional retrieval methods (optimal estimation) have been employed to reduce regional biases to the 0.1 K level, a target violated in most satellite SST datasets. Models for quantifying uncertainty have been developed to attach uncertainty to SST across a range of space-time scales. The stability of the data has been validated.
Wang, Nizhuan; Chang, Chunqi; Zeng, Weiming; Shi, Yuhu; Yan, Hongjie
2017-01-01
Independent component analysis (ICA) has been widely used in functional magnetic resonance imaging (fMRI) data analysis to evaluate functional connectivity of the brain; however, there are still some limitations on ICA simultaneously handling neuroimaging datasets with diverse acquisition parameters, e.g., different repetition time, different scanner, etc. Therefore, it is difficult for the traditional ICA framework to effectively handle ever-increasingly big neuroimaging datasets. In this research, a novel feature-map based ICA framework (FMICA) was proposed to address the aforementioned deficiencies, which aimed at exploring brain functional networks (BFNs) at different scales, e.g., the first level (individual subject level), second level (intragroup level of subjects within a certain dataset) and third level (intergroup level of subjects across different datasets), based only on the feature maps extracted from the fMRI datasets. The FMICA was presented as a hierarchical framework, which effectively made ICA and constrained ICA as a whole to identify the BFNs from the feature maps. The simulated and real experimental results demonstrated that FMICA had the excellent ability to identify the intergroup BFNs and to characterize subject-specific and group-specific difference of BFNs from the independent component feature maps, which sharply reduced the size of fMRI datasets. Compared with traditional ICAs, FMICA as a more generalized framework could efficiently and simultaneously identify the variant BFNs at the subject-specific, intragroup, intragroup-specific and intergroup levels, implying that FMICA was able to handle big neuroimaging datasets in neuroscience research.
Nagy, Zsolt; Acs, Bence; Butz, Henriett; Feldman, Karolina; Marta, Alexa; Szabo, Peter M; Baghy, Kornelia; Pazmany, Tamas; Racz, Karoly; Liko, Istvan; Patocs, Attila
2016-01-01
The glucocorticoid receptor (GR) plays a crucial role in inflammatory responses. GR has several isoforms, of which the most deeply studied are the GRα and GRß. Recently it has been suggested that in addition to its negative dominant effect on GRα, the GRß may have a GRα-independent transcriptional activity. The GRß isoform was found to be frequently overexpressed in various autoimmune diseases, including inflammatory bowel disease (IBD). In this study, we wished to test whether the gene expression profile found in a GRß overexpressing intestinal cell line (Caco-2GRß) might mimic the gene expression alterations found in patients with IBD. Whole genome microarray analysis was performed in both normal and GRß overexpressing Caco-2 cell lines with and without dexamethasone treatment. IBD-related genes were identified from a meta-analysis of 245 microarrays available in online microarray deposits performed on intestinal mucosa samples from patients with IBD and healthy individuals. The differentially expressed genes were further studied using in silico pathway analysis. Overexpression of GRß altered a large proportion of genes that were not regulated by dexamethasone suggesting that GRß may have a GRα-independent role in the regulation of gene expression. About 10% of genes differentially expressed in colonic mucosa samples from IBD patients compared to normal subjects were also detected in Caco-2 GRß intestinal cell line. Common genes are involved in cell adhesion and cell proliferation. Overexpression of GRß in intestinal cells may affect appropriate mucosal repair and intact barrier function. The proposed novel role of GRß in intestinal epithelium warrants further studies. Copyright © 2015 Elsevier Ltd. All rights reserved.
Massange-Sánchez, Julio A.; Palmeros-Suárez, Paola A.; Espitia-Rangel, Eduardo; Rodríguez-Arévalo, Isaac; Sánchez-Segura, Lino; Martínez-Gallardo, Norma A.; Alatorre-Cobos, Fulgencio; Tiessen, Axel; Délano-Frier, John P.
2016-01-01
Two grain amaranth transcription factor (TF) genes were overexpressed in Arabidopsis plants. The first, coding for a group VII ethylene response factor TF (i.e., AhERF-VII) conferred tolerance to water-deficit stress (WS) in transgenic Arabidopsis without affecting vegetative or reproductive growth. A significantly lower water-loss rate in detached leaves coupled to a reduced stomatal opening in leaves of plants subjected to WS was associated with this trait. WS tolerance was also associated with an increased antioxidant enzyme activity and the accumulation of putative stress-related secondary metabolites. However, microarray and GO data did not indicate an obvious correlation between WS tolerance, stomatal closure, and abscisic acid (ABA)-related signaling. This scenario suggested that stomatal closure during WS in these plants involved ABA-independent mechanisms, possibly involving reactive oxygen species (ROS). WS tolerance may have also involved other protective processes, such as those employed for methyl glyoxal detoxification. The second, coding for a class A and cluster I DNA binding with one finger TF (i.e., AhDof-AI) provided salt-stress (SS) tolerance with no evident fitness penalties. The lack of an obvious development-related phenotype contrasted with microarray and GO data showing an enrichment of categories and genes related to developmental processes, particularly flowering. SS tolerance also correlated with increased superoxide dismutase activity but not with augmented stomatal closure. Additionally, microarray and GO data indicated that, contrary to AhERF-VII, SS tolerance conferred by AhDof-AI in Arabidopsis involved ABA-dependent and ABA-independent stress amelioration mechanisms. PMID:27749893
Davies, John R; Chang, Yu-mei; Bishop, D Timothy; Armstrong, Bruce K; Bataille, Veronique; Bergman, Wilma; Berwick, Marianne; Bracci, Paige M; Elwood, J Mark; Ernstoff, Marc S; Green, Adele; Gruis, Nelleke A; Holly, Elizabeth A; Ingvar, Christian; Kanetsky, Peter A; Karagas, Margaret R; Lee, Tim K; Le Marchand, Loïc; Mackie, Rona M; Olsson, Håkan; Østerlind, Anne; Rebbeck, Timothy R; Reich, Kristian; Sasieni, Peter; Siskind, Victor; Swerdlow, Anthony J; Titus, Linda; Zens, Michael S; Ziegler, Andreas; Gallagher, Richard P.; Barrett, Jennifer H; Newton-Bishop, Julia
2015-01-01
Background We report the development of a cutaneous melanoma risk algorithm based upon 7 factors; hair colour, skin type, family history, freckling, nevus count, number of large nevi and history of sunburn, intended to form the basis of a self-assessment webtool for the general public. Methods Predicted odds of melanoma were estimated by analysing a pooled dataset from 16 case-control studies using logistic random coefficients models. Risk categories were defined based on the distribution of the predicted odds in the controls from these studies. Imputation was used to estimate missing data in the pooled datasets. The 30th, 60th and 90th centiles were used to distribute individuals into four risk groups for their age, sex and geographic location. Cross-validation was used to test the robustness of the thresholds for each group by leaving out each study one by one. Performance of the model was assessed in an independent UK case-control study dataset. Results Cross-validation confirmed the robustness of the threshold estimates. Cases and controls were well discriminated in the independent dataset (area under the curve 0.75, 95% CI 0.73-0.78). 29% of cases were in the highest risk group compared with 7% of controls, and 43% of controls were in the lowest risk group compared with 13% of cases. Conclusion We have identified a composite score representing an estimate of relative risk and successfully validated this score in an independent dataset. Impact This score may be a useful tool to inform members of the public about their melanoma risk. PMID:25713022
Pancoska, Petr; Moravek, Zdenek; Moll, Ute M
2004-01-01
Nucleic acids are molecules of choice for both established and emerging nanoscale technologies. These technologies benefit from large functional densities of 'DNA processing elements' that can be readily manufactured. To achieve the desired functionality, polynucleotide sequences are currently designed by a process that involves tedious and laborious filtering of potential candidates against a series of requirements and parameters. Here, we present a complete novel methodology for the rapid rational design of large sets of DNA sequences. This method allows for the direct implementation of very complex and detailed requirements for the generated sequences, thus avoiding 'brute force' filtering. At the same time, these sequences have narrow distributions of melting temperatures. The molecular part of the design process can be done without computer assistance, using an efficient 'human engineering' approach by drawing a single blueprint graph that represents all generated sequences. Moreover, the method eliminates the necessity for extensive thermodynamic calculations. Melting temperature can be calculated only once (or not at all). In addition, the isostability of the sequences is independent of the selection of a particular set of thermodynamic parameters. Applications are presented for DNA sequence designs for microarrays, universal microarray zip sequences and electron transfer experiments.
Shao, Ning; Jiang, Shi-Meng; Zhang, Miao; Wang, Jing; Guo, Shu-Juan; Li, Yang; Jiang, He-Wei; Liu, Cheng-Xi; Zhang, Da-Bing; Yang, Li-Tao; Tao, Sheng-Ce
2014-01-21
The monitoring of genetically modified organisms (GMOs) is a primary step of GMO regulation. However, there is presently a lack of effective and high-throughput methodologies for specifically and sensitively monitoring most of the commercialized GMOs. Herein, we developed a multiplex amplification on a chip with readout on an oligo microarray (MACRO) system specifically for convenient GMO monitoring. This system is composed of a microchip for multiplex amplification and an oligo microarray for the readout of multiple amplicons, containing a total of 91 targets (18 universal elements, 20 exogenous genes, 45 events, and 8 endogenous reference genes) that covers 97.1% of all GM events that have been commercialized up to 2012. We demonstrate that the specificity of MACRO is ~100%, with a limit of detection (LOD) that is suitable for real-world applications. Moreover, the results obtained of simulated complex samples and blind samples with MACRO were 100% consistent with expectations and the results of independently performed real-time PCRs, respectively. Thus, we believe MACRO is the first system that can be applied for effectively monitoring the majority of the commercialized GMOs in a single test.
Chiu, Charles Y
2015-01-01
Viral pathogen discovery is of critical importance to clinical microbiology, infectious diseases, and public health. Genomic approaches for pathogen discovery, including consensus polymerase chain reaction (PCR), microarrays, and unbiased next-generation sequencing (NGS), have the capacity to comprehensively identify novel microbes present in clinical samples. Although numerous challenges remain to be addressed, including the bioinformatics analysis and interpretation of large datasets, these technologies have been successful in rapidly identifying emerging outbreak threats, screening vaccines and other biological products for microbial contamination, and discovering novel viruses associated with both acute and chronic illnesses. Downstream studies such as genome assembly, epidemiologic screening, and a culture system or animal model of infection are necessary to establish an association of a candidate pathogen with disease. PMID:23725672
Decoding genes with coexpression networks and metabolomics - 'majority report by precogs'.
Saito, Kazuki; Hirai, Masami Y; Yonekura-Sakakibara, Keiko
2008-01-01
Following the sequencing of whole genomes of model plants, high-throughput decoding of gene function is a major challenge in modern plant biology. In view of remarkable technical advances in transcriptomics and metabolomics, integrated analysis of these 'omics' by data-mining informatics is an excellent tool for prediction and identification of gene function, particularly for genes involved in complicated metabolic pathways. The availability of Arabidopsis public transcriptome datasets containing data of >1000 microarrays reinforces the potential for prediction of gene function by transcriptome coexpression analysis. Here, we review the strategy of combining transcriptome and metabolome as a powerful technology for studying the functional genomics of model plants and also crop and medicinal plants.
ITALICS: an algorithm for normalization and DNA copy number calling for Affymetrix SNP arrays.
Rigaill, Guillem; Hupé, Philippe; Almeida, Anna; La Rosa, Philippe; Meyniel, Jean-Philippe; Decraene, Charles; Barillot, Emmanuel
2008-03-15
Affymetrix SNP arrays can be used to determine the DNA copy number measurement of 11 000-500 000 SNPs along the genome. Their high density facilitates the precise localization of genomic alterations and makes them a powerful tool for studies of cancers and copy number polymorphism. Like other microarray technologies it is influenced by non-relevant sources of variation, requiring correction. Moreover, the amplitude of variation induced by non-relevant effects is similar or greater than the biologically relevant effect (i.e. true copy number), making it difficult to estimate non-relevant effects accurately without including the biologically relevant effect. We addressed this problem by developing ITALICS, a normalization method that estimates both biological and non-relevant effects in an alternate, iterative manner, accurately eliminating irrelevant effects. We compared our normalization method with other existing and available methods, and found that ITALICS outperformed these methods for several in-house datasets and one public dataset. These results were validated biologically by quantitative PCR. The R package ITALICS (ITerative and Alternative normaLIzation and Copy number calling for affymetrix Snp arrays) has been submitted to Bioconductor.
Array data extractor (ADE): a LabVIEW program to extract and merge gene array data.
Kurtenbach, Stefan; Kurtenbach, Sarah; Zoidl, Georg
2013-12-01
Large data sets from gene expression array studies are publicly available offering information highly valuable for research across many disciplines ranging from fundamental to clinical research. Highly advanced bioinformatics tools have been made available to researchers, but a demand for user-friendly software allowing researchers to quickly extract expression information for multiple genes from multiple studies persists. Here, we present a user-friendly LabVIEW program to automatically extract gene expression data for a list of genes from multiple normalized microarray datasets. Functionality was tested for 288 class A G protein-coupled receptors (GPCRs) and expression data from 12 studies comparing normal and diseased human hearts. Results confirmed known regulation of a beta 1 adrenergic receptor and further indicate novel research targets. Although existing software allows for complex data analyses, the LabVIEW based program presented here, "Array Data Extractor (ADE)", provides users with a tool to retrieve meaningful information from multiple normalized gene expression datasets in a fast and easy way. Further, the graphical programming language used in LabVIEW allows applying changes to the program without the need of advanced programming knowledge.
Effect of the absolute statistic on gene-sampling gene-set analysis methods.
Nam, Dougu
2017-06-01
Gene-set enrichment analysis and its modified versions have commonly been used for identifying altered functions or pathways in disease from microarray data. In particular, the simple gene-sampling gene-set analysis methods have been heavily used for datasets with only a few sample replicates. The biggest problem with this approach is the highly inflated false-positive rate. In this paper, the effect of absolute gene statistic on gene-sampling gene-set analysis methods is systematically investigated. Thus far, the absolute gene statistic has merely been regarded as a supplementary method for capturing the bidirectional changes in each gene set. Here, it is shown that incorporating the absolute gene statistic in gene-sampling gene-set analysis substantially reduces the false-positive rate and improves the overall discriminatory ability. Its effect was investigated by power, false-positive rate, and receiver operating curve for a number of simulated and real datasets. The performances of gene-set analysis methods in one-tailed (genome-wide association study) and two-tailed (gene expression data) tests were also compared and discussed.
Mustroph, Angelika; Bailey-Serres, Julia
2010-03-01
Plants consist of distinct cell types distinguished by position, morphological features and metabolic activities. We recently developed a method to extract cell-type specific mRNA populations by immunopurification of ribosome-associated mRNAs. Microarray profiles of 21 cell-specific mRNA populations from seedling roots and shoots comprise the Arabidopsis Translatome dataset. This gene expression atlas provides a new tool for the study of cell-specific processes. Here we provide an example of how genes involved in a pathway limited to one or few cell-types can be further characterized and new candidate genes can be predicted. Cells of the root endodermis produce suberin as an inner barrier between the cortex and stele, whereas the shoot epidermal cells form cutin as a barrier to the external environment. Both polymers consist of fatty acid derivates, and share biosynthetic origins. We use the Arabidopsis Translatome dataset to demonstrate the significant cell-specific expression patterns of genes involved in those biosynthetic processes and suggest new candidate genes in the biosynthesis of suberin and cutin.
Multi-spectrometer calibration transfer based on independent component analysis.
Liu, Yan; Xu, Hao; Xia, Zhenzhen; Gong, Zhiyong
2018-02-26
Calibration transfer is indispensable for practical applications of near infrared (NIR) spectroscopy due to the need for precise and consistent measurements across different spectrometers. In this work, a method for multi-spectrometer calibration transfer is described based on independent component analysis (ICA). A spectral matrix is first obtained by aligning the spectra measured on different spectrometers. Then, by using independent component analysis, the aligned spectral matrix is decomposed into the mixing matrix and the independent components of different spectrometers. These differing measurements between spectrometers can then be standardized by correcting the coefficients within the independent components. Two NIR datasets of corn and edible oil samples measured with three and four spectrometers, respectively, were used to test the reliability of this method. The results of both datasets reveal that spectra measurements across different spectrometers can be transferred simultaneously and that the partial least squares (PLS) models built with the measurements on one spectrometer can predict that the spectra can be transferred correctly on another.
Deutsch, Eric W; Ball, Catherine A; Berman, Jules J; Bova, G Steven; Brazma, Alvis; Bumgarner, Roger E; Campbell, David; Causton, Helen C; Christiansen, Jeffrey H; Daian, Fabrice; Dauga, Delphine; Davidson, Duncan R; Gimenez, Gregory; Goo, Young Ah; Grimmond, Sean; Henrich, Thorsten; Herrmann, Bernhard G; Johnson, Michael H; Korb, Martin; Mills, Jason C; Oudes, Asa J; Parkinson, Helen E; Pascal, Laura E; Pollet, Nicolas; Quackenbush, John; Ramialison, Mirana; Ringwald, Martin; Salgado, David; Sansone, Susanna-Assunta; Sherlock, Gavin; Stoeckert, Christian J; Swedlow, Jason; Taylor, Ronald C; Walashek, Laura; Warford, Anthony; Wilkinson, David G; Zhou, Yi; Zon, Leonard I; Liu, Alvin Y; True, Lawrence D
2008-03-01
One purpose of the biomedical literature is to report results in sufficient detail that the methods of data collection and analysis can be independently replicated and verified. Here we present reporting guidelines for gene expression localization experiments: the minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE). MISFISHIE is modeled after the Minimum Information About a Microarray Experiment (MIAME) specification for microarray experiments. Both guidelines define what information should be reported without dictating a format for encoding that information. MISFISHIE describes six types of information to be provided for each experiment: experimental design, biomaterials and treatments, reporters, staining, imaging data and image characterizations. This specification has benefited the consortium within which it was developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.
Digital microarray analysis for digital artifact genomics
NASA Astrophysics Data System (ADS)
Jaenisch, Holger; Handley, James; Williams, Deborah
2013-06-01
We implement a Spatial Voting (SV) based analogy of microarray analysis for digital gene marker identification in malware code sections. We examine a famous set of malware formally analyzed by Mandiant and code named Advanced Persistent Threat (APT1). APT1 is a Chinese organization formed with specific intent to infiltrate and exploit US resources. Manidant provided a detailed behavior and sting analysis report for the 288 malware samples available. We performed an independent analysis using a new alternative to the traditional dynamic analysis and static analysis we call Spatial Analysis (SA). We perform unsupervised SA on the APT1 originating malware code sections and report our findings. We also show the results of SA performed on some members of the families associated by Manidant. We conclude that SV based SA is a practical fast alternative to dynamics analysis and static analysis.
Woodward, Richard B; Spanias, John A; Hargrove, Levi J
2016-08-01
Powered lower limb prostheses have the ability to provide greater mobility for amputee patients. Such prostheses often have pre-programmed modes which can allow activities such as climbing stairs and descending ramps, something which many amputees struggle with when using non-powered limbs. Previous literature has shown how pattern classification can allow seamless transitions between modes with a high accuracy and without any user interaction. Although accurate, training and testing each subject with their own dependent data is time consuming. By using subject independent datasets, whereby a unique subject is tested against a pooled dataset of other subjects, we believe subject training time can be reduced while still achieving an accurate classification. We present here an intent recognition system using an artificial neural network (ANN) with a scaled conjugate gradient learning algorithm to classify gait intention with user-dependent and independent datasets for six unilateral lower limb amputees. We compare these results against a linear discriminant analysis (LDA) classifier. The ANN was found to have significantly lower classification error (P<;0.05) than LDA with all user-dependent step-types, as well as transitional steps for user-independent datasets. Both types of classifiers are capable of making fast decisions; 1.29 and 2.83 ms for the LDA and ANN respectively. These results suggest that ANNs can provide suitable and accurate offline classification in prosthesis gait prediction.
Exploring the reproducibility of functional connectivity alterations in Parkinson’s disease
Onu, Mihaela; Wu, Tao; Roceanu, Adina; Bajenaru, Ovidiu
2017-01-01
Since anatomic MRI is presently not able to directly discern neuronal loss in Parkinson’s Disease (PD), studying the associated functional connectivity (FC) changes seems a promising approach toward developing non-invasive and non-radioactive neuroimaging markers for this disease. While several groups have reported such FC changes in PD, there are also significant discrepancies between studies. Investigating the reproducibility of PD-related FC changes on independent datasets is therefore of crucial importance. We acquired resting-state fMRI scans for 43 subjects (27 patients and 16 normal controls, with 2 replicate scans per subject) and compared the observed FC changes with those obtained in two independent datasets, one made available by the PPMI consortium (91 patients, 18 controls) and a second one by the group of Tao Wu (20 patients, 20 controls). Unfortunately, PD-related functional connectivity changes turned out to be non-reproducible across datasets. This could be due to disease heterogeneity, but also to technical differences. To distinguish between the two, we devised a method to directly check for disease heterogeneity using random splits of a single dataset. Since we still observe non-reproducibility in a large fraction of random splits of the same dataset, we conclude that functional heterogeneity may be a dominating factor behind the lack of reproducibility of FC alterations in different rs-fMRI studies of PD. While global PD-related functional connectivity changes were non-reproducible across datasets, we identified a few individual brain region pairs with marginally consistent FC changes across all three datasets. However, training classifiers on each one of the three datasets to discriminate PD scans from controls produced only low accuracies on the remaining two test datasets. Moreover, classifiers trained and tested on random splits of the same dataset (which are technically homogeneous) also had low test accuracies, directly substantiating disease heterogeneity. PMID:29182621
Delineation of Two Clinically and Molecularly Distinct Subgroups of Posterior Fossa Ependymoma
Witt, Hendrik; Mack, Stephen C.; Ryzhova, Marina; Bender, Sebastian; Sill, Martin; Isserlin, Ruth; Benner, Axel; Hielscher, Thomas; Milde, Till; Remke, Marc; Jones, David T.W.; Northcott, Paul A.; Garzia, Livia; Bertrand, Kelsey C.; Wittmann, Andrea; Yao, Yuan; Roberts, Stephen S.; Massimi, Luca; Van Meter, Tim; Weiss, William A.; Gupta, Nalin; Grajkowska, Wiesia; Lach, Boleslaw; Cho, Yoon-Jae; von Deimling, Andreas; Kulozik, Andreas E.; Witt, Olaf; Bader, Gary D.; Hawkins, Cynthia E.; Tabori, Uri; Guha, Abhijit; Rutka, James T.; Lichter, Peter; Korshunov, Andrey
2014-01-01
Summary Despite the histological similarity of ependymomas from throughout the neuroaxis, the disease likely comprises multiple independent entities, each with a distinct molecular pathogenesis. Transcriptional profiling of two large independent cohorts of ependymoma reveals the existence of two demographically, transcriptionally, genetically, and clinically distinct groups of posterior fossa (PF) ependymomas. Group A patients are younger, have laterally located tumors with a balanced genome, and are much more likely to exhibit recurrence, metastasis at recurrence, and death compared with Group B patients. Identification and optimization of immunohistochemical (IHC) markers for PF ependymoma subgroups allowed validation of our findings on a third independent cohort, using a human ependymoma tissue microarray, and provides a tool for prospective prognostication and stratification of PF ependymoma patients. PMID:21840481
Graubner, Felix R; Gram, Aykut; Kautz, Ewa; Bauersachs, Stefan; Aslan, Selim; Agaoglu, Ali R; Boos, Alois; Kowalewski, Mariusz P
2017-08-01
In the dog, there is no luteolysis in the absence of pregnancy. Thus, this species lacks any anti-luteolytic endocrine signal as found in other species that modulate uterine function during the critical period of pregnancy establishment. Nevertheless, in the dog an embryo-maternal communication must occur in order to prevent rejection of embryos. Based on this hypothesis, we performed microarray analysis of canine uterine samples collected during pre-attachment phase (days 10-12) and in corresponding non-pregnant controls, in order to elucidate the embryo attachment signal. An additional goal was to identify differences in uterine responses to pre-attachment embryos between dogs and other mammalian species exhibiting different reproductive patterns with regard to luteolysis, implantation, and preparation for placentation. Therefore, the canine microarray data were compared with gene sets from pigs, cattle, horses, and humans. We found 412 genes differentially regulated between the two experimental groups. The functional terms most strongly enriched in response to pre-attachment embryos related to extracellular matrix function and remodeling, and to immune and inflammatory responses. Several candidate genes were validated by semi-quantitative PCR. When compared with other species, best matches were found with human and equine counterparts. Especially for the pig, the majority of overlapping genes showed opposite expression patterns. Interestingly, 1926 genes did not pair with any of the other gene sets. Using a microarray approach, we report the uterine changes in the dog driven by the presence of embryos and compare these results with datasets from other mammalian species, finding common-, contrary-, and exclusively canine-regulated genes. © The Authors 2017. Published by Oxford University Press on behalf of Society for the Study of Reproduction.
Non-Small-Cell Lung Cancer Molecular Signatures Recapitulate Lung Developmental Pathways
Borczuk, Alain C.; Gorenstein, Lyall; Walter, Kristin L.; Assaad, Adel A.; Wang, Liqun; Powell, Charles A.
2003-01-01
Current paradigms hold that lung carcinomas arise from pleuripotent stem cells capable of differentiation into one or several histological types. These paradigms suggest lung tumor cell ontogeny is determined by consequences of gene expression that recapitulate events important in embryonic lung development. Using oligonucleotide microarrays, we acquired gene profiles from 32 microdissected non-small-cell lung tumors. We determined the 100 top-ranked marker genes for adenocarcinoma, squamous cell, large cell, and carcinoid using nearest neighbor analysis. Results were validated by immunostaining for 11 selected proteins using a tissue microarray representing 80 tumors. Gene expression data of lung development were accessed from a publicly available dataset generated with the murine Mu11k genome microarray. Self-organized mapping identified two temporally distinct clusters of murine orthologues. Supervised clustering of lung development data showed large-cell carcinoma gene orthologues were in a cluster expressed in pseudoglandular and canalicular stages whereas adenocarcinoma homologues were predominantly in a cluster expressed later in the terminal sac and alveolar stages of murine lung development. Representative large-cell genes (E2F3, MYBL2, HDAC2, CDK4, PCNA) are expressed in the nucleus and are associated with cell cycle and proliferation. In contrast, adenocarcinoma genes are associated with lung-specific transcription pathways (SFTPB, TTF-1), cell adhesion, and signal transduction. In sum, non-small-cell lung tumors histology gene profiles suggest mechanisms relevant to ontogeny and clinical course. Adenocarcinoma genes are associated with differentiation and glandular formation whereas large-cell genes are associated with proliferation and differentiation arrest. The identification of developmentally regulated pathways active in tumorigenesis provides insights into lung carcinogenesis and suggests early steps may differ according to the eventual tumor morphology. PMID:14578194
Liang, Xueying; Schnetz-Boutaud, Nathalie; Bartlett, Jackie; Allen, Melissa J; Gwirtsman, Harry; Schmechel, Don E; Carney, Regina M; Gilbert, John R; Pericak-Vance, Margaret A; Haines, Jonathan L
2008-01-01
SNP rs498055 in the predicted gene LOC439999 on chromosome 10 was recently identified as being strongly associated with late-onset Alzheimer disease (LOAD). This SNP falls within a chromosomal region that has engendered continued interest generated from both preliminary genetic linkage and candidate gene studies. To independently evaluate this interesting candidate SNP we examined four independent datasets, three family-based and one case-control. All the cases were late-onset AD Caucasian patients with minimum age at onset >or= 60 years. None of the three family samples or the combined family-based dataset showed association in either allelic or genotypic family-based association tests at p < 0.05. Both original and OSA two-point LOD scores were calculated. However, there was no evidence indicating linkage no matter what covariates were applied (the highest LOD score was 0.82). The case-control dataset did not demonstrate any association between this SNP and AD (all p-values > 0.52). Our results do not confirm the previous association, but are consistent with a more recent negative association result that used family-based association tests to examine the effect of this SNP in two family datasets. Thus we conclude that rs498055 is not associated with an increased risk of LOAD.
Xu, Haoming; Moni, Mohammad Ali; Liò, Pietro
2015-12-01
In cancer genomics, gene expression levels provide important molecular signatures for all types of cancer, and this could be very useful for predicting the survival of cancer patients. However, the main challenge of gene expression data analysis is high dimensionality, and microarray is characterised by few number of samples with large number of genes. To overcome this problem, a variety of penalised Cox proportional hazard models have been proposed. We introduce a novel network regularised Cox proportional hazard model and a novel multiplex network model to measure the disease comorbidities and to predict survival of the cancer patient. Our methods are applied to analyse seven microarray cancer gene expression datasets: breast cancer, ovarian cancer, lung cancer, liver cancer, renal cancer and osteosarcoma. Firstly, we applied a principal component analysis to reduce the dimensionality of original gene expression data. Secondly, we applied a network regularised Cox regression model on the reduced gene expression datasets. By using normalised mutual information method and multiplex network model, we predict the comorbidities for the liver cancer based on the integration of diverse set of omics and clinical data, and we find the diseasome associations (disease-gene association) among different cancers based on the identified common significant genes. Finally, we evaluated the precision of the approach with respect to the accuracy of survival prediction using ROC curves. We report that colon cancer, liver cancer and renal cancer share the CXCL5 gene, and breast cancer, ovarian cancer and renal cancer share the CCND2 gene. Our methods are useful to predict survival of the patient and disease comorbidities more accurately and helpful for improvement of the care of patients with comorbidity. Software in Matlab and R is available on our GitHub page: https://github.com/ssnhcom/NetworkRegularisedCox.git. Copyright © 2015. Published by Elsevier Ltd.
Sehgal, Muhammad Shoaib B; Gondal, Iqbal; Dooley, Laurence S
2005-05-15
Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm. The CMVE software is available upon request from the authors.
Reducing the time requirement of k-means algorithm.
Osamor, Victor Chukwudi; Adebiyi, Ezekiel Femi; Oyelade, Jelilli Olarenwaju; Doumbia, Seydou
2012-01-01
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space R(d) and an integer k. The problem is to determine a set of k points in R(d), called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARI(HA)). We found that when k is close to d, the quality is good (ARI(HA)>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARI(HA)>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.
Unsupervised Bayesian linear unmixing of gene expression microarrays.
Bazot, Cécile; Dobigeon, Nicolas; Tourneret, Jean-Yves; Zaas, Aimee K; Ginsburg, Geoffrey S; Hero, Alfred O
2013-03-19
This paper introduces a new constrained model and the corresponding algorithm, called unsupervised Bayesian linear unmixing (uBLU), to identify biological signatures from high dimensional assays like gene expression microarrays. The basis for uBLU is a Bayesian model for the data samples which are represented as an additive mixture of random positive gene signatures, called factors, with random positive mixing coefficients, called factor scores, that specify the relative contribution of each signature to a specific sample. The particularity of the proposed method is that uBLU constrains the factor loadings to be non-negative and the factor scores to be probability distributions over the factors. Furthermore, it also provides estimates of the number of factors. A Gibbs sampling strategy is adopted here to generate random samples according to the posterior distribution of the factors, factor scores, and number of factors. These samples are then used to estimate all the unknown parameters. Firstly, the proposed uBLU method is applied to several simulated datasets with known ground truth and compared with previous factor decomposition methods, such as principal component analysis (PCA), non negative matrix factorization (NMF), Bayesian factor regression modeling (BFRM), and the gradient-based algorithm for general matrix factorization (GB-GMF). Secondly, we illustrate the application of uBLU on a real time-evolving gene expression dataset from a recent viral challenge study in which individuals have been inoculated with influenza A/H3N2/Wisconsin. We show that the uBLU method significantly outperforms the other methods on the simulated and real data sets considered here. The results obtained on synthetic and real data illustrate the accuracy of the proposed uBLU method when compared to other factor decomposition methods from the literature (PCA, NMF, BFRM, and GB-GMF). The uBLU method identifies an inflammatory component closely associated with clinical symptom scores collected during the study. Using a constrained model allows recovery of all the inflammatory genes in a single factor.
Reducing the Time Requirement of k-Means Algorithm
Osamor, Victor Chukwudi; Adebiyi, Ezekiel Femi; Oyelade, Jelilli Olarenwaju; Doumbia, Seydou
2012-01-01
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the quality is good (ARIHA>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data. PMID:23239974
Far infrared promotes wound healing through activation of Notch1 signaling.
Hsu, Yung-Ho; Lin, Yuan-Feng; Chen, Cheng-Hsien; Chiu, Yu-Jhe; Chiu, Hui-Wen
2017-11-01
The Notch signaling pathway is critically involved in cell proliferation, differentiation, development, and homeostasis. Far infrared (FIR) has an effect that promotes wound healing. However, the underlying molecular mechanisms are unclear. In the present study, we employed in vivo and HaCaT (a human skin keratinocyte cell line) models to elucidate the role of Notch1 signaling in FIR-promoted wound healing. We found that FIR enhanced keratinocyte migration and proliferation. FIR induced the Notch1 signaling pathway in HaCaT cells and in a microarray dataset from the Gene Expression Omnibus database. We next determined the mRNA levels of NOTCH1 in paired normal and wound skin tissues derived from clinical patients using the microarray dataset and Ingenuity Pathway Analysis software. The result indicated that the Notch1/Twist1 axis plays important roles in wound healing and tissue repair. In addition, inhibiting Notch1 signaling decreased the FIR-enhanced proliferation and migration. In a full-thickness wound model in rats, the wounds healed more rapidly and the scar size was smaller in the FIR group than in the light group. Moreover, FIR could increase Notch1 and Delta1 in skin tissues. The activation of Notch1 signaling may be considered as a possible mechanism for the promoting effect of FIR on wound healing. FIR stimulates keratinocyte migration and proliferation. Notch1 in keratinocytes has an essential role in FIR-induced migration and proliferation. NOTCH1 promotes TWIST1-mediated gene expression to assist wound healing. FIR might promote skin wound healing in a rat model. FIR stimulates keratinocyte migration and proliferation. Notch1 in keratinocytes has an essential role in FIR-induced migration and proliferation. NOTCH1 promotes TWIST1-mediated gene expression to assist wound healing. FIR might promote skin wound healing in a rat model.
Zhao, Linlu; Bracken, Michael B.; DeWan, Andrew T.
2013-01-01
Summary A genome-wide association study was undertaken to identify maternal single nucleotide polymorphisms (SNPs) and copy-number variants (CNVs) associated with preeclampsia. Case-control analysis was performed on 1070 Afro-Caribbean (n=21 cases and 1049 controls) and 723 Hispanic (n=62 cases and 661 controls) mothers and 1257 mothers of European ancestry (n=50 cases and 1207 controls) from the Hyperglycemia and Adverse Pregnancy Outcome (HAPO) study. European ancestry subjects were genotyped on Illumina Human610-Quad and Afro-Caribbean and Hispanic subjects were genotyped on Illumina Human1M-Duo BeadChip microarrays. Genome-wide SNP data were analyzed using PLINK. CNVs were called using three detection algorithms (GNOSIS, PennCNV, and QuantiSNP), merged using CNVision, and then screened using stringent criteria. SNP and CNV findings were compared to those of the Study of Pregnancy Hypertension in Iowa (SOPHIA), an independent preeclampsia case-control dataset of Caucasian mothers (n=177 cases and 116 controls). A list of top SNPs were identified for each of the HAPO ethnic groups, but none reached Bonferroni-corrected significance. Novel candidate CNVs showing enrichment among preeclampsia cases were also identified in each of the three ethnic groups. Several variants were suggestively replicated in SOPHIA. The discovered SNPs and copy-number variable regions present interesting candidate genetic variants for preeclampsia that warrant further replication and investigation. PMID:23551011
Gelernter, Joel; Sherva, Richard; Koesterer, Ryan; Almasy, Laura; Zhao, Hongyu; Kranzler, Henry R.; Farrer, Lindsay
2013-01-01
We report a GWAS for cocaine dependence (CD) in three sets of African- and European-American subjects (AAs and EAs, respectively), to identify pathways, genes, and alleles important in CD risk. The discovery GWAS dataset (n=5,697 subjects) was genotyped using the Illumina OmniQuad microarray (890,000 analyzed SNPs). Additional genotypes were imputed based on the 1000 Genomes reference panel. Top-ranked findings were evaluated by incorporating information from publicly available GWAS data from 4,063 subjects. Then, the most significant GWAS SNPs were genotyped in 2,549 independent subjects. We observed one genomewide-significant (GWS) result: rs7086629 at the FAM53B (“family with sequence similarity 53, member B”) locus. This was supported in both AAs and EAs; p-value (meta-analysis of all samples) =4.28×10−8. The gene maps to the same chromosomal region as the maximum peak we observed in a previous linkage study. NCOR2 (nuclear receptor corepressor 1) SNP rs150954431 was associated with p=1.19×10−9 in the EA discovery sample. SNP rs2456778, which maps to CDK1 (“cyclin-dependent kinase 1”), was associated with cocaine-induced paranoia in AAs in the discovery sample only (p=4.68×10−8). This is the first study to identify risk variants for CD using GWAS. Our results implicate novel risk loci and provide insights into potential therapeutic and prevention strategies. PMID:23958962
Bryan, Kenneth; Cunningham, Pádraig
2008-01-01
Background Microarrays have the capacity to measure the expressions of thousands of genes in parallel over many experimental samples. The unsupervised classification technique of bicluster analysis has been employed previously to uncover gene expression correlations over subsets of samples with the aim of providing a more accurate model of the natural gene functional classes. This approach also has the potential to aid functional annotation of unclassified open reading frames (ORFs). Until now this aspect of biclustering has been under-explored. In this work we illustrate how bicluster analysis may be extended into a 'semi-supervised' ORF annotation approach referred to as BALBOA. Results The efficacy of the BALBOA ORF classification technique is first assessed via cross validation and compared to a multi-class k-Nearest Neighbour (kNN) benchmark across three independent gene expression datasets. BALBOA is then used to assign putative functional annotations to unclassified yeast ORFs. These predictions are evaluated using existing experimental and protein sequence information. Lastly, we employ a related semi-supervised method to predict the presence of novel functional modules within yeast. Conclusion In this paper we demonstrate how unsupervised classification methods, such as bicluster analysis, may be extended using of available annotations to form semi-supervised approaches within the gene expression analysis domain. We show that such methods have the potential to improve upon supervised approaches and shed new light on the functions of unclassified ORFs and their co-regulation. PMID:18831786
Bastani, Meysam; Vos, Larissa; Asgarian, Nasimeh; Deschenes, Jean; Graham, Kathryn; Mackey, John; Greiner, Russell
2013-01-01
Background Selecting the appropriate treatment for breast cancer requires accurately determining the estrogen receptor (ER) status of the tumor. However, the standard for determining this status, immunohistochemical analysis of formalin-fixed paraffin embedded samples, suffers from numerous technical and reproducibility issues. Assessment of ER-status based on RNA expression can provide more objective, quantitative and reproducible test results. Methods To learn a parsimonious RNA-based classifier of hormone receptor status, we applied a machine learning tool to a training dataset of gene expression microarray data obtained from 176 frozen breast tumors, whose ER-status was determined by applying ASCO-CAP guidelines to standardized immunohistochemical testing of formalin fixed tumor. Results This produced a three-gene classifier that can predict the ER-status of a novel tumor, with a cross-validation accuracy of 93.17±2.44%. When applied to an independent validation set and to four other public databases, some on different platforms, this classifier obtained over 90% accuracy in each. In addition, we found that this prediction rule separated the patients' recurrence-free survival curves with a hazard ratio lower than the one based on the IHC analysis of ER-status. Conclusions Our efficient and parsimonious classifier lends itself to high throughput, highly accurate and low-cost RNA-based assessments of ER-status, suitable for routine high-throughput clinical use. This analytic method provides a proof-of-principle that may be applicable to developing effective RNA-based tests for other biomarkers and conditions. PMID:24312637
Wu, Chengjiang; Zhao, Yangjing; Lin, Yu; Yang, Xinxin; Yan, Meina; Min, Yujiao; Pan, Zihui; Xia, Sheng; Shao, Qixiang
2018-01-01
DNA microarray and high-throughput sequencing have been widely used to identify the differentially expressed genes (DEGs) in systemic lupus erythematosus (SLE). However, the big data from gene microarrays are also challenging to work with in terms of analysis and processing. The presents study combined data from the microarray expression profile (GSE65391) and bioinformatics analysis to identify the key genes and cellular pathways in SLE. Gene ontology (GO) and cellular pathway enrichment analyses of DEGs were performed to investigate significantly enriched pathways. A protein-protein interaction network was constructed to determine the key genes in the occurrence and development of SLE. A total of 310 DEGs were identified in SLE, including 193 upregulated genes and 117 downregulated genes. GO analysis revealed that the most significant biological process of DEGs was immune system process. Kyoto Encyclopedia of Genes and Genome pathway analysis showed that these DEGs were enriched in signaling pathways associated with the immune system, including the RIG-I-like receptor signaling pathway, intestinal immune network for IgA production, antigen processing and presentation and the toll-like receptor signaling pathway. The current study screened the top 10 genes with higher degrees as hub genes, which included 2′-5′-oligoadenylate synthetase 1, MX dynamin like GTPase 2, interferon induced protein with tetratricopeptide repeats 1, interferon regulatory factor 7, interferon induced with helicase C domain 1, signal transducer and activator of transcription 1, ISG15 ubiquitin-like modifier, DExD/H-box helicase 58, interferon induced protein with tetratricopeptide repeats 3 and 2′-5′-oligoadenylate synthetase 2. Module analysis revealed that these hub genes were also involved in the RIG-I-like receptor signaling, cytosolic DNA-sensing, toll-like receptor signaling and ribosome biogenesis pathways. In addition, these hub genes, from different probe sets, exhibited significant co-expressed tendency in multi-experiment microarray datasets (P<0.01). In conclusion, these key genes and cellular pathways may improve the current understanding of the underlying mechanism of development of SLE. These key genes may be potential biomarkers of diagnosis, therapy and prognosis for SLE. PMID:29257335
Validation of a Radiosensitivity Molecular Signature in Breast Cancer
Eschrich, Steven A.; Fulp, William J.; Pawitan, Yudi; Foekens, John A.; Smid, Marcel; Martens, John W. M.; Echevarria, Michelle; Kamath, Vidya; Lee, Ji-Hyun; Harris, Eleanor E.; Bergh, Jonas; Torres-Roca, Javier F.
2014-01-01
Purpose Previously, we developed a radiosensitivity molecular signature (RSI) that was clinically-validated in three independent datasets (rectal, esophageal, head and neck) in 118 patients. Here, we test RSI in radiotherapy (RT) treated breast cancer patients. Experimental Design RSI was tested in two previously published breast cancer datasets. Patients were treated at the Karolinska University Hospital (n=159) and Erasmus Medical Center (n=344). RSI was applied as previously described. Results We tested RSI in RT-treated patients (Karolinska). Patients predicted to be radiosensitive (RS) had an improved 5 yr relapse-free survival when compared with radioresistant (RR) patients (95% vs. 75%, p=0.0212) but there was no difference between RS/RR patients treated without RT (71% vs. 77%, p=0.6744), consistent with RSI being RT-specific (interaction term RSIxRT, p=0.05). Similarly, in the Erasmus dataset RT-treated RS patients had an improved 5-year distant-metastasis-free survival over RR patients (77% vs. 64%, p=0.0409) but no difference was observed in patients treated without RT (RS vs. RR, 80% vs. 81%, p=0.9425). Multivariable analysis showed RSI is the strongest variable in RT-treated patients (Karolinska, HR=5.53, p=0.0987, Erasmus, HR=1.64, p=0.0758) and in backward selection (removal alpha of 0.10) RSI was the only variable remaining in the final model. Finally, RSI is an independent predictor of outcome in RT-treated ER+ patients (Erasmus, multivariable analysis, HR=2.64, p=0.0085). Conclusions RSI is validated in two independent breast cancer datasets totaling 503 patients. Including prior data, RSI is validated in five independent cohorts (621 patients) and represents, to our knowledge, the most extensively validated molecular signature in radiation oncology. PMID:22832933
An empirical understanding of triple collocation evaluation measure
NASA Astrophysics Data System (ADS)
Scipal, Klaus; Doubkova, Marcela; Hegyova, Alena; Dorigo, Wouter; Wagner, Wolfgang
2013-04-01
Triple collocation method is an advanced evaluation method that has been used in the soil moisture field for only about half a decade. The method requires three datasets with an independent error structure that represent an identical phenomenon. The main advantages of the method are that it a) doesn't require a reference dataset that has to be considered to represent the truth, b) limits the effect of random and systematic errors of other two datasets, and c) simultaneously assesses the error of three datasets. The objective of this presentation is to assess the triple collocation error (Tc) of the ASAR Global Mode Surface Soil Moisture (GM SSM 1) km dataset and highlight problems of the method related to its ability to cancel the effect of error of ancillary datasets. In particular, the goal is to a) investigate trends in Tc related to the change in spatial resolution from 5 to 25 km, b) to investigate trends in Tc related to the choice of a hydrological model, and c) to study the relationship between Tc and other absolute evaluation methods (namely RMSE and Error Propagation EP). The triple collocation method is implemented using ASAR GM, AMSR-E, and a model (either AWRA-L, GLDAS-NOAH, or ERA-Interim). First, the significance of the relationship between the three soil moisture datasets was tested that is a prerequisite for the triple collocation method. Second, the trends in Tc related to the choice of the third reference dataset and scale were assessed. For this purpose the triple collocation is repeated replacing AWRA-L with two different globally available model reanalysis dataset operating at different spatial resolution (ERA-Interim and GLDAS-NOAH). Finally, the retrieved results were compared to the results of the RMSE and EP evaluation measures. Our results demonstrate that the Tc method does not eliminate the random and time-variant systematic errors of the second and the third dataset used in the Tc. The possible reasons include the fact a) that the TC method could not fully function with datasets acting at very different spatial resolutions, or b) that the errors were not fully independent as initially assumed.
Suh, Yun-Suhk; Yu, Jieun; Kim, Byung Chul; Choi, Boram; Han, Tae-Su; Ahn, Hye Seong; Kong, Seong-Ho; Lee, Hyuk-Joon; Kim, Woo Ho; Yang, Han-Kwang
2015-01-01
Purpose The purpose of this study is to investigate differentially expressed genes using DNA microarray between advanced gastric cancer (AGC) with aggressive lymph node (LN) metastasis and that with a more advanced tumor stage but without LN metastasis. Materials and Methods Five sample pairs of gastric cancer tissue and normal gastric mucosa were taken from three patients with T3N3 stage (highN) and two with T4N0 stage (lowN). Data from triplicate DNA microarray experiments were analyzed, and candidate genes were identified using a volcano plot that showed ≥ 2-fold differential expression and were significant by Welch's t test (p < 0.05) between highN and lowN. Those selected genes were validated independently by reverse-transcriptase–polymerase chain reaction (RT-PCR) using five AGC patients, and tissue-microarray (TMA) comprising 47 AGC patients. Results CFTR, LAMC2, SERPINE2, F2R, MMP7, FN1, TIMP1, plasminogen activator inhibitor-1 (PAI-1), ITGB8, SDS, and TMPRSS4 were commonly up-regulated over 2-fold in highN. REG3A, CD24, ITLN1, and WBP5 were commonly down-regulated over 2-fold in lowN. Among these genes, overexpression of PAI-1 was validated by RT-PCR, and TMA showed 16.7% (7/42) PAI-1 expression in T3N3, but none (0/5) in T4N0 (p=0.393). Conclusion DNA microarray analysis and validation by RT-PCR and TMA showed that overexpression of PAI-1 is related to aggressive LN metastasis in AGC. PMID:25687870
Similarity of markers identified from cancer gene expression studies: observations from GEO.
Shi, Xingjie; Shen, Shihao; Liu, Jin; Huang, Jian; Zhou, Yong; Ma, Shuangge
2014-09-01
Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first objective of this study is to briefly review some statistical methods that can be used for such evaluation. Both marginal analysis and joint analysis methods are reviewed. The second objective is to apply those methods to 26 Gene Expression Omnibus (GEO) datasets on five types of cancers. Our analysis suggests that for the same cancer, the marker identification results may vary significantly across datasets, and different datasets share few common genes. In addition, datasets on different cancers share few common genes. The shared genetic basis of datasets on the same or different cancers, which has been suggested in the literature, is not observed in the analysis of GEO data. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
EgoNet: identification of human disease ego-network modules
2014-01-01
Background Mining novel biomarkers from gene expression profiles for accurate disease classification is challenging due to small sample size and high noise in gene expression measurements. Several studies have proposed integrated analyses of microarray data and protein-protein interaction (PPI) networks to find diagnostic subnetwork markers. However, the neighborhood relationship among network member genes has not been fully considered by those methods, leaving many potential gene markers unidentified. The main idea of this study is to take full advantage of the biological observation that genes associated with the same or similar diseases commonly reside in the same neighborhood of molecular networks. Results We present EgoNet, a novel method based on egocentric network-analysis techniques, to exhaustively search and prioritize disease subnetworks and gene markers from a large-scale biological network. When applied to a triple-negative breast cancer (TNBC) microarray dataset, the top selected modules contain both known gene markers in TNBC and novel candidates, such as RAD51 and DOK1, which play a central role in their respective ego-networks by connecting many differentially expressed genes. Conclusions Our results suggest that EgoNet, which is based on the ego network concept, allows the identification of novel biomarkers and provides a deeper understanding of their roles in complex diseases. PMID:24773628
NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways.
Brohée, Sylvain; Faust, Karoline; Lima-Mendez, Gipsi; Sand, Olivier; Janky, Rekin's; Vanderstocken, Gilles; Deville, Yves; van Helden, Jacques
2008-07-01
The network analysis tools (NeAT) (http://rsat.ulb.ac.be/neat/) provide a user-friendly web access to a collection of modular tools for the analysis of networks (graphs) and clusters (e.g. microarray clusters, functional classes, etc.). A first set of tools supports basic operations on graphs (comparison between two graphs, neighborhood of a set of input nodes, path finding and graph randomization). Another set of programs makes the connection between networks and clusters (graph-based clustering, cliques discovery and mapping of clusters onto a network). The toolbox also includes programs for detecting significant intersections between clusters/classes (e.g. clusters of co-expression versus functional classes of genes). NeAT are designed to cope with large datasets and provide a flexible toolbox for analyzing biological networks stored in various databases (protein interactions, regulation and metabolism) or obtained from high-throughput experiments (two-hybrid, mass-spectrometry and microarrays). The web interface interconnects the programs in predefined analysis flows, enabling to address a series of questions about networks of interest. Each tool can also be used separately by entering custom data for a specific analysis. NeAT can also be used as web services (SOAP/WSDL interface), in order to design programmatic workflows and integrate them with other available resources.
Yi, Ming; Stephens, Robert M.
2008-01-01
Analysis of microarray and other high throughput data often involves identification of genes consistently up or down-regulated across samples as the first step in extraction of biological meaning. This gene-level paradigm can be limited as a result of valid sample fluctuations and biological complexities. In this report, we describe a novel method, SLEPR, which eliminates this limitation by relying on pathway-level consistencies. Our method first selects the sample-level differentiated genes from each individual sample, capturing genes missed by other analysis methods, ascertains the enrichment levels of associated pathways from each of those lists, and then ranks annotated pathways based on the consistency of enrichment levels of individual samples from both sample classes. As a proof of concept, we have used this method to analyze three public microarray datasets with a direct comparison with the GSEA method, one of the most popular pathway-level analysis methods in the field. We found that our method was able to reproduce the earlier observations with significant improvements in depth of coverage for validated or expected biological themes, but also produced additional insights that make biological sense. This new method extends existing analyses approaches and facilitates integration of different types of HTP data. PMID:18818771
Identification of the Key Genes and Pathways in Esophageal Carcinoma.
Su, Peng; Wen, Shiwang; Zhang, Yuefeng; Li, Yong; Xu, Yanzhao; Zhu, Yonggang; Lv, Huilai; Zhang, Fan; Wang, Mingbo; Tian, Ziqiang
2016-01-01
Objective . Esophageal carcinoma (EC) is a frequently common malignancy of gastrointestinal cancer in the world. This study aims to screen key genes and pathways in EC and elucidate the mechanism of it. Methods . 5 microarray datasets of EC were downloaded from Gene Expression Omnibus. Differentially expressed genes (DEGs) were screened by bioinformatics analysis. Gene Ontology (GO) enrichment, Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment, and protein-protein interaction (PPI) network construction were performed to obtain the biological roles of DEGs in EC. Quantitative real-time polymerase chain reaction (qRT-PCR) was used to verify the expression level of DEGs in EC. Results . A total of 1955 genes were filtered as DEGs in EC. The upregulated genes were significantly enriched in cell cycle and the downregulated genes significantly enriched in Endocytosis. PPI network displayed CDK4 and CCT3 were hub proteins in the network. The expression level of 8 dysregulated DEGs including CDK4, CCT3, THSD4, SIM2, MYBL2, CENPF, CDCA3, and CDKN3 was validated in EC compared to adjacent nontumor tissues and the results were matched with the microarray analysis. Conclusion . The significantly DEGs including CDK4, CCT3, THSD4, and SIM2 may play key roles in tumorigenesis and development of EC involved in cell cycle and Endocytosis.
Severino, Patricia; Alvares, Adriana M; Michaluart, Pedro; Okamoto, Oswaldo K; Nunes, Fabio D; Moreira-Filho, Carlos A; Tajara, Eloiza H
2008-01-01
Background Oral squamous cell carcinoma (OSCC) is a frequent neoplasm, which is usually aggressive and has unpredictable biological behavior and unfavorable prognosis. The comprehension of the molecular basis of this variability should lead to the development of targeted therapies as well as to improvements in specificity and sensitivity of diagnosis. Results Samples of primary OSCCs and their corresponding surgical margins were obtained from male patients during surgery and their gene expression profiles were screened using whole-genome microarray technology. Hierarchical clustering and Principal Components Analysis were used for data visualization and One-way Analysis of Variance was used to identify differentially expressed genes. Samples clustered mostly according to disease subsite, suggesting molecular heterogeneity within tumor stages. In order to corroborate our results, two publicly available datasets of microarray experiments were assessed. We found significant molecular differences between OSCC anatomic subsites concerning groups of genes presently or potentially important for drug development, including mRNA processing, cytoskeleton organization and biogenesis, metabolic process, cell cycle and apoptosis. Conclusion Our results corroborate literature data on molecular heterogeneity of OSCCs. Differences between disease subsites and among samples belonging to the same TNM class highlight the importance of gene expression-based classification and challenge the development of targeted therapies. PMID:19014556
Use of lectin microarray to differentiate gastric cancer from gastric ulcer
Huang, Wei-Li; Li, Yang-Guang; Lv, Yong-Chen; Guan, Xiao-Hui; Ji, Hui-Fan; Chi, Bao-Rong
2014-01-01
AIM: To investigate the feasibility of lectin microarray for differentiating gastric cancer from gastric ulcer. METHODS: Twenty cases of human gastric cancer tissue and 20 cases of human gastric ulcer tissue were collected and processed. Protein was extracted from the frozen tissues and stored. The lectins were dissolved in buffer, and the sugar-binding specificities of lectins and the layout of the lectin microarray were summarized. The median of the effective data points for each lectin was globally normalized to the sum of medians of all effective data points for each lectin in one block. Formalin-fixed paraffin-embedded gastric cancer tissues and their corresponding gastric ulcer tissues were subjected to Ag retrieval. Biotinylated lectin was used as the primary antibody and HRP-streptavidin as the secondary antibody. The glycopatterns of glycoprotein in gastric cancer and gastric ulcer specimens were determined by lectin microarray, and then validated by lectin histochemistry. Data are presented as mean ± SD for the indicated number of independent experiments. RESULTS: The glycosylation level of gastric cancer was significantly higher than that in ulcer. In gastric cancer, most of the lectin binders showed positive signals and the intensity of the signals was stronger, whereas the opposite was the case for ulcers. Significant differences in the pathological score of the two lectins were apparent between ulcer and gastric cancer tissues using the same lectin. For MPL and VVA, all types of gastric cancer detected showed stronger staining and a higher positive rate in comparison with ulcer, especially in the case of signet ring cell carcinoma and intra-mucosal carcinoma. GalNAc bound to MPL showed a significant increase. A statistically significant association between MPL and gastric cancer was observed. As with MPL, there were significant differences in VVA staining between gastric cancer and ulcer. CONCLUSION: Lectin microarray can differentiate the different glycopatterns in gastric cancer and gastric ulcer, and the lectins MPL and VVA can be used as biomarkers. PMID:24833877
Deutsch, Eric W; Ball, Catherine A; Berman, Jules J; Bova, G Steven; Brazma, Alvis; Bumgarner, Roger E; Campbell, David; Causton, Helen C; Christiansen, Jeffrey H; Daian, Fabrice; Dauga, Delphine; Davidson, Duncan R; Gimenez, Gregory; Goo, Young Ah; Grimmond, Sean; Henrich, Thorsten; Herrmann, Bernhard G; Johnson, Michael H; Korb, Martin; Mills, Jason C; Oudes, Asa J; Parkinson, Helen E; Pascal, Laura E; Pollet, Nicolas; Quackenbush, John; Ramialison, Mirana; Ringwald, Martin; Salgado, David; Sansone, Susanna-Assunta; Sherlock, Gavin; Stoeckert, Christian J; Swedlow, Jason; Taylor, Ronald C; Walashek, Laura; Warford, Anthony; Wilkinson, David G; Zhou, Yi; Zon, Leonard I; Liu, Alvin Y; True, Lawrence D
2015-01-01
One purpose of the biomedical literature is to report results in sufficient detail so that the methods of data collection and analysis can be independently replicated and verified. Here we present for consideration a minimum information specification for gene expression localization experiments, called the “Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments (MISFISHIE)”. It is modelled after the MIAME (Minimum Information About a Microarray Experiment) specification for microarray experiments. Data specifications like MIAME and MISFISHIE specify the information content without dictating a format for encoding that information. The MISFISHIE specification describes six types of information that should be provided for each experiment: Experimental Design, Biomaterials and Treatments, Reporters, Staining, Imaging Data, and Image Characterizations. This specification has benefited the consortium within which it was initially developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal. PMID:18327244
Improved detection of DNA-binding proteins via compression technology on PSSM information.
Wang, Yubo; Ding, Yijie; Guo, Fei; Wei, Leyi; Tang, Jijun
2017-01-01
Since the importance of DNA-binding proteins in multiple biomolecular functions has been recognized, an increasing number of researchers are attempting to identify DNA-binding proteins. In recent years, the machine learning methods have become more and more compelling in the case of protein sequence data soaring, because of their favorable speed and accuracy. In this paper, we extract three features from the protein sequence, namely NMBAC (Normalized Moreau-Broto Autocorrelation), PSSM-DWT (Position-specific scoring matrix-Discrete Wavelet Transform), and PSSM-DCT (Position-specific scoring matrix-Discrete Cosine Transform). We also employ feature selection algorithm on these feature vectors. Then, these features are fed into the training SVM (support vector machine) model as classifier to predict DNA-binding proteins. Our method applys three datasets, namely PDB1075, PDB594 and PDB186, to evaluate the performance of our approach. The PDB1075 and PDB594 datasets are employed for Jackknife test and the PDB186 dataset is used for the independent test. Our method achieves the best accuracy in the Jacknife test, from 79.20% to 86.23% and 80.5% to 86.20% on PDB1075 and PDB594 datasets, respectively. In the independent test, the accuracy of our method comes to 76.3%. The performance of independent test also shows that our method has a certain ability to be effectively used for DNA-binding protein prediction. The data and source code are at https://doi.org/10.6084/m9.figshare.5104084.
Ding, Jiarui; Shah, Sohrab; Condon, Anne
2016-01-01
Motivation: Many biological data processing problems can be formalized as clustering problems to partition data points into sensible and biologically interpretable groups. Results: This article introduces densityCut, a novel density-based clustering algorithm, which is both time- and space-efficient and proceeds as follows: densityCut first roughly estimates the densities of data points from a K-nearest neighbour graph and then refines the densities via a random walk. A cluster consists of points falling into the basin of attraction of an estimated mode of the underlining density function. A post-processing step merges clusters and generates a hierarchical cluster tree. The number of clusters is selected from the most stable clustering in the hierarchical cluster tree. Experimental results on ten synthetic benchmark datasets and two microarray gene expression datasets demonstrate that densityCut performs better than state-of-the-art algorithms for clustering biological datasets. For applications, we focus on the recent cancer mutation clustering and single cell data analyses, namely to cluster variant allele frequencies of somatic mutations to reveal clonal architectures of individual tumours, to cluster single-cell gene expression data to uncover cell population compositions, and to cluster single-cell mass cytometry data to detect communities of cells of the same functional states or types. densityCut performs better than competing algorithms and is scalable to large datasets. Availability and Implementation: Data and the densityCut R package is available from https://bitbucket.org/jerry00/densitycut_dev. Contact: condon@cs.ubc.ca or sshah@bccrc.ca or jiaruid@cs.ubc.ca Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27153661
Exploring Transcription Factors-microRNAs Co-regulation Networks in Schizophrenia.
Xu, Yong; Yue, Weihua; Yao Shugart, Yin; Li, Sheng; Cai, Lei; Li, Qiang; Cheng, Zaohuo; Wang, Guoqiang; Zhou, Zhenhe; Jin, Chunhui; Yuan, Jianmin; Tian, Lin; Wang, Jun; Zhang, Kai; Zhang, Kerang; Liu, Sha; Song, Yuqing; Zhang, Fuquan
2016-07-01
Transcriptional factors (TFs) and microRNAs (miRNAs) have been recognized as 2 classes of principal gene regulators that may be responsible for genome coexpression changes observed in schizophrenia (SZ). This study aims to (1) identify differentially coexpressed genes (DCGs) in 3 mRNA expression microarray datasets; (2) explore potential interactions among the DCGs, and differentially expressed miRNAs identified in our dataset composed of early-onset SZ patients and healthy controls; (3) validate expression levels of some key transcripts; and (4) explore the druggability of DCGs using the curated database. We detected a differential coexpression network associated with SZ and found that 9 out of the 12 regulators were replicated in either of the 2 other datasets. Leveraging the differentially expressed miRNAs identified in our previous dataset, we constructed a miRNA-TF-gene network relevant to SZ, including an EGR1-miR-124-3p-SKIL feed-forward loop. Our real-time quantitative PCR analysis indicated the overexpression of miR-124-3p, the under expression of SKIL and EGR1 in the blood of SZ patients compared with controls, and the direction of change of miR-124-3p and SKIL mRNA levels in SZ cases were reversed after a 12-week treatment cycle. Our druggability analysis revealed that many of these genes have the potential to be drug targets. Together, our results suggest that coexpression network abnormalities driven by combinatorial and interactive action from TFs and miRNAs may contribute to the development of SZ and be relevant to the clinical treatment of the disease. © The Author 2015. Published by Oxford University Press on behalf of the Maryland Psychiatric Research Center. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Exploring Transcription Factors-microRNAs Co-regulation Networks in Schizophrenia
Xu, Yong; Yue, Weihua; Yao Shugart, Yin; Li, Sheng; Cai, Lei; Li, Qiang; Cheng, Zaohuo; Wang, Guoqiang; Zhou, Zhenhe; Jin, Chunhui; Yuan, Jianmin; Tian, Lin; Wang, Jun; Zhang, Kai; Zhang, Kerang; Liu, Sha; Song, Yuqing; Zhang, Fuquan
2016-01-01
Background: Transcriptional factors (TFs) and microRNAs (miRNAs) have been recognized as 2 classes of principal gene regulators that may be responsible for genome coexpression changes observed in schizophrenia (SZ). Methods: This study aims to (1) identify differentially coexpressed genes (DCGs) in 3 mRNA expression microarray datasets; (2) explore potential interactions among the DCGs, and differentially expressed miRNAs identified in our dataset composed of early-onset SZ patients and healthy controls; (3) validate expression levels of some key transcripts; and (4) explore the druggability of DCGs using the curated database. Results: We detected a differential coexpression network associated with SZ and found that 9 out of the 12 regulators were replicated in either of the 2 other datasets. Leveraging the differentially expressed miRNAs identified in our previous dataset, we constructed a miRNA–TF–gene network relevant to SZ, including an EGR1–miR-124-3p–SKIL feed-forward loop. Our real-time quantitative PCR analysis indicated the overexpression of miR-124-3p, the under expression of SKIL and EGR1 in the blood of SZ patients compared with controls, and the direction of change of miR-124-3p and SKIL mRNA levels in SZ cases were reversed after a 12-week treatment cycle. Our druggability analysis revealed that many of these genes have the potential to be drug targets. Conclusions: Together, our results suggest that coexpression network abnormalities driven by combinatorial and interactive action from TFs and miRNAs may contribute to the development of SZ and be relevant to the clinical treatment of the disease. PMID:26609121
Guo, Jin-Cheng; Wu, Yang; Chen, Yang; Pan, Feng; Wu, Zhi-Yong; Zhang, Jia-Sheng; Wu, Jian-Yi; Xu, Xiu-E; Zhao, Jian-Mei; Li, En-Min; Zhao, Yi; Xu, Li-Yan
2018-04-09
Esophageal squamous cell carcinoma (ESCC) is the predominant subtype of esophageal carcinoma in China. This study was to develop a staging model to predict outcomes of patients with ESCC. Using Cox regression analysis, principal component analysis (PCA), partitioning clustering, Kaplan-Meier analysis, receiver operating characteristic (ROC) curve analysis, and classification and regression tree (CART) analysis, we mined the Gene Expression Omnibus database to determine the expression profiles of genes in 179 patients with ESCC from GSE63624 and GSE63622 dataset. Univariate cox regression analysis of the GSE63624 dataset revealed that 2404 protein-coding genes (PCGs) and 635 long non-coding RNAs (lncRNAs) were associated with the survival of patients with ESCC. PCA categorized these PCGs and lncRNAs into three principal components (PCs), which were used to cluster the patients into three groups. ROC analysis demonstrated that the predictive ability of PCG-lncRNA PCs when applied to new patients was better than that of the tumor-node-metastasis staging (area under ROC curve [AUC]: 0.69 vs. 0.65, P < 0.05). Accordingly, we constructed a molecular disaggregated model comprising one lncRNA and two PCGs, which we designated as the LSB staging model using CART analysis in the GSE63624 dataset. This LSB staging model classified the GSE63622 dataset of patients into three different groups, and its effectiveness was validated by analysis of another cohort of 105 patients. The LSB staging model has clinical significance for the prognosis prediction of patients with ESCC and may serve as a three-gene staging microarray.
Chen, Zhenyu; Li, Jianping; Wei, Liwei
2007-10-01
Recently, gene expression profiling using microarray techniques has been shown as a promising tool to improve the diagnosis and treatment of cancer. Gene expression data contain high level of noise and the overwhelming number of genes relative to the number of available samples. It brings out a great challenge for machine learning and statistic techniques. Support vector machine (SVM) has been successfully used to classify gene expression data of cancer tissue. In the medical field, it is crucial to deliver the user a transparent decision process. How to explain the computed solutions and present the extracted knowledge becomes a main obstacle for SVM. A multiple kernel support vector machine (MK-SVM) scheme, consisting of feature selection, rule extraction and prediction modeling is proposed to improve the explanation capacity of SVM. In this scheme, we show that the feature selection problem can be translated into an ordinary multiple parameters learning problem. And a shrinkage approach: 1-norm based linear programming is proposed to obtain the sparse parameters and the corresponding selected features. We propose a novel rule extraction approach using the information provided by the separating hyperplane and support vectors to improve the generalization capacity and comprehensibility of rules and reduce the computational complexity. Two public gene expression datasets: leukemia dataset and colon tumor dataset are used to demonstrate the performance of this approach. Using the small number of selected genes, MK-SVM achieves encouraging classification accuracy: more than 90% for both two datasets. Moreover, very simple rules with linguist labels are extracted. The rule sets have high diagnostic power because of their good classification performance.
An ensemble of SVM classifiers based on gene pairs.
Tong, Muchenxuan; Liu, Kun-Hong; Xu, Chungui; Ju, Wenbin
2013-07-01
In this paper, a genetic algorithm (GA) based ensemble support vector machine (SVM) classifier built on gene pairs (GA-ESP) is proposed. The SVMs (base classifiers of the ensemble system) are trained on different informative gene pairs. These gene pairs are selected by the top scoring pair (TSP) criterion. Each of these pairs projects the original microarray expression onto a 2-D space. Extensive permutation of gene pairs may reveal more useful information and potentially lead to an ensemble classifier with satisfactory accuracy and interpretability. GA is further applied to select an optimized combination of base classifiers. The effectiveness of the GA-ESP classifier is evaluated on both binary-class and multi-class datasets. Copyright © 2013 Elsevier Ltd. All rights reserved.
The FLIGHT Drosophila RNAi database
Bursteinas, Borisas; Jain, Ekta; Gao, Qiong; Baum, Buzz; Zvelebil, Marketa
2010-01-01
FLIGHT (http://flight.icr.ac.uk/) is an online resource compiling data from high-throughput Drosophila in vivo and in vitro RNAi screens. FLIGHT includes details of RNAi reagents and their predicted off-target effects, alongside RNAi screen hits, scores and phenotypes, including images from high-content screens. The latest release of FLIGHT is designed to enable users to upload, analyze, integrate and share their own RNAi screens. Users can perform multiple normalizations, view quality control plots, detect and assign screen hits and compare hits from multiple screens using a variety of methods including hierarchical clustering. FLIGHT integrates RNAi screen data with microarray gene expression as well as genomic annotations and genetic/physical interaction datasets to provide a single interface for RNAi screen analysis and datamining in Drosophila. PMID:20855970
arrayCGHbase: an analysis platform for comparative genomic hybridization microarrays
Menten, Björn; Pattyn, Filip; De Preter, Katleen; Robbrecht, Piet; Michels, Evi; Buysse, Karen; Mortier, Geert; De Paepe, Anne; van Vooren, Steven; Vermeesch, Joris; Moreau, Yves; De Moor, Bart; Vermeulen, Stefan; Speleman, Frank; Vandesompele, Jo
2005-01-01
Background The availability of the human genome sequence as well as the large number of physically accessible oligonucleotides, cDNA, and BAC clones across the entire genome has triggered and accelerated the use of several platforms for analysis of DNA copy number changes, amongst others microarray comparative genomic hybridization (arrayCGH). One of the challenges inherent to this new technology is the management and analysis of large numbers of data points generated in each individual experiment. Results We have developed arrayCGHbase, a comprehensive analysis platform for arrayCGH experiments consisting of a MIAME (Minimal Information About a Microarray Experiment) supportive database using MySQL underlying a data mining web tool, to store, analyze, interpret, compare, and visualize arrayCGH results in a uniform and user-friendly format. Following its flexible design, arrayCGHbase is compatible with all existing and forthcoming arrayCGH platforms. Data can be exported in a multitude of formats, including BED files to map copy number information on the genome using the Ensembl or UCSC genome browser. Conclusion ArrayCGHbase is a web based and platform independent arrayCGH data analysis tool, that allows users to access the analysis suite through the internet or a local intranet after installation on a private server. ArrayCGHbase is available at . PMID:15910681
Constitutional downregulation of SEMA5A expression in autism.
Melin, M; Carlsson, B; Anckarsater, H; Rastam, M; Betancur, C; Isaksson, A; Gillberg, C; Dahl, N
2006-01-01
There is strong evidence for the importance of genetic factors in idiopathic autism. The results from independent twin and family studies suggest that the disorder is caused by the action of several genes, possibly acting epistatically. We have used cDNA microarray technology for the identification of constitutional changes in the gene expression profile associated with idiopathic autism. Samples were obtained and analyzed from 6 affected subjects belonging to multiplex autism families and from 6 healthy controls. We assessed the expression levels for approximately 7,700 genes by cDNA microarrays using mRNA derived from Epstein-Barr virus-transformed B lymphocytes. The microarray data were analyzed in order to identify up- or downregulation of specific genes. A common pattern with nine downregulated genes was identified among samples derived from individuals with autism when compared to controls. Four of these nine genes encode proteins involved in biological processes associated with brain function or the immune system, and are consequently considered as candidates for genes associated with autism. Quantitative real-time PCR confirms the downregulation of the gene encoding SEMA5A, a protein involved in axonal guidance. Epstein-Barr virus should be considered as a possible source for altered expression, but our consistent results make us suggest SEMA5A as a candidate gene in the etiology of idiopathic autism.
Constitutional downregulation of SEMA5A expression in autism
Melin, Malin; Carlsson, Birgit; Anckarsäter, Henrik; Rastam, Maria; Betancur, Catalina; Isaksson, Anders; Gillberg, Christopher; Dahl, Niklas
2006-01-01
There is strong evidence for the importance of genetic factors in idiopathic autism. The results from independent twin and family studies suggest that the disorder is caused by the action of several genes, possibly acting epistatically. We have used cDNA microarray technology for the identification of constitutional changes in the gene expression profile associated with idiopathic autism. Samples were obtained and analyzed from six affected subjects belonging to multiplex autism families and from six healthy controls. We assessed the expression levels for approximately 7,700 genes by cDNA microarrays using mRNA derived from Epstein Barr virus (EBV)-transformed B-lymphocytes. The microarray data was analyzed in order to identify up- or down-regulation of specific genes. A common pattern with nine down-regulated genes was identified among samples derived from individuals with autism when compared to controls. Four of these nine genes encode proteins involved in biological processes associated with brain function or the immune system, and are consequently considered as candidates for genes associated with autism. Quantitative realtime PCR confirms the down-regulation of the gene encoding SEMA5A, a protein involved in axonal guidance. EBV should be considered as a possible source for altered expression but our consistent results make us suggest SEMA5A a candidate gene in the etiology of idiopathic autism. PMID:17028446
Dynamic variable selection in SNP genotype autocalling from APEX microarray data.
Podder, Mohua; Welch, William J; Zamar, Ruben H; Tebbutt, Scott J
2006-11-30
Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide--adenine (A), thymine (T), cytosine (C) or guanine (G)--is altered. Arguably, SNPs account for more than 90% of human genetic variation. Our laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signals from multiple channels for a single SNP, based on arrayed primer extension (APEX). This mini-sequencing method is a powerful combination of a highly parallel microarray with distinctive Sanger-based dideoxy terminator sequencing chemistry. Using this microarray platform, our current genotype calling system (known as SNP Chart) is capable of calling single SNP genotypes by manual inspection of the APEX data, which is time-consuming and exposed to user subjectivity bias. Using a set of 32 Coriell DNA samples plus three negative PCR controls as a training data set, we have developed a fully-automated genotyping algorithm based on simple linear discriminant analysis (LDA) using dynamic variable selection. The algorithm combines separate analyses based on the multiple probe sets to give a final posterior probability for each candidate genotype. We have tested our algorithm on a completely independent data set of 270 DNA samples, with validated genotypes, from patients admitted to the intensive care unit (ICU) of St. Paul's Hospital (plus one negative PCR control sample). Our method achieves a concordance rate of 98.9% with a 99.6% call rate for a set of 96 SNPs. By adjusting the threshold value for the final posterior probability of the called genotype, the call rate reduces to 94.9% with a higher concordance rate of 99.6%. We also reversed the two independent data sets in their training and testing roles, achieving a concordance rate up to 99.8%. The strength of this APEX chemistry-based platform is its unique redundancy having multiple probes for a single SNP. Our model-based genotype calling algorithm captures the redundancy in the system considering all the underlying probe features of a particular SNP, automatically down-weighting any 'bad data' corresponding to image artifacts on the microarray slide or failure of a specific chemistry. In this regard, our method is able to automatically select the probes which work well and reduce the effect of other so-called bad performing probes in a sample-specific manner, for any number of SNPs.
Vlismas, Antonis; Bletsa, Ritsa; Mavrogianni, Despina; Mamali, Georgina; Pergamali, Maria; Dinopoulou, Vasiliki; Partsinevelos, George; Drakakis, Peter; Loutradis, Dimitris
2016-01-01
Previous microarray analyses of RNAs from 8-cell (8C) human embryos revealed a lack of cell cycle checkpoints and overexpression of core circadian oscillators and cell cycle drivers relative to pluripotent human stem cells [human embryonic stem cells/induced pluripotent stem (hES/iPS)] and fibroblasts, suggesting growth factor independence during early cleavage stages. To explore this possibility, we queried our combined microarray database for expression of 487 growth factors and receptors. Fifty-one gene elements were overdetected on the 8C arrays relative to hES/iPS cells, including 14 detected at least 80-fold higher, which annotated to multiple pathways: six cytokine family (CSF1R, IL2RG, IL3RA, IL4, IL17B, IL23R), four transforming growth factor beta (TGFB) family (BMP6, BMP15, GDF9, ENG), one fibroblast growth factor (FGF) family [FGF14(FH4)], one epidermal growth factor member (GAB1), plus CD36, and CLEC10A. 8C-specific gene elements were enriched (73%) for reported circadian-controlled genes in mouse tissues. High-level detection of CSF1R, ENG, IL23R, and IL3RA specifically on the 8C arrays suggests the embryo plays an active role in blocking immune rejection and is poised for trophectoderm development; robust detection of NRG1, GAB1, -2, GRB7, and FGF14(FHF4) indicates novel roles in early development in addition to their known roles in later development. Forty-four gene elements were underdetected on the 8C arrays, including 11 at least 80-fold under the pluripotent cells: two cytokines (IFITM1, TNFRSF8), five TGFBs (BMP7, LEFTY1, LEFTY2, TDGF1, TDGF3), two FGFs (FGF2, FGF receptor 1), plus ING5, and WNT6. The microarray detection patterns suggest that hES/iPS cells exhibit suppressed circadian competence, underexpression of early differentiation markers, and more robust expression of generic pluripotency genes, in keeping with an artificial state of continual uncommitted cell division. In contrast, gene expression patterns of the 8C embryo suggest that it is an independent circadian rhythm-competent equivalence group poised to signal its environment, defend against maternal immune rejection, and begin the rapid commitment events of early embryogenesis. PMID:26493868
Extraction of drainage networks from large terrain datasets using high throughput computing
NASA Astrophysics Data System (ADS)
Gong, Jianya; Xie, Jibo
2009-02-01
Advanced digital photogrammetry and remote sensing technology produces large terrain datasets (LTD). How to process and use these LTD has become a big challenge for GIS users. Extracting drainage networks, which are basic for hydrological applications, from LTD is one of the typical applications of digital terrain analysis (DTA) in geographical information applications. Existing serial drainage algorithms cannot deal with large data volumes in a timely fashion, and few GIS platforms can process LTD beyond the GB size. High throughput computing (HTC), a distributed parallel computing mode, is proposed to improve the efficiency of drainage networks extraction from LTD. Drainage network extraction using HTC involves two key issues: (1) how to decompose the large DEM datasets into independent computing units and (2) how to merge the separate outputs into a final result. A new decomposition method is presented in which the large datasets are partitioned into independent computing units using natural watershed boundaries instead of using regular 1-dimensional (strip-wise) and 2-dimensional (block-wise) decomposition. Because the distribution of drainage networks is strongly related to watershed boundaries, the new decomposition method is more effective and natural. The method to extract natural watershed boundaries was improved by using multi-scale DEMs instead of single-scale DEMs. A HTC environment is employed to test the proposed methods with real datasets.