Gene Selection and Cancer Classification: A Rough Sets Based Approach
NASA Astrophysics Data System (ADS)
Sun, Lijun; Miao, Duoqian; Zhang, Hongyun
Indentification of informative gene subsets responsible for discerning between available samples of gene expression data is an important task in bioinformatics. Reducts, from rough sets theory, corresponding to a minimal set of essential genes for discerning samples, is an efficient tool for gene selection. Due to the compuational complexty of the existing reduct algoritms, feature ranking is usually used to narrow down gene space as the first step and top ranked genes are selected . In this paper,we define a novel certierion based on the expression level difference btween classes and contribution to classification of the gene for scoring genes and present a algorithm for generating all possible reduct from informative genes.The algorithm takes the whole attribute sets into account and find short reduct with a significant reduction in computational complexity. An exploration of this approach on benchmark gene expression data sets demonstrates that this approach is successful for selecting high discriminative genes and the classification accuracy is impressive.
Clark, Neil R.; Szymkiewicz, Maciej; Wang, Zichen; Monteiro, Caroline D.; Jones, Matthew R.; Ma’ayan, Avi
2016-01-01
Gene set analysis of differential expression, which identifies collectively differentially expressed gene sets, has become an important tool for biology. The power of this approach lies in its reduction of the dimensionality of the statistical problem and its incorporation of biological interpretation by construction. Many approaches to gene set analysis have been proposed, but benchmarking their performance in the setting of real biological data is difficult due to the lack of a gold standard. In a previously published work we proposed a geometrical approach to differential expression which performed highly in benchmarking tests and compared well to the most popular methods of differential gene expression. As reported, this approach has a natural extension to gene set analysis which we call Principal Angle Enrichment Analysis (PAEA). PAEA employs dimensionality reduction and a multivariate approach for gene set enrichment analysis. However, the performance of this method has not been assessed nor its implementation as a web-based tool. Here we describe new benchmarking protocols for gene set analysis methods and find that PAEA performs highly. The PAEA method is implemented as a user-friendly web-based tool, which contains 70 gene set libraries and is freely available to the community. PMID:26848405
Clark, Neil R; Szymkiewicz, Maciej; Wang, Zichen; Monteiro, Caroline D; Jones, Matthew R; Ma'ayan, Avi
2015-11-01
Gene set analysis of differential expression, which identifies collectively differentially expressed gene sets, has become an important tool for biology. The power of this approach lies in its reduction of the dimensionality of the statistical problem and its incorporation of biological interpretation by construction. Many approaches to gene set analysis have been proposed, but benchmarking their performance in the setting of real biological data is difficult due to the lack of a gold standard. In a previously published work we proposed a geometrical approach to differential expression which performed highly in benchmarking tests and compared well to the most popular methods of differential gene expression. As reported, this approach has a natural extension to gene set analysis which we call Principal Angle Enrichment Analysis (PAEA). PAEA employs dimensionality reduction and a multivariate approach for gene set enrichment analysis. However, the performance of this method has not been assessed nor its implementation as a web-based tool. Here we describe new benchmarking protocols for gene set analysis methods and find that PAEA performs highly. The PAEA method is implemented as a user-friendly web-based tool, which contains 70 gene set libraries and is freely available to the community.
Distributed Function Mining for Gene Expression Programming Based on Fast Reduction.
Deng, Song; Yue, Dong; Yang, Le-chan; Fu, Xiong; Feng, Ya-zhou
2016-01-01
For high-dimensional and massive data sets, traditional centralized gene expression programming (GEP) or improved algorithms lead to increased run-time and decreased prediction accuracy. To solve this problem, this paper proposes a new improved algorithm called distributed function mining for gene expression programming based on fast reduction (DFMGEP-FR). In DFMGEP-FR, fast attribution reduction in binary search algorithms (FAR-BSA) is proposed to quickly find the optimal attribution set, and the function consistency replacement algorithm is given to solve integration of the local function model. Thorough comparative experiments for DFMGEP-FR, centralized GEP and the parallel gene expression programming algorithm based on simulated annealing (parallel GEPSA) are included in this paper. For the waveform, mushroom, connect-4 and musk datasets, the comparative results show that the average time-consumption of DFMGEP-FR drops by 89.09%%, 88.85%, 85.79% and 93.06%, respectively, in contrast to centralized GEP and by 12.5%, 8.42%, 9.62% and 13.75%, respectively, compared with parallel GEPSA. Six well-studied UCI test data sets demonstrate the efficiency and capability of our proposed DFMGEP-FR algorithm for distributed function mining.
Case-based retrieval framework for gene expression data.
Anaissi, Ali; Goyal, Madhu; Catchpoole, Daniel R; Braytee, Ali; Kennedy, Paul J
2015-01-01
The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process. This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles. The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children's Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set. The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.
Gene selection for tumor classification using neighborhood rough sets and entropy measures.
Chen, Yumin; Zhang, Zunjun; Zheng, Jianzhong; Ma, Ying; Xue, Yu
2017-03-01
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification. Copyright © 2017 Elsevier Inc. All rights reserved.
Novel gene sets improve set-level classification of prokaryotic gene expression data.
Holec, Matěj; Kuželka, Ondřej; Železný, Filip
2015-10-28
Set-level classification of gene expression data has received significant attention recently. In this setting, high-dimensional vectors of features corresponding to genes are converted into lower-dimensional vectors of features corresponding to biologically interpretable gene sets. The dimensionality reduction brings the promise of a decreased risk of overfitting, potentially resulting in improved accuracy of the learned classifiers. However, recent empirical research has not confirmed this expectation. Here we hypothesize that the reported unfavorable classification results in the set-level framework were due to the adoption of unsuitable gene sets defined typically on the basis of the Gene ontology and the KEGG database of metabolic networks. We explore an alternative approach to defining gene sets, based on regulatory interactions, which we expect to collect genes with more correlated expression. We hypothesize that such more correlated gene sets will enable to learn more accurate classifiers. We define two families of gene sets using information on regulatory interactions, and evaluate them on phenotype-classification tasks using public prokaryotic gene expression data sets. From each of the two gene-set families, we first select the best-performing subtype. The two selected subtypes are then evaluated on independent (testing) data sets against state-of-the-art gene sets and against the conventional gene-level approach. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. Novel gene sets defined on the basis of regulatory interactions improve set-level classification of gene expression data. The experimental scripts and other material needed to reproduce the experiments are available at http://ida.felk.cvut.cz/novelgenesets.tar.gz.
CATTAERT, TOM; CALLE, M. LUZ; DUDEK, SCOTT M.; MAHACHIE JOHN, JESTINAH M.; VAN LISHOUT, FRANÇOIS; URREA, VICTOR; RITCHIE, MARYLYN D.; VAN STEEN, KRISTEL
2010-01-01
SUMMARY Analyzing the combined effects of genes and/or environmental factors on the development of complex diseases is a great challenge from both the statistical and computational perspective, even using a relatively small number of genetic and non-genetic exposures. Several data mining methods have been proposed for interaction analysis, among them, the Multifactor Dimensionality Reduction Method (MDR), which has proven its utility in a variety of theoretical and practical settings. Model-Based Multifactor Dimensionality Reduction (MB-MDR), a relatively new MDR-based technique that is able to unify the best of both non-parametric and parametric worlds, was developed to address some of the remaining concerns that go along with an MDR-analysis. These include the restriction to univariate, dichotomous traits, the absence of flexible ways to adjust for lower-order effects and important confounders, and the difficulty to highlight epistasis effects when too many multi-locus genotype cells are pooled into two new genotype groups. Whereas the true value of MB-MDR can only reveal itself by extensive applications of the method in a variety of real-life scenarios, here we investigate the empirical power of MB-MDR to detect gene-gene interactions in the absence of any noise and in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. For the considered simulation settings, we show that the power is generally higher for MB-MDR than for MDR, in particular in the presence of genetic heterogeneity, phenocopy, or low minor allele frequencies. PMID:21158747
Kent, Jack W
2016-02-03
New technologies for acquisition of genomic data, while offering unprecedented opportunities for genetic discovery, also impose severe burdens of interpretation and penalties for multiple testing. The Pathway-based Analyses Group of the Genetic Analysis Workshop 19 (GAW19) sought reduction of multiple-testing burden through various approaches to aggregation of highdimensional data in pathways informed by prior biological knowledge. Experimental methods testedincluded the use of "synthetic pathways" (random sets of genes) to estimate power and false-positive error rate of methods applied to simulated data; data reduction via independent components analysis, single-nucleotide polymorphism (SNP)-SNP interaction, and use of gene sets to estimate genetic similarity; and general assessment of the efficacy of prior biological knowledge to reduce the dimensionality of complex genomic data. The work of this group explored several promising approaches to managing high-dimensional data, with the caveat that these methods are necessarily constrained by the quality of external bioinformatic annotation.
Ozerov, Ivan V; Lezhnina, Ksenia V; Izumchenko, Evgeny; Artemov, Artem V; Medintsev, Sergey; Vanhaelen, Quentin; Aliper, Alexander; Vijg, Jan; Osipov, Andreyan N; Labat, Ivan; West, Michael D; Buzdin, Anton; Cantor, Charles R; Nikolsky, Yuri; Borisov, Nikolay; Irincheeva, Irina; Khokhlovich, Edward; Sidransky, David; Camargo, Miguel Luiz; Zhavoronkov, Alex
2016-11-16
Signalling pathway activation analysis is a powerful approach for extracting biologically relevant features from large-scale transcriptomic and proteomic data. However, modern pathway-based methods often fail to provide stable pathway signatures of a specific phenotype or reliable disease biomarkers. In the present study, we introduce the in silico Pathway Activation Network Decomposition Analysis (iPANDA) as a scalable robust method for biomarker identification using gene expression data. The iPANDA method combines precalculated gene coexpression data with gene importance factors based on the degree of differential gene expression and pathway topology decomposition for obtaining pathway activation scores. Using Microarray Analysis Quality Control (MAQC) data sets and pretreatment data on Taxol-based neoadjuvant breast cancer therapy from multiple sources, we demonstrate that iPANDA provides significant noise reduction in transcriptomic data and identifies highly robust sets of biologically relevant pathway signatures. We successfully apply iPANDA for stratifying breast cancer patients according to their sensitivity to neoadjuvant therapy.
Ozerov, Ivan V.; Lezhnina, Ksenia V.; Izumchenko, Evgeny; Artemov, Artem V.; Medintsev, Sergey; Vanhaelen, Quentin; Aliper, Alexander; Vijg, Jan; Osipov, Andreyan N.; Labat, Ivan; West, Michael D.; Buzdin, Anton; Cantor, Charles R.; Nikolsky, Yuri; Borisov, Nikolay; Irincheeva, Irina; Khokhlovich, Edward; Sidransky, David; Camargo, Miguel Luiz; Zhavoronkov, Alex
2016-01-01
Signalling pathway activation analysis is a powerful approach for extracting biologically relevant features from large-scale transcriptomic and proteomic data. However, modern pathway-based methods often fail to provide stable pathway signatures of a specific phenotype or reliable disease biomarkers. In the present study, we introduce the in silico Pathway Activation Network Decomposition Analysis (iPANDA) as a scalable robust method for biomarker identification using gene expression data. The iPANDA method combines precalculated gene coexpression data with gene importance factors based on the degree of differential gene expression and pathway topology decomposition for obtaining pathway activation scores. Using Microarray Analysis Quality Control (MAQC) data sets and pretreatment data on Taxol-based neoadjuvant breast cancer therapy from multiple sources, we demonstrate that iPANDA provides significant noise reduction in transcriptomic data and identifies highly robust sets of biologically relevant pathway signatures. We successfully apply iPANDA for stratifying breast cancer patients according to their sensitivity to neoadjuvant therapy. PMID:27848968
Optimal selection of markers for validation or replication from genome-wide association studies.
Greenwood, Celia M T; Rangrej, Jagadish; Sun, Lei
2007-07-01
With reductions in genotyping costs and the fast pace of improvements in genotyping technology, it is not uncommon for the individuals in a single study to undergo genotyping using several different platforms, where each platform may contain different numbers of markers selected via different criteria. For example, a set of cases and controls may be genotyped at markers in a small set of carefully selected candidate genes, and shortly thereafter, the same cases and controls may be used for a genome-wide single nucleotide polymorphism (SNP) association study. After such initial investigations, often, a subset of "interesting" markers is selected for validation or replication. Specifically, by validation, we refer to the investigation of associations between the selected subset of markers and the disease in independent data. However, it is not obvious how to choose the best set of markers for this validation. There may be a prior expectation that some sets of genotyping data are more likely to contain real associations. For example, it may be more likely for markers in plausible candidate genes to show disease associations than markers in a genome-wide scan. Hence, it would be desirable to select proportionally more markers from the candidate gene set. When a fixed number of markers are selected for validation, we propose an approach for identifying an optimal marker-selection configuration by basing the approach on minimizing the stratified false discovery rate. We illustrate this approach using a case-control study of colorectal cancer from Ontario, Canada, and we show that this approach leads to substantial reductions in the estimated false discovery rates in the Ontario dataset for the selected markers, as well as reductions in the expected false discovery rates for the proposed validation dataset. Copyright 2007 Wiley-Liss, Inc.
A Fast Multiple-Kernel Method With Applications to Detect Gene-Environment Interaction.
Marceau, Rachel; Lu, Wenbin; Holloway, Shannon; Sale, Michèle M; Worrall, Bradford B; Williams, Stephen R; Hsu, Fang-Chi; Tzeng, Jung-Ying
2015-09-01
Kernel machine (KM) models are a powerful tool for exploring associations between sets of genetic variants and complex traits. Although most KM methods use a single kernel function to assess the marginal effect of a variable set, KM analyses involving multiple kernels have become increasingly popular. Multikernel analysis allows researchers to study more complex problems, such as assessing gene-gene or gene-environment interactions, incorporating variance-component based methods for population substructure into rare-variant association testing, and assessing the conditional effects of a variable set adjusting for other variable sets. The KM framework is robust, powerful, and provides efficient dimension reduction for multifactor analyses, but requires the estimation of high dimensional nuisance parameters. Traditional estimation techniques, including regularization and the "expectation-maximization (EM)" algorithm, have a large computational cost and are not scalable to large sample sizes needed for rare variant analysis. Therefore, under the context of gene-environment interaction, we propose a computationally efficient and statistically rigorous "fastKM" algorithm for multikernel analysis that is based on a low-rank approximation to the nuisance effect kernel matrices. Our algorithm is applicable to various trait types (e.g., continuous, binary, and survival traits) and can be implemented using any existing single-kernel analysis software. Through extensive simulation studies, we show that our algorithm has similar performance to an EM-based KM approach for quantitative traits while running much faster. We also apply our method to the Vitamin Intervention for Stroke Prevention (VISP) clinical trial, examining gene-by-vitamin effects on recurrent stroke risk and gene-by-age effects on change in homocysteine level. © 2015 WILEY PERIODICALS, INC.
Design and verification of a pangenome microarray oligonucleotide probe set for Dehalococcoides spp.
Hug, Laura A; Salehi, Maryam; Nuin, Paulo; Tillier, Elisabeth R; Edwards, Elizabeth A
2011-08-01
Dehalococcoides spp. are an industrially relevant group of Chloroflexi bacteria capable of reductively dechlorinating contaminants in groundwater environments. Existing Dehalococcoides genomes revealed a high level of sequence identity within this group, including 98 to 100% 16S rRNA sequence identity between strains with diverse substrate specificities. Common molecular techniques for identification of microbial populations are often not applicable for distinguishing Dehalococcoides strains. Here we describe an oligonucleotide microarray probe set designed based on clustered Dehalococcoides genes from five different sources (strain DET195, CBDB1, BAV1, and VS genomes and the KB-1 metagenome). This "pangenome" probe set provides coverage of core Dehalococcoides genes as well as strain-specific genes while optimizing the potential for hybridization to closely related, previously unknown Dehalococcoides strains. The pangenome probe set was compared to probe sets designed independently for each of the five Dehalococcoides strains. The pangenome probe set demonstrated better predictability and higher detection of Dehalococcoides genes than strain-specific probe sets on nontarget strains with <99% average nucleotide identity. An in silico analysis of the expected probe hybridization against the recently released Dehalococcoides strain GT genome and additional KB-1 metagenome sequence data indicated that the pangenome probe set performs more robustly than the combined strain-specific probe sets in the detection of genes not included in the original design. The pangenome probe set represents a highly specific, universal tool for the detection and characterization of Dehalococcoides from contaminated sites. It has the potential to become a common platform for Dehalococcoides-focused research, allowing meaningful comparisons between microarray experiments regardless of the strain examined.
A Combinatorial Approach to Detecting Gene-Gene and Gene-Environment Interactions in Family Studies
Lou, Xiang-Yang; Chen, Guo-Bo; Yan, Lei; Ma, Jennie Z.; Mangold, Jamie E.; Zhu, Jun; Elston, Robert C.; Li, Ming D.
2008-01-01
Widespread multifactor interactions present a significant challenge in determining risk factors of complex diseases. Several combinatorial approaches, such as the multifactor dimensionality reduction (MDR) method, have emerged as a promising tool for better detecting gene-gene (G × G) and gene-environment (G × E) interactions. We recently developed a general combinatorial approach, namely the generalized multifactor dimensionality reduction (GMDR) method, which can entertain both qualitative and quantitative phenotypes and allows for both discrete and continuous covariates to detect G × G and G × E interactions in a sample of unrelated individuals. In this article, we report the development of an algorithm that can be used to study G × G and G × E interactions for family-based designs, called pedigree-based GMDR (PGMDR). Compared to the available method, our proposed method has several major improvements, including allowing for covariate adjustments and being applicable to arbitrary phenotypes, arbitrary pedigree structures, and arbitrary patterns of missing marker genotypes. Our Monte Carlo simulations provide evidence that the PGMDR method is superior in performance to identify epistatic loci compared to the MDR-pedigree disequilibrium test (PDT). Finally, we applied our proposed approach to a genetic data set on tobacco dependence and found a significant interaction between two taste receptor genes (i.e., TAS2R16 and TAS2R38) in affecting nicotine dependence. PMID:18834969
DuanMu, Huizi; Wang, Yang; Bai, Xi; Cheng, Shufei; Deyholos, Michael K; Wong, Gane Ka-Shu; Li, Dan; Zhu, Dan; Li, Ran; Yu, Yang; Cao, Lei; Chen, Chao; Zhu, Yanming
2015-11-01
Soil alkalinity is an important environmental problem limiting agricultural productivity. Wild soybean (Glycine soja) shows strong alkaline stress tolerance, so it is an ideal plant candidate for studying the molecular mechanisms of alkaline tolerance and identifying alkaline stress-responsive genes. However, limited information is available about G. soja responses to alkaline stress on a genomic scale. Therefore, in the present study, we used RNA sequencing to compare transcript profiles of G. soja root responses to sodium bicarbonate (NaHCO3) at six time points, and a total of 68,138,478 pairs of clean reads were obtained using the Illumina GAIIX. Expression patterns of 46,404 G. soja genes were profiled in all six samples based on RNA-seq data using Cufflinks software. Then, t12 transcription factors from MYB, WRKY, NAC, bZIP, C2H2, HB, and TIFY families and 12 oxidation reduction related genes were chosen and verified to be induced in response to alkaline stress by using quantitative real-time polymerase chain reaction (qRT-PCR). The GO functional annotation analysis showed that besides "transcriptional regulation" and "oxidation reduction," these genes were involved in a variety of processes, such as "binding" and "response to stress." This is the first comprehensive transcriptome profiling analysis of wild soybean root under alkaline stress by RNA sequencing. Our results highlight changes in the gene expression patterns and identify a set of genes induced by NaHCO3 stress. These findings provide a base for the global analyses of G. soja alkaline stress tolerance mechanisms.
DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data
Glez-Peña, Daniel; Álvarez, Rodrigo; Díaz, Fernando; Fdez-Riverola, Florentino
2009-01-01
Background Expression profiling assays done by using DNA microarray technology generate enormous data sets that are not amenable to simple analysis. The greatest challenge in maximizing the use of this huge amount of data is to develop algorithms to interpret and interconnect results from different genes under different conditions. In this context, fuzzy logic can provide a systematic and unbiased way to both (i) find biologically significant insights relating to meaningful genes, thereby removing the need for expert knowledge in preliminary steps of microarray data analyses and (ii) reduce the cost and complexity of later applied machine learning techniques being able to achieve interpretable models. Results DFP is a new Bioconductor R package that implements a method for discretizing and selecting differentially expressed genes based on the application of fuzzy logic. DFP takes advantage of fuzzy membership functions to assign linguistic labels to gene expression levels. The technique builds a reduced set of relevant genes (FP, Fuzzy Pattern) able to summarize and represent each underlying class (pathology). A last step constructs a biased set of genes (DFP, Discriminant Fuzzy Pattern) by intersecting existing fuzzy patterns in order to detect discriminative elements. In addition, the software provides new functions and visualisation tools that summarize achieved results and aid in the interpretation of differentially expressed genes from multiple microarray experiments. Conclusion DFP integrates with other packages of the Bioconductor project, uses common data structures and is accompanied by ample documentation. It has the advantage that its parameters are highly configurable, facilitating the discovery of biologically relevant connections between sets of genes belonging to different pathologies. This information makes it possible to automatically filter irrelevant genes thereby reducing the large volume of data supplied by microarray experiments. Based on these contributions GENECBR, a successful tool for cancer diagnosis using microarray datasets, has recently been released. PMID:19178723
DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data.
Glez-Peña, Daniel; Alvarez, Rodrigo; Díaz, Fernando; Fdez-Riverola, Florentino
2009-01-29
Expression profiling assays done by using DNA microarray technology generate enormous data sets that are not amenable to simple analysis. The greatest challenge in maximizing the use of this huge amount of data is to develop algorithms to interpret and interconnect results from different genes under different conditions. In this context, fuzzy logic can provide a systematic and unbiased way to both (i) find biologically significant insights relating to meaningful genes, thereby removing the need for expert knowledge in preliminary steps of microarray data analyses and (ii) reduce the cost and complexity of later applied machine learning techniques being able to achieve interpretable models. DFP is a new Bioconductor R package that implements a method for discretizing and selecting differentially expressed genes based on the application of fuzzy logic. DFP takes advantage of fuzzy membership functions to assign linguistic labels to gene expression levels. The technique builds a reduced set of relevant genes (FP, Fuzzy Pattern) able to summarize and represent each underlying class (pathology). A last step constructs a biased set of genes (DFP, Discriminant Fuzzy Pattern) by intersecting existing fuzzy patterns in order to detect discriminative elements. In addition, the software provides new functions and visualisation tools that summarize achieved results and aid in the interpretation of differentially expressed genes from multiple microarray experiments. DFP integrates with other packages of the Bioconductor project, uses common data structures and is accompanied by ample documentation. It has the advantage that its parameters are highly configurable, facilitating the discovery of biologically relevant connections between sets of genes belonging to different pathologies. This information makes it possible to automatically filter irrelevant genes thereby reducing the large volume of data supplied by microarray experiments. Based on these contributions GENECBR, a successful tool for cancer diagnosis using microarray datasets, has recently been released.
Orthopoxvirus Genome Evolution: The Role of Gene Loss
Hendrickson, Robert Curtis; Wang, Chunlin; Hatcher, Eneida L.; Lefkowitz, Elliot J.
2010-01-01
Poxviruses are highly successful pathogens, known to infect a variety of hosts. The family Poxviridae includes Variola virus, the causative agent of smallpox, which has been eradicated as a public health threat but could potentially reemerge as a bioterrorist threat. The risk scenario includes other animal poxviruses and genetically engineered manipulations of poxviruses. Studies of orthologous gene sets have established the evolutionary relationships of members within the Poxviridae family. It is not clear, however, how variations between family members arose in the past, an important issue in understanding how these viruses may vary and possibly produce future threats. Using a newly developed poxvirus-specific tool, we predicted accurate gene sets for viruses with completely sequenced genomes in the genus Orthopoxvirus. Employing sensitive sequence comparison techniques together with comparison of syntenic gene maps, we established the relationships between all viral gene sets. These techniques allowed us to unambiguously identify the gene loss/gain events that have occurred over the course of orthopoxvirus evolution. It is clear that for all existing Orthopoxvirus species, no individual species has acquired protein-coding genes unique to that species. All existing species contain genes that are all present in members of the species Cowpox virus and that cowpox virus strains contain every gene present in any other orthopoxvirus strain. These results support a theory of reductive evolution in which the reduction in size of the core gene set of a putative ancestral virus played a critical role in speciation and confining any newly emerging virus species to a particular environmental (host or tissue) niche. PMID:21994715
Huang, Xiaoyun; Zang, Xiaonan; Wu, Fei; Jin, Yuming; Wang, Haitao; Liu, Chang; Ding, Yating; He, Bangxiang; Xiao, Dongfang; Song, Xinwei; Liu, Zhu
2017-01-01
Gracilariopsis lemaneiformis (aka Gracilaria lemaneiformis) is a red macroalga rich in phycoerythrin, which can capture light efficiently and transfer it to photosystemⅡ. However, little is known about the synthesis of optically active phycoerythrinin in G. lemaneiformis at the molecular level. With the advent of high-throughput sequencing technology, analysis of genetic information for G. lemaneiformis by transcriptome sequencing is an effective means to get a deeper insight into the molecular mechanism of phycoerythrin synthesis. Illumina technology was employed to sequence the transcriptome of two strains of G. lemaneiformis- the wild type and a green-pigmented mutant. We obtained a total of 86915 assembled unigenes as a reference gene set, and 42884 unigenes were annotated in at least one public database. Taking the above transcriptome sequencing as a reference gene set, 4041 differentially expressed genes were screened to analyze and compare the gene expression profiles of the wild type and green mutant. By GO and KEGG pathway analysis, we concluded that three factors, including a reduction in the expression level of apo-phycoerythrin, an increase of chlorophyll light-harvesting complex synthesis, and reduction of phycoerythrobilin by competitive inhibition, caused the reduction of optically active phycoerythrin in the green-pigmented mutant.
Random forests-based differential analysis of gene sets for gene expression data.
Hsueh, Huey-Miin; Zhou, Da-Wei; Tsai, Chen-An
2013-04-10
In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for interpretation of data in complex biological systems. The classifications of biologically defined gene sets can reveal the underlying interactions of gene sets associated with the phenotypes, and provide an insightful complement to conventional gene set analyses. Copyright © 2012 Elsevier B.V. All rights reserved.
Beretta, Lorenzo; Santaniello, Alessandro; van Riel, Piet L C M; Coenen, Marieke J H; Scorza, Raffaella
2010-08-06
Epistasis is recognized as a fundamental part of the genetic architecture of individuals. Several computational approaches have been developed to model gene-gene interactions in case-control studies, however, none of them is suitable for time-dependent analysis. Herein we introduce the Survival Dimensionality Reduction (SDR) algorithm, a non-parametric method specifically designed to detect epistasis in lifetime datasets. The algorithm requires neither specification about the underlying survival distribution nor about the underlying interaction model and proved satisfactorily powerful to detect a set of causative genes in synthetic epistatic lifetime datasets with a limited number of samples and high degree of right-censorship (up to 70%). The SDR method was then applied to a series of 386 Dutch patients with active rheumatoid arthritis that were treated with anti-TNF biological agents. Among a set of 39 candidate genes, none of which showed a detectable marginal effect on anti-TNF responses, the SDR algorithm did find that the rs1801274 SNP in the Fc gamma RIIa gene and the rs10954213 SNP in the IRF5 gene non-linearly interact to predict clinical remission after anti-TNF biologicals. Simulation studies and application in a real-world setting support the capability of the SDR algorithm to model epistatic interactions in candidate-genes studies in presence of right-censored data. http://sourceforge.net/projects/sdrproject/.
Microarray missing data imputation based on a set theoretic framework and biological knowledge.
Gan, Xiangchao; Liew, Alan Wee-Chung; Yan, Hong
2006-01-01
Gene expressions measured using microarrays usually suffer from the missing value problem. However, in many data analysis methods, a complete data matrix is required. Although existing missing value imputation algorithms have shown good performance to deal with missing values, they also have their limitations. For example, some algorithms have good performance only when strong local correlation exists in data while some provide the best estimate when data is dominated by global structure. In addition, these algorithms do not take into account any biological constraint in their imputation. In this paper, we propose a set theoretic framework based on projection onto convex sets (POCS) for missing data imputation. POCS allows us to incorporate different types of a priori knowledge about missing values into the estimation process. The main idea of POCS is to formulate every piece of prior knowledge into a corresponding convex set and then use a convergence-guaranteed iterative procedure to obtain a solution in the intersection of all these sets. In this work, we design several convex sets, taking into consideration the biological characteristic of the data: the first set mainly exploit the local correlation structure among genes in microarray data, while the second set captures the global correlation structure among arrays. The third set (actually a series of sets) exploits the biological phenomenon of synchronization loss in microarray experiments. In cyclic systems, synchronization loss is a common phenomenon and we construct a series of sets based on this phenomenon for our POCS imputation algorithm. Experiments show that our algorithm can achieve a significant reduction of error compared to the KNNimpute, SVDimpute and LSimpute methods.
Fu, Guifang; Dai, Xiaotian; Symanzik, Jürgen; Bushman, Shaun
2017-01-01
Leaf shape traits have long been a focus of many disciplines, but the complex genetic and environmental interactive mechanisms regulating leaf shape variation have not yet been investigated in detail. The question of the respective roles of genes and environment and how they interact to modulate leaf shape is a thorny evolutionary problem, and sophisticated methodology is needed to address it. In this study, we investigated a framework-level approach that inputs shape image photographs and genetic and environmental data, and then outputs the relative importance ranks of all variables after integrating shape feature extraction, dimension reduction, and tree-based statistical models. The power of the proposed framework was confirmed by simulation and a Populus szechuanica var. tibetica data set. This new methodology resulted in the detection of novel shape characteristics, and also confirmed some previous findings. The quantitative modeling of a combination of polygenetic, plastic, epistatic, and gene-environment interactive effects, as investigated in this study, will improve the discernment of quantitative leaf shape characteristics, and the methods are ready to be applied to other leaf morphology data sets. Unlike the majority of approaches in the quantitative leaf shape literature, this framework-level approach is data-driven, without assuming any pre-known shape attributes, landmarks, or model structures. © 2016 The Authors. New Phytologist © 2016 New Phytologist Trust.
Vivar, Juan C; Pemu, Priscilla; McPherson, Ruth; Ghosh, Sujoy
2013-08-01
Abstract Unparalleled technological advances have fueled an explosive growth in the scope and scale of biological data and have propelled life sciences into the realm of "Big Data" that cannot be managed or analyzed by conventional approaches. Big Data in the life sciences are driven primarily via a diverse collection of 'omics'-based technologies, including genomics, proteomics, metabolomics, transcriptomics, metagenomics, and lipidomics. Gene-set enrichment analysis is a powerful approach for interrogating large 'omics' datasets, leading to the identification of biological mechanisms associated with observed outcomes. While several factors influence the results from such analysis, the impact from the contents of pathway databases is often under-appreciated. Pathway databases often contain variously named pathways that overlap with one another to varying degrees. Ignoring such redundancies during pathway analysis can lead to the designation of several pathways as being significant due to high content-similarity, rather than truly independent biological mechanisms. Statistically, such dependencies also result in correlated p values and overdispersion, leading to biased results. We investigated the level of redundancies in multiple pathway databases and observed large discrepancies in the nature and extent of pathway overlap. This prompted us to develop the application, ReCiPa (Redundancy Control in Pathway Databases), to control redundancies in pathway databases based on user-defined thresholds. Analysis of genomic and genetic datasets, using ReCiPa-generated overlap-controlled versions of KEGG and Reactome pathways, led to a reduction in redundancy among the top-scoring gene-sets and allowed for the inclusion of additional gene-sets representing possibly novel biological mechanisms. Using obesity as an example, bioinformatic analysis further demonstrated that gene-sets identified from overlap-controlled pathway databases show stronger evidence of prior association to obesity compared to pathways identified from the original databases.
GO-based functional dissimilarity of gene sets.
Díaz-Díaz, Norberto; Aguilar-Ruiz, Jesús S
2011-09-01
The Gene Ontology (GO) provides a controlled vocabulary for describing the functions of genes and can be used to evaluate the functional coherence of gene sets. Many functional coherence measures consider each pair of gene functions in a set and produce an output based on all pairwise distances. A single gene can encode multiple proteins that may differ in function. For each functionality, other proteins that exhibit the same activity may also participate. Therefore, an identification of the most common function for all of the genes involved in a biological process is important in evaluating the functional similarity of groups of genes and a quantification of functional coherence can helps to clarify the role of a group of genes working together. To implement this approach to functional assessment, we present GFD (GO-based Functional Dissimilarity), a novel dissimilarity measure for evaluating groups of genes based on the most relevant functions of the whole set. The measure assigns a numerical value to the gene set for each of the three GO sub-ontologies. Results show that GFD performs robustly when applied to gene set of known functionality (extracted from KEGG). It performs particularly well on randomly generated gene sets. An ROC analysis reveals that the performance of GFD in evaluating the functional dissimilarity of gene sets is very satisfactory. A comparative analysis against other functional measures, such as GS2 and those presented by Resnik and Wang, also demonstrates the robustness of GFD.
Dong, Xinran; Hao, Yun; Wang, Xiao; Tian, Weidong
2016-01-01
Pathway or gene set over-representation analysis (ORA) has become a routine task in functional genomics studies. However, currently widely used ORA tools employ statistical methods such as Fisher’s exact test that reduce a pathway into a list of genes, ignoring the constitutive functional non-equivalent roles of genes and the complex gene-gene interactions. Here, we develop a novel method named LEGO (functional Link Enrichment of Gene Ontology or gene sets) that takes into consideration these two types of information by incorporating network-based gene weights in ORA analysis. In three benchmarks, LEGO achieves better performance than Fisher and three other network-based methods. To further evaluate LEGO’s usefulness, we compare LEGO with five gene expression-based and three pathway topology-based methods using a benchmark of 34 disease gene expression datasets compiled by a recent publication, and show that LEGO is among the top-ranked methods in terms of both sensitivity and prioritization for detecting target KEGG pathways. In addition, we develop a cluster-and-filter approach to reduce the redundancy among the enriched gene sets, making the results more interpretable to biologists. Finally, we apply LEGO to two lists of autism genes, and identify relevant gene sets to autism that could not be found by Fisher. PMID:26750448
Dong, Xinran; Hao, Yun; Wang, Xiao; Tian, Weidong
2016-01-11
Pathway or gene set over-representation analysis (ORA) has become a routine task in functional genomics studies. However, currently widely used ORA tools employ statistical methods such as Fisher's exact test that reduce a pathway into a list of genes, ignoring the constitutive functional non-equivalent roles of genes and the complex gene-gene interactions. Here, we develop a novel method named LEGO (functional Link Enrichment of Gene Ontology or gene sets) that takes into consideration these two types of information by incorporating network-based gene weights in ORA analysis. In three benchmarks, LEGO achieves better performance than Fisher and three other network-based methods. To further evaluate LEGO's usefulness, we compare LEGO with five gene expression-based and three pathway topology-based methods using a benchmark of 34 disease gene expression datasets compiled by a recent publication, and show that LEGO is among the top-ranked methods in terms of both sensitivity and prioritization for detecting target KEGG pathways. In addition, we develop a cluster-and-filter approach to reduce the redundancy among the enriched gene sets, making the results more interpretable to biologists. Finally, we apply LEGO to two lists of autism genes, and identify relevant gene sets to autism that could not be found by Fisher.
Sun, Zhengda; Wang, Chih-Yang; Lawson, Devon A; Kwek, Serena; Velozo, Hugo Gonzalez; Owyong, Mark; Lai, Ming-Derg; Fong, Lawrence; Wilson, Mark; Su, Hua; Werb, Zena; Cooke, Daniel L
2018-02-16
Tumor endothelial cells (TEC) play an indispensible role in tumor growth and metastasis although much of the detailed mechanism still remains elusive. In this study we characterized and compared the global gene expression profiles of TECs and control ECs isolated from human breast cancerous tissues and reduction mammoplasty tissues respectively by single cell RNA sequencing (scRNA-seq). Based on the qualified scRNA-seq libraries that we made, we found that 1302 genes were differentially expressed between these two EC phenotypes. Both principal component analysis (PCA) and heat map-based hierarchical clustering separated the cancerous versus control ECs as two distinctive clusters, and MetaCore disease biomarker analysis indicated that these differentially expressed genes are highly correlated with breast neoplasm diseases. Gene Set Enrichment Analysis software (GSEA) enriched these genes to extracellular matrix (ECM) signal pathways and highlighted 127 ECM-associated genes. External validation verified some of these ECM-associated genes are not only generally overexpressed in various cancer tissues but also specifically overexpressed in colorectal cancer ECs and lymphoma ECs. In conclusion, our data demonstrated that ECM-associated genes play pivotal roles in breast cancer EC biology and some of them could serve as potential TEC biomarkers for various cancers.
New insights into old methods for identifying causal rare variants.
Wang, Haitian; Huang, Chien-Hsun; Lo, Shaw-Hwa; Zheng, Tian; Hu, Inchi
2011-11-29
The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.
2010-01-01
Background Epistasis is recognized as a fundamental part of the genetic architecture of individuals. Several computational approaches have been developed to model gene-gene interactions in case-control studies, however, none of them is suitable for time-dependent analysis. Herein we introduce the Survival Dimensionality Reduction (SDR) algorithm, a non-parametric method specifically designed to detect epistasis in lifetime datasets. Results The algorithm requires neither specification about the underlying survival distribution nor about the underlying interaction model and proved satisfactorily powerful to detect a set of causative genes in synthetic epistatic lifetime datasets with a limited number of samples and high degree of right-censorship (up to 70%). The SDR method was then applied to a series of 386 Dutch patients with active rheumatoid arthritis that were treated with anti-TNF biological agents. Among a set of 39 candidate genes, none of which showed a detectable marginal effect on anti-TNF responses, the SDR algorithm did find that the rs1801274 SNP in the FcγRIIa gene and the rs10954213 SNP in the IRF5 gene non-linearly interact to predict clinical remission after anti-TNF biologicals. Conclusions Simulation studies and application in a real-world setting support the capability of the SDR algorithm to model epistatic interactions in candidate-genes studies in presence of right-censored data. Availability: http://sourceforge.net/projects/sdrproject/ PMID:20691091
Thorup, Casper; Schramm, Andreas; Findlay, Alyssa J; Finster, Kai W; Schreiber, Lars
2017-07-18
This study demonstrates that the deltaproteobacterium Desulfurivibrio alkaliphilus can grow chemolithotrophically by coupling sulfide oxidation to the dissimilatory reduction of nitrate and nitrite to ammonium. Key genes of known sulfide oxidation pathways are absent from the genome of D. alkaliphilus Instead, the genome contains all of the genes necessary for sulfate reduction, including a gene for a reductive-type dissimilatory bisulfite reductase (DSR). Despite this, growth by sulfate reduction was not observed. Transcriptomic analysis revealed a very high expression level of sulfate-reduction genes during growth by sulfide oxidation, while inhibition experiments with molybdate pointed to elemental sulfur/polysulfides as intermediates. Consequently, we propose that D. alkaliphilus initially oxidizes sulfide to elemental sulfur, which is then either disproportionated, or oxidized by a reversal of the sulfate reduction pathway. This is the first study providing evidence that a reductive-type DSR is involved in a sulfide oxidation pathway. Transcriptome sequencing further suggests that nitrate reduction to ammonium is performed by a novel type of periplasmic nitrate reductase and an unusual membrane-anchored nitrite reductase. IMPORTANCE Sulfide oxidation and sulfate reduction, the two major branches of the sulfur cycle, are usually ascribed to distinct sets of microbes with distinct diagnostic genes. Here we show a more complex picture, as D. alkaliphilus , with the genomic setup of a sulfate reducer, grows by sulfide oxidation. The high expression of genes typically involved in the sulfate reduction pathway suggests that these genes, including the reductive-type dissimilatory bisulfite reductases, are also involved in as-yet-unresolved sulfide oxidation pathways. Finally, D. alkaliphilus is closely related to cable bacteria, which grow by electrogenic sulfide oxidation. Since there are no pure cultures of cable bacteria, D. alkaliphilus may represent an exciting model organism in which to study the physiology of this process. Copyright © 2017 Thorup et al.
Zhang, Bing; Schmoyer, Denise; Kirov, Stefan; Snoddy, Jay
2004-01-01
Background Microarray and other high-throughput technologies are producing large sets of interesting genes that are difficult to analyze directly. Bioinformatics tools are needed to interpret the functional information in the gene sets. Results We have created a web-based tool for data analysis and data visualization for sets of genes called GOTree Machine (GOTM). This tool was originally intended to analyze sets of co-regulated genes identified from microarray analysis but is adaptable for use with other gene sets from other high-throughput analyses. GOTree Machine generates a GOTree, a tree-like structure to navigate the Gene Ontology Directed Acyclic Graph for input gene sets. This system provides user friendly data navigation and visualization. Statistical analysis helps users to identify the most important Gene Ontology categories for the input gene sets and suggests biological areas that warrant further study. GOTree Machine is available online at . Conclusion GOTree Machine has a broad application in functional genomic, proteomic and other high-throughput methods that generate large sets of interesting genes; its primary purpose is to help users sort for interesting patterns in gene sets. PMID:14975175
A Morpholino-based screen to identify novel genes involved in craniofacial morphogenesis
Melvin, Vida Senkus; Feng, Weiguo; Hernandez-Lagunas, Laura; Artinger, Kristin Bruk; Williams, Trevor
2014-01-01
BACKGROUND The regulatory mechanisms underpinning facial development are conserved between diverse species. Therefore, results from model systems provide insight into the genetic causes of human craniofacial defects. Previously, we generated a comprehensive dataset examining gene expression during development and fusion of the mouse facial prominences. Here, we used this resource to identify genes that have dynamic expression patterns in the facial prominences, but for which only limited information exists concerning developmental function. RESULTS This set of ~80 genes was used for a high throughput functional analysis in the zebrafish system using Morpholino gene knockdown technology. This screen revealed three classes of cranial cartilage phenotypes depending upon whether knockdown of the gene affected the neurocranium, viscerocranium, or both. The targeted genes that produced consistent phenotypes encoded proteins linked to transcription (meis1, meis2a, tshz2, vgll4l), signaling (pkdcc, vlk, macc1, wu:fb16h09), and extracellular matrix function (smoc2). The majority of these phenotypes were not altered by reduction of p53 levels, demonstrating that both p53 dependent and independent mechanisms were involved in the craniofacial abnormalities. CONCLUSIONS This Morpholino-based screen highlights new genes involved in development of the zebrafish craniofacial skeleton with wider relevance to formation of the face in other species, particularly mouse and human. PMID:23559552
Comparative study on gene set and pathway topology-based enrichment methods.
Bayerlová, Michaela; Jung, Klaus; Kramer, Frank; Klemm, Florian; Bleckmann, Annalen; Beißbarth, Tim
2015-10-22
Enrichment analysis is a popular approach to identify pathways or sets of genes which are significantly enriched in the context of differentially expressed genes. The traditional gene set enrichment approach considers a pathway as a simple gene list disregarding any knowledge of gene or protein interactions. In contrast, the new group of so called pathway topology-based methods integrates the topological structure of a pathway into the analysis. We comparatively investigated gene set and pathway topology-based enrichment approaches, considering three gene set and four topological methods. These methods were compared in two extensive simulation studies and on a benchmark of 36 real datasets, providing the same pathway input data for all methods. In the benchmark data analysis both types of methods showed a comparable ability to detect enriched pathways. The first simulation study was conducted with KEGG pathways, which showed considerable gene overlaps between each other. In this study with original KEGG pathways, none of the topology-based methods outperformed the gene set approach. Therefore, a second simulation study was performed on non-overlapping pathways created by unique gene IDs. Here, methods accounting for pathway topology reached higher accuracy than the gene set methods, however their sensitivity was lower. We conducted one of the first comprehensive comparative works on evaluating gene set against pathway topology-based enrichment methods. The topological methods showed better performance in the simulation scenarios with non-overlapping pathways, however, they were not conclusively better in the other scenarios. This suggests that simple gene set approach might be sufficient to detect an enriched pathway under realistic circumstances. Nevertheless, more extensive studies and further benchmark data are needed to systematically evaluate these methods and to assess what gain and cost pathway topology information introduces into enrichment analysis. Both types of methods for enrichment analysis require further improvements in order to deal with the problem of pathway overlaps.
Schelkunov, Mikhail I.; Shtratnikova, Viktoria Yu; Nuraliev, Maxim S.; Selosse, Marc-Andre; Penin, Aleksey A.; Logacheva, Maria D.
2015-01-01
The question on the patterns and limits of reduction of plastid genomes in nonphotosynthetic plants and the reasons of their conservation is one of the intriguing topics in plant genome evolution. Here, we report sequencing and analysis of plastid genome in nonphotosynthetic orchids Epipogium aphyllum and Epipogium roseum, which, with sizes of 31 and 19 kbp, respectively, represent the smallest plastid genomes characterized by now. Besides drastic reduction, which is expected, we found several unusual features of these “minimal” plastomes: Multiple rearrangements, highly biased nucleotide composition, and unprecedentedly high substitution rate. Only 27 and 29 genes remained intact in the plastomes of E. aphyllum and E. roseum—those encoding ribosomal components, transfer RNAs, and three additional housekeeping genes (infA, clpP, and accD). We found no signs of relaxed selection acting on these genes. We hypothesize that the main reason for retention of plastid genomes in Epipogium is the necessity to translate messenger RNAs (mRNAs) of accD and/or clpP proteins which are essential for cell metabolism. However, these genes are absent in plastomes of several plant species; their absence is compensated by the presence of a functional copy arisen by gene transfer from plastid to the nuclear genome. This suggests that there is no single set of plastid-encoded essential genes, but rather different sets for different species and that the retention of a gene in the plastome depends on the interaction between the nucleus and plastids. PMID:25635040
Orellana, Luis H.; Rodriguez-R, Luis M.; Konstantinidis, Konstantinos T.
2016-10-07
Functional annotation of metagenomic and metatranscriptomic data sets relies on similarity searches based on e-value thresholds resulting in an unknown number of false positive and negative matches. To overcome these limitations, we introduce ROCker, aimed at identifying position-specific, most-discriminant thresholds in sliding windows along the sequence of a target protein, accounting for non-discriminative domains shared by unrelated proteins. ROCker employs the receiver operating characteristic (ROC) curve to minimize false discovery rate (FDR) and calculate the best thresholds based on how simulated shotgun metagenomic reads of known composition map onto well-curated reference protein sequences and thus, differs from HMM profiles andmore » related methods. We showcase ROCker using ammonia monooxygenase (amoA) and nitrous oxide reductase (nosZ) genes, mediating oxidation of ammonia and the reduction of the potent greenhouse gas, N 2O, to inert N 2, respectively. ROCker typically showed 60-fold lower FDR when compared to the common practice of using fixed e-values. Previously uncounted ‘atypical’ nosZ genes were found to be two times more abundant, on average, than their typical counterparts in most soil metagenomes and the abundance of bacterial amoA was quantified against the highly-related particulate methane monooxygenase (pmoA). Therefore, ROCker can reliably detect and quantify target genes in short-read metagenomes.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Orellana, Luis H.; Rodriguez-R, Luis M.; Konstantinidis, Konstantinos T.
Functional annotation of metagenomic and metatranscriptomic data sets relies on similarity searches based on e-value thresholds resulting in an unknown number of false positive and negative matches. To overcome these limitations, we introduce ROCker, aimed at identifying position-specific, most-discriminant thresholds in sliding windows along the sequence of a target protein, accounting for non-discriminative domains shared by unrelated proteins. ROCker employs the receiver operating characteristic (ROC) curve to minimize false discovery rate (FDR) and calculate the best thresholds based on how simulated shotgun metagenomic reads of known composition map onto well-curated reference protein sequences and thus, differs from HMM profiles andmore » related methods. We showcase ROCker using ammonia monooxygenase (amoA) and nitrous oxide reductase (nosZ) genes, mediating oxidation of ammonia and the reduction of the potent greenhouse gas, N 2O, to inert N 2, respectively. ROCker typically showed 60-fold lower FDR when compared to the common practice of using fixed e-values. Previously uncounted ‘atypical’ nosZ genes were found to be two times more abundant, on average, than their typical counterparts in most soil metagenomes and the abundance of bacterial amoA was quantified against the highly-related particulate methane monooxygenase (pmoA). Therefore, ROCker can reliably detect and quantify target genes in short-read metagenomes.« less
2017-01-01
Abstract Functional annotation of metagenomic and metatranscriptomic data sets relies on similarity searches based on e-value thresholds resulting in an unknown number of false positive and negative matches. To overcome these limitations, we introduce ROCker, aimed at identifying position-specific, most-discriminant thresholds in sliding windows along the sequence of a target protein, accounting for non-discriminative domains shared by unrelated proteins. ROCker employs the receiver operating characteristic (ROC) curve to minimize false discovery rate (FDR) and calculate the best thresholds based on how simulated shotgun metagenomic reads of known composition map onto well-curated reference protein sequences and thus, differs from HMM profiles and related methods. We showcase ROCker using ammonia monooxygenase (amoA) and nitrous oxide reductase (nosZ) genes, mediating oxidation of ammonia and the reduction of the potent greenhouse gas, N2O, to inert N2, respectively. ROCker typically showed 60-fold lower FDR when compared to the common practice of using fixed e-values. Previously uncounted ‘atypical’ nosZ genes were found to be two times more abundant, on average, than their typical counterparts in most soil metagenomes and the abundance of bacterial amoA was quantified against the highly-related particulate methane monooxygenase (pmoA). Therefore, ROCker can reliably detect and quantify target genes in short-read metagenomes. PMID:28180325
Inference of combinatorial Boolean rules of synergistic gene sets from cancer microarray datasets.
Park, Inho; Lee, Kwang H; Lee, Doheon
2010-06-15
Gene set analysis has become an important tool for the functional interpretation of high-throughput gene expression datasets. Moreover, pattern analyses based on inferred gene set activities of individual samples have shown the ability to identify more robust disease signatures than individual gene-based pattern analyses. Although a number of approaches have been proposed for gene set-based pattern analysis, the combinatorial influence of deregulated gene sets on disease phenotype classification has not been studied sufficiently. We propose a new approach for inferring combinatorial Boolean rules of gene sets for a better understanding of cancer transcriptome and cancer classification. To reduce the search space of the possible Boolean rules, we identify small groups of gene sets that synergistically contribute to the classification of samples into their corresponding phenotypic groups (such as normal and cancer). We then measure the significance of the candidate Boolean rules derived from each group of gene sets; the level of significance is based on the class entropy of the samples selected in accordance with the rules. By applying the present approach to publicly available prostate cancer datasets, we identified 72 significant Boolean rules. Finally, we discuss several identified Boolean rules, such as the rule of glutathione metabolism (down) and prostaglandin synthesis regulation (down), which are consistent with known prostate cancer biology. Scripts written in Python and R are available at http://biosoft.kaist.ac.kr/~ihpark/. The refined gene sets and the full list of the identified Boolean rules are provided in the Supplementary Material. Supplementary data are available at Bioinformatics online.
Smoking-related microRNAs and mRNAs in human peripheral blood mononuclear cells
DOE Office of Scientific and Technical Information (OSTI.GOV)
Su, Ming-Wei
Teenager smoking is of great importance in public health. Functional roles of microRNAs have been documented in smoke-induced gene expression changes, but comprehensive mechanisms of microRNA-mRNA regulation and benefits remained poorly understood. We conducted the Teenager Smoking Reduction Trial (TSRT) to investigate the causal association between active smoking reduction and whole-genome microRNA and mRNA expression changes in human peripheral blood mononuclear cells (PBMC). A total of 12 teenagers with a substantial reduction in smoke quantity and a decrease in urine cotinine/creatinine ratio were enrolled in genomic analyses. In Gene Set Enrichment Analysis (GSEA) and Ingenuity Pathway Analysis (IPA), differentially expressedmore » genes altered by smoke reduction were mainly associated with glucocorticoid receptor signaling pathway. The integrative analysis of microRNA and mRNA found eleven differentially expressed microRNAs negatively correlated with predicted target genes. CD83 molecule regulated by miR-4498 in human PBMC, was critical for the canonical pathway of communication between innate and adaptive immune cells. Our data demonstrated that microRNAs could regulate immune responses in human PBMC after habitual smokers quit smoking and support the potential translational value of microRNAs in regulating disease-relevant gene expression caused by tobacco smoke. - Highlights: • We conducted a smoke reduction trial program and investigated the causal relationship between smoke and gene regulation. • MicroRNA and mRNA expression changes were examined in human PBMC. • MicroRNAs are important in regulating disease-causal genes after tobacco smoke reduction.« less
Kong, Xiang-Zhen; Liu, Jin-Xing; Zheng, Chun-Hou; Hou, Mi-Xiao; Wang, Juan
2017-07-01
High dimensionality has become a typical feature of biomolecular data. In this paper, a novel dimension reduction method named p-norm singular value decomposition (PSVD) is proposed to seek the low-rank approximation matrix to the biomolecular data. To enhance the robustness to outliers, the Lp-norm is taken as the error function and the Schatten p-norm is used as the regularization function in the optimization model. To evaluate the performance of PSVD, the Kmeans clustering method is then employed for tumor clustering based on the low-rank approximation matrix. Extensive experiments are carried out on five gene expression data sets including two benchmark data sets and three higher dimensional data sets from the cancer genome atlas. The experimental results demonstrate that the PSVD-based method outperforms many existing methods. Especially, it is experimentally proved that the proposed method is more efficient for processing higher dimensional data with good robustness, stability, and superior time performance.
Salnikova, Lyubov E; Smelaya, Tamara V; Golubev, Arkadiy M; Rubanovich, Alexander V; Moroz, Viktor V
2013-11-01
This study was conducted to establish the possible contribution of functional gene polymorphisms in detoxification/oxidative stress and vascular remodeling pathways to community-acquired pneumonia (CAP) susceptibility in the case-control study (350 CAP patients, 432 control subjects) and to predisposition to the development of CAP complications in the prospective study. All subjects were genotyped for 16 polymorphic variants in the 14 genes of xenobiotics detoxification CYP1A1, AhR, GSTM1, GSTT1, ABCB1, redox-status SOD2, CAT, GCLC, and vascular homeostasis ACE, AGT, AGTR1, NOS3, MTHFR, VEGFα. Risk of pulmonary complications (PC) in the single locus analysis was associated with CYP1A1, GCLC and AGTR1 genes. Extra PC (toxic shock syndrome and myocarditis) were not associated with these genes. We evaluated gene-gene interactions using multi-factor dimensionality reduction, and cumulative gene risk score approaches. The final model which included >5 risk alleles in the CYP1A1 (rs2606345, rs4646903, rs1048943), GCLC, AGT, and AGTR1 genes was associated with pleuritis, empyema, acute respiratory distress syndrome, all PC and acute respiratory failure (ARF). We considered CYP1A1, GCLC, AGT, AGTR1 gene set using Set Distiller mode implemented in GeneDecks for discovering gene-set relations via the degree of sharing descriptors within a given gene set. N-acetylcysteine and oxygen were defined by Set Distiller as the best descriptors for the gene set associated in the present study with PC and ARF. Results of the study are in line with literature data and suggest that genetically determined oxidative stress exacerbation may contribute to the progression of lung inflammation.
Schelkunov, Mikhail I; Shtratnikova, Viktoria Yu; Nuraliev, Maxim S; Selosse, Marc-Andre; Penin, Aleksey A; Logacheva, Maria D
2015-01-28
The question on the patterns and limits of reduction of plastid genomes in nonphotosynthetic plants and the reasons of their conservation is one of the intriguing topics in plant genome evolution. Here, we report sequencing and analysis of plastid genome in nonphotosynthetic orchids Epipogium aphyllum and Epipogium roseum, which, with sizes of 31 and 19 kbp, respectively, represent the smallest plastid genomes characterized by now. Besides drastic reduction, which is expected, we found several unusual features of these "minimal" plastomes: Multiple rearrangements, highly biased nucleotide composition, and unprecedentedly high substitution rate. Only 27 and 29 genes remained intact in the plastomes of E. aphyllum and E. roseum-those encoding ribosomal components, transfer RNAs, and three additional housekeeping genes (infA, clpP, and accD). We found no signs of relaxed selection acting on these genes. We hypothesize that the main reason for retention of plastid genomes in Epipogium is the necessity to translate messenger RNAs (mRNAs) of accD and/or clpP proteins which are essential for cell metabolism. However, these genes are absent in plastomes of several plant species; their absence is compensated by the presence of a functional copy arisen by gene transfer from plastid to the nuclear genome. This suggests that there is no single set of plastid-encoded essential genes, but rather different sets for different species and that the retention of a gene in the plastome depends on the interaction between the nucleus and plastids. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Analysis of genetic association using hierarchical clustering and cluster validation indices.
Pagnuco, Inti A; Pastore, Juan I; Abras, Guillermo; Brun, Marcel; Ballarin, Virginia L
2017-10-01
It is usually assumed that co-expressed genes suggest co-regulation in the underlying regulatory network. Determining sets of co-expressed genes is an important task, based on some criteria of similarity. This task is usually performed by clustering algorithms, where the genes are clustered into meaningful groups based on their expression values in a set of experiment. In this work, we propose a method to find sets of co-expressed genes, based on cluster validation indices as a measure of similarity for individual gene groups, and a combination of variants of hierarchical clustering to generate the candidate groups. We evaluated its ability to retrieve significant sets on simulated correlated and real genomics data, where the performance is measured based on its detection ability of co-regulated sets against a full search. Additionally, we analyzed the quality of the best ranked groups using an online bioinformatics tool that provides network information for the selected genes. Copyright © 2017 Elsevier Inc. All rights reserved.
Koster, Roelof; Mitra, Nandita; D'Andrea, Kurt; Vardhanabhuti, Saran; Chung, Charles C; Wang, Zhaoming; Loren Erickson, R; Vaughn, David J; Litchfield, Kevin; Rahman, Nazneen; Greene, Mark H; McGlynn, Katherine A; Turnbull, Clare; Chanock, Stephen J; Nathanson, Katherine L; Kanetsky, Peter A
2014-11-15
Genome-wide association (GWA) studies of testicular germ cell tumor (TGCT) have identified 18 susceptibility loci, some containing genes encoding proteins important in male germ cell development. Deletions of one of these genes, DMRT1, lead to male-to-female sex reversal and are associated with development of gonadoblastoma. To further explore genetic association with TGCT, we undertook a pathway-based analysis of SNP marker associations in the Penn GWAs (349 TGCT cases and 919 controls). We analyzed a custom-built sex determination gene set consisting of 32 genes using three different methods of pathway-based analysis. The sex determination gene set ranked highly compared with canonical gene sets, and it was associated with TGCT (FDRG = 2.28 × 10(-5), FDRM = 0.014 and FDRI = 0.008 for Gene Set Analysis-SNP (GSA-SNP), Meta-Analysis Gene Set Enrichment of Variant Associations (MAGENTA) and Improved Gene Set Enrichment Analysis for Genome-wide Association Study (i-GSEA4GWAS) analysis, respectively). The association remained after removal of DMRT1 from the gene set (FDRG = 0.0002, FDRM = 0.055 and FDRI = 0.009). Using data from the NCI GWA scan (582 TGCT cases and 1056 controls) and UK scan (986 TGCT cases and 4946 controls), we replicated these findings (NCI: FDRG = 0.006, FDRM = 0.014, FDRI = 0.033, and UK: FDRG = 1.04 × 10(-6), FDRM = 0.016, FDRI = 0.025). After removal of DMRT1 from the gene set, the sex determination gene set remains associated with TGCT in the NCI (FDRG = 0.039, FDRM = 0.050 and FDRI = 0.055) and UK scans (FDRG = 3.00 × 10(-5), FDRM = 0.056 and FDRI = 0.044). With the exception of DMRT1, genes in the sex determination gene set have not previously been identified as TGCT susceptibility loci in these GWA scans, demonstrating the complementary nature of a pathway-based approach for genome-wide analysis of TGCT. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Spectral gene set enrichment (SGSE).
Frost, H Robert; Li, Zhigang; Moore, Jason H
2015-03-03
Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.
Structure and Evolution of Chlorate Reduction Composite Transposons
Clark, Iain C.; Melnyk, Ryan A.; Engelbrektson, Anna; Coates, John D.
2013-01-01
ABSTRACT The genes for chlorate reduction in six bacterial strains were analyzed in order to gain insight into the metabolism. A newly isolated chlorate-reducing bacterium (Shewanella algae ACDC) and three previously isolated strains (Ideonella dechloratans, Pseudomonas sp. strain PK, and Dechloromarinus chlorophilus NSS) were genome sequenced and compared to published sequences (Alicycliphilus denitrificans BC plasmid pALIDE01 and Pseudomonas chloritidismutans AW-1). De novo assembly of genomes failed to join regions adjacent to genes involved in chlorate reduction, suggesting the presence of repeat regions. Using a bioinformatics approach and finishing PCRs to connect fragmented contigs, we discovered that chlorate reduction genes are flanked by insertion sequences, forming composite transposons in all four newly sequenced strains. These insertion sequences delineate regions with the potential to move horizontally and define a set of genes that may be important for chlorate reduction. In addition to core metabolic components, we have highlighted several such genes through comparative analysis and visualization. Phylogenetic analysis places chlorate reductase within a functionally diverse clade of type II dimethyl sulfoxide (DMSO) reductases, part of a larger family of enzymes with reactivity toward chlorate. Nucleotide-level forensics of regions surrounding chlorite dismutase (cld), as well as its phylogenetic clustering in a betaproteobacterial Cld clade, indicate that cld has been mobilized at least once from a perchlorate reducer to build chlorate respiration. PMID:23919996
Evolutionary interrogation of human biology in well-annotated genomic framework of rhesus macaque.
Zhang, Shi-Jian; Liu, Chu-Jun; Yu, Peng; Zhong, Xiaoming; Chen, Jia-Yu; Yang, Xinzhuang; Peng, Jiguang; Yan, Shouyu; Wang, Chenqu; Zhu, Xiaotong; Xiong, Jingwei; Zhang, Yong E; Tan, Bertrand Chin-Ming; Li, Chuan-Yun
2014-05-01
With genome sequence and composition highly analogous to human, rhesus macaque represents a unique reference for evolutionary studies of human biology. Here, we developed a comprehensive genomic framework of rhesus macaque, the RhesusBase2, for evolutionary interrogation of human genes and the associated regulations. A total of 1,667 next-generation sequencing (NGS) data sets were processed, integrated, and evaluated, generating 51.2 million new functional annotation records. With extensive NGS annotations, RhesusBase2 refined the fine-scale structures in 30% of the macaque Ensembl transcripts, reporting an accurate, up-to-date set of macaque gene models. On the basis of these annotations and accurate macaque gene models, we further developed an NGS-oriented Molecular Evolution Gateway to access and visualize macaque annotations in reference to human orthologous genes and associated regulations (www.rhesusbase.org/molEvo). We highlighted the application of this well-annotated genomic framework in generating hypothetical link of human-biased regulations to human-specific traits, by using mechanistic characterization of the DIEXF gene as an example that provides novel clues to the understanding of digestive system reduction in human evolution. On a global scale, we also identified a catalog of 9,295 human-biased regulatory events, which may represent novel elements that have a substantial impact on shaping human transcriptome and possibly underpin recent human phenotypic evolution. Taken together, we provide an NGS data-driven, information-rich framework that will broadly benefit genomics research in general and serves as an important resource for in-depth evolutionary studies of human biology.
Tissue Non-Specific Genes and Pathways Associated with Diabetes: An Expression Meta-Analysis.
Mei, Hao; Li, Lianna; Liu, Shijian; Jiang, Fan; Griswold, Michael; Mosley, Thomas
2017-01-21
We performed expression studies to identify tissue non-specific genes and pathways of diabetes by meta-analysis. We searched curated datasets of the Gene Expression Omnibus (GEO) database and identified 13 and five expression studies of diabetes and insulin responses at various tissues, respectively. We tested differential gene expression by empirical Bayes-based linear method and investigated gene set expression association by knowledge-based enrichment analysis. Meta-analysis by different methods was applied to identify tissue non-specific genes and gene sets. We also proposed pathway mapping analysis to infer functions of the identified gene sets, and correlation and independent analysis to evaluate expression association profile of genes and gene sets between studies and tissues. Our analysis showed that PGRMC1 and HADH genes were significant over diabetes studies, while IRS1 and MPST genes were significant over insulin response studies, and joint analysis showed that HADH and MPST genes were significant over all combined data sets. The pathway analysis identified six significant gene sets over all studies. The KEGG pathway mapping indicated that the significant gene sets are related to diabetes pathogenesis. The results also presented that 12.8% and 59.0% pairwise studies had significantly correlated expression association for genes and gene sets, respectively; moreover, 12.8% pairwise studies had independent expression association for genes, but no studies were observed significantly different for expression association of gene sets. Our analysis indicated that there are both tissue specific and non-specific genes and pathways associated with diabetes pathogenesis. Compared to the gene expression, pathway association tends to be tissue non-specific, and a common pathway influencing diabetes development is activated through different genes at different tissues.
Blatti, Charles; Sinha, Saurabh
2016-07-15
Analysis of co-expressed gene sets typically involves testing for enrichment of different annotations or 'properties' such as biological processes, pathways, transcription factor binding sites, etc., one property at a time. This common approach ignores any known relationships among the properties or the genes themselves. It is believed that known biological relationships among genes and their many properties may be exploited to more accurately reveal commonalities of a gene set. Previous work has sought to achieve this by building biological networks that combine multiple types of gene-gene or gene-property relationships, and performing network analysis to identify other genes and properties most relevant to a given gene set. Most existing network-based approaches for recognizing genes or annotations relevant to a given gene set collapse information about different properties to simplify (homogenize) the networks. We present a network-based method for ranking genes or properties related to a given gene set. Such related genes or properties are identified from among the nodes of a large, heterogeneous network of biological information. Our method involves a random walk with restarts, performed on an initial network with multiple node and edge types that preserve more of the original, specific property information than current methods that operate on homogeneous networks. In this first stage of our algorithm, we find the properties that are the most relevant to the given gene set and extract a subnetwork of the original network, comprising only these relevant properties. We then re-rank genes by their similarity to the given gene set, based on a second random walk with restarts, performed on the above subnetwork. We demonstrate the effectiveness of this algorithm for ranking genes related to Drosophila embryonic development and aggressive responses in the brains of social animals. DRaWR was implemented as an R package available at veda.cs.illinois.edu/DRaWR. blatti@illinois.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
An Independent Filter for Gene Set Testing Based on Spectral Enrichment.
Frost, H Robert; Li, Zhigang; Asselbergs, Folkert W; Moore, Jason H
2015-01-01
Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.
Lai, Yinglei; Zhang, Fanni; Nayak, Tapan K; Modarres, Reza; Lee, Norman H; McCaffrey, Timothy A
2014-01-01
Gene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment. We categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a three-component multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets. We used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method. This study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple large-scale two-sample gene expression data sets.
Query-based biclustering of gene expression data using Probabilistic Relational Models.
Zhao, Hui; Cloots, Lore; Van den Bulcke, Tim; Wu, Yan; De Smet, Riet; Storms, Valerie; Meysman, Pieter; Engelen, Kristof; Marchal, Kathleen
2011-02-15
With the availability of large scale expression compendia it is now possible to view own findings in the light of what is already available and retrieve genes with an expression profile similar to a set of genes of interest (i.e., a query or seed set) for a subset of conditions. To that end, a query-based strategy is needed that maximally exploits the coexpression behaviour of the seed genes to guide the biclustering, but that at the same time is robust against the presence of noisy genes in the seed set as seed genes are often assumed, but not guaranteed to be coexpressed in the queried compendium. Therefore, we developed ProBic, a query-based biclustering strategy based on Probabilistic Relational Models (PRMs) that exploits the use of prior distributions to extract the information contained within the seed set. We applied ProBic on a large scale Escherichia coli compendium to extend partially described regulons with potentially novel members. We compared ProBic's performance with previously published query-based biclustering algorithms, namely ISA and QDB, from the perspective of bicluster expression quality, robustness of the outcome against noisy seed sets and biological relevance.This comparison learns that ProBic is able to retrieve biologically relevant, high quality biclusters that retain their seed genes and that it is particularly strong in handling noisy seeds. ProBic is a query-based biclustering algorithm developed in a flexible framework, designed to detect biologically relevant, high quality biclusters that retain relevant seed genes even in the presence of noise or when dealing with low quality seed sets.
Tetteh, Antonia Y; Sun, Katherine H; Hung, Chiu-Yueh; Kittur, Farooqahmed S; Ibeanu, Gordon C; Williams, Daniel; Xie, Jiahua
2014-01-01
Bacteria can reduce toxic selenite into less toxic, elemental selenium (Se(0)), but the mechanism on how bacterial cells reduce selenite at molecular level is still not clear. We used Escherichia coli strain K12, a common bacterial strain, as a model to study its growth response to sodium selenite (Na2SeO3) treatment and then used quantitative real-time PCR (qRT-PCR) to quantify transcript levels of three E. coli selenopolypeptide genes and a set of machinery genes for selenocysteine (SeCys) biosynthesis and incorporation into polypeptides, whose involvements in the selenite reduction are largely unknown. We determined that 5 mM Na2SeO3 treatment inhibited growth by ∼ 50% while 0.001 to 0.01 mM treatments stimulated cell growth by ∼ 30%. Under 50% inhibitory or 30% stimulatory Na2SeO3 concentration, selenopolypeptide genes (fdnG, fdoG, and fdhF) whose products require SeCys but not SeCys biosynthesis machinery genes were found to be induced ≥2-fold. In addition, one sulfur (S) metabolic gene iscS and two previously reported selenite-responsive genes sodA and gutS were also induced ≥2-fold under 50% inhibitory concentration. Our findings provide insight about the detoxification of selenite in E. coli via induction of these genes involved in the selenite reduction process.
Dwivedi, Bhakti; Kowalski, Jeanne
2018-01-01
While many methods exist for integrating multi-omics data or defining gene sets, there is no one single tool that defines gene sets based on merging of multiple omics data sets. We present shinyGISPA, an open-source application with a user-friendly web-based interface to define genes according to their similarity in several molecular changes that are driving a disease phenotype. This tool was developed to help facilitate the usability of a previously published method, Gene Integrated Set Profile Analysis (GISPA), among researchers with limited computer-programming skills. The GISPA method allows the identification of multiple gene sets that may play a role in the characterization, clinical application, or functional relevance of a disease phenotype. The tool provides an automated workflow that is highly scalable and adaptable to applications that go beyond genomic data merging analysis. It is available at http://shinygispa.winship.emory.edu/shinyGISPA/.
Dwivedi, Bhakti
2018-01-01
While many methods exist for integrating multi-omics data or defining gene sets, there is no one single tool that defines gene sets based on merging of multiple omics data sets. We present shinyGISPA, an open-source application with a user-friendly web-based interface to define genes according to their similarity in several molecular changes that are driving a disease phenotype. This tool was developed to help facilitate the usability of a previously published method, Gene Integrated Set Profile Analysis (GISPA), among researchers with limited computer-programming skills. The GISPA method allows the identification of multiple gene sets that may play a role in the characterization, clinical application, or functional relevance of a disease phenotype. The tool provides an automated workflow that is highly scalable and adaptable to applications that go beyond genomic data merging analysis. It is available at http://shinygispa.winship.emory.edu/shinyGISPA/. PMID:29415010
Castells, Xavier; Acebes, Juan José; Majós, Carles; Boluda, Susana; Julià-Sapé, Margarida; Candiota, Ana Paula; Ariño, Joaquín; Barceló, Anna; Arús, Carles
2015-01-01
Glioblastoma (Gb) is one of the most deadly tumors. Its molecular subtypes are yet to be fully characterized while the attendant efforts for personalized medicine need to be intensified in relation to glioblastoma diagnosis, treatment, and prognosis. Several molecular signatures based on gene expression microarrays were reported, but the use of microarrays for routine clinical practice is challenged by attendant economic costs. Several authors have proposed discriminant equations based on RT-PCR. Still, the discriminant threshold is often incompletely described, which makes proper validation difficult. In a previous work, we have reported two Gb subtypes based on the expression levels of four genes: CHI3L1, LDHA, LGALS1, and IGFBP3. One Gb subtype presented with low expression of the four genes mentioned, and of MGMT in a large portion of the patients (with anticipated high methylation of its promoter), and mutated IDH1. Here, we evaluate the robustness of the equations fitted with these genes using RT-PCR values in a set of 64 cases and importantly, define an unequivocal discriminant threshold with a view to prognostic implications. We developed two approaches to generate the discriminant equations: 1) using the expression level of the four genes mentioned above, and 2) using those genes displaying the highest correlation with survival among the aforementioned four ones, plus MGMT, as an attempt to further reduce the number of genes. The ease of equations' applicability, reduction in cost for raw data, and robustness in terms of resampling-based classification accuracy warrant further evaluation of these equations to discern Gb tumor biopsy heterogeneity at molecular level, diagnose potential malignancy, and prognosis of individual patients with glioblastomas.
Cha, Kihoon; Hwang, Taeho; Oh, Kimin; Yi, Gwan-Su
2015-01-01
It has been reported that several brain diseases can be treated as transnosological manner implicating possible common molecular basis under those diseases. However, molecular level commonality among those brain diseases has been largely unexplored. Gene expression analyses of human brain have been used to find genes associated with brain diseases but most of those studies were restricted either to an individual disease or to a couple of diseases. In addition, identifying significant genes in such brain diseases mostly failed when it used typical methods depending on differentially expressed genes. In this study, we used a correlation-based biclustering approach to find coexpressed gene sets in five neurodegenerative diseases and three psychiatric disorders. By using biclustering analysis, we could efficiently and fairly identified various gene sets expressed specifically in both single and multiple brain diseases. We could find 4,307 gene sets correlatively expressed in multiple brain diseases and 3,409 gene sets exclusively specified in individual brain diseases. The function enrichment analysis of those gene sets showed many new possible functional bases as well as neurological processes that are common or specific for those eight diseases. This study introduces possible common molecular bases for several brain diseases, which open the opportunity to clarify the transnosological perspective assumed in brain diseases. It also showed the advantages of correlation-based biclustering analysis and accompanying function enrichment analysis for gene expression data in this type of investigation.
2015-01-01
Background It has been reported that several brain diseases can be treated as transnosological manner implicating possible common molecular basis under those diseases. However, molecular level commonality among those brain diseases has been largely unexplored. Gene expression analyses of human brain have been used to find genes associated with brain diseases but most of those studies were restricted either to an individual disease or to a couple of diseases. In addition, identifying significant genes in such brain diseases mostly failed when it used typical methods depending on differentially expressed genes. Results In this study, we used a correlation-based biclustering approach to find coexpressed gene sets in five neurodegenerative diseases and three psychiatric disorders. By using biclustering analysis, we could efficiently and fairly identified various gene sets expressed specifically in both single and multiple brain diseases. We could find 4,307 gene sets correlatively expressed in multiple brain diseases and 3,409 gene sets exclusively specified in individual brain diseases. The function enrichment analysis of those gene sets showed many new possible functional bases as well as neurological processes that are common or specific for those eight diseases. Conclusions This study introduces possible common molecular bases for several brain diseases, which open the opportunity to clarify the transnosological perspective assumed in brain diseases. It also showed the advantages of correlation-based biclustering analysis and accompanying function enrichment analysis for gene expression data in this type of investigation. PMID:26043779
Kayano, Mitsunori; Matsui, Hidetoshi; Yamaguchi, Rui; Imoto, Seiya; Miyano, Satoru
2016-04-01
High-throughput time course expression profiles have been available in the last decade due to developments in measurement techniques and devices. Functional data analysis, which treats smoothed curves instead of originally observed discrete data, is effective for the time course expression profiles in terms of dimension reduction, robustness, and applicability to data measured at small and irregularly spaced time points. However, the statistical method of differential analysis for time course expression profiles has not been well established. We propose a functional logistic model based on elastic net regularization (F-Logistic) in order to identify the genes with dynamic alterations in case/control study. We employ a mixed model as a smoothing method to obtain functional data; then F-Logistic is applied to time course profiles measured at small and irregularly spaced time points. We evaluate the performance of F-Logistic in comparison with another functional data approach, i.e. functional ANOVA test (F-ANOVA), by applying the methods to real and synthetic time course data sets. The real data sets consist of the time course gene expression profiles for long-term effects of recombinant interferon β on disease progression in multiple sclerosis. F-Logistic distinguishes dynamic alterations, which cannot be found by competitive approaches such as F-ANOVA, in case/control study based on time course expression profiles. F-Logistic is effective for time-dependent biomarker detection, diagnosis, and therapy. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
2017-01-01
Although in recent years the study of gene expression variation in the absence of genetic or environmental cues or gene expression heterogeneity has intensified considerably, many basic and applied biological fields still remain unaware of how useful the study of gene expression heterogeneity patterns might be for the characterization of biological systems and/or processes. Largely based on the modulator effect chromatin compaction has for gene expression heterogeneity and the extensive changes in chromatin compaction known to occur for specialized cells that are naturally or artificially induced to revert to less specialized states or dedifferentiate, I recently hypothesized that processes that concur with cell dedifferentiation would show an extensive reduction in gene expression heterogeneity. The confirmation of the existence of such trend could be of wide interest because of the biomedical and biotechnological relevance of cell dedifferentiation-based processes, i.e., regenerative development, cancer, human induced pluripotent stem cells, or plant somatic embryogenesis. Here, I report the first empirical evidence consistent with the existence of an extensive reduction in gene expression heterogeneity for processes that concur with cell dedifferentiation by analyzing transcriptome dynamics along forearm regenerative development in Ambystoma mexicanum or axolotl. Also, I briefly discuss on the utility of the study of gene expression heterogeneity dynamics might have for the characterization of cell dedifferentiation-based processes, and the engineering of tools that afforded better monitoring and modulating such processes. Finally, I reflect on how a transitional reduction in gene expression heterogeneity for dedifferentiated cells can promote a long-term increase in phenotypic heterogeneity following cell dedifferentiation with potential adverse effects for biomedical and biotechnological applications. PMID:29134148
Takahashi, Kei-ichiro; Takigawa, Ichigaku; Mamitsuka, Hiroshi
2013-01-01
Detecting biclusters from expression data is useful, since biclusters are coexpressed genes under only part of all given experimental conditions. We present a software called SiBIC, which from a given expression dataset, first exhaustively enumerates biclusters, which are then merged into rather independent biclusters, which finally are used to generate gene set networks, in which a gene set assigned to one node has coexpressed genes. We evaluated each step of this procedure: 1) significance of the generated biclusters biologically and statistically, 2) biological quality of merged biclusters, and 3) biological significance of gene set networks. We emphasize that gene set networks, in which nodes are not genes but gene sets, can be more compact than usual gene networks, meaning that gene set networks are more comprehensible. SiBIC is available at http://utrecht.kuicr.kyoto-u.ac.jp:8080/miami/faces/index.jsp.
Teasdale, Luisa C; Köhler, Frank; Murray, Kevin D; O'Hara, Tim; Moussalli, Adnan
2016-09-01
The qualification of orthology is a significant challenge when developing large, multiloci phylogenetic data sets from assembled transcripts. Transcriptome assemblies have various attributes, such as fragmentation, frameshifts and mis-indexing, which pose problems to automated methods of orthology assessment. Here, we identify a set of orthologous single-copy genes from transcriptome assemblies for the land snails and slugs (Eupulmonata) using a thorough approach to orthology determination involving manual alignment curation, gene tree assessment and sequencing from genomic DNA. We qualified the orthology of 500 nuclear, protein-coding genes from the transcriptome assemblies of 21 eupulmonate species to produce the most complete phylogenetic data matrix for a major molluscan lineage to date, both in terms of taxon and character completeness. Exon capture targeting 490 of the 500 genes (those with at least one exon >120 bp) from 22 species of Australian Camaenidae successfully captured sequences of 2825 exons (representing all targeted genes), with only a 3.7% reduction in the data matrix due to the presence of putative paralogs or pseudogenes. The automated pipeline Agalma retrieved the majority of the manually qualified 500 single-copy gene set and identified a further 375 putative single-copy genes, although it failed to account for fragmented transcripts resulting in lower data matrix completeness when considering the original 500 genes. This could potentially explain the minor inconsistencies we observed in the supported topologies for the 21 eupulmonate species between the manually curated and 'Agalma-equivalent' data set (sharing 458 genes). Overall, our study confirms the utility of the 500 gene set to resolve phylogenetic relationships at a range of evolutionary depths and highlights the importance of addressing fragmentation at the homolog alignment stage for probe design. © 2016 John Wiley & Sons Ltd.
Tao, Yebin; Sánchez, Brisa N; Mukherjee, Bhramar
2015-03-30
Many existing cohort studies designed to investigate health effects of environmental exposures also collect data on genetic markers. The Early Life Exposures in Mexico to Environmental Toxicants project, for instance, has been genotyping single nucleotide polymorphisms on candidate genes involved in mental and nutrient metabolism and also in potentially shared metabolic pathways with the environmental exposures. Given the longitudinal nature of these cohort studies, rich exposure and outcome data are available to address novel questions regarding gene-environment interaction (G × E). Latent variable (LV) models have been effectively used for dimension reduction, helping with multiple testing and multicollinearity issues in the presence of correlated multivariate exposures and outcomes. In this paper, we first propose a modeling strategy, based on LV models, to examine the association between repeated outcome measures (e.g., child weight) and a set of correlated exposure biomarkers (e.g., prenatal lead exposure). We then construct novel tests for G × E effects within the LV framework to examine effect modification of outcome-exposure association by genetic factors (e.g., the hemochromatosis gene). We consider two scenarios: one allowing dependence of the LV models on genes and the other assuming independence between the LV models and genes. We combine the two sets of estimates by shrinkage estimation to trade off bias and efficiency in a data-adaptive way. Using simulations, we evaluate the properties of the shrinkage estimates, and in particular, we demonstrate the need for this data-adaptive shrinkage given repeated outcome measures, exposure measures possibly repeated and time-varying gene-environment association. Copyright © 2014 John Wiley & Sons, Ltd.
Down-weighting overlapping genes improves gene set analysis
2012-01-01
Background The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set. Results In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method Pathway Analysis with Down-weighting of Overlapping Genes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results. Conclusions PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org. PMID:22713124
Pamukçu, Esra; Bozdogan, Hamparsum; Çalık, Sinan
2015-01-01
Gene expression data typically are large, complex, and highly noisy. Their dimension is high with several thousand genes (i.e., features) but with only a limited number of observations (i.e., samples). Although the classical principal component analysis (PCA) method is widely used as a first standard step in dimension reduction and in supervised and unsupervised classification, it suffers from several shortcomings in the case of data sets involving undersized samples, since the sample covariance matrix degenerates and becomes singular. In this paper we address these limitations within the context of probabilistic PCA (PPCA) by introducing and developing a new and novel approach using maximum entropy covariance matrix and its hybridized smoothed covariance estimators. To reduce the dimensionality of the data and to choose the number of probabilistic PCs (PPCs) to be retained, we further introduce and develop celebrated Akaike's information criterion (AIC), consistent Akaike's information criterion (CAIC), and the information theoretic measure of complexity (ICOMP) criterion of Bozdogan. Six publicly available undersized benchmark data sets were analyzed to show the utility, flexibility, and versatility of our approach with hybridized smoothed covariance matrix estimators, which do not degenerate to perform the PPCA to reduce the dimension and to carry out supervised classification of cancer groups in high dimensions. PMID:25838836
Synthetic and Evolutionary Construction of a Chlorate-Reducing Shewanella oneidensis MR-1.
Clark, Iain C; Melnyk, Ryan A; Youngblut, Matthew D; Carlson, Hans K; Iavarone, Anthony T; Coates, John D
2015-05-19
Despite evidence for the prevalence of horizontal gene transfer of respiratory genes, little is known about how pathways functionally integrate within new hosts. One example of a mobile respiratory metabolism is bacterial chlorate reduction, which is frequently encoded on composite transposons. This implies that the essential components of the metabolism are encoded on these mobile elements. To test this, we heterologously expressed genes for chlorate reduction from Shewanella algae ACDC in the non-chlorate-reducing Shewanella oneidensis MR-1. The construct that ultimately endowed robust growth on chlorate included cld, a cytochrome c gene, clrABDC, and two genes of unknown function. Although strain MR-1 was unable to grow on chlorate after initial insertion of these genes into the chromosome, 11 derived strains capable of chlorate respiration were obtained through adaptive evolution. Genome resequencing indicated that all of the evolved chlorate-reducing strains replicated a large genomic region containing chlorate reduction genes. Contraction in copy number and loss of the ability to reduce chlorate were also observed, indicating that this phenomenon was extremely dynamic. Although most strains contained more than six copies of the replicated region, a single strain with less duplication also grew rapidly. This strain contained three additional mutations that we hypothesized compensated for the low copy number. We remade the mutations combinatorially in the unevolved strain and determined that a single nucleotide polymorphism (SNP) upstream of cld enabled growth on chlorate and was epistatic to a second base pair change in the NarP binding sequence between narQP and nrfA that enhanced growth. The ability of chlorate reduction composite transposons to form functional metabolisms after transfer to a new host is an important part of their propagation. To study this phenomenon, we engineered Shewanella oneidensis MR-1 into a chlorate reducer. We defined a set of genes sufficient to endow growth on chlorate from a plasmid, but found that chromosomal insertion of these genes was nonfunctional. Evolution of this inoperative strain into a chlorate reducer showed that tandem duplication was a dominant mechanism of activation. While copy number changes are a relatively rapid way of increasing gene dosage, replicating almost 1 megabase of extra DNA is costly. Mutations that alleviate the need for high copy number are expected to arise and eventually predominate, and we identified a single nucleotide polymorphism (SNP) that relieved the copy number requirement. This study uses both rational and evolutionary approaches to gain insight into the evolution of a fascinating respiratory metabolism. Copyright © 2015 Clark et al.
Estimation of gene induction enables a relevance-based ranking of gene sets.
Bartholomé, Kilian; Kreutz, Clemens; Timmer, Jens
2009-07-01
In order to handle and interpret the vast amounts of data produced by microarray experiments, the analysis of sets of genes with a common biological functionality has been shown to be advantageous compared to single gene analyses. Some statistical methods have been proposed to analyse the differential gene expression of gene sets in microarray experiments. However, most of these methods either require threshhold values to be chosen for the analysis, or they need some reference set for the determination of significance. We present a method that estimates the number of differentially expressed genes in a gene set without requiring a threshold value for significance of genes. The method is self-contained (i.e., it does not require a reference set for comparison). In contrast to other methods which are focused on significance, our approach emphasizes the relevance of the regulation of gene sets. The presented method measures the degree of regulation of a gene set and is a useful tool to compare the induction of different gene sets and place the results of microarray experiments into the biological context. An R-package is available.
Tuning CRISPR-Cas9 Gene Drives in Saccharomyces cerevisiae
Roggenkamp, Emily; Giersch, Rachael M.; Schrock, Madison N.; Turnquist, Emily; Halloran, Megan; Finnigan, Gregory C.
2018-01-01
Control of biological populations is an ongoing challenge in many fields, including agriculture, biodiversity, ecological preservation, pest control, and the spread of disease. In some cases, such as insects that harbor human pathogens (e.g., malaria), elimination or reduction of a small number of species would have a dramatic impact across the globe. Given the recent discovery and development of the CRISPR-Cas9 gene editing technology, a unique arrangement of this system, a nuclease-based “gene drive,” allows for the super-Mendelian spread and forced propagation of a genetic element through a population. Recent studies have demonstrated the ability of a gene drive to rapidly spread within and nearly eliminate insect populations in a laboratory setting. While there are still ongoing technical challenges to design of a more optimal gene drive to be used in wild populations, there are still serious ecological and ethical concerns surrounding the nature of this powerful biological agent. Here, we use budding yeast as a safe and fully contained model system to explore mechanisms that might allow for programmed regulation of gene drive activity. We describe four conserved features of all CRISPR-based drives and demonstrate the ability of each drive component—Cas9 protein level, sgRNA identity, Cas9 nucleocytoplasmic shuttling, and novel Cas9-Cas9 tandem fusions—to modulate drive activity within a population. PMID:29348295
A Risk Stratification Model for Lung Cancer Based on Gene Coexpression Network and Deep Learning
2018-01-01
Risk stratification model for lung cancer with gene expression profile is of great interest. Instead of previous models based on individual prognostic genes, we aimed to develop a novel system-level risk stratification model for lung adenocarcinoma based on gene coexpression network. Using multiple microarray, gene coexpression network analysis was performed to identify survival-related networks. A deep learning based risk stratification model was constructed with representative genes of these networks. The model was validated in two test sets. Survival analysis was performed using the output of the model to evaluate whether it could predict patients' survival independent of clinicopathological variables. Five networks were significantly associated with patients' survival. Considering prognostic significance and representativeness, genes of the two survival-related networks were selected for input of the model. The output of the model was significantly associated with patients' survival in two test sets and training set (p < 0.00001, p < 0.0001 and p = 0.02 for training and test sets 1 and 2, resp.). In multivariate analyses, the model was associated with patients' prognosis independent of other clinicopathological features. Our study presents a new perspective on incorporating gene coexpression networks into the gene expression signature and clinical application of deep learning in genomic data science for prognosis prediction. PMID:29581968
Joint amalgamation of most parsimonious reconciled gene trees
Scornavacca, Celine; Jacox, Edwin; Szöllősi, Gergely J.
2015-01-01
Motivation: Traditionally, gene phylogenies have been reconstructed solely on the basis of molecular sequences; this, however, often does not provide enough information to distinguish between statistically equivalent relationships. To address this problem, several recent methods have incorporated information on the species phylogeny in gene tree reconstruction, leading to dramatic improvements in accuracy. Although probabilistic methods are able to estimate all model parameters but are computationally expensive, parsimony methods—generally computationally more efficient—require a prior estimate of parameters and of the statistical support. Results: Here, we present the Tree Estimation using Reconciliation (TERA) algorithm, a parsimony based, species tree aware method for gene tree reconstruction based on a scoring scheme combining duplication, transfer and loss costs with an estimate of the sequence likelihood. TERA explores all reconciled gene trees that can be amalgamated from a sample of gene trees. Using a large scale simulated dataset, we demonstrate that TERA achieves the same accuracy as the corresponding probabilistic method while being faster, and outperforms other parsimony-based methods in both accuracy and speed. Running TERA on a set of 1099 homologous gene families from complete cyanobacterial genomes, we find that incorporating knowledge of the species tree results in a two thirds reduction in the number of apparent transfer events. Availability and implementation: The algorithm is implemented in our program TERA, which is freely available from http://mbb.univ-montp2.fr/MBB/download_sources/16__TERA. Contact: celine.scornavacca@univ-montp2.fr, ssolo@angel.elte.hu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25380957
He, Peng; Zhang, Yun-Fei; Hong, Duan-Yang; Wang, Jun; Wang, Xing-Liang; Zuo, Ling-Hua; Tang, Xian-Fu; Xu, Wei-Ming; He, Ming
2017-03-01
Female moths synthesize species-specific sex pheromone components and release them to attract male moths, which depend on precise sex pheromone chemosensory system to locate females. Two types of genes involved in the sex pheromone biosynthesis and degradation pathways play essential roles in this important moth behavior. To understand the function of genes in the sex pheromone pathway, this study investigated the genome-wide and digital gene expression of sex pheromone biosynthesis and degradation genes in various adult tissues in the diamondback moth (DBM), Plutella xylostella, which is a notorious vegetable pest worldwide. A massive transcriptome data (at least 39.04 Gb) was generated by sequencing 6 adult tissues including male antennae, female antennae, heads, legs, abdomen and female pheromone glands from DBM by using Illumina 4000 next-generation sequencing and mapping to a published DBM genome. Bioinformatics analysis yielded a total of 89,332 unigenes among which 87 transcripts were putatively related to seven gene families in the sex pheromone biosynthesis pathway. Among these, seven [two desaturases (DES), three fatty acyl-CoA reductases (FAR) one acetyltransferase (ACT) and one alcohol dehydrogenase (AD)] were mainly expressed in the pheromone glands with likely function in the three essential sex pheromone biosynthesis steps: desaturation, reduction, and esterification. We also identified 210 odorant-degradation related genes (including sex pheromone-degradation related genes) from seven major enzyme groups. Among these genes, 100 genes are new identified and two aldehyde oxidases (AOXs), one aldehyde dehydrogenase (ALDH), five carboxyl/cholinesterases (CCEs), five UDP-glycosyltransferases (UGTs), eight cytochrome P450 (CYP) and three glutathione S-transferases (GSTs) displayed more robust expression in the antennae, and thus are proposed to participate in the degradation of sex pheromone components and plant volatiles. To date, this is the most comprehensive gene data set of sex pheromone biosynthesis and degradation enzyme related genes in DBM created by genome- and transcriptome-wide identification, characterization and expression profiling. Our findings provide a basis to better understand the function of genes with tissue enriched expression. The results also provide information on the genes involved in sex pheromone biosynthesis and degradation, and may be useful to identify potential gene targets for pest control strategies by disrupting the insect-insect communication using pheromone-based behavioral antagonists.
Combining multiple tools outperforms individual methods in gene set enrichment analyses.
Alhamdoosh, Monther; Ng, Milica; Wilson, Nicholas J; Sheridan, Julie M; Huynh, Huy; Wilson, Michael J; Ritchie, Matthew E
2017-02-01
Gene set enrichment (GSE) analysis allows researchers to efficiently extract biological insight from long lists of differentially expressed genes by interrogating them at a systems level. In recent years, there has been a proliferation of GSE analysis methods and hence it has become increasingly difficult for researchers to select an optimal GSE tool based on their particular dataset. Moreover, the majority of GSE analysis methods do not allow researchers to simultaneously compare gene set level results between multiple experimental conditions. The ensemble of genes set enrichment analyses (EGSEA) is a method developed for RNA-sequencing data that combines results from twelve algorithms and calculates collective gene set scores to improve the biological relevance of the highest ranked gene sets. EGSEA's gene set database contains around 25 000 gene sets from sixteen collections. It has multiple visualization capabilities that allow researchers to view gene sets at various levels of granularity. EGSEA has been tested on simulated data and on a number of human and mouse datasets and, based on biologists' feedback, consistently outperforms the individual tools that have been combined. Our evaluation demonstrates the superiority of the ensemble approach for GSE analysis, and its utility to effectively and efficiently extrapolate biological functions and potential involvement in disease processes from lists of differentially regulated genes. EGSEA is available as an R package at http://www.bioconductor.org/packages/EGSEA/ . The gene sets collections are available in the R package EGSEAdata from http://www.bioconductor.org/packages/EGSEAdata/ . monther.alhamdoosh@csl.com.au mritchie@wehi.edu.au. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Lee, Won Jun; Kim, Sang Cheol; Lee, Seul Ji; Lee, Jeongmi; Park, Jeong Hill; Yu, Kyung-Sang; Lim, Johan; Kwon, Sung Won
2014-01-01
Based on the process of carcinogenesis, carcinogens are classified as either genotoxic or non-genotoxic. In contrast to non-genotoxic carcinogens, many genotoxic carcinogens have been reported to cause tumor in carcinogenic bioassays in animals. Thus evaluating the genotoxicity potential of chemicals is important to discriminate genotoxic from non-genotoxic carcinogens for health care and pharmaceutical industry safety. Additionally, investigating the difference between the mechanisms of genotoxic and non-genotoxic carcinogens could provide the foundation for a mechanism-based classification for unknown compounds. In this study, we investigated the gene expression of HepG2 cells treated with genotoxic or non-genotoxic carcinogens and compared their mechanisms of action. To enhance our understanding of the differences in the mechanisms of genotoxic and non-genotoxic carcinogens, we implemented a gene set analysis using 12 compounds for the training set (12, 24, 48 h) and validated significant gene sets using 22 compounds for the test set (24, 48 h). For a direct biological translation, we conducted a gene set analysis using Globaltest and selected significant gene sets. To validate the results, training and test compounds were predicted by the significant gene sets using a prediction analysis for microarrays (PAM). Finally, we obtained 6 gene sets, including sets enriched for genes involved in the adherens junction, bladder cancer, p53 signaling pathway, pathways in cancer, peroxisome and RNA degradation. Among the 6 gene sets, the bladder cancer and p53 signaling pathway sets were significant at 12, 24 and 48 h. We also found that the DDB2, RRM2B and GADD45A, genes related to the repair and damage prevention of DNA, were consistently up-regulated for genotoxic carcinogens. Our results suggest that a gene set analysis could provide a robust tool in the investigation of the different mechanisms of genotoxic and non-genotoxic carcinogens and construct a more detailed understanding of the perturbation of significant pathways.
Lee, Won Jun; Kim, Sang Cheol; Lee, Seul Ji; Lee, Jeongmi; Park, Jeong Hill; Yu, Kyung-Sang; Lim, Johan; Kwon, Sung Won
2014-01-01
Based on the process of carcinogenesis, carcinogens are classified as either genotoxic or non-genotoxic. In contrast to non-genotoxic carcinogens, many genotoxic carcinogens have been reported to cause tumor in carcinogenic bioassays in animals. Thus evaluating the genotoxicity potential of chemicals is important to discriminate genotoxic from non-genotoxic carcinogens for health care and pharmaceutical industry safety. Additionally, investigating the difference between the mechanisms of genotoxic and non-genotoxic carcinogens could provide the foundation for a mechanism-based classification for unknown compounds. In this study, we investigated the gene expression of HepG2 cells treated with genotoxic or non-genotoxic carcinogens and compared their mechanisms of action. To enhance our understanding of the differences in the mechanisms of genotoxic and non-genotoxic carcinogens, we implemented a gene set analysis using 12 compounds for the training set (12, 24, 48 h) and validated significant gene sets using 22 compounds for the test set (24, 48 h). For a direct biological translation, we conducted a gene set analysis using Globaltest and selected significant gene sets. To validate the results, training and test compounds were predicted by the significant gene sets using a prediction analysis for microarrays (PAM). Finally, we obtained 6 gene sets, including sets enriched for genes involved in the adherens junction, bladder cancer, p53 signaling pathway, pathways in cancer, peroxisome and RNA degradation. Among the 6 gene sets, the bladder cancer and p53 signaling pathway sets were significant at 12, 24 and 48 h. We also found that the DDB2, RRM2B and GADD45A, genes related to the repair and damage prevention of DNA, were consistently up-regulated for genotoxic carcinogens. Our results suggest that a gene set analysis could provide a robust tool in the investigation of the different mechanisms of genotoxic and non-genotoxic carcinogens and construct a more detailed understanding of the perturbation of significant pathways. PMID:24497971
The limitations of simple gene set enrichment analysis assuming gene independence.
Tamayo, Pablo; Steinhardt, George; Liberzon, Arthur; Mesirov, Jill P
2016-02-01
Since its first publication in 2003, the Gene Set Enrichment Analysis method, based on the Kolmogorov-Smirnov statistic, has been heavily used, modified, and also questioned. Recently a simplified approach using a one-sample t-test score to assess enrichment and ignoring gene-gene correlations was proposed by Irizarry et al. 2009 as a serious contender. The argument criticizes Gene Set Enrichment Analysis's nonparametric nature and its use of an empirical null distribution as unnecessary and hard to compute. We refute these claims by careful consideration of the assumptions of the simplified method and its results, including a comparison with Gene Set Enrichment Analysis's on a large benchmark set of 50 datasets. Our results provide strong empirical evidence that gene-gene correlations cannot be ignored due to the significant variance inflation they produced on the enrichment scores and should be taken into account when estimating gene set enrichment significance. In addition, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose more generally for gene set enrichment methods. © The Author(s) 2012.
Multiconstrained gene clustering based on generalized projections
2010-01-01
Background Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. Results We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods. Conclusions The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions. PMID:20356386
A novel feature extraction approach for microarray data based on multi-algorithm fusion
Jiang, Zhu; Xu, Rong
2015-01-01
Feature extraction is one of the most important and effective method to reduce dimension in data mining, with emerging of high dimensional data such as microarray gene expression data. Feature extraction for gene selection, mainly serves two purposes. One is to identify certain disease-related genes. The other is to find a compact set of discriminative genes to build a pattern classifier with reduced complexity and improved generalization capabilities. Depending on the purpose of gene selection, two types of feature extraction algorithms including ranking-based feature extraction and set-based feature extraction are employed in microarray gene expression data analysis. In ranking-based feature extraction, features are evaluated on an individual basis, without considering inter-relationship between features in general, while set-based feature extraction evaluates features based on their role in a feature set by taking into account dependency between features. Just as learning methods, feature extraction has a problem in its generalization ability, which is robustness. However, the issue of robustness is often overlooked in feature extraction. In order to improve the accuracy and robustness of feature extraction for microarray data, a novel approach based on multi-algorithm fusion is proposed. By fusing different types of feature extraction algorithms to select the feature from the samples set, the proposed approach is able to improve feature extraction performance. The new approach is tested against gene expression dataset including Colon cancer data, CNS data, DLBCL data, and Leukemia data. The testing results show that the performance of this algorithm is better than existing solutions. PMID:25780277
A novel feature extraction approach for microarray data based on multi-algorithm fusion.
Jiang, Zhu; Xu, Rong
2015-01-01
Feature extraction is one of the most important and effective method to reduce dimension in data mining, with emerging of high dimensional data such as microarray gene expression data. Feature extraction for gene selection, mainly serves two purposes. One is to identify certain disease-related genes. The other is to find a compact set of discriminative genes to build a pattern classifier with reduced complexity and improved generalization capabilities. Depending on the purpose of gene selection, two types of feature extraction algorithms including ranking-based feature extraction and set-based feature extraction are employed in microarray gene expression data analysis. In ranking-based feature extraction, features are evaluated on an individual basis, without considering inter-relationship between features in general, while set-based feature extraction evaluates features based on their role in a feature set by taking into account dependency between features. Just as learning methods, feature extraction has a problem in its generalization ability, which is robustness. However, the issue of robustness is often overlooked in feature extraction. In order to improve the accuracy and robustness of feature extraction for microarray data, a novel approach based on multi-algorithm fusion is proposed. By fusing different types of feature extraction algorithms to select the feature from the samples set, the proposed approach is able to improve feature extraction performance. The new approach is tested against gene expression dataset including Colon cancer data, CNS data, DLBCL data, and Leukemia data. The testing results show that the performance of this algorithm is better than existing solutions.
Positive-unlabeled learning for disease gene identification
Yang, Peng; Li, Xiao-Li; Mei, Jian-Ping; Kwoh, Chee-Keong; Ng, See-Kiong
2012-01-01
Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the negative training set N (non-disease gene set does not exist) to build classifiers to identify new disease genes from the unknown genes. However, such kind of classifiers is actually built from a noisy negative set N as there can be unknown disease genes in N itself. As a result, the classifiers do not perform as well as they could be. Result: Instead of treating the unknown genes as negative examples in N, we treat them as an unlabeled set U. We design a novel positive-unlabeled (PU) learning algorithm PUDI (PU learning for disease gene identification) to build a classifier using P and U. We first partition U into four sets, namely, reliable negative set RN, likely positive set LP, likely negative set LN and weak negative set WN. The weighted support vector machines are then used to build a multi-level classifier based on the four training sets and positive training set P to identify disease genes. Our experimental results demonstrate that our proposed PUDI algorithm outperformed the existing methods significantly. Conclusion: The proposed PUDI algorithm is able to identify disease genes more accurately by treating the unknown data more appropriately as unlabeled set U instead of negative set N. Given that many machine learning problems in biomedical research do involve positive and unlabeled data instead of negative data, it is possible that the machine learning methods for these problems can be further improved by adopting PU learning methods, as we have done here for disease gene identification. Availability and implementation: The executable program and data are available at http://www1.i2r.a-star.edu.sg/∼xlli/PUDI/PUDI.html. Contact: xlli@i2r.a-star.edu.sg or yang0293@e.ntu.edu.sg Supplementary information: Supplementary Data are available at Bioinformatics online. PMID:22923290
Reboiro-Jato, Miguel; Arrais, Joel P; Oliveira, José Luis; Fdez-Riverola, Florentino
2014-01-30
The diagnosis and prognosis of several diseases can be shortened through the use of different large-scale genome experiments. In this context, microarrays can generate expression data for a huge set of genes. However, to obtain solid statistical evidence from the resulting data, it is necessary to train and to validate many classification techniques in order to find the best discriminative method. This is a time-consuming process that normally depends on intricate statistical tools. geneCommittee is a web-based interactive tool for routinely evaluating the discriminative classification power of custom hypothesis in the form of biologically relevant gene sets. While the user can work with different gene set collections and several microarray data files to configure specific classification experiments, the tool is able to run several tests in parallel. Provided with a straightforward and intuitive interface, geneCommittee is able to render valuable information for diagnostic analyses and clinical management decisions based on systematically evaluating custom hypothesis over different data sets using complementary classifiers, a key aspect in clinical research. geneCommittee allows the enrichment of microarrays raw data with gene functional annotations, producing integrated datasets that simplify the construction of better discriminative hypothesis, and allows the creation of a set of complementary classifiers. The trained committees can then be used for clinical research and diagnosis. Full documentation including common use cases and guided analysis workflows is freely available at http://sing.ei.uvigo.es/GC/.
Faruki, Hawazin; Mayhew, Gregory M; Fan, Cheng; Wilkerson, Matthew D; Parker, Scott; Kam-Morgan, Lauren; Eisenberg, Marcia; Horten, Bruce; Hayes, D Neil; Perou, Charles M; Lai-Goldman, Myla
2016-06-01
Context .- A histologic classification of lung cancer subtypes is essential in guiding therapeutic management. Objective .- To complement morphology-based classification of lung tumors, a previously developed lung subtyping panel (LSP) of 57 genes was tested using multiple public fresh-frozen gene-expression data sets and a prospectively collected set of formalin-fixed, paraffin-embedded lung tumor samples. Design .- The LSP gene-expression signature was evaluated in multiple lung cancer gene-expression data sets totaling 2177 patients collected from 4 platforms: Illumina RNAseq (San Diego, California), Agilent (Santa Clara, California) and Affymetrix (Santa Clara) microarrays, and quantitative reverse transcription-polymerase chain reaction. Gene centroids were calculated for each of 3 genomic-defined subtypes: adenocarcinoma, squamous cell carcinoma, and neuroendocrine, the latter of which encompassed both small cell carcinoma and carcinoid. Classification by LSP into 3 subtypes was evaluated in both fresh-frozen and formalin-fixed, paraffin-embedded tumor samples, and agreement with the original morphology-based diagnosis was determined. Results .- The LSP-based classifications demonstrated overall agreement with the original clinical diagnosis ranging from 78% (251 of 322) to 91% (492 of 538 and 869 of 951) in the fresh-frozen public data sets and 84% (65 of 77) in the formalin-fixed, paraffin-embedded data set. The LSP performance was independent of tissue-preservation method and gene-expression platform. Secondary, blinded pathology review of formalin-fixed, paraffin-embedded samples demonstrated concordance of 82% (63 of 77) with the original morphology diagnosis. Conclusions .- The LSP gene-expression signature is a reproducible and objective method for classifying lung tumors and demonstrates good concordance with morphology-based classification across multiple data sets. The LSP panel can supplement morphologic assessment of lung cancers, particularly when classification by standard methods is challenging.
NASA Astrophysics Data System (ADS)
Pagnuco, Inti A.; Pastore, Juan I.; Abras, Guillermo; Brun, Marcel; Ballarin, Virginia L.
2016-04-01
It is usually assumed that co-expressed genes suggest co-regulation in the underlying regulatory network. Determining sets of co-expressed genes is an important task, where significative groups of genes are defined based on some criteria. This task is usually performed by clustering algorithms, where the whole family of genes, or a subset of them, are clustered into meaningful groups based on their expression values in a set of experiment. In this work we used a methodology based on the Silhouette index as a measure of cluster quality for individual gene groups, and a combination of several variants of hierarchical clustering to generate the candidate groups, to obtain sets of co-expressed genes for two real data examples. We analyzed the quality of the best ranked groups, obtained by the algorithm, using an online bioinformatics tool that provides network information for the selected genes. Moreover, to verify the performance of the algorithm, considering the fact that it doesn’t find all possible subsets, we compared its results against a full search, to determine the amount of good co-regulated sets not detected.
MAGMA: Generalized Gene-Set Analysis of GWAS Data
de Leeuw, Christiaan A.; Mooij, Joris M.; Heskes, Tom; Posthuma, Danielle
2015-01-01
By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn’s Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn’s Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn’s Disease data was found to be considerably faster as well. PMID:25885710
MAGMA: generalized gene-set analysis of GWAS data.
de Leeuw, Christiaan A; Mooij, Joris M; Heskes, Tom; Posthuma, Danielle
2015-04-01
By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.
Xiong, Dong-Hai; Shen, Hui; Zhao, Lan-Juan; Xiao, Peng; Yang, Tie-Lin; Guo, Yan; Wang, Wei; Guo, Yan-Fang; Liu, Yong-Jun; Recker, Robert R; Deng, Hong-Wen
2007-01-01
Many “novel” osteoporosis candidate genes have been proposed in recent years. To advance our knowledge of their roles in osteoporosis, we screened 20 such genes using a set of high-density SNPs in a large family-based study. Our efforts led to the prioritization of those osteoporosis genes and the detection of gene–gene interactions. Introduction We performed large-scale family-based association analyses of 20 novel osteoporosis candidate genes using 277 single nucleotide polymorphisms (SNPs) for the quantitative trait BMD variation and the qualitative trait osteoporosis (OP) at three clinically important skeletal sites: spine, hip, and ultradistal radius (UD). Materials and Methods One thousand eight hundred seventy-three subjects from 405 white nuclear families were genotyped and analyzed with an average density of one SNP per 4 kb across the 20 genes. We conducted association analyses by SNP- and haplotype-based family-based association test (FBAT) and performed gene–gene interaction analyses using multianalytic approaches such as multifactor-dimensionality reduction (MDR) and conditional logistic regression. Results and Conclusions We detected four genes (DBP, LRP5, CYP17, and RANK) that showed highly suggestive associations (10,000-permutation derived empirical global p ≤ 0.01) with spine BMD/OP; four genes (CYP19, RANK, RANKL, and CYP17) highly suggestive for hip BMD/OP; and four genes (CYP19, BMP2, RANK, and TNFR2) highly suggestive for UD BMD/OP. The associations between BMP2 with UD BMD and those between RANK with OP at the spine, hip, and UD also met the experiment-wide stringent criterion (empirical global p ≤ 0.0007). Sex-stratified analyses further showed that some of the significant associations in the total sample were driven by either male or female subjects. In addition, we identified and validated a two-locus gene–gene interaction model involving GCR and ESR2, for which prior biological evidence exists. Our results suggested the prioritization of osteoporosis candidate genes from among the many proposed in recent years and revealed the significant gene–gene interaction effects influencing osteoporosis risk. PMID:17002564
Reduction of Cr(VI) and survival in Cr-contaminated sites by Caulobacter crescentus
NASA Astrophysics Data System (ADS)
Hu, P.; Chakraborty, R.; Brodie, E. L.; Andersen, G. L.; Hazen, T. C.
2008-12-01
The Caulobacter spp. is known to be able to live in low-nutrient environments, a characteristic of most heavy metal-contaminated sites. Recent studies have shown that Caulobacter crescentus can grow in chemically defined medium containing up to 1 mM uranium. Whole-genome transcriptional analysis and electron microscopic imaging of heavy metal stresses in Caulobacter crescentus also provided insight and evidence that the bacterium used an array of defensive mechanisms to deal with heavy metal stresses. In addition to up-regulated enzymes protecting against oxidative stress, DNA repair and down-regulated potential chromium transport, one of the major gene groups respond to chromium stress is "electron transport process and cytochrome oxidases", including cytochrome c oxidases, raising the possibility that the cells can employ the cytochromes to reduce chromium. Analysis of the microbial community at the chromium contaminated DOE site at Hanford, WA revealed the presence of Caulobacter spp. As an oligotroph, Caulobacter can play a significant role in chromium reduction in the environment where the nutrients are limited. This result was confirmed by both 16S rDNA based microarray (Phylochip) as well as by MDA-based clone library data. Based on these results we further investigated the capability of this organism to reduce Cr(VI) using the well known model strain Caulobacter crescentus CB15N. Preliminary cell suspension experiments were set up with glucose as the electron donor and Cr(VI) as the electron acceptor in phosphate based M2 salts buffer. After 22 hours almost 27% of Cr(VI) was reduced in the incubations containing active cells relative to the controls containing heat killed cells. Also, in another set of controls with no electron acceptor added, cells showed no increase in cell density during that time demonstrating that the reduction of Cr(VI) by cells of Caulobacter was due to biological activity. Future experiments will investigate the components responsible and the mechanism of Cr(VI) reduction by Caulobacetr crescentus.
Alteration of gene expression by alcohol exposure at early neurulation.
Zhou, Feng C; Zhao, Qianqian; Liu, Yunlong; Goodlett, Charles R; Liang, Tiebing; McClintick, Jeanette N; Edenberg, Howard J; Li, Lang
2011-02-21
We have previously demonstrated that alcohol exposure at early neurulation induces growth retardation, neural tube abnormalities, and alteration of DNA methylation. To explore the global gene expression changes which may underline these developmental defects, microarray analyses were performed in a whole embryo mouse culture model that allows control over alcohol and embryonic variables. Alcohol caused teratogenesis in brain, heart, forelimb, and optic vesicle; a subset of the embryos also showed cranial neural tube defects. In microarray analysis (accession number GSM9545), adopting hypothesis-driven Gene Set Enrichment Analysis (GSEA) informatics and intersection analysis of two independent experiments, we found that there was a collective reduction in expression of neural specification genes (neurogenin, Sox5, Bhlhe22), neural growth factor genes [Igf1, Efemp1, Klf10 (Tieg), and Edil3], and alteration of genes involved in cell growth, apoptosis, histone variants, eye and heart development. There was also a reduction of retinol binding protein 1 (Rbp1), and de novo expression of aldehyde dehydrogenase 1B1 (Aldh1B1). Remarkably, four key hematopoiesis genes (glycophorin A, adducin 2, beta-2 microglobulin, and ceruloplasmin) were absent after alcohol treatment, and histone variant genes were reduced. The down-regulation of the neurospecification and the neurotrophic genes were further confirmed by quantitative RT-PCR. Furthermore, the gene expression profile demonstrated distinct subgroups which corresponded with two distinct alcohol-related neural tube phenotypes: an open (ALC-NTO) and a closed neural tube (ALC-NTC). Further, the epidermal growth factor signaling pathway and histone variants were specifically altered in ALC-NTO, and a greater number of neurotrophic/growth factor genes were down-regulated in the ALC-NTO than in the ALC-NTC embryos. This study revealed a set of genes vulnerable to alcohol exposure and genes that were associated with neural tube defects during early neurulation.
Versican is a potential therapeutic target in docetaxel-resistant prostate cancer
Arichi, Naoko; Mitsui, Yozo; Hiraki, Miho; Nakamura, Sigenobu; Hiraoka, Takeo; Sumura, Masahiro; Hirata, Hiroshi; Tanaka, Yuichiro; Dahiya, Rajvir; Yasumoto, Hiroaki; Shiina, Hiroaki
2015-01-01
In the current study, we investigated a combination of docetaxel and thalidomide (DT therapy) in castration-resistant prostate cancer (CRPC) patients. We identified marker genes that predict the effect of DT therapy. Using an androgen-insensitive PC3 cell line, we established a docetaxel-resistant PC-3 cell line (DR-PC3). In DR-PC3 cells, DT therapy stronger inhibited proliferation/viability than docetaxel alone. Based on gene ontology analysis, we found versican as a selective gene. This result with the findings of cDNA microarray and validated by quantitative RT-PCR. In addition, the effect of DT therapy on cell viability was the same as the effect of docetaxel plus versican siRNA. In other words, silencing of versican can substitute for thalidomide. In the clinical setting, versican expression in prostate biopsy samples (before DT therapy) correlated with PSA reduction after DT therapy (p<0.05). Thus targeting versican is a potential therapeutic strategy in docetaxel-resistant prostate cancer. PMID:25859560
Jani, Saurin D; Argraves, Gary L; Barth, Jeremy L; Argraves, W Scott
2010-04-01
An important objective of DNA microarray-based gene expression experimentation is determining inter-relationships that exist between differentially expressed genes and biological processes, molecular functions, cellular components, signaling pathways, physiologic processes and diseases. Here we describe GeneMesh, a web-based program that facilitates analysis of DNA microarray gene expression data. GeneMesh relates genes in a query set to categories available in the Medical Subject Headings (MeSH) hierarchical index. The interface enables hypothesis driven relational analysis to a specific MeSH subcategory (e.g., Cardiovascular System, Genetic Processes, Immune System Diseases etc.) or unbiased relational analysis to broader MeSH categories (e.g., Anatomy, Biological Sciences, Disease etc.). Genes found associated with a given MeSH category are dynamically linked to facilitate tabular and graphical depiction of Entrez Gene information, Gene Ontology information, KEGG metabolic pathway diagrams and intermolecular interaction information. Expression intensity values of groups of genes that cluster in relation to a given MeSH category, gene ontology or pathway can be displayed as heat maps of Z score-normalized values. GeneMesh operates on gene expression data derived from a number of commercial microarray platforms including Affymetrix, Agilent and Illumina. GeneMesh is a versatile web-based tool for testing and developing new hypotheses through relating genes in a query set (e.g., differentially expressed genes from a DNA microarray experiment) to descriptors making up the hierarchical structure of the National Library of Medicine controlled vocabulary thesaurus, MeSH. The system further enhances the discovery process by providing links between sets of genes associated with a given MeSH category to a rich set of html linked tabular and graphic information including Entrez Gene summaries, gene ontologies, intermolecular interactions, overlays of genes onto KEGG pathway diagrams and heatmaps of expression intensity values. GeneMesh is freely available online at http://proteogenomics.musc.edu/genemesh/.
Ienasescu, Hans; Li, Kang; Andersson, Robin; Vitezic, Morana; Rennie, Sarah; Chen, Yun; Vitting-Seerup, Kristoffer; Lagoni, Emil; Boyd, Mette; Bornholdt, Jette; de Hoon, Michiel J. L.; Kawaji, Hideya; Lassmann, Timo; Hayashizaki, Yoshihide; Forrest, Alistair R. R.; Carninci, Piero; Sandelin, Albin
2016-01-01
Genomics consortia have produced large datasets profiling the expression of genes, micro-RNAs, enhancers and more across human tissues or cells. There is a need for intuitive tools to select subsets of such data that is the most relevant for specific studies. To this end, we present SlideBase, a web tool which offers a new way of selecting genes, promoters, enhancers and microRNAs that are preferentially expressed/used in a specified set of cells/tissues, based on the use of interactive sliders. With the help of sliders, SlideBase enables users to define custom expression thresholds for individual cell types/tissues, producing sets of genes, enhancers etc. which satisfy these constraints. Changes in slider settings result in simultaneous changes in the selected sets, updated in real time. SlideBase is linked to major databases from genomics consortia, including FANTOM, GTEx, The Human Protein Atlas and BioGPS. Database URL: http://slidebase.binf.ku.dk PMID:28025337
Lan, Hui; Carson, Rachel; Provart, Nicholas J; Bonner, Anthony J
2007-09-21
Arabidopsis thaliana is the model species of current plant genomic research with a genome size of 125 Mb and approximately 28,000 genes. The function of half of these genes is currently unknown. The purpose of this study is to infer gene function in Arabidopsis using machine-learning algorithms applied to large-scale gene expression data sets, with the goal of identifying genes that are potentially involved in plant response to abiotic stress. Using in house and publicly available data, we assembled a large set of gene expression measurements for A. thaliana. Using those genes of known function, we first evaluated and compared the ability of basic machine-learning algorithms to predict which genes respond to stress. Predictive accuracy was measured using ROC50 and precision curves derived through cross validation. To improve accuracy, we developed a method for combining these classifiers using a weighted-voting scheme. The combined classifier was then trained on genes of known function and applied to genes of unknown function, identifying genes that potentially respond to stress. Visual evidence corroborating the predictions was obtained using electronic Northern analysis. Three of the predicted genes were chosen for biological validation. Gene knockout experiments confirmed that all three are involved in a variety of stress responses. The biological analysis of one of these genes (At1g16850) is presented here, where it is shown to be necessary for the normal response to temperature and NaCl. Supervised learning methods applied to large-scale gene expression measurements can be used to predict gene function. However, the ability of basic learning methods to predict stress response varies widely and depends heavily on how much dimensionality reduction is used. Our method of combining classifiers can improve the accuracy of such predictions - in this case, predictions of genes involved in stress response in plants - and it effectively chooses the appropriate amount of dimensionality reduction automatically. The method provides a useful means of identifying genes in A. thaliana that potentially respond to stress, and we expect it would be useful in other organisms and for other gene functions.
MAVTgsa: An R Package for Gene Set (Enrichment) Analysis
Chien, Chih-Yi; Chang, Ching-Wei; Tsai, Chen-An; ...
2014-01-01
Gene semore » t analysis methods aim to determine whether an a priori defined set of genes shows statistically significant difference in expression on either categorical or continuous outcomes. Although many methods for gene set analysis have been proposed, a systematic analysis tool for identification of different types of gene set significance modules has not been developed previously. This work presents an R package, called MAVTgsa, which includes three different methods for integrated gene set enrichment analysis. (1) The one-sided OLS (ordinary least squares) test detects coordinated changes of genes in gene set in one direction, either up- or downregulation. (2) The two-sided MANOVA (multivariate analysis variance) detects changes both up- and downregulation for studying two or more experimental conditions. (3) A random forests-based procedure is to identify gene sets that can accurately predict samples from different experimental conditions or are associated with the continuous phenotypes. MAVTgsa computes the P values and FDR (false discovery rate) q -value for all gene sets in the study. Furthermore, MAVTgsa provides several visualization outputs to support and interpret the enrichment results. This package is available online.« less
Windhorst, Dafna A; Mileva-Seitz, Viara R; Rippe, Ralph C A; Tiemeier, Henning; Jaddoe, Vincent W V; Verhulst, Frank C; van IJzendoorn, Marinus H; Bakermans-Kranenburg, Marian J
2016-08-01
In a longitudinal cohort study, we investigated the interplay of harsh parenting and genetic variation across a set of functionally related dopamine genes, in association with children's externalizing behavior. This is one of the first studies to employ gene-based and gene-set approaches in tests of Gene by Environment (G × E) effects on complex behavior. This approach can offer an important alternative or complement to candidate gene and genome-wide environmental interaction (GWEI) studies in the search for genetic variation underlying individual differences in behavior. Genetic variants in 12 autosomal dopaminergic genes were available in an ethnically homogenous part of a population-based cohort. Harsh parenting was assessed with maternal (n = 1881) and paternal (n = 1710) reports at age 3. Externalizing behavior was assessed with the Child Behavior Checklist (CBCL) at age 5 (71 ± 3.7 months). We conducted gene-set analyses of the association between variation in dopaminergic genes and externalizing behavior, stratified for harsh parenting. The association was statistically significant or approached significance for children without harsh parenting experiences, but was absent in the group with harsh parenting. Similarly, significant associations between single genes and externalizing behavior were only found in the group without harsh parenting. Effect sizes in the groups with and without harsh parenting did not differ significantly. Gene-environment interaction tests were conducted for individual genetic variants, resulting in two significant interaction effects (rs1497023 and rs4922132) after correction for multiple testing. Our findings are suggestive of G × E interplay, with associations between dopamine genes and externalizing behavior present in children without harsh parenting, but not in children with harsh parenting experiences. Harsh parenting may overrule the role of genetic factors in externalizing behavior. Gene-based and gene-set analyses offer promising new alternatives to analyses focusing on single candidate polymorphisms when examining the interplay between genetic and environmental factors.
Conley, P B; Lemaux, P G; Lomax, T L; Grossman, A R
1986-01-01
The polypeptide composition of the phycobilisome, the major light-harvesting complex of prokaryotic cyanobacteria and certain eukaryotic algae, can be modulated by different light qualities in cyanobacteria exhibiting chromatic adaptation. We have identified genomic fragments encoding a cluster of phycobilisome polypeptides (phycobiliproteins) from the chromatically adapting cyanobacterium Fremyella diplosiphon using previously characterized DNA fragments of phycobiliprotein genes from the eukaryotic alga Cyanophora paradoxa and from F. diplosiphon. Characterization of two lambda-EMBL3 clones containing overlapping genomic fragments indicates that three sets of phycobiliprotein genes--the alpha- and beta-allophycocyanin genes plus two sets of alpha- and beta-phycocyanin genes--are clustered within 13 kilobases on the cyanobacterial genome and transcribed off the same strand. The gene order (alpha-allophycocyanin followed by beta-allophycocyanin and beta-phycocyanin followed by alpha-phycocyanin) appears to be a conserved arrangement found previously in a eukaryotic alga and another cyanobacterium. We have reported that one set of phycocyanin genes is transcribed as two abundant red light-induced mRNAs (1600 and 3800 bases). We now present data showing that the allophycocyanin genes and a second set of phycocyanin genes are transcribed into major mRNAs of 1400 and 1600 bases, respectively. These transcripts are present in RNA isolated from cultures grown in red and green light, although lower levels of the 1600-base phycocyanin transcript are present in cells grown in green light. Furthermore, a larger transcript of 1750 bases hybridizes to the allophycocyanin genes and may be a precursor to the 1400-base species. Images PMID:3086870
ISAAC - InterSpecies Analysing Application using Containers.
Baier, Herbert; Schultz, Jörg
2014-01-15
Information about genes, transcripts and proteins is spread over a wide variety of databases. Different tools have been developed using these databases to identify biological signals in gene lists from large scale analysis. Mostly, they search for enrichments of specific features. But, these tools do not allow an explorative walk through different views and to change the gene lists according to newly upcoming stories. To fill this niche, we have developed ISAAC, the InterSpecies Analysing Application using Containers. The central idea of this web based tool is to enable the analysis of sets of genes, transcripts and proteins under different biological viewpoints and to interactively modify these sets at any point of the analysis. Detailed history and snapshot information allows tracing each action. Furthermore, one can easily switch back to previous states and perform new analyses. Currently, sets can be viewed in the context of genomes, protein functions, protein interactions, pathways, regulation, diseases and drugs. Additionally, users can switch between species with an automatic, orthology based translation of existing gene sets. As todays research usually is performed in larger teams and consortia, ISAAC provides group based functionalities. Here, sets as well as results of analyses can be exchanged between members of groups. ISAAC fills the gap between primary databases and tools for the analysis of large gene lists. With its highly modular, JavaEE based design, the implementation of new modules is straight forward. Furthermore, ISAAC comes with an extensive web-based administration interface including tools for the integration of third party data. Thus, a local installation is easily feasible. In summary, ISAAC is tailor made for highly explorative interactive analyses of gene, transcript and protein sets in a collaborative environment.
Genes and Gut Bacteria Involved in Luminal Butyrate Reduction Caused by Diet and Loperamide.
Hwang, Nakwon; Eom, Taekil; Gupta, Sachin K; Jeong, Seong-Yeop; Jeong, Do-Youn; Kim, Yong Sung; Lee, Ji-Hoon; Sadowsky, Michael J; Unno, Tatsuya
2017-11-28
Unbalanced dietary habits and gut dysmotility are causative factors in metabolic and functional gut disorders, including obesity, diabetes, and constipation. Reduction in luminal butyrate synthesis is known to be associated with gut dysbioses, and studies have suggested that restoring butyrate formation in the colon may improve gut health. In contrast, shifts in different types of gut microbiota may inhibit luminal butyrate synthesis, requiring different treatments to restore colonic bacterial butyrate synthesis. We investigated the influence of high-fat diets (HFD) and low-fiber diets (LFD), and loperamide (LPM) administration, on key bacteria and genes involved in reduction of butyrate synthesis in mice. MiSeq-based microbiota analysis and HiSeq-based differential gene analysis indicated that different types of bacteria and genes were involved in butyrate metabolism in each treatment. Dietary modulation depleted butyrate kinase and phosphate butyryl transferase by decreasing members of the Bacteroidales and Parabacteroides . The HFD also depleted genes involved in succinate synthesis by decreasing Lactobacillus . The LFD and LPM treatments depleted genes involved in crotonoyl-CoA synthesis by decreasing Roseburia and Oscilllibacter . Taken together, our results suggest that different types of bacteria and genes were involved in gut dysbiosis, and that selected treatments may be needed depending on the cause of gut dysfunction.
Nejat, Naghmeh; Cahill, David M; Vadamalai, Ganesan; Ziemann, Mark; Rookes, James; Naderali, Neda
2015-10-01
Invasive phytoplasmas wreak havoc on coconut palms worldwide, leading to high loss of income, food insecurity and extreme poverty of farmers in producing countries. Phytoplasmas as strictly biotrophic insect-transmitted bacterial pathogens instigate distinct changes in developmental processes and defence responses of the infected plants and manipulate plants to their own advantage; however, little is known about the cellular and molecular mechanisms underlying host-phytoplasma interactions. Further, phytoplasma-mediated transcriptional alterations in coconut palm genes have not yet been identified. This study evaluated the whole transcriptome profiles of naturally infected leaves of Cocos nucifera ecotype Malayan Red Dwarf in response to yellow decline phytoplasma from group 16SrXIV, using RNA-Seq technique. Transcriptomics-based analysis reported here identified genes involved in coconut innate immunity. The number of down-regulated genes in response to phytoplasma infection exceeded the number of genes up-regulated. Of the 39,873 differentially expressed unigenes, 21,860 unigenes were suppressed and 18,013 were induced following infection. Comparative analysis revealed that genes associated with defence signalling against biotic stimuli were significantly overexpressed in phytoplasma-infected leaves versus healthy coconut leaves. Genes involving cell rescue and defence, cellular transport, oxidative stress, hormone stimulus and metabolism, photosynthesis reduction, transcription and biosynthesis of secondary metabolites were differentially represented. Our transcriptome analysis unveiled a core set of genes associated with defence of coconut in response to phytoplasma attack, although several novel defence response candidate genes with unknown function have also been identified. This study constitutes valuable sequence resource for uncovering the resistance genes and/or susceptibility genes which can be used as genetic tools in disease resistance breeding.
Partitioning of functional gene expression data using principal points.
Kim, Jaehee; Kim, Haseong
2017-10-12
DNA microarrays offer motivation and hope for the simultaneous study of variations in multiple genes. Gene expression is a temporal process that allows variations in expression levels with a characterized gene function over a period of time. Temporal gene expression curves can be treated as functional data since they are considered as independent realizations of a stochastic process. This process requires appropriate models to identify patterns of gene functions. The partitioning of the functional data can find homogeneous subgroups of entities for the massive genes within the inherent biological networks. Therefor it can be a useful technique for the analysis of time-course gene expression data. We propose a new self-consistent partitioning method of functional coefficients for individual expression profiles based on the orthonormal basis system. A principal points based functional partitioning method is proposed for time-course gene expression data. The method explores the relationship between genes using Legendre coefficients as principal points to extract the features of gene functions. Our proposed method provides high connectivity in connectedness after clustering for simulated data and finds a significant subsets of genes with the increased connectivity. Our approach has comparative advantages that fewer coefficients are used from the functional data and self-consistency of principal points for partitioning. As real data applications, we are able to find partitioned genes through the gene expressions found in budding yeast data and Escherichia coli data. The proposed method benefitted from the use of principal points, dimension reduction, and choice of orthogonal basis system as well as provides appropriately connected genes in the resulting subsets. We illustrate our method by applying with each set of cell-cycle-regulated time-course yeast genes and E. coli genes. The proposed method is able to identify highly connected genes and to explore the complex dynamics of biological systems in functional genomics.
de Jong, Simone; Vidler, Lewis R; Mokrab, Younes; Collier, David A; Breen, Gerome
2016-08-01
Genome-wide association studies (GWAS) have identified thousands of novel genetic associations for complex genetic disorders, leading to the identification of potential pharmacological targets for novel drug development. In schizophrenia, 108 conservatively defined loci that meet genome-wide significance have been identified and hundreds of additional sub-threshold associations harbour information on the genetic aetiology of the disorder. In the present study, we used gene-set analysis based on the known binding targets of chemical compounds to identify the 'drug pathways' most strongly associated with schizophrenia-associated genes, with the aim of identifying potential drug repositioning opportunities and clues for novel treatment paradigms, especially in multi-target drug development. We compiled 9389 gene sets (2496 with unique gene content) and interrogated gene-based p-values from the PGC2-SCZ analysis. Although no single drug exceeded experiment wide significance (corrected p<0.05), highly ranked gene-sets reaching suggestive significance including the dopamine receptor antagonists metoclopramide and trifluoperazine and the tyrosine kinase inhibitor neratinib. This is a proof of principle analysis showing the potential utility of GWAS data of schizophrenia for the direct identification of candidate drugs and molecules that show polypharmacy. © The Author(s) 2016.
Pathway Distiller - multisource biological pathway consolidation
2012-01-01
Background One method to understand and evaluate an experiment that produces a large set of genes, such as a gene expression microarray analysis, is to identify overrepresentation or enrichment for biological pathways. Because pathways are able to functionally describe the set of genes, much effort has been made to collect curated biological pathways into publicly accessible databases. When combining disparate databases, highly related or redundant pathways exist, making their consolidation into pathway concepts essential. This will facilitate unbiased, comprehensive yet streamlined analysis of experiments that result in large gene sets. Methods After gene set enrichment finds representative pathways for large gene sets, pathways are consolidated into representative pathway concepts. Three complementary, but different methods of pathway consolidation are explored. Enrichment Consolidation combines the set of the pathways enriched for the signature gene list through iterative combining of enriched pathways with other pathways with similar signature gene sets; Weighted Consolidation utilizes a Protein-Protein Interaction network based gene-weighting approach that finds clusters of both enriched and non-enriched pathways limited to the experiments' resultant gene list; and finally the de novo Consolidation method uses several measurements of pathway similarity, that finds static pathway clusters independent of any given experiment. Results We demonstrate that the three consolidation methods provide unified yet different functional insights of a resultant gene set derived from a genome-wide profiling experiment. Results from the methods are presented, demonstrating their applications in biological studies and comparing with a pathway web-based framework that also combines several pathway databases. Additionally a web-based consolidation framework that encompasses all three methods discussed in this paper, Pathway Distiller (http://cbbiweb.uthscsa.edu/PathwayDistiller), is established to allow researchers access to the methods and example microarray data described in this manuscript, and the ability to analyze their own gene list by using our unique consolidation methods. Conclusions By combining several pathway systems, implementing different, but complementary pathway consolidation methods, and providing a user-friendly web-accessible tool, we have enabled users the ability to extract functional explanations of their genome wide experiments. PMID:23134636
Pathway Distiller - multisource biological pathway consolidation.
Doderer, Mark S; Anguiano, Zachry; Suresh, Uthra; Dashnamoorthy, Ravi; Bishop, Alexander J R; Chen, Yidong
2012-01-01
One method to understand and evaluate an experiment that produces a large set of genes, such as a gene expression microarray analysis, is to identify overrepresentation or enrichment for biological pathways. Because pathways are able to functionally describe the set of genes, much effort has been made to collect curated biological pathways into publicly accessible databases. When combining disparate databases, highly related or redundant pathways exist, making their consolidation into pathway concepts essential. This will facilitate unbiased, comprehensive yet streamlined analysis of experiments that result in large gene sets. After gene set enrichment finds representative pathways for large gene sets, pathways are consolidated into representative pathway concepts. Three complementary, but different methods of pathway consolidation are explored. Enrichment Consolidation combines the set of the pathways enriched for the signature gene list through iterative combining of enriched pathways with other pathways with similar signature gene sets; Weighted Consolidation utilizes a Protein-Protein Interaction network based gene-weighting approach that finds clusters of both enriched and non-enriched pathways limited to the experiments' resultant gene list; and finally the de novo Consolidation method uses several measurements of pathway similarity, that finds static pathway clusters independent of any given experiment. We demonstrate that the three consolidation methods provide unified yet different functional insights of a resultant gene set derived from a genome-wide profiling experiment. Results from the methods are presented, demonstrating their applications in biological studies and comparing with a pathway web-based framework that also combines several pathway databases. Additionally a web-based consolidation framework that encompasses all three methods discussed in this paper, Pathway Distiller (http://cbbiweb.uthscsa.edu/PathwayDistiller), is established to allow researchers access to the methods and example microarray data described in this manuscript, and the ability to analyze their own gene list by using our unique consolidation methods. By combining several pathway systems, implementing different, but complementary pathway consolidation methods, and providing a user-friendly web-accessible tool, we have enabled users the ability to extract functional explanations of their genome wide experiments.
The GENCODE exome: sequencing the complete human exome
Coffey, Alison J; Kokocinski, Felix; Calafato, Maria S; Scott, Carol E; Palta, Priit; Drury, Eleanor; Joyce, Christopher J; LeProust, Emily M; Harrow, Jen; Hunt, Sarah; Lehesjoki, Anna-Elina; Turner, Daniel J; Hubbard, Tim J; Palotie, Aarno
2011-01-01
Sequencing the coding regions, the exome, of the human genome is one of the major current strategies to identify low frequency and rare variants associated with human disease traits. So far, the most widely used commercial exome capture reagents have mainly targeted the consensus coding sequence (CCDS) database. We report the design of an extended set of targets for capturing the complete human exome, based on annotation from the GENCODE consortium. The extended set covers an additional 5594 genes and 10.3 Mb compared with the current CCDS-based sets. The additional regions include potential disease genes previously inaccessible to exome resequencing studies, such as 43 genes linked to ion channel activity and 70 genes linked to protein kinase activity. In total, the new GENCODE exome set developed here covers 47.9 Mb and performed well in sequence capture experiments. In the sample set used in this study, we identified over 5000 SNP variants more in the GENCODE exome target (24%) than in the CCDS-based exome sequencing. PMID:21364695
Fédrigo, Olivier; Haygood, Ralph; Mukherjee, Sayan; Wray, Gregory A.
2009-01-01
Variation in gene expression is an important contributor to phenotypic diversity within and between species. Although this variation often has a genetic component, identification of the genetic variants driving this relationship remains challenging. In particular, measurements of gene expression usually do not reveal whether the genetic basis for any observed variation lies in cis or in trans to the gene, a distinction that has direct relevance to the physical location of the underlying genetic variant, and which may also impact its evolutionary trajectory. Allelic imbalance measurements identify cis-acting genetic effects by assaying the relative contribution of the two alleles of a cis-regulatory region to gene expression within individuals. Identification of patterns that predict commonly imbalanced genes could therefore serve as a useful tool and also shed light on the evolution of cis-regulatory variation itself. Here, we show that sequence motifs, polymorphism levels, and divergence levels around a gene can be used to predict commonly imbalanced genes in a human data set. Reduction of this feature set to four factors revealed that only one factor significantly differentiated between commonly imbalanced and nonimbalanced genes. We demonstrate that these results are consistent between the original data set and a second published data set in humans obtained using different technical and statistical methods. Finally, we show that variation in the single allelic imbalance-associated factor is partially explained by the density of genes in the region of a target gene (allelic imbalance is less probable for genes in gene-dense regions), and, to a lesser extent, the evenness of expression of the gene across tissues and the magnitude of negative selection on putative regulatory regions of the gene. These results suggest that the genomic distribution of functional cis-regulatory variants in the human genome is nonrandom, perhaps due to local differences in evolutionary constraint. PMID:19506001
snpGeneSets: An R Package for Genome-Wide Study Annotation
Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian
2016-01-01
Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048
Escudero, Lorena V; Casamayor, Emilio O; Chong, Guillermo; Pedrós-Alió, Carles; Demergasso, Cecilia
2013-01-01
The presence of the arsenic oxidation, reduction, and extrusion genes arsC, arrA, aioA, and acr3 was explored in a range of natural environments in northern Chile, with arsenic concentrations spanning six orders of magnitude. A combination of primers from the literature and newly designed primers were used to explore the presence of the arsC gene, coding for the reduction of As (V) to As (III) in one of the most common detoxification mechanisms. Enterobacterial related arsC genes appeared only in the environments with the lowest As concentration, while Firmicutes-like genes were present throughout the range of As concentrations. The arrA gene, involved in anaerobic respiration using As (V) as electron acceptor, was found in all the systems studied. The As (III) oxidation gene aioA and the As (III) transport gene acr3 were tracked with two primer sets each and they were also found to be spread through the As concentration gradient. Sediment samples had a higher number of arsenic related genes than water samples. Considering the results of the bacterial community composition available for these samples, the higher microbial phylogenetic diversity of microbes inhabiting the sediments may explain the increased number of genetic resources found to cope with arsenic. Overall, the environmental distribution of arsenic related genes suggests that the occurrence of different ArsC families provides different degrees of protection against arsenic as previously described in laboratory strains, and that the glutaredoxin (Grx)-linked arsenate reductases related to Enterobacteria do not confer enough arsenic resistance to live above certain levels of As concentrations.
Soler-Bistué, Alfonso; Mondotte, Juan A.; Bland, Michael Jason; Val, Marie-Eve; Saleh, María-Carla; Mazel, Didier
2015-01-01
The effects on cell physiology of gene order within the bacterial chromosome are poorly understood. In silico approaches have shown that genes involved in transcription and translation processes, in particular ribosomal protein (RP) genes, localize near the replication origin (oriC) in fast-growing bacteria suggesting that such a positional bias is an evolutionarily conserved growth-optimization strategy. Such genomic localization could either provide a higher dosage of these genes during fast growth or facilitate the assembly of ribosomes and transcription foci by keeping physically close the many components of these macromolecular machines. To explore this, we used novel recombineering tools to create a set of Vibrio cholerae strains in which S10-spec-α (S10), a locus bearing half of the ribosomal protein genes, was systematically relocated to alternative genomic positions. We show that the relative distance of S10 to the origin of replication tightly correlated with a reduction of S10 dosage, mRNA abundance and growth rate within these otherwise isogenic strains. Furthermore, this was accompanied by a significant reduction in the host-invasion capacity in Drosophila melanogaster. Both phenotypes were rescued in strains bearing two S10 copies highly distal to oriC, demonstrating that replication-dependent gene dosage reduction is the main mechanism behind these alterations. Hence, S10 positioning connects genome structure to cell physiology in Vibrio cholerae. Our results show experimentally for the first time that genomic positioning of genes involved in the flux of genetic information conditions global growth control and hence bacterial physiology and potentially its evolution. PMID:25875621
2013-01-01
Background Analysis of global gene expression by DNA microarrays is widely used in experimental molecular biology. However, the complexity of such high-dimensional data sets makes it difficult to fully understand the underlying biological features present in the data. The aim of this study is to introduce a method for DNA microarray analysis that provides an intuitive interpretation of data through dimension reduction and pattern recognition. We present the first “Archetypal Analysis” of global gene expression. The analysis is based on microarray data from five integrated studies of Pseudomonas aeruginosa isolated from the airways of cystic fibrosis patients. Results Our analysis clustered samples into distinct groups with comprehensible characteristics since the archetypes representing the individual groups are closely related to samples present in the data set. Significant changes in gene expression between different groups identified adaptive changes of the bacteria residing in the cystic fibrosis lung. The analysis suggests a similar gene expression pattern between isolates with a high mutation rate (hypermutators) despite accumulation of different mutations for these isolates. This suggests positive selection in the cystic fibrosis lung environment, and changes in gene expression for these isolates are therefore most likely related to adaptation of the bacteria. Conclusions Archetypal analysis succeeded in identifying adaptive changes of P. aeruginosa. The combination of clustering and matrix factorization made it possible to reveal minor similarities among different groups of data, which other analytical methods failed to identify. We suggest that this analysis could be used to supplement current methods used to analyze DNA microarray data. PMID:24059747
Fu, X; Sun, Y; Wang, J; Xing, Q; Zou, J; Li, R; Wang, Z; Wang, S; Hu, X; Zhang, L; Bao, Z
2014-01-01
Marine organisms are commonly exposed to variable environmental conditions, and many of them are under threat from increased sea temperatures caused by global climate change. Generating transcriptomic resources under different stress conditions are crucial for understanding molecular mechanisms underlying thermal adaptation. In this study, we conducted transcriptome-wide gene expression profiling of the scallop Chlamys farreri challenged by acute and chronic heat stress. Of the 13 953 unique tags, more than 850 were significantly differentially expressed at each time point after acute heat stress, which was more than the number of tags differentially expressed (320-350) under chronic heat stress. To obtain a systemic view of gene expression alterations during thermal stress, a weighted gene coexpression network was constructed. Six modules were identified as acute heat stress-responsive modules. Among them, four modules involved in apoptosis regulation, mRNA binding, mitochondrial envelope formation and oxidation reduction were downregulated. The remaining two modules were upregulated. One was enriched with chaperone and the other with microsatellite sequences, whose coexpression may originate from a transcription factor binding site. These results indicated that C. farreri triggered several cellular processes to acclimate to elevated temperature. No modules responded to chronic heat stress, suggesting that the scallops might have acclimated to elevated temperature within 3 days. This study represents the first sequencing-based gene network analysis in a nonmodel aquatic species and provides valuable gene resources for the study of thermal adaptation, which should assist in the development of heat-tolerant scallop lines for aquaculture. © 2013 John Wiley & Sons Ltd.
Approximate geodesic distances reveal biologically relevant structures in microarray data.
Nilsson, Jens; Fioretos, Thoas; Höglund, Mattias; Fontes, Magnus
2004-04-12
Genome-wide gene expression measurements, as currently determined by the microarray technology, can be represented mathematically as points in a high-dimensional gene expression space. Genes interact with each other in regulatory networks, restricting the cellular gene expression profiles to a certain manifold, or surface, in gene expression space. To obtain knowledge about this manifold, various dimensionality reduction methods and distance metrics are used. For data points distributed on curved manifolds, a sensible distance measure would be the geodesic distance along the manifold. In this work, we examine whether an approximate geodesic distance measure captures biological similarities better than the traditionally used Euclidean distance. We computed approximate geodesic distances, determined by the Isomap algorithm, for one set of lymphoma and one set of lung cancer microarray samples. Compared with the ordinary Euclidean distance metric, this distance measure produced more instructive, biologically relevant, visualizations when applying multidimensional scaling. This suggests the Isomap algorithm as a promising tool for the interpretation of microarray data. Furthermore, the results demonstrate the benefit and importance of taking nonlinearities in gene expression data into account.
Determination of the Core of a Minimal Bacterial Gene Set†
Gil, Rosario; Silva, Francisco J.; Peretó, Juli; Moya, Andrés
2004-01-01
The availability of a large number of complete genome sequences raises the question of how many genes are essential for cellular life. Trying to reconstruct the core of the protein-coding gene set for a hypothetical minimal bacterial cell, we have performed a computational comparative analysis of eight bacterial genomes. Six of the analyzed genomes are very small due to a dramatic genome size reduction process, while the other two, corresponding to free-living relatives, are larger. The available data from several systematic experimental approaches to define all the essential genes in some completely sequenced bacterial genomes were also considered, and a reconstruction of a minimal metabolic machinery necessary to sustain life was carried out. The proposed minimal genome contains 206 protein-coding genes with all the genetic information necessary for self-maintenance and reproduction in the presence of a full complement of essential nutrients and in the absence of environmental stress. The main features of such a minimal gene set, as well as the metabolic functions that must be present in the hypothetical minimal cell, are discussed. PMID:15353568
Bellucci, Elisa; Bitocchi, Elena; Ferrarini, Alberto; Benazzo, Andrea; Biagetti, Eleonora; Klie, Sebastian; Minio, Andrea; Rau, Domenico; Rodriguez, Monica; Panziera, Alex; Venturini, Luca; Attene, Giovanna; Albertini, Emidio; Jackson, Scott A.; Nanni, Laura; Fernie, Alisdair R.; Nikoloski, Zoran; Bertorelle, Giorgio; Delledonne, Massimo; Papa, Roberto
2014-01-01
Using RNA sequencing technology and de novo transcriptome assembly, we compared representative sets of wild and domesticated accessions of common bean (Phaseolus vulgaris) from Mesoamerica. RNA was extracted at the first true-leaf stage, and de novo assembly was used to develop a reference transcriptome; the final data set consists of ∼190,000 single nucleotide polymorphisms from 27,243 contigs in expressed genomic regions. A drastic reduction in nucleotide diversity (∼60%) is evident for the domesticated form, compared with the wild form, and almost 50% of the contigs that are polymorphic were brought to fixation by domestication. In parallel, the effects of domestication decreased the diversity of gene expression (18%). While the coexpression networks for the wild and domesticated accessions demonstrate similar seminal network properties, they show distinct community structures that are enriched for different molecular functions. After simulating the demographic dynamics during domestication, we found that 9% of the genes were actively selected during domestication. We also show that selection induced a further reduction in the diversity of gene expression (26%) and was associated with 5-fold enrichment of differentially expressed genes. While there is substantial evidence of positive selection associated with domestication, in a few cases, this selection has increased the nucleotide diversity in the domesticated pool at target loci associated with abiotic stress responses, flowering time, and morphology. PMID:24850850
The Gene Set Builder: collation, curation, and distribution of sets of genes
Yusuf, Dimas; Lim, Jonathan S; Wasserman, Wyeth W
2005-01-01
Background In bioinformatics and genomics, there are many applications designed to investigate the common properties for a set of genes. Often, these multi-gene analysis tools attempt to reveal sequential, functional, and expressional ties. However, while tremendous effort has been invested in developing tools that can analyze a set of genes, minimal effort has been invested in developing tools that can help researchers compile, store, and annotate gene sets in the first place. As a result, the process of making or accessing a set often involves tedious and time consuming steps such as finding identifiers for each individual gene. These steps are often repeated extensively to shift from one identifier type to another; or to recreate a published set. In this paper, we present a simple online tool which – with the help of the gene catalogs Ensembl and GeneLynx – can help researchers build and annotate sets of genes quickly and easily. Description The Gene Set Builder is a database-driven, web-based tool designed to help researchers compile, store, export, and share sets of genes. This application supports the 17 eukaryotic genomes found in version 32 of the Ensembl database, which includes species from yeast to human. User-created information such as sets and customized annotations are stored to facilitate easy access. Gene sets stored in the system can be "exported" in a variety of output formats – as lists of identifiers, in tables, or as sequences. In addition, gene sets can be "shared" with specific users to facilitate collaborations or fully released to provide access to published results. The application also features a Perl API (Application Programming Interface) for direct connectivity to custom analysis tools. A downloadable Quick Reference guide and an online tutorial are available to help new users learn its functionalities. Conclusion The Gene Set Builder is an Ensembl-facilitated online tool designed to help researchers compile and manage sets of genes in a user-friendly environment. The application can be accessed via . PMID:16371163
Saka, Ernur; Harrison, Benjamin J; West, Kirk; Petruska, Jeffrey C; Rouchka, Eric C
2017-12-06
Since the introduction of microarrays in 1995, researchers world-wide have used both commercial and custom-designed microarrays for understanding differential expression of transcribed genes. Public databases such as ArrayExpress and the Gene Expression Omnibus (GEO) have made millions of samples readily available. One main drawback to microarray data analysis involves the selection of probes to represent a specific transcript of interest, particularly in light of the fact that transcript-specific knowledge (notably alternative splicing) is dynamic in nature. We therefore developed a framework for reannotating and reassigning probe groups for Affymetrix® GeneChip® technology based on functional regions of interest. This framework addresses three issues of Affymetrix® GeneChip® data analyses: removing nonspecific probes, updating probe target mapping based on the latest genome knowledge and grouping probes into gene, transcript and region-based (UTR, individual exon, CDS) probe sets. Updated gene and transcript probe sets provide more specific analysis results based on current genomic and transcriptomic knowledge. The framework selects unique probes, aligns them to gene annotations and generates a custom Chip Description File (CDF). The analysis reveals only 87% of the Affymetrix® GeneChip® HG-U133 Plus 2 probes uniquely align to the current hg38 human assembly without mismatches. We also tested new mappings on the publicly available data series using rat and human data from GSE48611 and GSE72551 obtained from GEO, and illustrate that functional grouping allows for the subtle detection of regions of interest likely to have phenotypical consequences. Through reanalysis of the publicly available data series GSE48611 and GSE72551, we profiled the contribution of UTR and CDS regions to the gene expression levels globally. The comparison between region and gene based results indicated that the detected expressed genes by gene-based and region-based CDFs show high consistency and regions based results allows us to detection of changes in transcript formation.
González-Martínez, Santiago C; Ersoz, Elhan; Brown, Garth R; Wheeler, Nicholas C; Neale, David B
2006-03-01
Genetic association studies are rapidly becoming the experimental approach of choice to dissect complex traits, including tolerance to drought stress, which is the most common cause of mortality and yield losses in forest trees. Optimization of association mapping requires knowledge of the patterns of nucleotide diversity and linkage disequilibrium and the selection of suitable polymorphisms for genotyping. Moreover, standard neutrality tests applied to DNA sequence variation data can be used to select candidate genes or amino acid sites that are putatively under selection for association mapping. In this article, we study the pattern of polymorphism of 18 candidate genes for drought-stress response in Pinus taeda L., an important tree crop. Data analyses based on a set of 21 putatively neutral nuclear microsatellites did not show population genetic structure or genomewide departures from neutrality. Candidate genes had moderate average nucleotide diversity at silent sites (pi(sil) = 0.00853), varying 100-fold among single genes. The level of within-gene LD was low, with an average pairwise r2 of 0.30, decaying rapidly from approximately 0.50 to approximately 0.20 at 800 bp. No apparent LD among genes was found. A selective sweep may have occurred at the early-response-to-drought-3 (erd3) gene, although population expansion can also explain our results and evidence for selection was not conclusive. One other gene, ccoaomt-1, a methylating enzyme involved in lignification, showed dimorphism (i.e., two highly divergent haplotype lineages at equal frequency), which is commonly associated with the long-term action of balancing selection. Finally, a set of haplotype-tagging SNPs (htSNPs) was selected. Using htSNPs, a reduction of genotyping effort of approximately 30-40%, while sampling most common allelic variants, can be gained in our ongoing association studies for drought tolerance in pine.
Gerber, Simon D.; Amann, Ruth; Wyder, Stefan; Trueb, Beat
2012-01-01
Fgfrl1 (fibroblast growth factor receptor-like 1) is a transmembrane receptor that is essential for the development of the metanephric kidney. It is expressed in all nascent nephrogenic structures and in the ureteric bud. Fgfrl1 null mice fail to develop the metanephric kidneys. Mutant kidney rudiments show a dramatic reduction of ureteric branching and a lack of mesenchymal-to-epithelial transition. Here, we compared the expression profiles of wildtype and Fgfrl1 mutant kidneys to identify genes that act downstream of Fgfrl1 signaling during the early steps of nephron formation. We detected 56 differentially expressed transcripts with 2-fold or greater reduction, among them many genes involved in Fgf, Wnt, Bmp, Notch, and Six/Eya/Dach signaling. We validated the microarray data by qPCR and whole-mount in situ hybridization and showed the expression pattern of candidate genes in normal kidneys. Some of these genes might play an important role during early nephron formation. Our study should help to define the minimal set of genes that is required to form a functional nephron. PMID:22432025
Reliable pre-eclampsia pathways based on multiple independent microarray data sets.
Kawasaki, Kaoru; Kondoh, Eiji; Chigusa, Yoshitsugu; Ujita, Mari; Murakami, Ryusuke; Mogami, Haruta; Brown, J B; Okuno, Yasushi; Konishi, Ikuo
2015-02-01
Pre-eclampsia is a multifactorial disorder characterized by heterogeneous clinical manifestations. Gene expression profiling of preeclamptic placenta have provided different and even opposite results, partly due to data compromised by various experimental artefacts. Here we aimed to identify reliable pre-eclampsia-specific pathways using multiple independent microarray data sets. Gene expression data of control and preeclamptic placentas were obtained from Gene Expression Omnibus. Single-sample gene-set enrichment analysis was performed to generate gene-set activation scores of 9707 pathways obtained from the Molecular Signatures Database. Candidate pathways were identified by t-test-based screening using data sets, GSE10588, GSE14722 and GSE25906. Additionally, recursive feature elimination was applied to arrive at a further reduced set of pathways. To assess the validity of the pre-eclampsia pathways, a statistically-validated protocol was executed using five data sets including two independent other validation data sets, GSE30186, GSE44711. Quantitative real-time PCR was performed for genes in a panel of potential pre-eclampsia pathways using placentas of 20 women with normal or severe preeclamptic singleton pregnancies (n = 10, respectively). A panel of ten pathways were found to discriminate women with pre-eclampsia from controls with high accuracy. Among these were pathways not previously associated with pre-eclampsia, such as the GABA receptor pathway, as well as pathways that have already been linked to pre-eclampsia, such as the glutathione and CDKN1C pathways. mRNA expression of GABRA3 (GABA receptor pathway), GCLC and GCLM (glutathione metabolic pathway), and CDKN1C was significantly reduced in the preeclamptic placentas. In conclusion, ten accurate and reliable pre-eclampsia pathways were identified based on multiple independent microarray data sets. A pathway-based classification may be a worthwhile approach to elucidate the pathogenesis of pre-eclampsia. © The Author 2014. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
2013-01-01
Background Differential gene expression (DGE) analysis is commonly used to reveal the deregulated molecular mechanisms of complex diseases. However, traditional DGE analysis (e.g., the t test or the rank sum test) tests each gene independently without considering interactions between them. Top-ranked differentially regulated genes prioritized by the analysis may not directly relate to the coherent molecular changes underlying complex diseases. Joint analyses of co-expression and DGE have been applied to reveal the deregulated molecular modules underlying complex diseases. Most of these methods consist of separate steps: first to identify gene-gene relationships under the studied phenotype then to integrate them with gene expression changes for prioritizing signature genes, or vice versa. It is warrant a method that can simultaneously consider gene-gene co-expression strength and corresponding expression level changes so that both types of information can be leveraged optimally. Results In this paper, we develop a gene module based method for differential gene expression analysis, named network-based differential gene expression (nDGE) analysis, a one-step integrative process for prioritizing deregulated genes and grouping them into gene modules. We demonstrate that nDGE outperforms existing methods in prioritizing deregulated genes and discovering deregulated gene modules using simulated data sets. When tested on a series of smoker and non-smoker lung adenocarcinoma data sets, we show that top differentially regulated genes identified by the rank sum test in different sets are not consistent while top ranked genes defined by nDGE in different data sets significantly overlap. nDGE results suggest that a differentially regulated gene module, which is enriched for cell cycle related genes and E2F1 targeted genes, plays a role in the molecular differences between smoker and non-smoker lung adenocarcinoma. Conclusions In this paper, we develop nDGE to prioritize deregulated genes and group them into gene modules by simultaneously considering gene expression level changes and gene-gene co-regulations. When applied to both simulated and empirical data, nDGE outperforms the traditional DGE method. More specifically, when applied to smoker and non-smoker lung cancer sets, nDGE results illustrate the molecular differences between smoker and non-smoker lung cancer. PMID:24341432
Training set selection for the prediction of essential genes.
Cheng, Jian; Xu, Zhao; Wu, Wenwu; Zhao, Li; Li, Xiangchen; Liu, Yanlin; Tao, Shiheng
2014-01-01
Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale.
A robust prognostic signature for hormone-positive node-negative breast cancer.
Griffith, Obi L; Pepin, François; Enache, Oana M; Heiser, Laura M; Collisson, Eric A; Spellman, Paul T; Gray, Joe W
2013-01-01
Systemic chemotherapy in the adjuvant setting can cure breast cancer in some patients that would otherwise recur with incurable, metastatic disease. However, since only a fraction of patients would have recurrence after surgery alone, the challenge is to stratify high-risk patients (who stand to benefit from systemic chemotherapy) from low-risk patients (who can safely be spared treatment related toxicities and costs). We focus here on risk stratification in node-negative, ER-positive, HER2-negative breast cancer. We use a large database of publicly available microarray datasets to build a random forests classifier and develop a robust multi-gene mRNA transcription-based predictor of relapse free survival at 10 years, which we call the Random Forests Relapse Score (RFRS). Performance was assessed by internal cross-validation, multiple independent data sets, and comparison to existing algorithms using receiver-operating characteristic and Kaplan-Meier survival analysis. Internal redundancy of features was determined using k-means clustering to define optimal signatures with smaller numbers of primary genes, each with multiple alternates. Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704, which was comparable to or better than those reported previously or obtained by applying existing methods to our dataset. Three risk groups with probability cutoffs for low, intermediate, and high-risk were defined. Survival analysis determined a highly significant difference in relapse rate between these risk groups. Validation of the models against independent test datasets showed highly similar results. Smaller 17-gene and 8-gene optimized models were also developed with minimal reduction in performance. Furthermore, the signature was shown to be almost equally effective on both hormone-treated and untreated patients. RFRS allows flexibility in both the number and identity of genes utilized from thousands to as few as 17 or eight genes, each with multiple alternatives. The RFRS reports a probability score strongly correlated with risk of relapse. This score could therefore be used to assign systemic chemotherapy specifically to those high-risk patients most likely to benefit from further treatment.
A robust prognostic signature for hormone-positive node-negative breast cancer
2013-01-01
Background Systemic chemotherapy in the adjuvant setting can cure breast cancer in some patients that would otherwise recur with incurable, metastatic disease. However, since only a fraction of patients would have recurrence after surgery alone, the challenge is to stratify high-risk patients (who stand to benefit from systemic chemotherapy) from low-risk patients (who can safely be spared treatment related toxicities and costs). Methods We focus here on risk stratification in node-negative, ER-positive, HER2-negative breast cancer. We use a large database of publicly available microarray datasets to build a random forests classifier and develop a robust multi-gene mRNA transcription-based predictor of relapse free survival at 10 years, which we call the Random Forests Relapse Score (RFRS). Performance was assessed by internal cross-validation, multiple independent data sets, and comparison to existing algorithms using receiver-operating characteristic and Kaplan-Meier survival analysis. Internal redundancy of features was determined using k-means clustering to define optimal signatures with smaller numbers of primary genes, each with multiple alternates. Results Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704, which was comparable to or better than those reported previously or obtained by applying existing methods to our dataset. Three risk groups with probability cutoffs for low, intermediate, and high-risk were defined. Survival analysis determined a highly significant difference in relapse rate between these risk groups. Validation of the models against independent test datasets showed highly similar results. Smaller 17-gene and 8-gene optimized models were also developed with minimal reduction in performance. Furthermore, the signature was shown to be almost equally effective on both hormone-treated and untreated patients. Conclusions RFRS allows flexibility in both the number and identity of genes utilized from thousands to as few as 17 or eight genes, each with multiple alternatives. The RFRS reports a probability score strongly correlated with risk of relapse. This score could therefore be used to assign systemic chemotherapy specifically to those high-risk patients most likely to benefit from further treatment. PMID:24112773
A global analysis of adaptive evolution of operons in cyanobacteria.
Memon, Danish; Singh, Abhay K; Pakrasi, Himadri B; Wangikar, Pramod P
2013-02-01
Operons are an important feature of prokaryotic genomes. Evolution of operons is hypothesized to be adaptive and has contributed significantly towards coordinated optimization of functions. Two conflicting theories, based on (i) in situ formation to achieve co-regulation and (ii) horizontal gene transfer of functionally linked gene clusters, are generally considered to explain why and how operons have evolved. Furthermore, effects of operon evolution on genomic traits such as intergenic spacing, operon size and co-regulation are relatively less explored. Based on the conservation level in a set of diverse prokaryotes, we categorize the operonic gene pair associations and in turn the operons as ancient and recently formed. This allowed us to perform a detailed analysis of operonic structure in cyanobacteria, a morphologically and physiologically diverse group of photoautotrophs. Clustering based on operon conservation showed significant similarity with the 16S rRNA-based phylogeny, which groups the cyanobacterial strains into three clades. Clade C, dominated by strains that are believed to have undergone genome reduction, shows a larger fraction of operonic genes that are tightly packed in larger sized operons. Ancient operons are in general larger, more tightly packed, better optimized for co-regulation and part of key cellular processes. A sub-clade within Clade B, which includes Synechocystis sp. PCC 6803, shows a reverse trend in intergenic spacing. Our results suggest that while in situ formation and vertical descent may be a dominant mechanism of operon evolution in cyanobacteria, optimization of intergenic spacing and co-regulation are part of an ongoing process in the life-cycle of operons.
Uchiyama, Ikuo
2008-10-31
Identifying the set of intrinsically conserved genes, or the genomic core, among related genomes is crucial for understanding prokaryotic genomes where horizontal gene transfers are common. Although core genome identification appears to be obvious among very closely related genomes, it becomes more difficult when more distantly related genomes are compared. Here, we consider the core structure as a set of sufficiently long segments in which gene orders are conserved so that they are likely to have been inherited mainly through vertical transfer, and developed a method for identifying the core structure by finding the order of pre-identified orthologous groups (OGs) that maximally retains the conserved gene orders. The method was applied to genome comparisons of two well-characterized families, Bacillaceae and Enterobacteriaceae, and identified their core structures comprising 1438 and 2125 OGs, respectively. The core sets contained most of the essential genes and their related genes, which were primarily included in the intersection of the two core sets comprising around 700 OGs. The definition of the genomic core based on gene order conservation was demonstrated to be more robust than the simpler approach based only on gene conservation. We also investigated the core structures in terms of G+C content homogeneity and phylogenetic congruence, and found that the core genes primarily exhibited the expected characteristic, i.e., being indigenous and sharing the same history, more than the non-core genes. The results demonstrate that our strategy of genome alignment based on gene order conservation can provide an effective approach to identify the genomic core among moderately related microbial genomes.
Lv, Yufeng; Wei, Wenhao; Huang, Zhong; Chen, Zhichao; Fang, Yuan; Pan, Lili; Han, Xueqiong; Xu, Zihai
2018-06-20
The aim of this study was to develop a novel long non-coding RNA (lncRNA) expression signature to accurately predict early recurrence for patients with hepatocellular carcinoma (HCC) after curative resection. Using expression profiles downloaded from The Cancer Genome Atlas database, we identified multiple lncRNAs with differential expression between early recurrence (ER) group and non-early recurrence (non-ER) group of HCC. Least absolute shrinkage and selection operator (LASSO) for logistic regression models were used to develop a lncRNA-based classifier for predicting ER in the training set. An independent test set was used to validated the predictive value of this classifier. Futhermore, a co-expression network based on these lncRNAs and its highly related genes was constructed and Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses of genes in the network were performed. We identified 10 differentially expressed lncRNAs, including 3 that were upregulated and 7 that were downregulated in ER group. The lncRNA-based classifier was constructed based on 7 lncRNAs (AL035661.1, PART1, AC011632.1, AC109588.1, AL365361.1, LINC00861 and LINC02084), and its accuracy was 0.83 in training set, 0.87 in test set and 0.84 in total set. And ROC curve analysis showed the AUROC was 0.741 in training set, 0.824 in the test set and 0.765 in total set. A functional enrichment analysis suggested that the genes of which is highly related to 4 lncRNAs were involved in immune system. This 7-lncRNA expression profile can effectively predict the early recurrence after surgical resection for HCC. This article is protected by copyright. All rights reserved.
Kim, Eunji; Ivanov, Ivan; Hua, Jianping; Lampe, Johanna W; Hullar, Meredith Aj; Chapkin, Robert S; Dougherty, Edward R
2017-01-01
Ranking feature sets for phenotype classification based on gene expression is a challenging issue in cancer bioinformatics. When the number of samples is small, all feature selection algorithms are known to be unreliable, producing significant error, and error estimators suffer from different degrees of imprecision. The problem is compounded by the fact that the accuracy of classification depends on the manner in which the phenomena are transformed into data by the measurement technology. Because next-generation sequencing technologies amount to a nonlinear transformation of the actual gene or RNA concentrations, they can potentially produce less discriminative data relative to the actual gene expression levels. In this study, we compare the performance of ranking feature sets derived from a model of RNA-Seq data with that of a multivariate normal model of gene concentrations using 3 measures: (1) ranking power, (2) length of extensions, and (3) Bayes features. This is the model-based study to examine the effectiveness of reporting lists of small feature sets using RNA-Seq data and the effects of different model parameters and error estimators. The results demonstrate that the general trends of the parameter effects on the ranking power of the underlying gene concentrations are preserved in the RNA-Seq data, whereas the power of finding a good feature set becomes weaker when gene concentrations are transformed by the sequencing machine.
An improved method for functional similarity analysis of genes based on Gene Ontology.
Tian, Zhen; Wang, Chunyu; Guo, Maozu; Liu, Xiaoyan; Teng, Zhixia
2016-12-23
Measures of gene functional similarity are essential tools for gene clustering, gene function prediction, evaluation of protein-protein interaction, disease gene prioritization and other applications. In recent years, many gene functional similarity methods have been proposed based on the semantic similarity of GO terms. However, these leading approaches may make errorprone judgments especially when they measure the specificity of GO terms as well as the IC of a term set. Therefore, how to estimate the gene functional similarity reliably is still a challenging problem. We propose WIS, an effective method to measure the gene functional similarity. First of all, WIS computes the IC of a term by employing its depth, the number of its ancestors as well as the topology of its descendants in the GO graph. Secondly, WIS calculates the IC of a term set by means of considering the weighted inherited semantics of terms. Finally, WIS estimates the gene functional similarity based on the IC overlap ratio of term sets. WIS is superior to some other representative measures on the experiments of functional classification of genes in a biological pathway, collaborative evaluation of GO-based semantic similarity measures, protein-protein interaction prediction and correlation with gene expression. Further analysis suggests that WIS takes fully into account the specificity of terms and the weighted inherited semantics of terms between GO terms. The proposed WIS method is an effective and reliable way to compare gene function. The web service of WIS is freely available at http://nclab.hit.edu.cn/WIS/ .
Casel, Pierrot; Moreews, François; Lagarrigue, Sandrine; Klopp, Christophe
2009-07-16
Microarray is a powerful technology enabling to monitor tens of thousands of genes in a single experiment. Most microarrays are now using oligo-sets. The design of the oligo-nucleotides is time consuming and error prone. Genome wide microarray oligo-sets are designed using as large a set of transcripts as possible in order to monitor as many genes as possible. Depending on the genome sequencing state and on the assembly state the knowledge of the existing transcripts can be very different. This knowledge evolves with the different genome builds and gene builds. Once the design is done the microarrays are often used for several years. The biologists working in EADGENE expressed the need of up-to-dated annotation files for the oligo-sets they share including information about the orthologous genes of model species, the Gene Ontology, the corresponding pathways and the chromosomal location. The results of SigReannot on a chicken micro-array used in the EADGENE project compared to the initial annotations show that 23% of the oligo-nucleotide gene annotations were not confirmed, 2% were modified and 1% were added. The interest of this up-to-date annotation procedure is demonstrated through the analysis of real data previously published. SigReannot uses the oligo-nucleotide design procedure criteria to validate the probe-gene link and the Ensembl transcripts as reference for annotation. It therefore produces a high quality annotation based on reference gene sets.
Visualizing phylogenetic tree landscapes.
Wilgenbusch, James C; Huang, Wen; Gallivan, Kyle A
2017-02-02
Genomic-scale sequence alignments are increasingly used to infer phylogenies in order to better understand the processes and patterns of evolution. Different partitions within these new alignments (e.g., genes, codon positions, and structural features) often favor hundreds if not thousands of competing phylogenies. Summarizing and comparing phylogenies obtained from multi-source data sets using current consensus tree methods discards valuable information and can disguise potential methodological problems. Discovery of efficient and accurate dimensionality reduction methods used to display at once in 2- or 3- dimensions the relationship among these competing phylogenies will help practitioners diagnose the limits of current evolutionary models and potential problems with phylogenetic reconstruction methods when analyzing large multi-source data sets. We introduce several dimensionality reduction methods to visualize in 2- and 3-dimensions the relationship among competing phylogenies obtained from gene partitions found in three mid- to large-size mitochondrial genome alignments. We test the performance of these dimensionality reduction methods by applying several goodness-of-fit measures. The intrinsic dimensionality of each data set is also estimated to determine whether projections in 2- and 3-dimensions can be expected to reveal meaningful relationships among trees from different data partitions. Several new approaches to aid in the comparison of different phylogenetic landscapes are presented. Curvilinear Components Analysis (CCA) and a stochastic gradient decent (SGD) optimization method give the best representation of the original tree-to-tree distance matrix for each of the three- mitochondrial genome alignments and greatly outperformed the method currently used to visualize tree landscapes. The CCA + SGD method converged at least as fast as previously applied methods for visualizing tree landscapes. We demonstrate for all three mtDNA alignments that 3D projections significantly increase the fit between the tree-to-tree distances and can facilitate the interpretation of the relationship among phylogenetic trees. We demonstrate that the choice of dimensionality reduction method can significantly influence the spatial relationship among a large set of competing phylogenetic trees. We highlight the importance of selecting a dimensionality reduction method to visualize large multi-locus phylogenetic landscapes and demonstrate that 3D projections of mitochondrial tree landscapes better capture the relationship among the trees being compared.
shRNA-Induced Gene Knockdown In Vivo to Investigate Neutrophil Function.
Basit, Abdul; Tang, Wenwen; Wu, Dianqing
2016-01-01
To silence genes in neutrophils efficiently, we exploited the RNA interference and developed an shRNA-based gene knockdown technique. This method involves transfection of mouse bone marrow-derived hematopoietic stem cells with retroviral vector carrying shRNA directed at a specific gene. Transfected stem cells are then transplanted into irradiated wild-type mice. After engraftment of stem cells, the transplanted mice have two sets of circulating neutrophils. One set has a gene of interest knocked down while the other set has full complement of expressed genes. This efficient technique provides a unique way to directly compare the response of neutrophils with a knocked-down gene to that of neutrophils with the full complement of expressed genes in the same environment.
Carrión, Javier
2011-09-01
The immune phenotype conferred by two different sets of histone genes (H2A-H2B or H3-H4) was assessed. BALB/c mice vaccinated with pcDNA3H2AH2B succumbed to progressive cutaneous leishmaniosis (CL), whereas vaccination with pcDNA3H3H4 resulted in partial resistance to Leishmania major challenge associated with the development of mixed T helper 1 (Th1)/Th2-type response and a reduction in parasite-specific Treg cells number at the site of infection. Therefore, the presence of histones H3 and H4 may be considered essential in the development of vaccine strategies against CL based on the Leishmania histones. Copyright © 2011 Elsevier Ltd. All rights reserved.
Parodi, Stefano; Manneschi, Chiara; Verda, Damiano; Ferrari, Enrico; Muselli, Marco
2018-03-01
This study evaluates the performance of a set of machine learning techniques in predicting the prognosis of Hodgkin's lymphoma using clinical factors and gene expression data. Analysed samples from 130 Hodgkin's lymphoma patients included a small set of clinical variables and more than 54,000 gene features. Machine learning classifiers included three black-box algorithms ( k-nearest neighbour, Artificial Neural Network, and Support Vector Machine) and two methods based on intelligible rules (Decision Tree and the innovative Logic Learning Machine method). Support Vector Machine clearly outperformed any of the other methods. Among the two rule-based algorithms, Logic Learning Machine performed better and identified a set of simple intelligible rules based on a combination of clinical variables and gene expressions. Decision Tree identified a non-coding gene ( XIST) involved in the early phases of X chromosome inactivation that was overexpressed in females and in non-relapsed patients. XIST expression might be responsible for the better prognosis of female Hodgkin's lymphoma patients.
Sanli, G; Blaber, S I; Blaber, M
2001-01-01
Corynebacteria codon usage exhibits an overall GC content of 67%, and a wobble-position GC content of 88%. Escherichia coli, on the other hand has an overall GC content of 51%, and a wobble-position GC content of 55%. The high GC content of Corynebacteria genes results in an unfavorable codon preference for heterologous expression, and can present difficulties for polymerase-based manipulations due to secondary-structure effects. Since these characteristics are due primarily to base composition at the wobble-position, synthetic genes can, in principle, be designed to eliminate these problems and retain the wild-type amino acid sequence. Such genes would obviate the need for special additives or bases during in vitro polymerase-based manipulation and mutant host strains containing uncommon tRNA's for heterologous expression. We have evaluated synthetic genes with reduced wobble-position G/C content using two variants of the enzyme 2,5-diketo-D-gluconic acid reductase (2,5-DKGR A and B) from Corynebacterium. The wild-type genes are refractory to polymerase-based manipulations and exhibit poor heterologous expression in enteric bacteria. The results indicate that a subset of codons for five amino acids (alanine, arginine, glutamate, glycine and valine) contribute the greatest contribution to reduction in G/C content at the wobble-position. Furthermore, changes in codons for two amino acids (leucine and proline) enhance bias for expression in enteric bacteria without affecting the overall G/C content. The synthetic genes are readily amplified using polymerase-based methodologies, and exhibit high levels of heterologous expression in E. coli.
A statistical approach to identify, monitor, and manage incomplete curated data sets.
Howe, Douglas G
2018-04-02
Many biological knowledge bases gather data through expert curation of published literature. High data volume, selective partial curation, delays in access, and publication of data prior to the ability to curate it can result in incomplete curation of published data. Knowing which data sets are incomplete and how incomplete they are remains a challenge. Awareness that a data set may be incomplete is important for proper interpretation, to avoiding flawed hypothesis generation, and can justify further exploration of published literature for additional relevant data. Computational methods to assess data set completeness are needed. One such method is presented here. In this work, a multivariate linear regression model was used to identify genes in the Zebrafish Information Network (ZFIN) Database having incomplete curated gene expression data sets. Starting with 36,655 gene records from ZFIN, data aggregation, cleansing, and filtering reduced the set to 9870 gene records suitable for training and testing the model to predict the number of expression experiments per gene. Feature engineering and selection identified the following predictive variables: the number of journal publications; the number of journal publications already attributed for gene expression annotation; the percent of journal publications already attributed for expression data; the gene symbol; and the number of transgenic constructs associated with each gene. Twenty-five percent of the gene records (2483 genes) were used to train the model. The remaining 7387 genes were used to test the model. One hundred and twenty-two and 165 of the 7387 tested genes were identified as missing expression annotations based on their residuals being outside the model lower or upper 95% confidence interval respectively. The model had precision of 0.97 and recall of 0.71 at the negative 95% confidence interval and precision of 0.76 and recall of 0.73 at the positive 95% confidence interval. This method can be used to identify data sets that are incompletely curated, as demonstrated using the gene expression data set from ZFIN. This information can help both database resources and data consumers gauge when it may be useful to look further for published data to augment the existing expertly curated information.
Tiwari, Jagesh Kumar; Devi, Sapna; Sundaresha, S; Chandel, Poonam; Ali, Nilofer; Singh, Brajesh; Bhardwaj, Vinay; Singh, Bir Pal
2015-06-01
Genes involved in photoassimilate partitioning and changes in hormonal balance are important for potato tuberization. In the present study, we investigated gene expression patterns in the tuber-bearing potato somatic hybrid (E1-3) and control non-tuberous wild species Solanum etuberosum (Etb) by microarray. Plants were grown under controlled conditions and leaves were collected at eight tuber developmental stages for microarray analysis. A t-test analysis identified a total of 468 genes (94 up-regulated and 374 down-regulated) that were statistically significant (p ≤ 0.05) and differentially expressed in E1-3 and Etb. Gene Ontology (GO) characterization of the 468 genes revealed that 145 were annotated and 323 were of unknown function. Further, these 145 genes were grouped based on GO biological processes followed by molecular function and (or) PGSC description into 15 gene sets, namely (1) transport, (2) metabolic process, (3) biological process, (4) photosynthesis, (5) oxidation-reduction, (6) transcription, (7) translation, (8) binding, (9) protein phosphorylation, (10) protein folding, (11) ubiquitin-dependent protein catabolic process, (12) RNA processing, (13) negative regulation of protein, (14) methylation, and (15) mitosis. RT-PCR analysis of 10 selected highly significant genes (p ≤ 0.01) confirmed the microarray results. Overall, we show that candidate genes induced in leaves of E1-3 were implicated in tuberization processes such as transport, carbohydrate metabolism, phytohormones, and transcription/translation/binding functions. Hence, our results provide an insight into the candidate genes induced in leaf tissues during tuberization in E1-3.
Integrating high dimensional bi-directional parsing models for gene mention tagging.
Hsu, Chun-Nan; Chang, Yu-Ming; Kuo, Cheng-Ju; Lin, Yu-Shi; Huang, Han-Shen; Chung, I-Fang
2008-07-01
Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention tagging task in BioCreative 2. Our tagger is interesting because it accomplished the highest F-scores among CRF-based methods and second over all. Moreover, we obtained our results by mostly applying open source packages, making it easy to duplicate our results. We first describe in detail how we developed our CRF-based tagger. We designed a very high dimensional feature set that includes most of information that may be relevant. We trained bi-directional CRF models with the same set of features, one applies forward parsing and the other backward, and integrated two models based on the output scores and dictionary filtering. One of the most prominent factors that contributes to the good performance of our tagger is the integration of an additional backward parsing model. However, from the definition of CRF, it appears that a CRF model is symmetric and bi-directional parsing models will produce the same results. We show that due to different feature settings, a CRF model can be asymmetric and the feature setting for our tagger in BioCreative 2 not only produces different results but also gives backward parsing models slight but constant advantage over forward parsing model. To fully explore the potential of integrating bi-directional parsing models, we applied different asymmetric feature settings to generate many bi-directional parsing models and integrate them based on the output scores. Experimental results show that this integrated model can achieve even higher F-score solely based on the training corpus for gene mention tagging. Data sets, programs and an on-line service of our gene mention tagger can be accessed at http://aiia.iis.sinica.edu.tw/biocreative2.htm.
Tramm, Trine; Mohammed, Hayat; Myhre, Simen; Kyndi, Marianne; Alsner, Jan; Børresen-Dale, Anne-Lise; Sørlie, Therese; Frigessi, Arnoldo; Overgaard, Jens
2014-10-15
To identify genes predicting benefit of radiotherapy in patients with high-risk breast cancer treated with systemic therapy and randomized to receive or not receive postmastectomy radiotherapy (PMRT). The study was based on the Danish Breast Cancer Cooperative Group (DBCG82bc) cohort. Gene-expression analysis was performed in a training set of frozen tumor tissue from 191 patients. Genes were identified through the Lasso method with the endpoint being locoregional recurrence (LRR). A weighted gene-expression index (DBCG-RT profile) was calculated and transferred to quantitative real-time PCR (qRT-PCR) in corresponding formalin-fixed, paraffin-embedded (FFPE) samples, before validation in FFPE from 112 additional patients. Seven genes were identified, and the derived DBCG-RT profile divided the 191 patients into "high LRR risk" and "low LRR risk" groups. PMRT significantly reduced risk of LRR in "high LRR risk" patients, whereas "low LRR risk" patients showed no additional reduction in LRR rate. Technical transfer of the DBCG-RT profile to FFPE/qRT-PCR was successful, and the predictive impact was successfully validated in another 112 patients. A DBCG-RT gene profile was identified and validated, identifying patients with very low risk of LRR and no benefit from PMRT. The profile may provide a method to individualize treatment with PMRT. ©2014 American Association for Cancer Research.
Gao, Shanwu; Tibiche, Chabane; Zou, Jinfeng; Zaman, Naif; Trifiro, Mark; O'Connor-McCourt, Maureen; Wang, Edwin
2016-01-01
Decisions regarding adjuvant therapy in patients with stage II colorectal cancer (CRC) have been among the most challenging and controversial in oncology over the past 20 years. To develop robust combinatory cancer hallmark-based gene signature sets (CSS sets) that more accurately predict prognosis and identify a subset of patients with stage II CRC who could gain survival benefits from adjuvant chemotherapy. Thirteen retrospective studies of patients with stage II CRC who had clinical follow-up and adjuvant chemotherapy were analyzed. Respective totals of 162 and 843 patients from 2 and 11 independent cohorts were used as the discovery and validation cohorts, respectively. A total of 1005 patients with stage II CRC were included in the 13 cohorts. Among them, 84 of 416 patients in 3 independent cohorts received fluorouracil-based adjuvant chemotherapy. Identification of CSS sets to predict relapse-free survival and identify a subset of patients with stage II CRC who could gain substantial survival benefits from fluorouracil-based adjuvant chemotherapy. Eight cancer hallmark-based gene signatures (30 genes each) were identified and used to construct CSS sets for determining prognosis. The CSS sets were validated in 11 independent cohorts of 767 patients with stage II CRC who did not receive adjuvant chemotherapy. The CSS sets accurately stratified patients into low-, intermediate-, and high-risk groups. Five-year relapse-free survival rates were 94%, 78%, and 45%, respectively, representing 60%, 28%, and 12% of patients with stage II disease. The 416 patients with CSS set-defined high-risk stage II CRC who received fluorouracil-based adjuvant chemotherapy showed a substantial gain in survival benefits from the treatment (ie, recurrence reduced by 30%-40% in 5 years). The CSS sets substantially outperformed other prognostic predictors of stage 2 CRC. They are more accurate and robust for prognostic predictions and facilitate the identification of patients with stage II disease who could gain survival benefit from fluorouracil-based adjuvant chemotherapy.
Consistency of gene starts among Burkholderia genomes
2011-01-01
Background Evolutionary divergence in the position of the translational start site among orthologous genes can have significant functional impacts. Divergence can alter the translation rate, degradation rate, subcellular location, and function of the encoded proteins. Results Existing Genbank gene maps for Burkholderia genomes suggest that extensive divergence has occurred--53% of ortholog sets based on Genbank gene maps had inconsistent gene start sites. However, most of these inconsistencies appear to be gene-calling errors. Evolutionary divergence was the most plausible explanation for only 17% of the ortholog sets. Correcting probable errors in the Genbank gene maps decreased the percentage of ortholog sets with inconsistent starts by 68%, increased the percentage of ortholog sets with extractable upstream intergenic regions by 32%, increased the sequence similarity of intergenic regions and predicted proteins, and increased the number of proteins with identifiable signal peptides. Conclusions Our findings highlight an emerging problem in comparative genomics: single-digit percent errors in gene predictions can lead to double-digit percentages of inconsistent ortholog sets. The work demonstrates a simple approach to evaluate and improve the quality of gene maps. PMID:21342528
Phylogenomics from Whole Genome Sequences Using aTRAM.
Allen, Julie M; Boyd, Bret; Nguyen, Nam-Phuong; Vachaspati, Pranjal; Warnow, Tandy; Huang, Daisie I; Grady, Patrick G S; Bell, Kayce C; Cronk, Quentin C B; Mugisha, Lawrence; Pittendrigh, Barry R; Leonardi, M Soledad; Reed, David L; Johnson, Kevin P
2017-09-01
Novel sequencing technologies are rapidly expanding the size of data sets that can be applied to phylogenetic studies. Currently the most commonly used phylogenomic approaches involve some form of genome reduction. While these approaches make assembling phylogenomic data sets more economical for organisms with large genomes, they reduce the genomic coverage and thereby the long-term utility of the data. Currently, for organisms with moderate to small genomes ($<$1000 Mbp) it is feasible to sequence the entire genome at modest coverage ($10-30\\times$). Computational challenges for handling these large data sets can be alleviated by assembling targeted reads, rather than assembling the entire genome, to produce a phylogenomic data matrix. Here we demonstrate the use of automated Target Restricted Assembly Method (aTRAM) to assemble 1107 single-copy ortholog genes from whole genome sequencing of sucking lice (Anoplura) and out-groups. We developed a pipeline to extract exon sequences from the aTRAM assemblies by annotating them with respect to the original target protein. We aligned these protein sequences with the inferred amino acids and then performed phylogenetic analyses on both the concatenated matrix of genes and on each gene separately in a coalescent analysis. Finally, we tested the limits of successful assembly in aTRAM by assembling 100 genes from close- to distantly related taxa at high to low levels of coverage.Both the concatenated analysis and the coalescent-based analysis produced the same tree topology, which was consistent with previously published results and resolved weakly supported nodes. These results demonstrate that this approach is successful at developing phylogenomic data sets from raw genome sequencing reads. Further, we found that with coverages above $5-10\\times$, aTRAM was successful at assembling 80-90% of the contigs for both close and distantly related taxa. As sequencing costs continue to decline, we expect full genome sequencing will become more feasible for a wider array of organisms, and aTRAM will enable mining of these genomic data sets for an extensive variety of applications, including phylogenomics. [aTRAM; gene assembly; genome sequencing; phylogenomics.]. © The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
GSNFS: Gene subnetwork biomarker identification of lung cancer expression data.
Doungpan, Narumol; Engchuan, Worrawat; Chan, Jonathan H; Meechai, Asawin
2016-12-05
Gene expression has been used to identify disease gene biomarkers, but there are ongoing challenges. Single gene or gene-set biomarkers are inadequate to provide sufficient understanding of complex disease mechanisms and the relationship among those genes. Network-based methods have thus been considered for inferring the interaction within a group of genes to further study the disease mechanism. Recently, the Gene-Network-based Feature Set (GNFS), which is capable of handling case-control and multiclass expression for gene biomarker identification, has been proposed, partly taking into account of network topology. However, its performance relies on a greedy search for building subnetworks and thus requires further improvement. In this work, we establish a new approach named Gene Sub-Network-based Feature Selection (GSNFS) by implementing the GNFS framework with two proposed searching and scoring algorithms, namely gene-set-based (GS) search and parent-node-based (PN) search, to identify subnetworks. An additional dataset is used to validate the results. The two proposed searching algorithms of the GSNFS method for subnetwork expansion are concerned with the degree of connectivity and the scoring scheme for building subnetworks and their topology. For each iteration of expansion, the neighbour genes of a current subnetwork, whose expression data improved the overall subnetwork score, is recruited. While the GS search calculated the subnetwork score using an activity score of a current subnetwork and the gene expression values of its neighbours, the PN search uses the expression value of the corresponding parent of each neighbour gene. Four lung cancer expression datasets were used for subnetwork identification. In addition, using pathway data and protein-protein interaction as network data in order to consider the interaction among significant genes were discussed. Classification was performed to compare the performance of the identified gene subnetworks with three subnetwork identification algorithms. The two searching algorithms resulted in better classification and gene/gene-set agreement compared to the original greedy search of the GNFS method. The identified lung cancer subnetwork using the proposed searching algorithm resulted in an improvement of the cross-dataset validation and an increase in the consistency of findings between two independent datasets. The homogeneity measurement of the datasets was conducted to assess dataset compatibility in cross-dataset validation. The lung cancer dataset with higher homogeneity showed a better result when using the GS search while the dataset with low homogeneity showed a better result when using the PN search. The 10-fold cross-dataset validation on the independent lung cancer datasets showed higher classification performance of the proposed algorithms when compared with the greedy search in the original GNFS method. The proposed searching algorithms provide a higher number of genes in the subnetwork expansion step than the greedy algorithm. As a result, the performance of the subnetworks identified from the GSNFS method was improved in terms of classification performance and gene/gene-set level agreement depending on the homogeneity of the datasets used in the analysis. Some common genes obtained from the four datasets using different searching algorithms are genes known to play a role in lung cancer. The improvement of classification performance and the gene/gene-set level agreement, and the biological relevance indicated the effectiveness of the GSNFS method for gene subnetwork identification using expression data.
Li, Chen; Shen, Weixing; Shen, Sheng; Ai, Zhilong
2013-12-01
To explore the molecular mechanisms of cholangiocarcinoma (CC), microarray technology was used to find biomarkers for early detection and diagnosis. The gene expression profiles from 6 patients with CC and 5 normal controls were downloaded from Gene Expression Omnibus and compared. As a result, 204 differentially co-expressed genes (DCGs) in CC patients compared to normal controls were identified using a computational bioinformatics analysis. These genes were mainly involved in coenzyme metabolic process, peptidase activity and oxidation reduction. A regulatory network was constructed by mapping the DCGs to known regulation data. Four transcription factors, FOXC1, ZIC2, NKX2-2 and GCGR, were hub nodes in the network. In conclusion, this study provides a set of targets useful for future investigations into molecular biomarker studies. Copyright © 2013 Elsevier Ltd. All rights reserved.
Sloan, Daniel B.; Nakabachi, Atsushi; Richards, Stephen; Qu, Jiaxin; Murali, Shwetha Canchi; Gibbs, Richard A.; Moran, Nancy A.
2014-01-01
Bacteria confined to intracellular environments experience extensive genome reduction. In extreme cases, insect endosymbionts have evolved genomes that are so gene-poor that they blur the distinction between bacteria and endosymbiotically derived organelles such as mitochondria and plastids. To understand the host’s role in this extreme gene loss, we analyzed gene content and expression in the nuclear genome of the psyllid Pachypsylla venusta, a sap-feeding insect that harbors an ancient endosymbiont (Carsonella) with one of the most reduced bacterial genomes ever identified. Carsonella retains many genes required for synthesis of essential amino acids that are scarce in plant sap, but most of these biosynthetic pathways have been disrupted by gene loss. Host genes that are upregulated in psyllid cells housing Carsonella appear to compensate for endosymbiont gene losses, resulting in highly integrated metabolic pathways that mirror those observed in other sap-feeding insects. The host contribution to these pathways is mediated by a combination of native eukaryotic genes and bacterial genes that were horizontally transferred from multiple donor lineages early in the evolution of psyllids, including one gene that appears to have been directly acquired from Carsonella. By comparing the psyllid genome to a recent analysis of mealybugs, we found that a remarkably similar set of functional pathways have been shaped by independent transfers of bacterial genes to the two hosts. These results show that horizontal gene transfer is an important and recurring mechanism driving coevolution between insects and their bacterial endosymbionts and highlight interesting similarities and contrasts with the evolutionary history of mitochondria and plastids. PMID:24398322
A mixture model-based approach to the clustering of microarray expression data.
McLachlan, G J; Bean, R W; Peel, D
2002-03-01
This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. EMMIX-GENE is available at http://www.maths.uq.edu.au/~gjm/emmix-gene/
Integrative omics analysis. A study based on Plasmodium falciparum mRNA and protein data.
Tomescu, Oana A; Mattanovich, Diethard; Thallinger, Gerhard G
2014-01-01
Technological improvements have shifted the focus from data generation to data analysis. The availability of large amounts of data from transcriptomics, protemics and metabolomics experiments raise new questions concerning suitable integrative analysis methods. We compare three integrative analysis techniques (co-inertia analysis, generalized singular value decomposition and integrative biclustering) by applying them to gene and protein abundance data from the six life cycle stages of Plasmodium falciparum. Co-inertia analysis is an analysis method used to visualize and explore gene and protein data. The generalized singular value decomposition has shown its potential in the analysis of two transcriptome data sets. Integrative Biclustering applies biclustering to gene and protein data. Using CIA, we visualize the six life cycle stages of Plasmodium falciparum, as well as GO terms in a 2D plane and interpret the spatial configuration. With GSVD, we decompose the transcriptomic and proteomic data sets into matrices with biologically meaningful interpretations and explore the processes captured by the data sets. IBC identifies groups of genes, proteins, GO Terms and life cycle stages of Plasmodium falciparum. We show method-specific results as well as a network view of the life cycle stages based on the results common to all three methods. Additionally, by combining the results of the three methods, we create a three-fold validated network of life cycle stage specific GO terms: Sporozoites are associated with transcription and transport; merozoites with entry into host cell as well as biosynthetic and metabolic processes; rings with oxidation-reduction processes; trophozoites with glycolysis and energy production; schizonts with antigenic variation and immune response; gametocyctes with DNA packaging and mitochondrial transport. Furthermore, the network connectivity underlines the separation of the intraerythrocytic cycle from the gametocyte and sporozoite stages. Using integrative analysis techniques, we can integrate knowledge from different levels and obtain a wider view of the system under study. The overlap between method-specific and common results is considerable, even if the basic mathematical assumptions are very different. The three-fold validated network of life cycle stage characteristics of Plasmodium falciparum could identify a large amount of the known associations from literature in only one study.
Integrative omics analysis. A study based on Plasmodium falciparum mRNA and protein data
2014-01-01
Background Technological improvements have shifted the focus from data generation to data analysis. The availability of large amounts of data from transcriptomics, protemics and metabolomics experiments raise new questions concerning suitable integrative analysis methods. We compare three integrative analysis techniques (co-inertia analysis, generalized singular value decomposition and integrative biclustering) by applying them to gene and protein abundance data from the six life cycle stages of Plasmodium falciparum. Co-inertia analysis is an analysis method used to visualize and explore gene and protein data. The generalized singular value decomposition has shown its potential in the analysis of two transcriptome data sets. Integrative Biclustering applies biclustering to gene and protein data. Results Using CIA, we visualize the six life cycle stages of Plasmodium falciparum, as well as GO terms in a 2D plane and interpret the spatial configuration. With GSVD, we decompose the transcriptomic and proteomic data sets into matrices with biologically meaningful interpretations and explore the processes captured by the data sets. IBC identifies groups of genes, proteins, GO Terms and life cycle stages of Plasmodium falciparum. We show method-specific results as well as a network view of the life cycle stages based on the results common to all three methods. Additionally, by combining the results of the three methods, we create a three-fold validated network of life cycle stage specific GO terms: Sporozoites are associated with transcription and transport; merozoites with entry into host cell as well as biosynthetic and metabolic processes; rings with oxidation-reduction processes; trophozoites with glycolysis and energy production; schizonts with antigenic variation and immune response; gametocyctes with DNA packaging and mitochondrial transport. Furthermore, the network connectivity underlines the separation of the intraerythrocytic cycle from the gametocyte and sporozoite stages. Conclusion Using integrative analysis techniques, we can integrate knowledge from different levels and obtain a wider view of the system under study. The overlap between method-specific and common results is considerable, even if the basic mathematical assumptions are very different. The three-fold validated network of life cycle stage characteristics of Plasmodium falciparum could identify a large amount of the known associations from literature in only one study. PMID:25033389
Kawai, Mikihiko; Futagami, Taiki; Toyoda, Atsushi; Takaki, Yoshihiro; Nishi, Shinro; Hori, Sayaka; Arai, Wataru; Tsubouchi, Taishi; Morono, Yuki; Uchiyama, Ikuo; Ito, Takehiko; Fujiyama, Asao; Inagaki, Fumio; Takami, Hideto
2014-01-01
Marine subsurface sediments on the Pacific margin harbor diverse microbial communities even at depths of several hundreds meters below the seafloor (mbsf) or more. Previous PCR-based molecular analysis showed the presence of diverse reductive dehalogenase gene (rdhA) homologs in marine subsurface sediment, suggesting that anaerobic respiration of organohalides is one of the possible energy-yielding pathways in the organic-rich sedimentary habitat. However, primer-independent molecular characterization of rdhA has remained to be demonstrated. Here, we studied the diversity and frequency of rdhA homologs by metagenomic analysis of five different depth horizons (0.8, 5.1, 18.6, 48.5, and 107.0 mbsf) at Site C9001 off the Shimokita Peninsula of Japan. From all metagenomic pools, remarkably diverse rdhA-homologous sequences, some of which are affiliated with novel clusters, were observed with high frequency. As a comparison, we also examined frequency of dissimilatory sulfite reductase genes (dsrAB), key functional genes for microbial sulfate reduction. The dsrAB were also widely observed in the metagenomic pools whereas the frequency of dsrAB genes was generally smaller than that of rdhA-homologous genes. The phylogenetic composition of rdhA-homologous genes was similar among the five depth horizons. Our metagenomic data revealed that subseafloor rdhA homologs are more diverse than previously identified from PCR-based molecular studies. Spatial distribution of similar rdhA homologs across wide depositional ages indicates that the heterotrophic metabolic processes mediated by the genes can be ecologically important, functioning in the organic-rich subseafloor sedimentary biosphere. PMID:24624126
Kawai, Mikihiko; Futagami, Taiki; Toyoda, Atsushi; Takaki, Yoshihiro; Nishi, Shinro; Hori, Sayaka; Arai, Wataru; Tsubouchi, Taishi; Morono, Yuki; Uchiyama, Ikuo; Ito, Takehiko; Fujiyama, Asao; Inagaki, Fumio; Takami, Hideto
2014-01-01
Marine subsurface sediments on the Pacific margin harbor diverse microbial communities even at depths of several hundreds meters below the seafloor (mbsf) or more. Previous PCR-based molecular analysis showed the presence of diverse reductive dehalogenase gene (rdhA) homologs in marine subsurface sediment, suggesting that anaerobic respiration of organohalides is one of the possible energy-yielding pathways in the organic-rich sedimentary habitat. However, primer-independent molecular characterization of rdhA has remained to be demonstrated. Here, we studied the diversity and frequency of rdhA homologs by metagenomic analysis of five different depth horizons (0.8, 5.1, 18.6, 48.5, and 107.0 mbsf) at Site C9001 off the Shimokita Peninsula of Japan. From all metagenomic pools, remarkably diverse rdhA-homologous sequences, some of which are affiliated with novel clusters, were observed with high frequency. As a comparison, we also examined frequency of dissimilatory sulfite reductase genes (dsrAB), key functional genes for microbial sulfate reduction. The dsrAB were also widely observed in the metagenomic pools whereas the frequency of dsrAB genes was generally smaller than that of rdhA-homologous genes. The phylogenetic composition of rdhA-homologous genes was similar among the five depth horizons. Our metagenomic data revealed that subseafloor rdhA homologs are more diverse than previously identified from PCR-based molecular studies. Spatial distribution of similar rdhA homologs across wide depositional ages indicates that the heterotrophic metabolic processes mediated by the genes can be ecologically important, functioning in the organic-rich subseafloor sedimentary biosphere.
Molecular phylogeny and evolutionary timescale for the family of mammalian herpesviruses.
McGeoch, D J; Cook, S; Dolan, A; Jamieson, F E; Telford, E A
1995-03-31
A detailed phylogenetic analysis for mammalian members of the family Herpesviridae, based on molecular sequences is reported. Sets of encoded amino acid sequences were collected for eight well conserved genes that are common to mammalian herpesviruses. Phylogenetic trees were inferred from alignments of these sequence sets using both maximum parsimony and distance methods, and evaluated by bootstrap analysis. In all cases the three recognised subfamilies (Alpha-, Beta- and Gammaherpesvirinae), and major sublineages in each subfamily, were clearly distinguished, but within sublineages some finer details of branching were incompletely resolved. Multiple-gene sets were assembled to give a broadly based tree. The root position of the tree was estimated by assuming a constant molecular clock and also by analysis of one herpesviral gene set (that encoding uracil-DNA glycosylase) using cellular homologues as outgroups. Both procedures placed the root between the Alphaherpesvirinae and the other two subfamilies. Substitution rates were calculated for the combined gene sets based on a previous estimate for alphaherpesviral UL27 genes, where the time base had been obtained according to the hypothesis of cospeciation of virus and host lineages. Assuming a constant molecular clock, it was then estimated that the three subfamilies arose approximately 180 to 220 million years ago, that major sublineages within subfamilies were probably generated before the mammalian radiation of 80 to 60 million years ago, and that speciations within sublineages took place in the last 80 million years, probably with a major component of cospeciation with host lineages.
Ma, Chuang; Xin, Mingming; Feldmann, Kenneth A.; Wang, Xiangfeng
2014-01-01
Machine learning (ML) is an intelligent data mining technique that builds a prediction model based on the learning of prior knowledge to recognize patterns in large-scale data sets. We present an ML-based methodology for transcriptome analysis via comparison of gene coexpression networks, implemented as an R package called machine learning–based differential network analysis (mlDNA) and apply this method to reanalyze a set of abiotic stress expression data in Arabidopsis thaliana. The mlDNA first used a ML-based filtering process to remove nonexpressed, constitutively expressed, or non-stress-responsive “noninformative” genes prior to network construction, through learning the patterns of 32 expression characteristics of known stress-related genes. The retained “informative” genes were subsequently analyzed by ML-based network comparison to predict candidate stress-related genes showing expression and network differences between control and stress networks, based on 33 network topological characteristics. Comparative evaluation of the network-centric and gene-centric analytic methods showed that mlDNA substantially outperformed traditional statistical testing–based differential expression analysis at identifying stress-related genes, with markedly improved prediction accuracy. To experimentally validate the mlDNA predictions, we selected 89 candidates out of the 1784 predicted salt stress–related genes with available SALK T-DNA mutagenesis lines for phenotypic screening and identified two previously unreported genes, mutants of which showed salt-sensitive phenotypes. PMID:24520154
Robust Learning of High-dimensional Biological Networks with Bayesian Networks
NASA Astrophysics Data System (ADS)
Nägele, Andreas; Dejori, Mathäus; Stetter, Martin
Structure learning of Bayesian networks applied to gene expression data has become a potentially useful method to estimate interactions between genes. However, the NP-hardness of Bayesian network structure learning renders the reconstruction of the full genetic network with thousands of genes unfeasible. Consequently, the maximal network size is usually restricted dramatically to a small set of genes (corresponding with variables in the Bayesian network). Although this feature reduction step makes structure learning computationally tractable, on the downside, the learned structure might be adversely affected due to the introduction of missing genes. Additionally, gene expression data are usually very sparse with respect to the number of samples, i.e., the number of genes is much greater than the number of different observations. Given these problems, learning robust network features from microarray data is a challenging task. This chapter presents several approaches tackling the robustness issue in order to obtain a more reliable estimation of learned network features.
Wang, W; Huang, S; Hou, W; Liu, Y; Fan, Q; He, A; Wen, Y; Hao, J; Guo, X; Zhang, F
2017-10-01
Several genome-wide association studies (GWAS) of bone mineral density (BMD) have successfully identified multiple susceptibility genes, yet isolated susceptibility genes are often difficult to interpret biologically. The aim of this study was to unravel the genetic background of BMD at pathway level, by integrating BMD GWAS data with genome-wide expression quantitative trait loci (eQTLs) and methylation quantitative trait loci (meQTLs) data METHOD: We employed the GWAS datasets of BMD from the Genetic Factors for Osteoporosis Consortium (GEFOS), analysing patients' BMD. The areas studied included 32 735 femoral necks, 28 498 lumbar spines, and 8143 forearms. Genome-wide eQTLs (containing 923 021 eQTLs) and meQTLs (containing 683 152 unique methylation sites with local meQTLs) data sets were collected from recently published studies. Gene scores were first calculated by summary data-based Mendelian randomisation (SMR) software and meQTL-aligned GWAS results. Gene set enrichment analysis (GSEA) was then applied to identify BMD-associated gene sets with a predefined significance level of 0.05. We identified multiple gene sets associated with BMD in one or more regions, including relevant known biological gene sets such as the Reactome Circadian Clock (GSEA p-value = 1.0 × 10 -4 for LS and 2.7 × 10 -2 for femoral necks BMD in eQTLs-based GSEA) and insulin-like growth factor receptor binding (GSEA p-value = 5.0 × 10 -4 for femoral necks and 2.6 × 10 -2 for lumbar spines BMD in meQTLs-based GSEA). Our results provided novel clues for subsequent functional analysis of bone metabolism, and illustrated the benefit of integrating eQTLs and meQTLs data into pathway association analysis for genetic studies of complex human diseases. Cite this article : W. Wang, S. Huang, W. Hou, Y. Liu, Q. Fan, A. He, Y. Wen, J. Hao, X. Guo, F. Zhang. Integrative analysis of GWAS, eQTLs and meQTLs data suggests that multiple gene sets are associated with bone mineral density. Bone Joint Res 2017;6:572-576. © 2017 Wang et al.
Hu, Ting; Pan, Qinxin; Andrew, Angeline S; Langer, Jillian M; Cole, Michael D; Tomlinson, Craig R; Karagas, Margaret R; Moore, Jason H
2014-04-11
Several different genetic and environmental factors have been identified as independent risk factors for bladder cancer in population-based studies. Recent studies have turned to understanding the role of gene-gene and gene-environment interactions in determining risk. We previously developed the bioinformatics framework of statistical epistasis networks (SEN) to characterize the global structure of interacting genetic factors associated with a particular disease or clinical outcome. By applying SEN to a population-based study of bladder cancer among Caucasians in New Hampshire, we were able to identify a set of connected genetic factors with strong and significant interaction effects on bladder cancer susceptibility. To support our statistical findings using networks, in the present study, we performed pathway enrichment analyses on the set of genes identified using SEN, and found that they are associated with the carcinogen benzo[a]pyrene, a component of tobacco smoke. We further carried out an mRNA expression microarray experiment to validate statistical genetic interactions, and to determine if the set of genes identified in the SEN were differentially expressed in a normal bladder cell line and a bladder cancer cell line in the presence or absence of benzo[a]pyrene. Significant nonrandom sets of genes from the SEN were found to be differentially expressed in response to benzo[a]pyrene in both the normal bladder cells and the bladder cancer cells. In addition, the patterns of gene expression were significantly different between these two cell types. The enrichment analyses and the gene expression microarray results support the idea that SEN analysis of bladder in population-based studies is able to identify biologically meaningful statistical patterns. These results bring us a step closer to a systems genetic approach to understanding cancer susceptibility that integrates population and laboratory-based studies.
A trace ratio maximization approach to multiple kernel-based dimensionality reduction.
Jiang, Wenhao; Chung, Fu-lai
2014-01-01
Most dimensionality reduction techniques are based on one metric or one kernel, hence it is necessary to select an appropriate kernel for kernel-based dimensionality reduction. Multiple kernel learning for dimensionality reduction (MKL-DR) has been recently proposed to learn a kernel from a set of base kernels which are seen as different descriptions of data. As MKL-DR does not involve regularization, it might be ill-posed under some conditions and consequently its applications are hindered. This paper proposes a multiple kernel learning framework for dimensionality reduction based on regularized trace ratio, termed as MKL-TR. Our method aims at learning a transformation into a space of lower dimension and a corresponding kernel from the given base kernels among which some may not be suitable for the given data. The solutions for the proposed framework can be found based on trace ratio maximization. The experimental results demonstrate its effectiveness in benchmark datasets, which include text, image and sound datasets, for supervised, unsupervised as well as semi-supervised settings. Copyright © 2013 Elsevier Ltd. All rights reserved.
Gao, Jianyong; Tian, Gang; Han, Xu; Zhu, Qiang
2018-01-01
Oral squamous cell carcinoma (OSCC) is the sixth most common type cancer worldwide, with poor prognosis. The present study aimed to identify gene signatures that could classify OSCC and predict prognosis in different stages. A training data set (GSE41613) and two validation data sets (GSE42743 and GSE26549) were acquired from the online Gene Expression Omnibus database. In the training data set, patients were classified based on the tumor-node-metastasis staging system, and subsequently grouped into low stage (L) or high stage (H). Signature genes between L and H stages were selected by disparity index analysis, and classification was performed by the expression of these signature genes. The established classification was compared with the L and H classification, and fivefold cross validation was used to evaluate the stability. Enrichment analysis for the signature genes was implemented by the Database for Annotation, Visualization and Integration Discovery. Two validation data sets were used to determine the precise of classification. Survival analysis was conducted followed each classification using the package ‘survival’ in R software. A set of 24 signature genes was identified based on the classification model with the Fi value of 0.47, which was used to distinguish OSCC samples in two different stages. Overall survival of patients in the H stage was higher than those in the L stage. Signature genes were primarily enriched in ‘ether lipid metabolism’ pathway and biological processes such as ‘positive regulation of adaptive immune response’ and ‘apoptotic cell clearance’. The results provided a novel 24-gene set that may be used as biomarkers to predict OSCC prognosis with high accuracy, which may be used to determine an appropriate treatment program for patients with OSCC in addition to the traditional evaluation index. PMID:29257303
Fast Reduction Method in Dominance-Based Information Systems
NASA Astrophysics Data System (ADS)
Li, Yan; Zhou, Qinghua; Wen, Yongchuan
2018-01-01
In real world applications, there are often some data with continuous values or preference-ordered values. Rough sets based on dominance relations can effectively deal with these kinds of data. Attribute reduction can be done in the framework of dominance-relation based approach to better extract decision rules. However, the computational cost of the dominance classes greatly affects the efficiency of attribute reduction and rule extraction. This paper presents an efficient method of computing dominance classes, and further compares it with traditional method with increasing attributes and samples. Experiments on UCI data sets show that the proposed algorithm obviously improves the efficiency of the traditional method, especially for large-scale data.
Radiation Quality Effects on Transcriptome Profiles in 3-d Cultures After Particle Irradiation
NASA Technical Reports Server (NTRS)
Patel, Z. S.; Kidane, Y. H.; Huff, J. L.
2014-01-01
In this work, we evaluate the differential effects of low- and high-LET radiation on 3-D organotypic cultures in order to investigate radiation quality impacts on gene expression and cellular responses. Reducing uncertainties in current risk models requires new knowledge on the fundamental differences in biological responses (the so-called radiation quality effects) triggered by heavy ion particle radiation versus low-LET radiation associated with Earth-based exposures. We are utilizing novel 3-D organotypic human tissue models that provide a format for study of human cells within a realistic tissue framework, thereby bridging the gap between 2-D monolayer culture and animal models for risk extrapolation to humans. To identify biological pathway signatures unique to heavy ion particle exposure, functional gene set enrichment analysis (GSEA) was used with whole transcriptome profiling. GSEA has been used extensively as a method to garner biological information in a variety of model systems but has not been commonly used to analyze radiation effects. It is a powerful approach for assessing the functional significance of radiation quality-dependent changes from datasets where the changes are subtle but broad, and where single gene based analysis using rankings of fold-change may not reveal important biological information. We identified 45 statistically significant gene sets at 0.05 q-value cutoff, including 14 gene sets common to gamma and titanium irradiation, 19 gene sets specific to gamma irradiation, and 12 titanium-specific gene sets. Common gene sets largely align with DNA damage, cell cycle, early immune response, and inflammatory cytokine pathway activation. The top gene set enriched for the gamma- and titanium-irradiated samples involved KRAS pathway activation and genes activated in TNF-treated cells, respectively. Another difference noted for the high-LET samples was an apparent enrichment in gene sets involved in cycle cycle/mitotic control. It is plausible that the enrichment in these particular pathways results from the complex DNA damage resulting from high-LET exposure where repair processes are not completed during the same time scale as the less complex damage resulting from low-LET radiation.
When is hub gene selection better than standard meta-analysis?
Langfelder, Peter; Mischel, Paul S; Horvath, Steve
2013-01-01
Since hub nodes have been found to play important roles in many networks, highly connected hub genes are expected to play an important role in biology as well. However, the empirical evidence remains ambiguous. An open question is whether (or when) hub gene selection leads to more meaningful gene lists than a standard statistical analysis based on significance testing when analyzing genomic data sets (e.g., gene expression or DNA methylation data). Here we address this question for the special case when multiple genomic data sets are available. This is of great practical importance since for many research questions multiple data sets are publicly available. In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a co-expression network analysis approach that selects intramodular hubs in consensus modules. We assess the performance of these two types of approaches according to two criteria. The first criterion evaluates the biological insights gained and is relevant in basic research. The second criterion evaluates the validation success (reproducibility) in independent data sets and often applies in clinical diagnostic or prognostic applications. We compare meta-analysis with consensus network analysis based on weighted correlation network analysis (WGCNA) in three comprehensive and unbiased empirical studies: (1) Finding genes predictive of lung cancer survival, (2) finding methylation markers related to age, and (3) finding mouse genes related to total cholesterol. The results demonstrate that intramodular hub gene status with respect to consensus modules is more useful than a meta-analysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1). However, standard meta-analysis methods perform as good as (if not better than) a consensus network approach in terms of validation success (criterion 2). The article also reports a comparison of meta-analysis techniques applied to gene expression data and presents novel R functions for carrying out consensus network analysis, network based screening, and meta analysis.
Efficient Exploration of the Space of Reconciled Gene Trees
Szöllősi, Gergely J.; Rosikiewicz, Wojciech; Boussau, Bastien; Tannier, Eric; Daubin, Vincent
2013-01-01
Gene trees record the combination of gene-level events, such as duplication, transfer and loss (DTL), and species-level events, such as speciation and extinction. Gene tree–species tree reconciliation methods model these processes by drawing gene trees into the species tree using a series of gene and species-level events. The reconstruction of gene trees based on sequence alone almost always involves choosing between statistically equivalent or weakly distinguishable relationships that could be much better resolved based on a putative species tree. To exploit this potential for accurate reconstruction of gene trees, the space of reconciled gene trees must be explored according to a joint model of sequence evolution and gene tree–species tree reconciliation. Here we present amalgamated likelihood estimation (ALE), a probabilistic approach to exhaustively explore all reconciled gene trees that can be amalgamated as a combination of clades observed in a sample of gene trees. We implement the ALE approach in the context of a reconciliation model (Szöllősi et al. 2013), which allows for the DTL of genes. We use ALE to efficiently approximate the sum of the joint likelihood over amalgamations and to find the reconciled gene tree that maximizes the joint likelihood among all such trees. We demonstrate using simulations that gene trees reconstructed using the joint likelihood are substantially more accurate than those reconstructed using sequence alone. Using realistic gene tree topologies, branch lengths, and alignment sizes, we demonstrate that ALE produces more accurate gene trees even if the model of sequence evolution is greatly simplified. Finally, examining 1099 gene families from 36 cyanobacterial genomes we find that joint likelihood-based inference results in a striking reduction in apparent phylogenetic discord, with respectively. 24%, 59%, and 46% reductions in the mean numbers of duplications, transfers, and losses per gene family. The open source implementation of ALE is available from https://github.com/ssolo/ALE.git. [amalgamation; gene tree reconciliation; gene tree reconstruction; lateral gene transfer; phylogeny.] PMID:23925510
Ben-Ari Fuchs, Shani; Lieder, Iris; Stelzer, Gil; Mazor, Yaron; Buzhor, Ella; Kaplan, Sergey; Bogoch, Yoel; Plaschkes, Inbar; Shitrit, Alina; Rappaport, Noa; Kohn, Asher; Edgar, Ron; Shenhav, Liraz; Safran, Marilyn; Lancet, Doron; Guan-Golan, Yaron; Warshawsky, David; Shtrichman, Ronit
2016-03-01
Postgenomics data are produced in large volumes by life sciences and clinical applications of novel omics diagnostics and therapeutics for precision medicine. To move from "data-to-knowledge-to-innovation," a crucial missing step in the current era is, however, our limited understanding of biological and clinical contexts associated with data. Prominent among the emerging remedies to this challenge are the gene set enrichment tools. This study reports on GeneAnalytics™ ( geneanalytics.genecards.org ), a comprehensive and easy-to-apply gene set analysis tool for rapid contextualization of expression patterns and functional signatures embedded in the postgenomics Big Data domains, such as Next Generation Sequencing (NGS), RNAseq, and microarray experiments. GeneAnalytics' differentiating features include in-depth evidence-based scoring algorithms, an intuitive user interface and proprietary unified data. GeneAnalytics employs the LifeMap Science's GeneCards suite, including the GeneCards®--the human gene database; the MalaCards-the human diseases database; and the PathCards--the biological pathways database. Expression-based analysis in GeneAnalytics relies on the LifeMap Discovery®--the embryonic development and stem cells database, which includes manually curated expression data for normal and diseased tissues, enabling advanced matching algorithm for gene-tissue association. This assists in evaluating differentiation protocols and discovering biomarkers for tissues and cells. Results are directly linked to gene, disease, or cell "cards" in the GeneCards suite. Future developments aim to enhance the GeneAnalytics algorithm as well as visualizations, employing varied graphical display items. Such attributes make GeneAnalytics a broadly applicable postgenomics data analyses and interpretation tool for translation of data to knowledge-based innovation in various Big Data fields such as precision medicine, ecogenomics, nutrigenomics, pharmacogenomics, vaccinomics, and others yet to emerge on the postgenomics horizon.
Deep Sequencing of Urinary RNAs for Bladder Cancer Molecular Diagnostics.
Sin, Mandy L Y; Mach, Kathleen E; Sinha, Rahul; Wu, Fan; Trivedi, Dharati R; Altobelli, Emanuela; Jensen, Kristin C; Sahoo, Debashis; Lu, Ying; Liao, Joseph C
2017-07-15
Purpose: The majority of bladder cancer patients present with localized disease and are managed by transurethral resection. However, the high rate of recurrence necessitates lifetime cystoscopic surveillance. Developing a sensitive and specific urine-based test would significantly improve bladder cancer screening, detection, and surveillance. Experimental Design: RNA-seq was used for biomarker discovery to directly assess the gene expression profile of exfoliated urothelial cells in urine derived from bladder cancer patients ( n = 13) and controls ( n = 10). Eight bladder cancer specific and 3 reference genes identified by RNA-seq were quantitated by qPCR in a training cohort of 102 urine samples. A diagnostic model based on the training cohort was constructed using multiple logistic regression. The model was further validated in an independent cohort of 101 urines. Results: A total of 418 genes were found to be differentially expressed between bladder cancer and controls. Validation of a subset of these genes was used to construct an equation for computing a probability of bladder cancer score (P BC ) based on expression of three markers ( ROBO1, WNT5A , and CDC42BPB ). Setting P BC = 0.45 as the cutoff for a positive test, urine testing using the three-marker panel had overall 88% sensitivity and 92% specificity in the training cohort. The accuracy of the three-marker panel in the independent validation cohort yielded an AUC of 0.87 and overall 83% sensitivity and 89% specificity. Conclusions: Urine-based molecular diagnostics using this three-marker signature could provide a valuable adjunct to cystoscopy and may lead to a reduction of unnecessary procedures for bladder cancer diagnosis. Clin Cancer Res; 23(14); 3700-10. ©2017 AACR . ©2017 American Association for Cancer Research.
NEAT: an efficient network enrichment analysis test.
Signorelli, Mirko; Vinciotti, Veronica; Wit, Ernst C
2016-09-05
Network enrichment analysis is a powerful method, which allows to integrate gene enrichment analysis with the information on relationships between genes that is provided by gene networks. Existing tests for network enrichment analysis deal only with undirected networks, they can be computationally slow and are based on normality assumptions. We propose NEAT, a test for network enrichment analysis. The test is based on the hypergeometric distribution, which naturally arises as the null distribution in this context. NEAT can be applied not only to undirected, but to directed and partially directed networks as well. Our simulations indicate that NEAT is considerably faster than alternative resampling-based methods, and that its capacity to detect enrichments is at least as good as the one of alternative tests. We discuss applications of NEAT to network analyses in yeast by testing for enrichment of the Environmental Stress Response target gene set with GO Slim and KEGG functional gene sets, and also by inspecting associations between functional sets themselves. NEAT is a flexible and efficient test for network enrichment analysis that aims to overcome some limitations of existing resampling-based tests. The method is implemented in the R package neat, which can be freely downloaded from CRAN ( https://cran.r-project.org/package=neat ).
Arnardottir, Erna S.; Nikonova, Elena V.; Shockley, Keith R.; Podtelezhnikov, Alexei A.; Anafi, Ron C.; Tanis, Keith Q.; Maislin, Greg; Stone, David J.; Renger, John J.; Winrow, Christopher J.; Pack, Allan I.
2014-01-01
Study Objectives: To address whether changes in gene expression in blood cells with sleep loss are different in individuals resistant and sensitive to sleep deprivation. Design: Blood draws every 4 h during a 3-day study: 24-h normal baseline, 38 h of continuous wakefulness and subsequent recovery sleep, for a total of 19 time-points per subject, with every 2-h psychomotor vigilance task (PVT) assessment when awake. Setting: Sleep laboratory. Participants: Fourteen subjects who were previously identified as behaviorally resistant (n = 7) or sensitive (n = 7) to sleep deprivation by PVT. Intervention: Thirty-eight hours of continuous wakefulness. Measurements and Results: We found 4,481 unique genes with a significant 24-h diurnal rhythm during a normal sleep-wake cycle in blood (false discovery rate [FDR] < 5%). Biological pathways were enriched for biosynthetic processes during sleep. After accounting for circadian effects, two genes (SREBF1 and CPT1A, both involved in lipid metabolism) exhibited small, but significant, linear changes in expression with the duration of sleep deprivation (FDR < 5%). The main change with sleep deprivation was a reduction in the amplitude of the diurnal rhythm of expression of normally cycling probe sets. This reduction was noticeably higher in behaviorally resistant subjects than sensitive subjects, at any given P value. Furthermore, blood cell type enrichment analysis showed that the expression pattern difference between sensitive and resistant subjects is mainly found in cells of myeloid origin, such as monocytes. Conclusion: Individual differences in behavioral effects of sleep deprivation are associated with differences in diurnal amplitude of gene expression for genes that show circadian rhythmicity. Citation: Arnardottir ES, Nikonova EV, Shockley KR, Podtelezhnikov AA, Anafi RC, Tanis KQ, Maislin G, Stone DJ, Renger JJ, Winrow CJ, Pack AI. Blood-gene expression reveals reduced circadian rhythmicity in individuals resistant to sleep deprivation. SLEEP 2014;37(10):1589-1600. PMID:25197809
Fan, Qianrui; Wang, Wenyu; Hao, Jingcan; He, Awen; Wen, Yan; Guo, Xiong; Wu, Cuiyan; Ning, Yujie; Wang, Xi; Wang, Sen; Zhang, Feng
2017-08-01
Neuroticism is a fundamental personality trait with significant genetic determinant. To identify novel susceptibility genes for neuroticism, we conducted an integrative analysis of genomic and transcriptomic data of genome wide association study (GWAS) and expression quantitative trait locus (eQTL) study. GWAS summary data was driven from published studies of neuroticism, totally involving 170,906 subjects. eQTL dataset containing 927,753 eQTLs were obtained from an eQTL meta-analysis of 5311 samples. Integrative analysis of GWAS and eQTL data was conducted by summary data-based Mendelian randomization (SMR) analysis software. To identify neuroticism associated gene sets, the SMR analysis results were further subjected to gene set enrichment analysis (GSEA). The gene set annotation dataset (containing 13,311 annotated gene sets) of GSEA Molecular Signatures Database was used. SMR single gene analysis identified 6 significant genes for neuroticism, including MSRA (p value=2.27×10 -10 ), MGC57346 (p value=6.92×10 -7 ), BLK (p value=1.01×10 -6 ), XKR6 (p value=1.11×10 -6 ), C17ORF69 (p value=1.12×10 -6 ) and KIAA1267 (p value=4.00×10 -6 ). Gene set enrichment analysis observed significant association for Chr8p23 gene set (false discovery rate=0.033). Our results provide novel clues for the genetic mechanism studies of neuroticism. Copyright © 2017. Published by Elsevier Inc.
GeneTopics - interpretation of gene sets via literature-driven topic models
2013-01-01
Background Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. Methods Our proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research. Results We validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner. Conclusions Our approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly tied to relevant articles in the literature. Extending a general topic model method, the approach introduced here establishes a workflow for the interpretation of gene sets generated from diverse experimental scenarios that can complement the classical approach of comparison to reference gene sets. PMID:24564875
Lee, Chang Soo; Lee, Jiyoung
2010-09-01
A rapid and specific gyrB-based real-time PCR system has been developed for detecting Bacteroides fragilis as a human-specific marker of fecal contamination. Its specificity and sensitivity was evaluated by comparison with other 16S rRNA gene-based primers using closely related Bacteroides and Prevotella. Many studies have used 16S rRNA gene-based method targeting Bacteroides because this genus is relatively abundant in human feces and is useful for microbial source tracking. However, 16S rRNA gene-based primers are evolutionarily too conserved among taxa to discriminate between human-specific species of Bacteroides and other closely related genera, such as Prevotella. Recently, one of the housekeeping genes, gyrB, has been used as an alternative target in multilocus sequence analysis (MLSA) to provide greater phylogenetic resolution. In this study, a new B. fragilis-specific primer set (Bf904F/Bf958R) was designed by alignments of 322 gyrB genes and was compared with the performance of the 16S rRNA gene-based primers in the presence of B. fragilis, Bacteroides ovatus and Prevotella melaninogenica. Amplicons were sequenced and a phylogenetic tree was constructed to confirm the specificity of the primers to B. fragilis. The gyrB-based primers successfully discriminated B. fragilis from B. ovatus and P. melaninogenica. Real-time PCR results showed that the gyrB primer set had a comparable sensitivity in the detection of B. fragilis when compared with the 16S rRNA primer set. The host-specificity of our gyrB-based primer set was validated with human, pig, cow, and dog fecal samples. The gyrB primer system had superior human-specificity. The gyrB-based system can rapidly detect human-specific fecal source and can be used for improved source tracking of human contamination. (c) 2010 Elsevier B.V. All rights reserved.
A cis-regulatory logic simulator.
Zeigler, Robert D; Gertz, Jason; Cohen, Barak A
2007-07-27
A major goal of computational studies of gene regulation is to accurately predict the expression of genes based on the cis-regulatory content of their promoters. The development of computational methods to decode the interactions among cis-regulatory elements has been slow, in part, because it is difficult to know, without extensive experimental validation, whether a particular method identifies the correct cis-regulatory interactions that underlie a given set of expression data. There is an urgent need for test expression data in which the interactions among cis-regulatory sites that produce the data are known. The ability to rapidly generate such data sets would facilitate the development and comparison of computational methods that predict gene expression patterns from promoter sequence. We developed a gene expression simulator which generates expression data using user-defined interactions between cis-regulatory sites. The simulator can incorporate additive, cooperative, competitive, and synergistic interactions between regulatory elements. Constraints on the spacing, distance, and orientation of regulatory elements and their interactions may also be defined and Gaussian noise can be added to the expression values. The simulator allows for a data transformation that simulates the sigmoid shape of expression levels from real promoters. We found good agreement between sets of simulated promoters and predicted regulatory modules from real expression data. We present several data sets that may be useful for testing new methodologies for predicting gene expression from promoter sequence. We developed a flexible gene expression simulator that rapidly generates large numbers of simulated promoters and their corresponding transcriptional output based on specified interactions between cis-regulatory sites. When appropriate rule sets are used, the data generated by our simulator faithfully reproduces experimentally derived data sets. We anticipate that using simulated gene expression data sets will facilitate the direct comparison of computational strategies to predict gene expression from promoter sequence. The source code is available online and as additional material. The test sets are available as additional material.
Mechanisms of gap gene expression canalization in the Drosophila blastoderm.
Gursky, Vitaly V; Panok, Lena; Myasnikova, Ekaterina M; Manu; Samsonova, Maria G; Reinitz, John; Samsonov, Alexander M
2011-01-01
Extensive variation in early gap gene expression in the Drosophila blastoderm is reduced over time because of gap gene cross regulation. This phenomenon is a manifestation of canalization, the ability of an organism to produce a consistent phenotype despite variations in genotype or environment. The canalization of gap gene expression can be understood as arising from the actions of attractors in the gap gene dynamical system. In order to better understand the processes of developmental robustness and canalization in the early Drosophila embryo, we investigated the dynamical effects of varying spatial profiles of Bicoid protein concentration on the formation of the expression border of the gap gene hunchback. At several positions on the anterior-posterior axis of the embryo, we analyzed attractors and their basins of attraction in a dynamical model describing expression of four gap genes with the Bicoid concentration profile accounted as a given input in the model equations. This model was tested against a family of Bicoid gradients obtained from individual embryos. These gradients were normalized by two independent methods, which are based on distinct biological hypotheses and provide different magnitudes for Bicoid spatial variability. We showed how the border formation is dictated by the biological initial conditions (the concentration gradient of maternal Hunchback protein) being attracted to specific attracting sets in a local vicinity of the border. Different types of these attracting sets (point attractors or one dimensional attracting manifolds) define several possible mechanisms of border formation. The hunchback border formation is associated with intersection of the spatial gradient of the maternal Hunchback protein and a boundary between the attraction basins of two different point attractors. We demonstrated how the positional variability for hunchback is related to the corresponding variability of the basin boundaries. The observed reduction in variability of the hunchback gene expression can be accounted for by specific geometrical properties of the basin boundaries. We clarified the mechanisms of gap gene expression canalization in early Drosophila embryos. These mechanisms were specified in the case of hunchback in well defined terms of the dynamical system theory.
Kar, Siddhartha P.; Tyrer, Jonathan P.; Li, Qiyuan; Lawrenson, Kate; Aben, Katja K.H.; Anton-Culver, Hoda; Antonenkova, Natalia; Chenevix-Trench, Georgia; Baker, Helen; Bandera, Elisa V.; Bean, Yukie T.; Beckmann, Matthias W.; Berchuck, Andrew; Bisogna, Maria; Bjørge, Line; Bogdanova, Natalia; Brinton, Louise; Brooks-Wilson, Angela; Butzow, Ralf; Campbell, Ian; Carty, Karen; Chang-Claude, Jenny; Chen, Yian Ann; Chen, Zhihua; Cook, Linda S.; Cramer, Daniel; Cunningham, Julie M.; Cybulski, Cezary; Dansonka-Mieszkowska, Agnieszka; Dennis, Joe; Dicks, Ed; Doherty, Jennifer A.; Dörk, Thilo; du Bois, Andreas; Dürst, Matthias; Eccles, Diana; Easton, Douglas F.; Edwards, Robert P.; Ekici, Arif B.; Fasching, Peter A.; Fridley, Brooke L.; Gao, Yu-Tang; Gentry-Maharaj, Aleksandra; Giles, Graham G.; Glasspool, Rosalind; Goode, Ellen L.; Goodman, Marc T.; Grownwald, Jacek; Harrington, Patricia; Harter, Philipp; Hein, Alexander; Heitz, Florian; Hildebrandt, Michelle A.T.; Hillemanns, Peter; Hogdall, Estrid; Hogdall, Claus K.; Hosono, Satoyo; Iversen, Edwin S.; Jakubowska, Anna; Paul, James; Jensen, Allan; Ji, Bu-Tian; Karlan, Beth Y; Kjaer, Susanne K.; Kelemen, Linda E.; Kellar, Melissa; Kelley, Joseph; Kiemeney, Lambertus A.; Krakstad, Camilla; Kupryjanczyk, Jolanta; Lambrechts, Diether; Lambrechts, Sandrina; Le, Nhu D.; Lee, Alice W.; Lele, Shashi; Leminen, Arto; Lester, Jenny; Levine, Douglas A.; Liang, Dong; Lissowska, Jolanta; Lu, Karen; Lubinski, Jan; Lundvall, Lene; Massuger, Leon; Matsuo, Keitaro; McGuire, Valerie; McLaughlin, John R.; McNeish, Iain A.; Menon, Usha; Modugno, Francesmary; Moysich, Kirsten B.; Narod, Steven A.; Nedergaard, Lotte; Ness, Roberta B.; Nevanlinna, Heli; Odunsi, Kunle; Olson, Sara H.; Orlow, Irene; Orsulic, Sandra; Weber, Rachel Palmieri; Pearce, Celeste Leigh; Pejovic, Tanja; Pelttari, Liisa M.; Permuth-Wey, Jennifer; Phelan, Catherine M.; Pike, Malcolm C.; Poole, Elizabeth M.; Ramus, Susan J.; Risch, Harvey A.; Rosen, Barry; Rossing, Mary Anne; Rothstein, Joseph H.; Rudolph, Anja; Runnebaum, Ingo B.; Rzepecka, Iwona K.; Salvesen, Helga B.; Schildkraut, Joellen M.; Schwaab, Ira; Shu, Xiao-Ou; Shvetsov, Yurii B; Siddiqui, Nadeem; Sieh, Weiva; Song, Honglin; Southey, Melissa C.; Sucheston-Campbell, Lara E.; Tangen, Ingvild L.; Teo, Soo-Hwang; Terry, Kathryn L.; Thompson, Pamela J; Timorek, Agnieszka; Tsai, Ya-Yu; Tworoger, Shelley S.; van Altena, Anne M.; Van Nieuwenhuysen, Els; Vergote, Ignace; Vierkant, Robert A.; Wang-Gohrke, Shan; Walsh, Christine; Wentzensen, Nicolas; Whittemore, Alice S.; Wicklund, Kristine G.; Wilkens, Lynne R.; Woo, Yin-Ling; Wu, Xifeng; Wu, Anna; Yang, Hannah; Zheng, Wei; Ziogas, Argyrios; Sellers, Thomas A.; Monteiro, Alvaro N. A.; Freedman, Matthew L.; Gayther, Simon A.; Pharoah, Paul D. P.
2015-01-01
Background Genome-wide association studies (GWAS) have so far reported 12 loci associated with serous epithelial ovarian cancer (EOC) risk. We hypothesized that some of these loci function through nearby transcription factor (TF) genes and that putative target genes of these TFs as identified by co-expression may also be enriched for additional EOC risk associations. Methods We selected TF genes within 1 Mb of the top signal at the 12 genome-wide significant risk loci. Mutual information, a form of correlation, was used to build networks of genes strongly co-expressed with each selected TF gene in the unified microarray data set of 489 serous EOC tumors from The Cancer Genome Atlas. Genes represented in this data set were subsequently ranked using a gene-level test based on results for germline SNPs from a serous EOC GWAS meta-analysis (2,196 cases/4,396 controls). Results Gene set enrichment analysis identified six networks centered on TF genes (HOXB2, HOXB5, HOXB6, HOXB7 at 17q21.32 and HOXD1, HOXD3 at 2q31) that were significantly enriched for genes from the risk-associated end of the ranked list (P<0.05 and FDR<0.05). These results were replicated (P<0.05) using an independent association study (7,035 cases/21,693 controls). Genes underlying enrichment in the six networks were pooled into a combined network. Conclusion We identified a HOX-centric network associated with serous EOC risk containing several genes with known or emerging roles in serous EOC development. Impact Network analysis integrating large, context-specific data sets has the potential to offer mechanistic insights into cancer susceptibility and prioritize genes for experimental characterization. PMID:26209509
Liu, Qiang; Su, Rong-Chuan; Yi, Wen-Jing; Zheng, Li-Ting; Lu, Shan-Shan; Zhao, Zhi-Gang
2017-03-31
A series of tocopherol-based cationic lipid 3a-3f bearing a pH-sensitive imidazole moiety in the dipeptide headgroup and a reduction-responsive disulfide linkage were designed and synthesized. Acid-base titration of these lipids showed good buffering capacities. The liposomes formed from 3 and co-lipid 1, 2-dioleoyl-sn-glycero-3-phosphocholine (DOPC) could efficiently bind and condense DNA into nanoparticles. Gel binding and HPLC assays confirmed the encapsulated DNA could release from lipoplexes 3 upon addition of 10 mM glutathione (GSH). MTT assays in HEK 293 cells demonstrated that lipoplexes 3 had low cytotoxicity. The in vitro gene transfection studies showed cationic dipeptide headgroups clearly affected the transfection efficiency (TE), and arginine-histidine based dipeptide lipid 3f give the best TE, which was 30.4 times higher than Lipofectamine 3000 in the presence of 10% serum. Cell-uptake assays indicated that basic amino acid containing dipeptide cationic lipids exhibited more efficient cell uptake than serine and aromatic amino acids based dipeptide lipids. Confocal laser scanning microscopy (CLSM) studies corroborated that 3 could efficiently deliver and release DNA into the nuclei of HeLa cells. These results suggest that tocopherol-based dipeptide cationic lipids with pH and reduction dual-sensitive characteristics might be promising non-viral gene delivery vectors. Copyright © 2017 Elsevier Masson SAS. All rights reserved.
Construction of a minimal genome as a chassis for synthetic biology.
Sung, Bong Hyun; Choe, Donghui; Kim, Sun Chang; Cho, Byung-Kwan
2016-11-30
Microbial diversity and complexity pose challenges in understanding the voluminous genetic information produced from whole-genome sequences, bioinformatics and high-throughput '-omics' research. These challenges can be overcome by a core blueprint of a genome drawn with a minimal gene set, which is essential for life. Systems biology and large-scale gene inactivation studies have estimated the number of essential genes to be ∼300-500 in many microbial genomes. On the basis of the essential gene set information, minimal-genome strains have been generated using sophisticated genome engineering techniques, such as genome reduction and chemical genome synthesis. Current size-reduced genomes are not perfect minimal genomes, but chemically synthesized genomes have just been constructed. Some minimal genomes provide various desirable functions for bioindustry, such as improved genome stability, increased transformation efficacy and improved production of biomaterials. The minimal genome as a chassis genome for synthetic biology can be used to construct custom-designed genomes for various practical and industrial applications. © 2016 The Author(s). published by Portland Press Limited on behalf of the Biochemical Society.
Inference of Evolutionary Forces Acting on Human Biological Pathways
Daub, Josephine T.; Dupanloup, Isabelle; Robinson-Rechavi, Marc; Excoffier, Laurent
2015-01-01
Because natural selection is likely to act on multiple genes underlying a given phenotypic trait, we study here the potential effect of ongoing and past selection on the genetic diversity of human biological pathways. We first show that genes included in gene sets are generally under stronger selective constraints than other genes and that their evolutionary response is correlated. We then introduce a new procedure to detect selection at the pathway level based on a decomposition of the classical McDonald–Kreitman test extended to multiple genes. This new test, called 2DNS, detects outlier gene sets and takes into account past demographic effects and evolutionary constraints specific to gene sets. Selective forces acting on gene sets can be easily identified by a mere visual inspection of the position of the gene sets relative to their two-dimensional null distribution. We thus find several outlier gene sets that show signals of positive, balancing, or purifying selection but also others showing an ancient relaxation of selective constraints. The principle of the 2DNS test can also be applied to other genomic contrasts. For instance, the comparison of patterns of polymorphisms private to African and non-African populations reveals that most pathways show a higher proportion of nonsynonymous mutations in non-Africans than in Africans, potentially due to different demographic histories and selective pressures. PMID:25971280
Mechanism-based biomarker gene sets for glutathione depletion-related hepatotoxicity in rats
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gao Weihua; Mizukawa, Yumiko; Nakatsu, Noriyuki
Chemical-induced glutathione depletion is thought to be caused by two types of toxicological mechanisms: PHO-type glutathione depletion [glutathione conjugated with chemicals such as phorone (PHO) or diethyl maleate (DEM)], and BSO-type glutathione depletion [i.e., glutathione synthesis inhibited by chemicals such as L-buthionine-sulfoximine (BSO)]. In order to identify mechanism-based biomarker gene sets for glutathione depletion in rat liver, male SD rats were treated with various chemicals including PHO (40, 120 and 400 mg/kg), DEM (80, 240 and 800 mg/kg), BSO (150, 450 and 1500 mg/kg), and bromobenzene (BBZ, 10, 100 and 300 mg/kg). Liver samples were taken 3, 6, 9 andmore » 24 h after administration and examined for hepatic glutathione content, physiological and pathological changes, and gene expression changes using Affymetrix GeneChip Arrays. To identify differentially expressed probe sets in response to glutathione depletion, we focused on the following two courses of events for the two types of mechanisms of glutathione depletion: a) gene expression changes occurring simultaneously in response to glutathione depletion, and b) gene expression changes after glutathione was depleted. The gene expression profiles of the identified probe sets for the two types of glutathione depletion differed markedly at times during and after glutathione depletion, whereas Srxn1 was markedly increased for both types as glutathione was depleted, suggesting that Srxn1 is a key molecule in oxidative stress related to glutathione. The extracted probe sets were refined and verified using various compounds including 13 additional positive or negative compounds, and they established two useful marker sets. One contained three probe sets (Akr7a3, Trib3 and Gstp1) that could detect conjugation-type glutathione depletors any time within 24 h after dosing, and the other contained 14 probe sets that could detect glutathione depletors by any mechanism. These two sets, with appropriate scoring systems, could be promising biomarkers for preclinical examination of hepatotoxicity.« less
Identification of an Efficient Gene Expression Panel for Glioblastoma Classification
Zelaya, Ivette; Laks, Dan R.; Zhao, Yining; Kawaguchi, Riki; Gao, Fuying; Kornblum, Harley I.; Coppola, Giovanni
2016-01-01
We present here a novel genetic algorithm-based random forest (GARF) modeling technique that enables a reduction in the complexity of large gene disease signatures to highly accurate, greatly simplified gene panels. When applied to 803 glioblastoma multiforme samples, this method allowed the 840-gene Verhaak et al. gene panel (the standard in the field) to be reduced to a 48-gene classifier, while retaining 90.91% classification accuracy, and outperforming the best available alternative methods. Additionally, using this approach we produced a 32-gene panel which allows for better consistency between RNA-seq and microarray-based classifications, improving cross-platform classification retention from 69.67% to 86.07%. A webpage producing these classifications is available at http://simplegbm.semel.ucla.edu. PMID:27855170
Gao, Qian; Liu, Lu; Li, Hai-Mei; Tang, Yi-Lang; Wu, Zhao-Min; Chen, Yun; Wang, Yu-Feng; Qian, Qiu-Jin
2015-01-01
As candidate genes of attention--deficit/hyperactivity disorder (ADHD), monoamine oxidase A (MAOA), and synaptophysin (SYP) are both on the X chromosome, and have been suggested to be associated with the predominantly inattentive subtype (ADHD-I). The present study is to investigate the potential gene-gene interaction (G × G) between rs5905859 of MAOA and rs5906754 of SYP for ADHD in Chinese Han subjects. For family-based association study, 177 female trios were included. For case-control study, 1,462 probands and 807 normal controls were recruited. The ADHD Rating Scale-IV (ADHD-RS-IV) was used to evaluate ADHD symptoms. Pedigree-based generalized multifactor dimensionality reduction (PGMDR) for female ADHD trios indicated significant gene interaction effect of rs5905859 and rs5906754. Generalized multifactor dimensionality reduction (GMDR) indicated potential gene-gene interplay on ADHD RS-IV scores in female ADHD-I. No associations were observed in male subjects in case-control analysis. In conclusion, our findings suggested that the interaction of MAOA and SYP may be involved in the genetic mechanism of ADHD-I subtype and predict ADHD symptoms. © 2014 Wiley Periodicals, Inc.
A Function for the hnRNP A1/A2 Proteins in Transcription Elongation.
Lemieux, Bruno; Blanchette, Marco; Monette, Anne; Mouland, Andrew J; Wellinger, Raymund J; Chabot, Benoit
2015-01-01
The hnRNP A1 and A2 proteins regulate processes such as alternative pre-mRNA splicing and mRNA stability. Here, we report that a reduction in the levels of hnRNP A1 and A2 by RNA interference or their cytoplasmic retention by osmotic stress drastically increases the transcription of a reporter gene. Based on previous work, we propose that this effect may be linked to a decrease in the activity of the transcription elongation factor P-TEFb. Consistent with this hypothesis, the transcription of the reporter gene was stimulated when the catalytic component of P-TEFb, CDK9, was inhibited with DRB. While low levels of A1/A2 stimulated the association of RNA polymerase II with the reporter gene, they also increased the association of CDK9 with the repressor 7SK RNA, and compromised the recovery of promoter-distal transcription on the Kitlg gene after the release of pausing. Transcriptome analysis revealed that more than 50% of the genes whose expression was affected by the siRNA-mediated depletion of A1/A2 were also affected by DRB. RNA polymerase II-chromatin immunoprecipitation assays on DRB-treated and A1/A2-depleted cells identified a common set of repressed genes displaying increased occupancy of polymerases at promoter-proximal locations, consistent with pausing. Overall, our results suggest that lowering the levels of hnRNP A1/A2 elicits defective transcription elongation on a fraction of P-TEFb-dependent genes, hence favoring the transcription of P-TEFb-independent genes.
dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts
Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre
2013-01-01
The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284
Nandi, Sutanu; Subramanian, Abhishek; Sarkar, Ram Rup
2017-07-25
Prediction of essential genes helps to identify a minimal set of genes that are absolutely required for the appropriate functioning and survival of a cell. The available machine learning techniques for essential gene prediction have inherent problems, like imbalanced provision of training datasets, biased choice of the best model for a given balanced dataset, choice of a complex machine learning algorithm, and data-based automated selection of biologically relevant features for classification. Here, we propose a simple support vector machine-based learning strategy for the prediction of essential genes in Escherichia coli K-12 MG1655 metabolism that integrates a non-conventional combination of an appropriate sample balanced training set, a unique organism-specific genotype, phenotype attributes that characterize essential genes, and optimal parameters of the learning algorithm to generate the best machine learning model (the model with the highest accuracy among all the models trained for different sample training sets). For the first time, we also introduce flux-coupled metabolic subnetwork-based features for enhancing the classification performance. Our strategy proves to be superior as compared to previous SVM-based strategies in obtaining a biologically relevant classification of genes with high sensitivity and specificity. This methodology was also trained with datasets of other recent supervised classification techniques for essential gene classification and tested using reported test datasets. The testing accuracy was always high as compared to the known techniques, proving that our method outperforms known methods. Observations from our study indicate that essential genes are conserved among homologous bacterial species, demonstrate high codon usage bias, GC content and gene expression, and predominantly possess a tendency to form physiological flux modules in metabolism.
oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes
Ho Sui, Shannan J.; Mortimer, James R.; Arenillas, David J.; Brumm, Jochen; Walsh, Christopher J.; Kennedy, Brian P.; Wasserman, Wyeth W.
2005-01-01
Targeted transcript profiling studies can identify sets of co-expressed genes; however, identification of the underlying functional mechanism(s) is a significant challenge. Established methods for the analysis of gene annotations, particularly those based on the Gene Ontology, can identify functional linkages between genes. Similar methods for the identification of over-represented transcription factor binding sites (TFBSs) have been successful in yeast, but extension to human genomics has largely proved ineffective. Creation of a system for the efficient identification of common regulatory mechanisms in a subset of co-expressed human genes promises to break a roadblock in functional genomics research. We have developed an integrated system that searches for evidence of co-regulation by one or more transcription factors (TFs). oPOSSUM combines a pre-computed database of conserved TFBSs in human and mouse promoters with statistical methods for identification of sites over-represented in a set of co-expressed genes. The algorithm successfully identified mediating TFs in control sets of tissue-specific genes and in sets of co-expressed genes from three transcript profiling studies. Simulation studies indicate that oPOSSUM produces few false positives using empirically defined thresholds and can tolerate up to 50% noise in a set of co-expressed genes. PMID:15933209
Ritchie, Marylyn D.; Hahn, Lance W.; Roodi, Nady; Bailey, L. Renee; Dupont, William D.; Parl, Fritz F.; Moore, Jason H.
2001-01-01
One of the greatest challenges facing human geneticists is the identification and characterization of susceptibility genes for common complex multifactorial human diseases. This challenge is partly due to the limitations of parametric-statistical methods for detection of gene effects that are dependent solely or partially on interactions with other genes and with environmental exposures. We introduce multifactor-dimensionality reduction (MDR) as a method for reducing the dimensionality of multilocus information, to improve the identification of polymorphism combinations associated with disease risk. The MDR method is nonparametric (i.e., no hypothesis about the value of a statistical parameter is made), is model-free (i.e., it assumes no particular inheritance model), and is directly applicable to case-control and discordant-sib-pair studies. Using simulated case-control data, we demonstrate that MDR has reasonable power to identify interactions among two or more loci in relatively small samples. When it was applied to a sporadic breast cancer case-control data set, in the absence of any statistically significant independent main effects, MDR identified a statistically significant high-order interaction among four polymorphisms from three different estrogen-metabolism genes. To our knowledge, this is the first report of a four-locus interaction associated with a common complex multifactorial disease. PMID:11404819
Suzuki, Shigekatsu; Endoh, Rikiya; Manabe, Ri-Ichiroh; Ohkuma, Moriya; Hirakawa, Yoshihisa
2018-01-17
Autotrophic eukaryotes have evolved by the endosymbiotic uptake of photosynthetic organisms. Interestingly, many algae and plants have secondarily lost the photosynthetic activity despite its great advantages. Prototheca and Helicosporidium are non-photosynthetic green algae possessing colourless plastids. The plastid genomes of Prototheca wickerhamii and Helicosporidium sp. are highly reduced owing to the elimination of genes related to photosynthesis. To gain further insight into the reductive genome evolution during the shift from a photosynthetic to a heterotrophic lifestyle, we sequenced the plastid and nuclear genomes of two Prototheca species, P. cutis JCM 15793 and P. stagnora JCM 9641, and performed comparative genome analyses among trebouxiophytes. Our phylogenetic analyses using plastid- and nucleus-encoded proteins strongly suggest that independent losses of photosynthesis have occurred at least three times in the clade of Prototheca and Helicosporidium. Conserved gene content among these non-photosynthetic lineages suggests that the plastid and nuclear genomes have convergently eliminated a similar set of photosynthesis-related genes. Other than the photosynthetic genes, significant gene loss and gain were not observed in Prototheca compared to its closest photosynthetic relative Auxenochlorella. Although it remains unclear why loss of photosynthesis occurred in Prototheca, the mixotrophic capability of trebouxiophytes likely made it possible to eliminate photosynthesis.
Gene integrated set profile analysis: a context-based approach for inferring biological endpoints
Kowalski, Jeanne; Dwivedi, Bhakti; Newman, Scott; Switchenko, Jeffery M.; Pauly, Rini; Gutman, David A.; Arora, Jyoti; Gandhi, Khanjan; Ainslie, Kylie; Doho, Gregory; Qin, Zhaohui; Moreno, Carlos S.; Rossi, Michael R.; Vertino, Paula M.; Lonial, Sagar; Bernal-Mizrachi, Leon; Boise, Lawrence H.
2016-01-01
The identification of genes with specific patterns of change (e.g. down-regulated and methylated) as phenotype drivers or samples with similar profiles for a given gene set as drivers of clinical outcome, requires the integration of several genomic data types for which an ‘integrate by intersection’ (IBI) approach is often applied. In this approach, results from separate analyses of each data type are intersected, which has the limitation of a smaller intersection with more data types. We introduce a new method, GISPA (Gene Integrated Set Profile Analysis) for integrated genomic analysis and its variation, SISPA (Sample Integrated Set Profile Analysis) for defining respective genes and samples with the context of similar, a priori specified molecular profiles. With GISPA, the user defines a molecular profile that is compared among several classes and obtains ranked gene sets that satisfy the profile as drivers of each class. With SISPA, the user defines a gene set that satisfies a profile and obtains sample groups of profile activity. Our results from applying GISPA to human multiple myeloma (MM) cell lines contained genes of known profiles and importance, along with several novel targets, and their further SISPA application to MM coMMpass trial data showed clinical relevance. PMID:26826710
Discovery of cancer common and specific driver gene sets
2017-01-01
Abstract Cancer is known as a disease mainly caused by gene alterations. Discovery of mutated driver pathways or gene sets is becoming an important step to understand molecular mechanisms of carcinogenesis. However, systematically investigating commonalities and specificities of driver gene sets among multiple cancer types is still a great challenge, but this investigation will undoubtedly benefit deciphering cancers and will be helpful for personalized therapy and precision medicine in cancer treatment. In this study, we propose two optimization models to de novo discover common driver gene sets among multiple cancer types (ComMDP) and specific driver gene sets of one certain or multiple cancer types to other cancers (SpeMDP), respectively. We first apply ComMDP and SpeMDP to simulated data to validate their efficiency. Then, we further apply these methods to 12 cancer types from The Cancer Genome Atlas (TCGA) and obtain several biologically meaningful driver pathways. As examples, we construct a common cancer pathway model for BRCA and OV, infer a complex driver pathway model for BRCA carcinogenesis based on common driver gene sets of BRCA with eight cancer types, and investigate specific driver pathways of the liquid cancer lymphoblastic acute myeloid leukemia (LAML) versus other solid cancer types. In these processes more candidate cancer genes are also found. PMID:28168295
Refined NrfA phylogeny improves PCR-based nrfA gene detection
USDA-ARS?s Scientific Manuscript database
Dissimilatory nitrate reduction to ammonium (DNRA) promotes N-retention in the terrestrial nitrogen- (N-) cycle. Respiratory nitrite reduction to ammonium is catalyzed by the nitrite reductase NrfA. Prior phylogenetic analyses showed that NrfA divided into18 distinct clades amongst available sequenc...
A support vector machine based test for incongruence between sets of trees in tree space
2012-01-01
Background The increased use of multi-locus data sets for phylogenetic reconstruction has increased the need to determine whether a set of gene trees significantly deviate from the phylogenetic patterns of other genes. Such unusual gene trees may have been influenced by other evolutionary processes such as selection, gene duplication, or horizontal gene transfer. Results Motivated by this problem we propose a nonparametric goodness-of-fit test for two empirical distributions of gene trees, and we developed the software GeneOut to estimate a p-value for the test. Our approach maps trees into a multi-dimensional vector space and then applies support vector machines (SVMs) to measure the separation between two sets of pre-defined trees. We use a permutation test to assess the significance of the SVM separation. To demonstrate the performance of GeneOut, we applied it to the comparison of gene trees simulated within different species trees across a range of species tree depths. Applied directly to sets of simulated gene trees with large sample sizes, GeneOut was able to detect very small differences between two set of gene trees generated under different species trees. Our statistical test can also include tree reconstruction into its test framework through a variety of phylogenetic optimality criteria. When applied to DNA sequence data simulated from different sets of gene trees, results in the form of receiver operating characteristic (ROC) curves indicated that GeneOut performed well in the detection of differences between sets of trees with different distributions in a multi-dimensional space. Furthermore, it controlled false positive and false negative rates very well, indicating a high degree of accuracy. Conclusions The non-parametric nature of our statistical test provides fast and efficient analyses, and makes it an applicable test for any scenario where evolutionary or other factors can lead to trees with different multi-dimensional distributions. The software GeneOut is freely available under the GNU public license. PMID:22909268
Targeted exploration and analysis of large cross-platform human transcriptomic compendia
Zhu, Qian; Wong, Aaron K; Krishnan, Arjun; Aure, Miriam R; Tadych, Alicja; Zhang, Ran; Corney, David C; Greene, Casey S; Bongo, Lars A; Kristensen, Vessela N; Charikar, Moses; Li, Kai; Troyanskaya, Olga G.
2016-01-01
We present SEEK (http://seek.princeton.edu), a query-based search engine across very large transcriptomic data collections, including thousands of human data sets from almost 50 microarray and next-generation sequencing platforms. SEEK uses a novel query-level cross-validation-based algorithm to automatically prioritize data sets relevant to the query and a robust search approach to identify query-coregulated genes, pathways, and processes. SEEK provides cross-platform handling, multi-gene query search, iterative metadata-based search refinement, and extensive visualization-based analysis options. PMID:25581801
Xie, Xin-Ping; Xie, Yu-Feng; Wang, Hong-Qiang
2017-08-23
Large-scale accumulation of omics data poses a pressing challenge of integrative analysis of multiple data sets in bioinformatics. An open question of such integrative analysis is how to pinpoint consistent but subtle gene activity patterns across studies. Study heterogeneity needs to be addressed carefully for this goal. This paper proposes a regulation probability model-based meta-analysis, jGRP, for identifying differentially expressed genes (DEGs). The method integrates multiple transcriptomics data sets in a gene regulatory space instead of in a gene expression space, which makes it easy to capture and manage data heterogeneity across studies from different laboratories or platforms. Specifically, we transform gene expression profiles into a united gene regulation profile across studies by mathematically defining two gene regulation events between two conditions and estimating their occurring probabilities in a sample. Finally, a novel differential expression statistic is established based on the gene regulation profiles, realizing accurate and flexible identification of DEGs in gene regulation space. We evaluated the proposed method on simulation data and real-world cancer datasets and showed the effectiveness and efficiency of jGRP in identifying DEGs identification in the context of meta-analysis. Data heterogeneity largely influences the performance of meta-analysis of DEGs identification. Existing different meta-analysis methods were revealed to exhibit very different degrees of sensitivity to study heterogeneity. The proposed method, jGRP, can be a standalone tool due to its united framework and controllable way to deal with study heterogeneity.
Melroy-Greif, Whitney E; Simonson, Matthew A; Corley, Robin P; Lutz, Sharon M; Hokanson, John E; Ehringer, Marissa A
2017-04-01
Cigarette smoking is a physiologically harmful habit. Nicotinic acetylcholine receptors (nAChRs) are bound by nicotine and upregulated in response to chronic exposure to nicotine. It is known that upregulation of these receptors is not due to a change in mRNA of these genes, however, more precise details on the process are still uncertain, with several plausible hypotheses describing how nAChRs are upregulated. We have manually curated a set of genes believed to play a role in nicotine-induced nAChR upregulation. Here, we test the hypothesis that these genes are associated with and contribute risk for nicotine dependence (ND) and the number of cigarettes smoked per day (CPD). Studies with genotypic data on European and African Americans (EAs and AAs, respectively) were collected and a gene-based test was run to test for an association between each gene and ND and CPD. Although several novel genes were associated with CPD and ND at P < 0.05 in EAs and AAs, these associations did not survive correction for multiple testing. Previous associations between CHRNA3, CHRNA5, CHRNB4 and CPD in EAs were replicated. Our hypothesis-driven approach avoided many of the limitations inherent in pathway analyses and provided nominal evidence for association between cholinergic-related genes and nicotine behaviors. We evaluated the evidence for association between a manually curated set of genes and nicotine behaviors in European and African Americans. Although no genes were associated after multiple testing correction, this study has several strengths: by manually curating a set of genes we circumvented the limitations inherent in many pathway analyses and tested several genes that had not yet been examined in a human genetic study; gene-based tests are a useful way to test for association with a set of genes; and these genes were collected based on literature review and conversations with experts, highlighting the importance of scientific collaboration. © The Author 2016. Published by Oxford University Press on behalf of the Society for Research on Nicotine and Tobacco. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
SoFoCles: feature filtering for microarray classification based on gene ontology.
Papachristoudis, Georgios; Diplaris, Sotiris; Mitkas, Pericles A
2010-02-01
Marker gene selection has been an important research topic in the classification analysis of gene expression data. Current methods try to reduce the "curse of dimensionality" by using statistical intra-feature set calculations, or classifiers that are based on the given dataset. In this paper, we present SoFoCles, an interactive tool that enables semantic feature filtering in microarray classification problems with the use of external, well-defined knowledge retrieved from the Gene Ontology. The notion of semantic similarity is used to derive genes that are involved in the same biological path during the microarray experiment, by enriching a feature set that has been initially produced with legacy methods. Among its other functionalities, SoFoCles offers a large repository of semantic similarity methods that are used in order to derive feature sets and marker genes. The structure and functionality of the tool are discussed in detail, as well as its ability to improve classification accuracy. Through experimental evaluation, SoFoCles is shown to outperform other classification schemes in terms of classification accuracy in two real datasets using different semantic similarity computation approaches.
Baym, Michael; Shaket, Lev; Anzai, Isao A; Adesina, Oluwakemi; Barstow, Buz
2016-11-10
Whole-genome knockout collections are invaluable for connecting gene sequence to function, yet traditionally, their construction has required an extraordinary technical effort. Here we report a method for the construction and purification of a curated whole-genome collection of single-gene transposon disruption mutants termed Knockout Sudoku. Using simple combinatorial pooling, a highly oversampled collection of mutants is condensed into a next-generation sequencing library in a single day, a 30- to 100-fold improvement over prior methods. The identities of the mutants in the collection are then solved by a probabilistic algorithm that uses internal self-consistency within the sequencing data set, followed by rapid algorithmically guided condensation to a minimal representative set of mutants, validation, and curation. Starting from a progenitor collection of 39,918 mutants, we compile a quality-controlled knockout collection of the electroactive microbe Shewanella oneidensis MR-1 containing representatives for 3,667 genes that is functionally validated by high-throughput kinetic measurements of quinone reduction.
Finster, Kai Waldemar; Kjeldsen, Kasper Urup; Kube, Michael; Reinhardt, Richard; Mussmann, Marc; Amann, Rudolf; Schreiber, Lars
2013-04-15
Desulfocapsa sulfexigens SB164P1 (DSM 10523) belongs to the deltaproteobacterial family Desulfobulbaceae and is one of two validly described members of its genus. This strain was selected for genome sequencing, because it is the first marine bacterium reported to thrive on the disproportionation of elemental sulfur, a process with a unresolved enzymatic pathway in which elemental sulfur serves both as electron donor and electron acceptor. Furthermore, in contrast to its phylogenetically closest relatives, which are dissimilatory sulfate-reducers, D. sulfexigens is unable to grow by sulfate reduction and appears metabolically specialized in growing by disproportionating elemental sulfur, sulfite or thiosulfate with CO2 as the sole carbon source. The genome of D. sulfexigens contains the set of genes that is required for nitrogen fixation. In an acetylene assay it could be shown that the strain reduces acetylene to ethylene, which is indicative for N-fixation. The circular chromosome of D. sulfexigens SB164P1 comprises 3,986,761 bp and harbors 3,551 protein-coding genes of which 78% have a predicted function based on auto-annotation. The chromosome furthermore encodes 46 tRNA genes and 3 rRNA operons.
Finster, Kai Waldemar; Kjeldsen, Kasper Urup; Kube, Michael; Reinhardt, Richard; Mussmann, Marc; Amann, Rudolf; Schreiber, Lars
2013-01-01
Desulfocapsa sulfexigens SB164P1 (DSM 10523) belongs to the deltaproteobacterial family Desulfobulbaceae and is one of two validly described members of its genus. This strain was selected for genome sequencing, because it is the first marine bacterium reported to thrive on the disproportionation of elemental sulfur, a process with a unresolved enzymatic pathway in which elemental sulfur serves both as electron donor and electron acceptor. Furthermore, in contrast to its phylogenetically closest relatives, which are dissimilatory sulfate-reducers, D. sulfexigens is unable to grow by sulfate reduction and appears metabolically specialized in growing by disproportionating elemental sulfur, sulfite or thiosulfate with CO2 as the sole carbon source. The genome of D. sulfexigens contains the set of genes that is required for nitrogen fixation. In an acetylene assay it could be shown that the strain reduces acetylene to ethylene, which is indicative for N-fixation. The circular chromosome of D. sulfexigens SB164P1 comprises 3,986,761 bp and harbors 3,551 protein-coding genes of which 78% have a predicted function based on auto-annotation. The chromosome furthermore encodes 46 tRNA genes and 3 rRNA operons. PMID:23961312
The Cross-Entropy Based Multi-Filter Ensemble Method for Gene Selection.
Sun, Yingqiang; Lu, Chengbo; Li, Xiaobo
2018-05-17
The gene expression profile has the characteristics of a high dimension, low sample, and continuous type, and it is a great challenge to use gene expression profile data for the classification of tumor samples. This paper proposes a cross-entropy based multi-filter ensemble (CEMFE) method for microarray data classification. Firstly, multiple filters are used to select the microarray data in order to obtain a plurality of the pre-selected feature subsets with a different classification ability. The top N genes with the highest rank of each subset are integrated so as to form a new data set. Secondly, the cross-entropy algorithm is used to remove the redundant data in the data set. Finally, the wrapper method, which is based on forward feature selection, is used to select the best feature subset. The experimental results show that the proposed method is more efficient than other gene selection methods and that it can achieve a higher classification accuracy under fewer characteristic genes.
Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys
Werner, Jeffrey J; Koren, Omry; Hugenholtz, Philip; DeSantis, Todd Z; Walters, William A; Caporaso, J Gregory; Angenent, Largus T; Knight, Rob; Ley, Ruth E
2012-01-01
Taxonomic classification of the thousands–millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a naïve Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases. PMID:21716311
Xi, Zhenxiang; Liu, Liang; Davis, Charles C
2015-11-01
The development and application of coalescent methods are undergoing rapid changes. One little explored area that bears on the application of gene-tree-based coalescent methods to species tree estimation is gene informativeness. Here, we investigate the accuracy of these coalescent methods when genes have minimal phylogenetic information, including the implementation of the multilocus bootstrap approach. Using simulated DNA sequences, we demonstrate that genes with minimal phylogenetic information can produce unreliable gene trees (i.e., high error in gene tree estimation), which may in turn reduce the accuracy of species tree estimation using gene-tree-based coalescent methods. We demonstrate that this problem can be alleviated by sampling more genes, as is commonly done in large-scale phylogenomic analyses. This applies even when these genes are minimally informative. If gene tree estimation is biased, however, gene-tree-based coalescent analyses will produce inconsistent results, which cannot be remedied by increasing the number of genes. In this case, it is not the gene-tree-based coalescent methods that are flawed, but rather the input data (i.e., estimated gene trees). Along these lines, the commonly used program PhyML has a tendency to infer one particular bifurcating topology even though it is best represented as a polytomy. We additionally corroborate these findings by analyzing the 183-locus mammal data set assembled by McCormack et al. (2012) using ultra-conserved elements (UCEs) and flanking DNA. Lastly, we demonstrate that when employing the multilocus bootstrap approach on this 183-locus data set, there is no strong conflict between species trees estimated from concatenation and gene-tree-based coalescent analyses, as has been previously suggested by Gatesy and Springer (2014). Copyright © 2015 Elsevier Inc. All rights reserved.
USDA-ARS?s Scientific Manuscript database
Butyrate is a nutritional element with strong epigenetic regulatory activity as an inhibitor of histone deacetylases (HDACs). Based on the analysis of differentially expressed genes induced by butyrate in the bovine epithelial cell using deep RNA-sequencing technology (RNA-seq), a set of unique gen...
Pastukh, Viktor; Roberts, Justin T.; Clark, David W.; Bardwell, Gina C.; Patel, Mita; Al-Mehdi, Abu-Bakr; Borchert, Glen M.
2015-01-01
In hypoxia, mitochondria-generated reactive oxygen species not only stimulate accumulation of the transcriptional regulator of hypoxic gene expression, hypoxia inducible factor-1 (Hif-1), but also cause oxidative base modifications in hypoxic response elements (HREs) of hypoxia-inducible genes. When the hypoxia-induced base modifications are suppressed, Hif-1 fails to associate with the HRE of the VEGF promoter, and VEGF mRNA accumulation is blunted. The mechanism linking base modifications to transcription is unknown. Here we determined whether recruitment of base excision DNA repair (BER) enzymes in response to hypoxia-induced promoter modifications was required for transcription complex assembly and VEGF mRNA expression. Using chromatin immunoprecipitation analyses in pulmonary artery endothelial cells, we found that hypoxia-mediated formation of the base oxidation product 8-oxoguanine (8-oxoG) in VEGF HREs was temporally associated with binding of Hif-1α and the BER enzymes 8-oxoguanine glycosylase 1 (Ogg1) and redox effector factor-1 (Ref-1)/apurinic/apyrimidinic endonuclease 1 (Ape1) and introduction of DNA strand breaks. Hif-1α colocalized with HRE sequences harboring Ref-1/Ape1, but not Ogg1. Inhibition of BER by small interfering RNA-mediated reduction in Ogg1 augmented hypoxia-induced 8-oxoG accumulation and attenuated Hif-1α and Ref-1/Ape1 binding to VEGF HRE sequences and blunted VEGF mRNA expression. Chromatin immunoprecipitation-sequence analysis of 8-oxoG distribution in hypoxic pulmonary artery endothelial cells showed that most of the oxidized base was localized to promoters with virtually no overlap between normoxic and hypoxic data sets. Transcription of genes whose promoters lost 8-oxoG during hypoxia was reduced, while those gaining 8-oxoG was elevated. Collectively, these findings suggest that the BER pathway links hypoxia-induced introduction of oxidative DNA modifications in promoters of hypoxia-inducible genes to transcriptional activation. PMID:26432868
Mallik, Saurav; Zhao, Zhongming
2017-12-28
For transcriptomic analysis, there are numerous microarray-based genomic data, especially those generated for cancer research. The typical analysis measures the difference between a cancer sample-group and a matched control group for each transcript or gene. Association rule mining is used to discover interesting item sets through rule-based methodology. Thus, it has advantages to find causal effect relationships between the transcripts. In this work, we introduce two new rule-based similarity measures-weighted rank-based Jaccard and Cosine measures-and then propose a novel computational framework to detect condensed gene co-expression modules ( C o n G E M s) through the association rule-based learning system and the weighted similarity scores. In practice, the list of evolved condensed markers that consists of both singular and complex markers in nature depends on the corresponding condensed gene sets in either antecedent or consequent of the rules of the resultant modules. In our evaluation, these markers could be supported by literature evidence, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway and Gene Ontology annotations. Specifically, we preliminarily identified differentially expressed genes using an empirical Bayes test. A recently developed algorithm-RANWAR-was then utilized to determine the association rules from these genes. Based on that, we computed the integrated similarity scores of these rule-based similarity measures between each rule-pair, and the resultant scores were used for clustering to identify the co-expressed rule-modules. We applied our method to a gene expression dataset for lung squamous cell carcinoma and a genome methylation dataset for uterine cervical carcinogenesis. Our proposed module discovery method produced better results than the traditional gene-module discovery measures. In summary, our proposed rule-based method is useful for exploring biomarker modules from transcriptomic data.
Phthalate esters are a large family of compounds used in many industrial and commercial products. Based on numerous studies, phthalates such as diethyl hexyl phthalate (DEHP) produce reproductive malformations in male rodents through reduction of testosterone production and gene ...
Drug2Gene: an exhaustive resource to explore effectively the drug-target relation network.
Roider, Helge G; Pavlova, Nadia; Kirov, Ivaylo; Slavov, Stoyan; Slavov, Todor; Uzunov, Zlatyo; Weiss, Bertram
2014-03-11
Information about drug-target relations is at the heart of drug discovery. There are now dozens of databases providing drug-target interaction data with varying scope, and focus. Therefore, and due to the large chemical space, the overlap of the different data sets is surprisingly small. As searching through these sources manually is cumbersome, time-consuming and error-prone, integrating all the data is highly desirable. Despite a few attempts, integration has been hampered by the diversity of descriptions of compounds, and by the fact that the reported activity values, coming from different data sets, are not always directly comparable due to usage of different metrics or data formats. We have built Drug2Gene, a knowledge base, which combines the compound/drug-gene/protein information from 19 publicly available databases. A key feature is our rigorous unification and standardization process which makes the data truly comparable on a large scale, allowing for the first time effective data mining in such a large knowledge corpus. As of version 3.2, Drug2Gene contains 4,372,290 unified relations between compounds and their targets most of which include reported bioactivity data. We extend this set with putative (i.e. homology-inferred) relations where sufficient sequence homology between proteins suggests they may bind to similar compounds. Drug2Gene provides powerful search functionalities, very flexible export procedures, and a user-friendly web interface. Drug2Gene v3.2 has become a mature and comprehensive knowledge base providing unified, standardized drug-target related information gathered from publicly available data sources. It can be used to integrate proprietary data sets with publicly available data sets. Its main goal is to be a 'one-stop shop' to identify tool compounds targeting a given gene product or for finding all known targets of a drug. Drug2Gene with its integrated data set of public compound-target relations is freely accessible without restrictions at http://www.drug2gene.com.
Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan
2018-04-01
Association rule mining is an important technique for identifying interesting relationships between gene pairs in a biological data set. Earlier methods basically work for a single biological data set, and, in maximum cases, a single minimum support cutoff can be applied globally, i.e., across all genesets/itemsets. To overcome this limitation, in this paper, we propose dynamic threshold-based FP-growth rule mining algorithm that integrates gene expression, methylation and protein-protein interaction profiles based on weighted shortest distance to find the novel associations among different pairs of genes in multi-view data sets. For this purpose, we introduce three new thresholds, namely, Distance-based Variable/Dynamic Supports (DVS), Distance-based Variable Confidences (DVC), and Distance-based Variable Lifts (DVL) for each rule by integrating co-expression, co-methylation, and protein-protein interactions existed in the multi-omics data set. We develop the proposed algorithm utilizing these three novel multiple threshold measures. In the proposed algorithm, the values of , , and are computed for each rule separately, and subsequently it is verified whether the support, confidence, and lift of each evolved rule are greater than or equal to the corresponding individual , , and values, respectively, or not. If all these three conditions for a rule are found to be true, the rule is treated as a resultant rule. One of the major advantages of the proposed method compared with other related state-of-the-art methods is that it considers both the quantitative and interactive significance among all pairwise genes belonging to each rule. Moreover, the proposed method generates fewer rules, takes less running time, and provides greater biological significance for the resultant top-ranking rules compared to previous methods.
Yao, Z; Peng, Y; Bi, J; Xie, C; Chen, X; Li, Y; Ye, X; Zhou, J
2016-03-01
Multidrug-resistant Pseudomonas aeruginosa (MDRPA) infections are major threats to healthcare-associated infection control and the intrinsic molecular mechanisms of MDRPA are also unclear. We examined 348 isolates of P. aeruginosa, including 188 MDRPA and 160 non-MDRPA, obtained from five tertiary-care hospitals in Guangzhou, China. Significant correlations were found between gene/enzyme carriage and increased rates of antimicrobial resistance (P < 0·01). gyrA mutation, OprD loss and metallo-β-lactamase (MBL) presence were identified as crucial molecular risk factors for MDRPA acquisition by a combination of univariate logistic regression and a multifactor dimensionality reduction approach. The MDRPA rate was also elevated with the increase in positive numbers of those three determinants (P < 0·001). Thus, gyrA mutation, OprD loss and MBL presence may serve as predictors for early screening of MDRPA infections in clinical settings.
2010-01-01
Background Some organisms can survive extreme desiccation by entering a state of suspended animation known as anhydrobiosis. The free-living mycophagous nematode Aphelenchus avenae can be induced to enter anhydrobiosis by pre-exposure to moderate reductions in relative humidity (RH) prior to extreme desiccation. This preconditioning phase is thought to allow modification of the transcriptome by activation of genes required for desiccation tolerance. Results To identify such genes, a panel of expressed sequence tags (ESTs) enriched for sequences upregulated in A. avenae during preconditioning was created. A subset of 30 genes with significant matches in databases, together with a number of apparently novel sequences, were chosen for further study. Several of the recognisable genes are associated with water stress, encoding, for example, two new hydrophilic proteins related to the late embryogenesis abundant (LEA) protein family. Expression studies confirmed EST panel members to be upregulated by evaporative water loss, and the majority of genes was also induced by osmotic stress and cold, but rather fewer by heat. We attempted to use RNA interference (RNAi) to demonstrate the importance of this gene set for anhydrobiosis, but found A. avenae to be recalcitrant with the techniques used. Instead, therefore, we developed a cross-species RNAi procedure using A. avenae sequences in another anhydrobiotic nematode, Panagrolaimus superbus, which is amenable to gene silencing. Of 20 A. avenae ESTs screened, a significant reduction in survival of desiccation in treated P. superbus populations was observed with two sequences, one of which was novel, while the other encoded a glutathione peroxidase. To confirm a role for glutathione peroxidases in anhydrobiosis, RNAi with cognate sequences from P. superbus was performed and was also shown to reduce desiccation tolerance in this species. Conclusions This study has identified and characterised the expression profiles of members of the anhydrobiotic gene set in A. avenae. It also demonstrates the potential of RNAi for the analysis of anhydrobiosis and provides the first genetic data to underline the importance of effective antioxidant systems in metazoan desiccation tolerance. PMID:20085654
APPRIS 2017: principal isoforms for multiple gene sets
Rodriguez-Rivas, Juan; Di Domenico, Tomás; Vázquez, Jesús; Valencia, Alfonso
2018-01-01
Abstract The APPRIS database (http://appris-tools.org) uses protein structural and functional features and information from cross-species conservation to annotate splice isoforms in protein-coding genes. APPRIS selects a single protein isoform, the ‘principal’ isoform, as the reference for each gene based on these annotations. A single main splice isoform reflects the biological reality for most protein coding genes and APPRIS principal isoforms are the best predictors of these main proteins isoforms. Here, we present the updates to the database, new developments that include the addition of three new species (chimpanzee, Drosophila melangaster and Caenorhabditis elegans), the expansion of APPRIS to cover the RefSeq gene set and the UniProtKB proteome for six species and refinements in the core methods that make up the annotation pipeline. In addition APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the GENCODE/Ensembl and RefSeq reference sets. The individual GENCODE/Ensembl, RefSeq and UniProtKB reference gene sets for six organisms have been merged to produce common sets of splice variants. PMID:29069475
Schmid, Michael; Muri, Jonathan; Melidis, Damianos; Varadarajan, Adithi R; Somerville, Vincent; Wicki, Adrian; Moser, Aline; Bourqui, Marc; Wenzel, Claudia; Eugster-Meier, Elisabeth; Frey, Juerg E; Irmler, Stefan; Ahrens, Christian H
2018-01-01
Although complete genome sequences hold particular value for an accurate description of core genomes, the identification of strain-specific genes, and as the optimal basis for functional genomics studies, they are still largely underrepresented in public repositories. Based on an assessment of the genome assembly complexity for all lactobacilli, we used Pacific Biosciences' long read technology to sequence and de novo assemble the genomes of three Lactobacillus helveticus starter strains, raising the number of completely sequenced strains to 12. The first comparative genomics study for L. helveticus -to our knowledge-identified a core genome of 988 genes and sets of unique, strain-specific genes ranging from about 30 to more than 200 genes. Importantly, the comparison of MiSeq- and PacBio-based assemblies uncovered that not only accessory but also core genes can be missed in incomplete genome assemblies based on short reads. Analysis of the three genomes revealed that a large number of pseudogenes were enriched for functional Gene Ontology categories such as amino acid transmembrane transport and carbohydrate metabolism, which is in line with a reductive genome evolution in the rich natural habitat of L. helveticus . Notably, the functional Clusters of Orthologous Groups of proteins categories "cell wall/membrane biogenesis" and "defense mechanisms" were found to be enriched among the strain-specific genes. A genome mining effort uncovered examples where an experimentally observed phenotype could be linked to the underlying genotype, such as for cell envelope proteinase PrtH3 of strain FAM8627. Another possible link identified for peptidoglycan hydrolases will require further experiments. Of note, strain FAM22155 did not harbor a CRISPR/Cas system; its loss was also observed in other L. helveticus strains and lactobacillus species, thus questioning the value of the CRISPR/Cas system for diagnostic purposes. Importantly, the complete genome sequences proved to be very useful for the analysis of natural whey starter cultures with metagenomics, as a larger percentage of the sequenced reads of these complex mixtures could be unambiguously assigned down to the strain level.
Schmid, Michael; Muri, Jonathan; Melidis, Damianos; Varadarajan, Adithi R.; Somerville, Vincent; Wicki, Adrian; Moser, Aline; Bourqui, Marc; Wenzel, Claudia; Eugster-Meier, Elisabeth; Frey, Juerg E.; Irmler, Stefan; Ahrens, Christian H.
2018-01-01
Although complete genome sequences hold particular value for an accurate description of core genomes, the identification of strain-specific genes, and as the optimal basis for functional genomics studies, they are still largely underrepresented in public repositories. Based on an assessment of the genome assembly complexity for all lactobacilli, we used Pacific Biosciences' long read technology to sequence and de novo assemble the genomes of three Lactobacillus helveticus starter strains, raising the number of completely sequenced strains to 12. The first comparative genomics study for L. helveticus—to our knowledge—identified a core genome of 988 genes and sets of unique, strain-specific genes ranging from about 30 to more than 200 genes. Importantly, the comparison of MiSeq- and PacBio-based assemblies uncovered that not only accessory but also core genes can be missed in incomplete genome assemblies based on short reads. Analysis of the three genomes revealed that a large number of pseudogenes were enriched for functional Gene Ontology categories such as amino acid transmembrane transport and carbohydrate metabolism, which is in line with a reductive genome evolution in the rich natural habitat of L. helveticus. Notably, the functional Clusters of Orthologous Groups of proteins categories “cell wall/membrane biogenesis” and “defense mechanisms” were found to be enriched among the strain-specific genes. A genome mining effort uncovered examples where an experimentally observed phenotype could be linked to the underlying genotype, such as for cell envelope proteinase PrtH3 of strain FAM8627. Another possible link identified for peptidoglycan hydrolases will require further experiments. Of note, strain FAM22155 did not harbor a CRISPR/Cas system; its loss was also observed in other L. helveticus strains and lactobacillus species, thus questioning the value of the CRISPR/Cas system for diagnostic purposes. Importantly, the complete genome sequences proved to be very useful for the analysis of natural whey starter cultures with metagenomics, as a larger percentage of the sequenced reads of these complex mixtures could be unambiguously assigned down to the strain level. PMID:29441050
Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil
2015-01-01
The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073
Expression-based clustering of CAZyme-encoding genes of Aspergillus niger.
Gruben, Birgit S; Mäkelä, Miia R; Kowalczyk, Joanna E; Zhou, Miaomiao; Benoit-Gelber, Isabelle; De Vries, Ronald P
2017-11-23
The Aspergillus niger genome contains a large repertoire of genes encoding carbohydrate active enzymes (CAZymes) that are targeted to plant polysaccharide degradation enabling A. niger to grow on a wide range of plant biomass substrates. Which genes need to be activated in certain environmental conditions depends on the composition of the available substrate. Previous studies have demonstrated the involvement of a number of transcriptional regulators in plant biomass degradation and have identified sets of target genes for each regulator. In this study, a broad transcriptional analysis was performed of the A. niger genes encoding (putative) plant polysaccharide degrading enzymes. Microarray data focusing on the initial response of A. niger to the presence of plant biomass related carbon sources were analyzed of a wild-type strain N402 that was grown on a large range of carbon sources and of the regulatory mutant strains ΔxlnR, ΔaraR, ΔamyR, ΔrhaR and ΔgalX that were grown on their specific inducing compounds. The cluster analysis of the expression data revealed several groups of co-regulated genes, which goes beyond the traditionally described co-regulated gene sets. Additional putative target genes of the selected regulators were identified, based on their expression profile. Notably, in several cases the expression profile puts questions on the function assignment of uncharacterized genes that was based on homology searches, highlighting the need for more extensive biochemical studies into the substrate specificity of enzymes encoded by these non-characterized genes. The data also revealed sets of genes that were upregulated in the regulatory mutants, suggesting interaction between the regulatory systems and a therefore even more complex overall regulatory network than has been reported so far. Expression profiling on a large number of substrates provides better insight in the complex regulatory systems that drive the conversion of plant biomass by fungi. In addition, the data provides additional evidence in favor of and against the similarity-based functions assigned to uncharacterized genes.
Polonikov, Alexey V.; Ivanov, Vladimir P.; Bogomazov, Alexey D.; Freidin, Maxim B.; Illig, Thomas; Solodilova, Maria A.
2014-01-01
Oxidative stress resulting from an increased amount of reactive oxygen species and an imbalance between oxidants and antioxidants plays an important role in the pathogenesis of asthma. The present study tested the hypothesis that genetic susceptibility to allergic and nonallergic variants of asthma is determined by complex interactions between genes encoding antioxidant defense enzymes (ADE). We carried out a comprehensive analysis of the associations between adult asthma and 46 single nucleotide polymorphisms of 34 ADE genes and 12 other candidate genes of asthma in Russian population using set association analysis and multifactor dimensionality reduction approaches. We found for the first time epistatic interactions between ADE genes underlying asthma susceptibility and the genetic heterogeneity between allergic and nonallergic variants of the disease. We identified GSR (glutathione reductase) and PON2 (paraoxonase 2) as novel candidate genes for asthma susceptibility. We observed gender-specific effects of ADE genes on the risk of asthma. The results of the study demonstrate complexity and diversity of interactions between genes involved in oxidative stress underlying susceptibility to allergic and nonallergic asthma. PMID:24895604
Mindfulness meditation-based stress reduction: experience with a bilingual inner-city program.
Roth, B; Creaser, T
1997-03-01
This article describes a bilingual mindfulness meditation-based stress reduction program in an inner-city setting. Mindfulness meditation is defined, and the practices of breathing meditation, eating meditation, walking meditation, and mindful yoga are described. Data analysis examined compliance, medical and psychologic symptom reduction, and changes in self-esteem, of English- and Spanish-speaking patients who completed the 8-week Stress Reduction and Relaxation Program at the Community Health Center in Meriden, Conn. Statistically significant decreases in medical and psychologic symptoms and improvement in self-esteem were found. Many program completers reported dramatic changes in attitudes, beliefs, habits, and behaviors. Despite the limitations of the research design, these findings suggest that a mindfulness meditation course can be an effective health care intervention when utilized by English- and Spanish-speaking patients in an inner-city community health center. The article includes a discussion of factors to be considered when establishing a mindfulness meditation-based stress reduction program in a health care setting.
Benchmarking of Methods for Genomic Taxonomy
Larsen, Mette V.; Cosentino, Salvatore; Lukjancenko, Oksana; ...
2014-02-26
One of the first issues that emerges when a prokaryotic organism of interest is encountered is the question of what it is—that is, which species it is. The 16S rRNA gene formed the basis of the first method for sequence-based taxonomy and has had a tremendous impact on the field of microbiology. Nevertheless, the method has been found to have a number of shortcomings. In this paper, we trained and benchmarked five methods for whole-genome sequence-based prokaryotic species identification on a common data set of complete genomes: (i) SpeciesFinder, which is based on the complete 16S rRNA gene; (ii) Reads2Typemore » that searches for species-specific 50-mers in either the 16S rRNA gene or the gyrB gene (for the Enterobacteraceae family); (iii) the ribosomal multilocus sequence typing (rMLST) method that samples up to 53 ribosomal genes; (iv) TaxonomyFinder, which is based on species-specific functional protein domain profiles; and finally (v) KmerFinder, which examines the number of cooccurring k-mers (substrings of k nucleotides in DNA sequence data). The performances of the methods were subsequently evaluated on three data sets of short sequence reads or draft genomes from public databases. In total, the evaluation sets constituted sequence data from more than 11,000 isolates covering 159 genera and 243 species. Our results indicate that methods that sample only chromosomal, core genes have difficulties in distinguishing closely related species which only recently diverged. Finally, the KmerFinder method had the overall highest accuracy and correctly identified from 93% to 97% of the isolates in the evaluations sets.« less
Lee, Seungyeoun; Kim, Yongkang; Kwon, Min-Seok; Park, Taesung
2015-01-01
Genome-wide association studies (GWAS) have extensively analyzed single SNP effects on a wide variety of common and complex diseases and found many genetic variants associated with diseases. However, there is still a large portion of the genetic variants left unexplained. This missing heritability problem might be due to the analytical strategy that limits analyses to only single SNPs. One of possible approaches to the missing heritability problem is to consider identifying multi-SNP effects or gene-gene interactions. The multifactor dimensionality reduction method has been widely used to detect gene-gene interactions based on the constructive induction by classifying high-dimensional genotype combinations into one-dimensional variable with two attributes of high risk and low risk for the case-control study. Many modifications of MDR have been proposed and also extended to the survival phenotype. In this study, we propose several extensions of MDR for the survival phenotype and compare the proposed extensions with earlier MDR through comprehensive simulation studies. PMID:26339630
Mining subspace clusters from DNA microarray data using large itemset techniques.
Chang, Ye-In; Chen, Jiun-Rung; Tsai, Yueh-Chi
2009-05-01
Mining subspace clusters from the DNA microarrays could help researchers identify those genes which commonly contribute to a disease, where a subspace cluster indicates a subset of genes whose expression levels are similar under a subset of conditions. Since in a DNA microarray, the number of genes is far larger than the number of conditions, those previous proposed algorithms which compute the maximum dimension sets (MDSs) for any two genes will take a long time to mine subspace clusters. In this article, we propose the Large Itemset-Based Clustering (LISC) algorithm for mining subspace clusters. Instead of constructing MDSs for any two genes, we construct only MDSs for any two conditions. Then, we transform the task of finding the maximal possible gene sets into the problem of mining large itemsets from the condition-pair MDSs. Since we are only interested in those subspace clusters with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonable large support values in the condition-pair MDSs. From our simulation results, we show that the proposed algorithm needs shorter processing time than those previous proposed algorithms which need to construct gene-pair MDSs.
Identifying prognostic signature in ovarian cancer using DirGenerank
Wang, Jian-Yong; Chen, Ling-Ling; Zhou, Xiong-Hui
2017-01-01
Identifying the prognostic genes in cancer is essential not only for the treatment of cancer patients, but also for drug discovery. However, it's still a big challenge to select the prognostic genes that can distinguish the risk of cancer patients across various data sets because of tumor heterogeneity. In this situation, the selected genes whose expression levels are statistically related to prognostic risks may be passengers. In this paper, based on gene expression data and prognostic data of ovarian cancer patients, we used conditional mutual information to construct gene dependency network in which the nodes (genes) with more out-degrees have more chances to be the modulators of cancer prognosis. After that, we proposed DirGenerank (Generank in direct netowrk) algorithm, which concerns both the gene dependency network and genes’ correlations to prognostic risks, to identify the gene signature that can predict the prognostic risks of ovarian cancer patients. Using ovarian cancer data set from TCGA (The Cancer Genome Atlas) as training data set, 40 genes with the highest importance were selected as prognostic signature. Survival analysis of these patients divided by the prognostic signature in testing data set and four independent data sets showed the signature can distinguish the prognostic risks of cancer patients significantly. Enrichment analysis of the signature with curated cancer genes and the drugs selected by CMAP showed the genes in the signature may be drug targets for therapy. In summary, we have proposed a useful pipeline to identify prognostic genes of cancer patients. PMID:28615526
Jiang, Li; Edwards, Stefan M; Thomsen, Bo; Workman, Christopher T; Guldbrandtsen, Bernt; Sørensen, Peter
2014-09-24
Prioritizing genetic variants is a challenge because disease susceptibility loci are often located in genes of unknown function or the relationship with the corresponding phenotype is unclear. A global data-mining exercise on the biomedical literature can establish the phenotypic profile of genes with respect to their connection to disease phenotypes. The importance of protein-protein interaction networks in the genetic heterogeneity of common diseases or complex traits is becoming increasingly recognized. Thus, the development of a network-based approach combined with phenotypic profiling would be useful for disease gene prioritization. We developed a random-set scoring model and implemented it to quantify phenotype relevance in a network-based disease gene-prioritization approach. We validated our approach based on different gene phenotypic profiles, which were generated from PubMed abstracts, OMIM, and GeneRIF records. We also investigated the validity of several vocabulary filters and different likelihood thresholds for predicted protein-protein interactions in terms of their effect on the network-based gene-prioritization approach, which relies on text-mining of the phenotype data. Our method demonstrated good precision and sensitivity compared with those of two alternative complex-based prioritization approaches. We then conducted a global ranking of all human genes according to their relevance to a range of human diseases. The resulting accurate ranking of known causal genes supported the reliability of our approach. Moreover, these data suggest many promising novel candidate genes for human disorders that have a complex mode of inheritance. We have implemented and validated a network-based approach to prioritize genes for human diseases based on their phenotypic profile. We have devised a powerful and transparent tool to identify and rank candidate genes. Our global gene prioritization provides a unique resource for the biological interpretation of data from genome-wide association studies, and will help in the understanding of how the associated genetic variants influence disease or quantitative phenotypes.
Information-Theoretic Metrics for Visualizing Gene-Environment Interactions
Chanda, Pritam ; Zhang, Aidong ; Brazeau, Daniel ; Sucheston, Lara ; Freudenheim, Jo L. ; Ambrosone, Christine ; Ramanathan, Murali
2007-01-01
The purpose of our work was to develop heuristics for visualizing and interpreting gene-environment interactions (GEIs) and to assess the dependence of candidate visualization metrics on biological and study-design factors. Two information-theoretic metrics, the k-way interaction information (KWII) and the total correlation information (TCI), were investigated. The effectiveness of the KWII and TCI to detect GEIs in a diverse range of simulated data sets and a Crohn disease data set was assessed. The sensitivity of the KWII and TCI spectra to biological and study-design variables was determined. Head-to-head comparisons with the relevance-chain, multifactor dimensionality reduction, and the pedigree disequilibrium test (PDT) methods were obtained. The KWII and TCI spectra, which are graphical summaries of the KWII and TCI for each subset of environmental and genotype variables, were found to detect each known GEI in the simulated data sets. The patterns in the KWII and TCI spectra were informative for factors such as case-control misassignment, locus heterogeneity, allele frequencies, and linkage disequilibrium. The KWII and TCI spectra were found to have excellent sensitivity for identifying the key disease-associated genetic variations in the Crohn disease data set. In head-to-head comparisons with the relevance-chain, multifactor dimensionality reduction, and PDT methods, the results from visual interpretation of the KWII and TCI spectra performed satisfactorily. The KWII and TCI are promising metrics for visualizing GEIs. They are capable of detecting interactions among numerous single-nucleotide polymorphisms and environmental variables for a diverse range of GEI models. PMID:17924337
A human functional protein interaction network and its application to cancer data analysis
2010-01-01
Background One challenge facing biologists is to tease out useful information from massive data sets for further analysis. A pathway-based analysis may shed light by projecting candidate genes onto protein functional relationship networks. We are building such a pathway-based analysis system. Results We have constructed a protein functional interaction network by extending curated pathways with non-curated sources of information, including protein-protein interactions, gene coexpression, protein domain interaction, Gene Ontology (GO) annotations and text-mined protein interactions, which cover close to 50% of the human proteome. By applying this network to two glioblastoma multiforme (GBM) data sets and projecting cancer candidate genes onto the network, we found that the majority of GBM candidate genes form a cluster and are closer than expected by chance, and the majority of GBM samples have sequence-altered genes in two network modules, one mainly comprising genes whose products are localized in the cytoplasm and plasma membrane, and another comprising gene products in the nucleus. Both modules are highly enriched in known oncogenes, tumor suppressors and genes involved in signal transduction. Similar network patterns were also found in breast, colorectal and pancreatic cancers. Conclusions We have built a highly reliable functional interaction network upon expert-curated pathways and applied this network to the analysis of two genome-wide GBM and several other cancer data sets. The network patterns revealed from our results suggest common mechanisms in the cancer biology. Our system should provide a foundation for a network or pathway-based analysis platform for cancer and other diseases. PMID:20482850
Species tree inference by minimizing deep coalescences.
Than, Cuong; Nakhleh, Luay
2009-09-01
In a 1997 seminal paper, W. Maddison proposed minimizing deep coalescences, or MDC, as an optimization criterion for inferring the species tree from a set of incongruent gene trees, assuming the incongruence is exclusively due to lineage sorting. In a subsequent paper, Maddison and Knowles provided and implemented a search heuristic for optimizing the MDC criterion, given a set of gene trees. However, the heuristic is not guaranteed to compute optimal solutions, and its hill-climbing search makes it slow in practice. In this paper, we provide two exact solutions to the problem of inferring the species tree from a set of gene trees under the MDC criterion. In other words, our solutions are guaranteed to find the tree that minimizes the total number of deep coalescences from a set of gene trees. One solution is based on a novel integer linear programming (ILP) formulation, and another is based on a simple dynamic programming (DP) approach. Powerful ILP solvers, such as CPLEX, make the first solution appealing, particularly for very large-scale instances of the problem, whereas the DP-based solution eliminates dependence on proprietary tools, and its simplicity makes it easy to integrate with other genomic events that may cause gene tree incongruence. Using the exact solutions, we analyze a data set of 106 loci from eight yeast species, a data set of 268 loci from eight Apicomplexan species, and several simulated data sets. We show that the MDC criterion provides very accurate estimates of the species tree topologies, and that our solutions are very fast, thus allowing for the accurate analysis of genome-scale data sets. Further, the efficiency of the solutions allow for quick exploration of sub-optimal solutions, which is important for a parsimony-based criterion such as MDC, as we show. We show that searching for the species tree in the compatibility graph of the clusters induced by the gene trees may be sufficient in practice, a finding that helps ameliorate the computational requirements of optimization solutions. Further, we study the statistical consistency and convergence rate of the MDC criterion, as well as its optimality in inferring the species tree. Finally, we show how our solutions can be used to identify potential horizontal gene transfer events that may have caused some of the incongruence in the data, thus augmenting Maddison's original framework. We have implemented our solutions in the PhyloNet software package, which is freely available at: http://bioinfo.cs.rice.edu/phylonet.
PathwaySplice: An R package for unbiased pathway analysis of alternative splicing in RNA-Seq data.
Yan, Aimin; Ban, Yuguang; Gao, Zhen; Chen, Xi; Wang, Lily
2018-04-24
Pathway analysis of alternative splicing would be biased without accounting for the different number of exons or junctions associated with each gene, because genes with higher number of exons or junctions are more likely to be included in the "significant" gene list in alternative splicing. We present PathwaySplice, an R package that (1) Performs pathway analysis that explicitly adjusts for the number of exons or junctions associated with each gene; (2) Visualizes selection bias due to different number of exons or junctions for each gene and formally tests for presence of bias using logistic regression; (3) Supports gene sets based on the Gene Ontology terms, as well as more broadly defined gene sets (e.g. MSigDB) or user defined gene sets; (4) Identifies the significant genes driving pathway significance and (5) Organizes significant pathways with an enrichment map, where pathways with large number of overlapping genes are grouped together in a network graph. https://bioconductor.org/packages/release/bioc/html/PathwaySplice.html. lily.wangg@gmail.com, xi.steven.chen@gmail.com.
Graupner, Nadine; Bock, Christina; Wodniok, Sabina; Grossmann, Lars; Vos, Matthijs; Sures, Bernd
2017-01-01
Background Chrysophytes are protist model species in ecology and ecophysiology and important grazers of bacteria-sized microorganisms and primary producers. However, they have not yet been investigated in detail at the molecular level, and no genomic and only little transcriptomic information is available. Chrysophytes exhibit different trophic modes: while phototrophic chrysophytes perform only photosynthesis, mixotrophs can gain carbon from bacterial food as well as from photosynthesis, and heterotrophs solely feed on bacteria-sized microorganisms. Recent phylogenies and megasystematics demonstrate an immense complexity of eukaryotic diversity with numerous transitions between phototrophic and heterotrophic organisms. The question we aim to answer is how the diverse nutritional strategies, accompanied or brought about by a reduction of the plasmid and size reduction in heterotrophic strains, affect physiology and molecular processes. Results We sequenced the mRNA of 18 chrysophyte strains on the Illumina HiSeq platform and analysed the transcriptomes to determine relations between the trophic mode (mixotrophic vs. heterotrophic) and gene expression. We observed an enrichment of genes for photosynthesis, porphyrin and chlorophyll metabolism for phototrophic and mixotrophic strains that can perform photosynthesis. Genes involved in nutrient absorption, environmental information processing and various transporters (e.g., monosaccharide, peptide, lipid transporters) were present or highly expressed only in heterotrophic strains that have to sense, digest and absorb bacterial food. We furthermore present a transcriptome-based alignment-free phylogeny construction approach using transcripts assembled from short reads to determine the evolutionary relationships between the strains and the possible influence of nutritional strategies on the reconstructed phylogeny. We discuss the resulting phylogenies in comparison to those from established approaches based on ribosomal RNA and orthologous genes. Finally, we make functionally annotated reference transcriptomes of each strain available to the community, significantly enhancing publicly available data on Chrysophyceae. Conclusions Our study is the first comprehensive transcriptomic characterisation of a diverse set of Chrysophyceaen strains. In addition, we showcase the possibility of inferring phylogenies from assembled transcriptomes using an alignment-free approach. The raw and functionally annotated data we provide will prove beneficial for further examination of the diversity within this taxon. Our molecular characterisation of different trophic modes presents a first such example. PMID:28097055
Model-based gene set analysis for Bioconductor.
Bauer, Sebastian; Robinson, Peter N; Gagneur, Julien
2011-07-01
Gene Ontology and other forms of gene-category analysis play a major role in the evaluation of high-throughput experiments in molecular biology. Single-category enrichment analysis procedures such as Fisher's exact test tend to flag large numbers of redundant categories as significant, which can complicate interpretation. We have recently developed an approach called model-based gene set analysis (MGSA), that substantially reduces the number of redundant categories returned by the gene-category analysis. In this work, we present the Bioconductor package mgsa, which makes the MGSA algorithm available to users of the R language. Our package provides a simple and flexible application programming interface for applying the approach. The mgsa package has been made available as part of Bioconductor 2.8. It is released under the conditions of the Artistic license 2.0. peter.robinson@charite.de; julien.gagneur@embl.de.
Fallaize, Rosalind; Celis-Morales, Carlos; Macready, Anna L; Marsaux, Cyril Fm; Forster, Hannah; O'Donovan, Clare; Woolhead, Clara; San-Cristobal, Rodrigo; Kolossa, Silvia; Hallmann, Jacqueline; Mavrogianni, Christina; Surwillo, Agnieszka; Livingstone, Katherine M; Moschonis, George; Navas-Carretero, Santiago; Walsh, Marianne C; Gibney, Eileen R; Brennan, Lorraine; Bouwman, Jildau; Grimaldi, Keith; Manios, Yannis; Traczyk, Iwona; Drevon, Christian A; Martinez, J Alfredo; Daniel, Hannelore; Saris, Wim Hm; Gibney, Michael J; Mathers, John C; Lovegrove, Julie A
2016-09-01
The apolipoprotein E (APOE) risk allele (ɛ4) is associated with higher total cholesterol (TC), amplified response to saturated fatty acid (SFA) reduction, and increased cardiovascular disease. Although knowledge of gene risk may enhance dietary change, it is unclear whether ɛ4 carriers would benefit from gene-based personalized nutrition (PN). The aims of this study were to 1) investigate interactions between APOE genotype and habitual dietary fat intake and modulations of fat intake on metabolic outcomes; 2) determine whether gene-based PN results in greater dietary change than do standard dietary advice (level 0) and nongene-based PN (levels 1-2); and 3) assess the impact of knowledge of APOE risk (risk: E4+, nonrisk: E4-) on dietary change after gene-based PN (level 3). Individuals (n = 1466) recruited into the Food4Me pan-European PN dietary intervention study were randomly assigned to 4 treatment arms and genotyped for APOE (rs429358 and rs7412). Diet and dried blood spot TC and ω-3 (n-3) index were determined at baseline and after a 6-mo intervention. Data were analyzed with the use of adjusted general linear models. Significantly higher TC concentrations were observed in E4+ participants than in E4- (P < 0.05). Although there were no significant differences in APOE response to gene-based PN (E4+ compared with E4-), both groups had a greater reduction in SFA (percentage of total energy) intake than at level 0 (mean ± SD: E4+, -0.72% ± 0.35% compared with -1.95% ± 0.45%, P = 0.035; E4-, -0.31% ± 0.20% compared with -1.68% ± 0.35%, P = 0.029). Gene-based PN was associated with a smaller reduction in SFA intake than in nongene-based PN (level 2) for E4- participants (-1.68% ± 0.35% compared with -2.56% ± 0.27%, P = 0.025). The APOE ɛ4 allele was associated with higher TC. Although gene-based PN targeted to APOE was more effective in reducing SFA intake than standard dietary advice, there was no difference between APOE "risk" and "nonrisk" groups. Furthermore, disclosure of APOE nonrisk may have weakened dietary response to PN. This trial was registered at clinicaltrials.gov as NCT01530139. © 2016 American Society for Nutrition.
2012-01-01
Background Geminiviruses are a large and important family of plant viruses that infect a wide range of crops throughout the world. The Begomovirus genus contains species that are transmitted by whiteflies and are distributed worldwide causing disease on an array of horticultural crops. Symptom remission, in which newly developed leaves of systemically infected plants exhibit a reduction in symptom severity (recovery), has been observed on pepper (Capsicum annuum) plants infected with Pepper golden mosaic virus (PepGMV). Previous studies have shown that transcriptional and post-transcriptional gene silencing mechanisms are involved in the reduction of viral nucleic acid concentration in recovered tissue. In this study, we employed deep transcriptome sequencing methods to assess transcriptional variation in healthy (mock), symptomatic, and recovered pepper leaves following PepGMV infection. Results Differential expression analyses of the pepper leaf transcriptome from symptomatic and recovered stages revealed a total of 309 differentially expressed genes between healthy (mock) and symptomatic or recovered tissues. Computational prediction of differential expression was validated using quantitative reverse-transcription PCR confirming the robustness of our bioinformatic methods. Within the set of differentially expressed genes associated with the recovery process were genes involved in defense responses including pathogenesis-related proteins, reactive oxygen species, systemic acquired resistance, jasmonic acid biosynthesis, and ethylene signaling. No major differences were found when compared the differentially expressed genes in symptomatic and recovered tissues. On the other hand, a set of genes with novel roles in defense responses was identified including genes involved in histone modification. This latter result suggested that post-transcriptional and transcriptional gene silencing may be one of the major mechanisms involved in the recovery process. Genes orthologous to the C. annuum proteins involved in the pepper-PepGMV recovery response were identified in both Solanum lycopersicum and Solanum tuberosum suggesting conservation of components of the viral recovery response in the Solanaceae. Conclusion These data provide a valuable source of information for improving our understanding of the underlying molecular mechanisms by which pepper leaves become symptomless following infection with geminiviruses. The identification of orthologs for the majority of genes differentially expressed in recovered tissues in two major solanaceous crop species provides the basis for future comparative analyses of the viral recovery process across related taxa. PMID:23185982
Góngora-Castillo, Elsa; Ibarra-Laclette, Enrique; Trejo-Saavedra, Diana L; Rivera-Bustamante, Rafael F
2012-11-27
Geminiviruses are a large and important family of plant viruses that infect a wide range of crops throughout the world. The Begomovirus genus contains species that are transmitted by whiteflies and are distributed worldwide causing disease on an array of horticultural crops. Symptom remission, in which newly developed leaves of systemically infected plants exhibit a reduction in symptom severity (recovery), has been observed on pepper (Capsicum annuum) plants infected with Pepper golden mosaic virus (PepGMV). Previous studies have shown that transcriptional and post-transcriptional gene silencing mechanisms are involved in the reduction of viral nucleic acid concentration in recovered tissue. In this study, we employed deep transcriptome sequencing methods to assess transcriptional variation in healthy (mock), symptomatic, and recovered pepper leaves following PepGMV infection. Differential expression analyses of the pepper leaf transcriptome from symptomatic and recovered stages revealed a total of 309 differentially expressed genes between healthy (mock) and symptomatic or recovered tissues. Computational prediction of differential expression was validated using quantitative reverse-transcription PCR confirming the robustness of our bioinformatic methods. Within the set of differentially expressed genes associated with the recovery process were genes involved in defense responses including pathogenesis-related proteins, reactive oxygen species, systemic acquired resistance, jasmonic acid biosynthesis, and ethylene signaling. No major differences were found when compared the differentially expressed genes in symptomatic and recovered tissues. On the other hand, a set of genes with novel roles in defense responses was identified including genes involved in histone modification. This latter result suggested that post-transcriptional and transcriptional gene silencing may be one of the major mechanisms involved in the recovery process. Genes orthologous to the C. annuum proteins involved in the pepper-PepGMV recovery response were identified in both Solanum lycopersicum and Solanum tuberosum suggesting conservation of components of the viral recovery response in the Solanaceae. These data provide a valuable source of information for improving our understanding of the underlying molecular mechanisms by which pepper leaves become symptomless following infection with geminiviruses. The identification of orthologs for the majority of genes differentially expressed in recovered tissues in two major solanaceous crop species provides the basis for future comparative analyses of the viral recovery process across related taxa.
Naaijen, J; Bralten, J; Poelmans, G; Glennon, J C; Franke, B; Buitelaar, J K
2017-01-10
Attention-deficit/hyperactivity disorder (ADHD) and autism spectrum disorders (ASD) often co-occur. Both are highly heritable; however, it has been difficult to discover genetic risk variants. Glutamate and GABA are main excitatory and inhibitory neurotransmitters in the brain; their balance is essential for proper brain development and functioning. In this study we investigated the role of glutamate and GABA genetics in ADHD severity, autism symptom severity and inhibitory performance, based on gene set analysis, an approach to investigate multiple genetic variants simultaneously. Common variants within glutamatergic and GABAergic genes were investigated using the MAGMA software in an ADHD case-only sample (n=931), in which we assessed ASD symptoms and response inhibition on a Stop task. Gene set analysis for ADHD symptom severity, divided into inattention and hyperactivity/impulsivity symptoms, autism symptom severity and inhibition were performed using principal component regression analyses. Subsequently, gene-wide association analyses were performed. The glutamate gene set showed an association with severity of hyperactivity/impulsivity (P=0.009), which was robust to correcting for genome-wide association levels. The GABA gene set showed nominally significant association with inhibition (P=0.04), but this did not survive correction for multiple comparisons. None of single gene or single variant associations was significant on their own. By analyzing multiple genetic variants within candidate gene sets together, we were able to find genetic associations supporting the involvement of excitatory and inhibitory neurotransmitter systems in ADHD and ASD symptom severity in ADHD.
Gui, Jiang; Andrew, Angeline S.; Andrews, Peter; Nelson, Heather M.; Kelsey, Karl T.; Karagas, Margaret R.; Moore, Jason H.
2010-01-01
A central goal of human genetics is to identify and characterize susceptibility genes for common complex human diseases. An important challenge in this endeavor is the modeling of gene-gene interaction or epistasis that can result in non-additivity of genetic effects. The multifactor dimensionality reduction (MDR) method was developed as machine learning alternative to parametric logistic regression for detecting interactions in absence of significant marginal effects. The goal of MDR is to reduce the dimensionality inherent in modeling combinations of polymorphisms using a computational approach called constructive induction. Here, we propose a Robust Multifactor Dimensionality Reduction (RMDR) method that performs constructive induction using a Fisher’s Exact Test rather than a predetermined threshold. The advantage of this approach is that only those genotype combinations that are determined to be statistically significant are considered in the MDR analysis. We use two simulation studies to demonstrate that this approach will increase the success rate of MDR when there are only a few genotype combinations that are significantly associated with case-control status. We show that there is no loss of success rate when this is not the case. We then apply the RMDR method to the detection of gene-gene interactions in genotype data from a population-based study of bladder cancer in New Hampshire. PMID:21091664
Oberthuer, André; Berthold, Frank; Warnat, Patrick; Hero, Barbara; Kahlert, Yvonne; Spitz, Rüdiger; Ernestus, Karen; König, Rainer; Haas, Stefan; Eils, Roland; Schwab, Manfred; Brors, Benedikt; Westermann, Frank; Fischer, Matthias
2006-11-01
To develop a gene expression-based classifier for neuroblastoma patients that reliably predicts courses of the disease. Two hundred fifty-one neuroblastoma specimens were analyzed using a customized oligonucleotide microarray comprising 10,163 probes for transcripts with differential expression in clinical subgroups of the disease. Subsequently, the prediction analysis for microarrays (PAM) was applied to a first set of patients with maximally divergent clinical courses (n = 77). The classification accuracy was estimated by a complete 10-times-repeated 10-fold cross validation, and a 144-gene predictor was constructed from this set. This classifier's predictive power was evaluated in an independent second set (n = 174) by comparing results of the gene expression-based classification with those of risk stratification systems of current trials from Germany, Japan, and the United States. The first set of patients was accurately predicted by PAM (cross-validated accuracy, 99%). Within the second set, the PAM classifier significantly separated cohorts with distinct courses (3-year event-free survival [EFS] 0.86 +/- 0.03 [favorable; n = 115] v 0.52 +/- 0.07 [unfavorable; n = 59] and 3-year overall survival 0.99 +/- 0.01 v 0.84 +/- 0.05; both P < .0001) and separated risk groups of current neuroblastoma trials into subgroups with divergent outcome (NB2004: low-risk 3-year EFS 0.86 +/- 0.04 v 0.25 +/- 0.15, P < .0001; intermediate-risk 1.00 v 0.57 +/- 0.19, P = .018; high-risk 0.81 +/- 0.10 v 0.56 +/- 0.08, P = .06). In a multivariate Cox regression model, the PAM predictor classified patients of the second set more accurately than risk stratification of current trials from Germany, Japan, and the United States (P < .001; hazard ratio, 4.756 [95% CI, 2.544 to 8.893]). Integration of gene expression-based class prediction of neuroblastoma patients may improve risk estimation of current neuroblastoma trials.
Azuaje, Francisco; Zheng, Huiru; Camargo, Anyela; Wang, Haiying
2011-08-01
The discovery of novel disease biomarkers is a crucial challenge for translational bioinformatics. Demonstration of both their classification power and reproducibility across independent datasets are essential requirements to assess their potential clinical relevance. Small datasets and multiplicity of putative biomarker sets may explain lack of predictive reproducibility. Studies based on pathway-driven discovery approaches have suggested that, despite such discrepancies, the resulting putative biomarkers tend to be implicated in common biological processes. Investigations of this problem have been mainly focused on datasets derived from cancer research. We investigated the predictive and functional concordance of five methods for discovering putative biomarkers in four independently-generated datasets from the cardiovascular disease domain. A diversity of biosignatures was identified by the different methods. However, we found strong biological process concordance between them, especially in the case of methods based on gene set analysis. With a few exceptions, we observed lack of classification reproducibility using independent datasets. Partial overlaps between our putative sets of biomarkers and the primary studies exist. Despite the observed limitations, pathway-driven or gene set analysis can predict potentially novel biomarkers and can jointly point to biomedically-relevant underlying molecular mechanisms. Copyright © 2011 Elsevier Inc. All rights reserved.
Creswell, J David; Irwin, Michael R; Burklund, Lisa J; Lieberman, Matthew D; Arevalo, Jesusa M G; Ma, Jeffrey; Breen, Elizabeth Crabb; Cole, Steven W
2012-10-01
Lonely older adults have increased expression of pro-inflammatory genes as well as increased risk for morbidity and mortality. Previous behavioral treatments have attempted to reduce loneliness and its concomitant health risks, but have had limited success. The present study tested whether the 8-week Mindfulness-Based Stress Reduction (MBSR) program (compared to a Wait-List control group) reduces loneliness and downregulates loneliness-related pro-inflammatory gene expression in older adults (N = 40). Consistent with study predictions, mixed effect linear models indicated that the MBSR program reduced loneliness, compared to small increases in loneliness in the control group (treatment condition × time interaction: F(1,35) = 7.86, p = .008). Moreover, at baseline, there was an association between reported loneliness and upregulated pro-inflammatory NF-κB-related gene expression in circulating leukocytes, and MBSR downregulated this NF-κB-associated gene expression profile at post-treatment. Finally, there was a trend for MBSR to reduce C Reactive Protein (treatment condition × time interaction: (F(1,33) = 3.39, p = .075). This work provides an initial indication that MBSR may be a novel treatment approach for reducing loneliness and related pro-inflammatory gene expression in older adults. Copyright © 2012 Elsevier Inc. All rights reserved.
Creswell, J. David; Irwin, Michael R.; Burklund, Lisa J.; Lieberman, Matthew D.; Arevalo, Jesusa M. G.; Ma, Jeffrey; Breen, Elizabeth Crabb; Cole, Steven W.
2013-01-01
Lonely older adults have increased expression of pro-inflammatory genes as well as increased risk for morbidity and mortality. Previous behavioral treatments have attempted to reduce loneliness and its concomitant health risks, but have had limited success. The present study tested whether the 8-week Mindfulness-Based Stress Reduction (MBSR) program (compared to a Wait-List control group) reduces loneliness and downregulates loneliness-related pro-inflammatory gene expression in older adults (N=40). Consistent with study predictions, mixed effect linear models indicated that the MBSR program reduced loneliness, compared to small increases in loneliness in the control group (treatment condition × time interaction: F(1,35)=7.86, p=.008). Moreover, at baseline, there was an association between reported loneliness and upregulated pro-inflammatory NF-κB-related gene expression in circulating leukocytes, and MBSR downregulated this NF-κB-associated gene expression profile at post-treatment. Finally, there was a trend for MBSR to reduce C Reactive Protein (treatment condition × time interaction: (F(1,33)=3.39, p=.075). This work provides an initial indication that MBSR may be a novel treatment approach for reducing loneliness and related pro-inflammatory gene expression in older adults. PMID:22820409
About miRNAs, miRNA seeds, target genes and target pathways.
Kehl, Tim; Backes, Christina; Kern, Fabian; Fehlmann, Tobias; Ludwig, Nicole; Meese, Eckart; Lenhof, Hans-Peter; Keller, Andreas
2017-12-05
miRNAs are typically repressing gene expression by binding to the 3' UTR, leading to degradation of the mRNA. This process is dominated by the eight-base seed region of the miRNA. Further, miRNAs are known not only to target genes but also to target significant parts of pathways. A logical line of thoughts is: miRNAs with similar (seed) sequence target similar sets of genes and thus similar sets of pathways. By calculating similarity scores for all 3.25 million pairs of 2,550 human miRNAs, we found that this pattern frequently holds, while we also observed exceptions. Respective results were obtained for both, predicted target genes as well as experimentally validated targets. We note that miRNAs target gene set similarity follows a bimodal distribution, pointing at a set of 282 miRNAs that seems to target genes with very high specificity. Further, we discuss miRNAs with different (seed) sequences that nonetheless regulate similar gene sets or pathways. Most intriguingly, we found miRNA pairs that regulate different gene sets but similar pathways such as miR-6886-5p and miR-3529-5p. These are jointly targeting different parts of the MAPK signaling cascade. The main goal of this study is to provide a general overview on the results, to highlight a selection of relevant results on miRNAs, miRNA seeds, target genes and target pathways and to raise awareness for artifacts in respective comparisons. The full set of information that allows to infer detailed results on each miRNA has been included in miRPathDB, the miRNA target pathway database (https://mpd.bioinf.uni-sb.de).
Mitsui, Yuki; Setoguchi, Hiroaki
2012-12-28
Understanding demographic histories, such as divergence time, patterns of gene flow, and population size changes, in ecologically diverging lineages provide implications for the process and maintenance of population differentiation by ecological adaptation. This study addressed the demographic histories in two independently derived lineages of flood-resistant riparian plants and their non-riparian relatives [Ainsliaea linearis (riparian) and A. apiculata (non-riparian); A. oblonga (riparian) and A. macroclinidioides (non-riparian); Asteraceae] using an isolation-with-migration (IM) model based on variation at 10 nuclear DNA loci. The highest posterior probabilities of the divergence time parameters were estimated to be ca. 25,000 years ago for A. linearis and A. apiculata and ca. 9000 years ago for A. oblonga and A. macroclinidioides, although the confidence intervals of the parameters had broad ranges. The likelihood ratio tests detected evidence of historical gene flow between both riparian/non-riparian species pairs. The riparian populations showed lower levels of genetic diversity and a significant reduction in effective population sizes compared to the non-riparian populations and their ancestral populations. This study showed the recent origins of flood-resistant riparian plants, which are remarkable examples of plant ecological adaptation. The recent divergence and genetic signatures of historical gene flow among riparian/non-riparian species implied that they underwent morphological and ecological differentiation within short evolutionary timescales and have maintained their species boundaries in the face of gene flow. Comparative analyses of adaptive divergence in two sets of riparian/non-riparian lineages suggested that strong natural selection by flooding had frequently reduced the genetic diversity and size of riparian populations through genetic drift, possibly leading to fixation of adaptive traits in riparian populations. The two sets of riparian/non-riparian lineages showed contrasting patterns of gene flow and genetic differentiation, implying that each lineage showed different degrees of reproductive isolation and that they had experienced unique evolutionary and demographic histories in the process of adaptive divergence.
Hosmani, Prashant S.; Villalobos-Ayala, Krystal; Miller, Sherry; Shippy, Teresa; Flores, Mirella; Rosendale, Andrew; Cordola, Chris; Bell, Tracey; Mann, Hannah; DeAvila, Gabe; DeAvila, Daniel; Moore, Zachary; Buller, Kyle; Ciolkevich, Kathryn; Nandyal, Samantha; Mahoney, Robert; Van Voorhis, Joshua; Dunlevy, Megan; Farrow, David; Hunter, David; Morgan, Taylar; Shore, Kayla; Guzman, Victoria; Izsak, Allison; Dixon, Danielle E.; Cridge, Andrew; Cano, Liliana; Cao, Xiaolong; Jiang, Haobo; Leng, Nan; Johnson, Shannon; Cantarel, Brandi L.; Richards, Stephen; English, Adam; Shatters, Robert G.; Childers, Chris; Chen, Mei-Ju; Hunter, Wayne; Cilia, Michelle; Mueller, Lukas A.; Munoz-Torres, Monica; Nelson, David; Poelchau, Monica F.; Benoit, Joshua B.; Wiersma-Koch, Helen; D’Elia, Tom; Brown, Susan J.
2017-01-01
Abstract The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the pathogen associated with citrus Huanglongbing (HLB, citrus greening). HLB threatens citrus production worldwide. Suppression or reduction of the insect vector using chemical insecticides has been the primary method to inhibit the spread of citrus greening disease. Accurate structural and functional annotation of the Asian citrus psyllid genome, as well as a clear understanding of the interactions between the insect and CLas, are required for development of new molecular-based HLB control methods. A draft assembly of the D. citri genome has been generated and annotated with automated pipelines. However, knowledge transfer from well-curated reference genomes such as that of Drosophila melanogaster to newly sequenced ones is challenging due to the complexity and diversity of insect genomes. To identify and improve gene models as potential targets for pest control, we manually curated several gene families with a focus on genes that have key functional roles in D. citri biology and CLas interactions. This community effort produced 530 manually curated gene models across developmental, physiological, RNAi regulatory and immunity-related pathways. As previously shown in the pea aphid, RNAi machinery genes putatively involved in the microRNA pathway have been specifically duplicated. A comprehensive transcriptome enabled us to identify a number of gene families that are either missing or misassembled in the draft genome. In order to develop biocuration as a training experience, we included undergraduate and graduate students from multiple institutions, as well as experienced annotators from the insect genomics research community. The resulting gene set (OGS v1.0) combines both automatically predicted and manually curated gene models. Database URL: https://citrusgreening.org/ PMID:29220441
Haakensen, Vilde D; Lingjaerde, Ole Christian; Lüders, Torben; Riis, Margit; Prat, Aleix; Troester, Melissa A; Holmen, Marit M; Frantzen, Jan Ole; Romundstad, Linda; Navjord, Dina; Bukholm, Ida K; Johannesen, Tom B; Perou, Charles M; Ursin, Giske; Kristensen, Vessela N; Børresen-Dale, Anne-Lise; Helland, Aslaug
2011-11-01
Increased understanding of the variability in normal breast biology will enable us to identify mechanisms of breast cancer initiation and the origin of different subtypes, and to better predict breast cancer risk. Gene expression patterns in breast biopsies from 79 healthy women referred to breast diagnostic centers in Norway were explored by unsupervised hierarchical clustering and supervised analyses, such as gene set enrichment analysis and gene ontology analysis and comparison with previously published genelists and independent datasets. Unsupervised hierarchical clustering identified two separate clusters of normal breast tissue based on gene-expression profiling, regardless of clustering algorithm and gene filtering used. Comparison of the expression profile of the two clusters with several published gene lists describing breast cells revealed that the samples in cluster 1 share characteristics with stromal cells and stem cells, and to a certain degree with mesenchymal cells and myoepithelial cells. The samples in cluster 1 also share many features with the newly identified claudin-low breast cancer intrinsic subtype, which also shows characteristics of stromal and stem cells. More women belonging to cluster 1 have a family history of breast cancer and there is a slight overrepresentation of nulliparous women in cluster 1. Similar findings were seen in a separate dataset consisting of histologically normal tissue from both breasts harboring breast cancer and from mammoplasty reductions. This is the first study to explore the variability of gene expression patterns in whole biopsies from normal breasts and identified distinct subtypes of normal breast tissue. Further studies are needed to determine the specific cell contribution to the variation in the biology of normal breasts, how the clusters identified relate to breast cancer risk and their possible link to the origin of the different molecular subtypes of breast cancer.
Role of DISC1 interacting proteins in schizophrenia risk from genome-wide analysis of missense SNPs.
Costas, Javier; Suárez-Rama, Jose Javier; Carrera, Noa; Paz, Eduardo; Páramo, Mario; Agra, Santiago; Brenlla, Julio; Ramos-Ríos, Ramón; Arrojo, Manuel
2013-11-01
A balanced translocation affecting DISC1 cosegregates with several psychiatric disorders, including schizophrenia, in a Scottish family. DISC1 is a hub protein of a network of protein-protein interactions involved in multiple developmental pathways within the brain. Gene set-based analysis has been proposed as an alternative to individual analysis of single nucleotide polymorphisms (SNPs) to get information from genome-wide association studies. In this work, we tested for an overrepresentation of the DISC1 interacting proteins within the top results of our ranked list of genes based on our previous genome-wide association study of missense SNPs in schizophrenia. Our data set consisted of 5100 common missense SNPs genotyped in 476 schizophrenic patients and 447 control subjects from Galicia, NW Spain. We used a modification of the Gene Set Enrichment Analysis adapted for SNPs, as implemented in the GenGen software. The analysis detected an overrepresentation of the DISC1 interacting proteins (permuted P-value=0.0158), indicative of the role of this gene set in schizophrenia risk. We identified seven leading-edge genes, MACF1, UTRN, DST, DISC1, KIF3A, SYNE1, and AKAP9, responsible for the overrepresentation. These genes are involved in neuronal cytoskeleton organization and intracellular transport through the microtubule cytoskeleton, suggesting that these processes may be impaired in schizophrenia. © 2013 John Wiley & Sons Ltd/University College London.
Gene regulation is governed by a core network in hepatocellular carcinoma.
Gu, Zuguang; Zhang, Chenyu; Wang, Jin
2012-05-01
Hepatocellular carcinoma (HCC) is one of the most lethal cancers worldwide, and the mechanisms that lead to the disease are still relatively unclear. However, with the development of high-throughput technologies it is possible to gain a systematic view of biological systems to enhance the understanding of the roles of genes associated with HCC. Thus, analysis of the mechanism of molecule interactions in the context of gene regulatory networks can reveal specific sub-networks that lead to the development of HCC. In this study, we aimed to identify the most important gene regulations that are dysfunctional in HCC generation. Our method for constructing gene regulatory network is based on predicted target interactions, experimentally-supported interactions, and co-expression model. Regulators in the network included both transcription factors and microRNAs to provide a complete view of gene regulation. Analysis of gene regulatory network revealed that gene regulation in HCC is highly modular, in which different sets of regulators take charge of specific biological processes. We found that microRNAs mainly control biological functions related to mitochondria and oxidative reduction, while transcription factors control immune responses, extracellular activity and the cell cycle. On the higher level of gene regulation, there exists a core network that organizes regulations between different modules and maintains the robustness of the whole network. There is direct experimental evidence for most of the regulators in the core gene regulatory network relating to HCC. We infer it is the central controller of gene regulation. Finally, we explored the influence of the core gene regulatory network on biological pathways. Our analysis provides insights into the mechanism of transcriptional and post-transcriptional control in HCC. In particular, we highlight the importance of the core gene regulatory network; we propose that it is highly related to HCC and we believe further experimental validation is worthwhile.
Discovering monotonic stemness marker genes from time-series stem cell microarray data.
Wang, Hsei-Wei; Sun, Hsing-Jen; Chang, Ting-Yu; Lo, Hung-Hao; Cheng, Wei-Chung; Tseng, George C; Lin, Chin-Teng; Chang, Shing-Jyh; Pal, Nikhil; Chung, I-Fang
2015-01-01
Identification of genes with ascending or descending monotonic expression patterns over time or stages of stem cells is an important issue in time-series microarray data analysis. We propose a method named Monotonic Feature Selector (MFSelector) based on a concept of total discriminating error (DEtotal) to identify monotonic genes. MFSelector considers various time stages in stage order (i.e., Stage One vs. other stages, Stages One and Two vs. remaining stages and so on) and computes DEtotal of each gene. MFSelector can successfully identify genes with monotonic characteristics. We have demonstrated the effectiveness of MFSelector on two synthetic data sets and two stem cell differentiation data sets: embryonic stem cell neurogenesis (ESCN) and embryonic stem cell vasculogenesis (ESCV) data sets. We have also performed extensive quantitative comparisons of the three monotonic gene selection approaches. Some of the monotonic marker genes such as OCT4, NANOG, BLBP, discovered from the ESCN dataset exhibit consistent behavior with that reported in other studies. The role of monotonic genes found by MFSelector in either stemness or differentiation is validated using information obtained from Gene Ontology analysis and other literature. We justify and demonstrate that descending genes are involved in the proliferation or self-renewal activity of stem cells, while ascending genes are involved in differentiation of stem cells into variant cell lineages. We have developed a novel system, easy to use even with no pre-existing knowledge, to identify gene sets with monotonic expression patterns in multi-stage as well as in time-series genomics matrices. The case studies on ESCN and ESCV have helped to get a better understanding of stemness and differentiation. The novel monotonic marker genes discovered from a data set are found to exhibit consistent behavior in another independent data set, demonstrating the utility of the proposed method. The MFSelector R function and data sets can be downloaded from: http://microarray.ym.edu.tw/tools/MFSelector/.
Pavlidis, Paul; Qin, Jie; Arango, Victoria; Mann, John J; Sibille, Etienne
2004-06-01
One of the challenges in the analysis of gene expression data is placing the results in the context of other data available about genes and their relationships to each other. Here, we approach this problem in the study of gene expression changes associated with age in two areas of the human prefrontal cortex, comparing two computational methods. The first method, "overrepresentation analysis" (ORA), is based on statistically evaluating the fraction of genes in a particular gene ontology class found among the set of genes showing age-related changes in expression. The second method, "functional class scoring" (FCS), examines the statistical distribution of individual gene scores among all genes in the gene ontology class and does not involve an initial gene selection step. We find that FCS yields more consistent results than ORA, and the results of ORA depended strongly on the gene selection threshold. Our findings highlight the utility of functional class scoring for the analysis of complex expression data sets and emphasize the advantage of considering all available genomic information rather than sets of genes that pass a predetermined "threshold of significance."
An integrated analysis of genes and functional pathways for aggression in human and rodent models.
Zhang-James, Yanli; Fernàndez-Castillo, Noèlia; Hess, Jonathan L; Malki, Karim; Glatt, Stephen J; Cormand, Bru; Faraone, Stephen V
2018-06-01
Human genome-wide association studies (GWAS), transcriptome analyses of animal models, and candidate gene studies have advanced our understanding of the genetic architecture of aggressive behaviors. However, each of these methods presents unique limitations. To generate a more confident and comprehensive view of the complex genetics underlying aggression, we undertook an integrated, cross-species approach. We focused on human and rodent models to derive eight gene lists from three main categories of genetic evidence: two sets of genes identified in GWAS studies, four sets implicated by transcriptome-wide studies of rodent models, and two sets of genes with causal evidence from online Mendelian inheritance in man (OMIM) and knockout (KO) mice reports. These gene sets were evaluated for overlap and pathway enrichment to extract their similarities and differences. We identified enriched common pathways such as the G-protein coupled receptor (GPCR) signaling pathway, axon guidance, reelin signaling in neurons, and ERK/MAPK signaling. Also, individual genes were ranked based on their cumulative weights to quantify their importance as risk factors for aggressive behavior, which resulted in 40 top-ranked and highly interconnected genes. The results of our cross-species and integrated approach provide insights into the genetic etiology of aggression.
Worm, Petra; Stams, Alfons J M; Cheng, Xu; Plugge, Caroline M
2011-01-01
Transcription of genes coding for formate dehydrogenases (fdh genes) and hydrogenases (hyd genes) in Syntrophobacter fumaroxidans and Methanospirillum hungatei was studied following growth under different conditions. Under all conditions tested, all fdh and hyd genes were transcribed. However, transcription levels of the individual genes varied depending on the substrate and growth conditions. Our results strongly suggest that in syntrophically grown S. fumaroxidans cells, the [FeFe]-hydrogenase (encoded by Sfum_844-46), FDH1 (Sfum_2703-06) and Hox (Sfum_2713-16) may confurcate electrons from NADH and ferredoxin to protons and carbon dioxide to produce hydrogen and formate, respectively. Based on bioinformatic analysis, a membrane-integrated energy-converting [NiFe]-hydrogenase (Mhun_1741-46) of M. hungatei might be involved in the energy-dependent reduction of CO(2) to formylmethanofuran. The best candidates for F(420)-dependent N(5),N(10)-methyl-H(4) MPT and N(5),N(10),-methylene-H(4)MPT reduction are the cytoplasmic [NiFe]-hydrogenase and FDH1. 16S rRNA ratios indicate that in one of the triplicate co-cultures of S. fumaroxidans and M. hungatei, less energy was available for S. fumaroxidans. This led to enhanced transcription of genes coding for the Rnf-complex (Sfum_2694-99) and of several fdh and hyd genes. The Rnf-complex probably reoxidized NADH with ferredoxin reduction, followed by ferredoxin oxidation by the induced formate dehydrogenases and hydrogenases.
Combining Gene Signatures Improves Prediction of Breast Cancer Survival
Zhao, Xi; Naume, Bjørn; Langerød, Anita; Frigessi, Arnoldo; Kristensen, Vessela N.; Børresen-Dale, Anne-Lise; Lingjærde, Ole Christian
2011-01-01
Background Several gene sets for prediction of breast cancer survival have been derived from whole-genome mRNA expression profiles. Here, we develop a statistical framework to explore whether combination of the information from such sets may improve prediction of recurrence and breast cancer specific death in early-stage breast cancers. Microarray data from two clinically similar cohorts of breast cancer patients are used as training (n = 123) and test set (n = 81), respectively. Gene sets from eleven previously published gene signatures are included in the study. Principal Findings To investigate the relationship between breast cancer survival and gene expression on a particular gene set, a Cox proportional hazards model is applied using partial likelihood regression with an L2 penalty to avoid overfitting and using cross-validation to determine the penalty weight. The fitted models are applied to an independent test set to obtain a predicted risk for each individual and each gene set. Hierarchical clustering of the test individuals on the basis of the vector of predicted risks results in two clusters with distinct clinical characteristics in terms of the distribution of molecular subtypes, ER, PR status, TP53 mutation status and histological grade category, and associated with significantly different survival probabilities (recurrence: p = 0.005; breast cancer death: p = 0.014). Finally, principal components analysis of the gene signatures is used to derive combined predictors used to fit a new Cox model. This model classifies test individuals into two risk groups with distinct survival characteristics (recurrence: p = 0.003; breast cancer death: p = 0.001). The latter classifier outperforms all the individual gene signatures, as well as Cox models based on traditional clinical parameters and the Adjuvant! Online for survival prediction. Conclusion Combining the predictive strength of multiple gene signatures improves prediction of breast cancer survival. The presented methodology is broadly applicable to breast cancer risk assessment using any new identified gene set. PMID:21423775
Bushman, B Shaun; Amundsen, Keenan L; Warnke, Scott E; Robins, Joseph G; Johnson, Paul G
2016-01-13
Kentucky bluegrass (Poa pratensis L.) is a prominent turfgrass in the cool-season regions, but it is sensitive to salt stress. Previously, a relatively salt tolerant Kentucky bluegrass accession was identified that maintained green colour under consistent salt applications. In this study, a transcriptome study between the tolerant (PI 372742) accession and a salt susceptible (PI 368233) accession was conducted, under control and salt treatments, and in shoot and root tissues. Sample replicates grouped tightly by tissue and treatment, and fewer differentially expressed transcripts were detected in the tolerant PI 372742 samples compared to the susceptible PI 368233 samples, and in root tissues compared to shoot tissues. A de novo assembly resulted in 388,764 transcripts, with 36,587 detected as differentially expressed. Approximately 75 % of transcripts had homology based annotations, with several differences in GO terms enriched between the PI 368233 and PI 372742 samples. Gene expression profiling identified salt-responsive gene families that were consistently down-regulated in PI 372742 and unlikely to contribute to salt tolerance in Kentucky bluegrass. Gene expression profiling also identified sets of transcripts relating to transcription factors, ion and water transport genes, and oxidation-reduction process genes with likely roles in salt tolerance. The transcript assembly represents the first such assembly in the highly polyploidy, facultative apomictic Kentucky bluegrass. The transcripts identified provide genetic information on how this plant responds to and tolerates salt stress in both shoot and root tissues, and can be used for further genetic testing and introgression.
Computing and Applying Atomic Regulons to Understand Gene Expression and Regulation
Faria, José P.; Davis, James J.; Edirisinghe, Janaka N.; Taylor, Ronald C.; Weisenhorn, Pamela; Olson, Robert D.; Stevens, Rick L.; Rocha, Miguel; Rocha, Isabel; Best, Aaron A.; DeJongh, Matthew; Tintle, Nathan L.; Parrello, Bruce; Overbeek, Ross; Henry, Christopher S.
2016-01-01
Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. An important step toward meeting the challenge of understanding gene function and regulation is the identification of sets of genes that are always co-expressed. These gene sets, Atomic Regulons (ARs), represent fundamental units of function within a cell and could be used to associate genes of unknown function with cellular processes and to enable rational genetic engineering of cellular systems. Here, we describe an approach for inferring ARs that leverages large-scale expression data sets, gene context, and functional relationships among genes. We computed ARs for Escherichia coli based on 907 gene expression experiments and compared our results with gene clusters produced by two prevalent data-driven methods: Hierarchical clustering and k-means clustering. We compared ARs and purely data-driven gene clusters to the curated set of regulatory interactions for E. coli found in RegulonDB, showing that ARs are more consistent with gold standard regulons than are data-driven gene clusters. We further examined the consistency of ARs and data-driven gene clusters in the context of gene interactions predicted by Context Likelihood of Relatedness (CLR) analysis, finding that the ARs show better agreement with CLR predicted interactions. We determined the impact of increasing amounts of expression data on AR construction and find that while more data improve ARs, it is not necessary to use the full set of gene expression experiments available for E. coli to produce high quality ARs. In order to explore the conservation of co-regulated gene sets across different organisms, we computed ARs for Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus, each of which represents increasing degrees of phylogenetic distance from E. coli. Comparison of the organism-specific ARs showed that the consistency of AR gene membership correlates with phylogenetic distance, but there is clear variability in the regulatory networks of closely related organisms. As large scale expression data sets become increasingly common for model and non-model organisms, comparative analyses of atomic regulons will provide valuable insights into fundamental regulatory modules used across the bacterial domain. PMID:27933038
Jung, Inuk; Jo, Kyuri; Kang, Hyejin; Ahn, Hongryul; Yu, Youngjae; Kim, Sun
2017-12-01
Identifying biologically meaningful gene expression patterns from time series gene expression data is important to understand the underlying biological mechanisms. To identify significantly perturbed gene sets between different phenotypes, analysis of time series transcriptome data requires consideration of time and sample dimensions. Thus, the analysis of such time series data seeks to search gene sets that exhibit similar or different expression patterns between two or more sample conditions, constituting the three-dimensional data, i.e. gene-time-condition. Computational complexity for analyzing such data is very high, compared to the already difficult NP-hard two dimensional biclustering algorithms. Because of this challenge, traditional time series clustering algorithms are designed to capture co-expressed genes with similar expression pattern in two sample conditions. We present a triclustering algorithm, TimesVector, specifically designed for clustering three-dimensional time series data to capture distinctively similar or different gene expression patterns between two or more sample conditions. TimesVector identifies clusters with distinctive expression patterns in three steps: (i) dimension reduction and clustering of time-condition concatenated vectors, (ii) post-processing clusters for detecting similar and distinct expression patterns and (iii) rescuing genes from unclassified clusters. Using four sets of time series gene expression data, generated by both microarray and high throughput sequencing platforms, we demonstrated that TimesVector successfully detected biologically meaningful clusters of high quality. TimesVector improved the clustering quality compared to existing triclustering tools and only TimesVector detected clusters with differential expression patterns across conditions successfully. The TimesVector software is available at http://biohealth.snu.ac.kr/software/TimesVector/. sunkim.bioinfo@snu.ac.kr. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
A closer look at cross-validation for assessing the accuracy of gene regulatory networks and models.
Tabe-Bordbar, Shayan; Emad, Amin; Zhao, Sihai Dave; Sinha, Saurabh
2018-04-26
Cross-validation (CV) is a technique to assess the generalizability of a model to unseen data. This technique relies on assumptions that may not be satisfied when studying genomics datasets. For example, random CV (RCV) assumes that a randomly selected set of samples, the test set, well represents unseen data. This assumption doesn't hold true where samples are obtained from different experimental conditions, and the goal is to learn regulatory relationships among the genes that generalize beyond the observed conditions. In this study, we investigated how the CV procedure affects the assessment of supervised learning methods used to learn gene regulatory networks (or in other applications). We compared the performance of a regression-based method for gene expression prediction estimated using RCV with that estimated using a clustering-based CV (CCV) procedure. Our analysis illustrates that RCV can produce over-optimistic estimates of the model's generalizability compared to CCV. Next, we defined the 'distinctness' of test set from training set and showed that this measure is predictive of performance of the regression method. Finally, we introduced a simulated annealing method to construct partitions with gradually increasing distinctness and showed that performance of different gene expression prediction methods can be better evaluated using this method.
Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil
2015-02-01
The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Development of an expert data reduction assistant
NASA Technical Reports Server (NTRS)
Miller, Glenn E.; Johnston, Mark D.; Hanisch, Robert J.
1992-01-01
We propose the development of an expert system tool for the management and reduction of complex data sets. The proposed work is an extension of a successful prototype system for the calibration of CCD images developed by Dr. Johnston in 1987. The reduction of complex multi-parameter data sets presents severe challenges to a scientist. Not only must a particular data analysis system be mastered, (e.g. IRAF/SDAS/MIDAS), large amounts of data can require many days of tedious work and supervision by the scientist for even the most straightforward reductions. The proposed Expert Data Reduction Assistant will help the scientist overcome these obstacles by developing a reduction plan based on the data at hand and producing a script for the reduction of the data in a target common language.
Grewal, Nivit; Singh, Shailendra; Chand, Trilok
2017-01-01
Owing to the innate noise in the biological data sources, a single source or a single measure do not suffice for an effective disease gene prioritization. So, the integration of multiple data sources or aggregation of multiple measures is the need of the hour. The aggregation operators combine multiple related data values to a single value such that the combined value has the effect of all the individual values. In this paper, an attempt has been made for applying the fuzzy aggregation on the network-based disease gene prioritization and investigate its effect under noise conditions. This study has been conducted for a set of 15 blood disorders by fusing four different network measures, computed from the protein interaction network, using a selected set of aggregation operators and ranking the genes on the basis of the aggregated value. The aggregation operator-based rankings have been compared with the "Random walk with restart" gene prioritization method. The impact of noise has also been investigated by adding varying proportions of noise to the seed set. The results reveal that for all the selected blood disorders, the Mean of Maximal operator has relatively outperformed the other aggregation operators for noisy as well as non-noisy data.
Contrasting X-Linked and Autosomal Diversity across 14 Human Populations
Arbiza, Leonardo; Gottipati, Srikanth; Siepel, Adam; Keinan, Alon
2014-01-01
Contrasting the genetic diversity of the human X chromosome (X) and autosomes has facilitated understanding historical differences between males and females and the influence of natural selection. Previous studies based on smaller data sets have left questions regarding how empirical patterns extend to additional populations and which forces can explain them. Here, we address these questions by analyzing the ratio of X-to-autosomal (X/A) nucleotide diversity with the complete genomes of 569 females from 14 populations. Results show that X/A diversity is similar within each continental group but notably lower in European (EUR) and East Asian (ASN) populations than in African (AFR) populations. X/A diversity increases in all populations with increasing distance from genes, highlighting the stronger impact of diversity-reducing selection on X than on the autosomes. However, relative X/A diversity (between two populations) is invariant with distance from genes, suggesting that selection does not drive the relative reduction in X/A diversity in non-Africans (0.842 ± 0.012 for EUR-to-AFR and 0.820 ± 0.032 for ASN-to-AFR comparisons). Finally, an array of models with varying population bottlenecks, expansions, and migration from the latest studies of human demographic history account for about half of the observed reduction in relative X/A diversity from the expected value of 1. They predict values between 0.91 and 0.94 for EUR-to-AFR comparisons and between 0.91 and 0.92 for ASN-to-AFR comparisons. Further reductions can be predicted by more extreme demographic events in excess of those captured by the latest studies but, in the absence of these, also by historical sex-biased demographic events or other processes. PMID:24836452
Liu, Ying; Ciliax, Brian J; Borges, Karin; Dasigi, Venu; Ram, Ashwin; Navathe, Shamkant B; Dingledine, Ray
2004-01-01
One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.
Harm reduction principles for healthcare settings.
Hawk, Mary; Coulter, Robert W S; Egan, James E; Fisk, Stuart; Reuel Friedman, M; Tula, Monique; Kinsky, Suzanne
2017-10-24
Harm reduction refers to interventions aimed at reducing the negative effects of health behaviors without necessarily extinguishing the problematic health behaviors completely. The vast majority of the harm reduction literature focuses on the harms of drug use and on specific harm reduction strategies, such as syringe exchange, rather than on the harm reduction philosophy as a whole. Given that a harm reduction approach can address other risk behaviors that often occur alongside drug use and that harm reduction principles have been applied to harms such as sex work, eating disorders, and tobacco use, a natural evolution of the harm reduction philosophy is to extend it to other health risk behaviors and to a broader healthcare audience. Building on the extant literature, we used data from in-depth qualitative interviews with 23 patients and 17 staff members from an HIV clinic in the USA to describe harm reduction principles for use in healthcare settings. We defined six principles of harm reduction and generalized them for use in healthcare settings with patients beyond those who use illicit substances. The principles include humanism, pragmatism, individualism, autonomy, incrementalism, and accountability without termination. For each of these principles, we present a definition, a description of how healthcare providers can deliver interventions informed by the principle, and examples of how each principle may be applied in the healthcare setting. This paper is one of the firsts to provide a comprehensive set of principles for universal harm reduction as a conceptual approach for healthcare provision. Applying harm reduction principles in healthcare settings may improve clinical care outcomes given that the quality of the provider-patient relationship is known to impact health outcomes and treatment adherence. Harm reduction can be a universal precaution applied to all individuals regardless of their disclosure of negative health behaviors, given that health behaviors are not binary or linear but operate along a continuum based on a variety of individual and social determinants.
Wang, Yi-Ting; Sung, Pei-Yuan; Lin, Peng-Lin; Yu, Ya-Wen; Chung, Ren-Hua
2015-05-15
Genome-wide association studies (GWAS) have become a common approach to identifying single nucleotide polymorphisms (SNPs) associated with complex diseases. As complex diseases are caused by the joint effects of multiple genes, while the effect of individual gene or SNP is modest, a method considering the joint effects of multiple SNPs can be more powerful than testing individual SNPs. The multi-SNP analysis aims to test association based on a SNP set, usually defined based on biological knowledge such as gene or pathway, which may contain only a portion of SNPs with effects on the disease. Therefore, a challenge for the multi-SNP analysis is how to effectively select a subset of SNPs with promising association signals from the SNP set. We developed the Optimal P-value Threshold Pedigree Disequilibrium Test (OPTPDT). The OPTPDT uses general nuclear families. A variable p-value threshold algorithm is used to determine an optimal p-value threshold for selecting a subset of SNPs. A permutation procedure is used to assess the significance of the test. We used simulations to verify that the OPTPDT has correct type I error rates. Our power studies showed that the OPTPDT can be more powerful than the set-based test in PLINK, the multi-SNP FBAT test, and the p-value based test GATES. We applied the OPTPDT to a family-based autism GWAS dataset for gene-based association analysis and identified MACROD2-AS1 with genome-wide significance (p-value=2.5×10(-6)). Our simulation results suggested that the OPTPDT is a valid and powerful test. The OPTPDT will be helpful for gene-based or pathway association analysis. The method is ideal for the secondary analysis of existing GWAS datasets, which may identify a set of SNPs with joint effects on the disease.
GO-Bayes: Gene Ontology-based overrepresentation analysis using a Bayesian approach.
Zhang, Song; Cao, Jing; Kong, Y Megan; Scheuermann, Richard H
2010-04-01
A typical approach for the interpretation of high-throughput experiments, such as gene expression microarrays, is to produce groups of genes based on certain criteria (e.g. genes that are differentially expressed). To gain more mechanistic insights into the underlying biology, overrepresentation analysis (ORA) is often conducted to investigate whether gene sets associated with particular biological functions, for example, as represented by Gene Ontology (GO) annotations, are statistically overrepresented in the identified gene groups. However, the standard ORA, which is based on the hypergeometric test, analyzes each GO term in isolation and does not take into account the dependence structure of the GO-term hierarchy. We have developed a Bayesian approach (GO-Bayes) to measure overrepresentation of GO terms that incorporates the GO dependence structure by taking into account evidence not only from individual GO terms, but also from their related terms (i.e. parents, children, siblings, etc.). The Bayesian framework borrows information across related GO terms to strengthen the detection of overrepresentation signals. As a result, this method tends to identify sets of closely related GO terms rather than individual isolated GO terms. The advantage of the GO-Bayes approach is demonstrated with a simulation study and an application example.
Welcome to pandoraviruses at the ‘Fourth TRUC’ club
Sharma, Vikas; Colson, Philippe; Chabrol, Olivier; Scheid, Patrick; Pontarotti, Pierre; Raoult, Didier
2015-01-01
Nucleocytoplasmic large DNA viruses, or representatives of the proposed order Megavirales, belong to families of giant viruses that infect a broad range of eukaryotic hosts. Megaviruses have been previously described to comprise a fourth monophylogenetic TRUC (things resisting uncompleted classification) together with cellular domains in the universal tree of life. Recently described pandoraviruses have large (1.9–2.5 MB) and highly divergent genomes. In the present study, we updated the classification of pandoraviruses and other reported giant viruses. Phylogenetic trees were constructed based on six informational genes. Hierarchical clustering was performed based on a set of informational genes from Megavirales members and cellular organisms. Homologous sequences were selected from cellular organisms using TimeTree software, comprising comprehensive, and representative sets of members from Bacteria, Archaea, and Eukarya. Phylogenetic analyses based on three conserved core genes clustered pandoraviruses with phycodnaviruses, exhibiting their close relatedness. Additionally, hierarchical clustering analyses based on informational genes grouped pandoraviruses with Megavirales members as a super group distinct from cellular organisms. Thus, the analyses based on core conserved genes revealed that pandoraviruses are new genuine members of the ‘Fourth TRUC’ club, encompassing distinct life forms compared with cellular organisms. PMID:26042093
Welcome to pandoraviruses at the 'Fourth TRUC' club.
Sharma, Vikas; Colson, Philippe; Chabrol, Olivier; Scheid, Patrick; Pontarotti, Pierre; Raoult, Didier
2015-01-01
Nucleocytoplasmic large DNA viruses, or representatives of the proposed order Megavirales, belong to families of giant viruses that infect a broad range of eukaryotic hosts. Megaviruses have been previously described to comprise a fourth monophylogenetic TRUC (things resisting uncompleted classification) together with cellular domains in the universal tree of life. Recently described pandoraviruses have large (1.9-2.5 MB) and highly divergent genomes. In the present study, we updated the classification of pandoraviruses and other reported giant viruses. Phylogenetic trees were constructed based on six informational genes. Hierarchical clustering was performed based on a set of informational genes from Megavirales members and cellular organisms. Homologous sequences were selected from cellular organisms using TimeTree software, comprising comprehensive, and representative sets of members from Bacteria, Archaea, and Eukarya. Phylogenetic analyses based on three conserved core genes clustered pandoraviruses with phycodnaviruses, exhibiting their close relatedness. Additionally, hierarchical clustering analyses based on informational genes grouped pandoraviruses with Megavirales members as a super group distinct from cellular organisms. Thus, the analyses based on core conserved genes revealed that pandoraviruses are new genuine members of the 'Fourth TRUC' club, encompassing distinct life forms compared with cellular organisms.
Lee, Hyeonjeong; Shin, Miyoung
2017-01-01
The problem of discovering genetic markers as disease signatures is of great significance for the successful diagnosis, treatment, and prognosis of complex diseases. Even if many earlier studies worked on identifying disease markers from a variety of biological resources, they mostly focused on the markers of genes or gene-sets (i.e., pathways). However, these markers may not be enough to explain biological interactions between genetic variables that are related to diseases. Thus, in this study, our aim is to investigate distinctive associations among active pathways (i.e., pathway-sets) shown each in case and control samples which can be observed from gene expression and/or methylation data. The pathway-sets are obtained by identifying a set of associated pathways that are often active together over a significant number of class samples. For this purpose, gene expression or methylation profiles are first analyzed to identify significant (active) pathways via gene-set enrichment analysis. Then, regarding these active pathways, an association rule mining approach is applied to examine interesting pathway-sets in each class of samples (case or control). By doing so, the sets of associated pathways often working together in activity profiles are finally chosen as our distinctive signature of each class. The identified pathway-sets are aggregated into a pathway activity network (PAN), which facilitates the visualization of differential pathway associations between case and control samples. From our experiments with two publicly available datasets, we could find interesting PAN structures as the distinctive signatures of breast cancer and uterine leiomyoma cancer, respectively. Our pathway-set markers were shown to be superior or very comparable to other genetic markers (such as genes or gene-sets) in disease classification. Furthermore, the PAN structure, which can be constructed from the identified markers of pathway-sets, could provide deeper insights into distinctive associations between pathway activities in case and control samples.
Functional Abstraction as a Method to Discover Knowledge in Gene Ontologies
Ultsch, Alfred; Lötsch, Jörn
2014-01-01
Computational analyses of functions of gene sets obtained in microarray analyses or by topical database searches are increasingly important in biology. To understand their functions, the sets are usually mapped to Gene Ontology knowledge bases by means of over-representation analysis (ORA). Its result represents the specific knowledge of the functionality of the gene set. However, the specific ontology typically consists of many terms and relationships, hindering the understanding of the ‘main story’. We developed a methodology to identify a comprehensibly small number of GO terms as “headlines” of the specific ontology allowing to understand all central aspects of the roles of the involved genes. The Functional Abstraction method finds a set of headlines that is specific enough to cover all details of a specific ontology and is abstract enough for human comprehension. This method exceeds the classical approaches at ORA abstraction and by focusing on information rather than decorrelation of GO terms, it directly targets human comprehension. Functional abstraction provides, with a maximum of certainty, information value, coverage and conciseness, a representation of the biological functions in a gene set plays a role. This is the necessary means to interpret complex Gene Ontology results thus strengthening the role of functional genomics in biomarker and drug discovery. PMID:24587272
Kwon, Ji-Sun; Kim, Jihye; Nam, Dougu; Kim, Sangsoo
2012-06-01
Gene set analysis (GSA) is useful in interpreting a genome-wide association study (GWAS) result in terms of biological mechanism. We compared the performance of two different GSA implementations that accept GWAS p-values of single nucleotide polymorphisms (SNPs) or gene-by-gene summaries thereof, GSA-SNP and i-GSEA4GWAS, under the same settings of inputs and parameters. GSA runs were made with two sets of p-values from a Korean type 2 diabetes mellitus GWAS study: 259,188 and 1,152,947 SNPs of the original and imputed genotype datasets, respectively. When Gene Ontology terms were used as gene sets, i-GSEA4GWAS produced 283 and 1,070 hits for the unimputed and imputed datasets, respectively. On the other hand, GSA-SNP reported 94 and 38 hits, respectively, for both datasets. Similar, but to a lesser degree, trends were observed with Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets as well. The huge number of hits by i-GSEA4GWAS for the imputed dataset was probably an artifact due to the scaling step in the algorithm. The decrease in hits by GSA-SNP for the imputed dataset may be due to the fact that it relies on Z-statistics, which is sensitive to variations in the background level of associations. Judicious evaluation of the GSA outcomes, perhaps based on multiple programs, is recommended.
The Role of Vitamin D in the Transcriptional Program of Human Pregnancy
Al-Garawi, Amal; Carey, Vincent J.; Chhabra, Divya; Morrow, Jarrett; Lasky-Su, Jessica; Qiu, Weiliang; Laranjo, Nancy; Litonjua, Augusto A.; Weiss, Scott T.
2016-01-01
Background Patterns of gene expression of human pregnancy are poorly understood. In a trial of vitamin D supplementation in pregnant women, peripheral blood transcriptomes were measured longitudinally on 30 women and used to characterize gene co-expression networks. Objective Studies suggest that increased maternal Vitamin D levels may reduce the risk of asthma in early life, yet the underlying mechanisms have not been examined. In this study, we used a network-based approach to examine changes in gene expression profiles during the course of normal pregnancy and evaluated their association with maternal Vitamin D levels. Design The VDAART study is a randomized clinical trial of vitamin D supplementation in pregnancy for reduction of pediatric asthma risk. The trial enrolled 881 women at 10–18 weeks of gestation. Longitudinal gene expression measures were obtained on thirty pregnant women, using RNA isolated from peripheral blood samples obtained in the first and third trimesters. Differentially expressed genes were identified using significance of analysis of microarrays (SAM), and clustered using a weighted gene co-expression network analysis (WGCNA). Gene-set enrichment was performed to identify major biological pathways. Results Comparison of transcriptional profiles between first and third trimesters of pregnancy identified 5839 significantly differentially expressed genes (FDR<0.05). Weighted gene co-expression network analysis clustered these transcripts into 14 co-expression modules of which two showed significant correlation with maternal vitamin D levels. Pathway analysis of these two modules revealed genes enriched in immune defense pathways and extracellular matrix reorganization as well as genes enriched in notch signaling and transcription factor networks. Conclusion Our data show that gene expression profiles of healthy pregnant women change during the course of pregnancy and suggest that maternal Vitamin D levels influence transcriptional profiles. These alterations of the maternal transcriptome may contribute to fetal immune imprinting and reduce allergic sensitization in early life. Trial Registration clinicaltrials.gov NCT00920621 PMID:27711190
Microgravity and Immunity: Changes in Lymphocyte Gene Expression
NASA Technical Reports Server (NTRS)
Risin, D.; Pellis, N. R.; Ward, N. E.; Risin, S. A.
2006-01-01
Earlier studies had shown that modeled and true microgravity (MG) cause multiple direct effects on human lymphocytes. MG inhibits lymphocyte locomotion, suppresses polyclonal and antigen-specific activation, affects signal transduction mechanisms, as well as activation-induced apoptosis. In this study we assessed changes in gene expression associated with lymphocyte exposure to microgravity in an attempt to identify microgravity-sensitive genes (MGSG) in general and specifically those genes that might be responsible for the functional and structural changes observed earlier. Two sets of experiments targeting different goals were conducted. In the first set, T-lymphocytes from normal donors were activated with antiCD3 and IL2 and then cultured in 1g (static) and modeled MG (MMG) conditions (Rotating Wall Vessel bioreactor) for 24 hours. This setting allowed searching for MGSG by comparison of gene expression patterns in zero and 1 g gravity. In the second set - activated T-cells after culturing for 24 hours in 1g and MMG were exposed three hours before harvesting to a secondary activation stimulus (PHA) thus triggering the apoptotic pathway. Total RNA was extracted using the RNeasy isolation kit (Qiagen, Valencia, CA). Affymetrix Gene Chips (U133A), allowing testing for 18,400 human genes, were used for microarray analysis. In the first set of experiments MMG exposure resulted in altered expression of 89 genes, 10 of them were up-regulated and 79 down-regulated. In the second set, changes in expression were revealed in 85 genes, 20 were up-regulated and 65 were down-regulated. The analysis revealed that significant numbers of MGS genes are associated with signal transduction and apoptotic pathways. Interestingly, the majority of genes that responded by up- or down-regulation in the alternative sets of experiments were not the same, possibly reflecting different functional states of the examined T-lymphocyte populations. The responder genes (MGSG) might play an essential role in adaptation to MG and/or be responsible for pathologic changes encountered in Space and thus represent potential targets for molecular-based countermeasures
Strategies to explore functional genomics data sets in NCBI's GEO database.
Wilhite, Stephen E; Barrett, Tanya
2012-01-01
The Gene Expression Omnibus (GEO) database is a major repository that stores high-throughput functional genomics data sets that are generated using both microarray-based and sequence-based technologies. Data sets are submitted to GEO primarily by researchers who are publishing their results in journals that require original data to be made freely available for review and analysis. In addition to serving as a public archive for these data, GEO has a suite of tools that allow users to identify, analyze, and visualize data relevant to their specific interests. These tools include sample comparison applications, gene expression profile charts, data set clusters, genome browser tracks, and a powerful search engine that enables users to construct complex queries.
Strategies to Explore Functional Genomics Data Sets in NCBI’s GEO Database
Wilhite, Stephen E.; Barrett, Tanya
2012-01-01
The Gene Expression Omnibus (GEO) database is a major repository that stores high-throughput functional genomics data sets that are generated using both microarray-based and sequence-based technologies. Data sets are submitted to GEO primarily by researchers who are publishing their results in journals that require original data to be made freely available for review and analysis. In addition to serving as a public archive for these data, GEO has a suite of tools that allow users to identify, analyze and visualize data relevant to their specific interests. These tools include sample comparison applications, gene expression profile charts, data set clusters, genome browser tracks, and a powerful search engine that enables users to construct complex queries. PMID:22130872
Ooi, Chia Huey; Chetty, Madhu; Teng, Shyh Wei
2006-06-23
Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.
Reproducible detection of disease-associated markers from gene expression data.
Omae, Katsuhiro; Komori, Osamu; Eguchi, Shinto
2016-08-18
Detection of disease-associated markers plays a crucial role in gene screening for biological studies. Two-sample test statistics, such as the t-statistic, are widely used to rank genes based on gene expression data. However, the resultant gene ranking is often not reproducible among different data sets. Such irreproducibility may be caused by disease heterogeneity. When we divided data into two subsets, we found that the signs of the two t-statistics were often reversed. Focusing on such instability, we proposed a sign-sum statistic that counts the signs of the t-statistics for all possible subsets. The proposed method excludes genes affected by heterogeneity, thereby improving the reproducibility of gene ranking. We compared the sign-sum statistic with the t-statistic by a theoretical evaluation of the upper confidence limit. Through simulations and applications to real data sets, we show that the sign-sum statistic exhibits superior performance. We derive the sign-sum statistic for getting a robust gene ranking. The sign-sum statistic gives more reproducible ranking than the t-statistic. Using simulated data sets we show that the sign-sum statistic excludes hetero-type genes well. Also for the real data sets, the sign-sum statistic performs well in a viewpoint of ranking reproducibility.
Inferring species trees from incongruent multi-copy gene trees using the Robinson-Foulds distance
2013-01-01
Background Constructing species trees from multi-copy gene trees remains a challenging problem in phylogenetics. One difficulty is that the underlying genes can be incongruent due to evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. Gene tree estimation errors may further exacerbate the difficulties of species tree estimation. Results We present a new approach for inferring species trees from incongruent multi-copy gene trees that is based on a generalization of the Robinson-Foulds (RF) distance measure to multi-labeled trees (mul-trees). We prove that it is NP-hard to compute the RF distance between two mul-trees; however, it is easy to calculate this distance between a mul-tree and a singly-labeled species tree. Motivated by this, we formulate the RF problem for mul-trees (MulRF) as follows: Given a collection of multi-copy gene trees, find a singly-labeled species tree that minimizes the total RF distance from the input mul-trees. We develop and implement a fast SPR-based heuristic algorithm for the NP-hard MulRF problem. We compare the performance of the MulRF method (available at http://genome.cs.iastate.edu/CBL/MulRF/) with several gene tree parsimony approaches using gene tree simulations that incorporate gene tree error, gene duplications and losses, and/or lateral transfer. The MulRF method produces more accurate species trees than gene tree parsimony approaches. We also demonstrate that the MulRF method infers in minutes a credible plant species tree from a collection of nearly 2,000 gene trees. Conclusions Our new phylogenetic inference method, based on a generalized RF distance, makes it possible to quickly estimate species trees from large genomic data sets. Since the MulRF method, unlike gene tree parsimony, is based on a generic tree distance measure, it is appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplication and losses as well as phylogenetic error may contribute to gene tree discord. In experiments, the MulRF method estimated species trees accurately and quickly, demonstrating MulRF as an efficient alternative approach for phylogenetic inference from large-scale genomic data sets. PMID:24180377
Scuba: scalable kernel-based gene prioritization.
Zampieri, Guido; Tran, Dinh Van; Donini, Michele; Navarin, Nicolò; Aiolli, Fabio; Sperduti, Alessandro; Valle, Giorgio
2018-01-25
The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability. We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods. Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba .
Inferring gene regression networks with model trees
2010-01-01
Background Novel strategies are required in order to handle the huge amount of data produced by microarray technologies. To infer gene regulatory networks, the first step is to find direct regulatory relationships between genes building the so-called gene co-expression networks. They are typically generated using correlation statistics as pairwise similarity measures. Correlation-based methods are very useful in order to determine whether two genes have a strong global similarity but do not detect local similarities. Results We propose model trees as a method to identify gene interaction networks. While correlation-based methods analyze each pair of genes, in our approach we generate a single regression tree for each gene from the remaining genes. Finally, a graph from all the relationships among output and input genes is built taking into account whether the pair of genes is statistically significant. For this reason we apply a statistical procedure to control the false discovery rate. The performance of our approach, named REGNET, is experimentally tested on two well-known data sets: Saccharomyces Cerevisiae and E.coli data set. First, the biological coherence of the results are tested. Second the E.coli transcriptional network (in the Regulon database) is used as control to compare the results to that of a correlation-based method. This experiment shows that REGNET performs more accurately at detecting true gene associations than the Pearson and Spearman zeroth and first-order correlation-based methods. Conclusions REGNET generates gene association networks from gene expression data, and differs from correlation-based methods in that the relationship between one gene and others is calculated simultaneously. Model trees are very useful techniques to estimate the numerical values for the target genes by linear regression functions. They are very often more precise than linear regression models because they can add just different linear regressions to separate areas of the search space favoring to infer localized similarities over a more global similarity. Furthermore, experimental results show the good performance of REGNET. PMID:20950452
Naaijen, J; Bralten, J; Poelmans, G; Faraone, Stephen; Asherson, Philip; Banaschewski, Tobias; Buitelaar, Jan; Franke, Barbara; P Ebstein, Richard; Gill, Michael; Miranda, Ana; D Oades, Robert; Roeyers, Herbert; Rothenberger, Aribert; Sergeant, Joseph; Sonuga-Barke, Edmund; Anney, Richard; Mulas, Fernando; Steinhausen, Hans-Christoph; Glennon, J C; Franke, B; Buitelaar, J K
2017-01-01
Attention-deficit/hyperactivity disorder (ADHD) and autism spectrum disorders (ASD) often co-occur. Both are highly heritable; however, it has been difficult to discover genetic risk variants. Glutamate and GABA are main excitatory and inhibitory neurotransmitters in the brain; their balance is essential for proper brain development and functioning. In this study we investigated the role of glutamate and GABA genetics in ADHD severity, autism symptom severity and inhibitory performance, based on gene set analysis, an approach to investigate multiple genetic variants simultaneously. Common variants within glutamatergic and GABAergic genes were investigated using the MAGMA software in an ADHD case-only sample (n=931), in which we assessed ASD symptoms and response inhibition on a Stop task. Gene set analysis for ADHD symptom severity, divided into inattention and hyperactivity/impulsivity symptoms, autism symptom severity and inhibition were performed using principal component regression analyses. Subsequently, gene-wide association analyses were performed. The glutamate gene set showed an association with severity of hyperactivity/impulsivity (P=0.009), which was robust to correcting for genome-wide association levels. The GABA gene set showed nominally significant association with inhibition (P=0.04), but this did not survive correction for multiple comparisons. None of single gene or single variant associations was significant on their own. By analyzing multiple genetic variants within candidate gene sets together, we were able to find genetic associations supporting the involvement of excitatory and inhibitory neurotransmitter systems in ADHD and ASD symptom severity in ADHD. PMID:28072412
Fuzzy measures on the Gene Ontology for gene product similarity.
Popescu, Mihail; Keller, James M; Mitchell, Joyce A
2006-01-01
One of the most important objects in bioinformatics is a gene product (protein or RNA). For many gene products, functional information is summarized in a set of Gene Ontology (GO) annotations. For these genes, it is reasonable to include similarity measures based on the terms found in the GO or other taxonomy. In this paper, we introduce several novel measures for computing the similarity of two gene products annotated with GO terms. The fuzzy measure similarity (FMS) has the advantage that it takes into consideration the context of both complete sets of annotation terms when computing the similarity between two gene products. When the two gene products are not annotated by common taxonomy terms, we propose a method that avoids a zero similarity result. To account for the variations in the annotation reliability, we propose a similarity measure based on the Choquet integral. These similarity measures provide extra tools for the biologist in search of functional information for gene products. The initial testing on a group of 194 sequences representing three proteins families shows a higher correlation of the FMS and Choquet similarities to the BLAST sequence similarities than the traditional similarity measures such as pairwise average or pairwise maximum.
Comparative physical mapping between wheat chromosome arm 2BL and rice chromosome 4.
Lee, Tong Geon; Lee, Yong Jin; Kim, Dae Yeon; Seo, Yong Weon
2010-12-01
Physical maps of chromosomes provide a framework for organizing and integrating diverse genetic information. DNA microarrays are a valuable technique for physical mapping and can also be used to facilitate the discovery of single feature polymorphisms (SFPs). Wheat chromosome arm 2BL was physically mapped using a Wheat Genome Array onto near-isogenic lines (NILs) with the aid of wheat-rice synteny and mapped wheat EST information. Using high variance probe set (HVP) analysis, 314 HVPs constituting genes present on 2BL were identified. The 314 HVPs were grouped into 3 categories: HVPs that match only rice chromosome 4 (298 HVPs), those that match only wheat ESTs mapped on 2BL (1), and those that match both rice chromosome 4 and wheat ESTs mapped on 2BL (15). All HVPs were converted into gene sets, which represented either unique rice gene models or mapped wheat ESTs that matched identified HVPs. Comparative physical maps were constructed for 16 wheat gene sets and 271 rice gene sets. Of the 271 rice gene sets, 257 were mapped to the 18-35 Mb regions on rice chromosome 4. Based on HVP analysis and sequence similarity between the gene models in the rice chromosomes and mapped wheat ESTs, the outermost rice gene model that limits the translocation breakpoint to orthologous regions was identified.
Carreón-Diazconti, Concepción; Santamaría, Johanna; Berkompas, Justin; Field, James A.; Brusseau, Mark L.
2010-01-01
Isotopic analysis and molecular-based bioassay methods were used in conjunction with geochemical data to assess intrinsic reductive dechlorination processes for a chlorinated-solvent contaminated site in Tucson, Arizona. Groundwater samples were obtained from monitoring wells within a contaminant plume comprising tetrachloroethene and its metabolites trichloroethene, cis-1,2-dichloroethene, vinyl chloride, and ethene, as well as compounds associated with free-phase diesel present at the site. Compound specific isotope (CSI) analysis was performed to characterize biotransformation processes influencing the transport and fate of the chlorinated contaminants. PCR analysis was used to assess the presence of indigenous reductive dechlorinators. The target regions employed were the 16s rRNA gene sequences of Dehalococcoides sp. and Desulfuromonas sp., and DNA sequences of genes pceA, tceA, bvcA, and vcrA, which encode reductive dehalogenases. The results of the analyses indicate that relevant microbial populations are present and that reductive dechlorination is presently occurring at the site. The results further show that potential degrader populations as well as biotransformation activity is non-uniformly distributed within the site. The results of laboratory microcosm studies conducted using groundwater collected from the field site confirmed the reductive dechlorination of tetrachloroethene to dichloroethene. This study illustrates the use of an integrated, multiple-method approach for assessing natural attenuation at a complex chlorinated-solvent contaminated site. PMID:19603638
5 CFR 9901.355 - Setting pay upon reduction in band.
Code of Federal Regulations, 2010 CFR
2010-01-01
... involuntarily, the setting of the employee's base salary rate is subject to the rules in this section. As..., the employee's base salary may be reduced, subject to the requirements in paragraph (b) of this section. The employee may be eligible for an increase to base salary, subject to the requirements in...
Kaushik, Abhinav; Bhatia, Yashuma; Ali, Shakir; Gupta, Dinesh
2015-01-01
Metastatic melanoma patients have a poor prognosis, mainly attributable to the underlying heterogeneity in melanoma driver genes and altered gene expression profiles. These characteristics of melanoma also make the development of drugs and identification of novel drug targets for metastatic melanoma a daunting task. Systems biology offers an alternative approach to re-explore the genes or gene sets that display dysregulated behaviour without being differentially expressed. In this study, we have performed systems biology studies to enhance our knowledge about the conserved property of disease genes or gene sets among mutually exclusive datasets representing melanoma progression. We meta-analysed 642 microarray samples to generate melanoma reconstructed networks representing four different stages of melanoma progression to extract genes with altered molecular circuitry wiring as compared to a normal cellular state. Intriguingly, a majority of the melanoma network-rewired genes are not differentially expressed and the disease genes involved in melanoma progression consistently modulate its activity by rewiring network connections. We found that the shortlisted disease genes in the study show strong and abnormal network connectivity, which enhances with the disease progression. Moreover, the deviated network properties of the disease gene sets allow ranking/prioritization of different enriched, dysregulated and conserved pathway terms in metastatic melanoma, in agreement with previous findings. Our analysis also reveals presence of distinct network hubs in different stages of metastasizing tumor for the same set of pathways in the statistically conserved gene sets. The study results are also presented as a freely available database at http://bioinfo.icgeb.res.in/m3db/. The web-based database resource consists of results from the analysis presented here, integrated with cytoscape web and user-friendly tools for visualization, retrieval and further analysis. PMID:26558755
Muir, Lindsey A.; Murry, Charles E.
2016-01-01
In Duchenne muscular dystrophy (DMD) and other muscle wasting disorders, cell therapies are a promising route for promoting muscle regeneration by supplying a functional copy of the missing dystrophin gene and contributing new muscle fibers. The clinical application of cell-based therapies is resource intensive, and it will therefore be necessary to address key limitations that reduce cell engraftment into muscle tissue. A pressing issue is poor donor cell survival following transplantation, which in preclinical studies limits the ability to effectively test the impact of cell-based therapy on whole muscle function. We, therefore, sought to improve engraftment and the functional impact of in vivo myogenically converted dermal fibroblasts (dFbs) using a prosurvival cocktail (PSC) that includes heat shock followed by treatment with insulin-like growth factor-1, a caspase inhibitor, a Bcl-XL peptide, a KATP channel opener, basic fibroblast growth factor, Matrigel, and cyclosporine A. Advantages of dFbs include compatibility with the autologous setting, ease of isolation, and greater proliferative potential than DMD satellite cells. dFbs expressed tamoxifen-inducible MyoD and carried a mini-dystrophin gene driven by a muscle-specific promoter. After transplantation into muscles of mdx mice, a 70% reduction in donor cells was observed by day 5, and a 94% reduction by day 28. However, treatment with PSC gave a nearly three-fold increase in donor cells in early engraftment, and greatly increased the number of donor-contributed muscle fibers and total engrafted area in transplanted muscles. Furthermore, dystrophic muscles that received dFbs with PSC displayed reduced injury with eccentric contractions and an increase in maximum isometric force. Thus, enhancing survival of myogenic cells increases engraftment and improves structure and function of dystrophic muscle. PMID:27503462
Muir, Lindsey A; Murry, Charles E; Chamberlain, Jeffrey S
2016-09-07
In Duchenne muscular dystrophy (DMD) and other muscle wasting disorders, cell therapies are a promising route for promoting muscle regeneration by supplying a functional copy of the missing dystrophin gene and contributing new muscle fibers. The clinical application of cell-based therapies is resource intensive, and it will therefore be necessary to address key limitations that reduce cell engraftment into muscle tissue. A pressing issue is poor donor cell survival following transplantation, which in preclinical studies limits the ability to effectively test the impact of cell-based therapy on whole muscle function. We, therefore, sought to improve engraftment and the functional impact of in vivo myogenically converted dermal fibroblasts (dFbs) using a prosurvival cocktail (PSC) that includes heat shock followed by treatment with insulin-like growth factor-1, a caspase inhibitor, a Bcl-XL peptide, a K ATP channel opener, basic fibroblast growth factor, Matrigel, and cyclosporine A. Advantages of dFbs include compatibility with the autologous setting, ease of isolation, and greater proliferative potential than DMD satellite cells. dFbs expressed tamoxifen-inducible MyoD and carried a mini-dystrophin gene driven by a muscle-specific promoter. After transplantation into muscles of mdx mice, a 70% reduction in donor cells was observed by day 5, and a 94% reduction by day 28. However, treatment with PSC gave a nearly three-fold increase in donor cells in early engraftment, and greatly increased the number of donor-contributed muscle fibers and total engrafted area in transplanted muscles. Furthermore, dystrophic muscles that received dFbs with PSC displayed reduced injury with eccentric contractions and an increase in maximum isometric force. Thus, enhancing survival of myogenic cells increases engraftment and improves structure and function of dystrophic muscle.
Jin, Yulan; Sharma, Ashok; Bai, Shan; Davis, Colleen; Liu, Haitao; Hopkins, Diane; Barriga, Kathy; Rewers, Marian; She, Jin-Xiong
2014-07-01
There is tremendous scientific and clinical value to further improving the predictive power of autoantibodies because autoantibody-positive (AbP) children have heterogeneous rates of progression to clinical diabetes. This study explored the potential of gene expression profiles as biomarkers for risk stratification among 104 AbP subjects from the Diabetes Autoimmunity Study in the Young (DAISY) using a discovery data set based on microarray and a validation data set based on real-time RT-PCR. The microarray data identified 454 candidate genes with expression levels associated with various type 1 diabetes (T1D) progression rates. RT-PCR analyses of the top-27 candidate genes confirmed 5 genes (BACH2, IGLL3, EIF3A, CDC20, and TXNDC5) associated with differential progression and implicated in lymphocyte activation and function. Multivariate analyses of these five genes in the discovery and validation data sets identified and confirmed four multigene models (BI, ICE, BICE, and BITE, with each letter representing a gene) that consistently stratify high- and low-risk subsets of AbP subjects with hazard ratios >6 (P < 0.01). The results suggest that these genes may be involved in T1D pathogenesis and potentially serve as excellent gene expression biomarkers to predict the risk of progression to clinical diabetes for AbP subjects. © 2014 by the American Diabetes Association.
Marbach, Daniel; Roy, Sushmita; Ay, Ferhat; Meyer, Patrick E.; Candeias, Rogerio; Kahveci, Tamer; Bristow, Christopher A.; Kellis, Manolis
2012-01-01
Gaining insights on gene regulation from large-scale functional data sets is a grand challenge in systems biology. In this article, we develop and apply methods for transcriptional regulatory network inference from diverse functional genomics data sets and demonstrate their value for gene function and gene expression prediction. We formulate the network inference problem in a machine-learning framework and use both supervised and unsupervised methods to predict regulatory edges by integrating transcription factor (TF) binding, evolutionarily conserved sequence motifs, gene expression, and chromatin modification data sets as input features. Applying these methods to Drosophila melanogaster, we predict ∼300,000 regulatory edges in a network of ∼600 TFs and 12,000 target genes. We validate our predictions using known regulatory interactions, gene functional annotations, tissue-specific expression, protein–protein interactions, and three-dimensional maps of chromosome conformation. We use the inferred network to identify putative functions for hundreds of previously uncharacterized genes, including many in nervous system development, which are independently confirmed based on their tissue-specific expression patterns. Last, we use the regulatory network to predict target gene expression levels as a function of TF expression, and find significantly higher predictive power for integrative networks than for motif or ChIP-based networks. Our work reveals the complementarity between physical evidence of regulatory interactions (TF binding, motif conservation) and functional evidence (coordinated expression or chromatin patterns) and demonstrates the power of data integration for network inference and studies of gene regulation at the systems level. PMID:22456606
Springer, M S; Amrine, H M; Burk, A; Stanhope, M J
1999-03-01
We concatenated sequences for four mitochondrial genes (12S rRNA, tRNA valine, 16S rRNA, cytochrome b) and four nuclear genes [aquaporin, alpha 2B adrenergic receptor (A2AB), interphotoreceptor retinoid-binding protein (IRBP), von Willebrand factor (vWF)] into a multigene data set representing 11 eutherian orders (Artiodactyla, Hyracoidea, Insectivora, Lagomorpha, Macroscelidea, Perissodactyla, Primates, Proboscidea, Rodentia, Sirenia, Tubulidentata). Within this data set, we recognized nine mitochondrial partitions (both stems and loops, for each of 12S rRNA, tRNA valine, and 16S rRNA; and first, second, and third codon positions of cytochrome b) and 12 nuclear partitions (first, second, and third codon positions, respectively, of each of the four nuclear genes). Four of the 21 partitions (third positions of cytochrome b, A2AB, IRBP, and vWF) showed significant heterogeneity in base composition across taxa. Phylogenetic analyses (parsimony, minimum evolution, maximum likelihood) based on sequences for all 21 partitions provide 99-100% bootstrap support for Afrotheria and Paenungulata. With the elimination of the four partitions exhibiting heterogeneity in base composition, there is also high bootstrap support (89-100%) for cow + horse. Statistical tests reject Altungulata, Anagalida, and Ungulata. Data set heterogeneity between mitochondrial and nuclear genes is most evident when all partitions are included in the phylogenetic analyses. Mitochondrial-gene trees associate cow with horse, whereas nuclear-gene trees associate cow with hedgehog and these two with horse. However, after eliminating third positions of A2AB, IRBP, and vWF, nuclear data agree with mitochondrial data in supporting cow + horse. Nuclear genes provide stronger support for both Afrotheria and Paenungulata. Removal of third positions of cytochrome b results in improved performance for the mitochondrial genes in recovering these clades.
Zwaenepoel, Arthur; Diels, Tim; Amar, David; Van Parys, Thomas; Shamir, Ron; Van de Peer, Yves; Tzfadia, Oren
2018-01-01
Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at http://bioinformatics.psb.ugent.be/webtools/morphdb/morphDB/index/. We also provide a toolkit, named "MORPH bulk" (https://github.com/arzwa/morph-bulk), for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest.
Disentangling the multigenic and pleiotropic nature of molecular function
2015-01-01
Background Biological processes at the molecular level are usually represented by molecular interaction networks. Function is organised and modularity identified based on network topology, however, this approach often fails to account for the dynamic and multifunctional nature of molecular components. For example, a molecule engaging in spatially or temporally independent functions may be inappropriately clustered into a single functional module. To capture biologically meaningful sets of interacting molecules, we use experimentally defined pathways as spatial/temporal units of molecular activity. Results We defined functional profiles of Saccharomyces cerevisiae based on a minimal set of Gene Ontology terms sufficient to represent each pathway's genes. The Gene Ontology terms were used to annotate 271 pathways, accounting for pathway multi-functionality and gene pleiotropy. Pathways were then arranged into a network, linked by shared functionality. Of the genes in our data set, 44% appeared in multiple pathways performing a diverse set of functions. Linking pathways by overlapping functionality revealed a modular network with energy metabolism forming a sparse centre, surrounded by several denser clusters comprised of regulatory and metabolic pathways. Signalling pathways formed a relatively discrete cluster connected to the centre of the network. Genetic interactions were enriched within the clusters of pathways by a factor of 5.5, confirming the organisation of our pathway network is biologically significant. Conclusions Our representation of molecular function according to pathway relationships enables analysis of gene/protein activity in the context of specific functional roles, as an alternative to typical molecule-centric graph-based methods. The pathway network demonstrates the cooperation of multiple pathways to perform biological processes and organises pathways into functionally related clusters with interdependent outcomes. PMID:26678917
Detecting discordance enrichment among a series of two-sample genome-wide expression data sets.
Lai, Yinglei; Zhang, Fanni; Nayak, Tapan K; Modarres, Reza; Lee, Norman H; McCaffrey, Timothy A
2017-01-25
With the current microarray and RNA-seq technologies, two-sample genome-wide expression data have been widely collected in biological and medical studies. The related differential expression analysis and gene set enrichment analysis have been frequently conducted. Integrative analysis can be conducted when multiple data sets are available. In practice, discordant molecular behaviors among a series of data sets can be of biological and clinical interest. In this study, a statistical method is proposed for detecting discordance gene set enrichment. Our method is based on a two-level multivariate normal mixture model. It is statistically efficient with linearly increased parameter space when the number of data sets is increased. The model-based probability of discordance enrichment can be calculated for gene set detection. We apply our method to a microarray expression data set collected from forty-five matched tumor/non-tumor pairs of tissues for studying pancreatic cancer. We divided the data set into a series of non-overlapping subsets according to the tumor/non-tumor paired expression ratio of gene PNLIP (pancreatic lipase, recently shown it association with pancreatic cancer). The log-ratio ranges from a negative value (e.g. more expressed in non-tumor tissue) to a positive value (e.g. more expressed in tumor tissue). Our purpose is to understand whether any gene sets are enriched in discordant behaviors among these subsets (when the log-ratio is increased from negative to positive). We focus on KEGG pathways. The detected pathways will be useful for our further understanding of the role of gene PNLIP in pancreatic cancer research. Among the top list of detected pathways, the neuroactive ligand receptor interaction and olfactory transduction pathways are the most significant two. Then, we consider gene TP53 that is well-known for its role as tumor suppressor in cancer research. The log-ratio also ranges from a negative value (e.g. more expressed in non-tumor tissue) to a positive value (e.g. more expressed in tumor tissue). We divided the microarray data set again according to the expression ratio of gene TP53. After the discordance enrichment analysis, we observed overall similar results and the above two pathways are still the most significant detections. More interestingly, only these two pathways have been identified for their association with pancreatic cancer in a pathway analysis of genome-wide association study (GWAS) data. This study illustrates that some disease-related pathways can be enriched in discordant molecular behaviors when an important disease-related gene changes its expression. Our proposed statistical method is useful in the detection of these pathways. Furthermore, our method can also be applied to genome-wide expression data collected by the recent RNA-seq technology.
Parallel Clustering Algorithm for Large-Scale Biological Data Sets
Wang, Minchao; Zhang, Wu; Ding, Wang; Dai, Dongbo; Zhang, Huiran; Xie, Hao; Chen, Luonan; Guo, Yike; Xie, Jiang
2014-01-01
Backgrounds Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Methods Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. Result A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. PMID:24705246
Oleksiak, Marjorie F; Karchner, Sibel I; Jenny, Matthew J; Franks, Diana G; Welch, David B Mark; Hahn, Mark E
2011-05-24
Populations of Atlantic killifish (Fundulus heteroclitus) have evolved resistance to the embryotoxic effects of polychlorinated biphenyls (PCBs) and other halogenated and nonhalogenated aromatic hydrocarbons that act through an aryl hydrocarbon receptor (AHR)-dependent signaling pathway. The resistance is accompanied by reduced sensitivity to induction of cytochrome P450 1A (CYP1A), a widely used biomarker of aromatic hydrocarbon exposure and effect, but whether the reduced sensitivity is specific to CYP1A or reflects a genome-wide reduction in responsiveness to all AHR-mediated changes in gene expression is unknown. We compared gene expression profiles and the response to 3,3',4,4',5-pentachlorobiphenyl (PCB-126) exposure in embryos (5 and 10 dpf) and larvae (15 dpf) from F. heteroclitus populations inhabiting the New Bedford Harbor, Massachusetts (NBH) Superfund site (PCB-resistant) and a reference site, Scorton Creek, Massachusetts (SC; PCB-sensitive). Analysis using a 7,000-gene cDNA array revealed striking differences in responsiveness to PCB-126 between the populations; the differences occur at all three stages examined. There was a sizeable set of PCB-responsive genes in the sensitive SC population, a much smaller set of PCB-responsive genes in NBH fish, and few similarities in PCB-responsive genes between the two populations. Most of the array results were confirmed, and additional PCB-regulated genes identified, by RNA-Seq (deep pyrosequencing). The results suggest that NBH fish possess a gene regulatory defect that is not specific to one target gene such as CYP1A but rather lies in a regulatory pathway that controls the transcriptional response of multiple genes to PCB exposure. The results are consistent with genome-wide disruption of AHR-dependent signaling in NBH fish.
Maier, Uwe-G; Zauner, Stefan; Woehle, Christian; Bolte, Kathrin; Hempel, Franziska; Allen, John F.; Martin, William F.
2013-01-01
Plastid and mitochondrial genomes have undergone parallel evolution to encode the same functional set of genes. These encode conserved protein components of the electron transport chain in their respective bioenergetic membranes and genes for the ribosomes that express them. This highly convergent aspect of organelle genome evolution is partly explained by the redox regulation hypothesis, which predicts a separate plastid or mitochondrial location for genes encoding bioenergetic membrane proteins of either photosynthesis or respiration. Here we show that convergence in organelle genome evolution is far stronger than previously recognized, because the same set of genes for ribosomal proteins is independently retained by both plastid and mitochondrial genomes. A hitherto unrecognized selective pressure retains genes for the same ribosomal proteins in both organelles. On the Escherichia coli ribosome assembly map, the retained proteins are implicated in 30S and 50S ribosomal subunit assembly and initial rRNA binding. We suggest that ribosomal assembly imposes functional constraints that govern the retention of ribosomal protein coding genes in organelles. These constraints are subordinate to redox regulation for electron transport chain components, which anchor the ribosome to the organelle genome in the first place. As organelle genomes undergo reduction, the rRNAs also become smaller. Below size thresholds of approximately 1,300 nucleotides (16S rRNA) and 2,100 nucleotides (26S rRNA), all ribosomal protein coding genes are lost from organelles, while electron transport chain components remain organelle encoded as long as the organelles use redox chemistry to generate a proton motive force. PMID:24259312
Gu, Y R; Li, M Z; Zhang, K; Chen, L; Jiang, A A; Wang, J Y; Li, X W
2011-08-01
To normalize a set of quantitative real-time PCR (q-PCR) data, it is essential to determine an optimal number/set of housekeeping genes, as the abundance of housekeeping genes can vary across tissues or cells during different developmental stages, or even under certain environmental conditions. In this study, of the 20 commonly used endogenous control genes, 13, 18 and 17 genes exhibited credible stability in 56 different tissues, 10 types of adipose tissue and five types of muscle tissue, respectively. Our analysis clearly showed that three optimal housekeeping genes are adequate for an accurate normalization, which correlated well with the theoretical optimal number (r ≥ 0.94). In terms of economical and experimental feasibility, we recommend the use of the three most stable housekeeping genes for calculating the normalization factor. Based on our results, the three most stable housekeeping genes in all analysed samples (TOP2B, HSPCB and YWHAZ) are recommended for accurate normalization of q-PCR data. We also suggest that two different sets of housekeeping genes are appropriate for 10 types of adipose tissue (the HSPCB, ALDOA and GAPDH genes) and five types of muscle tissue (the TOP2B, HSPCB and YWHAZ genes), respectively. Our report will serve as a valuable reference for other studies aimed at measuring tissue-specific mRNA abundance in porcine samples. © 2011 Blackwell Verlag GmbH.
Maize GO annotation—methods, evaluation, and review (maize-GAMER)
USDA-ARS?s Scientific Manuscript database
We created a new high-coverage, robust, and reproducible functional annotation of maize protein-coding genes based on Gene Ontology (GO) term assignments. Whereas the existing Phytozome and Gramene maize GO annotation sets only cover 41% and 56% of maize protein-coding genes, respectively, this stu...
Veturi, Yogasudha; Ritchie, Marylyn D
2018-01-01
Transcriptome-wide association studies (TWAS) have recently been employed as an approach that can draw upon the advantages of genome-wide association studies (GWAS) and gene expression studies to identify genes associated with complex traits. Unlike standard GWAS, summary level data suffices for TWAS and offers improved statistical power. Two popular TWAS methods include either (a) imputing the cis genetic component of gene expression from smaller sized studies (using multi-SNP prediction or MP) into much larger effective sample sizes afforded by GWAS - TWAS-MP or (b) using summary-based Mendelian randomization - TWAS-SMR. Although these methods have been effective at detecting functional variants, it remains unclear how extensive variability in the genetic architecture of complex traits and diseases impacts TWAS results. Our goal was to investigate the different scenarios under which these methods yielded enough power to detect significant expression-trait associations. In this study, we conducted extensive simulations based on 6000 randomly chosen, unrelated Caucasian males from Geisinger's MyCode population to compare the power to detect cis expression-trait associations (within 500 kb of a gene) using the above-described approaches. To test TWAS across varying genetic backgrounds we simulated gene expression and phenotype using different quantitative trait loci per gene and cis-expression /trait heritability under genetic models that differentiate the effect of causality from that of pleiotropy. For each gene, on a training set ranging from 100 to 1000 individuals, we either (a) estimated regression coefficients with gene expression as the response using five different methods: LASSO, elastic net, Bayesian LASSO, Bayesian spike-slab, and Bayesian ridge regression or (b) performed eQTL analysis. We then sampled with replacement 50,000, 150,000, and 300,000 individuals respectively from the testing set of the remaining 5000 individuals and conducted GWAS on each set. Subsequently, we integrated the GWAS summary statistics derived from the testing set with the weights (or eQTLs) derived from the training set to identify expression-trait associations using (a) TWAS-MP (b) TWAS-SMR (c) eQTL-based GWAS, or (d) standalone GWAS. Finally, we examined the power to detect functionally relevant genes using the different approaches under the considered simulation scenarios. In general, we observed great similarities among TWAS-MP methods although the Bayesian methods resulted in improved power in comparison to LASSO and elastic net as the trait architecture grew more complex while training sample sizes and expression heritability remained small. Finally, we observed high power under causality but very low to moderate power under pleiotropy.
Assessment of RNAi-induced silencing in banana (Musa spp.).
Dang, Tuong Vi T; Windelinckx, Saskia; Henry, Isabelle M; De Coninck, Barbara; Cammue, Bruno P A; Swennen, Rony; Remy, Serge
2014-09-18
In plants, RNA- based gene silencing mediated by small RNAs functions at the transcriptional or post-transcriptional level to negatively regulate target genes, repetitive sequences, viral RNAs and/or transposon elements. Post-transcriptional gene silencing (PTGS) or the RNA interference (RNAi) approach has been achieved in a wide range of plant species for inhibiting the expression of target genes by generating double-stranded RNA (dsRNA). However, to our knowledge, successful RNAi-application to knock-down endogenous genes has not been reported in the important staple food crop banana. Using embryogenic cell suspension (ECS) transformed with ß-glucuronidase (GUS) as a model system, we assessed silencing of gusAINT using three intron-spliced hairpin RNA (ihpRNA) constructs containing gusAINT sequences of 299-nt, 26-nt and 19-nt, respectively. Their silencing potential was analysed in 2 different experimental set-ups. In the first, Agrobacterium-mediated co-transformation of banana ECS with a gusAINT containing vector and an ihpRNA construct resulted in a significantly reduced GUS enzyme activity 6-8 days after co-cultivation with either the 299-nt and 19-nt ihpRNA vectors. In the second approach, these ihpRNA constructs were transferred to stable GUS-expressing ECS and their silencing potential was evaluated in the regenerated in vitro plants. In comparison to control plants, transgenic plants transformed with the 299-nt gusAINT targeting sequence showed a 4.5 fold down-regulated gusA mRNA expression level, while GUS enzyme activity was reduced by 9 fold. Histochemical staining of plant tissues confirmed these findings. Northern blotting used to detect the expression of siRNA in the 299-nt ihpRNA vector transgenic in vitro plants revealed a negative relationship between siRNA expression and GUS enzyme activity. In contrast, no reduction in GUS activity or GUS mRNA expression occurred in the regenerated lines transformed with either of the two gusAINT oligo target sequences (26-nt and 19-nt). RNAi-induced silencing was achieved in banana, both at transient and stable level, resulting in significant reduction of gene expression and enzyme activity. The success of silencing was dependent on the targeted region of the target gene. The successful generation of transgenic ECS for second transformation with (an)other construct(s) can be of value for functional genomics research in banana.
Shen, Xing-Xing; Salichos, Leonidas; Rokas, Antonis
2016-09-02
Molecular phylogenetic inference is inherently dependent on choices in both methodology and data. Many insightful studies have shown how choices in methodology, such as the model of sequence evolution or optimality criterion used, can strongly influence inference. In contrast, much less is known about the impact of choices in the properties of the data, typically genes, on phylogenetic inference. We investigated the relationships between 52 gene properties (24 sequence-based, 19 function-based, and 9 tree-based) with each other and with three measures of phylogenetic signal in two assembled data sets of 2,832 yeast and 2,002 mammalian genes. We found that most gene properties, such as evolutionary rate (measured through the percent average of pairwise identity across taxa) and total tree length, were highly correlated with each other. Similarly, several gene properties, such as gene alignment length, Guanine-Cytosine content, and the proportion of tree distance on internal branches divided by relative composition variability (treeness/RCV), were strongly correlated with phylogenetic signal. Analysis of partial correlations between gene properties and phylogenetic signal in which gene evolutionary rate and alignment length were simultaneously controlled, showed similar patterns of correlations, albeit weaker in strength. Examination of the relative importance of each gene property on phylogenetic signal identified gene alignment length, alongside with number of parsimony-informative sites and variable sites, as the most important predictors. Interestingly, the subsets of gene properties that optimally predicted phylogenetic signal differed considerably across our three phylogenetic measures and two data sets; however, gene alignment length and RCV were consistently included as predictors of all three phylogenetic measures in both yeasts and mammals. These results suggest that a handful of sequence-based gene properties are reliable predictors of phylogenetic signal and could be useful in guiding the choice of phylogenetic markers. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Integrative sparse principal component analysis of gene expression data.
Liu, Mengque; Fan, Xinyan; Fang, Kuangnan; Zhang, Qingzhao; Ma, Shuangge
2017-12-01
In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance. © 2017 WILEY PERIODICALS, INC.
Malki, Karim; Tosto, Maria Grazia; Mouriño-Talín, Héctor; Rodríguez-Lorenzo, Sabela; Pain, Oliver; Jumhaboy, Irfan; Liu, Tina; Parpas, Panos; Newman, Stuart; Malykh, Artem; Carboni, Lucia; Uher, Rudolf; McGuffin, Peter; Schalkwyk, Leonard C; Bryson, Kevin; Herbster, Mark
2017-04-01
Response to antidepressant (AD) treatment may be a more polygenic trait than previously hypothesized, with many genetic variants interacting in yet unclear ways. In this study we used methods that can automatically learn to detect patterns of statistical regularity from a sparsely distributed signal across hippocampal transcriptome measurements in a large-scale animal pharmacogenomic study to uncover genomic variations associated with AD. The study used four inbred mouse strains of both sexes, two drug treatments, and a control group (escitalopram, nortriptyline, and saline). Multi-class and binary classification using Machine Learning (ML) and regularization algorithms using iterative and univariate feature selection methods, including InfoGain, mRMR, ANOVA, and Chi Square, were used to uncover genomic markers associated with AD response. Relevant genes were selected based on Jaccard distance and carried forward for gene-network analysis. Linear association methods uncovered only one gene associated with drug treatment response. The implementation of ML algorithms, together with feature reduction methods, revealed a set of 204 genes associated with SSRI and 241 genes associated with NRI response. Although only 10% of genes overlapped across the two drugs, network analysis shows that both drugs modulated the CREB pathway, through different molecular mechanisms. Through careful implementation and optimisations, the algorithms detected a weak signal used to predict whether an animal was treated with nortriptyline (77%) or escitalopram (67%) on an independent testing set. The results from this study indicate that the molecular signature of AD treatment may include a much broader range of genomic markers than previously hypothesized, suggesting that response to medication may be as complex as the pathology. The search for biomarkers of antidepressant treatment response could therefore consider a higher number of genetic markers and their interactions. Through predominately different molecular targets and mechanisms of action, the two drugs modulate the same Creb1 pathway which plays a key role in neurotrophic responses and in inflammatory processes. © 2016 The Authors. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics Published by Wiley Periodicals, Inc. © 2016 The Authors. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics Published by Wiley Periodicals, Inc.
Feather Development Genes and Associated Regulatory Innovation Predate the Origin of Dinosauria
Lowe, Craig B.; Clarke, Julia A.; Baker, Allan J.; Haussler, David; Edwards, Scott V.
2015-01-01
The evolution of avian feathers has recently been illuminated by fossils and the identification of genes involved in feather patterning and morphogenesis. However, molecular studies have focused mainly on protein-coding genes. Using comparative genomics and more than 600,000 conserved regulatory elements, we show that patterns of genome evolution in the vicinity of feather genes are consistent with a major role for regulatory innovation in the evolution of feathers. Rates of innovation at feather regulatory elements exhibit an extended period of innovation with peaks in the ancestors of amniotes and archosaurs. We estimate that 86% of such regulatory elements and 100% of the nonkeratin feather gene set were present prior to the origin of Dinosauria. On the branch leading to modern birds, we detect a strong signal of regulatory innovation near insulin-like growth factor binding protein (IGFBP) 2 and IGFBP5, which have roles in body size reduction, and may represent a genomic signature for the miniaturization of dinosaurian body size preceding the origin of flight. PMID:25415961
Discovery of error-tolerant biclusters from noisy gene expression data.
Gupta, Rohit; Rao, Navneet; Kumar, Vipin
2011-11-24
An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, which limits its applicability in real-life data sets where the biclusters may be fragmented due to random noise/errors. Moreover, as they only work with binary or boolean attributes, their application on gene-expression data require transforming real-valued attributes to binary attributes, which often results in loss of information. Many past approaches have tried to address the issue of noise and handling real-valued attributes independently but there is no systematic approach that addresses both of these issues together. In this paper, we first propose a novel error-tolerant biclustering model, 'ET-bicluster', and then propose a bottom-up heuristic-based mining algorithm to sequentially discover error-tolerant biclusters directly from real-valued gene-expression data. The efficacy of our proposed approach is illustrated by comparing it with a recent approach RAP in the context of two biological problems: discovery of functional modules and discovery of biomarkers. For the first problem, two real-valued S.Cerevisiae microarray gene-expression data sets are used to demonstrate that the biclusters obtained from ET-bicluster approach not only recover larger set of genes as compared to those obtained from RAP approach but also have higher functional coherence as evaluated using the GO-based functional enrichment analysis. The statistical significance of the discovered error-tolerant biclusters as estimated by using two randomization tests, reveal that they are indeed biologically meaningful and statistically significant. For the second problem of biomarker discovery, we used four real-valued Breast Cancer microarray gene-expression data sets and evaluate the biomarkers obtained using MSigDB gene sets. The results obtained for both the problems: functional module discovery and biomarkers discovery, clearly signifies the usefulness of the proposed ET-bicluster approach and illustrate the importance of explicitly incorporating noise/errors in discovering coherent groups of genes from gene-expression data.
Bockamp, Ernesto; Sprengel, Rolf; Eshkind, Leonid; Lehmann, Thomas; Braun, Jan M; Emmrich, Frank; Hengstler, Jan G
2008-03-01
Many mouse models are currently available, providing avenues to elucidate gene function and to recapitulate specific pathological conditions. To a large extent, successful translation of clinical evidence or analytical data into appropriate mouse models is possible through progress in transgenic or gene-targeting technology. Beginning with a review of standard mouse transgenics and conventional gene targeting, this article will move on to discussing the basics of conditional gene expression: the tetracycline (tet)-off and tet-on systems based on the transactivators tet-controlled transactivator (Tta) and reverse tet-on transactivator (rtTA) that allow downregulation or induction of gene expression; Cre or Flp recombinase-mediated modifications, including excision, inversion, insertion and interchromosomal translocation; combination of the tet and Cre systems, permitting inducible knockout, reporter gene activation or activation of point mutations; the avian retroviral system based on delivery of rtTA specifically into cells expressing the avian retroviral receptor, which enables cell type-specific, inducible gene expression; the tamoxifen system, one of the most frequently applied steroid receptor-based systems, allows rapid activation of a fusion protein between the gene of interest and a mutant domain of the estrogen receptor, whereby activation does not depend on transcription; and techniques for cell type-specific ablation. The diphtheria toxin receptor system offers the advantage that it can be combined with the 'zoo' of Cre recombinase driver mice. Having described the basics we move on to the cutting edge: generation of genome-wide sets of conditional knockout mice. To this end, large ongoing projects apply two strategies: gene trapping based on random integration of trapping vectors into introns leading to truncation of the transcript, and gene targeting, representing the directed approach using homologous recombination. It can be expected that in the near future genome-wide sets of such mice will be available. Finally, the possibilities of conditional expression systems for investigating gene function in tissue regeneration will be illustrated by examples for neurodegenerative disease, liver regeneration and wound healing of the skin.
Genomic analysis of wig-1 pathways.
Sedaghat, Yalda; Mazur, Curt; Sabripour, Mahyar; Hung, Gene; Monia, Brett P
2012-01-01
Wig-1 is a transcription factor regulated by p53 that can interact with hnRNP A2/B1, RNA Helicase A, and dsRNAs, which plays an important role in RNA and protein stabilization. in vitro studies have shown that wig-1 binds p53 mRNA and stabilizes it by protecting it from deadenylation. Furthermore, p53 has been implicated as a causal factor in neurodegenerative diseases based in part on its selective regulatory function on gene expression, including genes which, in turn, also possess regulatory functions on gene expression. In this study we focused on the wig-1 transcription factor as a downstream p53 regulated gene and characterized the effects of wig-1 down regulation on gene expression in mouse liver and brain. Antisense oligonucleotides (ASOs) were identified that specifically target mouse wig-1 mRNA and produce a dose-dependent reduction in wig-1 mRNA levels in cell culture. These wig-1 ASOs produced marked reductions in wig-1 levels in liver following intraperitoneal administration and in brain tissue following ASO administration through a single striatal bolus injection in FVB and BACHD mice. Wig-1 suppression was well tolerated and resulted in the reduction of mutant Htt protein levels in BACHD mouse brain but had no effect on normal Htt protein levels nor p53 mRNA or protein levels. Expression microarray analysis was employed to determine the effects of wig-1 suppression on genome-wide expression in mouse liver and brain. Reduction of wig-1 caused both down regulation and up regulation of several genes, and a number of wig-1 regulated genes were identified that potentially links wig-1 various signaling pathways and diseases. Antisense oligonucleotides can effectively reduce wig-1 levels in mouse liver and brain, which results in specific changes in gene expression for pathways relevant to both the nervous system and cancer.
Genomic Analysis of wig-1 Pathways
Sedaghat, Yalda; Mazur, Curt; Sabripour, Mahyar; Hung, Gene; Monia, Brett P.
2012-01-01
Background Wig-1 is a transcription factor regulated by p53 that can interact with hnRNP A2/B1, RNA Helicase A, and dsRNAs, which plays an important role in RNA and protein stabilization. in vitro studies have shown that wig-1 binds p53 mRNA and stabilizes it by protecting it from deadenylation. Furthermore, p53 has been implicated as a causal factor in neurodegenerative diseases based in part on its selective regulatory function on gene expression, including genes which, in turn, also possess regulatory functions on gene expression. In this study we focused on the wig-1 transcription factor as a downstream p53 regulated gene and characterized the effects of wig-1 down regulation on gene expression in mouse liver and brain. Methods and Results Antisense oligonucleotides (ASOs) were identified that specifically target mouse wig-1 mRNA and produce a dose-dependent reduction in wig-1 mRNA levels in cell culture. These wig-1 ASOs produced marked reductions in wig-1 levels in liver following intraperitoneal administration and in brain tissue following ASO administration through a single striatal bolus injection in FVB and BACHD mice. Wig-1 suppression was well tolerated and resulted in the reduction of mutant Htt protein levels in BACHD mouse brain but had no effect on normal Htt protein levels nor p53 mRNA or protein levels. Expression microarray analysis was employed to determine the effects of wig-1 suppression on genome-wide expression in mouse liver and brain. Reduction of wig-1 caused both down regulation and up regulation of several genes, and a number of wig-1 regulated genes were identified that potentially links wig-1 various signaling pathways and diseases. Conclusion Antisense oligonucleotides can effectively reduce wig-1 levels in mouse liver and brain, which results in specific changes in gene expression for pathways relevant to both the nervous system and cancer. PMID:22347364
Chi, Jingyun; Mahé, Frédéric; Loidl, Josef; Logsdon, John; Dunthorn, Micah
2014-03-01
To establish which meiosis genes are present in ciliates, and to look for clues as to which recombination pathways may be treaded by them, four genomes were inventoried for 11 meiosis-specific and 40 meiosis-related genes. We found that the set of meiosis genes shared by Tetrahymena thermophila, Paramecium tetraurelia, Ichthyophthirius multifiliis, and Oxytricha trifallax is consistent with the prevalence of a Mus81-dependent class II crossover pathway that is considered secondary in most model eukaryotes. There is little evidence for a canonical class I crossover pathway that requires the formation of a synaptonemal complex (SC). This gene inventory suggests that meiotic processes in ciliates largely depend on mitotic repair proteins for executing meiotic recombination. We propose that class I crossovers and SCs were reduced sometime during the evolution of ciliates. Consistent with this reduction, we provide microscopic evidence for the presence only of degenerate SCs in Stylonychia mytilus. In addition, lower nonsynonymous to synonymous mutation rates of some of the meiosis genes suggest that, in contrast to most other nuclear genes analyzed so far, meiosis genes in ciliates are largely evolving at a slower rate than those genes in fungi and animals.
Yu, Liang; Wang, Bingbo; Ma, Xiaoke; Gao, Lin
2016-12-23
Extracting drug-disease correlations is crucial in unveiling disease mechanisms, as well as discovering new indications of available drugs, or drug repositioning. Both the interactome and the knowledge of disease-associated and drug-associated genes remain incomplete. We present a new method to predict the associations between drugs and diseases. Our method is based on a module distance, which is originally proposed to calculate distances between modules in incomplete human interactome. We first map all the disease genes and drug genes to a combined protein interaction network. Then based on the module distance, we calculate the distances between drug gene sets and disease gene sets, and take the distances as the relationships of drug-disease pairs. We also filter possible false positive drug-disease correlations by p-value. Finally, we validate the top-100 drug-disease associations related to six drugs in the predicted results. The overlapping between our predicted correlations with those reported in Comparative Toxicogenomics Database (CTD) and literatures, and their enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways demonstrate our approach can not only effectively identify new drug indications, but also provide new insight into drug-disease discovery.
Integrating Multiple Data Sources for Combinatorial Marker Discovery: A Study in Tumorigenesis.
Bandyopadhyay, Sanghamitra; Mallik, Saurav
2018-01-01
Identification of combinatorial markers from multiple data sources is a challenging task in bioinformatics. Here, we propose a novel computational framework for identifying significant combinatorial markers ( s) using both gene expression and methylation data. The gene expression and methylation data are integrated into a single continuous data as well as a (post-discretized) boolean data based on their intrinsic (i.e., inverse) relationship. A novel combined score of methylation and expression data (viz., ) is introduced which is computed on the integrated continuous data for identifying initial non-redundant set of genes. Thereafter, (maximal) frequent closed homogeneous genesets are identified using a well-known biclustering algorithm applied on the integrated boolean data of the determined non-redundant set of genes. A novel sample-based weighted support ( ) is then proposed that is consecutively calculated on the integrated boolean data of the determined non-redundant set of genes in order to identify the non-redundant significant genesets. The top few resulting genesets are identified as potential s. Since our proposed method generates a smaller number of significant non-redundant genesets than those by other popular methods, the method is much faster than the others. Application of the proposed technique on an expression and a methylation data for Uterine tumor or Prostate Carcinoma produces a set of significant combination of markers. We expect that such a combination of markers will produce lower false positives than individual markers.
Selection of Differential Isolates of Magnaporthe oryzae for Postulation of Blast Resistance Genes.
Fang, W W; Liu, C C; Zhang, H W; Xu, H; Zhou, S; Fang, K X; Peng, Y L; Zhao, W S
2018-05-21
A set of differential isolates of Magnaporthe oryzae is needed for the postulation of blast resistance genes in numerous rice varieties and breeding materials. In this study, the pathotypes of 1,377 M. oryzae isolates from different regions of China were determined by inoculating detached rice leaves of 24 monogenic lines. Among them, 25 isolates were selected as differential isolates based on the following characteristics: they had distinct responses on the monogenic lines, contained the minimum number of avirulence genes, were stable in pathogenicity and conidiation during consecutive culture, were consistent colony growth rate, and, together, could differentiate combinations of the 24 major blast resistance genes. Seedlings of rice cultivars were inoculated with this differential set of isolates to postulate whether they contain 1 or more than 1 of the 24 blast resistance genes. The results were consistent with those from polymerase chain reaction analysis of target resistance genes. Establishment of a standard set of differential isolates will facilitate breeding for blast resistance and improved management of rice blast disease.
Primer sets for cloning the human repertoire of T cell Receptor Variable regions.
Boria, Ilenia; Cotella, Diego; Dianzani, Irma; Santoro, Claudio; Sblattero, Daniele
2008-08-29
Amplification and cloning of naïve T cell Receptor (TR) repertoires or antigen-specific TR is crucial to shape immune response and to develop immuno-based therapies. TR variable (V) regions are encoded by several genes that recombine during T cell development. The cloning of expressed genes as large diverse libraries from natural sources relies upon the availability of primers able to amplify as many V genes as possible. Here, we present a list of primers computationally designed on all functional TR V and J genes listed in the IMGT, the ImMunoGeneTics information system. The list consists of unambiguous or degenerate primers suitable to theoretically amplify and clone the entire TR repertoire. We show that it is possible to selectively amplify and clone expressed TR V genes in one single RT-PCR step and from as little as 1000 cells. This new primer set will facilitate the creation of more diverse TR libraries than has been possible using currently available primer sets.
Srivastava, Mousami; Khurana, Pankaj; Sugadev, Ragumani
2012-11-02
The tissue-specific Unigene Sets derived from more than one million expressed sequence tags (ESTs) in the NCBI, GenBank database offers a platform for identifying significantly and differentially expressed tissue-specific genes by in-silico methods. Digital differential display (DDD) rapidly creates transcription profiles based on EST comparisons and numerically calculates, as a fraction of the pool of ESTs, the relative sequence abundance of known and novel genes. However, the process of identifying the most likely tissue for a specific disease in which to search for candidate genes from the pool of differentially expressed genes remains difficult. Therefore, we have used 'Gene Ontology semantic similarity score' to measure the GO similarity between gene products of lung tissue-specific candidate genes from control (normal) and disease (cancer) sets. This semantic similarity score matrix based on hierarchical clustering represents in the form of a dendrogram. The dendrogram cluster stability was assessed by multiple bootstrapping. Multiple bootstrapping also computes a p-value for each cluster and corrects the bias of the bootstrap probability. Subsequent hierarchical clustering by the multiple bootstrapping method (α = 0.95) identified seven clusters. The comparative, as well as subtractive, approach revealed a set of 38 biomarkers comprising four distinct lung cancer signature biomarker clusters (panel 1-4). Further gene enrichment analysis of the four panels revealed that each panel represents a set of lung cancer linked metastasis diagnostic biomarkers (panel 1), chemotherapy/drug resistance biomarkers (panel 2), hypoxia regulated biomarkers (panel 3) and lung extra cellular matrix biomarkers (panel 4). Expression analysis reveals that hypoxia induced lung cancer related biomarkers (panel 3), HIF and its modulating proteins (TGM2, CSNK1A1, CTNNA1, NAMPT/Visfatin, TNFRSF1A, ETS1, SRC-1, FN1, APLP2, DMBT1/SAG, AIB1 and AZIN1) are significantly down regulated. All down regulated genes in this panel were highly up regulated in most other types of cancers. These panels of proteins may represent signature biomarkers for lung cancer and will aid in lung cancer diagnosis and disease monitoring as well as in the prediction of responses to therapeutics.
Nosil, P; Crespi, B J
2004-01-01
Population differentiation often reflects a balance between divergent natural selection and the opportunity for homogenizing gene flow to erode the effects of selection. However, during ecological speciation, trait divergence results in reproductive isolation and becomes a cause, rather than a consequence, of reductions in gene flow. To assess both the causes and the reproductive consequences of morphological differentiation, we examined morphological divergence and sexual isolation among 17 populations of Timema cristinae walking-sticks. Individuals from populations adapted to using Adenostoma as a host plant tended to exhibit smaller overall body size, wide heads, and short legs relative to individuals using Ceonothus as a host. However, there was also significant variation in morphology among populations within host-plant species. Mean trait values for each single population could be reliably predicted based upon host-plant used and the potential for homogenizing gene flow, inferred from the size of the neighboring population using the alternate host and mitochondrial DNA estimates of gene flow. Morphology did not influence the probability of copulation in between-population mating trials. Thus, morphological divergence is facilitated by reductions in gene flow, but does not cause reductions in gene flow via the evolution of sexual isolation. Combined with rearing data indicating that size and shape have a partial genetic basis, evidence for parallel origins of the host-associated forms, and inferences from functional morphology, these results indicate that morphological divergence in T. cristinae reflects a balance between the effects of host-specific natural selection and gene flow. Our findings illustrate how data on mating preferences can help determine the causal associations between trait divergence and levels of gene flow.
Hsiao, Tzu-Hung; Chiu, Yu-Chiao; Hsu, Pei-Yin; Lu, Tzu-Pin; Lai, Liang-Chuan; Tsai, Mong-Hsun; Huang, Tim H.-M.; Chuang, Eric Y.; Chen, Yidong
2016-01-01
Several mutual information (MI)-based algorithms have been developed to identify dynamic gene-gene and function-function interactions governed by key modulators (genes, proteins, etc.). Due to intensive computation, however, these methods rely heavily on prior knowledge and are limited in genome-wide analysis. We present the modulated gene/gene set interaction (MAGIC) analysis to systematically identify genome-wide modulation of interaction networks. Based on a novel statistical test employing conjugate Fisher transformations of correlation coefficients, MAGIC features fast computation and adaption to variations of clinical cohorts. In simulated datasets MAGIC achieved greatly improved computation efficiency and overall superior performance than the MI-based method. We applied MAGIC to construct the estrogen receptor (ER) modulated gene and gene set (representing biological function) interaction networks in breast cancer. Several novel interaction hubs and functional interactions were discovered. ER+ dependent interaction between TGFβ and NFκB was further shown to be associated with patient survival. The findings were verified in independent datasets. Using MAGIC, we also assessed the essential roles of ER modulation in another hormonal cancer, ovarian cancer. Overall, MAGIC is a systematic framework for comprehensively identifying and constructing the modulated interaction networks in a whole-genome landscape. MATLAB implementation of MAGIC is available for academic uses at https://github.com/chiuyc/MAGIC. PMID:26972162
Literature-based compound profiling: application to toxicogenomics.
Frijters, Raoul; Verhoeven, Stefan; Alkema, Wynand; van Schaik, René; Polman, Jan
2007-11-01
To reduce continuously increasing costs in drug development, adverse effects of drugs need to be detected as early as possible in the process. In recent years, compound-induced gene expression profiling methodologies have been developed to assess compound toxicity, including Gene Ontology term and pathway over-representation analyses. The objective of this study was to introduce an additional approach, in which literature information is used for compound profiling to evaluate compound toxicity and mode of toxicity. Gene annotations were built by text mining in Medline abstracts for retrieval of co-publications between genes, pathology terms, biological processes and pathways. This literature information was used to generate compound-specific keyword fingerprints, representing over-represented keywords calculated in a set of regulated genes after compound administration. To see whether keyword fingerprints can be used for assessment of compound toxicity, we analyzed microarray data sets of rat liver treated with 11 hepatotoxicants. Analysis of keyword fingerprints of two genotoxic carcinogens, two nongenotoxic carcinogens, two peroxisome proliferators and two randomly generated gene sets, showed that each compound produced a specific keyword fingerprint that correlated with the experimentally observed histopathological events induced by the individual compounds. By contrast, the random sets produced a flat aspecific keyword profile, indicating that the fingerprints induced by the compounds reflect biological events rather than random noise. A more detailed analysis of the keyword profiles of diethylhexylphthalate, dimethylnitrosamine and methapyrilene (MPy) showed that the differences in the keyword fingerprints of these three compounds are based upon known distinct modes of action. Visualization of MPy-linked keywords and MPy-induced genes in a literature network enabled us to construct a mode of toxicity proposal for MPy, which is in agreement with known effects of MPy in literature. Compound keyword fingerprinting based on information retrieved from literature is a powerful approach for compound profiling, allowing evaluation of compound toxicity and analysis of the mode of action.
Analysis of high-throughput biological data using their rank values.
Dembélé, Doulaye
2018-01-01
High-throughput biological technologies are routinely used to generate gene expression profiling or cytogenetics data. To achieve high performance, methods available in the literature become more specialized and often require high computational resources. Here, we propose a new versatile method based on the data-ordering rank values. We use linear algebra, the Perron-Frobenius theorem and also extend a method presented earlier for searching differentially expressed genes for the detection of recurrent copy number aberration. A result derived from the proposed method is a one-sample Student's t-test based on rank values. The proposed method is to our knowledge the only that applies to gene expression profiling and to cytogenetics data sets. This new method is fast, deterministic, and requires a low computational load. Probabilities are associated with genes to allow a statistically significant subset selection in the data set. Stability scores are also introduced as quality parameters. The performance and comparative analyses were carried out using real data sets. The proposed method can be accessed through an R package available from the CRAN (Comprehensive R Archive Network) website: https://cran.r-project.org/web/packages/fcros .
Henry, S.; Bru, D.; Stres, B.; Hallet, S.; Philippot, L.
2006-01-01
Nitrous oxide (N2O) is an important greenhouse gas in the troposphere controlling ozone concentration in the stratosphere through nitric oxide production. In order to quantify bacteria capable of N2O reduction, we developed a SYBR green quantitative real-time PCR assay targeting the nosZ gene encoding the catalytic subunit of the nitrous oxide reductase. Two independent sets of nosZ primers flanking the nosZ fragment previously used in diversity studies were designed and tested (K. Kloos, A. Mergel, C. Rösch, and H. Bothe, Aust. J. Plant Physiol. 28:991-998, 2001). The utility of these real-time PCR assays was demonstrated by quantifying the nosZ gene present in six different soils. Detection limits were between 101 and 102 target molecules per reaction for all assays. Sequence analysis of 128 cloned quantitative PCR products confirmed the specificity of the designed primers. The abundance of nosZ genes ranged from 105 to 107 target copies g−1 of dry soil, whereas genes for 16S rRNA were found at 108 to 109 target copies g−1 of dry soil. The abundance of narG and nirK genes was within the upper and lower limits of the 16S rRNA and nosZ gene copy numbers. The two sets of nosZ primers gave similar gene copy numbers for all tested soils. The maximum abundance of nosZ and nirK relative to 16S rRNA was 5 to 6%, confirming the low proportion of denitrifiers to total bacteria in soils. PMID:16885263
Empuku, Shinichiro; Nakajima, Kentaro; Akagi, Tomonori; Kaneko, Kunihiko; Hijiya, Naoki; Etoh, Tsuyoshi; Shiraishi, Norio; Moriyama, Masatsugu; Inomata, Masafumi
2016-05-01
Preoperative chemoradiotherapy (CRT) for locally advanced rectal cancer not only improves the postoperative local control rate, but also induces downstaging. However, it has not been established how to individually select patients who receive effective preoperative CRT. The aim of this study was to identify a predictor of response to preoperative CRT for locally advanced rectal cancer. This study is additional to our multicenter phase II study evaluating the safety and efficacy of preoperative CRT using oral fluorouracil (UMIN ID: 03396). From April, 2009 to August, 2011, 26 biopsy specimens obtained prior to CRT were analyzed by cyclopedic microarray analysis. Response to CRT was evaluated according to a histological grading system using surgically resected specimens. To decide on the number of genes for dividing into responder and non-responder groups, we statistically analyzed the data using a dimension reduction method, a principle component analysis. Of the 26 cases, 11 were responders and 15 non-responders. No significant difference was found in clinical background data between the two groups. We determined that the optimal number of genes for the prediction of response was 80 of 40,000 and the functions of these genes were analyzed. When comparing non-responders with responders, genes expressed at a high level functioned in alternative splicing, whereas those expressed at a low level functioned in the septin complex. Thus, an 80-gene expression set that predicts response to preoperative CRT for locally advanced rectal cancer was identified using a novel statistical method.
Contribution of Stress Responses to Antibiotic Tolerance in Pseudomonas aeruginosa Biofilms
Franklin, Michael J.; Williamson, Kerry S.; Folsom, James P.; Boegli, Laura; James, Garth A.
2015-01-01
Enhanced tolerance of biofilm-associated bacteria to antibiotic treatments is likely due to a combination of factors, including changes in cell physiology as bacteria adapt to biofilm growth and the inherent physiological heterogeneity of biofilm bacteria. In this study, a transcriptomics approach was used to identify genes differentially expressed during biofilm growth of Pseudomonas aeruginosa. These genes were tested for statistically significant overlap, with independently compiled gene lists corresponding to stress responses and other putative antibiotic-protective mechanisms. Among the gene groups tested were those associated with biofilm response to tobramycin or ciprofloxacin, drug efflux pumps, acyl homoserine lactone quorum sensing, osmotic shock, heat shock, hypoxia stress, and stationary-phase growth. Regulons associated with Anr-mediated hypoxia stress, RpoS-regulated stationary-phase growth, and osmotic stress were significantly enriched in the set of genes induced in the biofilm. Mutant strains deficient in rpoS, relA and spoT, or anr were cultured in biofilms and challenged with ciprofloxacin and tobramycin. When challenged with ciprofloxacin, the mutant strain biofilms had 2.4- to 2.9-log reductions in viable cells compared to a 0.9-log reduction of the wild-type strain. Interestingly, none of the mutants exhibited a statistically significant alteration in tobramycin susceptibility compared to that with the wild-type biofilm. These results are consistent with a model in which multiple genes controlled by overlapping starvation or stress responses contribute to the protection of a P. aeruginosa biofilm from ciprofloxacin. A distinct and as yet undiscovered mechanism protects the biofilm bacteria from tobramycin. PMID:25870065
Arnardottir, Erna S; Nikonova, Elena V; Shockley, Keith R; Podtelezhnikov, Alexei A; Anafi, Ron C; Tanis, Keith Q; Maislin, Greg; Stone, David J; Renger, John J; Winrow, Christopher J; Pack, Allan I
2014-10-01
To address whether changes in gene expression in blood cells with sleep loss are different in individuals resistant and sensitive to sleep deprivation. Blood draws every 4 h during a 3-day study: 24-h normal baseline, 38 h of continuous wakefulness and subsequent recovery sleep, for a total of 19 time-points per subject, with every 2-h psychomotor vigilance task (PVT) assessment when awake. Sleep laboratory. Fourteen subjects who were previously identified as behaviorally resistant (n = 7) or sensitive (n = 7) to sleep deprivation by PVT. Thirty-eight hours of continuous wakefulness. We found 4,481 unique genes with a significant 24-h diurnal rhythm during a normal sleep-wake cycle in blood (false discovery rate [FDR] < 5%). Biological pathways were enriched for biosynthetic processes during sleep. After accounting for circadian effects, two genes (SREBF1 and CPT1A, both involved in lipid metabolism) exhibited small, but significant, linear changes in expression with the duration of sleep deprivation (FDR < 5%). The main change with sleep deprivation was a reduction in the amplitude of the diurnal rhythm of expression of normally cycling probe sets. This reduction was noticeably higher in behaviorally resistant subjects than sensitive subjects, at any given P value. Furthermore, blood cell type enrichment analysis showed that the expression pattern difference between sensitive and resistant subjects is mainly found in cells of myeloid origin, such as monocytes. Individual differences in behavioral effects of sleep deprivation are associated with differences in diurnal amplitude of gene expression for genes that show circadian rhythmicity. © 2014 Associated Professional Sleep Societies, LLC.
Van Vlierberghe, Pieter; van Grotel, Martine; Tchinda, Joëlle; Lee, Charles; Beverloo, H. Berna; van der Spek, Peter J.; Stubbs, Andrew; Cools, Jan; Nagata, Kyosuke; Fornerod, Maarten; Buijs-Gladdines, Jessica; Horstmann, Martin; van Wering, Elisabeth R.; Soulier, Jean; Pieters, Rob
2008-01-01
T-cell acute lymphoblastic leukemia (T-ALL) is mostly characterized by specific chromosomal abnormalities, some occurring in a mutually exclusive manner that possibly delineate specific T-ALL subgroups. One subgroup, including MLL-rearranged, CALM-AF10 or inv (7)(p15q34) patients, is characterized by elevated expression of HOXA genes. Using a gene expression–based clustering analysis of 67 T-ALL cases with recurrent molecular genetic abnormalities and 25 samples lacking apparent aberrations, we identified 5 new patients with elevated HOXA levels. Using microarray-based comparative genomic hybridization (array-CGH), a cryptic and recurrent deletion, del (9)(q34.11q34.13), was exclusively identified in 3 of these 5 patients. This deletion results in a conserved SET-NUP214 fusion product, which was also identified in the T-ALL cell line LOUCY. SET-NUP214 binds in the promoter regions of specific HOXA genes, where it interacts with CRM1 and DOT1L, which may transcriptionally activate specific members of the HOXA cluster. Targeted inhibition of SET-NUP214 by siRNA abolished expression of HOXA genes, inhibited proliferation, and induced differentiation in LOUCY but not in other T-ALL lines. We conclude that SET-NUP214 may contribute to the pathogenesis of T-ALL by enforcing T-cell differentiation arrest. PMID:18299449
Ghadie, Mohamed A; Japkowicz, Nathalie; Perkins, Theodore J
2015-08-15
Stem cell differentiation is largely guided by master transcriptional regulators, but it also depends on the expression of other types of genes, such as cell cycle genes, signaling genes, metabolic genes, trafficking genes, etc. Traditional approaches to understanding gene expression patterns across multiple conditions, such as principal components analysis or K-means clustering, can group cell types based on gene expression, but they do so without knowledge of the differentiation hierarchy. Hierarchical clustering can organize cell types into a tree, but in general this tree is different from the differentiation hierarchy itself. Given the differentiation hierarchy and gene expression data at each node, we construct a weighted Euclidean distance metric such that the minimum spanning tree with respect to that metric is precisely the given differentiation hierarchy. We provide a set of linear constraints that are provably sufficient for the desired construction and a linear programming approach to identify sparse sets of weights, effectively identifying genes that are most relevant for discriminating different parts of the tree. We apply our method to microarray gene expression data describing 38 cell types in the hematopoiesis hierarchy, constructing a weighted Euclidean metric that uses just 175 genes. However, we find that there are many alternative sets of weights that satisfy the linear constraints. Thus, in the style of random-forest training, we also construct metrics based on random subsets of the genes and compare them to the metric of 175 genes. We then report on the selected genes and their biological functions. Our approach offers a new way to identify genes that may have important roles in stem cell differentiation. tperkins@ohri.ca Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Shufran, K A; Mornhinweg, D W; Baker, C A; Porter, D R
2007-10-01
Biotypes are infraspecific classifications based on biological rather than morphological characteristics. Cereal aphids are managed primarily by host plant resistance, and they often develop biotypes that injure or kill previously resistant plants. Although molecular genetic variation within aphid biotypes has been well documented, little is known about phenotypic variation, especially virulence or the biotype's ability to cause injury to cultivars with specific resistance genes. Five clones (single maternal lineages) of Russian wheat aphid, Diuraphis noxia (Kurdjumov) (Homoptera: Aphididae), determined to be injurious to wheat, Triticum aestivum L., with the Dn4 gene, were evaluated on resistant and susceptible wheat and barley, Hordeum vulgare L., for their ability to cause chlorosis, reduction in plant height, and reduction in shoot dry weight. Variation to cause injury on resistant 'Halt' wheat, susceptible 'Jagger' wheat, and resistant 'STARS-9301B' barley was found among the Dn4 virulent clones. One clone caused up to 30.0 and 59.5% more reduction in plant height and shoot dry weight, respectively, on resistant Halt than other clones. It also caused up to 29.9 and 55.5% more reduction in plant height and shoot dry weight, respectively, on susceptible Jagger wheat. Although STARS-9301B barley exhibited an equal resistant response to feeding by all five clones based on chlorosis, two clones caused approximately 20% more reduction in plant height and shoot dry weight than three other clones. The most injurious clones on wheat were not the most injurious clones on barley. This is the first report of variation to cause varying degrees of plant damage within an aphid biotype virulent to a single host resistance gene. A single aphid clone may not accurately represent the true virulent nature of a biotype population in the field.
Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean
2006-01-01
Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810
SynFind: Compiling Syntenic Regions across Any Set of Genomes on Demand.
Tang, Haibao; Bomhoff, Matthew D; Briones, Evan; Zhang, Liangsheng; Schnable, James C; Lyons, Eric
2015-11-11
The identification of conserved syntenic regions enables discovery of predicted locations for orthologous and homeologous genes, even when no such gene is present. This capability means that synteny-based methods are far more effective than sequence similarity-based methods in identifying true-negatives, a necessity for studying gene loss and gene transposition. However, the identification of syntenic regions requires complex analyses which must be repeated for pairwise comparisons between any two species. Therefore, as the number of published genomes increases, there is a growing demand for scalable, simple-to-use applications to perform comparative genomic analyses that cater to both gene family studies and genome-scale studies. We implemented SynFind, a web-based tool that addresses this need. Given one query genome, SynFind is capable of identifying conserved syntenic regions in any set of target genomes. SynFind is capable of reporting per-gene information, useful for researchers studying specific gene families, as well as genome-wide data sets of syntenic gene and predicted gene locations, critical for researchers focused on large-scale genomic analyses. Inference of syntenic homologs provides the basis for correlation of functional changes around genes of interests between related organisms. Deployed on the CoGe online platform, SynFind is connected to the genomic data from over 15,000 organisms from all domains of life as well as supporting multiple releases of the same organism. SynFind makes use of a powerful job execution framework that promises scalability and reproducibility. SynFind can be accessed at http://genomevolution.org/CoGe/SynFind.pl. A video tutorial of SynFind using Phytophthrora as an example is available at http://www.youtube.com/watch?v=2Agczny9Nyc. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Reference Gene Validation for RT-qPCR, a Note on Different Available Software Packages
De Spiegelaere, Ward; Dern-Wieloch, Jutta; Weigel, Roswitha; Schumacher, Valérie; Schorle, Hubert; Nettersheim, Daniel; Bergmann, Martin; Brehm, Ralph; Kliesch, Sabine; Vandekerckhove, Linos; Fink, Cornelia
2015-01-01
Background An appropriate normalization strategy is crucial for data analysis from real time reverse transcription polymerase chain reactions (RT-qPCR). It is widely supported to identify and validate stable reference genes, since no single biological gene is stably expressed between cell types or within cells under different conditions. Different algorithms exist to validate optimal reference genes for normalization. Applying human cells, we here compare the three main methods to the online available RefFinder tool that integrates these algorithms along with R-based software packages which include the NormFinder and GeNorm algorithms. Results 14 candidate reference genes were assessed by RT-qPCR in two sample sets, i.e. a set of samples of human testicular tissue containing carcinoma in situ (CIS), and a set of samples from the human adult Sertoli cell line (FS1) either cultured alone or in co-culture with the seminoma like cell line (TCam-2) or with equine bone marrow derived mesenchymal stem cells (eBM-MSC). Expression stabilities of the reference genes were evaluated using geNorm, NormFinder, and BestKeeper. Similar results were obtained by the three approaches for the most and least stably expressed genes. The R-based packages NormqPCR, SLqPCR and the NormFinder for R script gave identical gene rankings. Interestingly, different outputs were obtained between the original software packages and the RefFinder tool, which is based on raw Cq values for input. When the raw data were reanalysed assuming 100% efficiency for all genes, then the outputs of the original software packages were similar to the RefFinder software, indicating that RefFinder outputs may be biased because PCR efficiencies are not taken into account. Conclusions This report shows that assay efficiency is an important parameter for reference gene validation. New software tools that incorporate these algorithms should be carefully validated prior to use. PMID:25825906
Reference gene validation for RT-qPCR, a note on different available software packages.
De Spiegelaere, Ward; Dern-Wieloch, Jutta; Weigel, Roswitha; Schumacher, Valérie; Schorle, Hubert; Nettersheim, Daniel; Bergmann, Martin; Brehm, Ralph; Kliesch, Sabine; Vandekerckhove, Linos; Fink, Cornelia
2015-01-01
An appropriate normalization strategy is crucial for data analysis from real time reverse transcription polymerase chain reactions (RT-qPCR). It is widely supported to identify and validate stable reference genes, since no single biological gene is stably expressed between cell types or within cells under different conditions. Different algorithms exist to validate optimal reference genes for normalization. Applying human cells, we here compare the three main methods to the online available RefFinder tool that integrates these algorithms along with R-based software packages which include the NormFinder and GeNorm algorithms. 14 candidate reference genes were assessed by RT-qPCR in two sample sets, i.e. a set of samples of human testicular tissue containing carcinoma in situ (CIS), and a set of samples from the human adult Sertoli cell line (FS1) either cultured alone or in co-culture with the seminoma like cell line (TCam-2) or with equine bone marrow derived mesenchymal stem cells (eBM-MSC). Expression stabilities of the reference genes were evaluated using geNorm, NormFinder, and BestKeeper. Similar results were obtained by the three approaches for the most and least stably expressed genes. The R-based packages NormqPCR, SLqPCR and the NormFinder for R script gave identical gene rankings. Interestingly, different outputs were obtained between the original software packages and the RefFinder tool, which is based on raw Cq values for input. When the raw data were reanalysed assuming 100% efficiency for all genes, then the outputs of the original software packages were similar to the RefFinder software, indicating that RefFinder outputs may be biased because PCR efficiencies are not taken into account. This report shows that assay efficiency is an important parameter for reference gene validation. New software tools that incorporate these algorithms should be carefully validated prior to use.
50 CFR 229.37 - False Killer Whale Take Reduction Plan.
Code of Federal Regulations, 2014 CFR
2014-10-01
... Hawaii Pelagic and Hawaii Insular stocks of false killer whales in the Hawaii-based deep-set and shallow... section have the following meanings: (1) Deep-set or Deep-setting has the same meaning as the definition... this title. (c) Gear requirements. (1) While deep-setting, the owner and operator of a vessel...
50 CFR 229.37 - False Killer Whale Take Reduction Plan.
Code of Federal Regulations, 2013 CFR
2013-10-01
... Hawaii Pelagic and Hawaii Insular stocks of false killer whales in the Hawaii-based deep-set and shallow... section have the following meanings: (1) Deep-set or Deep-setting has the same meaning as the definition... this title. (c) Gear requirements. (1) While deep-setting, the owner and operator of a vessel...
Sulaiman, Irshad M.; Tang, Kevin; Osborne, John; Sammons, Scott; Wohlhueter, Robert M.
2007-01-01
We developed a set of seven resequencing GeneChips, based on the complete genome sequences of 24 strains of smallpox virus (variola virus), for rapid characterization of this human-pathogenic virus. Each GeneChip was designed to analyze a divergent segment of approximately 30,000 bases of the smallpox virus genome. This study includes the hybridization results of 14 smallpox virus strains. Of the 14 smallpox virus strains hybridized, only 7 had sequence information included in the design of the smallpox virus resequencing GeneChips; similar information for the remaining strains was not tiled as a reference in these GeneChips. By use of variola virus-specific primers and long-range PCR, 22 overlapping amplicons were amplified to cover nearly the complete genome and hybridized with the smallpox virus resequencing GeneChip set. These GeneChips were successful in generating nucleotide sequences for all 14 of the smallpox virus strains hybridized. Analysis of the data indicated that the GeneChip resequencing by hybridization was fast and reproducible and that the smallpox virus resequencing GeneChips could differentiate the 14 smallpox virus strains characterized. This study also suggests that high-density resequencing GeneChips have potential biodefense applications and may be used as an alternate tool for rapid identification of smallpox virus in the future. PMID:17182757
Ping, Yanyan; Deng, Yulan; Wang, Li; Zhang, Hongyi; Zhang, Yong; Xu, Chaohan; Zhao, Hongying; Fan, Huihui; Yu, Fulong; Xiao, Yun; Li, Xia
2015-01-01
The driver genetic aberrations collectively regulate core cellular processes underlying cancer development. However, identifying the modules of driver genetic alterations and characterizing their functional mechanisms are still major challenges for cancer studies. Here, we developed an integrative multi-omics method CMDD to identify the driver modules and their affecting dysregulated genes through characterizing genetic alteration-induced dysregulated networks. Applied to glioblastoma (GBM), the CMDD identified a core gene module of 17 genes, including seven known GBM drivers, and their dysregulated genes. The module showed significant association with shorter survival of GBM. When classifying driver genes in the module into two gene sets according to their genetic alteration patterns, we found that one gene set directly participated in the glioma pathway, while the other indirectly regulated the glioma pathway, mostly, via their dysregulated genes. Both of the two gene sets were significant contributors to survival and helpful for classifying GBM subtypes, suggesting their critical roles in GBM pathogenesis. Also, by applying the CMDD to other six cancers, we identified some novel core modules associated with overall survival of patients. Together, these results demonstrate integrative multi-omics data can identify driver modules and uncover their dysregulated genes, which is useful for interpreting cancer genome. PMID:25653168
Combinatorial therapy discovery using mixed integer linear programming.
Pang, Kaifang; Wan, Ying-Wooi; Choi, William T; Donehower, Lawrence A; Sun, Jingchun; Pant, Dhruv; Liu, Zhandong
2014-05-15
Combinatorial therapies play increasingly important roles in combating complex diseases. Owing to the huge cost associated with experimental methods in identifying optimal drug combinations, computational approaches can provide a guide to limit the search space and reduce cost. However, few computational approaches have been developed for this purpose, and thus there is a great need of new algorithms for drug combination prediction. Here we proposed to formulate the optimal combinatorial therapy problem into two complementary mathematical algorithms, Balanced Target Set Cover (BTSC) and Minimum Off-Target Set Cover (MOTSC). Given a disease gene set, BTSC seeks a balanced solution that maximizes the coverage on the disease genes and minimizes the off-target hits at the same time. MOTSC seeks a full coverage on the disease gene set while minimizing the off-target set. Through simulation, both BTSC and MOTSC demonstrated a much faster running time over exhaustive search with the same accuracy. When applied to real disease gene sets, our algorithms not only identified known drug combinations, but also predicted novel drug combinations that are worth further testing. In addition, we developed a web-based tool to allow users to iteratively search for optimal drug combinations given a user-defined gene set. Our tool is freely available for noncommercial use at http://www.drug.liuzlab.org/. zhandong.liu@bcm.edu Supplementary data are available at Bioinformatics online.
Kochneva, G V; Kolosova, I V; Lupan, T A; Sivolobova, G F; Iudin, P V; Grazhdantseva, A A; Riabchikova, E I; Kandrina, N Iu; Shchelkunov, S N
2009-01-01
Mousepox (ectromelia) virus genome contains four genes encoding for kelch-like proteins EVM018, EVM027, EVM150 and EVM167. A complete set of insertion plasmids was constructed to allow the production of recombinant ectromelia viruses with targeted deletions of one to four genes of kelch family both individually (single mutants) and in different combinations (double, triple and quadruple mutants). It was shown that deletion of any of the three genes EVMO18, EVM027 or EVM167 resulted in reduction of 50% lethal dose (LD50) by five and more orders in outbred white mice infected intraperitoneally. Deletion of mousepox kelch-gene EVM150 did not influence the virus virulence. Two or more kelch-genes deletion also resulted in high level of attenuation, which could evidently be due to the lack of three genes EVM167, EVM018 and/or EVM027 identified as virulence factors. The local inflammatory process on the model of intradermal injection of mouse ear pinnae (vasodilatation level, hyperemia, cutaneous edema, arterial thrombosis) was significantly more intensive for wild type virus and virulent mutant deltaEVM150 in comparison with avirulent mutant AEVM167.
Lee, Insuk; Li, Zhihua; Marcotte, Edward M.
2007-01-01
Background Probabilistic functional gene networks are powerful theoretical frameworks for integrating heterogeneous functional genomics and proteomics data into objective models of cellular systems. Such networks provide syntheses of millions of discrete experimental observations, spanning DNA microarray experiments, physical protein interactions, genetic interactions, and comparative genomics; the resulting networks can then be easily applied to generate testable hypotheses regarding specific gene functions and associations. Methodology/Principal Findings We report a significantly improved version (v. 2) of a probabilistic functional gene network [1] of the baker's yeast, Saccharomyces cerevisiae. We describe our optimization methods and illustrate their effects in three major areas: the reduction of functional bias in network training reference sets, the application of a probabilistic model for calculating confidences in pair-wise protein physical or genetic interactions, and the introduction of simple thresholds that eliminate many false positive mRNA co-expression relationships. Using the network, we predict and experimentally verify the function of the yeast RNA binding protein Puf6 in 60S ribosomal subunit biogenesis. Conclusions/Significance YeastNet v. 2, constructed using these optimizations together with additional data, shows significant reduction in bias and improvements in precision and recall, in total covering 102,803 linkages among 5,483 yeast proteins (95% of the validated proteome). YeastNet is available from http://www.yeastnet.org. PMID:17912365
DOE Office of Scientific and Technical Information (OSTI.GOV)
Van Nostrand, J.D.; Wu, W.-M.; Wu, L.
2009-07-15
A pilot-scale system was established for in situ biostimulation of U(VI) reduction by ethanol addition at the US Department of Energy's (DOE's) Field Research Center (Oak Ridge, TN). After achieving U(VI) reduction, stability of the bioreduced U(IV) was evaluated under conditions of (i) resting (no ethanol injection), (ii) reoxidation by introducing dissolved oxygen (DO), and (iii) reinjection of ethanol. GeoChip, a functional gene array with probes for N, S and C cycling, metal resistance and contaminant degradation genes, was used for monitoring groundwater microbial communities. High diversity of all major functional groups was observed during all experimental phases. The microbialmore » community was extremely responsive to ethanol, showing a substantial change in community structure with increased gene number and diversity after ethanol injections resumed. While gene numbers showed considerable variations, the relative abundance (i.e. percentage of each gene category) of most gene groups changed little. During the reoxidation period, U(VI) increased, suggesting reoxidation of reduced U(IV). However, when introduction of DO was stopped, U(VI) reduction resumed and returned to pre-reoxidation levels. These findings suggest that the community in this system can be stimulated and that the ability to reduce U(VI) can be maintained by the addition of electron donors. This biostimulation approach may potentially offer an effective means for the bioremediation of U(VI)-contaminated sites.« less
White-Al Habeeb, Nicole M A; Garcia, Julia; Fleshner, Neil; Bapat, Bharati
2016-12-01
This study explored the biological effects of metformin on prostate cancer (PCa) cells and determined molecular pathways and epigenetic regulators implicated in its mechanism of action. We performed mRNA expression profiling in 22Rv1 cells following 2.5 mM and 5 mM metformin treatment. Genes significantly modified by metformin treatment were ranked based on altered expression, involvement with cancer-related processes, and reported dysregulation in PCa. The effects of the top ranked gene, MMSET, on the proliferative and invasive capabilities of PCa cells were investigated via siRNA knockdown alone and also combined with metformin treatment. Metformin treatment decreased cell growth of PCa cell line 22Rv1 and stalled cells at the G1/S checkpoint in a time- and dose-dependent manner, resulting in increased cells in G1 (P < 0.05) and decreased cells in S (P < 0.05) phase. Metformin activated the AMPK/mTOR signaling pathway as shown by increased p-AMPK and decreased p-p70S6K. mRNA expression profiling following metformin treatment identified significant changes in 136 chromatin-modifying genes. The top ranked gene, multiple myeloma SET domain (MMSET) showed increased expression in PCa cell lines (22Rv1 and DU145) when compared to the benign prostate epithelium-derived cell-line RWPE-1, and its expression was decreased upon metformin treatment. siRNA-mediated knockdown of MMSET showed decreased cellular migration and invasion in DU-145 cells. MMSET knockdown in combination with metformin treatment resulted in further reduction in the capacity of PCa cells to migrate and invade. These data suggest MMSET may play a role in the inhibitory effect of metformin on PCa and could serve as a potential novel therapeutic target for PCa. Prostate 76:1507-1518, 2016. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
A Rice PECTATE LYASE-LIKE Gene Is Required for Plant Growth and Leaf Senescence1[OPEN
Leng, Yujia; Yang, Yaolong; Ren, Deyong; Dai, Liping; Wang, Yuqiong; Chen, Long; Tu, Zhengjun; Gao, Yihong; Zhu, Li; Hu, Jiang; Gao, Zhenyu; Guo, Longbiao; Lin, Yongjun
2017-01-01
To better understand the molecular mechanisms behind plant growth and leaf senescence in monocot plants, we identified a mutant exhibiting dwarfism and an early-senescence leaf phenotype, termed dwarf and early-senescence leaf1 (del1). Histological analysis showed that the abnormal growth was caused by a reduction in cell number. Further investigation revealed that the decline in cell number in del1 was affected by the cell cycle. Physiological analysis, transmission electron microscopy, and TUNEL assays showed that leaf senescence was triggered by the accumulation of reactive oxygen species. The DEL1 gene was cloned using a map-based approach. It was shown to encode a pectate lyase (PEL) precursor that contains a PelC domain. DEL1 contains all the conserved residues of PEL and has strong similarity with plant PelC. DEL1 is expressed in all tissues but predominantly in elongating tissues. Functional analysis revealed that mutation of DEL1 decreased the total PEL enzymatic activity, increased the degree of methylesterified homogalacturonan, and altered the cell wall composition and structure. In addition, transcriptome assay revealed that a set of cell wall function- and senescence-related gene expression was altered in del1 plants. Our research indicates that DEL1 is involved in both the maintenance of normal cell division and the induction of leaf senescence. These findings reveal a new molecular mechanism for plant growth and leaf senescence mediated by PECTATE LYASE-LIKE genes. PMID:28455404
A Rice PECTATE LYASE-LIKE Gene Is Required for Plant Growth and Leaf Senescence.
Leng, Yujia; Yang, Yaolong; Ren, Deyong; Huang, Lichao; Dai, Liping; Wang, Yuqiong; Chen, Long; Tu, Zhengjun; Gao, Yihong; Li, Xueyong; Zhu, Li; Hu, Jiang; Zhang, Guangheng; Gao, Zhenyu; Guo, Longbiao; Kong, Zhaosheng; Lin, Yongjun; Qian, Qian; Zeng, Dali
2017-06-01
To better understand the molecular mechanisms behind plant growth and leaf senescence in monocot plants, we identified a mutant exhibiting dwarfism and an early-senescence leaf phenotype, termed dwarf and early-senescence leaf1 ( del1 ). Histological analysis showed that the abnormal growth was caused by a reduction in cell number. Further investigation revealed that the decline in cell number in del1 was affected by the cell cycle. Physiological analysis, transmission electron microscopy, and TUNEL assays showed that leaf senescence was triggered by the accumulation of reactive oxygen species. The DEL1 gene was cloned using a map-based approach. It was shown to encode a pectate lyase (PEL) precursor that contains a PelC domain. DEL1 contains all the conserved residues of PEL and has strong similarity with plant PelC. DEL1 is expressed in all tissues but predominantly in elongating tissues. Functional analysis revealed that mutation of DEL1 decreased the total PEL enzymatic activity, increased the degree of methylesterified homogalacturonan, and altered the cell wall composition and structure. In addition, transcriptome assay revealed that a set of cell wall function- and senescence-related gene expression was altered in del1 plants. Our research indicates that DEL1 is involved in both the maintenance of normal cell division and the induction of leaf senescence. These findings reveal a new molecular mechanism for plant growth and leaf senescence mediated by PECTATE LYASE-LIKE genes. © 2017 American Society of Plant Biologists. All Rights Reserved.
Yang, Yi; Higgins, Steven A.; Yan, Jun; ...
2017-08-15
Here, organohalide-respiring bacteria play key roles in the natural chlorine cycle; however, most of the current knowledge is based on cultures from contaminated environments. We demonstrate that grape pomace compost without prior exposure to chlorinated solvents harbors a Dehalogenimonas ( Dhgm) species capable of using chlorinated ethenes, including the human carcinogen and common groundwater pollutant vinyl chloride (VC) as electron acceptors. Grape pomace microcosms and derived solid-free enrichment cultures were able to dechlorinate trichloroethene (TCE) to less chlorinated daughter products including ethene. 16S rRNA gene amplicon and qPCR analyses revealed the predominance of Dhgm sequences, but no Dehalococcoides mccartyi (more » Dhc) biomarker genes were detected. The enumeration of Dhgm 16S rRNA genes demonstrated VC-dependent growth, and 6.55 ± 0.64 x 10 8 cells were produced per µmole of chloride released. Metagenome sequencing enabled the assembly of a Dhgm draft genome, and 52 putative reductive dehalogenase (RDase) genes were identified. Proteomic workflows identified a putative VC RDase with 49% and 56.1% amino acid similarity to the known VC RDases VcrA and BvcA, respectively. A survey of 1,173 groundwater samples collected from 111 chlorinated solvent-contaminated sites revealed that Dhgm 16S rRNA genes were frequently detected and outnumbered Dhc in 65% of the samples. Dhgm may be more relevant contributors to chlorinated solvent reductive dechlorination in contaminated aquifers than is currently recognized, and non-polluted environments are a source of strictly organohalide-respiring bacteria with novel RDase genes.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Yi; Higgins, Steven A.; Yan, Jun
Here, organohalide-respiring bacteria play key roles in the natural chlorine cycle; however, most of the current knowledge is based on cultures from contaminated environments. We demonstrate that grape pomace compost without prior exposure to chlorinated solvents harbors a Dehalogenimonas ( Dhgm) species capable of using chlorinated ethenes, including the human carcinogen and common groundwater pollutant vinyl chloride (VC) as electron acceptors. Grape pomace microcosms and derived solid-free enrichment cultures were able to dechlorinate trichloroethene (TCE) to less chlorinated daughter products including ethene. 16S rRNA gene amplicon and qPCR analyses revealed the predominance of Dhgm sequences, but no Dehalococcoides mccartyi (more » Dhc) biomarker genes were detected. The enumeration of Dhgm 16S rRNA genes demonstrated VC-dependent growth, and 6.55 ± 0.64 x 10 8 cells were produced per µmole of chloride released. Metagenome sequencing enabled the assembly of a Dhgm draft genome, and 52 putative reductive dehalogenase (RDase) genes were identified. Proteomic workflows identified a putative VC RDase with 49% and 56.1% amino acid similarity to the known VC RDases VcrA and BvcA, respectively. A survey of 1,173 groundwater samples collected from 111 chlorinated solvent-contaminated sites revealed that Dhgm 16S rRNA genes were frequently detected and outnumbered Dhc in 65% of the samples. Dhgm may be more relevant contributors to chlorinated solvent reductive dechlorination in contaminated aquifers than is currently recognized, and non-polluted environments are a source of strictly organohalide-respiring bacteria with novel RDase genes.« less
Synthetic and Evolutionary Construction of a Chlorate-Reducing Shewanella oneidensis MR-1
Clark, Iain C.; Melnyk, Ryan A.; Youngblut, Matthew D.; Carlson, Hans K.; Iavarone, Anthony T.
2015-01-01
ABSTRACT Despite evidence for the prevalence of horizontal gene transfer of respiratory genes, little is known about how pathways functionally integrate within new hosts. One example of a mobile respiratory metabolism is bacterial chlorate reduction, which is frequently encoded on composite transposons. This implies that the essential components of the metabolism are encoded on these mobile elements. To test this, we heterologously expressed genes for chlorate reduction from Shewanella algae ACDC in the non-chlorate-reducing Shewanella oneidensis MR-1. The construct that ultimately endowed robust growth on chlorate included cld, a cytochrome c gene, clrABDC, and two genes of unknown function. Although strain MR-1 was unable to grow on chlorate after initial insertion of these genes into the chromosome, 11 derived strains capable of chlorate respiration were obtained through adaptive evolution. Genome resequencing indicated that all of the evolved chlorate-reducing strains replicated a large genomic region containing chlorate reduction genes. Contraction in copy number and loss of the ability to reduce chlorate were also observed, indicating that this phenomenon was extremely dynamic. Although most strains contained more than six copies of the replicated region, a single strain with less duplication also grew rapidly. This strain contained three additional mutations that we hypothesized compensated for the low copy number. We remade the mutations combinatorially in the unevolved strain and determined that a single nucleotide polymorphism (SNP) upstream of cld enabled growth on chlorate and was epistatic to a second base pair change in the NarP binding sequence between narQP and nrfA that enhanced growth. PMID:25991681
Bartsch, Georg; Mitra, Anirban P; Mitra, Sheetal A; Almal, Arpit A; Steven, Kenneth E; Skinner, Donald G; Fry, David W; Lenehan, Peter F; Worzel, William P; Cote, Richard J
2016-02-01
Due to the high recurrence risk of nonmuscle invasive urothelial carcinoma it is crucial to distinguish patients at high risk from those with indolent disease. In this study we used a machine learning algorithm to identify the genes in patients with nonmuscle invasive urothelial carcinoma at initial presentation that were most predictive of recurrence. We used the genes in a molecular signature to predict recurrence risk within 5 years after transurethral resection of bladder tumor. Whole genome profiling was performed on 112 frozen nonmuscle invasive urothelial carcinoma specimens obtained at first presentation on Human WG-6 BeadChips (Illumina®). A genetic programming algorithm was applied to evolve classifier mathematical models for outcome prediction. Cross-validation based resampling and gene use frequencies were used to identify the most prognostic genes, which were combined into rules used in a voting algorithm to predict the sample target class. Key genes were validated by quantitative polymerase chain reaction. The classifier set included 21 genes that predicted recurrence. Quantitative polymerase chain reaction was done for these genes in a subset of 100 patients. A 5-gene combined rule incorporating a voting algorithm yielded 77% sensitivity and 85% specificity to predict recurrence in the training set, and 69% and 62%, respectively, in the test set. A singular 3-gene rule was constructed that predicted recurrence with 80% sensitivity and 90% specificity in the training set, and 71% and 67%, respectively, in the test set. Using primary nonmuscle invasive urothelial carcinoma from initial occurrences genetic programming identified transcripts in reproducible fashion, which were predictive of recurrence. These findings could potentially impact nonmuscle invasive urothelial carcinoma management. Copyright © 2016 American Urological Association Education and Research, Inc. Published by Elsevier Inc. All rights reserved.
Tan, Joon Liang; Khang, Tsung Fei; Ngeow, Yun Fong; Choo, Siew Woh
2013-12-13
Mycobacterium abscessus is a rapidly growing mycobacterium that is often associated with human infections. The taxonomy of this species has undergone several revisions and is still being debated. In this study, we sequenced the genomes of 12 M. abscessus strains and used phylogenomic analysis to perform subspecies classification. A data mining approach was used to rank and select informative genes based on the relative entropy metric for the construction of a phylogenetic tree. The resulting tree topology was similar to that generated using the concatenation of five classical housekeeping genes: rpoB, hsp65, secA, recA and sodA. Additional support for the reliability of the subspecies classification came from the analysis of erm41 and ITS gene sequences, single nucleotide polymorphisms (SNPs)-based classification and strain clustering demonstrated by a variable number tandem repeat (VNTR) assay and a multilocus sequence analysis (MLSA). We subsequently found that the concatenation of a minimal set of three median-ranked genes: DNA polymerase III subunit alpha (polC), 4-hydroxy-2-ketovalerate aldolase (Hoa) and cell division protein FtsZ (ftsZ), is sufficient to recover the same tree topology. PCR assays designed specifically for these genes showed that all three genes could be amplified in the reference strain of M. abscessus ATCC 19977T. This study provides proof of concept that whole-genome sequence-based data mining approach can provide confirmatory evidence of the phylogenetic informativeness of existing markers, as well as lead to the discovery of a more economical and informative set of markers that produces similar subspecies classification in M. abscessus. The systematic procedure used in this study to choose the informative minimal set of gene markers can potentially be applied to species or subspecies classification of other bacteria.
Kumar, Gulshan; Rattan, Usha Kumari; Singh, Anil Kumar
2016-01-01
Winter dormancy is a well known mechanism adopted by temperate plants, to mitigate the chilling temperature of winters. However, acquisition of sufficient chilling during winter dormancy ensures the normal phenological traits in subsequent growing period. Thus, low temperature appears to play crucial roles in growth and development of temperate plants. Apple, being an important temperate fruit crop, also requires sufficient chilling to release winter dormancy and normal phenological traits, which are often associated with yield and quality of fruits. DNA cytosine methylation is one of the important epigenetic modifications which remarkably affect the gene expression during various developmental and adaptive processes. In present study, methylation sensitive amplified polymorphism was employed to assess the changes in cytosine methylation during dormancy, active growth and fruit set in apple, under differential chilling conditions. Under high chill conditions, total methylation was decreased from 27.2% in dormant bud to 21.0% in fruit set stage, while no significant reduction was found under low chill conditions. Moreover, the demethylation was found to be decreased, while methylation increased from dormant bud to fruit set stage under low chill as compared to high chill conditions. In addition, RNA-Seq analysis showed high expression of DNA methyltransferases and histone methyltransferases during dormancy and fruit set, and low expression of DNA glcosylases during active growth under low chill conditions, which was in accordance with changes in methylation patterns. The RNA-Seq data of 47 genes associated with MSAP fragments involved in cellular metabolism, stress response, antioxidant system and transcriptional regulation showed correlation between methylation and their expression. Similarly, bisulfite sequencing and qRT-PCR analysis of selected genes also showed correlation between gene body methylation and gene expression. Moreover, significant association between chilling and methylation changes was observed, which suggested that chilling acquisition during dormancy in apple is likely to affect the epigenetic regulation through DNA methylation.
Kumar, Gulshan; Rattan, Usha Kumari; Singh, Anil Kumar
2016-01-01
Winter dormancy is a well known mechanism adopted by temperate plants, to mitigate the chilling temperature of winters. However, acquisition of sufficient chilling during winter dormancy ensures the normal phenological traits in subsequent growing period. Thus, low temperature appears to play crucial roles in growth and development of temperate plants. Apple, being an important temperate fruit crop, also requires sufficient chilling to release winter dormancy and normal phenological traits, which are often associated with yield and quality of fruits. DNA cytosine methylation is one of the important epigenetic modifications which remarkably affect the gene expression during various developmental and adaptive processes. In present study, methylation sensitive amplified polymorphism was employed to assess the changes in cytosine methylation during dormancy, active growth and fruit set in apple, under differential chilling conditions. Under high chill conditions, total methylation was decreased from 27.2% in dormant bud to 21.0% in fruit set stage, while no significant reduction was found under low chill conditions. Moreover, the demethylation was found to be decreased, while methylation increased from dormant bud to fruit set stage under low chill as compared to high chill conditions. In addition, RNA-Seq analysis showed high expression of DNA methyltransferases and histone methyltransferases during dormancy and fruit set, and low expression of DNA glcosylases during active growth under low chill conditions, which was in accordance with changes in methylation patterns. The RNA-Seq data of 47 genes associated with MSAP fragments involved in cellular metabolism, stress response, antioxidant system and transcriptional regulation showed correlation between methylation and their expression. Similarly, bisulfite sequencing and qRT-PCR analysis of selected genes also showed correlation between gene body methylation and gene expression. Moreover, significant association between chilling and methylation changes was observed, which suggested that chilling acquisition during dormancy in apple is likely to affect the epigenetic regulation through DNA methylation. PMID:26901339
Balow, James E; Ryan, John G; Chae, Jae Jin; Booty, Matthew G; Bulua, Ariel; Stone, Deborah; Sun, Hong-Wei; Greene, James; Barham, Beverly; Goldbach-Mansky, Raphaela; Kastner, Daniel L; Aksentijevich, Ivona
2013-06-01
To analyse gene expression patterns and to define a specific gene expression signature in patients with the severe end of the spectrum of cryopyrin-associated periodic syndromes (CAPS). The molecular consequences of interleukin 1 inhibition were examined by comparing gene expression patterns in 16 CAPS patients before and after treatment with anakinra. We collected peripheral blood mononuclear cells from 22 CAPS patients with active disease and from 14 healthy children. Transcripts that passed stringent filtering criteria (p values≤false discovery rate 1%) were considered as differentially expressed genes (DEG). A set of DEG was validated by quantitative reverse transcription PCR and functional studies with primary cells from CAPS patients and healthy controls. We used 17 CAPS and 66 non-CAPS patient samples to create a set of gene expression models that differentiates CAPS patients from controls and from patients with other autoinflammatory conditions. Many DEG include transcripts related to the regulation of innate and adaptive immune responses, oxidative stress, cell death, cell adhesion and motility. A set of gene expression-based models comprising the CAPS-specific gene expression signature correctly classified all 17 samples from an independent dataset. This classifier also correctly identified 15 of 16 post-anakinra CAPS samples despite the fact that these CAPS patients were in clinical remission. We identified a gene expression signature that clearly distinguished CAPS patients from controls. A number of DEG were in common with other systemic inflammatory diseases such as systemic onset juvenile idiopathic arthritis. The CAPS-specific gene expression classifiers also suggest incomplete suppression of inflammation at low doses of anakinra.
Balow, James E; Ryan, John G; Chae, Jae Jin; Booty, Matthew G; Bulua, Ariel; Stone, Deborah; Sun, Hong-Wei; Greene, James; Barham, Beverly; Goldbach-Mansky, Raphaela; Kastner, Daniel L; Aksentijevich, Ivona
2014-01-01
Objective To analyse gene expression patterns and to define a specific gene expression signature in patients with the severe end of the spectrum of cryopyrin-associated periodic syndromes (CAPS). The molecular consequences of interleukin 1 inhibition were examined by comparing gene expression patterns in 16 CAPS patients before and after treatment with anakinra. Methods We collected peripheral blood mononuclear cells from 22 CAPS patients with active disease and from 14 healthy children. Transcripts that passed stringent filtering criteria (p values ≤ false discovery rate 1%) were considered as differentially expressed genes (DEG). A set of DEG was validated by quantitative reverse transcription PCR and functional studies with primary cells from CAPS patients and healthy controls. We used 17 CAPS and 66 non-CAPS patient samples to create a set of gene expression models that differentiates CAPS patients from controls and from patients with other autoinflammatory conditions. Results Many DEG include transcripts related to the regulation of innate and adaptive immune responses, oxidative stress, cell death, cell adhesion and motility. A set of gene expression-based models comprising the CAPS-specific gene expression signature correctly classified all 17 samples from an independent dataset. This classifier also correctly identified 15 of 16 postanakinra CAPS samples despite the fact that these CAPS patients were in clinical remission. Conclusions We identified a gene expression signature that clearly distinguished CAPS patients from controls. A number of DEG were in common with other systemic inflammatory diseases such as systemic onset juvenile idiopathic arthritis. The CAPS-specific gene expression classifiers also suggest incomplete suppression of inflammation at low doses of anakinra. PMID:23223423
The promises and pitfalls of RNA-interference-based therapeutics
Castanotto, Daniela; Rossi, John J.
2009-01-01
The discovery that gene expression can be controlled by the Watson–Crick base-pairing of small RNAs with messenger RNAs containing complementary sequence — a process known as RNA interference — has markedly advanced our understanding of eukaryotic gene regulation and function. The ability of short RNA sequences to modulate gene expression has provided a powerful tool with which to study gene function and is set to revolutionize the treatment of disease. Remarkably, despite being just one decade from its discovery, the phenomenon is already being used therapeutically in human clinical trials, and biotechnology companies that focus on RNA-interference-based therapeutics are already publicly traded. PMID:19158789
Grishok, Alla; Hoersch, Sebastian; Sharp, Phillip A
2008-12-23
In Caenorhabditis elegans, a vast number of endogenous short RNAs corresponding to thousands of genes have been discovered recently. This finding suggests that these short interfering RNAs (siRNAs) may contribute to regulation of many developmental and other signaling pathways in addition to silencing viruses and transposons. Here, we present a microarray analysis of gene expression in RNA interference (RNAi)-related mutants rde-4, zfp-1, and alg-1 and the retinoblastoma (Rb) mutant lin-35. We found that a component of Dicer complex RDE-4 and a chromatin-related zinc finger protein ZFP-1, not implicated in endogenous RNAi, regulate overlapping sets of genes. Notably, genes a) up-regulated in the rde-4 and zfp-1 mutants and b) up-regulated in the lin-35(Rb) mutant, but not the down-regulated genes are highly represented in the set of genes with corresponding endogenous siRNAs (endo-siRNAs). Our study suggests that endogenous siRNAs cooperate with chromatin factors, either C. elegans ortholog of acute lymphoblastic leukemia-1 (ALL-1)-fused gene from chromosome 10 (AF10), ZFP-1, or tumor suppressor Rb, to regulate overlapping sets of genes and predicts a large role for RNAi-based chromatin silencing in control of gene expression in C. elegans.
Grishok, Alla; Hoersch, Sebastian; Sharp, Phillip A.
2008-01-01
In Caenorhabditis elegans, a vast number of endogenous short RNAs corresponding to thousands of genes have been discovered recently. This finding suggests that these short interfering RNAs (siRNAs) may contribute to regulation of many developmental and other signaling pathways in addition to silencing viruses and transposons. Here, we present a microarray analysis of gene expression in RNA interference (RNAi)-related mutants rde-4, zfp-1, and alg-1 and the retinoblastoma (Rb) mutant lin-35. We found that a component of Dicer complex RDE-4 and a chromatin-related zinc finger protein ZFP-1, not implicated in endogenous RNAi, regulate overlapping sets of genes. Notably, genes a) up-regulated in the rde-4 and zfp-1 mutants and b) up-regulated in the lin-35(Rb) mutant, but not the down-regulated genes are highly represented in the set of genes with corresponding endogenous siRNAs (endo-siRNAs). Our study suggests that endogenous siRNAs cooperate with chromatin factors, either C. elegans ortholog of acute lymphoblastic leukemia-1 (ALL-1)-fused gene from chromosome 10 (AF10), ZFP-1, or tumor suppressor Rb, to regulate overlapping sets of genes and predicts a large role for RNAi-based chromatin silencing in control of gene expression in C. elegans. PMID:19073934
Gui, Jiang; Andrew, Angeline S.; Andrews, Peter; Nelson, Heather M.; Kelsey, Karl T.; Karagas, Margaret R.; Moore, Jason H.
2010-01-01
Epistasis or gene-gene interaction is a fundamental component of the genetic architecture of complex traits such as disease susceptibility. Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model-free method to detect epistasis when there are no significant marginal genetic effects. However, in many studies of complex disease, other covariates like age of onset and smoking status could have a strong main effect and may potentially interfere with MDR's ability to achieve its goal. In this paper, we present a simple and computationally efficient sampling method to adjust for covariate effects in MDR. We use simulation to show that after adjustment, MDR has sufficient power to detect true gene-gene interactions. We also compare our method with the state-of-art technique in covariate adjustment. The results suggest that our proposed method performs similarly, but is more computationally efficient. We then apply this new method to an analysis of a population-based bladder cancer study in New Hampshire. PMID:20924193
Gruel, Jérémy; LeBorgne, Michel; LeMeur, Nolwenn; Théret, Nathalie
2011-09-12
Regulation of gene expression plays a pivotal role in cellular functions. However, understanding the dynamics of transcription remains a challenging task. A host of computational approaches have been developed to identify regulatory motifs, mainly based on the recognition of DNA sequences for transcription factor binding sites. Recent integration of additional data from genomic analyses or phylogenetic footprinting has significantly improved these methods. Here, we propose a different approach based on the compilation of Simple Shared Motifs (SSM), groups of sequences defined by their length and similarity and present in conserved sequences of gene promoters. We developed an original algorithm to search and count SSM in pairs of genes. An exceptional number of SSM is considered as a common regulatory pattern. The SSM approach is applied to a sample set of genes and validated using functional gene-set enrichment analyses. We demonstrate that the SSM approach selects genes that are over-represented in specific biological categories (Ontology and Pathways) and are enriched in co-expressed genes. Finally we show that genes co-expressed in the same tissue or involved in the same biological pathway have increased SSM values. Using unbiased clustering of genes, Simple Shared Motifs analysis constitutes an original contribution to provide a clearer definition of expression networks.
2011-01-01
Background Regulation of gene expression plays a pivotal role in cellular functions. However, understanding the dynamics of transcription remains a challenging task. A host of computational approaches have been developed to identify regulatory motifs, mainly based on the recognition of DNA sequences for transcription factor binding sites. Recent integration of additional data from genomic analyses or phylogenetic footprinting has significantly improved these methods. Results Here, we propose a different approach based on the compilation of Simple Shared Motifs (SSM), groups of sequences defined by their length and similarity and present in conserved sequences of gene promoters. We developed an original algorithm to search and count SSM in pairs of genes. An exceptional number of SSM is considered as a common regulatory pattern. The SSM approach is applied to a sample set of genes and validated using functional gene-set enrichment analyses. We demonstrate that the SSM approach selects genes that are over-represented in specific biological categories (Ontology and Pathways) and are enriched in co-expressed genes. Finally we show that genes co-expressed in the same tissue or involved in the same biological pathway have increased SSM values. Conclusions Using unbiased clustering of genes, Simple Shared Motifs analysis constitutes an original contribution to provide a clearer definition of expression networks. PMID:21910886
Liu, Rong; Guo, Cheng-Xian; Zhou, Hong-Hao
2015-01-01
This study aims to identify effective gene networks and prognostic biomarkers associated with estrogen receptor positive (ER+) breast cancer using human mRNA studies. Weighted gene coexpression network analysis was performed with a complex ER+ breast cancer transcriptome to investigate the function of networks and key genes in the prognosis of breast cancer. We found a significant correlation of an expression module with distant metastasis-free survival (HR = 2.25; 95% CI .21.03-4.88 in discovery set; HR = 1.78; 95% CI = 1.07-2.93 in validation set). This module contained genes enriched in the biological process of the M phase. From this module, we further identified and validated 5 hub genes (CDK1, DLGAP5, MELK, NUSAP1, and RRM2), the expression levels of which were strongly associated with poor survival. Highly expressed MELK indicated poor survival in luminal A and luminal B breast cancer molecular subtypes. This gene was also found to be associated with tamoxifen resistance. Results indicated that a network-based approach may facilitate the discovery of biomarkers for the prognosis of ER+ breast cancer and may also be used as a basis for establishing personalized therapies. Nevertheless, before the application of this approach in clinical settings, in vivo and in vitro experiments and multi-center randomized controlled clinical trials are still needed.
Chen, Rui; Davis, Lea K; Guter, Stephen; Wei, Qiang; Jacob, Suma; Potter, Melissa H; Cox, Nancy J; Cook, Edwin H; Sutcliffe, James S; Li, Bingshan
2017-01-01
Autism spectrum disorder (ASD) is one of the most highly heritable neuropsychiatric disorders, but underlying molecular mechanisms are still unresolved due to extreme locus heterogeneity. Leveraging meaningful endophenotypes or biomarkers may be an effective strategy to reduce heterogeneity to identify novel ASD genes. Numerous lines of evidence suggest a link between hyperserotonemia, i.e., elevated serotonin (5-hydroxytryptamine or 5-HT) in whole blood, and ASD. However, the genetic determinants of blood 5-HT level and their relationship to ASD are largely unknown. In this study, pursuing the hypothesis that de novo variants (DNVs) and rare risk alleles acting in a recessive mode may play an important role in predisposition of hyperserotonemia in people with ASD, we carried out whole exome sequencing (WES) in 116 ASD parent-proband trios with most (107) probands having 5-HT measurements. Combined with published ASD DNVs, we identified USP15 as having recurrent de novo loss of function mutations and discovered evidence supporting two other known genes with recurrent DNVs ( FOXP1 and KDM5B ). Genes harboring functional DNVs significantly overlap with functional/disease gene sets known to be involved in ASD etiology, including FMRP targets and synaptic formation and transcriptional regulation genes. We grouped the probands into High-5HT and Normal-5HT groups based on normalized serotonin levels, and used network-based gene set enrichment analysis (NGSEA) to identify novel hyperserotonemia-related ASD genes based on LoF and missense DNVs. We found enrichment in the High-5HT group for a gene network module (DAWN-1) previously implicated in ASD, and this points to the TGF-β pathway and cell junction processes. Through analysis of rare recessively acting variants (RAVs), we also found that rare compound heterozygotes (CHs) in the High-5HT group were enriched for loci in an ASD-associated gene set. Finally, we carried out rare variant group-wise transmission disequilibrium tests (gTDT) and observed significant association of rare variants in genes encoding a subset of the serotonin pathway with ASD. Our study identified USP15 as a novel gene implicated in ASD based on recurrent DNVs. It also demonstrates the potential value of 5-HT as an effective endophenotype for gene discovery in ASD, and the effectiveness of this strategy needs to be further explored in studies of larger sample sizes.
Ooka, Hideshi; Hashimoto, Kazuhito; Nakamura, Ryuhei
2018-05-14
Understanding the design strategy of photosynthetic and respiratory enzymes is important to develop efficient artificial catalysts for oxygen evolution and reduction reactions. Here, based on a bioinformatic analysis of cyanobacterial oxygen evolution and reduction enzymes (photosystem II: PS II and cytochrome c oxidase: COX, respectively), the gene encoding the catalytic D1 subunit of PS II was found to be expressed individually across 38 phylogenetically diverse strains, which is in contrast to the operon structure of the genes encoding major COX subunits. Selective synthesis of the D1 subunit minimizes the repair cost of PS II, which allows compensation for its instability by lowering the turnover number required to generate a net positive energy yield. The different bioenergetics observed between PS II and COX suggest that in addition to the catalytic activity rationalized by the Sabatier principle, stability factors have also provided a major influence on the design strategy of biological multi-electron transfer enzymes. © 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Lacosamide improves outcome in a murine model of traumatic brain injury.
Wang, Bo; Dawson, Hana; Wang, Haichen; Kernagis, Dawn; Kolls, Brad J; Yao, Lucy; Laskowitz, Daniel T
2013-08-01
Use of antiepileptic drugs (AED's) is common in the neurocritical care setting. However, there remains a great deal of controversy regarding the optimal agent. Studies associating the prophylactic use of AED's with poor outcomes are heavily biased by the prevalent use of phenytoin, an agent highly associated with deleterious effects. In the current study, we evaluate lacosamide for neuroprotective properties in a murine model of closed head injury. Mice were subjected to moderate closed head injury using a pneumatic impactor, and then treated with either low-dose (6 mg/kg) or high-dose (30 mg/kg) lacosamide or vehicle at 30 min post-injury, and twice daily for 3 days after injury. Motor and cognitive functional assessments were performed following injury using rotarod and Morris Water Maze, respectively. Neuronal injury and microglial activation were measured by flourojade-B, NeuN, and F4/80 staining at 1 and 7 days post-injury. Timm's staining was also performed to assess lacosamide effects on mossy fiber axonal sprouting. To evaluate possible mechanisms of lacosamide effects on the inflammatory response to injury, an RNA expression array was used to evaluate for alterations in differential gene expression patterns in injured mice following lacosamide or vehicle treatments. High-dose lacosamide was associated with improved functional outcome on both the rotarod and Morris Water Maze. High-dose lacosamide was also associated with a reduction of neuronal injury at 24 h post-injury. However, the reduction in neuronal loss observed early did not result in greater neuronal density at 31 days post-injury based on unbiased stereology of NeuN staining. High-dose lacosamide was also associated with a significant reduction in microglial activation at 7 days post-injury. The therapeutic effects of lacosamide are associated with a delay in injury-related changes in RNA expression of a subset of inflammatory mediator genes typically seen at 24 h post-injury. Administration of lacosamide improves functional performance, and reduces histological evidence of acute neuronal injury and neuroinflammation in a murine model of closed head injury. Lacosamide effects appear to be mediated via a reduction or delay in the acute inflammatory response to injury. Prior clinical and animal studies have found antiepileptic treatment following injury to be detrimental, though these studies are biased by the common use of older medications such as phenytoin. Our current results as well as prior work on levetiracetam suggest the newer AED's may be beneficial in the setting of acute brain injury.
2012-01-01
Background Gene Set Analysis (GSA) has proven to be a useful approach to microarray analysis. However, most of the method development for GSA has focused on the statistical tests to be used rather than on the generation of sets that will be tested. Existing methods of set generation are often overly simplistic. The creation of sets from individual pathways (in isolation) is a poor reflection of the complexity of the underlying metabolic network. We have developed a novel approach to set generation via the use of Principal Component Analysis of the Laplacian matrix of a metabolic network. We have analysed a relatively simple data set to show the difference in results between our method and the current state-of-the-art pathway-based sets. Results The sets generated with this method are semi-exhaustive and capture much of the topological complexity of the metabolic network. The semi-exhaustive nature of this method has also allowed us to design a hypergeometric enrichment test to determine which genes are likely responsible for set significance. We show that our method finds significant aspects of biology that would be missed (i.e. false negatives) and addresses the false positive rates found with the use of simple pathway-based sets. Conclusions The set generation step for GSA is often neglected but is a crucial part of the analysis as it defines the full context for the analysis. As such, set generation methods should be robust and yield as complete a representation of the extant biological knowledge as possible. The method reported here achieves this goal and is demonstrably superior to previous set analysis methods. PMID:22876834
Hettne, Kristina M; Boorsma, André; van Dartel, Dorien A M; Goeman, Jelle J; de Jong, Esther; Piersma, Aldert H; Stierum, Rob H; Kleinjans, Jos C; Kors, Jan A
2013-01-29
Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals. Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.
2013-01-01
Background Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. Methods We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals. Results Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Conclusions Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect. PMID:23356878
Bonfiglio, F; Henström, M; Nag, A; Hadizadeh, F; Zheng, T; Cenit, M C; Tigchelaar, E; Williams, F; Reznichenko, A; Ek, W E; Rivera, N V; Homuth, G; Aghdassi, A A; Kacprowski, T; Männikkö, M; Karhunen, V; Bujanda, L; Rafter, J; Wijmenga, C; Ronkainen, J; Hysi, P; Zhernakova, A; D'Amato, M
2018-04-19
Irritable bowel syndrome (IBS) shows genetic predisposition, however, large-scale, powered gene mapping studies are lacking. We sought to exploit existing genetic (genotype) and epidemiological (questionnaire) data from a series of population-based cohorts for IBS genome-wide association studies (GWAS) and their meta-analysis. Based on questionnaire data compatible with Rome III Criteria, we identified a total of 1335 IBS cases and 9768 asymptomatic individuals from 5 independent European genotyped cohorts. Individual GWAS were carried out with sex-adjusted logistic regression under an additive model, followed by meta-analysis using the inverse variance method. Functional annotation of significant results was obtained via a computational pipeline exploiting ontology and interaction networks, and tissue-specific and gene set enrichment analyses. Suggestive GWAS signals (P ≤ 5.0 × 10 -6 ) were detected for 7 genomic regions, harboring 64 gene candidates to affect IBS risk via functional or expression changes. Functional annotation of this gene set convincingly (best FDR-corrected P = 3.1 × 10 -10 ) highlighted regulation of ion channel activity as the most plausible pathway affecting IBS risk. Our results confirm the feasibility of population-based studies for gene-discovery efforts in IBS, identify risk genes and loci to be prioritized in independent follow-ups, and pinpoint ion channels as important players and potential therapeutic targets warranting further investigation. © 2018 John Wiley & Sons Ltd.
Mengual, Lourdes; Burset, Moisès; Ribal, María José; Ars, Elisabet; Marín-Aguilera, Mercedes; Fernández, Manuel; Ingelmo-Torres, Mercedes; Villavicencio, Humberto; Alcaraz, Antonio
2010-05-01
To develop an accurate and noninvasive method for bladder cancer diagnosis and prediction of disease aggressiveness based on the gene expression patterns of urine samples. Gene expression patterns of 341 urine samples from bladder urothelial cell carcinoma (UCC) patients and 235 controls were analyzed via TaqMan Arrays. In a first phase of the study, three consecutive gene selection steps were done to identify a gene set expression signature to detect and stratify UCC in urine. Subsequently, those genes more informative for UCC diagnosis and prediction of tumor aggressiveness were combined to obtain a classification system of bladder cancer samples. In a second phase, the obtained gene set signature was evaluated in a routine clinical scenario analyzing only voided urine samples. We have identified a 12+2 gene expression signature for UCC diagnosis and prediction of tumor aggressiveness on urine samples. Overall, this gene set panel had 98% sensitivity (SN) and 99% specificity (SP) in discriminating between UCC and control samples and 79% SN and 92% SP in predicting tumor aggressiveness. The translation of the model to the clinically applicable format corroborates that the 12+2 gene set panel described maintains a high accuracy for UCC diagnosis (SN = 89% and SP = 95%) and tumor aggressiveness prediction (SN = 79% and SP = 91%) in voided urine samples. The 12+2 gene expression signature described in urine is able to identify patients suffering from UCC and predict tumor aggressiveness. We show that a panel of molecular markers may improve the schedule for diagnosis and follow-up in UCC patients. Copyright 2010 AACR.
Bartlett, Thomas E.; Jones, Allison; Goode, Ellen L.; Fridley, Brooke L.; Cunningham, Julie M.; Berns, Els M. J. J.; Wik, Elisabeth; Salvesen, Helga B.; Davidson, Ben; Trope, Claes G.; Lambrechts, Sandrina; Vergote, Ignace; Widschwendter, Martin
2015-01-01
We introduce a novel per-gene measure of intra-gene DNA methylation variability (IGV) based on the Illumina Infinium HumanMethylation450 platform, which is prognostic independently of well-known predictors of clinical outcome. Using IGV, we derive a robust gene-panel prognostic signature for ovarian cancer (OC, n = 221), which validates in two independent data sets from Mayo Clinic (n = 198) and TCGA (n = 358), with significance of p = 0.004 in both sets. The OC prognostic signature gene-panel is comprised of four gene groups, which represent distinct biological processes. We show the IGV measurements of these gene groups are most likely a reflection of a mixture of intra-tumour heterogeneity and transcription factor (TF) binding/activity. IGV can be used to predict clinical outcome in patients individually, providing a surrogate read-out of hard-to-measure disease processes. PMID:26629914
Bartlett, Thomas E; Jones, Allison; Goode, Ellen L; Fridley, Brooke L; Cunningham, Julie M; Berns, Els M J J; Wik, Elisabeth; Salvesen, Helga B; Davidson, Ben; Trope, Claes G; Lambrechts, Sandrina; Vergote, Ignace; Widschwendter, Martin
2015-01-01
We introduce a novel per-gene measure of intra-gene DNA methylation variability (IGV) based on the Illumina Infinium HumanMethylation450 platform, which is prognostic independently of well-known predictors of clinical outcome. Using IGV, we derive a robust gene-panel prognostic signature for ovarian cancer (OC, n = 221), which validates in two independent data sets from Mayo Clinic (n = 198) and TCGA (n = 358), with significance of p = 0.004 in both sets. The OC prognostic signature gene-panel is comprised of four gene groups, which represent distinct biological processes. We show the IGV measurements of these gene groups are most likely a reflection of a mixture of intra-tumour heterogeneity and transcription factor (TF) binding/activity. IGV can be used to predict clinical outcome in patients individually, providing a surrogate read-out of hard-to-measure disease processes.
Risk Classification with an Adaptive Naive Bayes Kernel Machine Model.
Minnier, Jessica; Yuan, Ming; Liu, Jun S; Cai, Tianxi
2015-04-22
Genetic studies of complex traits have uncovered only a small number of risk markers explaining a small fraction of heritability and adding little improvement to disease risk prediction. Standard single marker methods may lack power in selecting informative markers or estimating effects. Most existing methods also typically do not account for non-linearity. Identifying markers with weak signals and estimating their joint effects among many non-informative markers remains challenging. One potential approach is to group markers based on biological knowledge such as gene structure. If markers in a group tend to have similar effects, proper usage of the group structure could improve power and efficiency in estimation. We propose a two-stage method relating markers to disease risk by taking advantage of known gene-set structures. Imposing a naive bayes kernel machine (KM) model, we estimate gene-set specific risk models that relate each gene-set to the outcome in stage I. The KM framework efficiently models potentially non-linear effects of predictors without requiring explicit specification of functional forms. In stage II, we aggregate information across gene-sets via a regularization procedure. Estimation and computational efficiency is further improved with kernel principle component analysis. Asymptotic results for model estimation and gene set selection are derived and numerical studies suggest that the proposed procedure could outperform existing procedures for constructing genetic risk models.
Relative codon adaptation: a generic codon bias index for prediction of gene expression.
Fox, Jesse M; Erill, Ivan
2010-06-01
The development of codon bias indices (CBIs) remains an active field of research due to their myriad applications in computational biology. Recently, the relative codon usage bias (RCBS) was introduced as a novel CBI able to estimate codon bias without using a reference set. The results of this new index when applied to Escherichia coli and Saccharomyces cerevisiae led the authors of the original publications to conclude that natural selection favours higher expression and enhanced codon usage optimization in short genes. Here, we show that this conclusion was flawed and based on the systematic oversight of an intrinsic bias for short sequences in the RCBS index and of biases in the small data sets used for validation in E. coli. Furthermore, we reveal that how the RCBS can be corrected to produce useful results and how its underlying principle, which we here term relative codon adaptation (RCA), can be made into a powerful reference-set-based index that directly takes into account the genomic base composition. Finally, we show that RCA outperforms the codon adaptation index (CAI) as a predictor of gene expression when operating on the CAI reference set and that this improvement is significantly larger when analysing genomes with high mutational bias.
System Complexity Reduction via Feature Selection
ERIC Educational Resources Information Center
Deng, Houtao
2011-01-01
This dissertation transforms a set of system complexity reduction problems to feature selection problems. Three systems are considered: classification based on association rules, network structure learning, and time series classification. Furthermore, two variable importance measures are proposed to reduce the feature selection bias in tree…
Finding the missing honey bee genes: lessons learned from a genome upgrade.
Elsik, Christine G; Worley, Kim C; Bennett, Anna K; Beye, Martin; Camara, Francisco; Childers, Christopher P; de Graaf, Dirk C; Debyser, Griet; Deng, Jixin; Devreese, Bart; Elhaik, Eran; Evans, Jay D; Foster, Leonard J; Graur, Dan; Guigo, Roderic; Hoff, Katharina Jasmin; Holder, Michael E; Hudson, Matthew E; Hunt, Greg J; Jiang, Huaiyang; Joshi, Vandita; Khetani, Radhika S; Kosarev, Peter; Kovar, Christie L; Ma, Jian; Maleszka, Ryszard; Moritz, Robin F A; Munoz-Torres, Monica C; Murphy, Terence D; Muzny, Donna M; Newsham, Irene F; Reese, Justin T; Robertson, Hugh M; Robinson, Gene E; Rueppell, Olav; Solovyev, Victor; Stanke, Mario; Stolle, Eckart; Tsuruda, Jennifer M; Vaerenbergh, Matthias Van; Waterhouse, Robert M; Weaver, Daniel B; Whitfield, Charles W; Wu, Yuanqing; Zdobnov, Evgeny M; Zhang, Lan; Zhu, Dianhui; Gibbs, Richard A
2014-01-30
The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination.
Finding the missing honey bee genes: lessons learned from a genome upgrade
2014-01-01
Background The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. Results Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. Conclusions Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination. PMID:24479613
An, Yan; Zou, Zhihong; Li, Ranran
2014-01-01
A large number of parameters are acquired during practical water quality monitoring. If all the parameters are used in water quality assessment, the computational complexity will definitely increase. In order to reduce the input space dimensions, a fuzzy rough set was introduced to perform attribute reduction. Then, an attribute recognition theoretical model and entropy method were combined to assess water quality in the Harbin reach of the Songhuajiang River in China. A dataset consisting of ten parameters was collected from January to October in 2012. Fuzzy rough set was applied to reduce the ten parameters to four parameters: BOD5, NH3-N, TP, and F. coli (Reduct A). Considering that DO is a usual parameter in water quality assessment, another reduct, including DO, BOD5, NH3-N, TP, TN, F, and F. coli (Reduct B), was obtained. The assessment results of Reduct B show a good consistency with those of Reduct A, and this means that DO is not always necessary to assess water quality. The results with attribute reduction are not exactly the same as those without attribute reduction, which can be attributed to the α value decided by subjective experience. The assessment results gained by the fuzzy rough set obviously reduce computational complexity, and are acceptable and reliable. The model proposed in this paper enhances the water quality assessment system. PMID:24675643
Schaid, Daniel J; Sinnwell, Jason P; Jenkins, Gregory D; McDonnell, Shannon K; Ingle, James N; Kubo, Michiaki; Goss, Paul E; Costantino, Joseph P; Wickerham, D Lawrence; Weinshilboum, Richard M
2012-01-01
Gene-set analyses have been widely used in gene expression studies, and some of the developed methods have been extended to genome wide association studies (GWAS). Yet, complications due to linkage disequilibrium (LD) among single nucleotide polymorphisms (SNPs), and variable numbers of SNPs per gene and genes per gene-set, have plagued current approaches, often leading to ad hoc "fixes." To overcome some of the current limitations, we developed a general approach to scan GWAS SNP data for both gene-level and gene-set analyses, building on score statistics for generalized linear models, and taking advantage of the directed acyclic graph structure of the gene ontology when creating gene-sets. However, other types of gene-set structures can be used, such as the popular Kyoto Encyclopedia of Genes and Genomes (KEGG). Our approach combines SNPs into genes, and genes into gene-sets, but assures that positive and negative effects of genes on a trait do not cancel. To control for multiple testing of many gene-sets, we use an efficient computational strategy that accounts for LD and provides accurate step-down adjusted P-values for each gene-set. Application of our methods to two different GWAS provide guidance on the potential strengths and weaknesses of our proposed gene-set analyses. © 2011 Wiley Periodicals, Inc.
Saklatvala, Jake R; Dand, Nick; Simpson, Michael A
2018-05-01
The genetic diagnosis of rare monogenic diseases using exome/genome sequencing requires the true causal variant(s) to be identified from tens of thousands of observed variants. Typically a virtual gene panel approach is taken whereby only variants in genes known to cause phenotypes resembling the patient under investigation are considered. With the number of known monogenic gene-disease pairs exceeding 5,000, manual curation of personalized virtual panels using exhaustive knowledge of the genetic basis of the human monogenic phenotypic spectrum is challenging. We present improved probabilistic methods for estimating phenotypic similarity based on Human Phenotype Ontology annotation. A limitation of existing methods for evaluating a disease's similarity to a reference set is that reference diseases are typically represented as a series of binary (present/absent) observations of phenotypic terms. We evaluate a quantified disease reference set, using term frequency in phenotypic text descriptions to approximate term relevance. We demonstrate an improved ability to identify related diseases through the use of a quantified reference set, and that vector space similarity measures perform better than established information content-based measures. These improvements enable the generation of bespoke virtual gene panels, facilitating more accurate and efficient interpretation of genomic variant profiles from individuals with rare Mendelian disorders. These methods are available online at https://atlas.genetics.kcl.ac.uk/~jake/cgi-bin/patient_sim.py. © 2018 Wiley Periodicals, Inc.
Gene ontology analysis of pairwise genetic associations in two genome-wide studies of sporadic ALS.
Kim, Nora Chung; Andrews, Peter C; Asselbergs, Folkert W; Frost, H Robert; Williams, Scott M; Harris, Brent T; Read, Cynthia; Askland, Kathleen D; Moore, Jason H
2012-07-28
It is increasingly clear that common human diseases have a complex genetic architecture characterized by both additive and nonadditive genetic effects. The goal of the present study was to determine whether patterns of both additive and nonadditive genetic associations aggregate in specific functional groups as defined by the Gene Ontology (GO). We first estimated all pairwise additive and nonadditive genetic effects using the multifactor dimensionality reduction (MDR) method that makes few assumptions about the underlying genetic model. Statistical significance was evaluated using permutation testing in two genome-wide association studies of ALS. The detection data consisted of 276 subjects with ALS and 271 healthy controls while the replication data consisted of 221 subjects with ALS and 211 healthy controls. Both studies included genotypes from approximately 550,000 single-nucleotide polymorphisms (SNPs). Each SNP was mapped to a gene if it was within 500 kb of the start or end. Each SNP was assigned a p-value based on its strongest joint effect with the other SNPs. We then used the Exploratory Visual Analysis (EVA) method and software to assign a p-value to each gene based on the overabundance of significant SNPs at the α = 0.05 level in the gene. We also used EVA to assign p-values to each GO group based on the overabundance of significant genes at the α = 0.05 level. A GO category was determined to replicate if that category was significant at the α = 0.05 level in both studies. We found two GO categories that replicated in both studies. The first, 'Regulation of Cellular Component Organization and Biogenesis', a GO Biological Process, had p-values of 0.010 and 0.014 in the detection and replication studies, respectively. The second, 'Actin Cytoskeleton', a GO Cellular Component, had p-values of 0.040 and 0.046 in the detection and replication studies, respectively. Pathway analysis of pairwise genetic associations in two GWAS of sporadic ALS revealed a set of genes involved in cellular component organization and actin cytoskeleton, more specifically, that were not reported by prior GWAS. However, prior biological studies have implicated actin cytoskeleton in ALS and other motor neuron diseases. This study supports the idea that pathway-level analysis of GWAS data may discover important associations not revealed using conventional one-SNP-at-a-time approaches.
GAVIN: Gene-Aware Variant INterpretation for medical sequencing.
van der Velde, K Joeri; de Boer, Eddy N; van Diemen, Cleo C; Sikkema-Raddatz, Birgit; Abbott, Kristin M; Knopperts, Alain; Franke, Lude; Sijmons, Rolf H; de Koning, Tom J; Wijmenga, Cisca; Sinke, Richard J; Swertz, Morris A
2017-01-16
We present Gene-Aware Variant INterpretation (GAVIN), a new method that accurately classifies variants for clinical diagnostic purposes. Classifications are based on gene-specific calibrations of allele frequencies from the ExAC database, likely variant impact using SnpEff, and estimated deleteriousness based on CADD scores for >3000 genes. In a benchmark on 18 clinical gene sets, we achieve a sensitivity of 91.4% and a specificity of 76.9%. This accuracy is unmatched by 12 other tools. We provide GAVIN as an online MOLGENIS service to annotate VCF files and as an open source executable for use in bioinformatic pipelines. It can be found at http://molgenis.org/gavin .
Primer sets for cloning the human repertoire of T cell Receptor Variable regions
Boria, Ilenia; Cotella, Diego; Dianzani, Irma; Santoro, Claudio; Sblattero, Daniele
2008-01-01
Background Amplification and cloning of naïve T cell Receptor (TR) repertoires or antigen-specific TR is crucial to shape immune response and to develop immuno-based therapies. TR variable (V) regions are encoded by several genes that recombine during T cell development. The cloning of expressed genes as large diverse libraries from natural sources relies upon the availability of primers able to amplify as many V genes as possible. Results Here, we present a list of primers computationally designed on all functional TR V and J genes listed in the IMGT®, the ImMunoGeneTics information system®. The list consists of unambiguous or degenerate primers suitable to theoretically amplify and clone the entire TR repertoire. We show that it is possible to selectively amplify and clone expressed TR V genes in one single RT-PCR step and from as little as 1000 cells. Conclusion This new primer set will facilitate the creation of more diverse TR libraries than has been possible using currently available primer sets. PMID:18759974
Combining Evidence of Preferential Gene-Tissue Relationships from Multiple Sources
Guo, Jing; Hammar, Mårten; Öberg, Lisa; Padmanabhuni, Shanmukha S.; Bjäreland, Marcus; Dalevi, Daniel
2013-01-01
An important challenge in drug discovery and disease prognosis is to predict genes that are preferentially expressed in one or a few tissues, i.e. showing a considerably higher expression in one tissue(s) compared to the others. Although several data sources and methods have been published explicitly for this purpose, they often disagree and it is not evident how to retrieve these genes and how to distinguish true biological findings from those that are due to choice-of-method and/or experimental settings. In this work we have developed a computational approach that combines results from multiple methods and datasets with the aim to eliminate method/study-specific biases and to improve the predictability of preferentially expressed human genes. A rule-based score is used to merge and assign support to the results. Five sets of genes with known tissue specificity were used for parameter pruning and cross-validation. In total we identify 3434 tissue-specific genes. We compare the genes of highest scores with the public databases: PaGenBase (microarray), TiGER (EST) and HPA (protein expression data). The results have 85% overlap to PaGenBase, 71% to TiGER and only 28% to HPA. 99% of our predictions have support from at least one of these databases. Our approach also performs better than any of the databases on identifying drug targets and biomarkers with known tissue-specificity. PMID:23950964
Missing value imputation in DNA microarrays based on conjugate gradient method.
Dorri, Fatemeh; Azmi, Paeiz; Dorri, Faezeh
2012-02-01
Analysis of gene expression profiles needs a complete matrix of gene array values; consequently, imputation methods have been suggested. In this paper, an algorithm that is based on conjugate gradient (CG) method is proposed to estimate missing values. k-nearest neighbors of the missed entry are first selected based on absolute values of their Pearson correlation coefficient. Then a subset of genes among the k-nearest neighbors is labeled as the best similar ones. CG algorithm with this subset as its input is then used to estimate the missing values. Our proposed CG based algorithm (CGimpute) is evaluated on different data sets. The results are compared with sequential local least squares (SLLSimpute), Bayesian principle component analysis (BPCAimpute), local least squares imputation (LLSimpute), iterated local least squares imputation (ILLSimpute) and adaptive k-nearest neighbors imputation (KNNKimpute) methods. The average of normalized root mean squares error (NRMSE) and relative NRMSE in different data sets with various missing rates shows CGimpute outperforms other methods. Copyright © 2011 Elsevier Ltd. All rights reserved.
Yielding physically-interpretable emulators - A Sparse PCA approach
NASA Astrophysics Data System (ADS)
Galelli, S.; Alsahaf, A.; Giuliani, M.; Castelletti, A.
2015-12-01
Projection-based techniques, such as Principal Orthogonal Decomposition (POD), are a common approach to surrogate high-fidelity process-based models by lower order dynamic emulators. With POD, the dimensionality reduction is achieved by using observations, or 'snapshots' - generated with the high-fidelity model -, to project the entire set of input and state variables of this model onto a smaller set of basis functions that account for most of the variability in the data. While reduction efficiency and variance control of POD techniques are usually very high, the resulting emulators are structurally highly complex and can hardly be given a physically meaningful interpretation as each basis is a projection of the entire set of inputs and states. In this work, we propose a novel approach based on Sparse Principal Component Analysis (SPCA) that combines the several assets of POD methods with the potential for ex-post interpretation of the emulator structure. SPCA reduces the number of non-zero coefficients in the basis functions by identifying a sparse matrix of coefficients. While the resulting set of basis functions may retain less variance of the snapshots, the presence of a few non-zero coefficients assists in the interpretation of the underlying physical processes. The SPCA approach is tested on the reduction of a 1D hydro-ecological model (DYRESM-CAEDYM) used to describe the main ecological and hydrodynamic processes in Tono Dam, Japan. An experimental comparison against a standard POD approach shows that SPCA achieves the same accuracy in emulating a given output variable - for the same level of dimensionality reduction - while yielding better insights of the main process dynamics.
Diarrhea as a cause of mortality in a mouse model of infectious colitis
Borenshtein, Diana; Fry, Rebecca C; Groff, Elizabeth B; Nambiar, Prashant R; Carey, Vincent J; Fox, James G; Schauer, David B
2008-01-01
Background Comparative characterization of genome-wide transcriptional changes during infection can help elucidate the mechanisms underlying host susceptibility. In this study, transcriptional profiling of the mouse colon was carried out in two cognate lines of mice that differ in their response to Citrobacter rodentium infection; susceptible inbred FVB/N and resistant outbred Swiss Webster mice. Gene expression in the distal colon was determined prior to infection, and at four and nine days post-inoculation using a whole mouse genome Affymetrix array. Results Computational analysis identified 462 probe sets more than 2-fold differentially expressed between uninoculated resistant and susceptible mice. In response to C. rodentium infection, 5,123 probe sets were differentially expressed in one or both lines of mice. Microarray data were validated by quantitative real-time RT-PCR for 35 selected genes and were found to have a 94% concordance rate. Transcripts represented by 1,547 probe sets were differentially expressed between susceptible and resistant mice regardless of infection status, a host effect. Genes associated with transport were over-represented to a greater extent than even immune response-related genes. Electrolyte analysis revealed reduction in serum levels of chloride and sodium in susceptible animals. Conclusion The results support the hypothesis that mortality in C. rodentium-infected susceptible mice is associated with impaired intestinal ion transport and development of fatal fluid loss and dehydration. These studies contribute to our understanding of the pathogenesis of C. rodentium and suggest novel strategies for the prevention and treatment of diarrhea associated with intestinal bacterial infections. PMID:18680595
Tsukamoto, Kenji; Panei, Carlos Javier; Javier, Panei Carlos; Shishido, Makiko; Noguchi, Daigo; Pearce, John; Kang, Hyun-Mi; Jeong, Ok Mi; Lee, Youn-Jeong; Nakanishi, Koji; Ashizawa, Takayoshi
2012-01-01
Continuing outbreaks of H5N1 highly pathogenic (HP) avian influenza virus (AIV) infections of wild birds and poultry worldwide emphasize the need for global surveillance of wild birds. To support the future surveillance activities, we developed a SYBR green-based, real-time reverse transcriptase PCR (rRT-PCR) for detecting nucleoprotein (NP) genes and subtyping 16 hemagglutinin (HA) and 9 neuraminidase (NA) genes simultaneously. Primers were improved by focusing on Eurasian or North American lineage genes; the number of mixed-base positions per primer was set to five or fewer, and the concentration of each primer set was optimized empirically. Also, 30 cycles of amplification of 1:10 dilutions of cDNAs from cultured viruses effectively reduced minor cross- or nonspecific reactions. Under these conditions, 346 HA and 345 NA genes of 349 AIVs were detected, with average sensitivities of NP, HA, and NA genes of 10(1.5), 10(2.3), and 10(3.1) 50% egg infective doses, respectively. Utility of rRT-PCR for subtyping AIVs was compared with that of current standard serological tests by using 104 recent migratory duck virus isolates. As a result, all HA genes and 99% of the NA genes were genetically subtyped, while only 45% of HA genes and 74% of NA genes were serologically subtyped. Additionally, direct subtyping of AIVs in fecal samples was possible by 40 cycles of amplification: approximately 70% of HA and NA genes of NP gene-positive samples were successfully subtyped. This validation study indicates that rRT-PCR with optimized primers and reaction conditions is a powerful tool for subtyping varied AIVs in clinical and cultured samples.
Algorithm Sorts Groups Of Data
NASA Technical Reports Server (NTRS)
Evans, J. D.
1987-01-01
For efficient sorting, algorithm finds set containing minimum or maximum most significant data. Sets of data sorted as desired. Sorting process simplified by reduction of each multielement set of data to single representative number. First, each set of data expressed as polynomial with suitably chosen base, using elements of set as coefficients. Most significant element placed in term containing largest exponent. Base selected by examining range in value of data elements. Resulting series summed to yield single representative number. Numbers easily sorted, and each such number converted back to original set of data by successive division. Program written in BASIC.
Garcia, Marlene; Mauro, James A; Ramsamooj, Michael; Blanck, George
2015-08-03
Apoptosis- and proliferation-effector genes are substantially regulated by the same transactivators, with E2F-1 and Oct-1 being notable examples. The larger proliferation-effector genes have more binding sites for the transactivators that regulate both sets of genes, and proliferation-effector genes have more regions of active chromatin, i.e, DNase I hypersensitive and histone 3, lysine-4 trimethylation sites. Thus, the size differences between the 2 classes of genes suggest a transcriptional regulation paradigm whereby the accumulation of transcription factors that regulate both sets of genes, merely as an aspect of stochastic behavior, accumulate first on the larger proliferation-effector gene "traps," and then accumulate on the apoptosis effector genes, thereby effecting sequential activation of the 2 different gene sets. As IRF-1 and p53 levels increase, tumor suppressor proteins are first activated, followed by the activation of apoptosis-effector genes, for example during S-phase pausing for DNA repair. Tumor suppressor genes are larger than apoptosis-effector genes and have more IRF-1 and p53 binding sites, thereby likewise suggesting a paradigm for transcription sequencing based on stochastic interactions of transcription factors with different gene classes. In this report, using the ENCODE database, we determined that tumor suppressor genes have a greater number of open chromatin regions and histone 3 lysine-4 trimethylation sites, consistent with the idea that a larger gene size can facilitate earlier transcriptional activation via the inclusion of more transactivator binding sites.
USDA-ARS?s Scientific Manuscript database
Premise of the study: Reference genes are selected based on the assumption of temporal and spatial expression stability and on their widespread use in model species. They are often used in new target species without validation, presumed as stable. For barley, reference gene validation is lacking, bu...
MalaCards: an integrated compendium for diseases and their annotation
Rappaport, Noa; Nativ, Noam; Stelzer, Gil; Twik, Michal; Guan-Golan, Yaron; Iny Stein, Tsippi; Bahir, Iris; Belinky, Frida; Morrey, C. Paul; Safran, Marilyn; Lancet, Doron
2013-01-01
Comprehensive disease classification, integration and annotation are crucial for biomedical discovery. At present, disease compilation is incomplete, heterogeneous and often lacking systematic inquiry mechanisms. We introduce MalaCards, an integrated database of human maladies and their annotations, modeled on the architecture and strategy of the GeneCards database of human genes. MalaCards mines and merges 44 data sources to generate a computerized card for each of 16 919 human diseases. Each MalaCard contains disease-specific prioritized annotations, as well as inter-disease connections, empowered by the GeneCards relational database, its searches and GeneDecks set analyses. First, we generate a disease list from 15 ranked sources, using disease-name unification heuristics. Next, we use four schemes to populate MalaCards sections: (i) directly interrogating disease resources, to establish integrated disease names, synonyms, summaries, drugs/therapeutics, clinical features, genetic tests and anatomical context; (ii) searching GeneCards for related publications, and for associated genes with corresponding relevance scores; (iii) analyzing disease-associated gene sets in GeneDecks to yield affiliated pathways, phenotypes, compounds and GO terms, sorted by a composite relevance score and presented with GeneCards links; and (iv) searching within MalaCards itself, e.g. for additional related diseases and anatomical context. The latter forms the basis for the construction of a disease network, based on shared MalaCards annotations, embodying associations based on etiology, clinical features and clinical conditions. This broadly disposed network has a power-law degree distribution, suggesting that this might be an inherent property of such networks. Work in progress includes hierarchical malady classification, ontological mapping and disease set analyses, striving to make MalaCards an even more effective tool for biomedical research. Database URL: http://www.malacards.org/ PMID:23584832
Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool
2013-01-01
Background System-wide profiling of genes and proteins in mammalian cells produce lists of differentially expressed genes/proteins that need to be further analyzed for their collective functions in order to extract new knowledge. Once unbiased lists of genes or proteins are generated from such experiments, these lists are used as input for computing enrichment with existing lists created from prior knowledge organized into gene-set libraries. While many enrichment analysis tools and gene-set libraries databases have been developed, there is still room for improvement. Results Here, we present Enrichr, an integrative web-based and mobile software application that includes new gene-set libraries, an alternative approach to rank enriched terms, and various interactive visualization approaches to display enrichment results using the JavaScript library, Data Driven Documents (D3). The software can also be embedded into any tool that performs gene list analysis. We applied Enrichr to analyze nine cancer cell lines by comparing their enrichment signatures to the enrichment signatures of matched normal tissues. We observed a common pattern of up regulation of the polycomb group PRC2 and enrichment for the histone mark H3K27me3 in many cancer cell lines, as well as alterations in Toll-like receptor and interlukin signaling in K562 cells when compared with normal myeloid CD33+ cells. Such analyses provide global visualization of critical differences between normal tissues and cancer cell lines but can be applied to many other scenarios. Conclusions Enrichr is an easy to use intuitive enrichment analysis web-based tool providing various types of visualization summaries of collective functions of gene lists. Enrichr is open source and freely available online at: http://amp.pharm.mssm.edu/Enrichr. PMID:23586463
Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool.
Chen, Edward Y; Tan, Christopher M; Kou, Yan; Duan, Qiaonan; Wang, Zichen; Meirelles, Gabriela Vaz; Clark, Neil R; Ma'ayan, Avi
2013-04-15
System-wide profiling of genes and proteins in mammalian cells produce lists of differentially expressed genes/proteins that need to be further analyzed for their collective functions in order to extract new knowledge. Once unbiased lists of genes or proteins are generated from such experiments, these lists are used as input for computing enrichment with existing lists created from prior knowledge organized into gene-set libraries. While many enrichment analysis tools and gene-set libraries databases have been developed, there is still room for improvement. Here, we present Enrichr, an integrative web-based and mobile software application that includes new gene-set libraries, an alternative approach to rank enriched terms, and various interactive visualization approaches to display enrichment results using the JavaScript library, Data Driven Documents (D3). The software can also be embedded into any tool that performs gene list analysis. We applied Enrichr to analyze nine cancer cell lines by comparing their enrichment signatures to the enrichment signatures of matched normal tissues. We observed a common pattern of up regulation of the polycomb group PRC2 and enrichment for the histone mark H3K27me3 in many cancer cell lines, as well as alterations in Toll-like receptor and interlukin signaling in K562 cells when compared with normal myeloid CD33+ cells. Such analyses provide global visualization of critical differences between normal tissues and cancer cell lines but can be applied to many other scenarios. Enrichr is an easy to use intuitive enrichment analysis web-based tool providing various types of visualization summaries of collective functions of gene lists. Enrichr is open source and freely available online at: http://amp.pharm.mssm.edu/Enrichr.
Discovering semantic features in the literature: a foundation for building functional associations
Chagoyen, Monica; Carmona-Saez, Pedro; Shatkay, Hagit; Carazo, Jose M; Pascual-Montano, Alberto
2006-01-01
Background Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. Results We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. Conclusion The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data. PMID:16438716
2012-01-01
Background Understanding demographic histories, such as divergence time, patterns of gene flow, and population size changes, in ecologically diverging lineages provide implications for the process and maintenance of population differentiation by ecological adaptation. This study addressed the demographic histories in two independently derived lineages of flood-resistant riparian plants and their non-riparian relatives [Ainsliaea linearis (riparian) and A. apiculata (non-riparian); A. oblonga (riparian) and A. macroclinidioides (non-riparian); Asteraceae] using an isolation-with-migration (IM) model based on variation at 10 nuclear DNA loci. Results The highest posterior probabilities of the divergence time parameters were estimated to be ca. 25,000 years ago for A. linearis and A. apiculata and ca. 9000 years ago for A. oblonga and A. macroclinidioides, although the confidence intervals of the parameters had broad ranges. The likelihood ratio tests detected evidence of historical gene flow between both riparian/non-riparian species pairs. The riparian populations showed lower levels of genetic diversity and a significant reduction in effective population sizes compared to the non-riparian populations and their ancestral populations. Conclusions This study showed the recent origins of flood-resistant riparian plants, which are remarkable examples of plant ecological adaptation. The recent divergence and genetic signatures of historical gene flow among riparian/non-riparian species implied that they underwent morphological and ecological differentiation within short evolutionary timescales and have maintained their species boundaries in the face of gene flow. Comparative analyses of adaptive divergence in two sets of riparian/non-riparian lineages suggested that strong natural selection by flooding had frequently reduced the genetic diversity and size of riparian populations through genetic drift, possibly leading to fixation of adaptive traits in riparian populations. The two sets of riparian/non-riparian lineages showed contrasting patterns of gene flow and genetic differentiation, implying that each lineage showed different degrees of reproductive isolation and that they had experienced unique evolutionary and demographic histories in the process of adaptive divergence. PMID:23273287
Li, Wen; Fan, Chun Chieh; Mäki-Marttunen, Tuomo; Thompson, Wesley K; Schork, Andrew J; Bettella, Francesco; Djurovic, Srdjan; Dale, Anders M; Andreassen, Ole A; Wang, Yunpeng
2018-06-01
Traditional genome-wide association studies (GWAS) have successfully detected genetic variants associated with schizophrenia. However, only a small fraction of heritability can be explained. Gene-set/pathway-based methods can overcome limitations arising from single nucleotide polymorphism (SNP)-based analysis, but most of them place constraints on size which may exclude highly specific and functional sets, like macromolecules. Voltage-gated calcium (Ca v ) channels, belonging to macromolecules, are composed of several subunits whose encoding genes are located far away or even on different chromosomes. We combined information about such molecules with GWAS data to investigate how functional channels associated with schizophrenia. We defined a biologically meaningful SNP-set based on channel structure and performed an association study by using a validated method: SNP-set (sequence) kernel association test. We identified eight subtypes of Ca v channels significantly associated with schizophrenia from a subsample of published data (N = 56,605), including the L-type channels (Ca v 1.1, Ca v 1.2, Ca v 1.3), P-/Q-type Ca v 2.1, N-type Ca v 2.2, R-type Ca v 2.3, T-type Ca v 3.1, and Ca v 3.3. Only genes from Ca v 1.2 and Ca v 3.3 have been implicated by the largest GWAS (N = 82,315). Each subtype of Ca v channels showed relatively high chip heritability, proportional to the size of its constituent gene regions. The results suggest that abnormalities of Ca v channels may play an important role in the pathophysiology of schizophrenia and these channels may represent appropriate drug targets for therapeutics. Analyzing subunit-encoding genes of a macromolecule in aggregate is a complementary way to identify more genetic variants of polygenic diseases. This study offers the potential of power for discovery the biological mechanisms of schizophrenia. © 2018 Wiley Periodicals, Inc.
Crook, Nathan C; Schmitz, Alexander C; Alper, Hal S
2014-05-16
Reduction of endogenous gene expression is a fundamental operation of metabolic engineering, yet current methods for gene knockdown (i.e., genome editing) remain laborious and slow, especially in yeast. In contrast, RNA interference allows facile and tunable gene knockdown via a simple plasmid transformation step, enabling metabolic engineers to rapidly prototype knockdown strategies in multiple strains before expending significant cost to undertake genome editing. Although RNAi is naturally present in a myriad of eukaryotes, it has only been recently implemented in Saccharomyces cerevisiae as a heterologous pathway and so has not yet been optimized as a metabolic engineering tool. In this study, we elucidate a set of design principles for the construction of hairpin RNA expression cassettes in yeast and implement RNA interference to quickly identify routes for improvement of itaconic acid production in this organism. The approach developed here enables rapid prototyping of knockdown strategies and thus accelerates and reduces the cost of the design-build-test cycle in yeast.
2011-01-01
Background Increased understanding of the variability in normal breast biology will enable us to identify mechanisms of breast cancer initiation and the origin of different subtypes, and to better predict breast cancer risk. Methods Gene expression patterns in breast biopsies from 79 healthy women referred to breast diagnostic centers in Norway were explored by unsupervised hierarchical clustering and supervised analyses, such as gene set enrichment analysis and gene ontology analysis and comparison with previously published genelists and independent datasets. Results Unsupervised hierarchical clustering identified two separate clusters of normal breast tissue based on gene-expression profiling, regardless of clustering algorithm and gene filtering used. Comparison of the expression profile of the two clusters with several published gene lists describing breast cells revealed that the samples in cluster 1 share characteristics with stromal cells and stem cells, and to a certain degree with mesenchymal cells and myoepithelial cells. The samples in cluster 1 also share many features with the newly identified claudin-low breast cancer intrinsic subtype, which also shows characteristics of stromal and stem cells. More women belonging to cluster 1 have a family history of breast cancer and there is a slight overrepresentation of nulliparous women in cluster 1. Similar findings were seen in a separate dataset consisting of histologically normal tissue from both breasts harboring breast cancer and from mammoplasty reductions. Conclusion This is the first study to explore the variability of gene expression patterns in whole biopsies from normal breasts and identified distinct subtypes of normal breast tissue. Further studies are needed to determine the specific cell contribution to the variation in the biology of normal breasts, how the clusters identified relate to breast cancer risk and their possible link to the origin of the different molecular subtypes of breast cancer. PMID:22044755
Papaspyrou, Sokratis; Smith, Cindy J.; Dong, Liang F.; Whitby, Corinne; Dumbrell, Alex J.; Nedwell, David B.
2014-01-01
Denitrification and dissimilatory nitrate reduction to ammonium (DNRA) are processes occurring simultaneously under oxygen-limited or anaerobic conditions, where both compete for nitrate and organic carbon. Despite their ecological importance, there has been little investigation of how denitrification and DNRA potentials and related functional genes vary vertically with sediment depth. Nitrate reduction potentials measured in sediment depth profiles along the Colne estuary were in the upper range of nitrate reduction rates reported from other sediments and showed the existence of strong decreasing trends both with increasing depth and along the estuary. Denitrification potential decreased along the estuary, decreasing more rapidly with depth towards the estuary mouth. In contrast, DNRA potential increased along the estuary. Significant decreases in copy numbers of 16S rRNA and nitrate reducing genes were observed along the estuary and from surface to deeper sediments. Both metabolic potentials and functional genes persisted at sediment depths where porewater nitrate was absent. Transport of nitrate by bioturbation, based on macrofauna distributions, could only account for the upper 10 cm depth of sediment. A several fold higher combined freeze-lysable KCl-extractable nitrate pool compared to porewater nitrate was detected. We hypothesised that his could be attributed to intracellular nitrate pools from nitrate accumulating microorganisms like Thioploca or Beggiatoa. However, pyrosequencing analysis did not detect any such organisms, leaving other bacteria, microbenthic algae, or foraminiferans which have also been shown to accumulate nitrate, as possible candidates. The importance and bioavailability of a KCl-extractable nitrate sediment pool remains to be tested. The significant variation in the vertical pattern and abundance of the various nitrate reducing genes phylotypes reasonably suggests differences in their activity throughout the sediment column. This raises interesting questions as to what the alternative metabolic roles for the various nitrate reductases could be, analogous to the alternative metabolic roles found for nitrite reductases. PMID:24728381
Endeavour update: a web resource for gene prioritization in multiple species
Tranchevent, Léon-Charles; Barriot, Roland; Yu, Shi; Van Vooren, Steven; Van Loo, Peter; Coessens, Bert; De Moor, Bart; Aerts, Stein; Moreau, Yves
2008-01-01
Endeavour (http://www.esat.kuleuven.be/endeavourweb; this web site is free and open to all users and there is no login requirement) is a web resource for the prioritization of candidate genes. Using a training set of genes known to be involved in a biological process of interest, our approach consists of (i) inferring several models (based on various genomic data sources), (ii) applying each model to the candidate genes to rank those candidates against the profile of the known genes and (iii) merging the several rankings into a global ranking of the candidate genes. In the present article, we describe the latest developments of Endeavour. First, we provide a web-based user interface, besides our Java client, to make Endeavour more universally accessible. Second, we support multiple species: in addition to Homo sapiens, we now provide gene prioritization for three major model organisms: Mus musculus, Rattus norvegicus and Caenorhabditis elegans. Third, Endeavour makes use of additional data sources and is now including numerous databases: ontologies and annotations, protein–protein interactions, cis-regulatory information, gene expression data sets, sequence information and text-mining data. We tested the novel version of Endeavour on 32 recent disease gene associations from the literature. Additionally, we describe a number of recent independent studies that made use of Endeavour to prioritize candidate genes for obesity and Type II diabetes, cleft lip and cleft palate, and pulmonary fibrosis. PMID:18508807
Mining functionally relevant gene sets for analyzing physiologically novel clinical expression data.
Turcan, Sevin; Vetter, Douglas E; Maron, Jill L; Wei, Xintao; Slonim, Donna K
2011-01-01
Gene set analyses have become a standard approach for increasing the sensitivity of transcriptomic studies. However, analytical methods incorporating gene sets require the availability of pre-defined gene sets relevant to the underlying physiology being studied. For novel physiological problems, relevant gene sets may be unavailable or existing gene set databases may bias the results towards only the best-studied of the relevant biological processes. We describe a successful attempt to mine novel functional gene sets for translational projects where the underlying physiology is not necessarily well characterized in existing annotation databases. We choose targeted training data from public expression data repositories and define new criteria for selecting biclusters to serve as candidate gene sets. Many of the discovered gene sets show little or no enrichment for informative Gene Ontology terms or other functional annotation. However, we observe that such gene sets show coherent differential expression in new clinical test data sets, even if derived from different species, tissues, and disease states. We demonstrate the efficacy of this method on a human metabolic data set, where we discover novel, uncharacterized gene sets that are diagnostic of diabetes, and on additional data sets related to neuronal processes and human development. Our results suggest that our approach may be an efficient way to generate a collection of gene sets relevant to the analysis of data for novel clinical applications where existing functional annotation is relatively incomplete.
SFM: A novel sequence-based fusion method for disease genes identification and prioritization.
Yousef, Abdulaziz; Moghadam Charkari, Nasrollah
2015-10-21
The identification of disease genes from human genome is of great importance to improve diagnosis and treatment of disease. Several machine learning methods have been introduced to identify disease genes. However, these methods mostly differ in the prior knowledge used to construct the feature vector for each instance (gene), the ways of selecting negative data (non-disease genes) where there is no investigational approach to find them and the classification methods used to make the final decision. In this work, a novel Sequence-based fusion method (SFM) is proposed to identify disease genes. In this regard, unlike existing methods, instead of using a noisy and incomplete prior-knowledge, the amino acid sequence of the proteins which is universal data has been carried out to present the genes (proteins) into four different feature vectors. To select more likely negative data from candidate genes, the intersection set of four negative sets which are generated using distance approach is considered. Then, Decision Tree (C4.5) has been applied as a fusion method to combine the results of four independent state-of the-art predictors based on support vector machine (SVM) algorithm, and to make the final decision. The experimental results of the proposed method have been evaluated by some standard measures. The results indicate the precision, recall and F-measure of 82.6%, 85.6% and 84, respectively. These results confirm the efficiency and validity of the proposed method. Copyright © 2015 Elsevier Ltd. All rights reserved.
Harm reduction and viral hepatitis C in European prisons: a cross-sectional survey of 25 countries.
Bielen, Rob; Stumo, Samya R; Halford, Rachel; Werling, Klára; Reic, Tatjana; Stöver, Heino; Robaeys, Geert; Lazarus, Jeffrey V
2018-05-11
Current estimates suggest that 15% of all prisoners worldwide are chronically infected with the hepatitis C virus (HCV), and this number is even higher in regions with high rates of injecting drug use. Although harm reduction services such as opioid substitution therapy (OST) and needle and syringe programs (NSPs) are effective in preventing the further spread of HCV and HIV, the extent to which these are available in prisons varies significantly across countries. The Hep-CORE study surveyed liver patient groups from 25 European countries in 2016 and mid-2017 on national policies related to harm reduction, testing/screening, and treatment for HCV in prison settings. Results from the cross-sectional survey were compared to the data from available reports and the peer-reviewed literature to determine the overall degree to which European countries implement evidence-based HCV recommendations in prison settings. Patient groups in nine countries (36%) identified prisoners as a high-risk population target for HCV testing/screening. Twenty-one countries (84%) provide HCV treatment in prisons. However, the extent of coverage of these treatment programs varies widely. Two countries (8%) have NSPs officially available in prisons in all parts of the country. Eleven countries (44%) provide OST in prisons in all parts of the country without additional requirements. Despite the existence of evidence-based recommendations, infectious disease prevention measures such as harm reduction programs are inadequate in European prison settings. Harm reduction, HCV testing/screening, and treatment should be scaled up in prison settings in order to progress towards eliminating HCV as a public health threat.
[Effects of warming and precipitation exclusion on soil N2O fluxes in subtropical forests.
Tang, Cai di; Zhang, Zheng; Cai, Xiao Zhen; Guo, Jian Fen; Yang, Yu Sheng
2017-10-01
In order to explore how soil warming and precipitation exclusion influence soil N2O fluxes, we used related functional genes as markers, and four treatments were set up, i.e. , control (CT), soil warming (W, 5 ℃ above the ambient temperature of the control), 50% precipitation reduction (P), soil warming plus 50% precipitation reduction (WP). The results showed that precipitation exclusion reduced soil ammonium nitrogen concentration significantly. Soil warming decreased soil N2O flux and soil denitrification potential significantly. Soil microbial biomass nitrogen (MBN) in warming treatment (W) and precipitation exclusion treatment (P) was significantly lower than that in the control. The amoA gene abundance of AOA was negatively correlated with MBN and ammonium nitrogen contents, but neither soil nitrification potential nor soil N2O flux was correlated with the amoA gene abundance of AOA. Path analysis showed that the denitrification potential affected soil N2O flux directly, while microbial biomass phosphorus (MBP) and warming affected soil N2O flux indirectly through their direct effects on denitrification potential. Temperature might be the main driver of N2O flux in subtropical forest soils. Global warming would reduce N2O emissions from subtropical forest soils.
Bouma, G; Baggen, J M; van Bodegraven, A A; Mulder, C J J; Kraal, G; Zwiers, A; Horrevoets, A J; van der Pouw Kraan, C T M
2013-07-01
Crohn's disease (CD) is characterized by chronic inflammation of the gastrointestinal tract, as a result of aberrant activation of the innate immune system through TLR stimulation by bacterial products. The conventional immunosuppressive thiopurine derivatives (azathioprine and mercaptopurine) are used to treat CD. The effects of thiopurines on circulating immune cells and TLR responsiveness are unknown. To obtain a global view of affected gene expression of the immune system in CD patients and the treatment effect of thiopurine derivatives, we performed genome-wide transcriptome analysis on whole blood samples from 20 CD patients in remission, of which 10 patients received thiopurine treatment, compared to 16 healthy controls, before and after TLR4 stimulation with LPS. Several immune abnormalities were observed, including increased baseline interferon activity, while baseline expression of ribosomal genes was reduced. After LPS stimulation, CD patients showed reduced cytokine and chemokine expression. None of these effects were related to treatment. Strikingly, only one highly correlated set of 69 genes was affected by treatment, not influenced by LPS stimulation and consisted of genes reminiscent of effector cytotoxic NK cells. The most reduced cytotoxicity-related gene in CD was the cell surface marker CD160. Concordantly, we could demonstrate an in vivo reduction of circulating CD160(+)CD3(-)CD8(-) cells in CD patients after treatment with thiopurine derivatives in an independent cohort. In conclusion, using genome-wide profiling, we identified a disturbed immune activation status in peripheral blood cells from CD patients and a clear treatment effect of thiopurine derivatives selectively affecting effector cytotoxic CD160-positive cells. Copyright © 2013 Elsevier Ltd. All rights reserved.
Haitsma, Jack J.; Furmli, Suleiman; Masoom, Hussain; Liu, Mingyao; Imai, Yumiko; Slutsky, Arthur S.; Beyene, Joseph; Greenwood, Celia M. T.; dos Santos, Claudia
2012-01-01
Objectives To perform a meta-analysis of gene expression microarray data from animal studies of lung injury, and to identify an injury-specific gene expression signature capable of predicting the development of lung injury in humans. Methods We performed a microarray meta-analysis using 77 microarray chips across six platforms, two species and different animal lung injury models exposed to lung injury with or/and without mechanical ventilation. Individual gene chips were classified and grouped based on the strategy used to induce lung injury. Effect size (change in gene expression) was calculated between non-injurious and injurious conditions comparing two main strategies to pool chips: (1) one-hit and (2) two-hit lung injury models. A random effects model was used to integrate individual effect sizes calculated from each experiment. Classification models were built using the gene expression signatures generated by the meta-analysis to predict the development of lung injury in human lung transplant recipients. Results Two injury-specific lists of differentially expressed genes generated from our meta-analysis of lung injury models were validated using external data sets and prospective data from animal models of ventilator-induced lung injury (VILI). Pathway analysis of gene sets revealed that both new and previously implicated VILI-related pathways are enriched with differentially regulated genes. Classification model based on gene expression signatures identified in animal models of lung injury predicted development of primary graft failure (PGF) in lung transplant recipients with larger than 80% accuracy based upon injury profiles from transplant donors. We also found that better classifier performance can be achieved by using meta-analysis to identify differentially-expressed genes than using single study-based differential analysis. Conclusion Taken together, our data suggests that microarray analysis of gene expression data allows for the detection of “injury" gene predictors that can classify lung injury samples and identify patients at risk for clinically relevant lung injury complications. PMID:23071521
Map-Based Cloning of Genes Important for Maize Anther Development
NASA Astrophysics Data System (ADS)
Anaya, Y.; Walbot, V.; Nan, G.
2012-12-01
Map-Based cloning for maize mutant MS13 . Scientists still do not understand what decides the fate of a cell in plants. Many maize genes are important for anther development and when they are disrupted, the anthers do not shed pollen, i.e. male sterile. Since the maize genome has been fully sequenced, we conduct map-based cloning using a bulk segregant analysis strategy. Using PCR (polymerase chain reaction), we look for biomarkers that are linked to our gene of interest, Male Sterile 13 (MS13). Recombinations occur more often if the biomarkers are further away from the gene, therefore we can estimate where the gene is and design more PCR primers to get closer to our gene. Genetic and molecular analysis will help distinguish the role of key genes in setting cell fates before meiosis and for being in charge of the switch from mitosis to meiosis.
Zhang, Wensheng; Edwards, Andrea; Fan, Wei; Zhu, Dongxiao; Zhang, Kun
2010-06-22
Comparative analysis of gene expression profiling of multiple biological categories, such as different species of organisms or different kinds of tissue, promises to enhance the fundamental understanding of the universality as well as the specialization of mechanisms and related biological themes. Grouping genes with a similar expression pattern or exhibiting co-expression together is a starting point in understanding and analyzing gene expression data. In recent literature, gene module level analysis is advocated in order to understand biological network design and system behaviors in disease and life processes; however, practical difficulties often lie in the implementation of existing methods. Using the singular value decomposition (SVD) technique, we developed a new computational tool, named svdPPCS (SVD-based Pattern Pairing and Chart Splitting), to identify conserved and divergent co-expression modules of two sets of microarray experiments. In the proposed methods, gene modules are identified by splitting the two-way chart coordinated with a pair of left singular vectors factorized from the gene expression matrices of the two biological categories. Importantly, the cutoffs are determined by a data-driven algorithm using the well-defined statistic, SVD-p. The implementation was illustrated on two time series microarray data sets generated from the samples of accessory gland (ACG) and malpighian tubule (MT) tissues of the line W118 of M. drosophila. Two conserved modules and six divergent modules, each of which has a unique characteristic profile across tissue kinds and aging processes, were identified. The number of genes contained in these models ranged from five to a few hundred. Three to over a hundred GO terms were over-represented in individual modules with FDR < 0.1. One divergent module suggested the tissue-specific relationship between the expressions of mitochondrion-related genes and the aging process. This finding, together with others, may be of biological significance. The validity of the proposed SVD-based method was further verified by a simulation study, as well as the comparisons with regression analysis and cubic spline regression analysis plus PAM based clustering. svdPPCS is a novel computational tool for the comparative analysis of transcriptional profiling. It especially fits the comparison of time series data of related organisms or different tissues of the same organism under equivalent or similar experimental conditions. The general scheme can be directly extended to the comparisons of multiple data sets. It also can be applied to the integration of data sets from different platforms and of different sources.
Hwang, Hyonson; Bowen, Benjamin P.; Lefort, Natalie; Flynn, Charles R.; De Filippis, Elena A.; Roberts, Christine; Smoke, Christopher C.; Meyer, Christian; Højlund, Kurt; Yi, Zhengping; Mandarino, Lawrence J.
2010-01-01
OBJECTIVE Insulin resistance in skeletal muscle is an early phenomenon in the pathogenesis of type 2 diabetes. Studies of insulin resistance usually are highly focused. However, approaches that give a more global picture of abnormalities in insulin resistance are useful in pointing out new directions for research. In previous studies, gene expression analyses show a coordinated pattern of reduction in nuclear-encoded mitochondrial gene expression in insulin resistance. However, changes in mRNA levels may not predict changes in protein abundance. An approach to identify global protein abundance changes involving the use of proteomics was used here. RESEARCH DESIGN AND METHODS Muscle biopsies were obtained basally from lean, obese, and type 2 diabetic volunteers (n = 8 each); glucose clamps were used to assess insulin sensitivity. Muscle protein was subjected to mass spectrometry–based quantification using normalized spectral abundance factors. RESULTS Of 1,218 proteins assigned, 400 were present in at least half of all subjects. Of these, 92 were altered by a factor of 2 in insulin resistance, and of those, 15 were significantly increased or decreased by ANOVA (P < 0.05). Analysis of protein sets revealed patterns of decreased abundance in mitochondrial proteins and altered abundance of proteins involved with cytoskeletal structure (desmin and alpha actinin-2 both decreased), chaperone function (TCP-1 subunits increased), and proteasome subunits (increased). CONCLUSIONS The results confirm the reduction in mitochondrial proteins in insulin-resistant muscle and suggest that changes in muscle structure, protein degradation, and folding also characterize insulin resistance. PMID:19833877
Microarray-based cancer prediction using soft computing approach.
Wang, Xiaosheng; Gotoh, Osamu
2009-05-26
One of the difficulties in using gene expression profiles to predict cancer is how to effectively select a few informative genes to construct accurate prediction models from thousands or ten thousands of genes. We screen highly discriminative genes and gene pairs to create simple prediction models involved in single genes or gene pairs on the basis of soft computing approach and rough set theory. Accurate cancerous prediction is obtained when we apply the simple prediction models for four cancerous gene expression datasets: CNS tumor, colon tumor, lung cancer and DLBCL. Some genes closely correlated with the pathogenesis of specific or general cancers are identified. In contrast with other models, our models are simple, effective and robust. Meanwhile, our models are interpretable for they are based on decision rules. Our results demonstrate that very simple models may perform well on cancerous molecular prediction and important gene markers of cancer can be detected if the gene selection approach is chosen reasonably.
Parrott, Roxanne; Kahl, Mary L; Ndiaye, Khadidiatou; Traeder, Tara
2012-08-01
This research examined the lay public's beliefs about genes and health that might be labeled deterministic. The goals of this research were to sort through the divergent and contested meanings of genetic determinism in an effort to suggest directions for public health genomic communication. A survey conducted in community-based settings of 717 participants included 267 who self-reported race as African American and 450 who self-reported race as Caucasian American. The survey results revealed that the structure of genetic determinism included 2 belief sets. One set aligned with perceived threat, encompassing susceptibility and severity beliefs linked to genes and health. The other set represents beliefs about biological essentialism linked to the role of genes for health. These concepts were found to be modestly positively related. Threat beliefs predicted perceived control over genes. Public health efforts to communicate about genes and health should consider effects of these messages for (a) perceived threat relating to susceptibility and severity and (b) perceptions of disease essentialism. Perceived threat may enhance motivation to act in health protective ways, whereas disease essentialist beliefs may contribute to a loss of motivation associated with control over health.
Mathieson, Luke; Mendes, Alexandre; Marsden, John; Pond, Jeffrey; Moscato, Pablo
2017-01-01
This chapter introduces a new method for knowledge extraction from databases for the purpose of finding a discriminative set of features that is also a robust set for within-class classification. Our method is generic and we introduce it here in the field of breast cancer diagnosis from digital mammography data. The mathematical formalism is based on a generalization of the k-Feature Set problem called (α, β)-k-Feature Set problem, introduced by Cotta and Moscato (J Comput Syst Sci 67(4):686-690, 2003). This method proceeds in two steps: first, an optimal (α, β)-k-feature set of minimum cardinality is identified and then, a set of classification rules using these features is obtained. We obtain the (α, β)-k-feature set in two phases; first a series of extremely powerful reduction techniques, which do not lose the optimal solution, are employed; and second, a metaheuristic search to identify the remaining features to be considered or disregarded. Two algorithms were tested with a public domain digital mammography dataset composed of 71 malignant and 75 benign cases. Based on the results provided by the algorithms, we obtain classification rules that employ only a subset of these features.
Non-Invasive Delivery of dsRNA into De-Waxed Tick Eggs by Electroporation
Ruiz, Newton; de Abreu, Leonardo Araujo; Parizi, Luís Fernando; Kim, Tae Kwon; Mulenga, Albert; Braz, Gloria Regina Cardoso; Vaz, Itabajara da Silva; Logullo, Carlos
2015-01-01
RNA interference-mediated gene silencing was shown to be an efficient tool for validation of targets that may become anti-tick vaccine components. Here, we demonstrate the application of this approach in the validation of components of molecular signaling cascades, such as the Protein Kinase B (AKT) / Glycogen Synthase Kinase (GSK) axis during tick embryogenesis. It was shown that heptane and hypochlorite treatment of tick eggs can remove wax, affecting corium integrity and but not embryo development. Evidence of AKT and GSK dsRNA delivery into de-waxed eggs of via electroporation is provided. Primers designed to amplify part of the dsRNA delivered into the electroporated eggs dsRNA confirmed its entry in eggs. In addition, it was shown that electroporation is able to deliver the fluorescent stain, 4',6-diamidino-2-phenylindole (DAPI). To confirm gene silencing, a second set of primers was designed outside the dsRNA sequence of target gene. In this assay, the suppression of AKT and GSK transcripts (approximately 50% reduction in both genes) was demonstrated in 7-day-old eggs. Interestingly, silencing of GSK in 7-day-old eggs caused 25% reduction in hatching. Additionally, the effect of silencing AKT and GSK on embryo energy metabolism was evaluated. As expected, knockdown of AKT, which down regulates GSK, the suppressor of glycogen synthesis, decreased glycogen content in electroporated eggs. These data demonstrate that electroporation of de-waxed R. microplus eggs could be used for gene silencing in tick embryos, and improve the knowledge about arthropod embryogenesis. PMID:26091260
Busch-Nentwich, Elisabeth; Söllner, Christian; Roehl, Henry; Nicolson, Teresa
2004-02-01
Over 30 genes responsible for human hereditary hearing loss have been identified during the last 10 years. The proteins encoded by these genes play roles in a diverse set of cellular functions ranging from transcriptional regulation to K(+) recycling. In a few cases, the genes are novel and do not give much insight into the cellular or molecular cause for the hearing loss. Among these poorly understood deafness genes is DFNA5. How the truncation of the encoded protein DFNA5 leads to an autosomal dominant form of hearing loss is not clear. In order to understand the biological role of Dfna5, we took a reversegenetic approach in zebrafish. Here we show that morpholino antisense nucleotide knock-down of dfna5 function in zebrafish leads to disorganization of the developing semicircular canals and reduction of pharyngeal cartilage. This phenotype closely resembles previously isolated zebrafish craniofacial mutants including the mutant jekyll. jekyll encodes Ugdh [uridine 5'-diphosphate (UDP)-glucose dehydrogenase], an enzyme that is crucial for production of the extracellular matrix component hyaluronic acid (HA). In dfna5 morphants, expression of ugdh is absent in the developing ear and pharyngeal arches, and HA levels are strongly reduced in the outgrowing protrusions of the developing semicircular canals. Previous studies suggest that HA is essential for differentiating cartilage and directed outgrowth of the epithelial protrusions in the developing ear. We hypothesize that the reduction of HA production leads to uncoordinated outgrowth of the canal columns and impaired facial cartilage differentiation.
Feather development genes and associated regulatory innovation predate the origin of Dinosauria.
Lowe, Craig B; Clarke, Julia A; Baker, Allan J; Haussler, David; Edwards, Scott V
2015-01-01
The evolution of avian feathers has recently been illuminated by fossils and the identification of genes involved in feather patterning and morphogenesis. However, molecular studies have focused mainly on protein-coding genes. Using comparative genomics and more than 600,000 conserved regulatory elements, we show that patterns of genome evolution in the vicinity of feather genes are consistent with a major role for regulatory innovation in the evolution of feathers. Rates of innovation at feather regulatory elements exhibit an extended period of innovation with peaks in the ancestors of amniotes and archosaurs. We estimate that 86% of such regulatory elements and 100% of the nonkeratin feather gene set were present prior to the origin of Dinosauria. On the branch leading to modern birds, we detect a strong signal of regulatory innovation near insulin-like growth factor binding protein (IGFBP) 2 and IGFBP5, which have roles in body size reduction, and may represent a genomic signature for the miniaturization of dinosaurian body size preceding the origin of flight. © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Kringel, Dario; Lippmann, Catharina; Parnham, Michael J; Kalso, Eija; Ultsch, Alfred; Lötsch, Jörn
2018-06-19
Human genetic research has implicated functional variants of more than one hundred genes in the modulation of persisting pain. Artificial intelligence and machine learning techniques may combine this knowledge with results of genetic research gathered in any context, which permits the identification of the key biological processes involved in chronic sensitization to pain. Based on published evidence, a set of 110 genes carrying variants reported to be associated with modulation of the clinical phenotype of persisting pain in eight different clinical settings was submitted to unsupervised machine-learning aimed at functional clustering. Subsequently, a mathematically supported subset of genes, comprising those most consistently involved in persisting pain, was analyzed by means of computational functional genomics in the Gene Ontology knowledgebase. Clustering of genes with evidence for a modulation of persisting pain elucidated a functionally heterogeneous set. The situation cleared when the focus was narrowed to a genetic modulation consistently observed throughout several clinical settings. On this basis, two groups of biological processes, the immune system and nitric oxide signaling, emerged as major players in sensitization to persisting pain, which is biologically highly plausible and in agreement with other lines of pain research. The present computational functional genomics-based approach provided a computational systems-biology perspective on chronic sensitization to pain. Human genetic control of persisting pain points to the immune system as a source of potential future targets for drugs directed against persisting pain. Contemporary machine-learned methods provide innovative approaches to knowledge discovery from previous evidence. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Chen, Meng-Yun; Liang, Dan; Zhang, Peng
2015-11-01
Incongruence between different phylogenomic analyses is the main challenge faced by phylogeneticists in the genomic era. To reduce incongruence, phylogenomic studies normally adopt some data filtering approaches, such as reducing missing data or using slowly evolving genes, to improve the signal quality of data. Here, we assembled a phylogenomic data set of 58 jawed vertebrate taxa and 4682 genes to investigate the backbone phylogeny of jawed vertebrates under both concatenation and coalescent-based frameworks. To evaluate the efficiency of extracting phylogenetic signals among different data filtering methods, we chose six highly intractable internodes within the backbone phylogeny of jawed vertebrates as our test questions. We found that our phylogenomic data set exhibits substantial conflicting signal among genes for these questions. Our analyses showed that non-specific data sets that are generated without bias toward specific questions are not sufficient to produce consistent results when there are several difficult nodes within a phylogeny. Moreover, phylogenetic accuracy based on non-specific data is considerably influenced by the size of data and the choice of tree inference methods. To address such incongruences, we selected genes that resolve a given internode but not the entire phylogeny. Notably, not only can this strategy yield correct relationships for the question, but it also reduces inconsistency associated with data sizes and inference methods. Our study highlights the importance of gene selection in phylogenomic analyses, suggesting that simply using a large amount of data cannot guarantee correct results. Constructing question-specific data sets may be more powerful for resolving problematic nodes. © The Author(s) 2015. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Nucleotide variability of protamine genes influencing bull sperm motility variables.
H M, Yathish; Kumar, Subodh; Chaudhary, Rajni; Mishra, Chinmoy; A, Sivakumar; Kumar, Amit; Chauhan, Anuj; Ghosh, S K; Mitra, Abhijit
2018-06-01
Protamines (PRMs), important proteins of chromatin condensation in spermiogenesis, are promising candidate genes to explore markers of sperm motility. The coding and in-silico predicted promoter regions of these genes were investigated in 102 crossbred and 32 purebred cattle. Also, mRNA quantification was done to explore its possibility as diagnostic tool of infertility. The PCR-SSCP analysis indicated there were two band patterns only in fragment I of the PRM1 and fragment II of the PRM2 gene. The sequence analysis revealed A152G and G179A transitions in the PRM1 gene. Similarly, G35A, A49G and A64G transitions were identified in the PRM2 gene which resulted in altered amino acid sequences from arginine (R) to glutamine (Q), from arginine (R) to glycine (G) and from arginine (R) to glycine (G), respectively. This caused the reduction in molecular weight of PRM2 from 2157.66 to 1931.33 Da due to reduction in the number of basic amino acids. These altered properties of the PRM2 protein led to the reduction in Mass Motility (MM: P < 0.01), Initial Progressive Motility (IPM; P < 0.05) and Post Thaw Motility (PTM; P < 0.05) in crossbred bulls. The least squares analysis of variance indicated there was an effect of PRM2 haplotypes on MM (P = 0.0069), IPM (P = 0.0306) and PTM (P = 0.0500) in crossbred cattle and on PTM (P = 0.0408) in the overall cattle population. Based on the RT-qPCR analysis, however, there was not any significant variation of PRM1 and PRM2 gene expression among sperm of Vrindavani bulls with relatively lesser and greater sperm motility. Copyright © 2018 Elsevier B.V. All rights reserved.
Maeno, Shintaro; Tanizawa, Yasuhiro; Kanesaki, Yu; Kubota, Eri; Kumar, Himanshu; Dicks, Leon; Salminen, Seppo; Nakagawa, Junichi; Arita, Masanori; Endo, Akihito
2016-12-01
Lactobacillus kunkeei is classified as a sole obligate fructophilic lactic acid bacterium that is found in fructose-rich niches, including the guts of honeybees. The species is differentiated from other lactobacilli based on its poor growth with glucose, enhanced growth in the presence of oxygen and other electron acceptors, and production of high concentrations of acetate from the metabolism of glucose. These characteristics are similar to phylogenetically distant Fructobacillus spp. In the present study, the genomic structure of L. kunkeei was characterized by using 16 different strains, and it had significantly less genes and smaller genomes when compared with other lactobacilli. Functional gene classification revealed that L. kunkeei had lost genes specifically involved in carbohydrate transport and metabolism. The species also lacked most of the genes for respiration, although growth was enhanced in the presence of oxygen. The adhE gene of L. kunkeei, encoding a bifunctional alcohol dehydrogenase (ADH)/aldehyde dehydrogenase (ALDH) protein, lacked the part encoding the ADH domain, which is reported here for the first time in lactic acid bacteria. The deletion resulted in the lack of ADH activity, implying a requirement for electron acceptors in glucose assimilation. These results clearly indicated that L. kunkeei had undergone a specific reductive evolution in order to adapt to fructose-rich environments. The reduction characteristics were similar to those of Fructobacillus spp., but distinct from other lactobacilli with small genomes, such as Lactobacillus gasseri and Lactobacillus vaginalis. Fructose-richness thus induced an environment-specific gene reduction in phylogenetically distant microorganisms. Copyright © 2016 Elsevier GmbH. All rights reserved.
Comparative Single-Cell Genomics of Chloroflexi from the Okinawa Trough Deep-Subsurface Biosphere.
Fullerton, Heather; Moyer, Craig L
2016-05-15
Chloroflexi small-subunit (SSU) rRNA gene sequences are frequently recovered from subseafloor environments, but the metabolic potential of the phylum is poorly understood. The phylum Chloroflexi is represented by isolates with diverse metabolic strategies, including anoxic phototrophy, fermentation, and reductive dehalogenation; therefore, function cannot be attributed to these organisms based solely on phylogeny. Single-cell genomics can provide metabolic insights into uncultured organisms, like the deep-subsurface Chloroflexi Nine SSU rRNA gene sequences were identified from single-cell sorts of whole-round core material collected from the Okinawa Trough at Iheya North hydrothermal field as part of Integrated Ocean Drilling Program (IODP) expedition 331 (Deep Hot Biosphere). Previous studies of subsurface Chloroflexi single amplified genomes (SAGs) suggested heterotrophic or lithotrophic metabolisms and provided no evidence for growth by reductive dehalogenation. Our nine Chloroflexi SAGs (seven of which are from the order Anaerolineales) indicate that, in addition to genes for the Wood-Ljungdahl pathway, exogenous carbon sources can be actively transported into cells. At least one subunit for pyruvate ferredoxin oxidoreductase was found in four of the Chloroflexi SAGs. This protein can provide a link between the Wood-Ljungdahl pathway and other carbon anabolic pathways. Finally, one of the seven Anaerolineales SAGs contains a distinct reductive dehalogenase homologous (rdhA) gene. Through the use of single amplified genomes (SAGs), we have extended the metabolic potential of an understudied group of subsurface microbes, the Chloroflexi These microbes are frequently detected in the subsurface biosphere, though their metabolic capabilities have remained elusive. In contrast to previously examined Chloroflexi SAGs, our genomes (several are from the order Anaerolineales) were recovered from a hydrothermally driven system and therefore provide a unique window into the metabolic potential of this type of habitat. In addition, a reductive dehalogenase gene (rdhA) has been directly linked to marine subsurface Chloroflexi, suggesting that reductive dehalogenation is not limited to the class Dehalococcoidia This discovery expands the nutrient-cycling and metabolic potential present within the deep subsurface and provides functional gene information relating to this enigmatic group. Copyright © 2016 Fullerton and Moyer.
Evaluating the consistency of gene sets used in the analysis of bacterial gene expression data.
Tintle, Nathan L; Sitarik, Alexandra; Boerema, Benjamin; Young, Kylie; Best, Aaron A; Dejongh, Matthew
2012-08-08
Statistical analyses of whole genome expression data require functional information about genes in order to yield meaningful biological conclusions. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) are common sources of functionally grouped gene sets. For bacteria, the SEED and MicrobesOnline provide alternative, complementary sources of gene sets. To date, no comprehensive evaluation of the data obtained from these resources has been performed. We define a series of gene set consistency metrics directly related to the most common classes of statistical analyses for gene expression data, and then perform a comprehensive analysis of 3581 Affymetrix® gene expression arrays across 17 diverse bacteria. We find that gene sets obtained from GO and KEGG demonstrate lower consistency than those obtained from the SEED and MicrobesOnline, regardless of gene set size. Despite the widespread use of GO and KEGG gene sets in bacterial gene expression data analysis, the SEED and MicrobesOnline provide more consistent sets for a wide variety of statistical analyses. Increased use of the SEED and MicrobesOnline gene sets in the analysis of bacterial gene expression data may improve statistical power and utility of expression data.
Immune pathways and defence mechanisms in honey bees Apis mellifera
Evans, J D; Aronstein, K; Chen, Y P; Hetru, C; Imler, J-L; Jiang, H; Kanost, M; Thompson, G J; Zou, Z; Hultmark, D
2006-01-01
Social insects are able to mount both group-level and individual defences against pathogens. Here we focus on individual defences, by presenting a genome-wide analysis of immunity in a social insect, the honey bee Apis mellifera. We present honey bee models for each of four signalling pathways associated with immunity, identifying plausible orthologues for nearly all predicted pathway members. When compared to the sequenced Drosophila and Anopheles genomes, honey bees possess roughly one-third as many genes in 17 gene families implicated in insect immunity. We suggest that an implied reduction in immune flexibility in bees reflects either the strength of social barriers to disease, or a tendency for bees to be attacked by a limited set of highly coevolved pathogens. PMID:17069638
Seok, Junhee; Davis, Ronald W; Xiao, Wenzhong
2015-01-01
Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn't been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge.
Seok, Junhee; Davis, Ronald W.; Xiao, Wenzhong
2015-01-01
Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn’t been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge. PMID:25933378
Xie, Jianping; He, Zhili; Liu, Xinxing; Liu, Xueduan; Van Nostrand, Joy D.; Deng, Ye; Wu, Liyou; Zhou, Jizhong; Qiu, Guanzhou
2011-01-01
Acid mine drainage (AMD) is an extreme environment, usually with low pH and high concentrations of metals. Although the phylogenetic diversity of AMD microbial communities has been examined extensively, little is known about their functional gene diversity and metabolic potential. In this study, a comprehensive functional gene array (GeoChip 2.0) was used to analyze the functional diversity, composition, structure, and metabolic potential of AMD microbial communities from three copper mines in China. GeoChip data indicated that these microbial communities were functionally diverse as measured by the number of genes detected, gene overlapping, unique genes, and various diversity indices. Almost all key functional gene categories targeted by GeoChip 2.0 were detected in the AMD microbial communities, including carbon fixation, carbon degradation, methane generation, nitrogen fixation, nitrification, denitrification, ammonification, nitrogen reduction, sulfur metabolism, metal resistance, and organic contaminant degradation, which suggested that the functional gene diversity was higher than was previously thought. Mantel test results indicated that AMD microbial communities are shaped largely by surrounding environmental factors (e.g., S, Mg, and Cu). Functional genes (e.g., narG and norB) and several key functional processes (e.g., methane generation, ammonification, denitrification, sulfite reduction, and organic contaminant degradation) were significantly (P < 0.10) correlated with environmental variables. This study presents an overview of functional gene diversity and the structure of AMD microbial communities and also provides insights into our understanding of metabolic potential in AMD ecosystems. PMID:21097602
Jiang, Jiyang; Thalamuthu, Anbupalam; Ho, Jennifer E.; Mahajan, Anubha; Ek, Weronica E.; Brown, David A.; Breit, Samuel N.; Wang, Thomas J.; Gyllensten, Ulf; Chen, Ming-Huei; Enroth, Stefan; Januzzi, James L.; Lind, Lars; Armstrong, Nicola J.; Kwok, John B.; Schofield, Peter R.; Wen, Wei; Trollor, Julian N.; Johansson, Åsa; Morris, Andrew P.; Vasan, Ramachandran S.; Sachdev, Perminder S.; Mather, Karen A.
2018-01-01
Blood levels of growth differentiation factor-15 (GDF-15), also known as macrophage inhibitory cytokine-1 (MIC-1), have been associated with various pathological processes and diseases, including cardiovascular disease and cancer. Prior studies suggest genetic factors play a role in regulating blood MIC-1/GDF-15 concentration. In the current study, we conducted the largest genome-wide association study (GWAS) to date using a sample of ∼5,400 community-based Caucasian participants, to determine the genetic variants associated with MIC-1/GDF-15 blood concentration. Conditional and joint (COJO), gene-based association, and gene-set enrichment analyses were also carried out to identify novel loci, genes, and pathways. Consistent with prior results, a locus on chromosome 19, which includes nine single nucleotide polymorphisms (SNPs) (top SNP, rs888663, p = 1.690 × 10-35), was significantly associated with blood MIC-1/GDF-15 concentration, and explained 21.47% of its variance. COJO analysis showed evidence for two independent signals within this locus. Gene-based analysis confirmed the chromosome 19 locus association and in addition, a putative locus on chromosome 1. Gene-set enrichment analyses showed that the“COPI-mediated anterograde transport” gene-set was associated with MIC-1/GDF15 blood concentration with marginal significance after FDR correction (p = 0.067). In conclusion, a locus on chromosome 19 was associated with MIC-1/GDF-15 blood concentration with genome-wide significance, with evidence for a new locus (chromosome 1). Future studies using independent cohorts are needed to confirm the observed associations especially for the chromosomes 1 locus, and to further investigate and identify the causal SNPs that contribute to MIC-1/GDF-15 levels. PMID:29628937
Jiang, Jiyang; Thalamuthu, Anbupalam; Ho, Jennifer E; Mahajan, Anubha; Ek, Weronica E; Brown, David A; Breit, Samuel N; Wang, Thomas J; Gyllensten, Ulf; Chen, Ming-Huei; Enroth, Stefan; Januzzi, James L; Lind, Lars; Armstrong, Nicola J; Kwok, John B; Schofield, Peter R; Wen, Wei; Trollor, Julian N; Johansson, Åsa; Morris, Andrew P; Vasan, Ramachandran S; Sachdev, Perminder S; Mather, Karen A
2018-01-01
Blood levels of growth differentiation factor-15 (GDF-15), also known as macrophage inhibitory cytokine-1 (MIC-1), have been associated with various pathological processes and diseases, including cardiovascular disease and cancer. Prior studies suggest genetic factors play a role in regulating blood MIC-1/GDF-15 concentration. In the current study, we conducted the largest genome-wide association study (GWAS) to date using a sample of ∼5,400 community-based Caucasian participants, to determine the genetic variants associated with MIC-1/GDF-15 blood concentration. Conditional and joint (COJO), gene-based association, and gene-set enrichment analyses were also carried out to identify novel loci, genes, and pathways. Consistent with prior results, a locus on chromosome 19, which includes nine single nucleotide polymorphisms (SNPs) (top SNP, rs888663, p = 1.690 × 10 -35 ), was significantly associated with blood MIC-1/GDF-15 concentration, and explained 21.47% of its variance. COJO analysis showed evidence for two independent signals within this locus. Gene-based analysis confirmed the chromosome 19 locus association and in addition, a putative locus on chromosome 1. Gene-set enrichment analyses showed that the"COPI-mediated anterograde transport" gene-set was associated with MIC-1/GDF15 blood concentration with marginal significance after FDR correction ( p = 0.067). In conclusion, a locus on chromosome 19 was associated with MIC-1/GDF-15 blood concentration with genome-wide significance, with evidence for a new locus (chromosome 1). Future studies using independent cohorts are needed to confirm the observed associations especially for the chromosomes 1 locus, and to further investigate and identify the causal SNPs that contribute to MIC-1/GDF-15 levels.
Gene set analysis of purine and pyrimidine antimetabolites cancer therapies.
Fridley, Brooke L; Batzler, Anthony; Li, Liang; Li, Fang; Matimba, Alice; Jenkins, Gregory D; Ji, Yuan; Wang, Liewei; Weinshilboum, Richard M
2011-11-01
Responses to therapies, either with regard to toxicities or efficacy, are expected to involve complex relationships of gene products within the same molecular pathway or functional gene set. Therefore, pathways or gene sets, as opposed to single genes, may better reflect the true underlying biology and may be more appropriate units for analysis of pharmacogenomic studies. Application of such methods to pharmacogenomic studies may enable the detection of more subtle effects of multiple genes in the same pathway that may be missed by assessing each gene individually. A gene set analysis of 3821 gene sets is presented assessing the association between basal messenger RNA expression and drug cytotoxicity using ethnically defined human lymphoblastoid cell lines for two classes of drugs: pyrimidines [gemcitabine (dFdC) and arabinoside] and purines [6-thioguanine and 6-mercaptopurine]. The gene set nucleoside-diphosphatase activity was found to be significantly associated with both dFdC and arabinoside, whereas gene set γ-aminobutyric acid catabolic process was associated with dFdC and 6-thioguanine. These gene sets were significantly associated with the phenotype even after adjusting for multiple testing. In addition, five associated gene sets were found in common between the pyrimidines and two gene sets for the purines (3',5'-cyclic-AMP phosphodiesterase activity and γ-aminobutyric acid catabolic process) with a P value of less than 0.0001. Functional validation was attempted with four genes each in gene sets for thiopurine and pyrimidine antimetabolites. All four genes selected from the pyrimidine gene sets (PSME3, CANT1, ENTPD6, ADRM1) were validated, but only one (PDE4D) was validated for the thiopurine gene sets. In summary, results from the gene set analysis of pyrimidine and purine therapies, used often in the treatment of various cancers, provide novel insight into the relationship between genomic variation and drug response.
Lamba, Jatinder K; Crews, Kristine R; Pounds, Stanley B; Cao, Xueyuan; Gandhi, Varsha; Plunkett, William; Razzouk, Bassem I; Lamba, Vishal; Baker, Sharyn D; Raimondi, Susana C; Campana, Dario; Pui, Ching-Hon; Downing, James R; Rubnitz, Jeffrey E; Ribeiro, Raul C
2011-01-01
Aim To identify gene-expression signatures predicting cytarabine response by an integrative analysis of multiple clinical and pharmacological end points in acute myeloid leukemia (AML) patients. Materials & methods We performed an integrated analysis to associate the gene expression of diagnostic bone marrow blasts from acute myeloid leukemia (AML) patients treated in the discovery set (AML97; n = 42) and in the independent validation set (AML02; n = 46) with multiple clinical and pharmacological end points. Based on prior biological knowledge, we defined a gene to show a therapeutically beneficial (detrimental) pattern of association of its expression positively (negatively) correlated with favorable phenotypes such as intracellular cytarabine 5´-triphosphate levels, morphological response and event-free survival, and negatively (positively) correlated with unfavorable end points such as post-cytarabine DNA synthesis levels, minimal residual disease and cytarabine LC50. Results We identified 240 probe sets predicting a therapeutically beneficial pattern and 97 predicting detrimental pattern (p ≤ 0.005) in the discovery set. Of these, 60 were confirmed in the independent validation set. The validated probe sets correspond to genes involved in PIK3/PTEN/AKT/mTOR signaling, G-protein-coupled receptor signaling and leukemogenesis. This suggests that targeting these pathways as potential pharmacogenomic and therapeutic candidates could be useful for improving treatment outcomes in AML. Conclusion This study illustrates the power of integrated data analysis of genomic data as well as multiple clinical and pharmacologic end points in the identification of genes and pathways of biological relevance. PMID:21449673
Takisawa, Rihito; Nakazaki, Tetsuya; Nunome, Tsukasa; Fukuoka, Hiroyuki; Kataoka, Keiko; Saito, Hiroki; Habu, Tsuyoshi; Kitajima, Akira
2018-04-27
Parthenocarpy is a desired trait in tomato because it can overcome problems with fruit setting under unfavorable environmental conditions. A parthenocarpic tomato cultivar, 'MPK-1', with a parthenocarpic gene, Pat-k, exhibits stable parthenocarpy that produces few seeds. Because 'MPK-1' produces few seeds, seedlings are propagated inefficiently via cuttings. It was reported that Pat-k is located on chromosome 1. However, the gene had not been isolated and the relationship between the parthenocarpy and low seed set in 'MPK-1' remained unclear. In this study, we isolated Pat-k to clarify the relationship between parthenocarpy and low seed set in 'MPK-1'. Using quantitative trait locus (QTL) analysis for parthenocarpy and seed production, we detected a major QTL for each trait on nearly the same region of the Pat-k locus on chromosome 1. To isolate Pat-k, we performed fine mapping using an F 4 population following the cross between a non-parthenocarpic cultivar, 'Micro-Tom' and 'MPK-1'. The results showed that Pat-k was located in the 529 kb interval between two markers, where 60 genes exist. By using data from a whole genome re-sequencing and genome sequence analysis of 'MPK-1', we could identify that the SlAGAMOUS-LIKE 6 (SlAGL6) gene of 'MPK-1' was mutated by a retrotransposon insertion. The transcript level of SlAGL6 was significantly lower in ovaries of 'MPK-1' than a non-parthenocarpic cultivar. From these results, we could conclude that Pat-k is SlAGL6, and its down-regulation in 'MPK-1' causes parthenocarpy and low seed set. In addition, we observed abnormal micropyles only in plants homozygous for the 'MPK-1' allele at the Pat-k/SlAGL6 locus. This result suggests that Pat-k/SlAGL6 is also related to ovule formation and that the low seed set in 'MPK-1' is likely caused by abnormal ovule formation through down-regulation of Pat-k/SlAGL6. Pat-k is identical to SlAGL6, and its down-regulation causes parthenocarpy and low seed set in 'MPK-1'. Moreover, down-regulation of Pat-k/SlAGL6 could cause abnormal ovule formation, leading to a reduction in the number of seeds.
Kray, Jutta
2006-08-11
Adult age differences in task switching and advance preparation were examined by comparing cue-based and memory-based switching conditions. Task switching was assessed by determining two types of costs that occur at the general (mixing costs) and specific (switching costs) level of switching. Advance preparation was investigated by varying the time interval until the next task (short, middle, very long). Results indicated that the implementation of task sets was different for cue-based switching with random task sequences and memory-based switching with predictable task sequences. Switching costs were strongly reduced under cue-based switching conditions, indicating that task-set cues facilitate the retrieval of the next task. Age differences were found for mixing costs and for switching costs only under cue-based conditions in which older adults showed smaller switching costs than younger adults. It is suggested that older adults adopt a less extreme bias between two tasks than younger adults in situations associated with uncertainty. For cue-based switching with random task sequences, older adults are less engaged in a complete reconfiguration of task sets because of the probability of a further task change. Furthermore, the reduction of switching costs was more pronounced for cue- than memory-based switching for short preparation intervals, whereas the reduction of switch costs was more pronounced for memory- than cue-based switching for longer preparation intervals at least for older adults. Together these findings suggest that the implementation of task sets is functionally different for the two types of task-switching conditions.
2011-01-01
Background Populations of Atlantic killifish (Fundulus heteroclitus) have evolved resistance to the embryotoxic effects of polychlorinated biphenyls (PCBs) and other halogenated and nonhalogenated aromatic hydrocarbons that act through an aryl hydrocarbon receptor (AHR)-dependent signaling pathway. The resistance is accompanied by reduced sensitivity to induction of cytochrome P450 1A (CYP1A), a widely used biomarker of aromatic hydrocarbon exposure and effect, but whether the reduced sensitivity is specific to CYP1A or reflects a genome-wide reduction in responsiveness to all AHR-mediated changes in gene expression is unknown. We compared gene expression profiles and the response to 3,3',4,4',5-pentachlorobiphenyl (PCB-126) exposure in embryos (5 and 10 dpf) and larvae (15 dpf) from F. heteroclitus populations inhabiting the New Bedford Harbor, Massachusetts (NBH) Superfund site (PCB-resistant) and a reference site, Scorton Creek, Massachusetts (SC; PCB-sensitive). Results Analysis using a 7,000-gene cDNA array revealed striking differences in responsiveness to PCB-126 between the populations; the differences occur at all three stages examined. There was a sizeable set of PCB-responsive genes in the sensitive SC population, a much smaller set of PCB-responsive genes in NBH fish, and few similarities in PCB-responsive genes between the two populations. Most of the array results were confirmed, and additional PCB-regulated genes identified, by RNA-Seq (deep pyrosequencing). Conclusions The results suggest that NBH fish possess a gene regulatory defect that is not specific to one target gene such as CYP1A but rather lies in a regulatory pathway that controls the transcriptional response of multiple genes to PCB exposure. The results are consistent with genome-wide disruption of AHR-dependent signaling in NBH fish. PMID:21609454
Research on Attribute Reduction in Hoisting Motor State Recognition of Quayside Container Crane
NASA Astrophysics Data System (ADS)
Li, F.; Tang, G.; Hu, X.
2017-07-01
In view of too many attributes in hoisting motor state recognition of quayside container crane. Attribute reduction method based on discernibility matrix is introduced to attribute reduction of lifting motor state information table. A method of attribute reduction based on the combination of rough set and genetic algorithm is proposed to deal with the hoisting motor state decision table. Under the condition that the information system's decision-making ability is unchanged, the redundant attribute is deleted. Which reduces the complexity and computation of the recognition process of the hoisting motor. It is possible to realize the fast state recognition.
Thyroid hormone induction of human cholesterol 7 alpha-hydroxylase (Cyp7a1) in vitro.
Lammel Lindemann, Jan A; Angajala, Anusha; Engler, David A; Webb, Paul; Ayers, Stephen D
2014-05-05
Thyroid hormone (TH) modulates serum cholesterol by acting on TH receptor β1 (TRβ1) in liver to regulate metabolic gene sets. In rodents, one important TH regulated step involves induction of Cyp7a1, an enzyme in the cytochrome P450 family, which enhances cholesterol to bile acid conversion and plays a crucial role in regulation of serum cholesterol levels. Current models suggest, however, that Cyp7a1 has lost the capacity to respond to THs in humans. We were prompted to re-examine TH effects on cholesterol metabolic genes in human liver cells by a recent study of a synthetic TH mimetic which showed that serum cholesterol reductions were accompanied by increases in a marker for bile acid synthesis in humans. Here, we show that TH effects upon cholesterol metabolic genes are almost identical in mouse liver, mouse and human liver primary cells and human hepatocyte cell lines. Moreover, Cyp7a1 is a direct TR target gene that responds to physiologic TR levels through a set of distinct response elements in its promoter. These findings suggest that THs regulate cholesterol to bile acid conversion in similar ways in humans and rodent experimental models and that manipulation of hormone signaling pathways could provide a strategy to enhance Cyp7a1 activity in human patients. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
A new computational strategy for predicting essential genes.
Cheng, Jian; Wu, Wenwu; Zhang, Yinwen; Li, Xiangchen; Jiang, Xiaoqian; Wei, Gehong; Tao, Shiheng
2013-12-21
Determination of the minimum gene set for cellular life is one of the central goals in biology. Genome-wide essential gene identification has progressed rapidly in certain bacterial species; however, it remains difficult to achieve in most eukaryotic species. Several computational models have recently been developed to integrate gene features and used as alternatives to transfer gene essentiality annotations between organisms. We first collected features that were widely used by previous predictive models and assessed the relationships between gene features and gene essentiality using a stepwise regression model. We found two issues that could significantly reduce model accuracy: (i) the effect of multicollinearity among gene features and (ii) the diverse and even contrasting correlations between gene features and gene essentiality existing within and among different species. To address these issues, we developed a novel model called feature-based weighted Naïve Bayes model (FWM), which is based on Naïve Bayes classifiers, logistic regression, and genetic algorithm. The proposed model assesses features and filters out the effects of multicollinearity and diversity. The performance of FWM was compared with other popular models, such as support vector machine, Naïve Bayes model, and logistic regression model, by applying FWM to reciprocally predict essential genes among and within 21 species. Our results showed that FWM significantly improves the accuracy and robustness of essential gene prediction. FWM can remarkably improve the accuracy of essential gene prediction and may be used as an alternative method for other classification work. This method can contribute substantially to the knowledge of the minimum gene sets required for living organisms and the discovery of new drug targets.
Functional clustering of time series gene expression data by Granger causality
2012-01-01
Background A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes. Results In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence. Conclusions This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them. PMID:23107425
Willenborg, Christian J; Brûlé-Babel, Anita L; Van Acker, Rene C
2010-06-01
Transgenic wheat (Triticum aestivum L.) with improved agronomic traits is currently being field-tested. Gene flow in space is well-documented, but isolation in time has not received comparable attention. Here, we report the results of a field experiment that investigated reductions in intraspecific gene flow associated with temporal isolation of flowering between T. aestivum conspecifics. Pollen-mediated gene flow (PMGF) between an imazamox-resistant (IR) volunteer wheat population and a non-IR spring wheat crop was assessed over a range of volunteer emergence timings and plant population densities that collectively promoted flowering asynchrony. Natural hybridization events between the two populations were detected by phenotypically scoring plants in F(1) populations followed by verification with Mendelian segregation ratios in the F(1:2) lines. Based on the examination of >545,000 seedlings, we identified a hybridization window in spring wheat approximately 125 growing degree-days (GDD) in length. We found a sizeable reduction (two- to four-fold) in gene flow frequencies when flowering occurred outside of this window. The hybridization window identified in this research also will serve to temporally isolate neighboring wheat crops. However, strict control of volunteer populations or spatial isolation of neighbouring crops emerging within a 125 GDD hybridization window will be necessary to maintain low frequencies of PMGF in spring wheat fields. The model developed herein also is likely to be applicable to other wind-pollinated species.
NASA Astrophysics Data System (ADS)
Mahrooghy, Majid; Ashraf, Ahmed B.; Daye, Dania; Mies, Carolyn; Rosen, Mark; Feldman, Michael; Kontos, Despina
2014-03-01
We evaluate the prognostic value of sparse representation-based features by applying the K-SVD algorithm on multiparametric kinetic, textural, and morphologic features in breast dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI). K-SVD is an iterative dimensionality reduction method that optimally reduces the initial feature space by updating the dictionary columns jointly with the sparse representation coefficients. Therefore, by using K-SVD, we not only provide sparse representation of the features and condense the information in a few coefficients but also we reduce the dimensionality. The extracted K-SVD features are evaluated by a machine learning algorithm including a logistic regression classifier for the task of classifying high versus low breast cancer recurrence risk as determined by a validated gene expression assay. The features are evaluated using ROC curve analysis and leave one-out cross validation for different sparse representation and dimensionality reduction numbers. Optimal sparse representation is obtained when the number of dictionary elements is 4 (K=4) and maximum non-zero coefficients is 2 (L=2). We compare K-SVD with ANOVA based feature selection for the same prognostic features. The ROC results show that the AUC of the K-SVD based (K=4, L=2), the ANOVA based, and the original features (i.e., no dimensionality reduction) are 0.78, 0.71. and 0.68, respectively. From the results, it can be inferred that by using sparse representation of the originally extracted multi-parametric, high-dimensional data, we can condense the information on a few coefficients with the highest predictive value. In addition, the dimensionality reduction introduced by K-SVD can prevent models from over-fitting.
GeneChip{sup {trademark}} screening assay for cystic fibrosis mutations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cronn, M.T.; Miyada, C.G.; Fucini, R.V.
1994-09-01
GeneChip{sup {trademark}} assays are based on high density, carefully designed arrays of short oligonucleotide probes (13-16 bases) built directly on derivatized silica substrates. DNA target sequence analysis is achieved by hybridizing fluorescently labeled amplification products to these arrays. Fluorescent hybridization signals located within the probe array are translated into target sequence information using the known probe sequence at each array feature. The mutation screening assay for cystic fibrosis includes sets of oligonucleotide probes designed to detect numerous different mutations that have been described in 14 exons and one intron of the CFTR gene. Each mutation site is addressed by amore » sub-array of at least 40 probe sequences, half designed to detect the wild type gene sequence and half designed to detect the reported mutant sequence. Hybridization with homozygous mutant, homozygous wild type or heterozygous targets results in distinctive hybridization patterns within a sub-array, permitting specific discrimination of each mutation. The GeneChip probe arrays are very small (approximately 1 cm{sup 2}). There miniature size coupled with their high information content make GeneChip probe arrays a useful and practical means for providing CF mutation analysis in a clinical setting.« less
Thorup, Casper; Schramm, Andreas
2017-01-01
ABSTRACT This study demonstrates that the deltaproteobacterium Desulfurivibrio alkaliphilus can grow chemolithotrophically by coupling sulfide oxidation to the dissimilatory reduction of nitrate and nitrite to ammonium. Key genes of known sulfide oxidation pathways are absent from the genome of D. alkaliphilus. Instead, the genome contains all of the genes necessary for sulfate reduction, including a gene for a reductive-type dissimilatory bisulfite reductase (DSR). Despite this, growth by sulfate reduction was not observed. Transcriptomic analysis revealed a very high expression level of sulfate-reduction genes during growth by sulfide oxidation, while inhibition experiments with molybdate pointed to elemental sulfur/polysulfides as intermediates. Consequently, we propose that D. alkaliphilus initially oxidizes sulfide to elemental sulfur, which is then either disproportionated, or oxidized by a reversal of the sulfate reduction pathway. This is the first study providing evidence that a reductive-type DSR is involved in a sulfide oxidation pathway. Transcriptome sequencing further suggests that nitrate reduction to ammonium is performed by a novel type of periplasmic nitrate reductase and an unusual membrane-anchored nitrite reductase. PMID:28720728
Pesesky, Mitchell W; Hussain, Tahir; Wallace, Meghan; Patel, Sanket; Andleeb, Saadia; Burnham, Carey-Ann D; Dantas, Gautam
2016-01-01
The time-to-result for culture-based microorganism recovery and phenotypic antimicrobial susceptibility testing necessitates initial use of empiric (frequently broad-spectrum) antimicrobial therapy. If the empiric therapy is not optimal, this can lead to adverse patient outcomes and contribute to increasing antibiotic resistance in pathogens. New, more rapid technologies are emerging to meet this need. Many of these are based on identifying resistance genes, rather than directly assaying resistance phenotypes, and thus require interpretation to translate the genotype into treatment recommendations. These interpretations, like other parts of clinical diagnostic workflows, are likely to be increasingly automated in the future. We set out to evaluate the two major approaches that could be amenable to automation pipelines: rules-based methods and machine learning methods. The rules-based algorithm makes predictions based upon current, curated knowledge of Enterobacteriaceae resistance genes. The machine-learning algorithm predicts resistance and susceptibility based on a model built from a training set of variably resistant isolates. As our test set, we used whole genome sequence data from 78 clinical Enterobacteriaceae isolates, previously identified to represent a variety of phenotypes, from fully-susceptible to pan-resistant strains for the antibiotics tested. We tested three antibiotic resistance determinant databases for their utility in identifying the complete resistome for each isolate. The predictions of the rules-based and machine learning algorithms for these isolates were compared to results of phenotype-based diagnostics. The rules based and machine-learning predictions achieved agreement with standard-of-care phenotypic diagnostics of 89.0 and 90.3%, respectively, across twelve antibiotic agents from six major antibiotic classes. Several sources of disagreement between the algorithms were identified. Novel variants of known resistance factors and incomplete genome assembly confounded the rules-based algorithm, resulting in predictions based on gene family, rather than on knowledge of the specific variant found. Low-frequency resistance caused errors in the machine-learning algorithm because those genes were not seen or seen infrequently in the test set. We also identified an example of variability in the phenotype-based results that led to disagreement with both genotype-based methods. Genotype-based antimicrobial susceptibility testing shows great promise as a diagnostic tool, and we outline specific research goals to further refine this methodology.
Sierra, Beatriz; Triska, Petr; Soares, Pedro; Garcia, Gissel; Perez, Ana B; Aguirre, Eglys; Oliveira, Marisa; Cavadas, Bruno; Regnault, Béatrice; Alvarez, Mayling; Ruiz, Didye; Samuels, David C; Sakuntabhai, Anavaj; Pereira, Luisa; Guzman, Maria G
2017-02-01
Ethnic groups can display differential genetic susceptibility to infectious diseases. The arthropod-born viral dengue disease is one such disease, with empirical and limited genetic evidence showing that African ancestry may be protective against the haemorrhagic phenotype. Global ancestry analysis based on high-throughput genotyping in admixed populations can be used to test this hypothesis, while admixture mapping can map candidate protective genes. A Cuban dengue fever cohort was genotyped using a 2.5 million SNP chip. Global ancestry was ascertained through ADMIXTURE and used in a fine-matched corrected association study, while local ancestry was inferred by the RFMix algorithm. The expression of candidate genes was evaluated by RT-PCR in a Cuban dengue patient cohort and gene set enrichment analysis was performed in a Thai dengue transcriptome. OSBPL10 and RXRA candidate genes were identified, with most significant SNPs placed in inferred weak enhancers, promoters and lncRNAs. OSBPL10 had significantly lower expression in Africans than Europeans, while for RXRA several SNPs may differentially regulate its transcription between Africans and Europeans. Their expression was confirmed to change through dengue disease progression in Cuban patients and to vary with disease severity in a Thai transcriptome dataset. These genes interact in the LXR/RXR activation pathway that integrates lipid metabolism and immune functions, being a key player in dengue virus entrance into cells, its replication therein and in cytokine production. Knockdown of OSBPL10 expression in THP-1 cells by two shRNAs followed by DENV2 infection tests led to a significant reduction in DENV replication, being a direct functional proof that the lower OSBPL10 expression profile in Africans protects this ancestry against dengue disease.
Soares, Pedro; Garcia, Gissel; Perez, Ana B.; Aguirre, Eglys; Cavadas, Bruno; Regnault, Béatrice; Alvarez, Mayling; Ruiz, Didye; Guzman, Maria G.
2017-01-01
Ethnic groups can display differential genetic susceptibility to infectious diseases. The arthropod-born viral dengue disease is one such disease, with empirical and limited genetic evidence showing that African ancestry may be protective against the haemorrhagic phenotype. Global ancestry analysis based on high-throughput genotyping in admixed populations can be used to test this hypothesis, while admixture mapping can map candidate protective genes. A Cuban dengue fever cohort was genotyped using a 2.5 million SNP chip. Global ancestry was ascertained through ADMIXTURE and used in a fine-matched corrected association study, while local ancestry was inferred by the RFMix algorithm. The expression of candidate genes was evaluated by RT-PCR in a Cuban dengue patient cohort and gene set enrichment analysis was performed in a Thai dengue transcriptome. OSBPL10 and RXRA candidate genes were identified, with most significant SNPs placed in inferred weak enhancers, promoters and lncRNAs. OSBPL10 had significantly lower expression in Africans than Europeans, while for RXRA several SNPs may differentially regulate its transcription between Africans and Europeans. Their expression was confirmed to change through dengue disease progression in Cuban patients and to vary with disease severity in a Thai transcriptome dataset. These genes interact in the LXR/RXR activation pathway that integrates lipid metabolism and immune functions, being a key player in dengue virus entrance into cells, its replication therein and in cytokine production. Knockdown of OSBPL10 expression in THP-1 cells by two shRNAs followed by DENV2 infection tests led to a significant reduction in DENV replication, being a direct functional proof that the lower OSBPL10 expression profile in Africans protects this ancestry against dengue disease. PMID:28241052
Lemieux, Madeleine E.; Cheng, Ziming; Zhou, Qing; White, Ruth; Cornell, John; Kung, Andrew L.; Rebel, Vivienne I.
2011-01-01
Global expression analysis of fetal liver hematopoietic stem cells (FL HSCs) revealed the presence of unspliced pre-mRNA for a number of genes in normal FL HSCs. In a subset of these genes, Crebbp+/− FL HSCs had less unprocessed pre-mRNA without a corresponding reduction in total mRNA levels. Among the genes thus identified were the key regulators of HSC function Itga4, Msi2 and Tcf4. A similar but much weaker effect was apparent in Ep300+/− FL HSCs, indicating that, in this context as in others, the two paralogs are not interchangeable. As a group, the down-regulated intronic probe sets could discriminate adult HSCs from more mature cell types, suggesting that the underlying mechanism is regulated with differentiation stage and is active in both fetal and adult hematopoiesis. Consistent with increased myelopoiesis in Crebbp hemizygous mice, targeted reduction of CREBBP abundance by shRNA in the multipotent EML cell line triggered spontaneous myeloid differentiation in the absence of the normally required inductive signals. In addition, differences in protein levels between phenotypically distinct EML subpopulations were better predicted by taking into account not only the total mRNA signal but also the amount of unspliced message present. CREBBP thus appears to selectively influence the timing and degree of pre-mRNA processing of genes essential for HSC regulation and thereby has the potential to alter subsequent cell fate decisions in HSCs. PMID:21901164
Mallik, Saurav; Maulik, Ujjwal
2015-10-01
Gene ranking is an important problem in bioinformatics. Here, we propose a new framework for ranking biomolecules (viz., miRNAs, transcription-factors/TFs and genes) in a multi-informative uterine leiomyoma dataset having both gene expression and methylation data using (statistical) eigenvector centrality based approach. At first, genes that are both differentially expressed and methylated, are identified using Limma statistical test. A network, comprising these genes, corresponding TFs from TRANSFAC and ITFP databases, and targeter miRNAs from miRWalk database, is then built. The biomolecules are then ranked based on eigenvector centrality. Our proposed method provides better average accuracy in hub gene and non-hub gene classifications than other methods. Furthermore, pre-ranked Gene set enrichment analysis is applied on the pathway database as well as GO-term databases of Molecular Signatures Database with providing a pre-ranked gene-list based on different centrality values for comparing among the ranking methods. Finally, top novel potential gene-markers for the uterine leiomyoma are provided. Copyright © 2015 Elsevier Inc. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Boylan, Joan M.; Salomon, Arthur R.; Department of Chemistry, Brown University, Providence, RI
Protein phosphatase 6 (PP6) is a ubiquitous Ser/Thr phosphatase involved in an array of cellular processes. To assess the potential of PP6 as a therapeutic target in liver disorders, we attenuated expression of the PP6 catalytic subunit in HepG2 cells using lentiviral-transduced shRNA. Two PP6 knock-down (PP6KD) cell lines (90% reduction of PP6-C protein content) were studied in depth. Both proliferated at a rate similar to control cells. However, flow cytometry indicated G2/M cell cycle arrest that was accounted for by a shift of the cells from a diploid to tetraploid state. PP6KD cells did not show an increase inmore » apoptosis, nor did they exhibit reduced viability in the presence of bleomycin or taxol. Gene expression analysis by microarray showed attenuated anti-inflammatory signaling. Genes associated with DNA replication were downregulated. Mass spectrometry-based phosphoproteomic analysis yielded 80 phosphopeptides representing 56 proteins that were significantly affected by a stable reduction in PP6-C. Proteins involved in DNA replication, DNA damage repair and pre-mRNA splicing were overrepresented among these. PP6KD cells showed intact mTOR signaling. Our studies demonstrated involvement of PP6 in a diverse set of biological pathways and an adaptive response that may limit the effectiveness of targeting PP6 in liver disorders. - Highlights: • Lentiviral-transduced shRNA was used to generate a stable knockdown of PP6 in HepG2 cells. • Cells adapted to reduced PP6; cell proliferation was unaffected, and cell survival was normal. • However, PP6 knockdown was associated with a transition to a tetraploid state. • Genomic profiling showed downregulated anti-inflammatory signaling and DNA replication. • Phosphoproteomic profiling showed changes in proteins associated with DNA replication and repair.« less
2011-01-01
Background Copy number aberrations (CNAs) are an important molecular signature in cancer initiation, development, and progression. However, these aberrations span a wide range of chromosomes, making it hard to distinguish cancer related genes from other genes that are not closely related to cancer but are located in broadly aberrant regions. With the current availability of high-resolution data sets such as single nucleotide polymorphism (SNP) microarrays, it has become an important issue to develop a computational method to detect driving genes related to cancer development located in the focal regions of CNAs. Results In this study, we introduce a novel method referred to as the wavelet-based identification of focal genomic aberrations (WIFA). The use of the wavelet analysis, because it is a multi-resolution approach, makes it possible to effectively identify focal genomic aberrations in broadly aberrant regions. The proposed method integrates multiple cancer samples so that it enables the detection of the consistent aberrations across multiple samples. We then apply this method to glioblastoma multiforme and lung cancer data sets from the SNP microarray platform. Through this process, we confirm the ability to detect previously known cancer related genes from both cancer types with high accuracy. Also, the application of this approach to a lung cancer data set identifies focal amplification regions that contain known oncogenes, though these regions are not reported using a recent CNAs detecting algorithm GISTIC: SMAD7 (chr18q21.1) and FGF10 (chr5p12). Conclusions Our results suggest that WIFA can be used to reveal cancer related genes in various cancer data sets. PMID:21569311
de Moraes, Marcos H; Desai, Prerak; Porwollik, Steffen; Canals, Rocio; Perez, Daniel R; Chu, Weiping; McClelland, Michael; Teplitski, Max
2017-03-01
Human enteric pathogens, such as Salmonella spp. and verotoxigenic Escherichia coli , are increasingly recognized as causes of gastroenteritis outbreaks associated with the consumption of fruits and vegetables. Persistence in plants represents an important part of the life cycle of these pathogens. The identification of the full complement of Salmonella genes involved in the colonization of the model plant (tomato) was carried out using transposon insertion sequencing analysis. With this approach, 230,000 transposon insertions were screened in tomato pericarps to identify loci with reduction in fitness, followed by validation of the screen results using competition assays of the isogenic mutants against the wild type. A comparison with studies in animals revealed a distinct plant-associated set of genes, which only partially overlaps with the genes required to elicit disease in animals. De novo biosynthesis of amino acids was critical to persistence within tomatoes, while amino acid scavenging was prevalent in animal infections. Fitness reduction of the Salmonella amino acid synthesis mutants was generally more severe in the tomato rin mutant, which hyperaccumulates certain amino acids, suggesting that these nutrients remain unavailable to Salmonella spp. within plants. Salmonella lipopolysaccharide (LPS) was required for persistence in both animals and plants, exemplifying some shared pathogenesis-related mechanisms in animal and plant hosts. Similarly to phytopathogens, Salmonella spp. required biosynthesis of amino acids, LPS, and nucleotides to colonize tomatoes. Overall, however, it appears that while Salmonella shares some strategies with phytopathogens and taps into its animal virulence-related functions, colonization of tomatoes represents a distinct strategy, highlighting this pathogen's flexible metabolism. IMPORTANCE Outbreaks of gastroenteritis caused by human pathogens have been increasingly associated with foods of plant origin, with tomatoes being one of the common culprits. Recent studies also suggest that these human pathogens can use plants as alternate hosts as a part of their life cycle. While dual (animal/plant) lifestyles of other members of the Enterobacteriaceae family are well known, the strategies with which Salmonella colonizes plants are only partially understood. Therefore, we undertook a high-throughput characterization of the functions required for Salmonella persistence within tomatoes. The results of this study were compared with what is known about genes required for Salmonella virulence in animals and interactions of plant pathogens with their hosts to determine whether Salmonella repurposes its virulence repertoire inside plants or whether it behaves more as a phytopathogen during plant colonization. Even though Salmonella utilized some of its virulence-related genes in tomatoes, plant colonization required a distinct set of functions. Copyright © 2017 American Society for Microbiology.
Desai, Prerak; Porwollik, Steffen; Canals, Rocio; Perez, Daniel R.; Chu, Weiping; McClelland, Michael; Teplitski, Max
2016-01-01
ABSTRACT Human enteric pathogens, such as Salmonella spp. and verotoxigenic Escherichia coli, are increasingly recognized as causes of gastroenteritis outbreaks associated with the consumption of fruits and vegetables. Persistence in plants represents an important part of the life cycle of these pathogens. The identification of the full complement of Salmonella genes involved in the colonization of the model plant (tomato) was carried out using transposon insertion sequencing analysis. With this approach, 230,000 transposon insertions were screened in tomato pericarps to identify loci with reduction in fitness, followed by validation of the screen results using competition assays of the isogenic mutants against the wild type. A comparison with studies in animals revealed a distinct plant-associated set of genes, which only partially overlaps with the genes required to elicit disease in animals. De novo biosynthesis of amino acids was critical to persistence within tomatoes, while amino acid scavenging was prevalent in animal infections. Fitness reduction of the Salmonella amino acid synthesis mutants was generally more severe in the tomato rin mutant, which hyperaccumulates certain amino acids, suggesting that these nutrients remain unavailable to Salmonella spp. within plants. Salmonella lipopolysaccharide (LPS) was required for persistence in both animals and plants, exemplifying some shared pathogenesis-related mechanisms in animal and plant hosts. Similarly to phytopathogens, Salmonella spp. required biosynthesis of amino acids, LPS, and nucleotides to colonize tomatoes. Overall, however, it appears that while Salmonella shares some strategies with phytopathogens and taps into its animal virulence-related functions, colonization of tomatoes represents a distinct strategy, highlighting this pathogen's flexible metabolism. IMPORTANCE Outbreaks of gastroenteritis caused by human pathogens have been increasingly associated with foods of plant origin, with tomatoes being one of the common culprits. Recent studies also suggest that these human pathogens can use plants as alternate hosts as a part of their life cycle. While dual (animal/plant) lifestyles of other members of the Enterobacteriaceae family are well known, the strategies with which Salmonella colonizes plants are only partially understood. Therefore, we undertook a high-throughput characterization of the functions required for Salmonella persistence within tomatoes. The results of this study were compared with what is known about genes required for Salmonella virulence in animals and interactions of plant pathogens with their hosts to determine whether Salmonella repurposes its virulence repertoire inside plants or whether it behaves more as a phytopathogen during plant colonization. Even though Salmonella utilized some of its virulence-related genes in tomatoes, plant colonization required a distinct set of functions. PMID:28039131
Gene set analysis using variance component tests.
Huang, Yen-Tsung; Lin, Xihong
2013-06-28
Gene set analyses have become increasingly important in genomic research, as many complex diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important feature of a gene set to improve statistical power in gene set analyses. We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression coefficients in multivariate linear regression models, and calculate the p-values using permutation and a scaled chi-square approximation. We show using simulations that type I error is protected under different choices of working covariance matrices and power is improved as the working covariance approaches the true covariance. The global test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and gene set enrichment analysis (GSEA). We develop a gene set analyses method (TEGS) under the multivariate regression framework, which directly models the interdependence of the expression values in a gene set using a working covariance. TEGS outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data.
Maillard, Julien; Schumacher, Wolfram; Vazquez, Francisco; Regeard, Christophe; Hagen, Wilfred R.; Holliger, Christof
2003-01-01
The membrane-bound tetrachloroethene reductive dehalogenase (PCE-RDase) (PceA; EC 1.97.1.8), the terminal component of the respiratory chain of Dehalobacter restrictus, was purified 25-fold to apparent electrophoretic homogeneity. Sodium dodecyl sulfate-polyacrylamide gel electrophoresis revealed a single band with an apparent molecular mass of 60 ± 1 kDa, whereas the native molecular mass was 71 ± 8 kDa according to size exclusion chromatography in the presence of the detergent octyl-β-d-glucopyranoside. The monomeric enzyme contained (per mol of the 60-kDa subunit) 1.0 ± 0.1 mol of cobalamin, 0.6 ± 0.02 mol of cobalt, 7.1 ± 0.6 mol of iron, and 5.8 ± 0.5 mol of acid-labile sulfur. Purified PceA catalyzed the reductive dechlorination of tetrachloroethene and trichloroethene to cis-1,2-dichloroethene with a specific activity of 250 ± 12 nkat/mg of protein. In addition, several chloroethanes and tetrachloromethane caused methyl viologen oxidation in the presence of PceA. The Km values for tetrachloroethene, trichloroethene, and methyl viologen were 20.4 ± 3.2, 23.7 ± 5.2, and 47 ± 10 μM, respectively. The PceA exhibited the highest activity at pH 8.1 and was oxygen sensitive, with a half-life of activity of 280 min upon exposure to air. Based on the almost identical N-terminal amino acid sequences of PceA of Dehalobacter restrictus, Desulfitobacterium hafniense strain TCE1 (formerly Desulfitobacterium frappieri strain TCE1), and Desulfitobacterium hafniense strain PCE-S (formerly Desulfitobacterium frappieri strain PCE-S), the pceA genes of the first two organisms were cloned and sequenced. Together with the pceA genes of Desulfitobacterium hafniense strains PCE-S and Y51, the pceA genes of Desulfitobacterium hafniense strain TCE1 and Dehalobacter restrictus form a coherent group of reductive dehalogenases with almost 100% sequence identity. Also, the pceB genes, which may code for a membrane anchor protein of PceA, and the intergenic regions of Dehalobacter restrictus and the three desulfitobacteria had identical sequences. Whereas the cprB (chlorophenol reductive dehalogenase) genes of chlorophenol-dehalorespiring bacteria are always located upstream of cprA, all pceB genes known so far are located downstream of pceA. The possible consequences of this feature for the annotation of putative reductive dehalogenase genes are discussed, as are the sequence around the iron-sulfur cluster binding motifs and the type of iron-sulfur clusters of the reductive dehalogenases of Dehalobacter restrictus and Desulfitobacterium dehalogenans identified by electron paramagnetic resonance spectroscopy. PMID:12902251
GARNET--gene set analysis with exploration of annotation relations.
Rho, Kyoohyoung; Kim, Bumjin; Jang, Youngjun; Lee, Sanghyun; Bae, Taejeong; Seo, Jihae; Seo, Chaehwa; Lee, Jihyun; Kang, Hyunjung; Yu, Ungsik; Kim, Sunghoon; Lee, Sanghyuk; Kim, Wan Kyu
2011-02-15
Gene set analysis is a powerful method of deducing biological meaning for an a priori defined set of genes. Numerous tools have been developed to test statistical enrichment or depletion in specific pathways or gene ontology (GO) terms. Major difficulties towards biological interpretation are integrating diverse types of annotation categories and exploring the relationships between annotation terms of similar information. GARNET (Gene Annotation Relationship NEtwork Tools) is an integrative platform for gene set analysis with many novel features. It includes tools for retrieval of genes from annotation database, statistical analysis & visualization of annotation relationships, and managing gene sets. In an effort to allow access to a full spectrum of amassed biological knowledge, we have integrated a variety of annotation data that include the GO, domain, disease, drug, chromosomal location, and custom-defined annotations. Diverse types of molecular networks (pathways, transcription and microRNA regulations, protein-protein interaction) are also included. The pair-wise relationship between annotation gene sets was calculated using kappa statistics. GARNET consists of three modules--gene set manager, gene set analysis and gene set retrieval, which are tightly integrated to provide virtually automatic analysis for gene sets. A dedicated viewer for annotation network has been developed to facilitate exploration of the related annotations. GARNET (gene annotation relationship network tools) is an integrative platform for diverse types of gene set analysis, where complex relationships among gene annotations can be easily explored with an intuitive network visualization tool (http://garnet.isysbio.org/ or http://ercsb.ewha.ac.kr/garnet/).
Sulfate Assimilation Mediates Tellurite Reduction and Toxicity in Saccharomyces cerevisiae▿†
Ottosson, Lars-Göran; Logg, Katarina; Ibstedt, Sebastian; Sunnerhagen, Per; Käll, Mikael; Blomberg, Anders; Warringer, Jonas
2010-01-01
Despite a century of research and increasing environmental and human health concerns, the mechanistic basis of the toxicity of derivatives of the metalloid tellurium, Te, in particular the oxyanion tellurite, Te(IV), remains unsolved. Here, we provide an unbiased view of the mechanisms of tellurium metabolism in the yeast Saccharomyces cerevisiae by measuring deviations in Te-related traits of a complete collection of gene knockout mutants. Reduction of Te(IV) and intracellular accumulation as metallic tellurium strongly correlated with loss of cellular fitness, suggesting that Te(IV) reduction and toxicity are causally linked. The sulfate assimilation pathway upstream of Met17, in particular, the sulfite reductase and its cofactor siroheme, was shown to be central to tellurite toxicity and its reduction to elemental tellurium. Gene knockout mutants with altered Te(IV) tolerance also showed a similar deviation in tolerance to both selenite and, interestingly, selenomethionine, suggesting that the toxicity of these agents stems from a common mechanism. We also show that Te(IV) reduction and toxicity in yeast is partially mediated via a mitochondrial respiratory mechanism that does not encompass the generation of substantial oxidative stress. The results reported here represent a robust base from which to attack the mechanistic details of Te(IV) toxicity and reduction in a eukaryotic organism. PMID:20675578
Gene Ranking of RNA-Seq Data via Discriminant Non-Negative Matrix Factorization.
Jia, Zhilong; Zhang, Xiang; Guan, Naiyang; Bo, Xiaochen; Barnes, Michael R; Luo, Zhigang
2015-01-01
RNA-sequencing is rapidly becoming the method of choice for studying the full complexity of transcriptomes, however with increasing dimensionality, accurate gene ranking is becoming increasingly challenging. This paper proposes an accurate and sensitive gene ranking method that implements discriminant non-negative matrix factorization (DNMF) for RNA-seq data. To the best of our knowledge, this is the first work to explore the utility of DNMF for gene ranking. When incorporating Fisher's discriminant criteria and setting the reduced dimension as two, DNMF learns two factors to approximate the original gene expression data, abstracting the up-regulated or down-regulated metagene by using the sample label information. The first factor denotes all the genes' weights of two metagenes as the additive combination of all genes, while the second learned factor represents the expression values of two metagenes. In the gene ranking stage, all the genes are ranked as a descending sequence according to the differential values of the metagene weights. Leveraging the nature of NMF and Fisher's criterion, DNMF can robustly boost the gene ranking performance. The Area Under the Curve analysis of differential expression analysis on two benchmarking tests of four RNA-seq data sets with similar phenotypes showed that our proposed DNMF-based gene ranking method outperforms other widely used methods. Moreover, the Gene Set Enrichment Analysis also showed DNMF outweighs others. DNMF is also computationally efficient, substantially outperforming all other benchmarked methods. Consequently, we suggest DNMF is an effective method for the analysis of differential gene expression and gene ranking for RNA-seq data.
Efficient experimental design for uncertainty reduction in gene regulatory networks.
Dehghannasiri, Roozbeh; Yoon, Byung-Jun; Dougherty, Edward R
2015-01-01
An accurate understanding of interactions among genes plays a major role in developing therapeutic intervention methods. Gene regulatory networks often contain a significant amount of uncertainty. The process of prioritizing biological experiments to reduce the uncertainty of gene regulatory networks is called experimental design. Under such a strategy, the experiments with high priority are suggested to be conducted first. The authors have already proposed an optimal experimental design method based upon the objective for modeling gene regulatory networks, such as deriving therapeutic interventions. The experimental design method utilizes the concept of mean objective cost of uncertainty (MOCU). MOCU quantifies the expected increase of cost resulting from uncertainty. The optimal experiment to be conducted first is the one which leads to the minimum expected remaining MOCU subsequent to the experiment. In the process, one must find the optimal intervention for every gene regulatory network compatible with the prior knowledge, which can be prohibitively expensive when the size of the network is large. In this paper, we propose a computationally efficient experimental design method. This method incorporates a network reduction scheme by introducing a novel cost function that takes into account the disruption in the ranking of potential experiments. We then estimate the approximate expected remaining MOCU at a lower computational cost using the reduced networks. Simulation results based on synthetic and real gene regulatory networks show that the proposed approximate method has close performance to that of the optimal method but at lower computational cost. The proposed approximate method also outperforms the random selection policy significantly. A MATLAB software implementing the proposed experimental design method is available at http://gsp.tamu.edu/Publications/supplementary/roozbeh15a/.
Efficient experimental design for uncertainty reduction in gene regulatory networks
2015-01-01
Background An accurate understanding of interactions among genes plays a major role in developing therapeutic intervention methods. Gene regulatory networks often contain a significant amount of uncertainty. The process of prioritizing biological experiments to reduce the uncertainty of gene regulatory networks is called experimental design. Under such a strategy, the experiments with high priority are suggested to be conducted first. Results The authors have already proposed an optimal experimental design method based upon the objective for modeling gene regulatory networks, such as deriving therapeutic interventions. The experimental design method utilizes the concept of mean objective cost of uncertainty (MOCU). MOCU quantifies the expected increase of cost resulting from uncertainty. The optimal experiment to be conducted first is the one which leads to the minimum expected remaining MOCU subsequent to the experiment. In the process, one must find the optimal intervention for every gene regulatory network compatible with the prior knowledge, which can be prohibitively expensive when the size of the network is large. In this paper, we propose a computationally efficient experimental design method. This method incorporates a network reduction scheme by introducing a novel cost function that takes into account the disruption in the ranking of potential experiments. We then estimate the approximate expected remaining MOCU at a lower computational cost using the reduced networks. Conclusions Simulation results based on synthetic and real gene regulatory networks show that the proposed approximate method has close performance to that of the optimal method but at lower computational cost. The proposed approximate method also outperforms the random selection policy significantly. A MATLAB software implementing the proposed experimental design method is available at http://gsp.tamu.edu/Publications/supplementary/roozbeh15a/. PMID:26423515
2012-01-01
Background Fever is one of the most common adverse events of vaccines. The detailed mechanisms of fever and vaccine-associated gene interaction networks are not fully understood. In the present study, we employed a genome-wide, Centrality and Ontology-based Network Discovery using Literature data (CONDL) approach to analyse the genes and gene interaction networks associated with fever or vaccine-related fever responses. Results Over 170,000 fever-related articles from PubMed abstracts and titles were retrieved and analysed at the sentence level using natural language processing techniques to identify genes and vaccines (including 186 Vaccine Ontology terms) as well as their interactions. This resulted in a generic fever network consisting of 403 genes and 577 gene interactions. A vaccine-specific fever sub-network consisting of 29 genes and 28 gene interactions was extracted from articles that are related to both fever and vaccines. In addition, gene-vaccine interactions were identified. Vaccines (including 4 specific vaccine names) were found to directly interact with 26 genes. Gene set enrichment analysis was performed using the genes in the generated interaction networks. Moreover, the genes in these networks were prioritized using network centrality metrics. Making scientific discoveries and generating new hypotheses were possible by using network centrality and gene set enrichment analyses. For example, our study found that the genes in the generic fever network were more enriched in cell death and responses to wounding, and the vaccine sub-network had more gene enrichment in leukocyte activation and phosphorylation regulation. The most central genes in the vaccine-specific fever network are predicted to be highly relevant to vaccine-induced fever, whereas genes that are central only in the generic fever network are likely to be highly relevant to generic fever responses. Interestingly, no Toll-like receptors (TLRs) were found in the gene-vaccine interaction network. Since multiple TLRs were found in the generic fever network, it is reasonable to hypothesize that vaccine-TLR interactions may play an important role in inducing fever response, which deserves a further investigation. Conclusions This study demonstrated that ontology-based literature mining is a powerful method for analyzing gene interaction networks and generating new scientific hypotheses. PMID:23256563
Prior knowledge based mining functional modules from Yeast PPI networks with gene ontology
2010-01-01
Background In the literature, there are fruitful algorithmic approaches for identification functional modules in protein-protein interactions (PPI) networks. Because of accumulation of large-scale interaction data on multiple organisms and non-recording interaction data in the existing PPI database, it is still emergent to design novel computational techniques that can be able to correctly and scalably analyze interaction data sets. Indeed there are a number of large scale biological data sets providing indirect evidence for protein-protein interaction relationships. Results The main aim of this paper is to present a prior knowledge based mining strategy to identify functional modules from PPI networks with the aid of Gene Ontology. Higher similarity value in Gene Ontology means that two gene products are more functionally related to each other, so it is better to group such gene products into one functional module. We study (i) to encode the functional pairs into the existing PPI networks; and (ii) to use these functional pairs as pairwise constraints to supervise the existing functional module identification algorithms. Topology-based modularity metric and complex annotation in MIPs will be used to evaluate the identified functional modules by these two approaches. Conclusions The experimental results on Yeast PPI networks and GO have shown that the prior knowledge based learning methods perform better than the existing algorithms. PMID:21172053
Li, Qing-Lan; Lei, Pin-Ji; Zhao, Quan-Yi; Li, Lianyun; Wei, Gang; Wu, Min
2017-08-01
Epigenetic marks are critical regulators of chromatin and gene activity. Their roles in normal physiology and disease states, including cancer development, still remain elusive. Herein, the epigenomic change of H3K9me3, as well as its potential impacts on gene activity and genome stability, was investigated in an in vitro breast cancer transformation model. The global H3K9me3 level was studied with western blotting. The distribution of H3K9me3 on chromatin and gene expression was studied with ChIP-Seq and RNA-Seq, respectively. The global H3K9me3 level decreases during transformation and its distribution on chromatin is reprogrammed. By combining with TCGA data, we identified 67 candidate oncogenes, among which five genes are totally novel. Our analysis further links H3K9me3 with transposon activity, and suggests H3K9me3 reduction increases the cell's sensitivity to DNA damage reagents. H3K9me3 reduction is possibly related with breast cancer transformation by regulating gene expression and chromatin stability during transformation.
Duan, Qiaonan; Flynn, Corey; Niepel, Mario; Hafner, Marc; Muhlich, Jeremy L; Fernandez, Nicolas F; Rouillard, Andrew D; Tan, Christopher M; Chen, Edward Y; Golub, Todd R; Sorger, Peter K; Subramanian, Aravind; Ma'ayan, Avi
2014-07-01
For the Library of Integrated Network-based Cellular Signatures (LINCS) project many gene expression signatures using the L1000 technology have been produced. The L1000 technology is a cost-effective method to profile gene expression in large scale. LINCS Canvas Browser (LCB) is an interactive HTML5 web-based software application that facilitates querying, browsing and interrogating many of the currently available LINCS L1000 data. LCB implements two compacted layered canvases, one to visualize clustered L1000 expression data, and the other to display enrichment analysis results using 30 different gene set libraries. Clicking on an experimental condition highlights gene-sets enriched for the differentially expressed genes from the selected experiment. A search interface allows users to input gene lists and query them against over 100 000 conditions to find the top matching experiments. The tool integrates many resources for an unprecedented potential for new discoveries in systems biology and systems pharmacology. The LCB application is available at http://www.maayanlab.net/LINCS/LCB. Customized versions will be made part of the http://lincscloud.org and http://lincs.hms.harvard.edu websites. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries
Smith, Robin P; Buchser, William J; Lemmon, Marcus B; Pardinas, Jose R; Bixby, John L; Lemmon, Vance P
2008-01-01
Background Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. Results We have developed "EST Express", a suite of analytical tools that identify and annotate ESTs originating from specific mRNA populations. The software consists of a user-friendly GUI powered by PHP and MySQL that allows for online collaboration between researchers and continuity with UniGene, Entrez Gene and RefSeq. Two key features of the software include a novel, simplified Entrez Gene parser and tools to manage cDNA library sequencing projects. We have tested the software on a large data set (2,016 samples) produced by subtractive hybridization. Conclusion EST Express is an open-source, cross-platform web server application that imports sequences from cDNA libraries, such as those generated through subtractive hybridization or yeast two-hybrid screens. It then provides several layers of annotation based on Entrez Gene and RefSeq to allow the user to highlight useful genes and manage cDNA library projects. PMID:18402700
EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries.
Smith, Robin P; Buchser, William J; Lemmon, Marcus B; Pardinas, Jose R; Bixby, John L; Lemmon, Vance P
2008-04-10
Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. We have developed "EST Express", a suite of analytical tools that identify and annotate ESTs originating from specific mRNA populations. The software consists of a user-friendly GUI powered by PHP and MySQL that allows for online collaboration between researchers and continuity with UniGene, Entrez Gene and RefSeq. Two key features of the software include a novel, simplified Entrez Gene parser and tools to manage cDNA library sequencing projects. We have tested the software on a large data set (2,016 samples) produced by subtractive hybridization. EST Express is an open-source, cross-platform web server application that imports sequences from cDNA libraries, such as those generated through subtractive hybridization or yeast two-hybrid screens. It then provides several layers of annotation based on Entrez Gene and RefSeq to allow the user to highlight useful genes and manage cDNA library projects.
NASA Astrophysics Data System (ADS)
Marble, Jay A.; Gorman, John D.
1999-08-01
A feature based approach is taken to reduce the occurrence of false alarms in foliage penetrating, ultra-wideband, synthetic aperture radar data. A set of 'generic' features is defined based on target size, shape, and pixel intensity. A second set of features is defined that contains generic features combined with features based on scattering phenomenology. Each set is combined using a quadratic polynomial discriminant (QPD), and performance is characterized by generating a receiver operating characteristic (ROC) curve. Results show that the feature set containing phenomenological features improves performance against both broadside and end-on targets. Performance against end-on targets, however, is especially pronounced.
A systematic analysis of genomic changes in Tg2576 mice.
Tan, Lu; Wang, Xiong; Ni, Zhong-Fei; Zhu, Xiuming; Wu, Wei; Zhu, Ling-Qiang; Liu, Dan
2013-06-01
Alzheimer's disease (AD) is an age-related neurodegenerative disorder characterized by intelligence decline, behavioral disorders and cognitive disability. The purpose of this study was to investigate gene expression in AD, based on published microarray data on Tg2576 mice. Hierarchical Cluster Analysis and Gene Ontology were employed to group genes together on the basis of their product characteristics and annotation data. Genes with prominent alterations were clustered into apoptosis and axon guidance pathways. Based on our findings and those of previous studies, we propose that the mitochondria-mediated apoptotic pathway plays a crucial role in the neuronal loss and synaptic dysfunction associated with AD. Furthermore, based on the findings of Positional Gene Enrichment analysis and Gene Set Enrichment analysis, we propose that the regulation of transcription of AD genes may be an important pathogenic factor in this neurodegenerative disease. Our results highlight the importance of genes that could subsequently be examined for their potential as prognostic markers for AD.
Geng, Haijiang; Li, Zhihui; Li, Jiabing; Lu, Tao; Yan, Fangrong
2015-01-01
BACKGROUND Personalized cancer treatments depend on the determination of a patient's genetic status according to known genetic profiles for which targeted treatments exist. Such genetic profiles must be scientifically validated before they is applied to general patient population. Reproducibility of findings that support such genetic profiles is a fundamental challenge in validation studies. The percentage of overlapping genes (POG) criterion and derivative methods produce unstable and misleading results. Furthermore, in a complex disease, comparisons between different tumor subtypes can produce high POG scores that do not capture the consistencies in the functions. RESULTS We focused on the quality rather than the quantity of the overlapping genes. We defined the rank value of each gene according to importance or quality by PageRank on basis of a particular topological structure. Then, we used the p-value of the rank-sum of the overlapping genes (PRSOG) to evaluate the quality of reproducibility. Though the POG scores were low in different studies of the same disease, the PRSOG was statistically significant, which suggests that sets of differentially expressed genes might be highly reproducible. CONCLUSIONS Evaluations of eight datasets from breast cancer, lung cancer and four other disorders indicate that quality-based PRSOG method performs better than a quantity-based method. Our analysis of the components of the sets of overlapping genes supports the utility of the PRSOG method. PMID:26556852
Clique-based data mining for related genes in a biomedical database.
Matsunaga, Tsutomu; Yonemori, Chikara; Tomita, Etsuji; Muramatsu, Masaaki
2009-07-01
Progress in the life sciences cannot be made without integrating biomedical knowledge on numerous genes in order to help formulate hypotheses on the genetic mechanisms behind various biological phenomena, including diseases. There is thus a strong need for a way to automatically and comprehensively search from biomedical databases for related genes, such as genes in the same families and genes encoding components of the same pathways. Here we address the extraction of related genes by searching for densely-connected subgraphs, which are modeled as cliques, in a biomedical relational graph. We constructed a graph whose nodes were gene or disease pages, and edges were the hyperlink connections between those pages in the Online Mendelian Inheritance in Man (OMIM) database. We obtained over 20,000 sets of related genes (called 'gene modules') by enumerating cliques computationally. The modules included genes in the same family, genes for proteins that form a complex, and genes for components of the same signaling pathway. The results of experiments using 'metabolic syndrome'-related gene modules show that the gene modules can be used to get a coherent holistic picture helpful for interpreting relations among genes. We presented a data mining approach extracting related genes by enumerating cliques. The extracted gene sets provide a holistic picture useful for comprehending complex disease mechanisms.
Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts.
Xu, Lijing; Furlotte, Nicholas; Lin, Yunyue; Heinrich, Kevin; Berry, Michael W; George, Ebenezer O; Homayouni, Ramin
2011-04-14
High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature. GCAT is freely available at http://binf1.memphis.edu/gcat.
Tufto, Jarle
2010-01-01
Domesticated species frequently spread their genes into populations of wild relatives through interbreeding. The domestication process often involves artificial selection for economically desirable traits. This can lead to an indirect response in unknown correlated traits and a reduction in fitness of domesticated individuals in the wild. Previous models for the effect of gene flow from domesticated species to wild relatives have assumed that evolution occurs in one dimension. Here, I develop a quantitative genetic model for the balance between migration and multivariate stabilizing selection. Different forms of correlational selection consistent with a given observed ratio between average fitness of domesticated and wild individuals offsets the phenotypic means at migration-selection balance away from predictions based on simpler one-dimensional models. For almost all parameter values, correlational selection leads to a reduction in the migration load. For ridge selection, this reduction arises because the distance the immigrants deviates from the local optimum in effect is reduced. For realistic parameter values, however, the effect of correlational selection on the load is small, suggesting that simpler one-dimensional models may still be adequate in terms of predicting mean population fitness and viability.
Prediction of gene expression in embryonic structures of Drosophila melanogaster.
Samsonova, Anastasia A; Niranjan, Mahesan; Russell, Steven; Brazma, Alvis
2007-07-01
Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms.
Prediction of Gene Expression in Embryonic Structures of Drosophila melanogaster
Samsonova, Anastasia A; Niranjan, Mahesan; Russell, Steven; Brazma, Alvis
2007-01-01
Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms. PMID:17658945
Norton, Rhy; Austin, Cindy; Mitchell, Amber; Zank, Sara; Durham, Paul
2015-01-01
Increased utilization of inorganic silver as an adjunctive to many medical devices has raised concerns of emergent silver resistance in clinical bacteria. Although the molecular basis for silver resistance has been previously characterized, to date, significant phenotypic expression of these genes in clinical settings is yet to be observed. Here, we identified the first strains of clinical bacteria expressing silver resistance at a level that could significantly impact wound care and the use of silver-based dressings. Screening of 859 clinical isolates confirmed 31 harbored at least 1 silver resistance gene. Despite the presence of these genes, MIC testing revealed most of the bacteria displayed little or no increase in resistance to ionic silver (200 to 300 μM Ag+). However, 2 isolates (Klebsiella pneumonia and Enterobacter cloacae) were capable of robust growth at exceedingly high silver concentrations, with MIC values reaching 5,500 μM Ag+. DNA sequencing of these two strains revealed the presence of genes homologous to known genetic determinants of heavy metal resistance. Darkening of the bacteria's pigment was observed after exposure to high silver concentrations. Scanning electron microscopy images showed the presence of silver nanoparticles embedded in the extracellular polymeric substance of both isolates. This finding suggested that the isolates may neutralize ionic silver via reduction to elemental silver. Antimicrobial testing revealed both organisms to be completely resistant to many commercially available silver-impregnated burn and wound dressings. Taken together, these findings provide the first evidence of clinical bacteria capable of expressing silver resistance at levels that could significantly impact wound management. PMID:26014954
The transcriptional stress response of Candida albicans to weak organic acids.
Cottier, Fabien; Tan, Alrina Shin Min; Chen, Jinmiao; Lum, Josephine; Zolezzi, Francesca; Poidinger, Michael; Pavelka, Norman
2015-01-29
Candida albicans is the most important fungal pathogen of humans, causing severe infections, especially in nosocomial and immunocompromised settings. However, it is also the most prevalent fungus of the normal human microbiome, where it shares its habitat with hundreds of trillions of other microbial cells. Despite weak organic acids (WOAs) being among the most abundant metabolites produced by bacterial microbiota, little is known about their effect on C. albicans. Here we used a sequencing-based profiling strategy to systematically investigate the transcriptional stress response of C. albicans to lactic, acetic, propionic, and butyric acid at several time points after treatment. Our data reveal a complex transcriptional response, with individual WOAs triggering unique gene expression profiles and with important differences between acute and chronic exposure. Despite these dissimilarities, we found significant overlaps between the gene expression changes induced by each WOA, which led us to uncover a core transcriptional response that was largely unrelated to other previously published C. albicans transcriptional stress responses. Genes commonly up-regulated by WOAs were enriched in several iron transporters, which was associated with an overall decrease in intracellular iron concentrations. Moreover, chronic exposure to any WOA lead to down-regulation of RNA synthesis and ribosome biogenesis genes, which resulted in significant reduction of total RNA levels and of ribosomal RNA in particular. In conclusion, this study suggests that gastrointestinal microbiota might directly influence C. albicans physiology via production of WOAs, with possible implications of how this fungus interacts with its host in both health and disease. Copyright © 2015 Cottier et al.
The Transcriptional Stress Response of Candida albicans to Weak Organic Acids
Cottier, Fabien; Tan, Alrina Shin Min; Chen, Jinmiao; Lum, Josephine; Zolezzi, Francesca; Poidinger, Michael; Pavelka, Norman
2015-01-01
Candida albicans is the most important fungal pathogen of humans, causing severe infections, especially in nosocomial and immunocompromised settings. However, it is also the most prevalent fungus of the normal human microbiome, where it shares its habitat with hundreds of trillions of other microbial cells. Despite weak organic acids (WOAs) being among the most abundant metabolites produced by bacterial microbiota, little is known about their effect on C. albicans. Here we used a sequencing-based profiling strategy to systematically investigate the transcriptional stress response of C. albicans to lactic, acetic, propionic, and butyric acid at several time points after treatment. Our data reveal a complex transcriptional response, with individual WOAs triggering unique gene expression profiles and with important differences between acute and chronic exposure. Despite these dissimilarities, we found significant overlaps between the gene expression changes induced by each WOA, which led us to uncover a core transcriptional response that was largely unrelated to other previously published C. albicans transcriptional stress responses. Genes commonly up-regulated by WOAs were enriched in several iron transporters, which was associated with an overall decrease in intracellular iron concentrations. Moreover, chronic exposure to any WOA lead to down-regulation of RNA synthesis and ribosome biogenesis genes, which resulted in significant reduction of total RNA levels and of ribosomal RNA in particular. In conclusion, this study suggests that gastrointestinal microbiota might directly influence C. albicans physiology via production of WOAs, with possible implications of how this fungus interacts with its host in both health and disease. PMID:25636313
The Ad5 [E1-, E2b-]-based vector: a new and versatile gene delivery platform
NASA Astrophysics Data System (ADS)
Jones, Frank R.; Gabitzsch, Elizabeth S.; Balint, Joseph P.
2015-05-01
Based upon advances in gene sequencing and construction, it is now possible to identify specific genes or sequences thereof for gene delivery applications. Recombinant adenovirus serotype-5 (Ad5) viral vectors have been utilized in the settings of gene therapy, vaccination, and immunotherapy but have encountered clinical challenges because they are recognized as foreign entities to the host. This recognition leads to an immunologic clearance of the vector that contains the inserted gene of interest and prevents effective immunization(s). We have reported on a new Ad5-based viral vector technology that can be utilized as an immunization modality to induce immune responses even in the presence of Ad5 vector immunity. We have reported successful immunization and immunotherapy results to infectious diseases and cancers. This improved recombinant viral platform (Ad5 [E1-, E2b-]) can now be utilized in the development of multiple vaccines and immunotherapies.
Chang, Chia-Ming; Chuang, Chi-Mu; Wang, Mong-Lien; Yang, Yi-Ping; Chuang, Jen-Hua; Yang, Ming-Jie; Yen, Ming-Shyen; Chiou, Shih-Hwa; Chang, Cheng-Chang
2016-01-01
Clear cell (CCC), endometrioid (EC), mucinous (MC) and high-grade serous carcinoma (SC) are the four most common subtypes of epithelial ovarian carcinoma (EOC). The widely accepted dualistic model of ovarian carcinogenesis divided EOCs into type I and II categories based on the molecular features. However, this hypothesis has not been experimentally demonstrated. We carried out a gene set-based analysis by integrating the microarray gene expression profiles downloaded from the publicly available databases. These quantified biological functions of EOCs were defined by 1454 Gene Ontology (GO) term and 674 Reactome pathway gene sets. The pathogenesis of the four EOC subtypes was investigated by hierarchical clustering and exploratory factor analysis. The patterns of functional regulation among the four subtypes containing 1316 cases could be accurately classified by machine learning. The results revealed that the ERBB and PI3K-related pathways played important roles in the carcinogenesis of CCC, EC and MC; while deregulation of cell cycle was more predominant in SC. The study revealed that two different functional regulation patterns exist among the four EOC subtypes, which were compatible with the type I and II classifications proposed by the dualistic model of ovarian carcinogenesis. PMID:27527159
Interactions in the microbiome: communities of organisms and communities of genes
Boon, Eva; Meehan, Conor J; Whidden, Chris; Wong, Dennis H-J; Langille, Morgan GI; Beiko, Robert G
2014-01-01
A central challenge in microbial community ecology is the delineation of appropriate units of biodiversity, which can be taxonomic, phylogenetic, or functional in nature. The term ‘community’ is applied ambiguously; in some cases, the term refers simply to a set of observed entities, while in other cases, it requires that these entities interact with one another. Microorganisms can rapidly gain and lose genes, potentially decoupling community roles from taxonomic and phylogenetic groupings. Trait-based approaches offer a useful alternative, but many traits can be defined based on gene functions, metabolic modules, and genomic properties, and the optimal set of traits to choose is often not obvious. An analysis that considers taxon assignment and traits in concert may be ideal, with the strengths of each approach offsetting the weaknesses of the other. Individual genes also merit consideration as entities in an ecological analysis, with characteristics such as diversity, turnover, and interactions modeled using genes rather than organisms as entities. We identify some promising avenues of research that are likely to yield a deeper understanding of microbial communities that shift from observation-based questions of ‘Who is there?’ and ‘What are they doing?’ to the mechanistically driven question of ‘How will they respond?’ PMID:23909933
Prediction of epigenetically regulated genes in breast cancer cell lines.
Loss, Leandro A; Sadanandam, Anguraj; Durinck, Steffen; Nautiyal, Shivani; Flaucher, Diane; Carlton, Victoria E H; Moorhead, Martin; Lu, Yontao; Gray, Joe W; Faham, Malek; Spellman, Paul; Parvin, Bahram
2010-06-04
Methylation of CpG islands within the DNA promoter regions is one mechanism that leads to aberrant gene expression in cancer. In particular, the abnormal methylation of CpG islands may silence associated genes. Therefore, using high-throughput microarrays to measure CpG island methylation will lead to better understanding of tumor pathobiology and progression, while revealing potentially new biomarkers. We have examined a recently developed high-throughput technology for measuring genome-wide methylation patterns called mTACL. Here, we propose a computational pipeline for integrating gene expression and CpG island methylation profiles to identify epigenetically regulated genes for a panel of 45 breast cancer cell lines, which is widely used in the Integrative Cancer Biology Program (ICBP). The pipeline (i) reduces the dimensionality of the methylation data, (ii) associates the reduced methylation data with gene expression data, and (iii) ranks methylation-expression associations according to their epigenetic regulation. Dimensionality reduction is performed in two steps: (i) methylation sites are grouped across the genome to identify regions of interest, and (ii) methylation profiles are clustered within each region. Associations between the clustered methylation and the gene expression data sets generate candidate matches within a fixed neighborhood around each gene. Finally, the methylation-expression associations are ranked through a logistic regression, and their significance is quantified through permutation analysis. Our two-step dimensionality reduction compressed 90% of the original data, reducing 137,688 methylation sites to 14,505 clusters. Methylation-expression associations produced 18,312 correspondences, which were used to further analyze epigenetic regulation. Logistic regression was used to identify 58 genes from these correspondences that showed a statistically significant negative correlation between methylation profiles and gene expression in the panel of breast cancer cell lines. Subnetwork enrichment of these genes has identified 35 common regulators with 6 or more predicted markers. In addition to identifying epigenetically regulated genes, we show evidence of differentially expressed methylation patterns between the basal and luminal subtypes. Our results indicate that the proposed computational protocol is a viable platform for identifying epigenetically regulated genes. Our protocol has generated a list of predictors including COL1A2, TOP2A, TFF1, and VAV3, genes whose key roles in epigenetic regulation is documented in the literature. Subnetwork enrichment of these predicted markers further suggests that epigenetic regulation of individual genes occurs in a coordinated fashion and through common regulators.
2012-01-01
Background Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different statistical methods have been used to systematically extract biological information and to quantify the associated uncertainty. The simplest method to identify differentially expressed genes is to evaluate the ratio of average intensities in two different conditions and consider all genes that differ by more than an arbitrary cut-off value to be differentially expressed. This filtering approach is not a statistical test and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed. At the same time the fold change by itself provide valuable information and it is important to find unambiguous ways of using this information in expression data treatment. Results A new method of finding differentially expressed genes, called distributional fold change (DFC) test is introduced. The method is based on an analysis of the intensity distribution of all microarray probe sets mapped to a three dimensional feature space composed of average expression level, average difference of gene expression and total variance. The proposed method allows one to rank each feature based on the signal-to-noise ratio and to ascertain for each feature the confidence level and power for being differentially expressed. The performance of the new method was evaluated using the total and partial area under receiver operating curves and tested on 11 data sets from Gene Omnibus Database with independently verified differentially expressed genes and compared with the t-test and shrinkage t-test. Overall the DFC test performed the best – on average it had higher sensitivity and partial AUC and its elevation was most prominent in the low range of differentially expressed features, typical for formalin-fixed paraffin-embedded sample sets. Conclusions The distributional fold change test is an effective method for finding and ranking differentially expressed probesets on microarrays. The application of this test is advantageous to data sets using formalin-fixed paraffin-embedded samples or other systems where degradation effects diminish the applicability of correlation adjusted methods to the whole feature set. PMID:23122055
Arkas: Rapid reproducible RNAseq analysis
Colombo, Anthony R.; J. Triche Jr, Timothy; Ramsingh, Giridharan
2017-01-01
The recently introduced Kallisto pseudoaligner has radically simplified the quantification of transcripts in RNA-sequencing experiments. We offer cloud-scale RNAseq pipelines Arkas-Quantification, and Arkas-Analysis available within Illumina’s BaseSpace cloud application platform which expedites Kallisto preparatory routines, reliably calculates differential expression, and performs gene-set enrichment of REACTOME pathways . Due to inherit inefficiencies of scale, Illumina's BaseSpace computing platform offers a massively parallel distributive environment improving data management services and data importing. Arkas-Quantification deploys Kallisto for parallel cloud computations and is conveniently integrated downstream from the BaseSpace Sequence Read Archive (SRA) import/conversion application titled SRA Import. Arkas-Analysis annotates the Kallisto results by extracting structured information directly from source FASTA files with per-contig metadata, calculates the differential expression and gene-set enrichment analysis on both coding genes and transcripts. The Arkas cloud pipeline supports ENSEMBL transcriptomes and can be used downstream from the SRA Import facilitating raw sequencing importing, SRA FASTQ conversion, RNA quantification and analysis steps. PMID:28868134
Nomoto, R; Kagawa, H; Yoshida, T
2008-01-01
To investigate the difference between Lancefield group C Streptococcus dysgalactiae (GCSD) strains isolated from diseased fish and animals by sequencing and phylogenetic analysis of the sodA gene. The sodA gene of Strep. dysgalactiae strains isolated from fish and animals were amplified and its nucleotide sequences were determined. Although 100% sequence identity was observed among fish GCSD strains, the determined sequences from animal isolates showed variations against fish isolate sequences. Thus, all fish GCSD strains were clearly separated from the GCSD strains of other origin by using phylogenetic tree analysis. In addition, the original primer set was designed based on the determined sequences for specifically amplify the sodA gene of fish GCSD strains. The primer set yield amplification products from only fish GCSD strains. By sequencing analysis of the sodA gene, the genetic divergence between Strep. dysgalactiae strains isolated from fish and mammals was demonstrated. Moreover, an original oligonucletide primer set, which could simply detect the genotype of fish GCSD strains was designed. This study shows that Strep. dysgalactiae isolated from diseased fish could be distinguished from conventional GCSD strains by the difference in the sequence of the sodA gene.
Mandelker, Diana; Schmidt, Ryan J; Ankala, Arunkanth; McDonald Gibson, Kristin; Bowser, Mark; Sharma, Himanshu; Duffy, Elizabeth; Hegde, Madhuri; Santani, Avni; Lebo, Matthew; Funke, Birgit
2016-12-01
Next-generation sequencing (NGS) is now routinely used to interrogate large sets of genes in a diagnostic setting. Regions of high sequence homology continue to be a major challenge for short-read technologies and can lead to false-positive and false-negative diagnostic errors. At the scale of whole-exome sequencing (WES), laboratories may be limited in their knowledge of genes and regions that pose technical hurdles due to high homology. We have created an exome-wide resource that catalogs highly homologous regions that is tailored toward diagnostic applications. This resource was developed using a mappability-based approach tailored to current Sanger and NGS protocols. Gene-level and exon-level lists delineate regions that are difficult or impossible to analyze via standard NGS. These regions are ranked by degree of affectedness, annotated for medical relevance, and classified by the type of homology (within-gene, different functional gene, known pseudogene, uncharacterized noncoding region). Additionally, we provide a list of exons that cannot be analyzed by short-amplicon Sanger sequencing. This resource can help guide clinical test design, supplemental assay implementation, and results interpretation in the context of high homology.Genet Med 18 12, 1282-1289.
Logical analysis of diffuse large B-cell lymphomas.
Alexe, G; Alexe, S; Axelrod, D E; Hammer, P L; Weissmann, D
2005-07-01
The goal of this study is to re-examine the oligonucleotide microarray dataset of Shipp et al., which contains the intensity levels of 6817 genes of 58 patients with diffuse large B-cell lymphoma (DLBCL) and 19 with follicular lymphoma (FL), by means of the combinatorics, optimisation, and logic-based methodology of logical analysis of data (LAD). The motivations for this new analysis included the previously demonstrated capabilities of LAD and its expected potential (1) to identify different informative genes than those discovered by conventional statistical methods, (2) to identify combinations of gene expression levels capable of characterizing different types of lymphoma, and (3) to assemble collections of such combinations that if considered jointly are capable of accurately distinguishing different types of lymphoma. The central concept of LAD is a pattern or combinatorial biomarker, a concept that resembles a rule as used in decision tree methods. LAD is able to exhaustively generate the collection of all those patterns which satisfy certain quality constraints, through a systematic combinatorial process guided by clear optimization criteria. Then, based on a set covering approach, LAD aggregates the collection of patterns into classification models. In addition, LAD is able to use the information provided by large collections of patterns in order to extract subsets of variables, which collectively are able to distinguish between different types of disease. For the differential diagnosis of DLBCL versus FL, a model based on eight significant genes is constructed and shown to have a sensitivity of 94.7% and a specificity of 100% on the test set. For the prognosis of good versus poor outcome among the DLBCL patients, a model is constructed on another set consisting also of eight significant genes, and shown to have a sensitivity of 87.5% and a specificity of 90% on the test set. The genes selected by LAD also work well as a basis for other kinds of statistical analysis, indicating their robustness. These two models exhibit accuracies that compare favorably to those in the original study. In addition, the current study also provides a ranking by importance of the genes in the selected significant subsets as well as a library of dozens of combinatorial biomarkers (i.e. pairs or triplets of genes) that can serve as a source of mathematically generated, statistically significant research hypotheses in need of biological explanation.
Analysis and modelling of septic shock microarray data using Singular Value Decomposition.
Allanki, Srinivas; Dixit, Madhulika; Thangaraj, Paul; Sinha, Nandan Kumar
2017-06-01
Being a high throughput technique, enormous amounts of microarray data has been generated and there arises a need for more efficient techniques of analysis, in terms of speed and accuracy. Finding the differentially expressed genes based on just fold change and p-value might not extract all the vital biological signals that occur at a lower gene expression level. Besides this, numerous mathematical models have been generated to predict the clinical outcome from microarray data, while very few, if not none, aim at predicting the vital genes that are important in a disease progression. Such models help a basic researcher narrow down and concentrate on a promising set of genes which leads to the discovery of gene-based therapies. In this article, as a first objective, we have used the lesser known and used Singular Value Decomposition (SVD) technique to build a microarray data analysis tool that works with gene expression patterns and intrinsic structure of the data in an unsupervised manner. We have re-analysed a microarray data over the clinical course of Septic shock from Cazalis et al. (2014) and have shown that our proposed analysis provides additional information compared to the conventional method. As a second objective, we developed a novel mathematical model that predicts a set of vital genes in the disease progression that works by generating samples in the continuum between health and disease, using a simple normal-distribution-based random number generator. We also verify that most of the predicted genes are indeed related to septic shock. Copyright © 2017 Elsevier Inc. All rights reserved.
Detecting Horizontal Gene Transfer between Closely Related Taxa
Adato, Orit; Ninyo, Noga; Gophna, Uri; Snir, Sagi
2015-01-01
Horizontal gene transfer (HGT), the transfer of genetic material between organisms, is crucial for genetic innovation and the evolution of genome architecture. Existing HGT detection algorithms rely on a strong phylogenetic signal distinguishing the transferred sequence from ancestral (vertically derived) genes in its recipient genome. Detecting HGT between closely related species or strains is challenging, as the phylogenetic signal is usually weak and the nucleotide composition is normally nearly identical. Nevertheless, there is a great importance in detecting HGT between congeneric species or strains, especially in clinical microbiology, where understanding the emergence of new virulent and drug-resistant strains is crucial, and often time-sensitive. We developed a novel, self-contained technique named Near HGT, based on the synteny index, to measure the divergence of a gene from its native genomic environment and used it to identify candidate HGT events between closely related strains. The method confirms candidate transferred genes based on the constant relative mutability (CRM). Using CRM, the algorithm assigns a confidence score based on “unusual” sequence divergence. A gene exhibiting exceptional deviations according to both synteny and mutability criteria, is considered a validated HGT product. We first employed the technique to a set of three E. coli strains and detected several highly probable horizontally acquired genes. We then compared the method to existing HGT detection tools using a larger strain data set. When combined with additional approaches our new algorithm provides richer picture and brings us closer to the goal of detecting all newly acquired genes in a particular strain. PMID:26439115
Discretization provides a conceptually simple tool to build expression networks.
Vass, J Keith; Higham, Desmond J; Mudaliar, Manikhandan A V; Mao, Xuerong; Crowther, Daniel J
2011-04-18
Biomarker identification, using network methods, depends on finding regular co-expression patterns; the overall connectivity is of greater importance than any single relationship. A second requirement is a simple algorithm for ranking patients on how relevant a gene-set is. For both of these requirements discretized data helps to first identify gene cliques, and then to stratify patients.We explore a biologically intuitive discretization technique which codes genes as up- or down-regulated, with values close to the mean set as unchanged; this allows a richer description of relationships between genes than can be achieved by positive and negative correlation. We find a close agreement between our results and the template gene-interactions used to build synthetic microarray-like data by SynTReN, which synthesizes "microarray" data using known relationships which are successfully identified by our method.We are able to split positive co-regulation into up-together and down-together and negative co-regulation is considered as directed up-down relationships. In some cases these exist in only one direction, with real data, but not with the synthetic data. We illustrate our approach using two studies on white blood cells and derived immortalized cell lines and compare the approach with standard correlation-based computations. No attempt is made to distinguish possible causal links as the search for biomarkers would be crippled by losing highly significant co-expression relationships. This contrasts with approaches like ARACNE and IRIS.The method is illustrated with an analysis of gene-expression for energy metabolism pathways. For each discovered relationship we are able to identify the samples on which this is based in the discretized sample-gene matrix, along with a simplified view of the patterns of gene expression; this helps to dissect the gene-sample relevant to a research topic--identifying sets of co-regulated and anti-regulated genes and the samples or patients in which this relationship occurs.
A Comprehensive Analysis of Nuclear-Encoded Mitochondrial Genes in Schizophrenia.
Gonçalves, Vanessa F; Cappi, Carolina; Hagen, Christian M; Sequeira, Adolfo; Vawter, Marquis P; Derkach, Andriy; Zai, Clement C; Hedley, Paula L; Bybjerg-Grauholm, Jonas; Pouget, Jennie G; Cuperfain, Ari B; Sullivan, Patrick F; Christiansen, Michael; Kennedy, James L; Sun, Lei
2018-05-01
The genetic risk factors of schizophrenia (SCZ), a severe psychiatric disorder, are not yet fully understood. Multiple lines of evidence suggest that mitochondrial dysfunction may play a role in SCZ, but comprehensive association studies are lacking. We hypothesized that variants in nuclear-encoded mitochondrial genes influence susceptibility to SCZ. We conducted gene-based and gene-set analyses using summary association results from the Psychiatric Genomics Consortium Schizophrenia Phase 2 (PGC-SCZ2) genome-wide association study comprising 35,476 cases and 46,839 control subjects. We applied the MAGMA method to three sets of nuclear-encoded mitochondrial genes: oxidative phosphorylation genes, other nuclear-encoded mitochondrial genes, and genes involved in nucleus-mitochondria crosstalk. Furthermore, we conducted a replication study using the iPSYCH SCZ sample of 2290 cases and 21,621 control subjects. In the PGC-SCZ2 sample, 1186 mitochondrial genes were analyzed, among which 159 had p values < .05 and 19 remained significant after multiple testing correction. A meta-analysis of 818 genes combining the PGC-SCZ2 and iPSYCH samples resulted in 104 nominally significant and nine significant genes, suggesting a polygenic model for the nuclear-encoded mitochondrial genes. Gene-set analysis, however, did not show significant results. In an in silico protein-protein interaction network analysis, 14 mitochondrial genes interacted directly with 158 SCZ risk genes identified in PGC-SCZ2 (permutation p = .02), and aldosterone signaling in epithelial cells and mitochondrial dysfunction pathways appeared to be overrepresented in this network of mitochondrial and SCZ risk genes. This study provides evidence that specific aspects of mitochondrial function may play a role in SCZ, but we did not observe its broad involvement even using a large sample. Copyright © 2018 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.
nGASP--the nematode genome annotation assessment project.
Coghlan, Avril; Fiedler, Tristan J; McKay, Sheldon J; Flicek, Paul; Harris, Todd W; Blasiar, Darin; Stein, Lincoln D
2008-12-19
While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders. This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.
Dynamic association rules for gene expression data analysis.
Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung
2015-10-14
The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.
Let them fall where they may: congruence analysis in massive phylogenetically messy data sets.
Leigh, Jessica W; Schliep, Klaus; Lopez, Philippe; Bapteste, Eric
2011-10-01
Interest in congruence in phylogenetic data has largely focused on issues affecting multicellular organisms, and animals in particular, in which the level of incongruence is expected to be relatively low. In addition, assessment methods developed in the past have been designed for reasonably small numbers of loci and scale poorly for larger data sets. However, there are currently over a thousand complete genome sequences available and of interest to evolutionary biologists, and these sequences are predominantly from microbial organisms, whose molecular evolution is much less frequently tree-like than that of multicellular life forms. As such, the level of incongruence in these data is expected to be high. We present a congruence method that accommodates both very large numbers of genes and high degrees of incongruence. Our method uses clustering algorithms to identify subsets of genes based on similarity of phylogenetic signal. It involves only a single phylogenetic analysis per gene, and therefore, computation time scales nearly linearly with the number of genes in the data set. We show that our method performs very well with sets of sequence alignments simulated under a wide variety of conditions. In addition, we present an analysis of core genes of prokaryotes, often assumed to have been largely vertically inherited, in which we identify two highly incongruent classes of genes. This result is consistent with the complexity hypothesis.
Effect of the absolute statistic on gene-sampling gene-set analysis methods.
Nam, Dougu
2017-06-01
Gene-set enrichment analysis and its modified versions have commonly been used for identifying altered functions or pathways in disease from microarray data. In particular, the simple gene-sampling gene-set analysis methods have been heavily used for datasets with only a few sample replicates. The biggest problem with this approach is the highly inflated false-positive rate. In this paper, the effect of absolute gene statistic on gene-sampling gene-set analysis methods is systematically investigated. Thus far, the absolute gene statistic has merely been regarded as a supplementary method for capturing the bidirectional changes in each gene set. Here, it is shown that incorporating the absolute gene statistic in gene-sampling gene-set analysis substantially reduces the false-positive rate and improves the overall discriminatory ability. Its effect was investigated by power, false-positive rate, and receiver operating curve for a number of simulated and real datasets. The performances of gene-set analysis methods in one-tailed (genome-wide association study) and two-tailed (gene expression data) tests were also compared and discussed.
Gene function prediction based on the Gene Ontology hierarchical structure.
Cheng, Liangxi; Lin, Hongfei; Hu, Yuncui; Wang, Jian; Yang, Zhihao
2014-01-01
The information of the Gene Ontology annotation is helpful in the explanation of life science phenomena, and can provide great support for the research of the biomedical field. The use of the Gene Ontology is gradually affecting the way people store and understand bioinformatic data. To facilitate the prediction of gene functions with the aid of text mining methods and existing resources, we transform it into a multi-label top-down classification problem and develop a method that uses the hierarchical relationships in the Gene Ontology structure to relieve the quantitative imbalance of positive and negative training samples. Meanwhile the method enhances the discriminating ability of classifiers by retaining and highlighting the key training samples. Additionally, the top-down classifier based on a tree structure takes the relationship of target classes into consideration and thus solves the incompatibility between the classification results and the Gene Ontology structure. Our experiment on the Gene Ontology annotation corpus achieves an F-value performance of 50.7% (precision: 52.7% recall: 48.9%). The experimental results demonstrate that when the size of training set is small, it can be expanded via topological propagation of associated documents between the parent and child nodes in the tree structure. The top-down classification model applies to the set of texts in an ontology structure or with a hierarchical relationship.
Rollins, Derrick K; Teh, Ailing
2010-12-17
Microarray data sets provide relative expression levels for thousands of genes for a small number, in comparison, of different experimental conditions called assays. Data mining techniques are used to extract specific information of genes as they relate to the assays. The multivariate statistical technique of principal component analysis (PCA) has proven useful in providing effective data mining methods. This article extends the PCA approach of Rollins et al. to the development of ranking genes of microarray data sets that express most differently between two biologically different grouping of assays. This method is evaluated on real and simulated data and compared to a current approach on the basis of false discovery rate (FDR) and statistical power (SP) which is the ability to correctly identify important genes. This work developed and evaluated two new test statistics based on PCA and compared them to a popular method that is not PCA based. Both test statistics were found to be effective as evaluated in three case studies: (i) exposing E. coli cells to two different ethanol levels; (ii) application of myostatin to two groups of mice; and (iii) a simulated data study derived from the properties of (ii). The proposed method (PM) effectively identified critical genes in these studies based on comparison with the current method (CM). The simulation study supports higher identification accuracy for PM over CM for both proposed test statistics when the gene variance is constant and for one of the test statistics when the gene variance is non-constant. PM compares quite favorably to CM in terms of lower FDR and much higher SP. Thus, PM can be quite effective in producing accurate signatures from large microarray data sets for differential expression between assays groups identified in a preliminary step of the PCA procedure and is, therefore, recommended for use in these applications.
Gu, Liqiang; Yu, Jun; Wang, Qing; Xu, Bin; Ji, Liechen; Yu, Lin; Zhang, Xipeng; Cai, Hui
2018-05-03
The present study aimed to investigate potential prognostic long noncoding RNAs (lncRNAs) associated with colorectal cancer (CRC). An mRNA‑seq dataset obtained from The Cancer Genome Atlas was employed to identify the differentially expressed lncRNAs (DELs) between CRC patients with good and poor prognoses. Subsequently, univariate and multivariate Cox regression analyses were conducted to analyze the prognosis‑associated lncRNAs among all DELs. In addition, a risk scoring system was developed according to the expression levels of the prognostic lncRNAs, which was then applied to a training set and an independent testing set. Furthermore, the co‑expressed genes of prognostic lncRNAs were screened using a Multi‑Experiment Matrix online tool for construction of lncRNA‑gene networks. Finally, Kyoto Encyclopedia of Genes and Genomes pathway and Gene Ontology (GO) function enrichment analyses were performed on genes in the lncRNA‑gene networks using KOBAS, GOATOOLS and ClusterProfiler. The present study identified 82 DELs, of which long intergenic nonprotein coding RNA 2159, RP11‑452L6.6, RP11‑894P9.1 and RP11‑69M1.6, and whey acidic protein four‑disulfide core domain 21 (WFDC21P) were reported to be independently associated with the prognosis of patients with CRC. A 5‑lncRNA signature‑based risk scoring system was developed, which may be used to classify patients into low‑ and high‑risk groups with significantly different recurrence‑free survival times in the training and testing sets (P<0.05). Co‑expressed genes of WFDC21P or RP11‑69M1.6 were utilized to construct the lncRNA‑gene networks. Genes in the networks were significantly enriched in 'tight junction', 'focal adhesion' and 'regulation of actin cytoskeleton' pathways, and numerous GO terms associated with 'reactive oxygen species metabolism' and 'nitric oxide metabolism'. The present study proposed a 5‑lncRNA signature‑based risk scoring system for predicting the prognosis of patients with CRC, and revealed the associated signaling pathways and biological processes. The results of the present study may help improve prognostic evaluation in clinical practice.
Species Tree Inference Using a Mixture Model.
Ullah, Ikram; Parviainen, Pekka; Lagergren, Jens
2015-09-01
Species tree reconstruction has been a subject of substantial research due to its central role across biology and medicine. A species tree is often reconstructed using a set of gene trees or by directly using sequence data. In either of these cases, one of the main confounding phenomena is the discordance between a species tree and a gene tree due to evolutionary events such as duplications and losses. Probabilistic methods can resolve the discordance by coestimating gene trees and the species tree but this approach poses a scalability problem for larger data sets. We present MixTreEM-DLRS: A two-phase approach for reconstructing a species tree in the presence of gene duplications and losses. In the first phase, MixTreEM, a novel structural expectation maximization algorithm based on a mixture model is used to reconstruct a set of candidate species trees, given sequence data for monocopy gene families from the genomes under study. In the second phase, PrIME-DLRS, a method based on the DLRS model (Åkerborg O, Sennblad B, Arvestad L, Lagergren J. 2009. Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proc Natl Acad Sci U S A. 106(14):5714-5719), is used for selecting the best species tree. PrIME-DLRS can handle multicopy gene families since DLRS, apart from modeling sequence evolution, models gene duplication and loss using a gene evolution model (Arvestad L, Lagergren J, Sennblad B. 2009. The gene evolution model and computing its associated probabilities. J ACM. 56(2):1-44). We evaluate MixTreEM-DLRS using synthetic and biological data, and compare its performance with a recent genome-scale species tree reconstruction method PHYLDOG (Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V. 2013. Genome-scale coestimation of species and gene trees. Genome Res. 23(2):323-330) as well as with a fast parsimony-based algorithm Duptree (Wehe A, Bansal MS, Burleigh JG, Eulenstein O. 2008. Duptree: a program for large-scale phylogenetic analyses using gene tree parsimony. Bioinformatics 24(13):1540-1541). Our method is competitive with PHYLDOG in terms of accuracy and runs significantly faster and our method outperforms Duptree in accuracy. The analysis constituted by MixTreEM without DLRS may also be used for selecting the target species tree, yielding a fast and yet accurate algorithm for larger data sets. MixTreEM is freely available at http://prime.scilifelab.se/mixtreem/. © The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
RNAi Mediated curcin precursor gene silencing in Jatropha (Jatropha curcas L.).
Patade, Vikas Yadav; Khatri, Deepti; Kumar, Kamal; Grover, Atul; Kumari, Maya; Gupta, Sanjay Mohan; Kumar, Devender; Nasim, Mohammed
2014-07-01
Curcin, a type I ribosomal inhibiting protein-RIP, encoded by curcin precursor gene, is a phytotoxin present in Jatropha (Jatropha curcas L.). Here, we report designing of RNAi construct for the curcin precursor gene and further its genetic transformation of Jatropha to reduce its transcript expression. Curcin precursor gene was first cloned from Jatropha strain DARL-2 and part of the gene sequence was cloned in sense and antisense orientation separated by an intron sequence in plant expression binary vector pRI101 AN. The construction of the RNAi vector was confirmed by double digestion and nucleotide sequencing. The vector was then mobilized into Agrobacterium tumefaciens strain GV 3101 and used for tissue culture independent in planta transformation protocol optimized for Jatropha. Germinating seeds were injured with a needle before infection with Agrobacterium and then transferred to sterilized sand medium. The seedlings were grown for 90 days and genomic DNA was isolated from leaves for transgenic confirmation based on real time PCR with NPT II specific dual labeled probe. Result of the transgenic confirmation analysis revealed presence of the gene silencing construct in ten out of 30 tested seedlings. Further, quantitative transcript expression analysis of the curcin precursor gene revealed reduction in the transcript abundance by more than 98% to undetectable level. The transgenic plants are being grown in containment for further studies on reduction in curcin protein content in Jatropha seeds.
Nagarajan, Mahesh B; Coan, Paola; Huber, Markus B; Diemoz, Paul C; Wismüller, Axel
2015-01-01
Phase contrast X-ray computed tomography (PCI-CT) has been demonstrated as a novel imaging technique that can visualize human cartilage with high spatial resolution and soft tissue contrast. Different textural approaches have been previously investigated for characterizing chondrocyte organization on PCI-CT to enable classification of healthy and osteoarthritic cartilage. However, the large size of feature sets extracted in such studies motivates an investigation into algorithmic feature reduction for computing efficient feature representations without compromising their discriminatory power. For this purpose, geometrical feature sets derived from the scaling index method (SIM) were extracted from 1392 volumes of interest (VOI) annotated on PCI-CT images of ex vivo human patellar cartilage specimens. The extracted feature sets were subject to linear and non-linear dimension reduction techniques as well as feature selection based on evaluation of mutual information criteria. The reduced feature set was subsequently used in a machine learning task with support vector regression to classify VOIs as healthy or osteoarthritic; classification performance was evaluated using the area under the receiver-operating characteristic (ROC) curve (AUC). Our results show that the classification performance achieved by 9-D SIM-derived geometric feature sets (AUC: 0.96 ± 0.02) can be maintained with 2-D representations computed from both dimension reduction and feature selection (AUC values as high as 0.97 ± 0.02). Thus, such feature reduction techniques can offer a high degree of compaction to large feature sets extracted from PCI-CT images while maintaining their ability to characterize the underlying chondrocyte patterns.
oPOSSUM: integrated tools for analysis of regulatory motif over-representation
Ho Sui, Shannan J.; Fulton, Debra L.; Arenillas, David J.; Kwon, Andrew T.; Wasserman, Wyeth W.
2007-01-01
The identification of over-represented transcription factor binding sites from sets of co-expressed genes provides insights into the mechanisms of regulation for diverse biological contexts. oPOSSUM, an internet-based system for such studies of regulation, has been improved and expanded in this new release. New features include a worm-specific version for investigating binding sites conserved between Caenorhabditis elegans and C. briggsae, as well as a yeast-specific version for the analysis of co-expressed sets of Saccharomyces cerevisiae genes. The human and mouse applications feature improvements in ortholog mapping, sequence alignments and the delineation of multiple alternative promoters. oPOSSUM2, introduced for the analysis of over-represented combinations of motifs in human and mouse genes, has been integrated with the original oPOSSUM system. Analysis using user-defined background gene sets is now supported. The transcription factor binding site models have been updated to include new profiles from the JASPAR database. oPOSSUM is available at http://www.cisreg.ca/oPOSSUM/ PMID:17576675
nGASP - the nematode genome annotation assessment project
DOE Office of Scientific and Technical Information (OSTI.GOV)
Coghlan, A; Fiedler, T J; McKay, S J
2008-12-19
While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner'more » algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders.« less
Detection of Pathways Affected by Positive Selection in Primate Lineages Ancestral to Humans
Moretti, S.; Davydov, I.I.; Excoffier, L.
2017-01-01
Abstract Gene set enrichment approaches have been increasingly successful in finding signals of recent polygenic selection in the human genome. In this study, we aim at detecting biological pathways affected by positive selection in more ancient human evolutionary history. Focusing on four branches of the primate tree that lead to modern humans, we tested all available protein coding gene trees of the Primates clade for signals of adaptation in these branches, using the likelihood-based branch site test of positive selection. The results of these locus-specific tests were then used as input for a gene set enrichment test, where whole pathways are globally scored for a signal of positive selection, instead of focusing only on outlier “significant” genes. We identified signals of positive selection in several pathways that are mainly involved in immune response, sensory perception, metabolism, and energy production. These pathway-level results are highly significant, even though there is no functional enrichment when only focusing on top scoring genes. Interestingly, several gene sets are found significant at multiple levels in the phylogeny, but different genes are responsible for the selection signal in the different branches. This suggests that the same function has been optimized in different ways at different times in primate evolution. PMID:28333345
WormQTLHD--a web database for linking human disease to natural variation data in C. elegans.
van der Velde, K Joeri; de Haan, Mark; Zych, Konrad; Arends, Danny; Snoek, L Basten; Kammenga, Jan E; Jansen, Ritsert C; Swertz, Morris A; Li, Yang
2014-01-01
Interactions between proteins are highly conserved across species. As a result, the molecular basis of multiple diseases affecting humans can be studied in model organisms that offer many alternative experimental opportunities. One such organism-Caenorhabditis elegans-has been used to produce much molecular quantitative genetics and systems biology data over the past decade. We present WormQTL(HD) (Human Disease), a database that quantitatively and systematically links expression Quantitative Trait Loci (eQTL) findings in C. elegans to gene-disease associations in man. WormQTL(HD), available online at http://www.wormqtl-hd.org, is a user-friendly set of tools to reveal functionally coherent, evolutionary conserved gene networks. These can be used to predict novel gene-to-gene associations and the functions of genes underlying the disease of interest. We created a new database that links C. elegans eQTL data sets to human diseases (34 337 gene-disease associations from OMIM, DGA, GWAS Central and NHGRI GWAS Catalogue) based on overlapping sets of orthologous genes associated to phenotypes in these two species. We utilized QTL results, high-throughput molecular phenotypes, classical phenotypes and genotype data covering different developmental stages and environments from WormQTL database. All software is available as open source, built on MOLGENIS and xQTL workbench.
Mirza, Babur S.; Muruganandam, Subathra; Meng, Xianyu; Sorensen, Darwin L.; Dupont, R. Ryan
2014-01-01
Basin-fill aquifers of the Southwestern United States are associated with elevated concentrations of arsenic (As) in groundwater. Many private domestic wells in the Cache Valley Basin, UT, have As concentrations in excess of the U.S. EPA drinking water limit. Thirteen sediment cores were collected from the center of the valley at the depth of the shallow groundwater and were sectioned into layers based on redoxmorphic features. Three of the layers, two from redox transition zones and one from a depletion zone, were used to establish microcosms. Microcosms were treated with groundwater (GW) or groundwater plus glucose (GW+G) to investigate the extent of As reduction in relation to iron (Fe) transformation and characterize the microbial community structure and function by sequencing 16S rRNA and arsenate dissimilatory reductase (arrA) genes. Under the carbon-limited conditions of the GW treatment, As reduction was independent of Fe reduction, despite the abundance of sequences related to Geobacter and Shewanella, genera that include a variety of dissimilatory iron-reducing bacteria. The addition of glucose, an electron donor and carbon source, caused substantial shifts toward domination of the bacterial community by Clostridium-related organisms, and As reduction was correlated with Fe reduction for the sediments from the redox transition zone. The arrA gene sequencing from microcosms at day 54 of incubation showed the presence of 14 unique phylotypes, none of which were related to any previously described arrA gene sequence, suggesting a unique community of dissimilatory arsenate-respiring bacteria in the Cache Valley Basin. PMID:24632255
Brinkley-Rubinstein, Lauren; Cloud, David H; Davis, Chelsea; Zaller, Nickolas; Delany-Brumsey, Ayesha; Pope, Leah; Martino, Sarah; Bouvier, Benjamin; Rich, Josiah
2017-03-13
Purpose The purpose of this paper is to discuss overdose among those with criminal justice experience and recommend harm reduction strategies to lessen overdose risk among this vulnerable population. Design/methodology/approach Strategies are needed to reduce overdose deaths among those with recent incarceration. Jails and prisons are at the epicenter of the opioid epidemic but are a largely untapped setting for implementing overdose education, risk assessment, medication assisted treatment, and naloxone distribution programs. Federal, state, and local plans commonly lack corrections as an ingredient in combating overdose. Harm reduction strategies are vital for reducing the risk of overdose in the post-release community. Findings Therefore, the authors recommend that the following be implemented in correctional settings: expansion of overdose education and naloxone programs; establishment of comprehensive medication assisted treatment programs as standard of care; development of corrections-specific overdose risk assessment tools; and increased collaboration between corrections entities and community-based organizations. Originality/value In this policy brief the authors provide recommendations for implementing harm reduction approaches in criminal justice settings. Adoption of these strategies could reduce the number of overdoses among those with recent criminal justice involvement.
Watanabe, Toru; Bartrand, Timothy A; Omura, Tatsuo; Haas, Charles N
2012-03-01
Reported data sets on infection of volunteers challenged with wild-type influenza A virus at graded doses are few. Alternatively, we aimed at developing a dose-response assessment for this virus based on the data sets for its live attenuated reassortants. Eleven data sets for live attenuated reassortants that were fit to beta-Poisson and exponential dose-response models. Dose-response relationships for those reassortants were characterized by pooling analysis of the data sets with respect to virus subtype (H1N1 or H3N2), attenuation method (cold-adapted or avian-human gene reassortment), and human age (adults or children). Furthermore, by comparing the above data sets to a limited number of reported data sets for wild-type virus, we quantified the degree of attenuation of wild-type virus with gene reassortment and estimated its infectivity. As a result, dose-response relationships of all reassortants were best described by a beta-Poisson model. Virus subtype and human age were significant factors determining the dose-response relationship, whereas attenuation method affected only the relationship of H1N1 virus infection to adults. The data sets for H3N2 wild-type virus could be pooled with those for its reassortants on the assumption that the gene reassortment attenuates wild-type virus by at least 63 times and most likely 1,070 times. Considering this most likely degree of attenuation, 10% infectious dose of H3N2 wild-type virus for adults was estimated at 18 TCID50 (95% CI = 8.8-35 TCID50). The infectivity of wild-type H1N1 virus remains unknown as the data set pooling was unsuccessful. © 2011 Society for Risk Analysis.
RAMONA: a Web application for gene set analysis on multilevel omics data.
Sass, Steffen; Buettner, Florian; Mueller, Nikola S; Theis, Fabian J
2015-01-01
Decreasing costs of modern high-throughput experiments allow for the simultaneous analysis of altered gene activity on various molecular levels. However, these multi-omics approaches lead to a large amount of data, which is hard to interpret for a non-bioinformatician. Here, we present the remotely accessible multilevel ontology analysis (RAMONA). It offers an easy-to-use interface for the simultaneous gene set analysis of combined omics datasets and is an extension of the previously introduced MONA approach. RAMONA is based on a Bayesian enrichment method for the inference of overrepresented biological processes among given gene sets. Overrepresentation is quantified by interpretable term probabilities. It is able to handle data from various molecular levels, while in parallel coping with redundancies arising from gene set overlaps and related multiple testing problems. The comprehensive output of RAMONA is easy to interpret and thus allows for functional insight into the affected biological processes. With RAMONA, we provide an efficient implementation of the Bayesian inference problem such that ontologies consisting of thousands of terms can be processed in the order of seconds. RAMONA is implemented as ASP.NET Web application and publicly available at http://icb.helmholtz-muenchen.de/ramona. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
ERIC Educational Resources Information Center
Crone, Regina M.; Mehta, Smita Shukla
2016-01-01
Setting variables such as location of parent training, programming with common stimuli, generalization of discrete responses to non-trained settings, and subsequent reduction in child problem behavior may influence the effectiveness of interventions. The purpose of this study was to evaluate the effectiveness of home-versus clinic-based training…
Hardigan, Michael A.; Crisovan, Emily; Hamilton, John P.; Laimbeer, Parker; Leisner, Courtney P.; Manrique-Carpintero, Norma C.; Newton, Linsey; Pham, Gina M.; Vaillancourt, Brieanne; Zeng, Zixian; Jiang, Jiming
2016-01-01
Clonally reproducing plants have the potential to bear a significantly greater mutational load than sexually reproducing species. To investigate this possibility, we examined the breadth of genome-wide structural variation in a panel of monoploid/doubled monoploid clones generated from native populations of diploid potato (Solanum tuberosum), a highly heterozygous asexually propagated plant. As rare instances of purely homozygous clones, they provided an ideal set for determining the degree of structural variation tolerated by this species and deriving its minimal gene complement. Extensive copy number variation (CNV) was uncovered, impacting 219.8 Mb (30.2%) of the potato genome with nearly 30% of genes subject to at least partial duplication or deletion, revealing the highly heterogeneous nature of the potato genome. Dispensable genes (>7000) were associated with limited transcription and/or a recent evolutionary history, with lower deletion frequency observed in genes conserved across angiosperms. Association of CNV with plant adaptation was highlighted by enrichment in gene clusters encoding functions for environmental stress response, with gene duplication playing a part in species-specific expansions of stress-related gene families. This study revealed unique impacts of CNV in a species with asexual reproductive habits and how CNV may drive adaption through evolution of key stress pathways. PMID:26772996
Analysis of evolutionary patterns of genes in campylobacter jejuni and C. coli
USDA-ARS?s Scientific Manuscript database
Background: In order to investigate the population genetics structure of thermophilic Campylobacter spp., we extracted a set of 1029 core gene families (CGF) from 25 sequenced genomes of C. jejuni, C. coli and C. lari. Based on these CGFs we employed different approaches to reveal the evolutionary ...
Juraeva, Dilafruz; Haenisch, Britta; Zapatka, Marc; Frank, Josef; Witt, Stephanie H; Mühleisen, Thomas W; Treutlein, Jens; Strohmaier, Jana; Meier, Sandra; Degenhardt, Franziska; Giegling, Ina; Ripke, Stephan; Leber, Markus; Lange, Christoph; Schulze, Thomas G; Mössner, Rainald; Nenadic, Igor; Sauer, Heinrich; Rujescu, Dan; Maier, Wolfgang; Børglum, Anders; Ophoff, Roel; Cichon, Sven; Nöthen, Markus M; Rietschel, Marcella; Mattheisen, Manuel; Brors, Benedikt
2014-06-01
In the present study, an integrated hierarchical approach was applied to: (1) identify pathways associated with susceptibility to schizophrenia; (2) detect genes that may be potentially affected in these pathways since they contain an associated polymorphism; and (3) annotate the functional consequences of such single-nucleotide polymorphisms (SNPs) in the affected genes or their regulatory regions. The Global Test was applied to detect schizophrenia-associated pathways using discovery and replication datasets comprising 5,040 and 5,082 individuals of European ancestry, respectively. Information concerning functional gene-sets was retrieved from the Kyoto Encyclopedia of Genes and Genomes, Gene Ontology, and the Molecular Signatures Database. Fourteen of the gene-sets or pathways identified in the discovery dataset were confirmed in the replication dataset. These include functional processes involved in transcriptional regulation and gene expression, synapse organization, cell adhesion, and apoptosis. For two genes, i.e. CTCF and CACNB2, evidence for association with schizophrenia was available (at the gene-level) in both the discovery study and published data from the Psychiatric Genomics Consortium schizophrenia study. Furthermore, these genes mapped to four of the 14 presently identified pathways. Several of the SNPs assigned to CTCF and CACNB2 have potential functional consequences, and a gene in close proximity to CACNB2, i.e. ARL5B, was identified as a potential gene of interest. Application of the present hierarchical approach thus allowed: (1) identification of novel biological gene-sets or pathways with potential involvement in the etiology of schizophrenia, as well as replication of these findings in an independent cohort; (2) detection of genes of interest for future follow-up studies; and (3) the highlighting of novel genes in previously reported candidate regions for schizophrenia.
Vimaleswaran, Karani S; Tachmazidou, Ioanna; Zhao, Jing Hua; Hirschhorn, Joel N; Dudbridge, Frank; Loos, Ruth J F
2012-10-15
Before the advent of genome-wide association studies (GWASs), hundreds of candidate genes for obesity-susceptibility had been identified through a variety of approaches. We examined whether those obesity candidate genes are enriched for associations with body mass index (BMI) compared with non-candidate genes by using data from a large-scale GWAS. A thorough literature search identified 547 candidate genes for obesity-susceptibility based on evidence from animal studies, Mendelian syndromes, linkage studies, genetic association studies and expression studies. Genomic regions were defined to include the genes ±10 kb of flanking sequence around candidate and non-candidate genes. We used summary statistics publicly available from the discovery stage of the genome-wide meta-analysis for BMI performed by the genetic investigation of anthropometric traits consortium in 123 564 individuals. Hypergeometric, rank tail-strength and gene-set enrichment analysis tests were used to test for the enrichment of association in candidate compared with non-candidate genes. The hypergeometric test of enrichment was not significant at the 5% P-value quantile (P = 0.35), but was nominally significant at the 25% quantile (P = 0.015). The rank tail-strength and gene-set enrichment tests were nominally significant for the full set of genes and borderline significant for the subset without SNPs at P < 10(-7). Taken together, the observed evidence for enrichment suggests that the candidate gene approach retains some value. However, the degree of enrichment is small despite the extensive number of candidate genes and the large sample size. Studies that focus on candidate genes have only slightly increased chances of detecting associations, and are likely to miss many true effects in non-candidate genes, at least for obesity-related traits.
A three-gene expression signature model for risk stratification of patients with neuroblastoma.
Garcia, Idoia; Mayol, Gemma; Ríos, José; Domenech, Gema; Cheung, Nai-Kong V; Oberthuer, André; Fischer, Matthias; Maris, John M; Brodeur, Garrett M; Hero, Barbara; Rodríguez, Eva; Suñol, Mariona; Galvan, Patricia; de Torres, Carmen; Mora, Jaume; Lavarino, Cinzia
2012-04-01
Neuroblastoma is an embryonal tumor with contrasting clinical courses. Despite elaborate stratification strategies, precise clinical risk assessment still remains a challenge. The purpose of this study was to develop a PCR-based predictor model to improve clinical risk assessment of patients with neuroblastoma. The model was developed using real-time PCR gene expression data from 96 samples and tested on separate expression data sets obtained from real-time PCR and microarray studies comprising 362 patients. On the basis of our prior study of differentially expressed genes in favorable and unfavorable neuroblastoma subgroups, we identified three genes, CHD5, PAFAH1B1, and NME1, strongly associated with patient outcome. The expression pattern of these genes was used to develop a PCR-based single-score predictor model. The model discriminated patients into two groups with significantly different clinical outcome [set 1: 5-year overall survival (OS): 0.93 ± 0.03 vs. 0.53 ± 0.06, 5-year event-free survival (EFS): 0.85 ± 0.04 vs. 0.042 ± 0.06, both P < 0.001; set 2 OS: 0.97 ± 0.02 vs. 0.61 ± 0.1, P = 0.005, EFS: 0.91 ± 0.8 vs. 0.56 ± 0.1, P = 0.005; and set 3 OS: 0.99 ± 0.01 vs. 0.56 ± 0.06, EFS: 0.96 ± 0.02 vs. 0.43 ± 0.05, both P < 0.001]. Multivariate analysis showed that the model was an independent marker for survival (P < 0.001, for all). In comparison with accepted risk stratification systems, the model robustly classified patients in the total cohort and in different clinically relevant risk subgroups. We propose for the first time in neuroblastoma, a technically simple PCR-based predictor model that could help refine current risk stratification systems. ©2012 AACR.
A Three-Gene Expression Signature Model for Risk Stratification of Patients with Neuroblastoma
Garcia, Idoia; Mayol, Gemma; Ríos, José; Domenech, Gema; Cheung, Nai-Kong V.; Oberthuer, André; Fischer, Matthias; Maris, John M.; Brodeur, Garrett M.; Hero, Barbara; Rodríguez, Eva; Suñol, Mariona; Galvan, Patricia; de Torres, Carmen; Mora, Jaume; Lavarino, Cinzia
2014-01-01
Purpose Neuroblastoma is an embryonal tumor with contrasting clinical courses. Despite elaborate stratification strategies, precise clinical risk assessment still remains a challenge. The purpose of this study was to develop a PCR-based predictor model to improve clinical risk assessment of patients with neuroblastoma. Experimental Design The model was developed using real-time PCR gene expression data from 96 samples and tested on separate expression data sets obtained from real-time PCR and microarray studies comprising 362 patients. Results On the basis of our prior study of differentially expressed genes in favorable and unfavorable neuroblastoma subgroups, we identified three genes, CHD5, PAFAH1B1, and NME1, strongly associated with patient outcome. The expression pattern of these genes was used to develop a PCR-based single-score predictor model. The model discriminated patients into two groups with significantly different clinical outcome [set 1: 5-year overall survival (OS): 0.93 ± 0.03 vs. 0.53 ± 0.06, 5-year event-free survival (EFS): 0.85 ± 0.04 vs. 0.042 ± 0.06, both P < 0.001; set 2 OS: 0.97 ± 0.02 vs. 0.61 ± 0.1, P = 0.005, EFS: 0.91 ± 0.8 vs. 0.56 ± 0.1, P = 0.005; and set 3 OS: 0.99 ± 0.01 vs. 0.56 ± 0.06, EFS: 0.96 ± 0.02 vs. 0.43 ± 0.05, both P < 0.001]. Multivariate analysis showed that the model was an independent marker for survival (P < 0.001, for all). In comparison with accepted risk stratification systems, the model robustly classified patients in the total cohort and in different clinically relevant risk subgroups. Conclusion We propose for the first time in neuroblastoma, a technically simple PCR-based predictor model that could help refine current risk stratification systems. PMID:22328561
Zhu, Bo; Zhang, Wenli; Jiang, Jiming
2015-01-01
Enhancers are important regulators of gene expression in eukaryotes. Enhancers function independently of their distance and orientation to the promoters of target genes. Thus, enhancers have been difficult to identify. Only a few enhancers, especially distant intergenic enhancers, have been identified in plants. We developed an enhancer prediction system based exclusively on the DNase I hypersensitive sites (DHSs) in the Arabidopsis thaliana genome. A set of 10,044 DHSs located in intergenic regions, which are away from any gene promoters, were predicted to be putative enhancers. We examined the functions of 14 predicted enhancers using the β-glucuronidase gene reporter. Ten of the 14 (71%) candidates were validated by the reporter assay. We also designed 10 constructs using intergenic sequences that are not associated with DHSs, and none of these constructs showed enhancer activities in reporter assays. In addition, the tissue specificity of the putative enhancers can be precisely predicted based on DNase I hypersensitivity data sets developed from different plant tissues. These results suggest that the open chromatin signature-based enhancer prediction system developed in Arabidopsis may serve as a universal system for enhancer identification in plants. PMID:26373455
Akram, Pakeeza; Liao, Li
2017-12-06
Identification of common genes associated with comorbid diseases can be critical in understanding their pathobiological mechanism. This work presents a novel method to predict missing common genes associated with a disease pair. Searching for missing common genes is formulated as an optimization problem to minimize network based module separation from two subgraphs produced by mapping genes associated with disease onto the interactome. Using cross validation on more than 600 disease pairs, our method achieves significantly higher average receiver operating characteristic ROC Score of 0.95 compared to a baseline ROC score 0.60 using randomized data. Missing common genes prediction is aimed to complete gene set associated with comorbid disease for better understanding of biological intervention. It will also be useful for gene targeted therapeutics related to comorbid diseases. This method can be further considered for prediction of missing edges to complete the subgraph associated with disease pair.
Feldmesser, Ester; Rosenwasser, Shilo; Vardi, Assaf; Ben-Dor, Shifra
2014-02-22
The advent of Next Generation Sequencing technologies and corresponding bioinformatics tools allows the definition of transcriptomes in non-model organisms. Non-model organisms are of great ecological and biotechnological significance, and consequently the understanding of their unique metabolic pathways is essential. Several methods that integrate de novo assembly with genome-based assembly have been proposed. Yet, there are many open challenges in defining genes, particularly where genomes are not available or incomplete. Despite the large numbers of transcriptome assemblies that have been performed, quality control of the transcript building process, particularly on the protein level, is rarely performed if ever. To test and improve the quality of the automated transcriptome reconstruction, we used manually defined and curated genes, several of them experimentally validated. Several approaches to transcript construction were utilized, based on the available data: a draft genome, high quality RNAseq reads, and ESTs. In order to maximize the contribution of the various data, we integrated methods including de novo and genome based assembly, as well as EST clustering. After each step a set of manually curated genes was used for quality assessment of the transcripts. The interplay between the automated pipeline and the quality control indicated which additional processes were required to improve the transcriptome reconstruction. We discovered that E. huxleyi has a very high percentage of non-canonical splice junctions, and relatively high rates of intron retention, which caused unique issues with the currently available tools. While individual tools missed genes and artificially joined overlapping transcripts, combining the results of several tools improved the completeness and quality considerably. The final collection, created from the integration of several quality control and improvement rounds, was compared to the manually defined set both on the DNA and protein levels, and resulted in an improvement of 20% versus any of the read-based approaches alone. To the best of our knowledge, this is the first time that an automated transcript definition is subjected to quality control using manually defined and curated genes and thereafter the process is improved. We recommend using a set of manually curated genes to troubleshoot transcriptome reconstruction.
Stec, James; Wang, Jing; Coombes, Kevin; Ayers, Mark; Hoersch, Sebastian; Gold, David L.; Ross, Jeffrey S; Hess, Kenneth R.; Tirrell, Stephen; Linette, Gerald; Hortobagyi, Gabriel N.; Symmans, W. Fraser; Pusztai, Lajos
2005-01-01
We examined how well differentially expressed genes and multigene outcome classifiers retain their class-discriminating values when tested on data generated by different transcriptional profiling platforms. RNA from 33 stage I-III breast cancers was hybridized to both Affymetrix GeneChip and Millennium Pharmaceuticals cDNA arrays. Only 30% of all corresponding gene expression measurements on the two platforms had Pearson correlation coefficient r ≥ 0.7 when UniGene was used to match probes. There was substantial variation in correlation between different Affymetrix probe sets matched to the same cDNA probe. When cDNA and Affymetrix probes were matched by basic local alignment tool (BLAST) sequence identity, the correlation increased substantially. We identified 182 genes in the Affymetrix and 45 in the cDNA data (including 17 common genes) that accurately separated 91% of cases in supervised hierarchical clustering in each data set. Cross-platform testing of these informative genes resulted in lower clustering accuracy of 45 and 79%, respectively. Several sets of accurate five-gene classifiers were developed on each platform using linear discriminant analysis. The best 100 classifiers showed average misclassification error rate of 2% on the original data that rose to 19.5% when tested on data from the other platform. Random five-gene classifiers showed misclassification error rate of 33%. We conclude that multigene predictors optimized for one platform lose accuracy when applied to data from another platform due to missing genes and sequence differences in probes that result in differing measurements for the same gene. PMID:16049308
A Genome-Wide siRNA Screen in Mammalian Cells for Regulators of S6 Phosphorylation
Papageorgiou, Angela; Rapley, Joseph; Mesirov, Jill P.; Tamayo, Pablo; Avruch, Joseph
2015-01-01
mTOR complex1, the major regulator of mRNA translation in all eukaryotic cells, is strongly activated in most cancers. We performed a genome-wide RNAi screen in a human cancer cell line, seeking genes that regulate S6 phosphorylation, readout of mTORC1 activity. Applying a stringent selection, we retrieved nearly 600 genes wherein at least two RNAis gave significant reduction in S6-P. This cohort contains known regulators of mTOR complex 1 and is significantly enriched in genes whose depletion affects the proliferation/viability of the large set of cancer cell lines in the Achilles database in a manner paralleling that caused by mTOR depletion. We next examined the effect of RNAi pools directed at 534 of these gene products on S6-P in TSC1 null mouse embryo fibroblasts. 76 RNAis reduced S6 phosphorylation significantly in 2 or 3 replicates. Surprisingly, among this cohort of genes the only elements previously associated with the maintenance of mTORC1 activity are two subunits of the vacuolar ATPase and the CUL4 subunit DDB1. RNAi against a second set of 84 targets reduced S6-P in only one of three replicates. However, an indication that this group also bears attention is the presence of rpS6KB1 itself, Rac1 and MAP4K3, a protein kinase that supports amino acid signaling to rpS6KB1. The finding that S6 phosphorylation requires a previously unidentified, functionally diverse cohort of genes that participate in fundamental cellular processes such as mRNA translation, RNA processing, DNA repair and metabolism suggests the operation of feedback pathways in the regulation of mTORC1 operating through novel mechanisms. PMID:25790369
Phylogenetic Diversity and Metabolic Potential Revealed in a Glacier Ice Metagenome▿ †
Simon, Carola; Wiezer, Arnim; Strittmatter, Axel W.; Daniel, Rolf
2009-01-01
The largest part of the Earth's microbial biomass is stored in cold environments, which represent almost untapped reservoirs of novel species, processes, and genes. In this study, the first metagenomic survey of the metabolic potential and phylogenetic diversity of a microbial assemblage present in glacial ice is presented. DNA was isolated from glacial ice of the Northern Schneeferner, Germany. Pyrosequencing of this DNA yielded 1,076,539 reads (239.7 Mbp). The phylogenetic composition of the prokaryotic community was assessed by evaluation of a pyrosequencing-derived data set and sequencing of 16S rRNA genes. The Proteobacteria (mainly Betaproteobacteria), Bacteroidetes, and Actinobacteria were the predominant phylogenetic groups. In addition, isolation of psychrophilic microorganisms was performed, and 13 different bacterial isolates were recovered. Analysis of the 16S rRNA gene sequences of the isolates revealed that all were affiliated to the predominant groups. As expected for microorganisms residing in a low-nutrient environment, a high metabolic versatility with respect to degradation of organic substrates was detected by analysis of the pyrosequencing-derived data set. The presence of autotrophic microorganisms was indicated by identification of genes typical for different ways of carbon fixation. In accordance with the results of the phylogenetic studies, in which mainly aerobic and facultative aerobic bacteria were detected, genes typical for central metabolism of aerobes were found. Nevertheless, the capability of growth under anaerobic conditions was indicated by genes involved in dissimilatory nitrate/nitrite reduction. Numerous characteristics for metabolic adaptations associated with a psychrophilic lifestyle, such as formation of cryoprotectants and maintenance of membrane fluidity by the incorporation of unsaturated fatty acids, were detected. Thus, analysis of the glacial metagenome provided insights into the microbial life in frozen habitats on Earth, thereby possibly shedding light onto microbial life in analogous extraterrestrial environments. PMID:19801459