Dwivedi, Bhakti; Kowalski, Jeanne
2018-01-01
While many methods exist for integrating multi-omics data or defining gene sets, there is no one single tool that defines gene sets based on merging of multiple omics data sets. We present shinyGISPA, an open-source application with a user-friendly web-based interface to define genes according to their similarity in several molecular changes that are driving a disease phenotype. This tool was developed to help facilitate the usability of a previously published method, Gene Integrated Set Profile Analysis (GISPA), among researchers with limited computer-programming skills. The GISPA method allows the identification of multiple gene sets that may play a role in the characterization, clinical application, or functional relevance of a disease phenotype. The tool provides an automated workflow that is highly scalable and adaptable to applications that go beyond genomic data merging analysis. It is available at http://shinygispa.winship.emory.edu/shinyGISPA/.
Dwivedi, Bhakti
2018-01-01
While many methods exist for integrating multi-omics data or defining gene sets, there is no one single tool that defines gene sets based on merging of multiple omics data sets. We present shinyGISPA, an open-source application with a user-friendly web-based interface to define genes according to their similarity in several molecular changes that are driving a disease phenotype. This tool was developed to help facilitate the usability of a previously published method, Gene Integrated Set Profile Analysis (GISPA), among researchers with limited computer-programming skills. The GISPA method allows the identification of multiple gene sets that may play a role in the characterization, clinical application, or functional relevance of a disease phenotype. The tool provides an automated workflow that is highly scalable and adaptable to applications that go beyond genomic data merging analysis. It is available at http://shinygispa.winship.emory.edu/shinyGISPA/. PMID:29415010
Multiconstrained gene clustering based on generalized projections
2010-01-01
Background Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. Results We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods. Conclusions The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions. PMID:20356386
Combining multiple tools outperforms individual methods in gene set enrichment analyses.
Alhamdoosh, Monther; Ng, Milica; Wilson, Nicholas J; Sheridan, Julie M; Huynh, Huy; Wilson, Michael J; Ritchie, Matthew E
2017-02-01
Gene set enrichment (GSE) analysis allows researchers to efficiently extract biological insight from long lists of differentially expressed genes by interrogating them at a systems level. In recent years, there has been a proliferation of GSE analysis methods and hence it has become increasingly difficult for researchers to select an optimal GSE tool based on their particular dataset. Moreover, the majority of GSE analysis methods do not allow researchers to simultaneously compare gene set level results between multiple experimental conditions. The ensemble of genes set enrichment analyses (EGSEA) is a method developed for RNA-sequencing data that combines results from twelve algorithms and calculates collective gene set scores to improve the biological relevance of the highest ranked gene sets. EGSEA's gene set database contains around 25 000 gene sets from sixteen collections. It has multiple visualization capabilities that allow researchers to view gene sets at various levels of granularity. EGSEA has been tested on simulated data and on a number of human and mouse datasets and, based on biologists' feedback, consistently outperforms the individual tools that have been combined. Our evaluation demonstrates the superiority of the ensemble approach for GSE analysis, and its utility to effectively and efficiently extrapolate biological functions and potential involvement in disease processes from lists of differentially regulated genes. EGSEA is available as an R package at http://www.bioconductor.org/packages/EGSEA/ . The gene sets collections are available in the R package EGSEAdata from http://www.bioconductor.org/packages/EGSEAdata/ . monther.alhamdoosh@csl.com.au mritchie@wehi.edu.au. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
MAGMA: Generalized Gene-Set Analysis of GWAS Data
de Leeuw, Christiaan A.; Mooij, Joris M.; Heskes, Tom; Posthuma, Danielle
2015-01-01
By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn’s Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn’s Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn’s Disease data was found to be considerably faster as well. PMID:25885710
MAGMA: generalized gene-set analysis of GWAS data.
de Leeuw, Christiaan A; Mooij, Joris M; Heskes, Tom; Posthuma, Danielle
2015-04-01
By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.
Discovery of cancer common and specific driver gene sets
2017-01-01
Abstract Cancer is known as a disease mainly caused by gene alterations. Discovery of mutated driver pathways or gene sets is becoming an important step to understand molecular mechanisms of carcinogenesis. However, systematically investigating commonalities and specificities of driver gene sets among multiple cancer types is still a great challenge, but this investigation will undoubtedly benefit deciphering cancers and will be helpful for personalized therapy and precision medicine in cancer treatment. In this study, we propose two optimization models to de novo discover common driver gene sets among multiple cancer types (ComMDP) and specific driver gene sets of one certain or multiple cancer types to other cancers (SpeMDP), respectively. We first apply ComMDP and SpeMDP to simulated data to validate their efficiency. Then, we further apply these methods to 12 cancer types from The Cancer Genome Atlas (TCGA) and obtain several biologically meaningful driver pathways. As examples, we construct a common cancer pathway model for BRCA and OV, infer a complex driver pathway model for BRCA carcinogenesis based on common driver gene sets of BRCA with eight cancer types, and investigate specific driver pathways of the liquid cancer lymphoblastic acute myeloid leukemia (LAML) versus other solid cancer types. In these processes more candidate cancer genes are also found. PMID:28168295
Faruki, Hawazin; Mayhew, Gregory M; Fan, Cheng; Wilkerson, Matthew D; Parker, Scott; Kam-Morgan, Lauren; Eisenberg, Marcia; Horten, Bruce; Hayes, D Neil; Perou, Charles M; Lai-Goldman, Myla
2016-06-01
Context .- A histologic classification of lung cancer subtypes is essential in guiding therapeutic management. Objective .- To complement morphology-based classification of lung tumors, a previously developed lung subtyping panel (LSP) of 57 genes was tested using multiple public fresh-frozen gene-expression data sets and a prospectively collected set of formalin-fixed, paraffin-embedded lung tumor samples. Design .- The LSP gene-expression signature was evaluated in multiple lung cancer gene-expression data sets totaling 2177 patients collected from 4 platforms: Illumina RNAseq (San Diego, California), Agilent (Santa Clara, California) and Affymetrix (Santa Clara) microarrays, and quantitative reverse transcription-polymerase chain reaction. Gene centroids were calculated for each of 3 genomic-defined subtypes: adenocarcinoma, squamous cell carcinoma, and neuroendocrine, the latter of which encompassed both small cell carcinoma and carcinoid. Classification by LSP into 3 subtypes was evaluated in both fresh-frozen and formalin-fixed, paraffin-embedded tumor samples, and agreement with the original morphology-based diagnosis was determined. Results .- The LSP-based classifications demonstrated overall agreement with the original clinical diagnosis ranging from 78% (251 of 322) to 91% (492 of 538 and 869 of 951) in the fresh-frozen public data sets and 84% (65 of 77) in the formalin-fixed, paraffin-embedded data set. The LSP performance was independent of tissue-preservation method and gene-expression platform. Secondary, blinded pathology review of formalin-fixed, paraffin-embedded samples demonstrated concordance of 82% (63 of 77) with the original morphology diagnosis. Conclusions .- The LSP gene-expression signature is a reproducible and objective method for classifying lung tumors and demonstrates good concordance with morphology-based classification across multiple data sets. The LSP panel can supplement morphologic assessment of lung cancers, particularly when classification by standard methods is challenging.
Ji, S C; Pan, Y T; Lu, Q Y; Sun, Z Y; Liu, Y Z
2014-03-17
The purpose of this study was to identify critical genes associated with septic multiple trauma by comparing peripheral whole blood samples from multiple trauma patients with and without sepsis. A microarray data set was downloaded from the Gene Expression Omnibus (GEO) database. This data set included 70 samples, 36 from multiple trauma patients with sepsis and 34 from multiple trauma patients without sepsis (as a control set). The data were preprocessed, and differentially expressed genes (DEGs) were then screened for using packages of the R language. Functional analysis of DEGs was performed with DAVID. Interaction networks were then established for the most up- and down-regulated genes using HitPredict. Pathway-enrichment analysis was conducted for genes in the networks using WebGestalt. Fifty-eight DEGs were identified. The expression levels of PLAU (down-regulated) and MMP8 (up-regulated) presented the largest fold-changes, and interaction networks were established for these genes. Further analysis revealed that PLAT (plasminogen activator, tissue) and SERPINF2 (serpin peptidase inhibitor, clade F, member 2), which interact with PLAU, play important roles in the pathway of the component and coagulation cascade. We hypothesize that PLAU is a major regulator of the component and coagulation cascade, and down-regulation of PLAU results in dysfunction of the pathway, causing sepsis.
2014-01-01
Background In complex large-scale experiments, in addition to simultaneously considering a large number of features, multiple hypotheses are often being tested for each feature. This leads to a problem of multi-dimensional multiple testing. For example, in gene expression studies over ordered categories (such as time-course or dose-response experiments), interest is often in testing differential expression across several categories for each gene. In this paper, we consider a framework for testing multiple sets of hypothesis, which can be applied to a wide range of problems. Results We adopt the concept of the overall false discovery rate (OFDR) for controlling false discoveries on the hypothesis set level. Based on an existing procedure for identifying differentially expressed gene sets, we discuss a general two-step hierarchical hypothesis set testing procedure, which controls the overall false discovery rate under independence across hypothesis sets. In addition, we discuss the concept of the mixed-directional false discovery rate (mdFDR), and extend the general procedure to enable directional decisions for two-sided alternatives. We applied the framework to the case of microarray time-course/dose-response experiments, and proposed three procedures for testing differential expression and making multiple directional decisions for each gene. Simulation studies confirm the control of the OFDR and mdFDR by the proposed procedures under independence and positive correlations across genes. Simulation results also show that two of our new procedures achieve higher power than previous methods. Finally, the proposed methodology is applied to a microarray dose-response study, to identify 17 β-estradiol sensitive genes in breast cancer cells that are induced at low concentrations. Conclusions The framework we discuss provides a platform for multiple testing procedures covering situations involving two (or potentially more) sources of multiplicity. The framework is easy to use and adaptable to various practical settings that frequently occur in large-scale experiments. Procedures generated from the framework are shown to maintain control of the OFDR and mdFDR, quantities that are especially relevant in the case of multiple hypothesis set testing. The procedures work well in both simulations and real datasets, and are shown to have better power than existing methods. PMID:24731138
Naaijen, J; Bralten, J; Poelmans, G; Glennon, J C; Franke, B; Buitelaar, J K
2017-01-10
Attention-deficit/hyperactivity disorder (ADHD) and autism spectrum disorders (ASD) often co-occur. Both are highly heritable; however, it has been difficult to discover genetic risk variants. Glutamate and GABA are main excitatory and inhibitory neurotransmitters in the brain; their balance is essential for proper brain development and functioning. In this study we investigated the role of glutamate and GABA genetics in ADHD severity, autism symptom severity and inhibitory performance, based on gene set analysis, an approach to investigate multiple genetic variants simultaneously. Common variants within glutamatergic and GABAergic genes were investigated using the MAGMA software in an ADHD case-only sample (n=931), in which we assessed ASD symptoms and response inhibition on a Stop task. Gene set analysis for ADHD symptom severity, divided into inattention and hyperactivity/impulsivity symptoms, autism symptom severity and inhibition were performed using principal component regression analyses. Subsequently, gene-wide association analyses were performed. The glutamate gene set showed an association with severity of hyperactivity/impulsivity (P=0.009), which was robust to correcting for genome-wide association levels. The GABA gene set showed nominally significant association with inhibition (P=0.04), but this did not survive correction for multiple comparisons. None of single gene or single variant associations was significant on their own. By analyzing multiple genetic variants within candidate gene sets together, we were able to find genetic associations supporting the involvement of excitatory and inhibitory neurotransmitter systems in ADHD and ASD symptom severity in ADHD.
Gene set analysis of purine and pyrimidine antimetabolites cancer therapies.
Fridley, Brooke L; Batzler, Anthony; Li, Liang; Li, Fang; Matimba, Alice; Jenkins, Gregory D; Ji, Yuan; Wang, Liewei; Weinshilboum, Richard M
2011-11-01
Responses to therapies, either with regard to toxicities or efficacy, are expected to involve complex relationships of gene products within the same molecular pathway or functional gene set. Therefore, pathways or gene sets, as opposed to single genes, may better reflect the true underlying biology and may be more appropriate units for analysis of pharmacogenomic studies. Application of such methods to pharmacogenomic studies may enable the detection of more subtle effects of multiple genes in the same pathway that may be missed by assessing each gene individually. A gene set analysis of 3821 gene sets is presented assessing the association between basal messenger RNA expression and drug cytotoxicity using ethnically defined human lymphoblastoid cell lines for two classes of drugs: pyrimidines [gemcitabine (dFdC) and arabinoside] and purines [6-thioguanine and 6-mercaptopurine]. The gene set nucleoside-diphosphatase activity was found to be significantly associated with both dFdC and arabinoside, whereas gene set γ-aminobutyric acid catabolic process was associated with dFdC and 6-thioguanine. These gene sets were significantly associated with the phenotype even after adjusting for multiple testing. In addition, five associated gene sets were found in common between the pyrimidines and two gene sets for the purines (3',5'-cyclic-AMP phosphodiesterase activity and γ-aminobutyric acid catabolic process) with a P value of less than 0.0001. Functional validation was attempted with four genes each in gene sets for thiopurine and pyrimidine antimetabolites. All four genes selected from the pyrimidine gene sets (PSME3, CANT1, ENTPD6, ADRM1) were validated, but only one (PDE4D) was validated for the thiopurine gene sets. In summary, results from the gene set analysis of pyrimidine and purine therapies, used often in the treatment of various cancers, provide novel insight into the relationship between genomic variation and drug response.
A new fast method for inferring multiple consensus trees using k-medoids.
Tahiri, Nadia; Willems, Matthieu; Makarenkov, Vladimir
2018-04-05
Gene trees carry important information about specific evolutionary patterns which characterize the evolution of the corresponding gene families. However, a reliable species consensus tree cannot be inferred from a multiple sequence alignment of a single gene family or from the concatenation of alignments corresponding to gene families having different evolutionary histories. These evolutionary histories can be quite different due to horizontal transfer events or to ancient gene duplications which cause the emergence of paralogs within a genome. Many methods have been proposed to infer a single consensus tree from a collection of gene trees. Still, the application of these tree merging methods can lead to the loss of specific evolutionary patterns which characterize some gene families or some groups of gene families. Thus, the problem of inferring multiple consensus trees from a given set of gene trees becomes relevant. We describe a new fast method for inferring multiple consensus trees from a given set of phylogenetic trees (i.e. additive trees or X-trees) defined on the same set of species (i.e. objects or taxa). The traditional consensus approach yields a single consensus tree. We use the popular k-medoids partitioning algorithm to divide a given set of trees into several clusters of trees. We propose novel versions of the well-known Silhouette and Caliński-Harabasz cluster validity indices that are adapted for tree clustering with k-medoids. The efficiency of the new method was assessed using both synthetic and real data, such as a well-known phylogenetic dataset consisting of 47 gene trees inferred for 14 archaeal organisms. The method described here allows inference of multiple consensus trees from a given set of gene trees. It can be used to identify groups of gene trees having similar intragroup and different intergroup evolutionary histories. The main advantage of our method is that it is much faster than the existing tree clustering approaches, while providing similar or better clustering results in most cases. This makes it particularly well suited for the analysis of large genomic and phylogenetic datasets.
Ishii, Jun; Kondo, Takashi; Makino, Harumi; Ogura, Akira; Matsuda, Fumio; Kondo, Akihiko
2014-05-01
Yeast has the potential to be used in bulk-scale fermentative production of fuels and chemicals due to its tolerance for low pH and robustness for autolysis. However, expression of multiple external genes in one host yeast strain is considerably labor-intensive due to the lack of polycistronic transcription. To promote the metabolic engineering of yeast, we generated systematic and convenient genetic engineering tools to express multiple genes in Saccharomyces cerevisiae. We constructed a series of multi-copy and integration vector sets for concurrently expressing two or three genes in S. cerevisiae by embedding three classical promoters. The comparative expression capabilities of the constructed vectors were monitored with green fluorescent protein, and the concurrent expression of genes was monitored with three different fluorescent proteins. Our multiple gene expression tool will be helpful to the advanced construction of genetically engineered yeast strains in a variety of research fields other than metabolic engineering. © 2014 Federation of European Microbiological Societies. Published by John Wiley & Sons Ltd. All rights reserved.
Lai, Yinglei; Zhang, Fanni; Nayak, Tapan K; Modarres, Reza; Lee, Norman H; McCaffrey, Timothy A
2014-01-01
Gene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment. We categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a three-component multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets. We used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method. This study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple large-scale two-sample gene expression data sets.
Lamba, Jatinder K; Crews, Kristine R; Pounds, Stanley B; Cao, Xueyuan; Gandhi, Varsha; Plunkett, William; Razzouk, Bassem I; Lamba, Vishal; Baker, Sharyn D; Raimondi, Susana C; Campana, Dario; Pui, Ching-Hon; Downing, James R; Rubnitz, Jeffrey E; Ribeiro, Raul C
2011-01-01
Aim To identify gene-expression signatures predicting cytarabine response by an integrative analysis of multiple clinical and pharmacological end points in acute myeloid leukemia (AML) patients. Materials & methods We performed an integrated analysis to associate the gene expression of diagnostic bone marrow blasts from acute myeloid leukemia (AML) patients treated in the discovery set (AML97; n = 42) and in the independent validation set (AML02; n = 46) with multiple clinical and pharmacological end points. Based on prior biological knowledge, we defined a gene to show a therapeutically beneficial (detrimental) pattern of association of its expression positively (negatively) correlated with favorable phenotypes such as intracellular cytarabine 5´-triphosphate levels, morphological response and event-free survival, and negatively (positively) correlated with unfavorable end points such as post-cytarabine DNA synthesis levels, minimal residual disease and cytarabine LC50. Results We identified 240 probe sets predicting a therapeutically beneficial pattern and 97 predicting detrimental pattern (p ≤ 0.005) in the discovery set. Of these, 60 were confirmed in the independent validation set. The validated probe sets correspond to genes involved in PIK3/PTEN/AKT/mTOR signaling, G-protein-coupled receptor signaling and leukemogenesis. This suggests that targeting these pathways as potential pharmacogenomic and therapeutic candidates could be useful for improving treatment outcomes in AML. Conclusion This study illustrates the power of integrated data analysis of genomic data as well as multiple clinical and pharmacologic end points in the identification of genes and pathways of biological relevance. PMID:21449673
Naaijen, J; Bralten, J; Poelmans, G; Faraone, Stephen; Asherson, Philip; Banaschewski, Tobias; Buitelaar, Jan; Franke, Barbara; P Ebstein, Richard; Gill, Michael; Miranda, Ana; D Oades, Robert; Roeyers, Herbert; Rothenberger, Aribert; Sergeant, Joseph; Sonuga-Barke, Edmund; Anney, Richard; Mulas, Fernando; Steinhausen, Hans-Christoph; Glennon, J C; Franke, B; Buitelaar, J K
2017-01-01
Attention-deficit/hyperactivity disorder (ADHD) and autism spectrum disorders (ASD) often co-occur. Both are highly heritable; however, it has been difficult to discover genetic risk variants. Glutamate and GABA are main excitatory and inhibitory neurotransmitters in the brain; their balance is essential for proper brain development and functioning. In this study we investigated the role of glutamate and GABA genetics in ADHD severity, autism symptom severity and inhibitory performance, based on gene set analysis, an approach to investigate multiple genetic variants simultaneously. Common variants within glutamatergic and GABAergic genes were investigated using the MAGMA software in an ADHD case-only sample (n=931), in which we assessed ASD symptoms and response inhibition on a Stop task. Gene set analysis for ADHD symptom severity, divided into inattention and hyperactivity/impulsivity symptoms, autism symptom severity and inhibition were performed using principal component regression analyses. Subsequently, gene-wide association analyses were performed. The glutamate gene set showed an association with severity of hyperactivity/impulsivity (P=0.009), which was robust to correcting for genome-wide association levels. The GABA gene set showed nominally significant association with inhibition (P=0.04), but this did not survive correction for multiple comparisons. None of single gene or single variant associations was significant on their own. By analyzing multiple genetic variants within candidate gene sets together, we were able to find genetic associations supporting the involvement of excitatory and inhibitory neurotransmitter systems in ADHD and ASD symptom severity in ADHD. PMID:28072412
Wang, W; Huang, S; Hou, W; Liu, Y; Fan, Q; He, A; Wen, Y; Hao, J; Guo, X; Zhang, F
2017-10-01
Several genome-wide association studies (GWAS) of bone mineral density (BMD) have successfully identified multiple susceptibility genes, yet isolated susceptibility genes are often difficult to interpret biologically. The aim of this study was to unravel the genetic background of BMD at pathway level, by integrating BMD GWAS data with genome-wide expression quantitative trait loci (eQTLs) and methylation quantitative trait loci (meQTLs) data METHOD: We employed the GWAS datasets of BMD from the Genetic Factors for Osteoporosis Consortium (GEFOS), analysing patients' BMD. The areas studied included 32 735 femoral necks, 28 498 lumbar spines, and 8143 forearms. Genome-wide eQTLs (containing 923 021 eQTLs) and meQTLs (containing 683 152 unique methylation sites with local meQTLs) data sets were collected from recently published studies. Gene scores were first calculated by summary data-based Mendelian randomisation (SMR) software and meQTL-aligned GWAS results. Gene set enrichment analysis (GSEA) was then applied to identify BMD-associated gene sets with a predefined significance level of 0.05. We identified multiple gene sets associated with BMD in one or more regions, including relevant known biological gene sets such as the Reactome Circadian Clock (GSEA p-value = 1.0 × 10 -4 for LS and 2.7 × 10 -2 for femoral necks BMD in eQTLs-based GSEA) and insulin-like growth factor receptor binding (GSEA p-value = 5.0 × 10 -4 for femoral necks and 2.6 × 10 -2 for lumbar spines BMD in meQTLs-based GSEA). Our results provided novel clues for subsequent functional analysis of bone metabolism, and illustrated the benefit of integrating eQTLs and meQTLs data into pathway association analysis for genetic studies of complex human diseases. Cite this article : W. Wang, S. Huang, W. Hou, Y. Liu, Q. Fan, A. He, Y. Wen, J. Hao, X. Guo, F. Zhang. Integrative analysis of GWAS, eQTLs and meQTLs data suggests that multiple gene sets are associated with bone mineral density. Bone Joint Res 2017;6:572-576. © 2017 Wang et al.
Inference of Evolutionary Forces Acting on Human Biological Pathways
Daub, Josephine T.; Dupanloup, Isabelle; Robinson-Rechavi, Marc; Excoffier, Laurent
2015-01-01
Because natural selection is likely to act on multiple genes underlying a given phenotypic trait, we study here the potential effect of ongoing and past selection on the genetic diversity of human biological pathways. We first show that genes included in gene sets are generally under stronger selective constraints than other genes and that their evolutionary response is correlated. We then introduce a new procedure to detect selection at the pathway level based on a decomposition of the classical McDonald–Kreitman test extended to multiple genes. This new test, called 2DNS, detects outlier gene sets and takes into account past demographic effects and evolutionary constraints specific to gene sets. Selective forces acting on gene sets can be easily identified by a mere visual inspection of the position of the gene sets relative to their two-dimensional null distribution. We thus find several outlier gene sets that show signals of positive, balancing, or purifying selection but also others showing an ancient relaxation of selective constraints. The principle of the 2DNS test can also be applied to other genomic contrasts. For instance, the comparison of patterns of polymorphisms private to African and non-African populations reveals that most pathways show a higher proportion of nonsynonymous mutations in non-Africans than in Africans, potentially due to different demographic histories and selective pressures. PMID:25971280
Blatti, Charles; Sinha, Saurabh
2016-07-15
Analysis of co-expressed gene sets typically involves testing for enrichment of different annotations or 'properties' such as biological processes, pathways, transcription factor binding sites, etc., one property at a time. This common approach ignores any known relationships among the properties or the genes themselves. It is believed that known biological relationships among genes and their many properties may be exploited to more accurately reveal commonalities of a gene set. Previous work has sought to achieve this by building biological networks that combine multiple types of gene-gene or gene-property relationships, and performing network analysis to identify other genes and properties most relevant to a given gene set. Most existing network-based approaches for recognizing genes or annotations relevant to a given gene set collapse information about different properties to simplify (homogenize) the networks. We present a network-based method for ranking genes or properties related to a given gene set. Such related genes or properties are identified from among the nodes of a large, heterogeneous network of biological information. Our method involves a random walk with restarts, performed on an initial network with multiple node and edge types that preserve more of the original, specific property information than current methods that operate on homogeneous networks. In this first stage of our algorithm, we find the properties that are the most relevant to the given gene set and extract a subnetwork of the original network, comprising only these relevant properties. We then re-rank genes by their similarity to the given gene set, based on a second random walk with restarts, performed on the above subnetwork. We demonstrate the effectiveness of this algorithm for ranking genes related to Drosophila embryonic development and aggressive responses in the brains of social animals. DRaWR was implemented as an R package available at veda.cs.illinois.edu/DRaWR. blatti@illinois.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Ficklin, Stephen P.; Luo, Feng; Feltus, F. Alex
2010-01-01
Discovering gene sets underlying the expression of a given phenotype is of great importance, as many phenotypes are the result of complex gene-gene interactions. Gene coexpression networks, built using a set of microarray samples as input, can help elucidate tightly coexpressed gene sets (modules) that are mixed with genes of known and unknown function. Functional enrichment analysis of modules further subdivides the coexpressed gene set into cofunctional gene clusters that may coexist in the module with other functionally related gene clusters. In this study, 45 coexpressed gene modules and 76 cofunctional gene clusters were discovered for rice (Oryza sativa) using a global, knowledge-independent paradigm and the combination of two network construction methodologies. Some clusters were enriched for previously characterized mutant phenotypes, providing evidence for specific gene sets (and their annotated molecular functions) that underlie specific phenotypes. PMID:20668062
Ficklin, Stephen P; Luo, Feng; Feltus, F Alex
2010-09-01
Discovering gene sets underlying the expression of a given phenotype is of great importance, as many phenotypes are the result of complex gene-gene interactions. Gene coexpression networks, built using a set of microarray samples as input, can help elucidate tightly coexpressed gene sets (modules) that are mixed with genes of known and unknown function. Functional enrichment analysis of modules further subdivides the coexpressed gene set into cofunctional gene clusters that may coexist in the module with other functionally related gene clusters. In this study, 45 coexpressed gene modules and 76 cofunctional gene clusters were discovered for rice (Oryza sativa) using a global, knowledge-independent paradigm and the combination of two network construction methodologies. Some clusters were enriched for previously characterized mutant phenotypes, providing evidence for specific gene sets (and their annotated molecular functions) that underlie specific phenotypes.
Ensemble positive unlabeled learning for disease gene identification.
Yang, Peng; Li, Xiaoli; Chua, Hon-Nian; Kwoh, Chee-Keong; Ng, See-Kiong
2014-01-01
An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.
McDonald, Jacqueline U.; Kaforou, Myrsini; Clare, Simon; Hale, Christine; Ivanova, Maria; Huntley, Derek; Dorner, Marcus; Wright, Victoria J.; Levin, Michael; Martinon-Torres, Federico; Herberg, Jethro A.
2016-01-01
ABSTRACT Greater understanding of the functions of host gene products in response to infection is required. While many of these genes enable pathogen clearance, some enhance pathogen growth or contribute to disease symptoms. Many studies have profiled transcriptomic and proteomic responses to infection, generating large data sets, but selecting targets for further study is challenging. Here we propose a novel data-mining approach combining multiple heterogeneous data sets to prioritize genes for further study by using respiratory syncytial virus (RSV) infection as a model pathogen with a significant health care impact. The assumption was that the more frequently a gene is detected across multiple studies, the more important its role is. A literature search was performed to find data sets of genes and proteins that change after RSV infection. The data sets were standardized, collated into a single database, and then panned to determine which genes occurred in multiple data sets, generating a candidate gene list. This candidate gene list was validated by using both a clinical cohort and in vitro screening. We identified several genes that were frequently expressed following RSV infection with no assigned function in RSV control, including IFI27, IFIT3, IFI44L, GBP1, OAS3, IFI44, and IRF7. Drilling down into the function of these genes, we demonstrate a role in disease for the gene for interferon regulatory factor 7, which was highly ranked on the list, but not for IRF1, which was not. Thus, we have developed and validated an approach for collating published data sets into a manageable list of candidates, identifying novel targets for future analysis. IMPORTANCE Making the most of “big data” is one of the core challenges of current biology. There is a large array of heterogeneous data sets of host gene responses to infection, but these data sets do not inform us about gene function and require specialized skill sets and training for their utilization. Here we describe an approach that combines and simplifies these data sets, distilling this information into a single list of genes commonly upregulated in response to infection with RSV as a model pathogen. Many of the genes on the list have unknown functions in RSV disease. We validated the gene list with new clinical, in vitro, and in vivo data. This approach allows the rapid selection of genes of interest for further, more-detailed studies, thus reducing time and costs. Furthermore, the approach is simple to use and widely applicable to a range of diseases. PMID:27822537
The Molecular Signatures Database (MSigDB) hallmark gene set collection.
Liberzon, Arthur; Birger, Chet; Thorvaldsdóttir, Helga; Ghandi, Mahmoud; Mesirov, Jill P; Tamayo, Pablo
2015-12-23
The Molecular Signatures Database (MSigDB) is one of the most widely used and comprehensive databases of gene sets for performing gene set enrichment analysis. Since its creation, MSigDB has grown beyond its roots in metabolic disease and cancer to include >10,000 gene sets. These better represent a wider range of biological processes and diseases, but the utility of the database is reduced by increased redundancy across, and heterogeneity within, gene sets. To address this challenge, here we use a combination of automated approaches and expert curation to develop a collection of "hallmark" gene sets as part of MSigDB. Each hallmark in this collection consists of a "refined" gene set, derived from multiple "founder" sets, that conveys a specific biological state or process and displays coherent expression. The hallmarks effectively summarize most of the relevant information of the original founder sets and, by reducing both variation and redundancy, provide more refined and concise inputs for gene set enrichment analysis.
Schaid, Daniel J; Sinnwell, Jason P; Jenkins, Gregory D; McDonnell, Shannon K; Ingle, James N; Kubo, Michiaki; Goss, Paul E; Costantino, Joseph P; Wickerham, D Lawrence; Weinshilboum, Richard M
2012-01-01
Gene-set analyses have been widely used in gene expression studies, and some of the developed methods have been extended to genome wide association studies (GWAS). Yet, complications due to linkage disequilibrium (LD) among single nucleotide polymorphisms (SNPs), and variable numbers of SNPs per gene and genes per gene-set, have plagued current approaches, often leading to ad hoc "fixes." To overcome some of the current limitations, we developed a general approach to scan GWAS SNP data for both gene-level and gene-set analyses, building on score statistics for generalized linear models, and taking advantage of the directed acyclic graph structure of the gene ontology when creating gene-sets. However, other types of gene-set structures can be used, such as the popular Kyoto Encyclopedia of Genes and Genomes (KEGG). Our approach combines SNPs into genes, and genes into gene-sets, but assures that positive and negative effects of genes on a trait do not cancel. To control for multiple testing of many gene-sets, we use an efficient computational strategy that accounts for LD and provides accurate step-down adjusted P-values for each gene-set. Application of our methods to two different GWAS provide guidance on the potential strengths and weaknesses of our proposed gene-set analyses. © 2011 Wiley Periodicals, Inc.
Rydenfelt, Mattias; Cox, Robert Sidney; Garcia, Hernan; Phillips, Rob
2014-01-01
Transcription factors (TFs) with regulatory action at multiple promoter targets is the rule rather than the exception, with examples ranging from the cAMP receptor protein (CRP) in E. coli that regulates hundreds of different genes simultaneously to situations involving multiple copies of the same gene, such as plasmids, retrotransposons, or highly replicated viral DNA. When the number of TFs heavily exceeds the number of binding sites, TF binding to each promoter can be regarded as independent. However, when the number of TF molecules is comparable to the number of binding sites, TF titration will result in correlation (“promoter entanglement”) between transcription of different genes. We develop a statistical mechanical model which takes the TF titration effect into account and use it to predict both the level of gene expression for a general set of promoters and the resulting correlation in transcription rates of different genes. Our results show that the TF titration effect could be important for understanding gene expression in many regulatory settings. PMID:24580252
Cha, Kihoon; Hwang, Taeho; Oh, Kimin; Yi, Gwan-Su
2015-01-01
It has been reported that several brain diseases can be treated as transnosological manner implicating possible common molecular basis under those diseases. However, molecular level commonality among those brain diseases has been largely unexplored. Gene expression analyses of human brain have been used to find genes associated with brain diseases but most of those studies were restricted either to an individual disease or to a couple of diseases. In addition, identifying significant genes in such brain diseases mostly failed when it used typical methods depending on differentially expressed genes. In this study, we used a correlation-based biclustering approach to find coexpressed gene sets in five neurodegenerative diseases and three psychiatric disorders. By using biclustering analysis, we could efficiently and fairly identified various gene sets expressed specifically in both single and multiple brain diseases. We could find 4,307 gene sets correlatively expressed in multiple brain diseases and 3,409 gene sets exclusively specified in individual brain diseases. The function enrichment analysis of those gene sets showed many new possible functional bases as well as neurological processes that are common or specific for those eight diseases. This study introduces possible common molecular bases for several brain diseases, which open the opportunity to clarify the transnosological perspective assumed in brain diseases. It also showed the advantages of correlation-based biclustering analysis and accompanying function enrichment analysis for gene expression data in this type of investigation.
2015-01-01
Background It has been reported that several brain diseases can be treated as transnosological manner implicating possible common molecular basis under those diseases. However, molecular level commonality among those brain diseases has been largely unexplored. Gene expression analyses of human brain have been used to find genes associated with brain diseases but most of those studies were restricted either to an individual disease or to a couple of diseases. In addition, identifying significant genes in such brain diseases mostly failed when it used typical methods depending on differentially expressed genes. Results In this study, we used a correlation-based biclustering approach to find coexpressed gene sets in five neurodegenerative diseases and three psychiatric disorders. By using biclustering analysis, we could efficiently and fairly identified various gene sets expressed specifically in both single and multiple brain diseases. We could find 4,307 gene sets correlatively expressed in multiple brain diseases and 3,409 gene sets exclusively specified in individual brain diseases. The function enrichment analysis of those gene sets showed many new possible functional bases as well as neurological processes that are common or specific for those eight diseases. Conclusions This study introduces possible common molecular bases for several brain diseases, which open the opportunity to clarify the transnosological perspective assumed in brain diseases. It also showed the advantages of correlation-based biclustering analysis and accompanying function enrichment analysis for gene expression data in this type of investigation. PMID:26043779
Multiple-input multiple-output causal strategies for gene selection.
Bontempi, Gianluca; Haibe-Kains, Benjamin; Desmedt, Christine; Sotiriou, Christos; Quackenbush, John
2011-11-25
Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting. We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection. Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.
Reliable pre-eclampsia pathways based on multiple independent microarray data sets.
Kawasaki, Kaoru; Kondoh, Eiji; Chigusa, Yoshitsugu; Ujita, Mari; Murakami, Ryusuke; Mogami, Haruta; Brown, J B; Okuno, Yasushi; Konishi, Ikuo
2015-02-01
Pre-eclampsia is a multifactorial disorder characterized by heterogeneous clinical manifestations. Gene expression profiling of preeclamptic placenta have provided different and even opposite results, partly due to data compromised by various experimental artefacts. Here we aimed to identify reliable pre-eclampsia-specific pathways using multiple independent microarray data sets. Gene expression data of control and preeclamptic placentas were obtained from Gene Expression Omnibus. Single-sample gene-set enrichment analysis was performed to generate gene-set activation scores of 9707 pathways obtained from the Molecular Signatures Database. Candidate pathways were identified by t-test-based screening using data sets, GSE10588, GSE14722 and GSE25906. Additionally, recursive feature elimination was applied to arrive at a further reduced set of pathways. To assess the validity of the pre-eclampsia pathways, a statistically-validated protocol was executed using five data sets including two independent other validation data sets, GSE30186, GSE44711. Quantitative real-time PCR was performed for genes in a panel of potential pre-eclampsia pathways using placentas of 20 women with normal or severe preeclamptic singleton pregnancies (n = 10, respectively). A panel of ten pathways were found to discriminate women with pre-eclampsia from controls with high accuracy. Among these were pathways not previously associated with pre-eclampsia, such as the GABA receptor pathway, as well as pathways that have already been linked to pre-eclampsia, such as the glutathione and CDKN1C pathways. mRNA expression of GABRA3 (GABA receptor pathway), GCLC and GCLM (glutathione metabolic pathway), and CDKN1C was significantly reduced in the preeclamptic placentas. In conclusion, ten accurate and reliable pre-eclampsia pathways were identified based on multiple independent microarray data sets. A pathway-based classification may be a worthwhile approach to elucidate the pathogenesis of pre-eclampsia. © The Author 2014. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Shimoni, Yishai
2018-02-01
One of the goals of cancer research is to identify a set of genes that cause or control disease progression. However, although multiple such gene sets were published, these are usually in very poor agreement with each other, and very few of the genes proved to be functional therapeutic targets. Furthermore, recent findings from a breast cancer gene-expression cohort showed that sets of genes selected randomly can be used to predict survival with a much higher probability than expected. These results imply that many of the genes identified in breast cancer gene expression analysis may not be causal of cancer progression, even though they can still be highly predictive of prognosis. We performed a similar analysis on all the cancer types available in the cancer genome atlas (TCGA), namely, estimating the predictive power of random gene sets for survival. Our work shows that most cancer types exhibit the property that random selections of genes are more predictive of survival than expected. In contrast to previous work, this property is not removed by using a proliferation signature, which implies that proliferation may not always be the confounder that drives this property. We suggest one possible solution in the form of data-driven sub-classification to reduce this property significantly. Our results suggest that the predictive power of random gene sets may be used to identify the existence of sub-classes in the data, and thus may allow better understanding of patient stratification. Furthermore, by reducing the observed bias this may allow more direct identification of biologically relevant, and potentially causal, genes.
2018-01-01
One of the goals of cancer research is to identify a set of genes that cause or control disease progression. However, although multiple such gene sets were published, these are usually in very poor agreement with each other, and very few of the genes proved to be functional therapeutic targets. Furthermore, recent findings from a breast cancer gene-expression cohort showed that sets of genes selected randomly can be used to predict survival with a much higher probability than expected. These results imply that many of the genes identified in breast cancer gene expression analysis may not be causal of cancer progression, even though they can still be highly predictive of prognosis. We performed a similar analysis on all the cancer types available in the cancer genome atlas (TCGA), namely, estimating the predictive power of random gene sets for survival. Our work shows that most cancer types exhibit the property that random selections of genes are more predictive of survival than expected. In contrast to previous work, this property is not removed by using a proliferation signature, which implies that proliferation may not always be the confounder that drives this property. We suggest one possible solution in the form of data-driven sub-classification to reduce this property significantly. Our results suggest that the predictive power of random gene sets may be used to identify the existence of sub-classes in the data, and thus may allow better understanding of patient stratification. Furthermore, by reducing the observed bias this may allow more direct identification of biologically relevant, and potentially causal, genes. PMID:29470520
Nicoletti, Paola; Bansal, Mukesh; Lefebvre, Celine; Guarnieri, Paolo; Shen, Yufeng; Pe'er, Itsik; Califano, Andrea; Floratos, Aris
2015-01-01
Stevens-Johnson syndrome (SJS) and Toxic Epidermal Necrolysis (TEN) represent rare but serious adverse drug reactions (ADRs). Both are characterized by distinctive blistering lesions and significant mortality rates. While there is evidence for strong drug-specific genetic predisposition related to HLA alleles, recent genome wide association studies (GWAS) on European and Asian populations have failed to identify genetic susceptibility alleles that are common across multiple drugs. We hypothesize that this is a consequence of the low to moderate effect size of individual genetic risk factors. To test this hypothesis we developed Pointer, a new algorithm that assesses the aggregate effect of multiple low risk variants on a pathway using a gene set enrichment approach. A key advantage of our method is the capability to associate SNPs with genes by exploiting physical proximity as well as by using expression quantitative trait loci (eQTLs) that capture information about both cis- and trans-acting regulatory effects. We control for known bias-inducing aspects of enrichment based analyses, such as: 1) gene length, 2) gene set size, 3) presence of biologically related genes within the same linkage disequilibrium (LD) region, and, 4) genes shared among multiple gene sets. We applied this approach to publicly available SJS/TEN genome-wide genotype data and identified the ABC transporter and Proteasome pathways as potentially implicated in the genetic susceptibility of non-drug-specific SJS/TEN. We demonstrated that the innovative SNP-to-gene mapping phase of the method was essential in detecting the significant enrichment for those pathways. Analysis of an independent gene expression dataset provides supportive functional evidence for the involvement of Proteasome pathways in SJS/TEN cutaneous lesions. These results suggest that Pointer provides a useful framework for the integrative analysis of pharmacogenetic GWAS data, by increasing the power to detect aggregate effects of multiple low risk variants. The software is available for download at https://sourceforge.net/projects/pointergsa/.
Xie, Xin-Ping; Xie, Yu-Feng; Wang, Hong-Qiang
2017-08-23
Large-scale accumulation of omics data poses a pressing challenge of integrative analysis of multiple data sets in bioinformatics. An open question of such integrative analysis is how to pinpoint consistent but subtle gene activity patterns across studies. Study heterogeneity needs to be addressed carefully for this goal. This paper proposes a regulation probability model-based meta-analysis, jGRP, for identifying differentially expressed genes (DEGs). The method integrates multiple transcriptomics data sets in a gene regulatory space instead of in a gene expression space, which makes it easy to capture and manage data heterogeneity across studies from different laboratories or platforms. Specifically, we transform gene expression profiles into a united gene regulation profile across studies by mathematically defining two gene regulation events between two conditions and estimating their occurring probabilities in a sample. Finally, a novel differential expression statistic is established based on the gene regulation profiles, realizing accurate and flexible identification of DEGs in gene regulation space. We evaluated the proposed method on simulation data and real-world cancer datasets and showed the effectiveness and efficiency of jGRP in identifying DEGs identification in the context of meta-analysis. Data heterogeneity largely influences the performance of meta-analysis of DEGs identification. Existing different meta-analysis methods were revealed to exhibit very different degrees of sensitivity to study heterogeneity. The proposed method, jGRP, can be a standalone tool due to its united framework and controllable way to deal with study heterogeneity.
Seok, Junhee; Davis, Ronald W; Xiao, Wenzhong
2015-01-01
Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn't been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge.
Seok, Junhee; Davis, Ronald W.; Xiao, Wenzhong
2015-01-01
Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn’t been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge. PMID:25933378
Chowdhury, Nilotpal; Sapru, Shantanu
2015-01-01
Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate - adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.
Chowdhury, Nilotpal; Sapru, Shantanu
2015-01-01
Introduction Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis. Aim The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets. Methods Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA). Results Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed. Conclusion To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research. PMID:26080057
Evaluating Gene Set Enrichment Analysis Via a Hybrid Data Model
Hua, Jianping; Bittner, Michael L.; Dougherty, Edward R.
2014-01-01
Gene set enrichment analysis (GSA) methods have been widely adopted by biological labs to analyze data and generate hypotheses for validation. Most of the existing comparison studies focus on whether the existing GSA methods can produce accurate P-values; however, practitioners are often more concerned with the correct gene-set ranking generated by the methods. The ranking performance is closely related to two critical goals associated with GSA methods: the ability to reveal biological themes and ensuring reproducibility, especially for small-sample studies. We have conducted a comprehensive simulation study focusing on the ranking performance of seven representative GSA methods. We overcome the limitation on the availability of real data sets by creating hybrid data models from existing large data sets. To build the data model, we pick a master gene from the data set to form the ground truth and artificially generate the phenotype labels. Multiple hybrid data models can be constructed from one data set and multiple data sets of smaller sizes can be generated by resampling the original data set. This approach enables us to generate a large batch of data sets to check the ranking performance of GSA methods. Our simulation study reveals that for the proposed data model, the Q2 type GSA methods have in general better performance than other GSA methods and the global test has the most robust results. The properties of a data set play a critical role in the performance. For the data sets with highly connected genes, all GSA methods suffer significantly in performance. PMID:24558298
Spinelli, Lionel; Carpentier, Sabrina; Montañana Sanchis, Frédéric; Dalod, Marc; Vu Manh, Thien-Phong
2015-10-19
Recent advances in the analysis of high-throughput expression data have led to the development of tools that scaled-up their focus from single-gene to gene set level. For example, the popular Gene Set Enrichment Analysis (GSEA) algorithm can detect moderate but coordinated expression changes of groups of presumably related genes between pairs of experimental conditions. This considerably improves extraction of information from high-throughput gene expression data. However, although many gene sets covering a large panel of biological fields are available in public databases, the ability to generate home-made gene sets relevant to one's biological question is crucial but remains a substantial challenge to most biologists lacking statistic or bioinformatic expertise. This is all the more the case when attempting to define a gene set specific of one condition compared to many other ones. Thus, there is a crucial need for an easy-to-use software for generation of relevant home-made gene sets from complex datasets, their use in GSEA, and the correction of the results when applied to multiple comparisons of many experimental conditions. We developed BubbleGUM (GSEA Unlimited Map), a tool that allows to automatically extract molecular signatures from transcriptomic data and perform exhaustive GSEA with multiple testing correction. One original feature of BubbleGUM notably resides in its capacity to integrate and compare numerous GSEA results into an easy-to-grasp graphical representation. We applied our method to generate transcriptomic fingerprints for murine cell types and to assess their enrichments in human cell types. This analysis allowed us to confirm homologies between mouse and human immunocytes. BubbleGUM is an open-source software that allows to automatically generate molecular signatures out of complex expression datasets and to assess directly their enrichment by GSEA on independent datasets. Enrichments are displayed in a graphical output that helps interpreting the results. This innovative methodology has recently been used to answer important questions in functional genomics, such as the degree of similarities between microarray datasets from different laboratories or with different experimental models or clinical cohorts. BubbleGUM is executable through an intuitive interface so that both bioinformaticians and biologists can use it. It is available at http://www.ciml.univ-mrs.fr/applications/BubbleGUM/index.html .
Turning publicly available gene expression data into discoveries using gene set context analysis.
Ji, Zhicheng; Vokes, Steven A; Dang, Chi V; Ji, Hongkai
2016-01-08
Gene Set Context Analysis (GSCA) is an open source software package to help researchers use massive amounts of publicly available gene expression data (PED) to make discoveries. Users can interactively visualize and explore gene and gene set activities in 25,000+ consistently normalized human and mouse gene expression samples representing diverse biological contexts (e.g. different cells, tissues and disease types, etc.). By providing one or multiple genes or gene sets as input and specifying a gene set activity pattern of interest, users can query the expression compendium to systematically identify biological contexts associated with the specified gene set activity pattern. In this way, researchers with new gene sets from their own experiments may discover previously unknown contexts of gene set functions and hence increase the value of their experiments. GSCA has a graphical user interface (GUI). The GUI makes the analysis convenient and customizable. Analysis results can be conveniently exported as publication quality figures and tables. GSCA is available at https://github.com/zji90/GSCA. This software significantly lowers the bar for biomedical investigators to use PED in their daily research for generating and screening hypotheses, which was previously difficult because of the complexity, heterogeneity and size of the data. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Grewal, Nivit; Singh, Shailendra; Chand, Trilok
2017-01-01
Owing to the innate noise in the biological data sources, a single source or a single measure do not suffice for an effective disease gene prioritization. So, the integration of multiple data sources or aggregation of multiple measures is the need of the hour. The aggregation operators combine multiple related data values to a single value such that the combined value has the effect of all the individual values. In this paper, an attempt has been made for applying the fuzzy aggregation on the network-based disease gene prioritization and investigate its effect under noise conditions. This study has been conducted for a set of 15 blood disorders by fusing four different network measures, computed from the protein interaction network, using a selected set of aggregation operators and ranking the genes on the basis of the aggregated value. The aggregation operator-based rankings have been compared with the "Random walk with restart" gene prioritization method. The impact of noise has also been investigated by adding varying proportions of noise to the seed set. The results reveal that for all the selected blood disorders, the Mean of Maximal operator has relatively outperformed the other aggregation operators for noisy as well as non-noisy data.
Larson, Nicholas B; McDonnell, Shannon; Cannon Albright, Lisa; Teerlink, Craig; Stanford, Janet; Ostrander, Elaine A; Isaacs, William B; Xu, Jianfeng; Cooney, Kathleen A; Lange, Ethan; Schleutker, Johanna; Carpten, John D; Powell, Isaac; Bailey-Wilson, Joan E; Cussenot, Olivier; Cancel-Tassin, Geraldine; Giles, Graham G; MacInnis, Robert J; Maier, Christiane; Whittemore, Alice S; Hsieh, Chih-Lin; Wiklund, Fredrik; Catalona, William J; Foulkes, William; Mandal, Diptasri; Eeles, Rosalind; Kote-Jarai, Zsofia; Ackerman, Michael J; Olson, Timothy M; Klein, Christopher J; Thibodeau, Stephen N; Schaid, Daniel J
2017-05-01
Next-generation sequencing technologies have afforded unprecedented characterization of low-frequency and rare genetic variation. Due to low power for single-variant testing, aggregative methods are commonly used to combine observed rare variation within a single gene. Causal variation may also aggregate across multiple genes within relevant biomolecular pathways. Kernel-machine regression and adaptive testing methods for aggregative rare-variant association testing have been demonstrated to be powerful approaches for pathway-level analysis, although these methods tend to be computationally intensive at high-variant dimensionality and require access to complete data. An additional analytical issue in scans of large pathway definition sets is multiple testing correction. Gene set definitions may exhibit substantial genic overlap, and the impact of the resultant correlation in test statistics on Type I error rate control for large agnostic gene set scans has not been fully explored. Herein, we first outline a statistical strategy for aggregative rare-variant analysis using component gene-level linear kernel score test summary statistics as well as derive simple estimators of the effective number of tests for family-wise error rate control. We then conduct extensive simulation studies to characterize the behavior of our approach relative to direct application of kernel and adaptive methods under a variety of conditions. We also apply our method to two case-control studies, respectively, evaluating rare variation in hereditary prostate cancer and schizophrenia. Finally, we provide open-source R code for public use to facilitate easy application of our methods to existing rare-variant analysis results. © 2017 WILEY PERIODICALS, INC.
Multiple Testing of Gene Sets from Gene Ontology: Possibilities and Pitfalls.
Meijer, Rosa J; Goeman, Jelle J
2016-09-01
The use of multiple testing procedures in the context of gene-set testing is an important but relatively underexposed topic. If a multiple testing method is used, this is usually a standard familywise error rate (FWER) or false discovery rate (FDR) controlling procedure in which the logical relationships that exist between the different (self-contained) hypotheses are not taken into account. Taking those relationships into account, however, can lead to more powerful variants of existing multiple testing procedures and can make summarizing and interpreting the final results easier. We will show that, from the perspective of interpretation as well as from the perspective of power improvement, FWER controlling methods are more suitable than FDR controlling methods. As an example of a possible power improvement, we suggest a modified version of the popular method by Holm, which we also implemented in the R package cherry. © The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
SET1A/COMPASS and shadow enhancers in the regulation of homeotic gene expression
Cao, Kaixiang; Collings, Clayton K.; Marshall, Stacy A.; Morgan, Marc A.; Rendleman, Emily J.; Wang, Lu; Sze, Christie C.; Sun, Tianjiao; Bartom, Elizabeth T.; Shilatifard, Ali
2017-01-01
The homeotic (Hox) genes are highly conserved in metazoans, where they are required for various processes in development, and misregulation of their expression is associated with human cancer. In the developing embryo, Hox genes are activated sequentially in time and space according to their genomic position within Hox gene clusters. Accumulating evidence implicates both enhancer elements and noncoding RNAs in controlling this spatiotemporal expression of Hox genes, but disentangling their relative contributions is challenging. Here, we identify two cis-regulatory elements (E1 and E2) functioning as shadow enhancers to regulate the early expression of the HoxA genes. Simultaneous deletion of these shadow enhancers in embryonic stem cells leads to impaired activation of HoxA genes upon differentiation, while knockdown of a long noncoding RNA overlapping E1 has no detectable effect on their expression. Although MLL/COMPASS (complex of proteins associated with Set1) family of histone methyltransferases is known to activate transcription of Hox genes in other contexts, we found that individual inactivation of the MLL1-4/COMPASS family members has little effect on early Hox gene activation. Instead, we demonstrate that SET1A/COMPASS is required for full transcriptional activation of multiple Hox genes but functions independently of the E1 and E2 cis-regulatory elements. Our results reveal multiple regulatory layers for Hox genes to fine-tune transcriptional programs essential for development. PMID:28487406
Gene set analysis using variance component tests.
Huang, Yen-Tsung; Lin, Xihong
2013-06-28
Gene set analyses have become increasingly important in genomic research, as many complex diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important feature of a gene set to improve statistical power in gene set analyses. We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression coefficients in multivariate linear regression models, and calculate the p-values using permutation and a scaled chi-square approximation. We show using simulations that type I error is protected under different choices of working covariance matrices and power is improved as the working covariance approaches the true covariance. The global test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and gene set enrichment analysis (GSEA). We develop a gene set analyses method (TEGS) under the multivariate regression framework, which directly models the interdependence of the expression values in a gene set using a working covariance. TEGS outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data.
Simultaneous Identification of Multiple Driver Pathways in Cancer
Leiserson, Mark D. M.; Blokh, Dima
2013-01-01
Distinguishing the somatic mutations responsible for cancer (driver mutations) from random, passenger mutations is a key challenge in cancer genomics. Driver mutations generally target cellular signaling and regulatory pathways consisting of multiple genes. This heterogeneity complicates the identification of driver mutations by their recurrence across samples, as different combinations of mutations in driver pathways are observed in different samples. We introduce the Multi-Dendrix algorithm for the simultaneous identification of multiple driver pathways de novo in somatic mutation data from a cohort of cancer samples. The algorithm relies on two combinatorial properties of mutations in a driver pathway: high coverage and mutual exclusivity. We derive an integer linear program that finds set of mutations exhibiting these properties. We apply Multi-Dendrix to somatic mutations from glioblastoma, breast cancer, and lung cancer samples. Multi-Dendrix identifies sets of mutations in genes that overlap with known pathways – including Rb, p53, PI(3)K, and cell cycle pathways – and also novel sets of mutually exclusive mutations, including mutations in several transcription factors or other genes involved in transcriptional regulation. These sets are discovered directly from mutation data with no prior knowledge of pathways or gene interactions. We show that Multi-Dendrix outperforms other algorithms for identifying combinations of mutations and is also orders of magnitude faster on genome-scale data. Software available at: http://compbio.cs.brown.edu/software. PMID:23717195
Allen Brain Atlas-Driven Visualizations: a web-based gene expression energy visualization tool.
Zaldivar, Andrew; Krichmar, Jeffrey L
2014-01-01
The Allen Brain Atlas-Driven Visualizations (ABADV) is a publicly accessible web-based tool created to retrieve and visualize expression energy data from the Allen Brain Atlas (ABA) across multiple genes and brain structures. Though the ABA offers their own search engine and software for researchers to view their growing collection of online public data sets, including extensive gene expression and neuroanatomical data from human and mouse brain, many of their tools limit the amount of genes and brain structures researchers can view at once. To complement their work, ABADV generates multiple pie charts, bar charts and heat maps of expression energy values for any given set of genes and brain structures. Such a suite of free and easy-to-understand visualizations allows for easy comparison of gene expression across multiple brain areas. In addition, each visualization links back to the ABA so researchers may view a summary of the experimental detail. ABADV is currently supported on modern web browsers and is compatible with expression energy data from the Allen Mouse Brain Atlas in situ hybridization data. By creating this web application, researchers can immediately obtain and survey numerous amounts of expression energy data from the ABA, which they can then use to supplement their work or perform meta-analysis. In the future, we hope to enable ABADV across multiple data resources.
GO-based functional dissimilarity of gene sets.
Díaz-Díaz, Norberto; Aguilar-Ruiz, Jesús S
2011-09-01
The Gene Ontology (GO) provides a controlled vocabulary for describing the functions of genes and can be used to evaluate the functional coherence of gene sets. Many functional coherence measures consider each pair of gene functions in a set and produce an output based on all pairwise distances. A single gene can encode multiple proteins that may differ in function. For each functionality, other proteins that exhibit the same activity may also participate. Therefore, an identification of the most common function for all of the genes involved in a biological process is important in evaluating the functional similarity of groups of genes and a quantification of functional coherence can helps to clarify the role of a group of genes working together. To implement this approach to functional assessment, we present GFD (GO-based Functional Dissimilarity), a novel dissimilarity measure for evaluating groups of genes based on the most relevant functions of the whole set. The measure assigns a numerical value to the gene set for each of the three GO sub-ontologies. Results show that GFD performs robustly when applied to gene set of known functionality (extracted from KEGG). It performs particularly well on randomly generated gene sets. An ROC analysis reveals that the performance of GFD in evaluating the functional dissimilarity of gene sets is very satisfactory. A comparative analysis against other functional measures, such as GS2 and those presented by Resnik and Wang, also demonstrates the robustness of GFD.
Gene selection with multiple ordering criteria.
Chen, James J; Tsai, Chen-An; Tzeng, Shengli; Chen, Chun-Houh
2007-03-05
A microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects. We propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations. The layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives.
Training set selection for the prediction of essential genes.
Cheng, Jian; Xu, Zhao; Wu, Wenwu; Zhao, Li; Li, Xiangchen; Liu, Yanlin; Tao, Shiheng
2014-01-01
Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale.
Integrative Functional Genomics for Systems Genetics in GeneWeaver.org.
Bubier, Jason A; Langston, Michael A; Baker, Erich J; Chesler, Elissa J
2017-01-01
The abundance of existing functional genomics studies permits an integrative approach to interpreting and resolving the results of diverse systems genetics studies. However, a major challenge lies in assembling and harmonizing heterogeneous data sets across species for facile comparison to the positional candidate genes and coexpression networks that come from systems genetic studies. GeneWeaver is an online database and suite of tools at www.geneweaver.org that allows for fast aggregation and analysis of gene set-centric data. GeneWeaver contains curated experimental data together with resource-level data such as GO annotations, MP annotations, and KEGG pathways, along with persistent stores of user entered data sets. These can be entered directly into GeneWeaver or transferred from widely used resources such as GeneNetwork.org. Data are analyzed using statistical tools and advanced graph algorithms to discover new relations, prioritize candidate genes, and generate function hypotheses. Here we use GeneWeaver to find genes common to multiple gene sets, prioritize candidate genes from a quantitative trait locus, and characterize a set of differentially expressed genes. Coupling a large multispecies repository curated and empirical functional genomics data to fast computational tools allows for the rapid integrative analysis of heterogeneous data for interpreting and extrapolating systems genetics results.
Multiple origins of interdependent endosymbiotic complexes in a genus of cicadas.
Łukasik, Piotr; Nazario, Katherine; Van Leuven, James T; Campbell, Matthew A; Meyer, Mariah; Michalik, Anna; Pessacq, Pablo; Simon, Chris; Veloso, Claudio; McCutcheon, John P
2018-01-09
Bacterial endosymbionts that provide nutrients to hosts often have genomes that are extremely stable in structure and gene content. In contrast, the genome of the endosymbiont Hodgkinia cicadicola has fractured into multiple distinct lineages in some species of the cicada genus Tettigades To better understand the frequency, timing, and outcomes of Hodgkinia lineage splitting throughout this cicada genus, we sampled cicadas over three field seasons in Chile and performed genomics and microscopy on representative samples. We found that a single ancestral Hodgkinia lineage has split at least six independent times in Tettigades over the last 4 million years, resulting in complexes of between two and six distinct Hodgkinia lineages per host. Individual genomes in these symbiotic complexes differ dramatically in relative abundance, genome size, organization, and gene content. Each Hodgkinia lineage retains a small set of core genes involved in genetic information processing, but the high level of gene loss experienced by all genomes suggests that extensive sharing of gene products among symbiont cells must occur. In total, Hodgkinia complexes that consist of multiple lineages encode nearly complete sets of genes present on the ancestral single lineage and presumably perform the same functions as symbionts that have not undergone splitting. However, differences in the timing of the splits, along with dissimilar gene loss patterns on the resulting genomes, have led to very different outcomes of lineage splitting in extant cicadas.
A Comprehensive Analysis of Nuclear-Encoded Mitochondrial Genes in Schizophrenia.
Gonçalves, Vanessa F; Cappi, Carolina; Hagen, Christian M; Sequeira, Adolfo; Vawter, Marquis P; Derkach, Andriy; Zai, Clement C; Hedley, Paula L; Bybjerg-Grauholm, Jonas; Pouget, Jennie G; Cuperfain, Ari B; Sullivan, Patrick F; Christiansen, Michael; Kennedy, James L; Sun, Lei
2018-05-01
The genetic risk factors of schizophrenia (SCZ), a severe psychiatric disorder, are not yet fully understood. Multiple lines of evidence suggest that mitochondrial dysfunction may play a role in SCZ, but comprehensive association studies are lacking. We hypothesized that variants in nuclear-encoded mitochondrial genes influence susceptibility to SCZ. We conducted gene-based and gene-set analyses using summary association results from the Psychiatric Genomics Consortium Schizophrenia Phase 2 (PGC-SCZ2) genome-wide association study comprising 35,476 cases and 46,839 control subjects. We applied the MAGMA method to three sets of nuclear-encoded mitochondrial genes: oxidative phosphorylation genes, other nuclear-encoded mitochondrial genes, and genes involved in nucleus-mitochondria crosstalk. Furthermore, we conducted a replication study using the iPSYCH SCZ sample of 2290 cases and 21,621 control subjects. In the PGC-SCZ2 sample, 1186 mitochondrial genes were analyzed, among which 159 had p values < .05 and 19 remained significant after multiple testing correction. A meta-analysis of 818 genes combining the PGC-SCZ2 and iPSYCH samples resulted in 104 nominally significant and nine significant genes, suggesting a polygenic model for the nuclear-encoded mitochondrial genes. Gene-set analysis, however, did not show significant results. In an in silico protein-protein interaction network analysis, 14 mitochondrial genes interacted directly with 158 SCZ risk genes identified in PGC-SCZ2 (permutation p = .02), and aldosterone signaling in epithelial cells and mitochondrial dysfunction pathways appeared to be overrepresented in this network of mitochondrial and SCZ risk genes. This study provides evidence that specific aspects of mitochondrial function may play a role in SCZ, but we did not observe its broad involvement even using a large sample. Copyright © 2018 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.
Microbial genotype-phenotype mapping by class association rule mining.
Tamura, Makio; D'haeseleer, Patrik
2008-07-01
Microbial phenotypes are typically due to the concerted action of multiple gene functions, yet the presence of each gene may have only a weak correlation with the observed phenotype. Hence, it may be more appropriate to examine co-occurrence between sets of genes and a phenotype (multiple-to-one) instead of pairwise relations between a single gene and the phenotype. Here, we propose an efficient class association rule mining algorithm, netCAR, in order to extract sets of COGs (clusters of orthologous groups of proteins) associated with a phenotype from COG phylogenetic profiles and a phenotype profile. netCAR takes into account the phylogenetic co-occurrence graph between COGs to restrict hypothesis space, and uses mutual information to evaluate the biconditional relation. We examined the mining capability of pairwise and multiple-to-one association by using netCAR to extract COGs relevant to six microbial phenotypes (aerobic, anaerobic, facultative, endospore, motility and Gram negative) from 11,969 unique COG profiles across 155 prokaryotic organisms. With the same level of false discovery rate, multiple-to-one association can extract about 10 times more relevant COGs than one-to-one association. We also reveal various topologies of association networks among COGs (modules) from extracted multiple-to-one correlation rules relevant with the six phenotypes; including a well-connected network for motility, a star-shaped network for aerobic and intermediate topologies for the other phenotypes. netCAR outperforms a standard CAR mining algorithm, CARapriori, while requiring several orders of magnitude less computational time for extracting 3-COG sets. Source code of the Java implementation is available as Supplementary Material at the Bioinformatics online website, or upon request to the author. Supplementary data are available at Bioinformatics online.
Time-Course Gene Set Analysis for Longitudinal Gene Expression Data
Hejblum, Boris P.; Skinner, Jason; Thiébaut, Rodolphe
2015-01-01
Gene set analysis methods, which consider predefined groups of genes in the analysis of genomic data, have been successfully applied for analyzing gene expression data in cross-sectional studies. The time-course gene set analysis (TcGSA) introduced here is an extension of gene set analysis to longitudinal data. The proposed method relies on random effects modeling with maximum likelihood estimates. It allows to use all available repeated measurements while dealing with unbalanced data due to missing at random (MAR) measurements. TcGSA is a hypothesis driven method that identifies a priori defined gene sets with significant expression variations over time, taking into account the potential heterogeneity of expression within gene sets. When biological conditions are compared, the method indicates if the time patterns of gene sets significantly differ according to these conditions. The interest of the method is illustrated by its application to two real life datasets: an HIV therapeutic vaccine trial (DALIA-1 trial), and data from a recent study on influenza and pneumococcal vaccines. In the DALIA-1 trial TcGSA revealed a significant change in gene expression over time within 69 gene sets during vaccination, while a standard univariate individual gene analysis corrected for multiple testing as well as a standard a Gene Set Enrichment Analysis (GSEA) for time series both failed to detect any significant pattern change over time. When applied to the second illustrative data set, TcGSA allowed the identification of 4 gene sets finally found to be linked with the influenza vaccine too although they were found to be associated to the pneumococcal vaccine only in previous analyses. In our simulation study TcGSA exhibits good statistical properties, and an increased power compared to other approaches for analyzing time-course expression patterns of gene sets. The method is made available for the community through an R package. PMID:26111374
Srivastava, Mousami; Khurana, Pankaj; Sugadev, Ragumani
2012-11-02
The tissue-specific Unigene Sets derived from more than one million expressed sequence tags (ESTs) in the NCBI, GenBank database offers a platform for identifying significantly and differentially expressed tissue-specific genes by in-silico methods. Digital differential display (DDD) rapidly creates transcription profiles based on EST comparisons and numerically calculates, as a fraction of the pool of ESTs, the relative sequence abundance of known and novel genes. However, the process of identifying the most likely tissue for a specific disease in which to search for candidate genes from the pool of differentially expressed genes remains difficult. Therefore, we have used 'Gene Ontology semantic similarity score' to measure the GO similarity between gene products of lung tissue-specific candidate genes from control (normal) and disease (cancer) sets. This semantic similarity score matrix based on hierarchical clustering represents in the form of a dendrogram. The dendrogram cluster stability was assessed by multiple bootstrapping. Multiple bootstrapping also computes a p-value for each cluster and corrects the bias of the bootstrap probability. Subsequent hierarchical clustering by the multiple bootstrapping method (α = 0.95) identified seven clusters. The comparative, as well as subtractive, approach revealed a set of 38 biomarkers comprising four distinct lung cancer signature biomarker clusters (panel 1-4). Further gene enrichment analysis of the four panels revealed that each panel represents a set of lung cancer linked metastasis diagnostic biomarkers (panel 1), chemotherapy/drug resistance biomarkers (panel 2), hypoxia regulated biomarkers (panel 3) and lung extra cellular matrix biomarkers (panel 4). Expression analysis reveals that hypoxia induced lung cancer related biomarkers (panel 3), HIF and its modulating proteins (TGM2, CSNK1A1, CTNNA1, NAMPT/Visfatin, TNFRSF1A, ETS1, SRC-1, FN1, APLP2, DMBT1/SAG, AIB1 and AZIN1) are significantly down regulated. All down regulated genes in this panel were highly up regulated in most other types of cancers. These panels of proteins may represent signature biomarkers for lung cancer and will aid in lung cancer diagnosis and disease monitoring as well as in the prediction of responses to therapeutics.
Integrating Multiple Data Sources for Combinatorial Marker Discovery: A Study in Tumorigenesis.
Bandyopadhyay, Sanghamitra; Mallik, Saurav
2018-01-01
Identification of combinatorial markers from multiple data sources is a challenging task in bioinformatics. Here, we propose a novel computational framework for identifying significant combinatorial markers ( s) using both gene expression and methylation data. The gene expression and methylation data are integrated into a single continuous data as well as a (post-discretized) boolean data based on their intrinsic (i.e., inverse) relationship. A novel combined score of methylation and expression data (viz., ) is introduced which is computed on the integrated continuous data for identifying initial non-redundant set of genes. Thereafter, (maximal) frequent closed homogeneous genesets are identified using a well-known biclustering algorithm applied on the integrated boolean data of the determined non-redundant set of genes. A novel sample-based weighted support ( ) is then proposed that is consecutively calculated on the integrated boolean data of the determined non-redundant set of genes in order to identify the non-redundant significant genesets. The top few resulting genesets are identified as potential s. Since our proposed method generates a smaller number of significant non-redundant genesets than those by other popular methods, the method is much faster than the others. Application of the proposed technique on an expression and a methylation data for Uterine tumor or Prostate Carcinoma produces a set of significant combination of markers. We expect that such a combination of markers will produce lower false positives than individual markers.
Sadee, Wolfgang
2013-09-01
Pharmacogenetic biomarker tests include mostly specific single gene-drug pairs, capable of accounting for a portion of interindividual variability in drug response and toxicity. However, multiple genes are likely to contribute, either acting independently or epistatically, with the CYP2C9-VKORC1-warfarin test panel, an example of a clinically used gene-gene-dug interaction. I discuss here further instances of gene-gene-drug interactions, including a proposed dynamic effect on statin therapy by genetic variants in both a transporter (SLCO1B1) and a metabolizing enzyme (CYP3A4) in liver cells, the main target site where statins block cholesterol synthesis. These examples set a conceptual framework for developing diagnostic panels involving multiple gene-drug combinations. Copyright © 2013 Wiley Periodicals, Inc.
Logue, Mark W.; Smith, Alicia K.; Baldwin, Clinton; Wolf, Erika J.; Guffanti, Guia; Ratanatharathorn, Andrew; Stone, Annjanette; Schichman, Steven A.; Humphries, Donald; Binder, Elisabeth B.; Arloth, Janine; Menke, Andreas; Uddin, Monica; Wildman, Derek; Galea, Sandro; Aiello, Allison E.; Koenen, Karestan C.; Miller, Mark W.
2015-01-01
We examined the association between posttraumatic stress disorder (PTSD) and gene expression using whole blood samples from a cohort of trauma-exposed white non-Hispanic male veterans (115 cases and 28 controls). 10,264 probes of genes and gene transcripts were analyzed. We found 41 that were differentially expressed in PTSD cases versus controls (multiple-testing corrected p<0.05). The most significant was DSCAM, a neurological gene expressed widely in the developing brain and in the amygdala and hippocampus of the adult brain. We then examined the 41 differentially expressed genes in a meta-analysis using two replication cohorts and found significant associations with PTSD for 7 of the 41 (p<0.05), one of which (ATP6AP1L) survived multiple-testing correction. There was also broad evidence of overlap across the discovery and replication samples for the entire set of genes implicated in the discovery data based on the direction of effect and an enrichment of p<0.05 significant probes beyond what would be expected under the null. Finally, we found that the set of differentially expressed genes from the discovery sample was enriched for genes responsive to glucocorticoid signaling with most showing reduced expression in PTSD cases compared to controls. PMID:25867994
Array data extractor (ADE): a LabVIEW program to extract and merge gene array data.
Kurtenbach, Stefan; Kurtenbach, Sarah; Zoidl, Georg
2013-12-01
Large data sets from gene expression array studies are publicly available offering information highly valuable for research across many disciplines ranging from fundamental to clinical research. Highly advanced bioinformatics tools have been made available to researchers, but a demand for user-friendly software allowing researchers to quickly extract expression information for multiple genes from multiple studies persists. Here, we present a user-friendly LabVIEW program to automatically extract gene expression data for a list of genes from multiple normalized microarray datasets. Functionality was tested for 288 class A G protein-coupled receptors (GPCRs) and expression data from 12 studies comparing normal and diseased human hearts. Results confirmed known regulation of a beta 1 adrenergic receptor and further indicate novel research targets. Although existing software allows for complex data analyses, the LabVIEW based program presented here, "Array Data Extractor (ADE)", provides users with a tool to retrieve meaningful information from multiple normalized gene expression datasets in a fast and easy way. Further, the graphical programming language used in LabVIEW allows applying changes to the program without the need of advanced programming knowledge.
Newton, Richard; Wernisch, Lorenz
2014-01-01
Inferring gene regulatory relationships from observational data is challenging. Manipulation and intervention is often required to unravel causal relationships unambiguously. However, gene copy number changes, as they frequently occur in cancer cells, might be considered natural manipulation experiments on gene expression. An increasing number of data sets on matched array comparative genomic hybridisation and transcriptomics experiments from a variety of cancer pathologies are becoming publicly available. Here we explore the potential of a meta-analysis of thirty such data sets. The aim of our analysis was to assess the potential of in silico inference of trans-acting gene regulatory relationships from this type of data. We found sufficient correlation signal in the data to infer gene regulatory relationships, with interesting similarities between data sets. A number of genes had highly correlated copy number and expression changes in many of the data sets and we present predicted potential trans-acted regulatory relationships for each of these genes. The study also investigates to what extent heterogeneity between cell types and between pathologies determines the number of statistically significant predictions available from a meta-analysis of experiments. PMID:25148247
Fan, Qianrui; Wang, Wenyu; Hao, Jingcan; He, Awen; Wen, Yan; Guo, Xiong; Wu, Cuiyan; Ning, Yujie; Wang, Xi; Wang, Sen; Zhang, Feng
2017-08-01
Neuroticism is a fundamental personality trait with significant genetic determinant. To identify novel susceptibility genes for neuroticism, we conducted an integrative analysis of genomic and transcriptomic data of genome wide association study (GWAS) and expression quantitative trait locus (eQTL) study. GWAS summary data was driven from published studies of neuroticism, totally involving 170,906 subjects. eQTL dataset containing 927,753 eQTLs were obtained from an eQTL meta-analysis of 5311 samples. Integrative analysis of GWAS and eQTL data was conducted by summary data-based Mendelian randomization (SMR) analysis software. To identify neuroticism associated gene sets, the SMR analysis results were further subjected to gene set enrichment analysis (GSEA). The gene set annotation dataset (containing 13,311 annotated gene sets) of GSEA Molecular Signatures Database was used. SMR single gene analysis identified 6 significant genes for neuroticism, including MSRA (p value=2.27×10 -10 ), MGC57346 (p value=6.92×10 -7 ), BLK (p value=1.01×10 -6 ), XKR6 (p value=1.11×10 -6 ), C17ORF69 (p value=1.12×10 -6 ) and KIAA1267 (p value=4.00×10 -6 ). Gene set enrichment analysis observed significant association for Chr8p23 gene set (false discovery rate=0.033). Our results provide novel clues for the genetic mechanism studies of neuroticism. Copyright © 2017. Published by Elsevier Inc.
Gene integrated set profile analysis: a context-based approach for inferring biological endpoints
Kowalski, Jeanne; Dwivedi, Bhakti; Newman, Scott; Switchenko, Jeffery M.; Pauly, Rini; Gutman, David A.; Arora, Jyoti; Gandhi, Khanjan; Ainslie, Kylie; Doho, Gregory; Qin, Zhaohui; Moreno, Carlos S.; Rossi, Michael R.; Vertino, Paula M.; Lonial, Sagar; Bernal-Mizrachi, Leon; Boise, Lawrence H.
2016-01-01
The identification of genes with specific patterns of change (e.g. down-regulated and methylated) as phenotype drivers or samples with similar profiles for a given gene set as drivers of clinical outcome, requires the integration of several genomic data types for which an ‘integrate by intersection’ (IBI) approach is often applied. In this approach, results from separate analyses of each data type are intersected, which has the limitation of a smaller intersection with more data types. We introduce a new method, GISPA (Gene Integrated Set Profile Analysis) for integrated genomic analysis and its variation, SISPA (Sample Integrated Set Profile Analysis) for defining respective genes and samples with the context of similar, a priori specified molecular profiles. With GISPA, the user defines a molecular profile that is compared among several classes and obtains ranked gene sets that satisfy the profile as drivers of each class. With SISPA, the user defines a gene set that satisfies a profile and obtains sample groups of profile activity. Our results from applying GISPA to human multiple myeloma (MM) cell lines contained genes of known profiles and importance, along with several novel targets, and their further SISPA application to MM coMMpass trial data showed clinical relevance. PMID:26826710
APPRIS 2017: principal isoforms for multiple gene sets
Rodriguez-Rivas, Juan; Di Domenico, Tomás; Vázquez, Jesús; Valencia, Alfonso
2018-01-01
Abstract The APPRIS database (http://appris-tools.org) uses protein structural and functional features and information from cross-species conservation to annotate splice isoforms in protein-coding genes. APPRIS selects a single protein isoform, the ‘principal’ isoform, as the reference for each gene based on these annotations. A single main splice isoform reflects the biological reality for most protein coding genes and APPRIS principal isoforms are the best predictors of these main proteins isoforms. Here, we present the updates to the database, new developments that include the addition of three new species (chimpanzee, Drosophila melangaster and Caenorhabditis elegans), the expansion of APPRIS to cover the RefSeq gene set and the UniProtKB proteome for six species and refinements in the core methods that make up the annotation pipeline. In addition APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the GENCODE/Ensembl and RefSeq reference sets. The individual GENCODE/Ensembl, RefSeq and UniProtKB reference gene sets for six organisms have been merged to produce common sets of splice variants. PMID:29069475
Cooperation and coexpression: How coexpression networks shift in response to multiple mutualists.
Palakurty, Sathvik X; Stinchcombe, John R; Afkhami, Michelle E
2018-04-01
A mechanistic understanding of community ecology requires tackling the nonadditive effects of multispecies interactions, a challenge that necessitates integration of ecological and molecular complexity-namely moving beyond pairwise ecological interaction studies and the "gene at a time" approach to mechanism. Here, we investigate the consequences of multispecies mutualisms for the structure and function of genomewide differential coexpression networks for the first time, using the tractable and ecologically important interaction between legume Medicago truncatula, rhizobia and mycorrhizal fungi. First, we found that genes whose expression is affected nonadditively by multiple mutualists are more highly connected in gene networks than expected by chance and had 94% greater network centrality than genes showing additive effects, suggesting that nonadditive genes may be key players in the widespread transcriptomic responses to multispecies symbioses. Second, multispecies mutualisms substantially changed coexpression network structure of 18 modules of host plant genes and 22 modules of the fungal symbionts' genes, indicating that third-party mutualists can cause significant rewiring of plant and fungal molecular networks. Third, we found that 60% of the coexpressed gene sets that explained variation in plant performance had coexpression structures that were altered by interactive effects of rhizobia and fungi. Finally, an "across-symbiosis" approach identified sets of plant and mycorrhizal genes whose coexpression structure was unique to the multiple mutualist context and suggested coupled responses across the plant-mycorrhizal interaction to rhizobial mutualists. Taken together, these results show multispecies mutualisms have substantial effects on the molecular interactions in host plants, microbes and across symbiotic boundaries. © 2018 John Wiley & Sons Ltd.
Olson, Nathan D.; Lund, Steven P.; Zook, Justin M.; Rojas-Cornejo, Fabiola; Beck, Brian; Foy, Carole; Huggett, Jim; Whale, Alexandra S.; Sui, Zhiwei; Baoutina, Anna; Dobeson, Michael; Partis, Lina; Morrow, Jayne B.
2015-01-01
This study presents the results from an interlaboratory sequencing study for which we developed a novel high-resolution method for comparing data from different sequencing platforms for a multi-copy, paralogous gene. The combination of PCR amplification and 16S ribosomal RNA gene (16S rRNA) sequencing has revolutionized bacteriology by enabling rapid identification, frequently without the need for culture. To assess variability between laboratories in sequencing 16S rRNA, six laboratories sequenced the gene encoding the 16S rRNA from Escherichia coli O157:H7 strain EDL933 and Listeria monocytogenes serovar 4b strain NCTC11994. Participants performed sequencing methods and protocols available in their laboratories: Sanger sequencing, Roche 454 pyrosequencing®, or Ion Torrent PGM®. The sequencing data were evaluated on three levels: (1) identity of biologically conserved position, (2) ratio of 16S rRNA gene copies featuring identified variants, and (3) the collection of variant combinations in a set of 16S rRNA gene copies. The same set of biologically conserved positions was identified for each sequencing method. Analytical methods using Bayesian and maximum likelihood statistics were developed to estimate variant copy ratios, which describe the ratio of nucleotides at each identified biologically variable position, as well as the likely set of variant combinations present in 16S rRNA gene copies. Our results indicate that estimated variant copy ratios at biologically variable positions were only reproducible for high throughput sequencing methods. Furthermore, the likely variant combination set was only reproducible with increased sequencing depth and longer read lengths. We also demonstrate novel methods for evaluating variable positions when comparing multi-copy gene sequence data from multiple laboratories generated using multiple sequencing technologies. PMID:27077030
PLEXdb: Gene expression resources for plants and plant pathogens
USDA-ARS?s Scientific Manuscript database
PLEXdb (Plant Expression Database), in partnership with community databases, supports comparisons of gene expression across multiple plant and pathogen species, promoting individuals and/or consortia to upload genome-scale data sets to contrast them to previously archived data. These analyses facili...
Gupta, Mayetri; Cheung, Ching-Lung; Hsu, Yi-Hsiang; Demissie, Serkalem; Cupples, L Adrienne; Kiel, Douglas P; Karasik, David
2011-06-01
Genome-wide association studies (GWAS) using high-density genotyping platforms offer an unbiased strategy to identify new candidate genes for osteoporosis. It is imperative to be able to clearly distinguish signal from noise by focusing on the best phenotype in a genetic study. We performed GWAS of multiple phenotypes associated with fractures [bone mineral density (BMD), bone quantitative ultrasound (QUS), bone geometry, and muscle mass] with approximately 433,000 single-nucleotide polymorphisms (SNPs) and created a database of resulting associations. We performed analysis of GWAS data from 23 phenotypes by a novel modification of a block clustering algorithm followed by gene-set enrichment analysis. A data matrix of standardized regression coefficients was partitioned along both axes--SNPs and phenotypes. Each partition represents a distinct cluster of SNPs that have similar effects over a particular set of phenotypes. Application of this method to our data shows several SNP-phenotype connections. We found a strong cluster of association coefficients of high magnitude for 10 traits (BMD at several skeletal sites, ultrasound measures, cross-sectional bone area, and section modulus of femoral neck and shaft). These clustered traits were highly genetically correlated. Gene-set enrichment analyses indicated the augmentation of genes that cluster with the 10 osteoporosis-related traits in pathways such as aldosterone signaling in epithelial cells, role of osteoblasts, osteoclasts, and chondrocytes in rheumatoid arthritis, and Parkinson signaling. In addition to several known candidate genes, we also identified PRKCH and SCNN1B as potential candidate genes for multiple bone traits. In conclusion, our mining of GWAS results revealed the similarity of association results between bone strength phenotypes that may be attributed to pleiotropic effects of genes. This knowledge may prove helpful in identifying novel genes and pathways that underlie several correlated phenotypes, as well as in deciphering genetic and phenotypic modularity underlying osteoporosis risk. Copyright © 2011 American Society for Bone and Mineral Research.
Chau, John H; Rahfeldt, Wolfgang A; Olmstead, Richard G
2018-03-01
Targeted sequence capture can be used to efficiently gather sequence data for large numbers of loci, such as single-copy nuclear loci. Most published studies in plants have used taxon-specific locus sets developed individually for a clade using multiple genomic and transcriptomic resources. General locus sets can also be developed from loci that have been identified as single-copy and have orthologs in large clades of plants. We identify and compare a taxon-specific locus set and three general locus sets (conserved ortholog set [COSII], shared single-copy nuclear [APVO SSC] genes, and pentatricopeptide repeat [PPR] genes) for targeted sequence capture in Buddleja (Scrophulariaceae) and outgroups. We evaluate their performance in terms of assembly success, sequence variability, and resolution and support of inferred phylogenetic trees. The taxon-specific locus set had the most target loci. Assembly success was high for all locus sets in Buddleja samples. For outgroups, general locus sets had greater assembly success. Taxon-specific and PPR loci had the highest average variability. The taxon-specific data set produced the best-supported tree, but all data sets showed improved resolution over previous non-sequence capture data sets. General locus sets can be a useful source of sequence capture targets, especially if multiple genomic resources are not available for a taxon.
Melroy-Greif, Whitney E; Simonson, Matthew A; Corley, Robin P; Lutz, Sharon M; Hokanson, John E; Ehringer, Marissa A
2017-04-01
Cigarette smoking is a physiologically harmful habit. Nicotinic acetylcholine receptors (nAChRs) are bound by nicotine and upregulated in response to chronic exposure to nicotine. It is known that upregulation of these receptors is not due to a change in mRNA of these genes, however, more precise details on the process are still uncertain, with several plausible hypotheses describing how nAChRs are upregulated. We have manually curated a set of genes believed to play a role in nicotine-induced nAChR upregulation. Here, we test the hypothesis that these genes are associated with and contribute risk for nicotine dependence (ND) and the number of cigarettes smoked per day (CPD). Studies with genotypic data on European and African Americans (EAs and AAs, respectively) were collected and a gene-based test was run to test for an association between each gene and ND and CPD. Although several novel genes were associated with CPD and ND at P < 0.05 in EAs and AAs, these associations did not survive correction for multiple testing. Previous associations between CHRNA3, CHRNA5, CHRNB4 and CPD in EAs were replicated. Our hypothesis-driven approach avoided many of the limitations inherent in pathway analyses and provided nominal evidence for association between cholinergic-related genes and nicotine behaviors. We evaluated the evidence for association between a manually curated set of genes and nicotine behaviors in European and African Americans. Although no genes were associated after multiple testing correction, this study has several strengths: by manually curating a set of genes we circumvented the limitations inherent in many pathway analyses and tested several genes that had not yet been examined in a human genetic study; gene-based tests are a useful way to test for association with a set of genes; and these genes were collected based on literature review and conversations with experts, highlighting the importance of scientific collaboration. © The Author 2016. Published by Oxford University Press on behalf of the Society for Research on Nicotine and Tobacco. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
A Guide to the PLAZA 3.0 Plant Comparative Genomic Database.
Vandepoele, Klaas
2017-01-01
PLAZA 3.0 is an online resource for comparative genomics and offers a versatile platform to study gene functions and gene families or to analyze genome organization and evolution in the green plant lineage. Starting from genome sequence information for over 35 plant species, precomputed comparative genomic data sets cover homologous gene families, multiple sequence alignments, phylogenetic trees, and genomic colinearity information within and between species. Complementary functional data sets, a Workbench, and interactive visualization tools are available through a user-friendly web interface, making PLAZA an excellent starting point to translate sequence or omics data sets into biological knowledge. PLAZA is available at http://bioinformatics.psb.ugent.be/plaza/ .
When is hub gene selection better than standard meta-analysis?
Langfelder, Peter; Mischel, Paul S; Horvath, Steve
2013-01-01
Since hub nodes have been found to play important roles in many networks, highly connected hub genes are expected to play an important role in biology as well. However, the empirical evidence remains ambiguous. An open question is whether (or when) hub gene selection leads to more meaningful gene lists than a standard statistical analysis based on significance testing when analyzing genomic data sets (e.g., gene expression or DNA methylation data). Here we address this question for the special case when multiple genomic data sets are available. This is of great practical importance since for many research questions multiple data sets are publicly available. In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a co-expression network analysis approach that selects intramodular hubs in consensus modules. We assess the performance of these two types of approaches according to two criteria. The first criterion evaluates the biological insights gained and is relevant in basic research. The second criterion evaluates the validation success (reproducibility) in independent data sets and often applies in clinical diagnostic or prognostic applications. We compare meta-analysis with consensus network analysis based on weighted correlation network analysis (WGCNA) in three comprehensive and unbiased empirical studies: (1) Finding genes predictive of lung cancer survival, (2) finding methylation markers related to age, and (3) finding mouse genes related to total cholesterol. The results demonstrate that intramodular hub gene status with respect to consensus modules is more useful than a meta-analysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1). However, standard meta-analysis methods perform as good as (if not better than) a consensus network approach in terms of validation success (criterion 2). The article also reports a comparison of meta-analysis techniques applied to gene expression data and presents novel R functions for carrying out consensus network analysis, network based screening, and meta analysis.
A Fast Multiple-Kernel Method With Applications to Detect Gene-Environment Interaction.
Marceau, Rachel; Lu, Wenbin; Holloway, Shannon; Sale, Michèle M; Worrall, Bradford B; Williams, Stephen R; Hsu, Fang-Chi; Tzeng, Jung-Ying
2015-09-01
Kernel machine (KM) models are a powerful tool for exploring associations between sets of genetic variants and complex traits. Although most KM methods use a single kernel function to assess the marginal effect of a variable set, KM analyses involving multiple kernels have become increasingly popular. Multikernel analysis allows researchers to study more complex problems, such as assessing gene-gene or gene-environment interactions, incorporating variance-component based methods for population substructure into rare-variant association testing, and assessing the conditional effects of a variable set adjusting for other variable sets. The KM framework is robust, powerful, and provides efficient dimension reduction for multifactor analyses, but requires the estimation of high dimensional nuisance parameters. Traditional estimation techniques, including regularization and the "expectation-maximization (EM)" algorithm, have a large computational cost and are not scalable to large sample sizes needed for rare variant analysis. Therefore, under the context of gene-environment interaction, we propose a computationally efficient and statistically rigorous "fastKM" algorithm for multikernel analysis that is based on a low-rank approximation to the nuisance effect kernel matrices. Our algorithm is applicable to various trait types (e.g., continuous, binary, and survival traits) and can be implemented using any existing single-kernel analysis software. Through extensive simulation studies, we show that our algorithm has similar performance to an EM-based KM approach for quantitative traits while running much faster. We also apply our method to the Vitamin Intervention for Stroke Prevention (VISP) clinical trial, examining gene-by-vitamin effects on recurrent stroke risk and gene-by-age effects on change in homocysteine level. © 2015 WILEY PERIODICALS, INC.
Endeavour update: a web resource for gene prioritization in multiple species
Tranchevent, Léon-Charles; Barriot, Roland; Yu, Shi; Van Vooren, Steven; Van Loo, Peter; Coessens, Bert; De Moor, Bart; Aerts, Stein; Moreau, Yves
2008-01-01
Endeavour (http://www.esat.kuleuven.be/endeavourweb; this web site is free and open to all users and there is no login requirement) is a web resource for the prioritization of candidate genes. Using a training set of genes known to be involved in a biological process of interest, our approach consists of (i) inferring several models (based on various genomic data sources), (ii) applying each model to the candidate genes to rank those candidates against the profile of the known genes and (iii) merging the several rankings into a global ranking of the candidate genes. In the present article, we describe the latest developments of Endeavour. First, we provide a web-based user interface, besides our Java client, to make Endeavour more universally accessible. Second, we support multiple species: in addition to Homo sapiens, we now provide gene prioritization for three major model organisms: Mus musculus, Rattus norvegicus and Caenorhabditis elegans. Third, Endeavour makes use of additional data sources and is now including numerous databases: ontologies and annotations, protein–protein interactions, cis-regulatory information, gene expression data sets, sequence information and text-mining data. We tested the novel version of Endeavour on 32 recent disease gene associations from the literature. Additionally, we describe a number of recent independent studies that made use of Endeavour to prioritize candidate genes for obesity and Type II diabetes, cleft lip and cleft palate, and pulmonary fibrosis. PMID:18508807
Isolation with Migration Models for More Than Two Populations
Hey, Jody
2010-01-01
A method for studying the divergence of multiple closely related populations is described and assessed. The approach of Hey and Nielsen (2007, Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc Natl Acad Sci USA. 104:2785–2790) for fitting an isolation-with-migration model was extended to the case of multiple populations with a known phylogeny. Analysis of simulated data sets reveals the kinds of history that are accessible with a multipopulation analysis. Necessarily, processes associated with older time periods in a phylogeny are more difficult to estimate; and histories with high levels of gene flow are particularly difficult with more than two populations. However, for histories with modest levels of gene flow, or for very large data sets, it is possible to study large complex divergence problems that involve multiple closely related populations or species. PMID:19955477
Isolation with migration models for more than two populations.
Hey, Jody
2010-04-01
A method for studying the divergence of multiple closely related populations is described and assessed. The approach of Hey and Nielsen (2007, Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc Natl Acad Sci USA. 104:2785-2790) for fitting an isolation-with-migration model was extended to the case of multiple populations with a known phylogeny. Analysis of simulated data sets reveals the kinds of history that are accessible with a multipopulation analysis. Necessarily, processes associated with older time periods in a phylogeny are more difficult to estimate; and histories with high levels of gene flow are particularly difficult with more than two populations. However, for histories with modest levels of gene flow, or for very large data sets, it is possible to study large complex divergence problems that involve multiple closely related populations or species.
NIMEFI: gene regulatory network inference using multiple ensemble feature importance algorithms.
Ruyssinck, Joeri; Huynh-Thu, Vân Anh; Geurts, Pierre; Dhaene, Tom; Demeester, Piet; Saeys, Yvan
2014-01-01
One of the long-standing open challenges in computational systems biology is the topology inference of gene regulatory networks from high-throughput omics data. Recently, two community-wide efforts, DREAM4 and DREAM5, have been established to benchmark network inference techniques using gene expression measurements. In these challenges the overall top performer was the GENIE3 algorithm. This method decomposes the network inference task into separate regression problems for each gene in the network in which the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. The contribution of this work is twofold. First, we generalize the regression decomposition strategy of GENIE3 to other feature importance methods. We compare the performance of support vector regression, the elastic net, random forest regression, symbolic regression and their ensemble variants in this setting to the original GENIE3 algorithm. To create the ensemble variants, we propose a subsampling approach which allows us to cast any feature selection algorithm that produces a feature ranking into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. As second contribution, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better. An implementation of NIMEFI has been made publicly available.
Array data extractor (ADE): a LabVIEW program to extract and merge gene array data
2013-01-01
Background Large data sets from gene expression array studies are publicly available offering information highly valuable for research across many disciplines ranging from fundamental to clinical research. Highly advanced bioinformatics tools have been made available to researchers, but a demand for user-friendly software allowing researchers to quickly extract expression information for multiple genes from multiple studies persists. Findings Here, we present a user-friendly LabVIEW program to automatically extract gene expression data for a list of genes from multiple normalized microarray datasets. Functionality was tested for 288 class A G protein-coupled receptors (GPCRs) and expression data from 12 studies comparing normal and diseased human hearts. Results confirmed known regulation of a beta 1 adrenergic receptor and further indicate novel research targets. Conclusions Although existing software allows for complex data analyses, the LabVIEW based program presented here, “Array Data Extractor (ADE)”, provides users with a tool to retrieve meaningful information from multiple normalized gene expression datasets in a fast and easy way. Further, the graphical programming language used in LabVIEW allows applying changes to the program without the need of advanced programming knowledge. PMID:24289243
Ujino-Ihara, Tokuko; Kanamori, Hiroyuki; Yamane, Hiroko; Taguchi, Yuriko; Namiki, Nobukazu; Mukai, Yuzuru; Yoshimura, Kensuke; Tsumura, Yoshihiko
2005-12-01
To identify and characterize lineage-specific genes of conifers, two sets of ESTs (with 12791 and 5902 ESTs, representing 5373 and 3018 gene transcripts, respectively) were generated from the Cupressaceae species Cryptomeria japonica and Chamaecyparis obtusa. These transcripts were compared with non-redundant sets of genes generated from Pinaceae species, other gymnosperms and angiosperms. About 6% of tentative unique genes (Unigenes) of C. japonica and C. obtusa had homologs in other conifers but not angiosperms, and about 70% had apparent homologs in angiosperms. The calculated GC contents of orthologous genes showed that GC contents of coniferous genes are likely to be lower than those of angiosperms. Comparisons of the numbers of homologous genes in each species suggest that copy numbers of genes may be correlated between diverse seed plants. This correlation suggests that the multiplicity of such genes may have arisen before the divergence of gymnosperms and angiosperms.
Network neighborhood analysis with the multi-node topological overlap measure.
Li, Ai; Horvath, Steve
2007-01-15
The goal of neighborhood analysis is to find a set of genes (the neighborhood) that is similar to an initial 'seed' set of genes. Neighborhood analysis methods for network data are important in systems biology. If individual network connections are susceptible to noise, it can be advantageous to define neighborhoods on the basis of a robust interconnectedness measure, e.g. the topological overlap measure. Since the use of multiple nodes in the seed set may lead to more informative neighborhoods, it can be advantageous to define multi-node similarity measures. The pairwise topological overlap measure is generalized to multiple network nodes and subsequently used in a recursive neighborhood construction method. A local permutation scheme is used to determine the neighborhood size. Using four network applications and a simulated example, we provide empirical evidence that the resulting neighborhoods are biologically meaningful, e.g. we use neighborhood analysis to identify brain cancer related genes. An executable Windows program and tutorial for multi-node topological overlap measure (MTOM) based analysis can be downloaded from the webpage (http://www.genetics.ucla.edu/labs/horvath/MTOM/).
Golden Gate Assembly of CRISPR gRNA expression array for simultaneously targeting multiple genes.
Vad-Nielsen, Johan; Lin, Lin; Bolund, Lars; Nielsen, Anders Lade; Luo, Yonglun
2016-11-01
The engineered CRISPR/Cas9 technology has developed as the most efficient and broadly used genome editing tool. However, simultaneously targeting multiple genes (or genomic loci) in the same individual cells using CRISPR/Cas9 remain one technical challenge. In this article, we have developed a Golden Gate Assembly method for the generation of CRISPR gRNA expression arrays, thus enabling simultaneous gene targeting. Using this method, the generation of CRISPR gRNA expression array can be accomplished in 2 weeks, and contains up to 30 gRNA expression cassettes. We demonstrated in the study that simultaneously targeting 10 genomic loci or simultaneously inhibition of multiple endogenous genes could be achieved using the multiplexed gRNA expression array vector in human cells. The complete set of plasmids is available through the non-profit plasmid repository Addgene.
Sand, Olivier; Thomas-Chollier, Morgane; Vervisch, Eric; van Helden, Jacques
2008-01-01
This protocol shows how to access the Regulatory Sequence Analysis Tools (RSAT) via a programmatic interface in order to automate the analysis of multiple data sets. We describe the steps for writing a Perl client that connects to the RSAT Web services and implements a workflow to discover putative cis-acting elements in promoters of gene clusters. In the presented example, we apply this workflow to lists of transcription factor target genes resulting from ChIP-chip experiments. For each factor, the protocol predicts the binding motifs by detecting significantly overrepresented hexanucleotides in the target promoters and generates a feature map that displays the positions of putative binding sites along the promoter sequences. This protocol is addressed to bioinformaticians and biologists with programming skills (notions of Perl). Running time is approximately 6 min on the example data set.
Gaponova, Anna V.; Deneka, Alexander Y.; Beck, Tim N.; Liu, Hanqing; Andrianov, Gregory; Nikonova, Anna S.; Nicolas, Emmanuelle; Einarson, Margret B.; Golemis, Erica A.; Serebriiskii, Ilya G.
2017-01-01
Ovarian, head and neck, and other cancers are commonly treated with cisplatin and other DNA damaging cytotoxic agents. Altered DNA damage response (DDR) contributes to resistance of these tumors to chemotherapies, some targeted therapies, and radiation. DDR involves multiple protein complexes and signaling pathways, some of which are evolutionarily ancient and involve protein orthologs conserved from yeast to humans. To identify new regulators of cisplatin-resistance in human tumors, we integrated high throughput and curated datasets describing yeast genes that regulate sensitivity to cisplatin and/or ionizing radiation. Next, we clustered highly validated genes based on chemogenomic profiling, and then mapped orthologs of these genes in expanded genomic networks for multiple metazoans, including humans. This approach identified an enriched candidate set of genes involved in the regulation of resistance to radiation and/or cisplatin in humans. Direct functional assessment of selected candidate genes using RNA interference confirmed their activity in influencing cisplatin resistance, degree of γH2AX focus formation and ATR phosphorylation, in ovarian and head and neck cancer cell lines, suggesting impaired DDR signaling as the driving mechanism. This work enlarges the set of genes that may contribute to chemotherapy resistance and provides a new contextual resource for interpreting next generation sequencing (NGS) genomic profiling of tumors. PMID:27863405
Taguchi, Y-H
2018-05-08
Even though coexistence of multiple phenotypes sharing the same genomic background is interesting, it remains incompletely understood. Epigenomic profiles may represent key factors, with unknown contributions to the development of multiple phenotypes, and social-insect castes are a good model for elucidation of the underlying mechanisms. Nonetheless, previous studies have failed to identify genes associated with aberrant gene expression and methylation profiles because of the lack of suitable methodology that can address this problem properly. A recently proposed principal component analysis (PCA)-based and tensor decomposition (TD)-based unsupervised feature extraction (FE) can solve this problem because these two approaches can deal with gene expression and methylation profiles even when a small number of samples is available. PCA-based and TD-based unsupervised FE methods were applied to the analysis of gene expression and methylation profiles in the brains of two social insects, Polistes canadensis and Dinoponera quadriceps. Genes associated with differential expression and methylation between castes were identified, and analysis of enrichment of Gene Ontology terms confirmed reliability of the obtained sets of genes from the biological standpoint. Biologically relevant genes, shown to be associated with significant differential gene expression and methylation between castes, were identified here for the first time. The identification of these genes may help understand the mechanisms underlying epigenetic control of development of multiple phenotypes under the same genomic conditions.
Zhu, Xinyu; Ma, Hong; Chen, Zhiduan
2011-03-09
Plants contain numerous Su(var)3-9 homologues (SUVH) and related (SUVR) genes, some of which await functional characterization. Although there have been studies on the evolution of plant Su(var)3-9 SET genes, a systematic evolutionary study including major land plant groups has not been reported. Large-scale phylogenetic and evolutionary analyses can help to elucidate the underlying molecular mechanisms and contribute to improve genome annotation. Putative orthologs of plant Su(var)3-9 SET protein sequences were retrieved from major representatives of land plants. A novel clustering that included most members analyzed, henceforth referred to as core Su(var)3-9 homologues and related (cSUVHR) gene clade, was identified as well as all orthologous groups previously identified. Our analysis showed that plant Su(var)3-9 SET proteins possessed a variety of domain organizations, and can be classified into five types and ten subtypes. Plant Su(var)3-9 SET genes also exhibit a wide range of gene structures among different paralogs within a family, even in the regions encoding conserved PreSET and SET domains. We also found that the majority of SUVH members were intronless and formed three subclades within the SUVH clade. A detailed phylogenetic analysis of the plant Su(var)3-9 SET genes was performed. A novel deep phylogenetic relationship including most plant Su(var)3-9 SET genes was identified. Additional domains such as SAR, ZnF_C2H2 and WIYLD were early integrated into primordial PreSET/SET/PostSET domain organization. At least three classes of gene structures had been formed before the divergence of Physcomitrella patens (moss) from other land plants. One or multiple retroposition events might have occurred among SUVH genes with the donor genes leading to the V-2 orthologous group. The structural differences among evolutionary groups of plant Su(var)3-9 SET genes with different functions were described, contributing to the design of further experimental studies.
Freytag, Virginie; Probst, Sabine; Hadziselimovic, Nils; Boglari, Csaba; Hauser, Yannick; Peter, Fabian; Gabor Fenyves, Bank; Milnik, Annette; Demougin, Philippe; Vukojevic, Vanja; de Quervain, Dominique J-F; Papassotiropoulos, Andreas; Stetak, Attila
2017-07-12
The identification of genes related to encoding, storage, and retrieval of memories is a major interest in neuroscience. In the current study, we analyzed the temporal gene expression changes in a neuronal mRNA pool during an olfactory long-term associative memory (LTAM) in Caenorhabditis elegans hermaphrodites. Here, we identified a core set of 712 (538 upregulated and 174 downregulated) genes that follows three distinct temporal peaks demonstrating multiple gene regulation waves in LTAM. Compared with the previously published positive LTAM gene set (Lakhina et al., 2015), 50% of the identified upregulated genes here overlap with the previous dataset, possibly representing stimulus-independent memory-related genes. On the other hand, the remaining genes were not previously identified in positive associative memory and may specifically regulate aversive LTAM. Our results suggest a multistep gene activation process during the formation and retrieval of long-term memory and define general memory-implicated genes as well as conditioning-type-dependent gene sets. SIGNIFICANCE STATEMENT The identification of genes regulating different steps of memory is of major interest in neuroscience. Identification of common memory genes across different learning paradigms and the temporal activation of the genes are poorly studied. Here, we investigated the temporal aspects of Caenorhabditis elegans gene expression changes using aversive olfactory associative long-term memory (LTAM) and identified three major gene activation waves. Like in previous studies, aversive LTAM is also CREB dependent, and CREB activity is necessary immediately after training. Finally, we define a list of memory paradigm-independent core gene sets as well as conditioning-dependent genes. Copyright © 2017 the authors 0270-6474/17/376661-12$15.00/0.
Rogic, Sanja; Wong, Albertina; Pavlidis, Paul
2017-01-01
Background Prenatal alcohol exposure (PAE) can result in an array of morphological, behavioural and neurobiological deficits that can range in their severity. Despite extensive research in the field and a significant progress made, especially in understanding the range of possible malformations and neurobehavioral abnormalities, the molecular mechanisms of alcohol responses in development are still not well understood. There have been multiple transcriptomic studies looking at the changes in gene expression after PAE in animal models, however there is a limited apparent consensus among the reported findings. In an effort to address this issue, we performed a comprehensive re-analysis and meta-analysis of all suitable, publically available expression data sets. Methods We assembled ten microarray data sets of gene expression after PAE in mouse and rat models consisting of samples from a total of 63 ethanol-exposed and 80 control animals. We re-analyzed each data set for differential expression and then used the results to perform meta-analyses considering all data sets together or grouping them by time or duration of exposure (pre- and post-natal, acute and chronic, respectively). We performed network and Gene Ontology enrichment analysis to further characterize the identified signatures. Results For each sub-analysis we identified signatures of differential expressed genes that show support from multiple studies. Overall, the changes in gene expression were more extensive after acute ethanol treatment during prenatal development than in other models. Considering the analysis of all the data together, we identified a robust core signature of 104 genes down-regulated after PAE, with no up-regulated genes. Functional analysis reveals over-representation of genes involved in protein synthesis, mRNA splicing and chromatin organization. Conclusions Our meta-analysis shows that existing studies, despite superficial dissimilarity in findings, share features that allow us to identify a common core signature set of transcriptome changes in PAE. This is an important step to identifying the biological processes that underlie the etiology of FASD. PMID:26996386
A Risk Stratification Model for Lung Cancer Based on Gene Coexpression Network and Deep Learning
2018-01-01
Risk stratification model for lung cancer with gene expression profile is of great interest. Instead of previous models based on individual prognostic genes, we aimed to develop a novel system-level risk stratification model for lung adenocarcinoma based on gene coexpression network. Using multiple microarray, gene coexpression network analysis was performed to identify survival-related networks. A deep learning based risk stratification model was constructed with representative genes of these networks. The model was validated in two test sets. Survival analysis was performed using the output of the model to evaluate whether it could predict patients' survival independent of clinicopathological variables. Five networks were significantly associated with patients' survival. Considering prognostic significance and representativeness, genes of the two survival-related networks were selected for input of the model. The output of the model was significantly associated with patients' survival in two test sets and training set (p < 0.00001, p < 0.0001 and p = 0.02 for training and test sets 1 and 2, resp.). In multivariate analyses, the model was associated with patients' prognosis independent of other clinicopathological features. Our study presents a new perspective on incorporating gene coexpression networks into the gene expression signature and clinical application of deep learning in genomic data science for prognosis prediction. PMID:29581968
Kassambara, Alboukadel; Hose, Dirk; Moreaux, Jérôme; Walker, Brian A.; Protopopov, Alexei; Reme, Thierry; Pellestor, Franck; Pantesco, Véronique; Jauch, Anna; Morgan, Gareth; Goldschmidt, Hartmut; Klein, Bernard
2012-01-01
Background Genetic abnormalities are common in patients with multiple myeloma, and may deregulate gene products involved in tumor survival, proliferation, metabolism and drug resistance. In particular, translocations may result in a high expression of targeted genes (termed spike expression) in tumor cells. We identified spike genes in multiple myeloma cells of patients with newly-diagnosed myeloma and investigated their prognostic value. Design and Methods Genes with a spike expression in multiple myeloma cells were picked up using box plot probe set signal distribution and two selection filters. Results In a cohort of 206 newly diagnosed patients with multiple myeloma, 2587 genes/expressed sequence tags with a spike expression were identified. Some spike genes were associated with some transcription factors such as MAF or MMSET and with known recurrent translocations as expected. Spike genes were not associated with increased DNA copy number and for a majority of them, involved unknown mechanisms. Of spiked genes, 36.7% clustered significantly in 149 out of 862 documented chromosome (sub)bands, of which 53 had prognostic value (35 bad, 18 good). Their prognostic value was summarized with a spike band score that delineated 23.8% of patients with a poor median overall survival (27.4 months versus not reached, P<0.001) using the training cohort of 206 patients. The spike band score was independent of other gene expression profiling-based risk scores, t(4;14), or del17p in an independent validation cohort of 345 patients. Conclusions We present a new approach to identify spike genes and their relationship to patients’ survival. PMID:22102711
Disentangling the multigenic and pleiotropic nature of molecular function
2015-01-01
Background Biological processes at the molecular level are usually represented by molecular interaction networks. Function is organised and modularity identified based on network topology, however, this approach often fails to account for the dynamic and multifunctional nature of molecular components. For example, a molecule engaging in spatially or temporally independent functions may be inappropriately clustered into a single functional module. To capture biologically meaningful sets of interacting molecules, we use experimentally defined pathways as spatial/temporal units of molecular activity. Results We defined functional profiles of Saccharomyces cerevisiae based on a minimal set of Gene Ontology terms sufficient to represent each pathway's genes. The Gene Ontology terms were used to annotate 271 pathways, accounting for pathway multi-functionality and gene pleiotropy. Pathways were then arranged into a network, linked by shared functionality. Of the genes in our data set, 44% appeared in multiple pathways performing a diverse set of functions. Linking pathways by overlapping functionality revealed a modular network with energy metabolism forming a sparse centre, surrounded by several denser clusters comprised of regulatory and metabolic pathways. Signalling pathways formed a relatively discrete cluster connected to the centre of the network. Genetic interactions were enriched within the clusters of pathways by a factor of 5.5, confirming the organisation of our pathway network is biologically significant. Conclusions Our representation of molecular function according to pathway relationships enables analysis of gene/protein activity in the context of specific functional roles, as an alternative to typical molecule-centric graph-based methods. The pathway network demonstrates the cooperation of multiple pathways to perform biological processes and organises pathways into functionally related clusters with interdependent outcomes. PMID:26678917
2011-01-01
Background Copy number aberrations (CNAs) are an important molecular signature in cancer initiation, development, and progression. However, these aberrations span a wide range of chromosomes, making it hard to distinguish cancer related genes from other genes that are not closely related to cancer but are located in broadly aberrant regions. With the current availability of high-resolution data sets such as single nucleotide polymorphism (SNP) microarrays, it has become an important issue to develop a computational method to detect driving genes related to cancer development located in the focal regions of CNAs. Results In this study, we introduce a novel method referred to as the wavelet-based identification of focal genomic aberrations (WIFA). The use of the wavelet analysis, because it is a multi-resolution approach, makes it possible to effectively identify focal genomic aberrations in broadly aberrant regions. The proposed method integrates multiple cancer samples so that it enables the detection of the consistent aberrations across multiple samples. We then apply this method to glioblastoma multiforme and lung cancer data sets from the SNP microarray platform. Through this process, we confirm the ability to detect previously known cancer related genes from both cancer types with high accuracy. Also, the application of this approach to a lung cancer data set identifies focal amplification regions that contain known oncogenes, though these regions are not reported using a recent CNAs detecting algorithm GISTIC: SMAD7 (chr18q21.1) and FGF10 (chr5p12). Conclusions Our results suggest that WIFA can be used to reveal cancer related genes in various cancer data sets. PMID:21569311
Jaiswal, Deepika; Jezek, Meagan; Quijote, Jeremiah; Lum, Joanna; Choi, Grace; Kulkarni, Rushmie; Park, DoHwan; Green, Erin M.
2017-01-01
The conserved yeast histone methyltransferase Set1 targets H3 lysine 4 (H3K4) for mono, di, and trimethylation and is linked to active transcription due to the euchromatic distribution of these methyl marks and the recruitment of Set1 during transcription. However, loss of Set1 results in increased expression of multiple classes of genes, including genes adjacent to telomeres and middle sporulation genes, which are repressed under normal growth conditions because they function in meiotic progression and spore formation. The mechanisms underlying Set1-mediated gene repression are varied, and still unclear in some cases, although repression has been linked to both direct and indirect action of Set1, associated with noncoding transcription, and is often dependent on the H3K4me2 mark. We show that Set1, and particularly the H3K4me2 mark, are implicated in repression of a subset of middle sporulation genes during vegetative growth. In the absence of Set1, there is loss of the DNA-binding transcriptional regulator Sum1 and the associated histone deacetylase Hst1 from chromatin in a locus-specific manner. This is linked to increased H4K5ac at these loci and aberrant middle gene expression. These data indicate that, in addition to DNA sequence, histone modification status also contributes to proper localization of Sum1. Our results also show that the role for Set1 in middle gene expression control diverges as cells receive signals to undergo meiosis. Overall, this work dissects an unexplored role for Set1 in gene-specific repression, and provides important insights into a new mechanism associated with the control of gene expression linked to meiotic differentiation. PMID:29066473
Kwon, Ji-Sun; Kim, Jihye; Nam, Dougu; Kim, Sangsoo
2012-06-01
Gene set analysis (GSA) is useful in interpreting a genome-wide association study (GWAS) result in terms of biological mechanism. We compared the performance of two different GSA implementations that accept GWAS p-values of single nucleotide polymorphisms (SNPs) or gene-by-gene summaries thereof, GSA-SNP and i-GSEA4GWAS, under the same settings of inputs and parameters. GSA runs were made with two sets of p-values from a Korean type 2 diabetes mellitus GWAS study: 259,188 and 1,152,947 SNPs of the original and imputed genotype datasets, respectively. When Gene Ontology terms were used as gene sets, i-GSEA4GWAS produced 283 and 1,070 hits for the unimputed and imputed datasets, respectively. On the other hand, GSA-SNP reported 94 and 38 hits, respectively, for both datasets. Similar, but to a lesser degree, trends were observed with Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets as well. The huge number of hits by i-GSEA4GWAS for the imputed dataset was probably an artifact due to the scaling step in the algorithm. The decrease in hits by GSA-SNP for the imputed dataset may be due to the fact that it relies on Z-statistics, which is sensitive to variations in the background level of associations. Judicious evaluation of the GSA outcomes, perhaps based on multiple programs, is recommended.
Computing all hybridization networks for multiple binary phylogenetic input trees.
Albrecht, Benjamin
2015-07-30
The computation of phylogenetic trees on the same set of species that are based on different orthologous genes can lead to incongruent trees. One possible explanation for this behavior are interspecific hybridization events recombining genes of different species. An important approach to analyze such events is the computation of hybridization networks. This work presents the first algorithm computing the hybridization number as well as a set of representative hybridization networks for multiple binary phylogenetic input trees on the same set of taxa. To improve its practical runtime, we show how this algorithm can be parallelized. Moreover, we demonstrate the efficiency of the software Hybroscale, containing an implementation of our algorithm, by comparing it to PIRNv2.0, which is so far the best available software computing the exact hybridization number for multiple binary phylogenetic trees on the same set of taxa. The algorithm is part of the software Hybroscale, which was developed specifically for the investigation of hybridization networks including their computation and visualization. Hybroscale is freely available(1) and runs on all three major operating systems. Our simulation study indicates that our approach is on average 100 times faster than PIRNv2.0. Moreover, we show how Hybroscale improves the interpretation of the reported hybridization networks by adding certain features to its graphical representation.
Functional and evolutionary insights from the Ciona notochord transcriptome.
Reeves, Wendy M; Wu, Yuye; Harder, Matthew J; Veeman, Michael T
2017-09-15
The notochord of the ascidian Ciona consists of only 40 cells, and is a longstanding model for studying organogenesis in a small, simple embryo. Here, we perform RNAseq on flow-sorted notochord cells from multiple stages to define a comprehensive Ciona notochord transcriptome. We identify 1364 genes with enriched expression and extensively validate the results by in situ hybridization. These genes are highly enriched for Gene Ontology terms related to the extracellular matrix, cell adhesion and cytoskeleton. Orthologs of 112 of the Ciona notochord genes have known notochord expression in vertebrates, more than twice as many as predicted by chance alone. This set of putative effector genes with notochord expression conserved from tunicates to vertebrates will be invaluable for testing hypotheses about notochord evolution. The full set of Ciona notochord genes provides a foundation for systems-level studies of notochord gene regulation and morphogenesis. We find only modest overlap between this set of notochord-enriched transcripts and the genes upregulated by ectopic expression of the key notochord transcription factor Brachyury, indicating that Brachyury is not a notochord master regulator gene as strictly defined. © 2017. Published by The Company of Biologists Ltd.
Chen, Junhui; Meng, Yuhuan; Zhou, Jinghui; Zhuo, Min; Ling, Fei; Zhang, Yu; Du, Hongli; Wang, Xiaoning
2013-01-01
Type 2 Diabetes Mellitus (T2DM) and obesity have become increasingly prevalent in recent years. Recent studies have focused on identifying causal variations or candidate genes for obesity and T2DM via analysis of expression quantitative trait loci (eQTL) within a single tissue. T2DM and obesity are affected by comprehensive sets of genes in multiple tissues. In the current study, gene expression levels in multiple human tissues from GEO datasets were analyzed, and 21 candidate genes displaying high percentages of differential expression were filtered out. Specifically, DENND1B, LYN, MRPL30, POC1B, PRKCB, RP4-655J12.3, HIBADH, and TMBIM4 were identified from the T2DM-control study, and BCAT1, BMP2K, CSRNP2, MYNN, NCKAP5L, SAP30BP, SLC35B4, SP1, BAP1, GRB14, HSP90AB1, ITGA5, and TOMM5 were identified from the obesity-control study. The majority of these genes are known to be involved in T2DM and obesity. Therefore, analysis of gene expression in various tissues using GEO datasets may be an effective and feasible method to determine novel or causal genes associated with T2DM and obesity.
Combining Gene Signatures Improves Prediction of Breast Cancer Survival
Zhao, Xi; Naume, Bjørn; Langerød, Anita; Frigessi, Arnoldo; Kristensen, Vessela N.; Børresen-Dale, Anne-Lise; Lingjærde, Ole Christian
2011-01-01
Background Several gene sets for prediction of breast cancer survival have been derived from whole-genome mRNA expression profiles. Here, we develop a statistical framework to explore whether combination of the information from such sets may improve prediction of recurrence and breast cancer specific death in early-stage breast cancers. Microarray data from two clinically similar cohorts of breast cancer patients are used as training (n = 123) and test set (n = 81), respectively. Gene sets from eleven previously published gene signatures are included in the study. Principal Findings To investigate the relationship between breast cancer survival and gene expression on a particular gene set, a Cox proportional hazards model is applied using partial likelihood regression with an L2 penalty to avoid overfitting and using cross-validation to determine the penalty weight. The fitted models are applied to an independent test set to obtain a predicted risk for each individual and each gene set. Hierarchical clustering of the test individuals on the basis of the vector of predicted risks results in two clusters with distinct clinical characteristics in terms of the distribution of molecular subtypes, ER, PR status, TP53 mutation status and histological grade category, and associated with significantly different survival probabilities (recurrence: p = 0.005; breast cancer death: p = 0.014). Finally, principal components analysis of the gene signatures is used to derive combined predictors used to fit a new Cox model. This model classifies test individuals into two risk groups with distinct survival characteristics (recurrence: p = 0.003; breast cancer death: p = 0.001). The latter classifier outperforms all the individual gene signatures, as well as Cox models based on traditional clinical parameters and the Adjuvant! Online for survival prediction. Conclusion Combining the predictive strength of multiple gene signatures improves prediction of breast cancer survival. The presented methodology is broadly applicable to breast cancer risk assessment using any new identified gene set. PMID:21423775
Combining Evidence of Preferential Gene-Tissue Relationships from Multiple Sources
Guo, Jing; Hammar, Mårten; Öberg, Lisa; Padmanabhuni, Shanmukha S.; Bjäreland, Marcus; Dalevi, Daniel
2013-01-01
An important challenge in drug discovery and disease prognosis is to predict genes that are preferentially expressed in one or a few tissues, i.e. showing a considerably higher expression in one tissue(s) compared to the others. Although several data sources and methods have been published explicitly for this purpose, they often disagree and it is not evident how to retrieve these genes and how to distinguish true biological findings from those that are due to choice-of-method and/or experimental settings. In this work we have developed a computational approach that combines results from multiple methods and datasets with the aim to eliminate method/study-specific biases and to improve the predictability of preferentially expressed human genes. A rule-based score is used to merge and assign support to the results. Five sets of genes with known tissue specificity were used for parameter pruning and cross-validation. In total we identify 3434 tissue-specific genes. We compare the genes of highest scores with the public databases: PaGenBase (microarray), TiGER (EST) and HPA (protein expression data). The results have 85% overlap to PaGenBase, 71% to TiGER and only 28% to HPA. 99% of our predictions have support from at least one of these databases. Our approach also performs better than any of the databases on identifying drug targets and biomarkers with known tissue-specificity. PMID:23950964
Gu, Y R; Li, M Z; Zhang, K; Chen, L; Jiang, A A; Wang, J Y; Li, X W
2011-08-01
To normalize a set of quantitative real-time PCR (q-PCR) data, it is essential to determine an optimal number/set of housekeeping genes, as the abundance of housekeeping genes can vary across tissues or cells during different developmental stages, or even under certain environmental conditions. In this study, of the 20 commonly used endogenous control genes, 13, 18 and 17 genes exhibited credible stability in 56 different tissues, 10 types of adipose tissue and five types of muscle tissue, respectively. Our analysis clearly showed that three optimal housekeeping genes are adequate for an accurate normalization, which correlated well with the theoretical optimal number (r ≥ 0.94). In terms of economical and experimental feasibility, we recommend the use of the three most stable housekeeping genes for calculating the normalization factor. Based on our results, the three most stable housekeeping genes in all analysed samples (TOP2B, HSPCB and YWHAZ) are recommended for accurate normalization of q-PCR data. We also suggest that two different sets of housekeeping genes are appropriate for 10 types of adipose tissue (the HSPCB, ALDOA and GAPDH genes) and five types of muscle tissue (the TOP2B, HSPCB and YWHAZ genes), respectively. Our report will serve as a valuable reference for other studies aimed at measuring tissue-specific mRNA abundance in porcine samples. © 2011 Blackwell Verlag GmbH.
PSAT: A web tool to compare genomic neighborhoods of multiple prokaryotic genomes
Fong, Christine; Rohmer, Laurence; Radey, Matthew; Wasnick, Michael; Brittnacher, Mitchell J
2008-01-01
Background The conservation of gene order among prokaryotic genomes can provide valuable insight into gene function, protein interactions, or events by which genomes have evolved. Although some tools are available for visualizing and comparing the order of genes between genomes of study, few support an efficient and organized analysis between large numbers of genomes. The Prokaryotic Sequence homology Analysis Tool (PSAT) is a web tool for comparing gene neighborhoods among multiple prokaryotic genomes. Results PSAT utilizes a database that is preloaded with gene annotation, BLAST hit results, and gene-clustering scores designed to help identify regions of conserved gene order. Researchers use the PSAT web interface to find a gene of interest in a reference genome and efficiently retrieve the sequence homologs found in other bacterial genomes. The tool generates a graphic of the genomic neighborhood surrounding the selected gene and the corresponding regions for its homologs in each comparison genome. Homologs in each region are color coded to assist users with analyzing gene order among various genomes. In contrast to common comparative analysis methods that filter sequence homolog data based on alignment score cutoffs, PSAT leverages gene context information for homologs, including those with weak alignment scores, enabling a more sensitive analysis. Features for constraining or ordering results are designed to help researchers browse results from large numbers of comparison genomes in an organized manner. PSAT has been demonstrated to be useful for helping to identify gene orthologs and potential functional gene clusters, and detecting genome modifications that may result in loss of function. Conclusion PSAT allows researchers to investigate the order of genes within local genomic neighborhoods of multiple genomes. A PSAT web server for public use is available for performing analyses on a growing set of reference genomes through any web browser with no client side software setup or installation required. Source code is freely available to researchers interested in setting up a local version of PSAT for analysis of genomes not available through the public server. Access to the public web server and instructions for obtaining source code can be found at . PMID:18366802
Fast and robust group-wise eQTL mapping using sparse graphical models.
Cheng, Wei; Shi, Yu; Zhang, Xiang; Wang, Wei
2015-01-16
Genome-wide expression quantitative trait loci (eQTL) studies have emerged as a powerful tool to understand the genetic basis of gene expression and complex traits. The traditional eQTL methods focus on testing the associations between individual single-nucleotide polymorphisms (SNPs) and gene expression traits. A major drawback of this approach is that it cannot model the joint effect of a set of SNPs on a set of genes, which may correspond to hidden biological pathways. We introduce a new approach to identify novel group-wise associations between sets of SNPs and sets of genes. Such associations are captured by hidden variables connecting SNPs and genes. Our model is a linear-Gaussian model and uses two types of hidden variables. One captures the set associations between SNPs and genes, and the other captures confounders. We develop an efficient optimization procedure which makes this approach suitable for large scale studies. Extensive experimental evaluations on both simulated and real datasets demonstrate that the proposed methods can effectively capture both individual and group-wise signals that cannot be identified by the state-of-the-art eQTL mapping methods. Considering group-wise associations significantly improves the accuracy of eQTL mapping, and the successful multi-layer regression model opens a new approach to understand how multiple SNPs interact with each other to jointly affect the expression level of a group of genes.
Cornish, Alex J; Filippis, Ioannis; David, Alessia; Sternberg, Michael J E
2015-09-01
Each cell type found within the human body performs a diverse and unique set of functions, the disruption of which can lead to disease. However, there currently exists no systematic mapping between cell types and the diseases they can cause. In this study, we integrate protein-protein interaction data with high-quality cell-type-specific gene expression data from the FANTOM5 project to build the largest collection of cell-type-specific interactomes created to date. We develop a novel method, called gene set compactness (GSC), that contrasts the relative positions of disease-associated genes across 73 cell-type-specific interactomes to map genes associated with 196 diseases to the cell types they affect. We conduct text-mining of the PubMed database to produce an independent resource of disease-associated cell types, which we use to validate our method. The GSC method successfully identifies known disease-cell-type associations, as well as highlighting associations that warrant further study. This includes mast cells and multiple sclerosis, a cell population currently being targeted in a multiple sclerosis phase 2 clinical trial. Furthermore, we build a cell-type-based diseasome using the cell types identified as manifesting each disease, offering insight into diseases linked through etiology. The data set produced in this study represents the first large-scale mapping of diseases to the cell types in which they are manifested and will therefore be useful in the study of disease systems. Overall, we demonstrate that our approach links disease-associated genes to the phenotypes they produce, a key goal within systems medicine.
Windhorst, Dafna A; Mileva-Seitz, Viara R; Rippe, Ralph C A; Tiemeier, Henning; Jaddoe, Vincent W V; Verhulst, Frank C; van IJzendoorn, Marinus H; Bakermans-Kranenburg, Marian J
2016-08-01
In a longitudinal cohort study, we investigated the interplay of harsh parenting and genetic variation across a set of functionally related dopamine genes, in association with children's externalizing behavior. This is one of the first studies to employ gene-based and gene-set approaches in tests of Gene by Environment (G × E) effects on complex behavior. This approach can offer an important alternative or complement to candidate gene and genome-wide environmental interaction (GWEI) studies in the search for genetic variation underlying individual differences in behavior. Genetic variants in 12 autosomal dopaminergic genes were available in an ethnically homogenous part of a population-based cohort. Harsh parenting was assessed with maternal (n = 1881) and paternal (n = 1710) reports at age 3. Externalizing behavior was assessed with the Child Behavior Checklist (CBCL) at age 5 (71 ± 3.7 months). We conducted gene-set analyses of the association between variation in dopaminergic genes and externalizing behavior, stratified for harsh parenting. The association was statistically significant or approached significance for children without harsh parenting experiences, but was absent in the group with harsh parenting. Similarly, significant associations between single genes and externalizing behavior were only found in the group without harsh parenting. Effect sizes in the groups with and without harsh parenting did not differ significantly. Gene-environment interaction tests were conducted for individual genetic variants, resulting in two significant interaction effects (rs1497023 and rs4922132) after correction for multiple testing. Our findings are suggestive of G × E interplay, with associations between dopamine genes and externalizing behavior present in children without harsh parenting, but not in children with harsh parenting experiences. Harsh parenting may overrule the role of genetic factors in externalizing behavior. Gene-based and gene-set analyses offer promising new alternatives to analyses focusing on single candidate polymorphisms when examining the interplay between genetic and environmental factors.
Microgravity and Immunity: Changes in Lymphocyte Gene Expression
NASA Technical Reports Server (NTRS)
Risin, D.; Pellis, N. R.; Ward, N. E.; Risin, S. A.
2006-01-01
Earlier studies had shown that modeled and true microgravity (MG) cause multiple direct effects on human lymphocytes. MG inhibits lymphocyte locomotion, suppresses polyclonal and antigen-specific activation, affects signal transduction mechanisms, as well as activation-induced apoptosis. In this study we assessed changes in gene expression associated with lymphocyte exposure to microgravity in an attempt to identify microgravity-sensitive genes (MGSG) in general and specifically those genes that might be responsible for the functional and structural changes observed earlier. Two sets of experiments targeting different goals were conducted. In the first set, T-lymphocytes from normal donors were activated with antiCD3 and IL2 and then cultured in 1g (static) and modeled MG (MMG) conditions (Rotating Wall Vessel bioreactor) for 24 hours. This setting allowed searching for MGSG by comparison of gene expression patterns in zero and 1 g gravity. In the second set - activated T-cells after culturing for 24 hours in 1g and MMG were exposed three hours before harvesting to a secondary activation stimulus (PHA) thus triggering the apoptotic pathway. Total RNA was extracted using the RNeasy isolation kit (Qiagen, Valencia, CA). Affymetrix Gene Chips (U133A), allowing testing for 18,400 human genes, were used for microarray analysis. In the first set of experiments MMG exposure resulted in altered expression of 89 genes, 10 of them were up-regulated and 79 down-regulated. In the second set, changes in expression were revealed in 85 genes, 20 were up-regulated and 65 were down-regulated. The analysis revealed that significant numbers of MGS genes are associated with signal transduction and apoptotic pathways. Interestingly, the majority of genes that responded by up- or down-regulation in the alternative sets of experiments were not the same, possibly reflecting different functional states of the examined T-lymphocyte populations. The responder genes (MGSG) might play an essential role in adaptation to MG and/or be responsible for pathologic changes encountered in Space and thus represent potential targets for molecular-based countermeasures
Expression Atlas: gene and protein expression across multiple studies and organisms
Tang, Y Amy; Bazant, Wojciech; Burke, Melissa; Fuentes, Alfonso Muñoz-Pomer; George, Nancy; Koskinen, Satu; Mohammed, Suhaib; Geniza, Matthew; Preece, Justin; Jarnuczak, Andrew F; Huber, Wolfgang; Stegle, Oliver; Brazma, Alvis; Petryszak, Robert
2018-01-01
Abstract Expression Atlas (http://www.ebi.ac.uk/gxa) is an added value database that provides information about gene and protein expression in different species and contexts, such as tissue, developmental stage, disease or cell type. The available public and controlled access data sets from different sources are curated and re-analysed using standardized, open source pipelines and made available for queries, download and visualization. As of August 2017, Expression Atlas holds data from 3,126 studies across 33 different species, including 731 from plants. Data from large-scale RNA sequencing studies including Blueprint, PCAWG, ENCODE, GTEx and HipSci can be visualized next to each other. In Expression Atlas, users can query genes or gene-sets of interest and explore their expression across or within species, tissues, developmental stages in a constitutive or differential context, representing the effects of diseases, conditions or experimental interventions. All processed data matrices are available for direct download in tab-delimited format or as R-data. In addition to the web interface, data sets can now be searched and downloaded through the Expression Atlas R package. Novel features and visualizations include the on-the-fly analysis of gene set overlaps and the option to view gene co-expression in experiments investigating constitutive gene expression across tissues or other conditions. PMID:29165655
Gene regulatory network inference using fused LASSO on multiple data sets
Omranian, Nooshin; Eloundou-Mbebi, Jeanne M. O.; Mueller-Roeber, Bernd; Nikoloski, Zoran
2016-01-01
Devising computational methods to accurately reconstruct gene regulatory networks given gene expression data is key to systems biology applications. Here we propose a method for reconstructing gene regulatory networks by simultaneous consideration of data sets from different perturbation experiments and corresponding controls. The method imposes three biologically meaningful constraints: (1) expression levels of each gene should be explained by the expression levels of a small number of transcription factor coding genes, (2) networks inferred from different data sets should be similar with respect to the type and number of regulatory interactions, and (3) relationships between genes which exhibit similar differential behavior over the considered perturbations should be favored. We demonstrate that these constraints can be transformed in a fused LASSO formulation for the proposed method. The comparative analysis on transcriptomics time-series data from prokaryotic species, Escherichia coli and Mycobacterium tuberculosis, as well as a eukaryotic species, mouse, demonstrated that the proposed method has the advantages of the most recent approaches for regulatory network inference, while obtaining better performance and assigning higher scores to the true regulatory links. The study indicates that the combination of sparse regression techniques with other biologically meaningful constraints is a promising framework for gene regulatory network reconstructions. PMID:26864687
NIMEFI: Gene Regulatory Network Inference using Multiple Ensemble Feature Importance Algorithms
Ruyssinck, Joeri; Huynh-Thu, Vân Anh; Geurts, Pierre; Dhaene, Tom; Demeester, Piet; Saeys, Yvan
2014-01-01
One of the long-standing open challenges in computational systems biology is the topology inference of gene regulatory networks from high-throughput omics data. Recently, two community-wide efforts, DREAM4 and DREAM5, have been established to benchmark network inference techniques using gene expression measurements. In these challenges the overall top performer was the GENIE3 algorithm. This method decomposes the network inference task into separate regression problems for each gene in the network in which the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. The contribution of this work is twofold. First, we generalize the regression decomposition strategy of GENIE3 to other feature importance methods. We compare the performance of support vector regression, the elastic net, random forest regression, symbolic regression and their ensemble variants in this setting to the original GENIE3 algorithm. To create the ensemble variants, we propose a subsampling approach which allows us to cast any feature selection algorithm that produces a feature ranking into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. As second contribution, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better. An implementation of NIMEFI has been made publicly available. PMID:24667482
Demirci, F Yesim; Wang, Xingbin; Morris, David L; Feingold, Eleanor; Bernatsky, Sasha; Pineau, Christian; Clarke, Ann; Ramsey-Goldman, Rosalind; Manzi, Susan; Vyse, Timothy J; Kamboh, M I
2017-06-01
A major systemic lupus erythematosus (SLE) susceptibility locus lies within a common inversion polymorphism region (encompassing 3.8 - 4.5 Mb) located at 8p23. Initially implicated genes included FAM167A-BLK and XKR6 , of which BLK received major attention due to its known role in B-cell biology. Recently, additional SLE risk carried in non-inverted background was also reported. In this case -control study, we further investigated the 'extended' 8p23 locus (~ 4 Mb) where we observed multiple SLE signals and assessed these signals for their relation to the inversion affecting this region. The study involved a North American discovery data set ( ~ 1200 subjects) and a replication data set (> 10 000 subjects) comprising European-descent individuals. Meta-analysis of 8p23 SNPs, with p < 0.05 in both data sets, identified 51 genome-wide significant SNPs (p < 5.0 × 10 -8 ). While most of these SNPs were related to previously implicated signals ( XKR6-FAM167A-BLK subregion), our results also revealed two 'new' SLE signals, including SGK223-CLDN23-MFHAS1 (6.06 × 10 -9 ≤ meta p ≤ 4.88 × 10 -8 ) and CTSB (meta p = 4.87 × 10 -8 ) subregions that are located > 2 Mb upstream and ~ 0.3 Mb downstream from previously reported signals. Functional assessment of relevant SNPs indicated putative cis -effects on the expression of various genes at 8p23. Additional analyses in discovery sample, where the inversion genotypes were inferred, replicated the association of non-inverted status with SLE risk and suggested that a number of SLE risk alleles are predominantly carried in non-inverted background. Our results implicate multiple (known+novel) SLE signals/genes at the extended 8p23 locus, beyond previously reported signals/genes, and suggest that this broad locus contributes to SLE risk through the effects of multiple genes/pathways. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K
2015-06-04
Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.
A robust prognostic signature for hormone-positive node-negative breast cancer.
Griffith, Obi L; Pepin, François; Enache, Oana M; Heiser, Laura M; Collisson, Eric A; Spellman, Paul T; Gray, Joe W
2013-01-01
Systemic chemotherapy in the adjuvant setting can cure breast cancer in some patients that would otherwise recur with incurable, metastatic disease. However, since only a fraction of patients would have recurrence after surgery alone, the challenge is to stratify high-risk patients (who stand to benefit from systemic chemotherapy) from low-risk patients (who can safely be spared treatment related toxicities and costs). We focus here on risk stratification in node-negative, ER-positive, HER2-negative breast cancer. We use a large database of publicly available microarray datasets to build a random forests classifier and develop a robust multi-gene mRNA transcription-based predictor of relapse free survival at 10 years, which we call the Random Forests Relapse Score (RFRS). Performance was assessed by internal cross-validation, multiple independent data sets, and comparison to existing algorithms using receiver-operating characteristic and Kaplan-Meier survival analysis. Internal redundancy of features was determined using k-means clustering to define optimal signatures with smaller numbers of primary genes, each with multiple alternates. Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704, which was comparable to or better than those reported previously or obtained by applying existing methods to our dataset. Three risk groups with probability cutoffs for low, intermediate, and high-risk were defined. Survival analysis determined a highly significant difference in relapse rate between these risk groups. Validation of the models against independent test datasets showed highly similar results. Smaller 17-gene and 8-gene optimized models were also developed with minimal reduction in performance. Furthermore, the signature was shown to be almost equally effective on both hormone-treated and untreated patients. RFRS allows flexibility in both the number and identity of genes utilized from thousands to as few as 17 or eight genes, each with multiple alternatives. The RFRS reports a probability score strongly correlated with risk of relapse. This score could therefore be used to assign systemic chemotherapy specifically to those high-risk patients most likely to benefit from further treatment.
A robust prognostic signature for hormone-positive node-negative breast cancer
2013-01-01
Background Systemic chemotherapy in the adjuvant setting can cure breast cancer in some patients that would otherwise recur with incurable, metastatic disease. However, since only a fraction of patients would have recurrence after surgery alone, the challenge is to stratify high-risk patients (who stand to benefit from systemic chemotherapy) from low-risk patients (who can safely be spared treatment related toxicities and costs). Methods We focus here on risk stratification in node-negative, ER-positive, HER2-negative breast cancer. We use a large database of publicly available microarray datasets to build a random forests classifier and develop a robust multi-gene mRNA transcription-based predictor of relapse free survival at 10 years, which we call the Random Forests Relapse Score (RFRS). Performance was assessed by internal cross-validation, multiple independent data sets, and comparison to existing algorithms using receiver-operating characteristic and Kaplan-Meier survival analysis. Internal redundancy of features was determined using k-means clustering to define optimal signatures with smaller numbers of primary genes, each with multiple alternates. Results Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704, which was comparable to or better than those reported previously or obtained by applying existing methods to our dataset. Three risk groups with probability cutoffs for low, intermediate, and high-risk were defined. Survival analysis determined a highly significant difference in relapse rate between these risk groups. Validation of the models against independent test datasets showed highly similar results. Smaller 17-gene and 8-gene optimized models were also developed with minimal reduction in performance. Furthermore, the signature was shown to be almost equally effective on both hormone-treated and untreated patients. Conclusions RFRS allows flexibility in both the number and identity of genes utilized from thousands to as few as 17 or eight genes, each with multiple alternatives. The RFRS reports a probability score strongly correlated with risk of relapse. This score could therefore be used to assign systemic chemotherapy specifically to those high-risk patients most likely to benefit from further treatment. PMID:24112773
Cross-Study Homogeneity of Psoriasis Gene Expression in Skin across a Large Expression Range
Kerkof, Keith; Timour, Martin; Russell, Christopher B.
2013-01-01
Background In psoriasis, only limited overlap between sets of genes identified as differentially expressed (psoriatic lesional vs. psoriatic non-lesional) was found using statistical and fold-change cut-offs. To provide a framework for utilizing prior psoriasis data sets we sought to understand the consistency of those sets. Methodology/Principal Findings Microarray expression profiling and qRT-PCR were used to characterize gene expression in PP and PN skin from psoriasis patients. cDNA (three new data sets) and cRNA hybridization (four existing data sets) data were compared using a common analysis pipeline. Agreement between data sets was assessed using varying qualitative and quantitative cut-offs to generate a DEG list in a source data set and then using other data sets to validate the list. Concordance increased from 67% across all probe sets to over 99% across more than 10,000 probe sets when statistical filters were employed. The fold-change behavior of individual genes tended to be consistent across the multiple data sets. We found that genes with <2-fold change values were quantitatively reproducible between pairs of data-sets. In a subset of transcripts with a role in inflammation changes detected by microarray were confirmed by qRT-PCR with high concordance. For transcripts with both PN and PP levels within the microarray dynamic range, microarray and qRT-PCR were quantitatively reproducible, including minimal fold-changes in IL13, TNFSF11, and TNFRSF11B and genes with >10-fold changes in either direction such as CHRM3, IL12B and IFNG. Conclusions/Significance Gene expression changes in psoriatic lesions were consistent across different studies, despite differences in patient selection, sample handling, and microarray platforms but between-study comparisons showed stronger agreement within than between platforms. We could use cut-offs as low as log10(ratio) = 0.1 (fold-change = 1.26), generating larger gene lists that validate on independent data sets. The reproducibility of PP signatures across data sets suggests that different sample sets can be productively compared. PMID:23308107
GESearch: An Interactive GUI Tool for Identifying Gene Expression Signature.
Ye, Ning; Yin, Hengfu; Liu, Jingjing; Dai, Xiaogang; Yin, Tongming
2015-01-01
The huge amount of gene expression data generated by microarray and next-generation sequencing technologies present challenges to exploit their biological meanings. When searching for the coexpression genes, the data mining process is largely affected by selection of algorithms. Thus, it is highly desirable to provide multiple options of algorithms in the user-friendly analytical toolkit to explore the gene expression signatures. For this purpose, we developed GESearch, an interactive graphical user interface (GUI) toolkit, which is written in MATLAB and supports a variety of gene expression data files. This analytical toolkit provides four models, including the mean, the regression, the delegate, and the ensemble models, to identify the coexpression genes, and enables the users to filter data and to select gene expression patterns by browsing the display window or by importing knowledge-based genes. Subsequently, the utility of this analytical toolkit is demonstrated by analyzing two sets of real-life microarray datasets from cell-cycle experiments. Overall, we have developed an interactive GUI toolkit that allows for choosing multiple algorithms for analyzing the gene expression signatures.
Heterogeneous data fusion for brain tumor classification.
Metsis, Vangelis; Huang, Heng; Andronesi, Ovidiu C; Makedon, Fillia; Tzika, Aria
2012-10-01
Current research in biomedical informatics involves analysis of multiple heterogeneous data sets. This includes patient demographics, clinical and pathology data, treatment history, patient outcomes as well as gene expression, DNA sequences and other information sources such as gene ontology. Analysis of these data sets could lead to better disease diagnosis, prognosis, treatment and drug discovery. In this report, we present a novel machine learning framework for brain tumor classification based on heterogeneous data fusion of metabolic and molecular datasets, including state-of-the-art high-resolution magic angle spinning (HRMAS) proton (1H) magnetic resonance spectroscopy and gene transcriptome profiling, obtained from intact brain tumor biopsies. Our experimental results show that our novel framework outperforms any analysis using individual dataset.
GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis.
Zheng, Qi; Wang, Xiu-Jie
2008-07-01
Gene Ontology (GO) analysis has become a commonly used approach for functional studies of large-scale genomic or transcriptomic data. Although there have been a lot of software with GO-related analysis functions, new tools are still needed to meet the requirements for data generated by newly developed technologies or for advanced analysis purpose. Here, we present a Gene Ontology Enrichment Analysis Software Toolkit (GOEAST), an easy-to-use web-based toolkit that identifies statistically overrepresented GO terms within given gene sets. Compared with available GO analysis tools, GOEAST has the following improved features: (i) GOEAST displays enriched GO terms in graphical format according to their relationships in the hierarchical tree of each GO category (biological process, molecular function and cellular component), therefore, provides better understanding of the correlations among enriched GO terms; (ii) GOEAST supports analysis for data from various sources (probe or probe set IDs of Affymetrix, Illumina, Agilent or customized microarrays, as well as different gene identifiers) and multiple species (about 60 prokaryote and eukaryote species); (iii) One unique feature of GOEAST is to allow cross comparison of the GO enrichment status of multiple experiments to identify functional correlations among them. GOEAST also provides rigorous statistical tests to enhance the reliability of analysis results. GOEAST is freely accessible at http://omicslab.genetics.ac.cn/GOEAST/
Ulrich, Reiner; Puff, Christina; Wewetzer, Konstantin; Kalkuhl, Arno; Deschl, Ulrich; Baumgärtner, Wolfgang
2014-01-01
Canine distemper virus (CDV)-induced demyelinating leukoencephalitis in dogs (Canis familiaris) is suggested to represent a naturally occurring translational model for subacute sclerosing panencephalitis and multiple sclerosis in humans. The aim of this study was a hypothesis-free microarray analysis of the transcriptional changes within cerebellar specimens of five cases of acute, six cases of subacute demyelinating, and three cases of chronic demyelinating and inflammatory CDV leukoencephalitis as compared to twelve non-infected control dogs. Frozen cerebellar specimens were used for analysis of histopathological changes including demyelination, transcriptional changes employing microarrays, and presence of CDV nucleoprotein RNA and protein using microarrays, RT-qPCR and immunohistochemistry. Microarray analysis revealed 780 differentially expressed probe sets. The dominating change was an up-regulation of genes related to the innate and the humoral immune response, and less distinct the cytotoxic T-cell-mediated immune response in all subtypes of CDV leukoencephalitis as compared to controls. Multiple myelin genes including myelin basic protein and proteolipid protein displayed a selective down-regulation in subacute CDV leukoencephalitis, suggestive of an oligodendrocyte dystrophy. In contrast, a marked up-regulation of multiple immunoglobulin-like expressed sequence tags and the delta polypeptide of the CD3 antigen was observed in chronic CDV leukoencephalitis, in agreement with the hypothesis of an immune-mediated demyelination in the late inflammatory phase of the disease. Analysis of pathways intimately linked to demyelination as determined by morphometry employing correlation-based Gene Set Enrichment Analysis highlighted the pathomechanistic importance of up-regulated genes comprised by the gene ontology terms “viral replication” and “humoral immune response” as well as down-regulated genes functionally related to “metabolite and energy generation”. PMID:24755553
Ulrich, Reiner; Puff, Christina; Wewetzer, Konstantin; Kalkuhl, Arno; Deschl, Ulrich; Baumgärtner, Wolfgang
2014-01-01
Canine distemper virus (CDV)-induced demyelinating leukoencephalitis in dogs (Canis familiaris) is suggested to represent a naturally occurring translational model for subacute sclerosing panencephalitis and multiple sclerosis in humans. The aim of this study was a hypothesis-free microarray analysis of the transcriptional changes within cerebellar specimens of five cases of acute, six cases of subacute demyelinating, and three cases of chronic demyelinating and inflammatory CDV leukoencephalitis as compared to twelve non-infected control dogs. Frozen cerebellar specimens were used for analysis of histopathological changes including demyelination, transcriptional changes employing microarrays, and presence of CDV nucleoprotein RNA and protein using microarrays, RT-qPCR and immunohistochemistry. Microarray analysis revealed 780 differentially expressed probe sets. The dominating change was an up-regulation of genes related to the innate and the humoral immune response, and less distinct the cytotoxic T-cell-mediated immune response in all subtypes of CDV leukoencephalitis as compared to controls. Multiple myelin genes including myelin basic protein and proteolipid protein displayed a selective down-regulation in subacute CDV leukoencephalitis, suggestive of an oligodendrocyte dystrophy. In contrast, a marked up-regulation of multiple immunoglobulin-like expressed sequence tags and the delta polypeptide of the CD3 antigen was observed in chronic CDV leukoencephalitis, in agreement with the hypothesis of an immune-mediated demyelination in the late inflammatory phase of the disease. Analysis of pathways intimately linked to demyelination as determined by morphometry employing correlation-based Gene Set Enrichment Analysis highlighted the pathomechanistic importance of up-regulated genes comprised by the gene ontology terms "viral replication" and "humoral immune response" as well as down-regulated genes functionally related to "metabolite and energy generation".
Mapping cis- and trans-regulatory effects across multiple tissues in twins
Grundberg, Elin; Small, Kerrin S.; Hedman, Åsa K.; Nica, Alexandra C.; Buil, Alfonso; Keildson, Sarah; Bell, Jordana T.; Yang, Tsun-Po; Meduri, Eshwar; Barrett, Amy; Nisbett, James; Sekowska, Magdalena; Wilk, Alicja; Shin, So-Youn; Glass, Daniel; Travers, Mary; Min, Josine L.; Ring, Sue; Ho, Karen; Thorleifsson, Gudmar; Kong, Augustine; Thorsteindottir, Unnur; Ainali, Chrysanthi; Dimas, Antigone S.; Hassanali, Neelam; Ingle, Catherine; Knowles, David; Krestyaninova, Maria; Lowe, Christopher E.; Di Meglio, Paola; Montgomery, Stephen B.; Parts, Leopold; Potter, Simon; Surdulescu, Gabriela; Tsaprouni, Loukia; Tsoka, Sophia; Bataille, Veronique; Durbin, Richard; Nestle, Frank O.; O’Rahilly, Stephen; Soranzo, Nicole; Lindgren, Cecilia M.; Zondervan, Krina T.; Ahmadi, Kourosh R.; Schadt, Eric E.; Stefansson, Kari; Smith, George Davey; McCarthy, Mark I.; Deloukas, Panos; Dermitzakis, Emmanouil T.; Spector, Tim D.
2013-01-01
Sequence-based variation in gene expression is a key driver of disease risk. Common variants regulating expression in cis have been mapped in many eQTL studies typically in single tissues from unrelated individuals. Here, we present a comprehensive analysis of gene expression across multiple tissues conducted in a large set of mono- and dizygotic twins that allows systematic dissection of genetic (cis and trans) and non-genetic effects on gene expression. Using identity-by-descent estimates, we show that at least 40% of the total heritable cis-effect on expression cannot be accounted for by common cis-variants, a finding which exposes the contribution of low frequency and rare regulatory variants with respect to both transcriptional regulation and complex trait susceptibility. We show that a substantial proportion of gene expression heritability is trans to the structural gene and identify several replicating trans-variants which act predominantly in a tissue-restricted manner and may regulate the transcription of many genes. PMID:22941192
Bioinformatics approaches to predict target genes from transcription factor binding data.
Essebier, Alexandra; Lamprecht, Marnie; Piper, Michael; Bodén, Mikael
2017-12-01
Transcription factors regulate gene expression and play an essential role in development by maintaining proliferative states, driving cellular differentiation and determining cell fate. Transcription factors are capable of regulating multiple genes over potentially long distances making target gene identification challenging. Currently available experimental approaches to detect distal interactions have multiple weaknesses that have motivated the development of computational approaches. Although an improvement over experimental approaches, existing computational approaches are still limited in their application, with different weaknesses depending on the approach. Here, we review computational approaches with a focus on data dependency, cell type specificity and usability. With the aim of identifying transcription factor target genes, we apply available approaches to typical transcription factor experimental datasets. We show that approaches are not always capable of annotating all transcription factor binding sites; binding sites should be treated disparately; and a combination of approaches can increase the biological relevance of the set of genes identified as targets. Copyright © 2017 Elsevier Inc. All rights reserved.
Mixture models for detecting differentially expressed genes in microarrays.
Jones, Liat Ben-Tovim; Bean, Richard; McLachlan, Geoffrey J; Zhu, Justin Xi
2006-10-01
An important and common problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. As this problem concerns the selection of significant genes from a large pool of candidate genes, it needs to be carried out within the framework of multiple hypothesis testing. In this paper, we focus on the use of mixture models to handle the multiplicity issue. With this approach, a measure of the local FDR (false discovery rate) is provided for each gene. An attractive feature of the mixture model approach is that it provides a framework for the estimation of the prior probability that a gene is not differentially expressed, and this probability can subsequently be used in forming a decision rule. The rule can also be formed to take the false negative rate into account. We apply this approach to a well-known publicly available data set on breast cancer, and discuss our findings with reference to other approaches.
STRIDE: Species Tree Root Inference from Gene Duplication Events.
Emms, David M; Kelly, Steven
2017-12-01
The correct interpretation of any phylogenetic tree is dependent on that tree being correctly rooted. We present STRIDE, a fast, effective, and outgroup-free method for identification of gene duplication events and species tree root inference in large-scale molecular phylogenetic analyses. STRIDE identifies sets of well-supported in-group gene duplication events from a set of unrooted gene trees, and analyses these events to infer a probability distribution over an unrooted species tree for the location of its root. We show that STRIDE correctly identifies the root of the species tree in multiple large-scale molecular phylogenetic data sets spanning a wide range of timescales and taxonomic groups. We demonstrate that the novel probability model implemented in STRIDE can accurately represent the ambiguity in species tree root assignment for data sets where information is limited. Furthermore, application of STRIDE to outgroup-free inference of the origin of the eukaryotic tree resulted in a root probability distribution that provides additional support for leading hypotheses for the origin of the eukaryotes. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Soneson, Charlotte; Fontes, Magnus
2012-01-01
Analysis of multivariate data sets from, for example, microarray studies frequently results in lists of genes which are associated with some response of interest. The biological interpretation is often complicated by the statistical instability of the obtained gene lists, which may partly be due to the functional redundancy among genes, implying that multiple genes can play exchangeable roles in the cell. In this paper, we use the concept of exchangeability of random variables to model this functional redundancy and thereby account for the instability. We present a flexible framework to incorporate the exchangeability into the representation of lists. The proposed framework supports straightforward comparison between any 2 lists. It can also be used to generate new more stable gene rankings incorporating more information from the experimental data. Using 2 microarray data sets, we show that the proposed method provides more robust gene rankings than existing methods with respect to sampling variations, without compromising the biological significance of the rankings.
Taminau, Jonatan; Meganck, Stijn; Lazar, Cosmin; Steenhoff, David; Coletta, Alain; Molter, Colin; Duque, Robin; de Schaetzen, Virginie; Weiss Solís, David Y; Bersini, Hugues; Nowé, Ann
2012-12-24
With an abundant amount of microarray gene expression data sets available through public repositories, new possibilities lie in combining multiple existing data sets. In this new context, analysis itself is no longer the problem, but retrieving and consistently integrating all this data before delivering it to the wide variety of existing analysis tools becomes the new bottleneck. We present the newly released inSilicoMerging R/Bioconductor package which, together with the earlier released inSilicoDb R/Bioconductor package, allows consistent retrieval, integration and analysis of publicly available microarray gene expression data sets. Inside the inSilicoMerging package a set of five visual and six quantitative validation measures are available as well. By providing (i) access to uniformly curated and preprocessed data, (ii) a collection of techniques to remove the batch effects between data sets from different sources, and (iii) several validation tools enabling the inspection of the integration process, these packages enable researchers to fully explore the potential of combining gene expression data for downstream analysis. The power of using both packages is demonstrated by programmatically retrieving and integrating gene expression studies from the InSilico DB repository [https://insilicodb.org/app/].
NASA Technical Reports Server (NTRS)
Sundaresan, A.; Pellis, N. R.
2005-01-01
Genetic response suites in human lymphocytes in response to microgravity are important to identify and further study in order to augment human physiological adaptation to novel environments. Emerging technologies, such as DNA micro array profiling, have the potential to identify novel genes that are involved in mediating adaptation to these environments. These genes may prove to be therapeutically valuable as new targets for countermeasures, or as predictive biomarkers of response to these new environments. Human lymphocytes cultured in lg and microgravity analog culture were analyzed for their differential gene expression response. Different groups of genes related to the immune response, cardiovascular system and stress response were then analyzed. Analysis of cells from multiple donors reveals a small shared set that are likely to be essential to adaptation. These three groups focus on human adaptation to new environments. The shared set contains genes related to T cell activation, immune response and stress response to analog microgravity.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Friddle, Carl J; Koga, Teiichiro; Rubin, Edward M.
2000-03-15
While cardiac hypertrophy has been the subject of intensive investigation, regression of hypertrophy has been significantly less studied, precluding large-scale analysis of the relationship between these processes. In the present study, using pharmacological models of hypertrophy in mice, expression profiling was performed with fragments of more than 3,000 genes to characterize and contrast expression changes during induction and regression of hypertrophy. Administration of angiotensin II and isoproterenol by osmotic minipump produced increases in heart weight (15% and 40% respectively) that returned to pre-induction size following drug withdrawal. From multiple expression analyses of left ventricular RNA isolated at daily time-points duringmore » cardiac hypertrophy and regression, we identified sets of genes whose expression was altered at specific stages of this process. While confirming the participation of 25 genes or pathways previously known to be altered by hypertrophy, a larger set of 30 genes was identified whose expression had not previously been associated with cardiac hypertrophy or regression. Of the 55 genes that showed reproducible changes during the time course of induction and regression, 32 genes were altered only during induction and 8 were altered only during regression. This study identified both known and novel genes whose expression is affected at different stages of cardiac hypertrophy and regression and demonstrates that cardiac remodeling during regression utilizes a set of genes that are distinct from those used during induction of hypertrophy.« less
Dai, Wei; Siddiq, Afshan; Walley, Andrew J; Limpaiboon, Temduang; Brown, Robert
2013-01-01
Genetic abnormalities of cholangiocarcinoma have been widely studied; however, epigenomic changes related to cholangiocarcinogenesis have been less well characterised. We have profiled the DNA methylomes of 28 primary cholangiocarcinoma and six matched adjacent normal tissues using Infinium’s HumanMethylation27 BeadChips with the aim of identifying gene sets aberrantly epigenetically regulated in this tumour type. Using a linear model for microarray data we identified 1610 differentially methylated autosomal CpG sites with 809 CpG sites (representing 603 genes) being hypermethylated and 801 CpG sites (representing 712 genes) being hypomethylated in cholangiocarcinoma versus adjacent normal tissues (false discovery rate ≤ 0.05). Gene ontology and gene set enrichment analyses identified gene sets significantly associated with hypermethylation at linked CpG sites in cholangiocarcinoma including homeobox genes and target genes of PRC2, EED, SUZ12 and histone H3 trimethylation at lysine 27. We confirmed frequent hypermethylation at the homeobox genes HOXA9 and HOXD9 by bisulfite pyrosequencing in a larger cohort of cholangiocarcinoma (n = 102). Our findings indicate a key role for hypermethylation of multiple CpG sites at genes associated with a stem cell-like phenotype as a common molecular aberration in cholangiocarcinoma. These data have implications for cholangiocarcinogenesis, as well as possible novel treatment options using histone methyltransferase inhibitors. PMID:24089088
Ficklin, Stephen P; Feltus, Frank Alex
2013-01-01
Many traits of biological and agronomic significance in plants are controlled in a complex manner where multiple genes and environmental signals affect the expression of the phenotype. In Oryza sativa (rice), thousands of quantitative genetic signals have been mapped to the rice genome. In parallel, thousands of gene expression profiles have been generated across many experimental conditions. Through the discovery of networks with real gene co-expression relationships, it is possible to identify co-localized genetic and gene expression signals that implicate complex genotype-phenotype relationships. In this work, we used a knowledge-independent, systems genetics approach, to discover a high-quality set of co-expression networks, termed Gene Interaction Layers (GILs). Twenty-two GILs were constructed from 1,306 Affymetrix microarray rice expression profiles that were pre-clustered to allow for improved capture of gene co-expression relationships. Functional genomic and genetic data, including over 8,000 QTLs and 766 phenotype-tagged SNPs (p-value < = 0.001) from genome-wide association studies, both covering over 230 different rice traits were integrated with the GILs. An online systems genetics data-mining resource, the GeneNet Engine, was constructed to enable dynamic discovery of gene sets (i.e. network modules) that overlap with genetic traits. GeneNet Engine does not provide the exact set of genes underlying a given complex trait, but through the evidence of gene-marker correspondence, co-expression, and functional enrichment, site visitors can identify genes with potential shared causality for a trait which could then be used for experimental validation. A set of 2 million SNPs was incorporated into the database and serve as a potential set of testable biomarkers for genes in modules that overlap with genetic traits. Herein, we describe two modules found using GeneNet Engine, one with significant overlap with the trait amylose content and another with significant overlap with blast disease resistance.
Ficklin, Stephen P.; Feltus, Frank Alex
2013-01-01
Many traits of biological and agronomic significance in plants are controlled in a complex manner where multiple genes and environmental signals affect the expression of the phenotype. In Oryza sativa (rice), thousands of quantitative genetic signals have been mapped to the rice genome. In parallel, thousands of gene expression profiles have been generated across many experimental conditions. Through the discovery of networks with real gene co-expression relationships, it is possible to identify co-localized genetic and gene expression signals that implicate complex genotype-phenotype relationships. In this work, we used a knowledge-independent, systems genetics approach, to discover a high-quality set of co-expression networks, termed Gene Interaction Layers (GILs). Twenty-two GILs were constructed from 1,306 Affymetrix microarray rice expression profiles that were pre-clustered to allow for improved capture of gene co-expression relationships. Functional genomic and genetic data, including over 8,000 QTLs and 766 phenotype-tagged SNPs (p-value < = 0.001) from genome-wide association studies, both covering over 230 different rice traits were integrated with the GILs. An online systems genetics data-mining resource, the GeneNet Engine, was constructed to enable dynamic discovery of gene sets (i.e. network modules) that overlap with genetic traits. GeneNet Engine does not provide the exact set of genes underlying a given complex trait, but through the evidence of gene-marker correspondence, co-expression, and functional enrichment, site visitors can identify genes with potential shared causality for a trait which could then be used for experimental validation. A set of 2 million SNPs was incorporated into the database and serve as a potential set of testable biomarkers for genes in modules that overlap with genetic traits. Herein, we describe two modules found using GeneNet Engine, one with significant overlap with the trait amylose content and another with significant overlap with blast disease resistance. PMID:23874666
ZHANG, YAFANG; CROFTON, ELIZABETH J.; FAN, XIUZHEN; LI, DINGGE; KONG, FANPING; SINHA, MALA; LUXON, BRUCE A.; SPRATT, HEIDI M.; LICHTI, CHERYL F.; GREEN, THOMAS A.
2016-01-01
Transcriptomic and proteomic approaches have separately proven effective at identifying novel mechanisms affecting addiction-related behavior; however, it is difficult to prioritize the many promising leads from each approach. A convergent secondary analysis of proteomic and transcriptomic results can glean additional information to help prioritize promising leads. The current study is a secondary analysis of the convergence of recently published separate transcriptomic and proteomic analyses of nucleus accumbens (NAc) tissue from rats subjected to environmental enrichment vs. isolation and cocaine self-administration vs. saline. Multiple bioinformatics approaches (e.g. Gene Ontology (GO) analysis, Ingenuity Pathway Analysis (IPA), and Gene Set Enrichment Analysis (GSEA)) were used to interrogate these rich data sets. Although there was little correspondence between mRNA vs. protein at the individual target level, good correspondence was found at the level of gene/protein sets, particularly for the environmental enrichment manipulation. These data identify gene sets where there is a positive relationship between changes in mRNA and protein (e.g. glycolysis, ATP synthesis, translation elongation factor activity, etc.) and gene sets where there is an inverse relationship (e.g. ribosomes, Rho GTPase signaling, protein ubiquitination, etc.). Overall environmental enrichment produced better correspondence than cocaine self-administration. The individual targets contributing to mRNA and protein effects were largely not overlapping. As a whole, these results confirm that robust transcriptomic and proteomic data sets can provide similar results at the gene/protein set level even when there is little correspondence at the individual target level and little overlap in the targets contributing to the effects. PMID:27717806
Uchiyama, Ikuo
2008-10-31
Identifying the set of intrinsically conserved genes, or the genomic core, among related genomes is crucial for understanding prokaryotic genomes where horizontal gene transfers are common. Although core genome identification appears to be obvious among very closely related genomes, it becomes more difficult when more distantly related genomes are compared. Here, we consider the core structure as a set of sufficiently long segments in which gene orders are conserved so that they are likely to have been inherited mainly through vertical transfer, and developed a method for identifying the core structure by finding the order of pre-identified orthologous groups (OGs) that maximally retains the conserved gene orders. The method was applied to genome comparisons of two well-characterized families, Bacillaceae and Enterobacteriaceae, and identified their core structures comprising 1438 and 2125 OGs, respectively. The core sets contained most of the essential genes and their related genes, which were primarily included in the intersection of the two core sets comprising around 700 OGs. The definition of the genomic core based on gene order conservation was demonstrated to be more robust than the simpler approach based only on gene conservation. We also investigated the core structures in terms of G+C content homogeneity and phylogenetic congruence, and found that the core genes primarily exhibited the expected characteristic, i.e., being indigenous and sharing the same history, more than the non-core genes. The results demonstrate that our strategy of genome alignment based on gene order conservation can provide an effective approach to identify the genomic core among moderately related microbial genomes.
oPOSSUM: integrated tools for analysis of regulatory motif over-representation
Ho Sui, Shannan J.; Fulton, Debra L.; Arenillas, David J.; Kwon, Andrew T.; Wasserman, Wyeth W.
2007-01-01
The identification of over-represented transcription factor binding sites from sets of co-expressed genes provides insights into the mechanisms of regulation for diverse biological contexts. oPOSSUM, an internet-based system for such studies of regulation, has been improved and expanded in this new release. New features include a worm-specific version for investigating binding sites conserved between Caenorhabditis elegans and C. briggsae, as well as a yeast-specific version for the analysis of co-expressed sets of Saccharomyces cerevisiae genes. The human and mouse applications feature improvements in ortholog mapping, sequence alignments and the delineation of multiple alternative promoters. oPOSSUM2, introduced for the analysis of over-represented combinations of motifs in human and mouse genes, has been integrated with the original oPOSSUM system. Analysis using user-defined background gene sets is now supported. The transcription factor binding site models have been updated to include new profiles from the JASPAR database. oPOSSUM is available at http://www.cisreg.ca/oPOSSUM/ PMID:17576675
Brito, Jose L.R.; Walker, Brian; Jenner, Matthew; Dickens, Nicholas J.; Brown, Nicola J.M.; Ross, Fiona M.; Avramidou, Athanasia; Irving, Julie A.E.; Gonzalez, David; Davies, Faith E.; Morgan, Gareth J.
2009-01-01
Background The recurrent immunoglobulin translocation, t(4;14)(p16;q32) occurs in 15% of multiple myeloma patients and is associated with poor prognosis, through an unknown mechanism. The t(4;14) up-regulates fibroblast growth factor receptor 3 (FGFR3) and multiple myeloma SET domain (MMSET) genes. The involvement of MMSET in the pathogenesis of t(4;14) multiple myeloma and the mechanism or genes deregulated by MMSET upregulation are still unclear. Design and Methods The expression of MMSET was analyzed using a novel antibody. The involvement of MMSET in t(4;14) myelomagenesis was assessed by small interfering RNA mediated knockdown combined with several biological assays. In addition, the differential gene expression of MMSET-induced knockdown was analyzed with expression microarrays. MMSET gene targets in primary patient material was analyzed by expression microarrays. Results We found that MMSET isoforms are expressed in multiple myeloma cell lines, being exclusively up-regulated in t(4;14)-positive cells. Suppression of MMSET expression affected cell proliferation by both decreasing cell viability and cell cycle progression of cells with the t(4;14) translocation. These findings were associated with reduced expression of genes involved in the regulation of cell cycle progression (e.g. CCND2, CCNG1, BRCA1, AURKA and CHEK1), apoptosis (CASP1, CASP4 and FOXO3A) and cell adhesion (ADAM9 and DSG2). Furthermore, we identified genes involved in the latter processes that were differentially expressed in t(4;14) multiple myeloma patient samples. Conclusions In conclusion, dysregulation of MMSET affects the expression of several genes involved in the regulation of cell cycle progression, cell adhesion and survival. PMID:19059936
Mitsui, Jun; Fukuda, Yoko; Azuma, Kyo; Tozaki, Hirokazu; Ishiura, Hiroyuki; Takahashi, Yuji; Goto, Jun; Tsuji, Shoji
2010-07-01
We have recently found that multiple rare variants of the glucocerebrosidase gene (GBA) confer a robust risk for Parkinson disease, supporting the 'common disease-multiple rare variants' hypothesis. To develop an efficient method of identifying rare variants in a large number of samples, we applied multiplexed resequencing using a next-generation sequencer to identification of rare variants of GBA. Sixteen sets of pooled DNAs from six pooled DNA samples were prepared. Each set of pooled DNAs was subjected to polymerase chain reaction to amplify the target gene (GBA) covering 6.5 kb, pooled into one tube with barcode indexing, and then subjected to extensive sequence analysis using the SOLiD System. Individual samples were also subjected to direct nucleotide sequence analysis. With the optimization of data processing, we were able to extract all the variants from 96 samples with acceptable rates of false-positive single-nucleotide variants.
Kent, Jack W
2016-02-03
New technologies for acquisition of genomic data, while offering unprecedented opportunities for genetic discovery, also impose severe burdens of interpretation and penalties for multiple testing. The Pathway-based Analyses Group of the Genetic Analysis Workshop 19 (GAW19) sought reduction of multiple-testing burden through various approaches to aggregation of highdimensional data in pathways informed by prior biological knowledge. Experimental methods testedincluded the use of "synthetic pathways" (random sets of genes) to estimate power and false-positive error rate of methods applied to simulated data; data reduction via independent components analysis, single-nucleotide polymorphism (SNP)-SNP interaction, and use of gene sets to estimate genetic similarity; and general assessment of the efficacy of prior biological knowledge to reduce the dimensionality of complex genomic data. The work of this group explored several promising approaches to managing high-dimensional data, with the caveat that these methods are necessarily constrained by the quality of external bioinformatic annotation.
GeneSCF: a real-time based functional enrichment tool with support for multiple organisms.
Subhash, Santhilal; Kanduri, Chandrasekhar
2016-09-13
High-throughput technologies such as ChIP-sequencing, RNA-sequencing, DNA sequencing and quantitative metabolomics generate a huge volume of data. Researchers often rely on functional enrichment tools to interpret the biological significance of the affected genes from these high-throughput studies. However, currently available functional enrichment tools need to be updated frequently to adapt to new entries from the functional database repositories. Hence there is a need for a simplified tool that can perform functional enrichment analysis by using updated information directly from the source databases such as KEGG, Reactome or Gene Ontology etc. In this study, we focused on designing a command-line tool called GeneSCF (Gene Set Clustering based on Functional annotations), that can predict the functionally relevant biological information for a set of genes in a real-time updated manner. It is designed to handle information from more than 4000 organisms from freely available prominent functional databases like KEGG, Reactome and Gene Ontology. We successfully employed our tool on two of published datasets to predict the biologically relevant functional information. The core features of this tool were tested on Linux machines without the need for installation of more dependencies. GeneSCF is more reliable compared to other enrichment tools because of its ability to use reference functional databases in real-time to perform enrichment analysis. It is an easy-to-integrate tool with other pipelines available for downstream analysis of high-throughput data. More importantly, GeneSCF can run multiple gene lists simultaneously on different organisms thereby saving time for the users. Since the tool is designed to be ready-to-use, there is no need for any complex compilation and installation procedures.
Kadam, Anagha; Janto, Benjamin; Eutsey, Rory; Earl, Joshua P; Powell, Evan; Dahlgren, Margaret E; Hu, Fen Z; Ehrlich, Garth D; Hiller, N Luisa
2015-02-02
There is extensive genomic diversity among Streptococcus pneumoniae isolates. Approximately half of the comprehensive set of genes in the species (the supragenome or pangenome) is present in all the isolates (core set), and the remaining is unevenly distributed among strains (distributed set). The Streptococcus pneumoniae Supragenome Hybridization (SpSGH) array provides coverage for an extensive set of genes and polymorphisms encountered within this species, capturing this genomic diversity. Further, the capture is quantitative. In this manner, the SpSGH array allows for both genomic and transcriptomic analyses of diverse S. pneumoniae isolates on a single platform. In this unit, we present the SpSGH array, and describe in detail its design and implementation for both genomic and transcriptomic analyses. The methodology can be applied to construction and modification of SpSGH array platforms, as well to other bacterial species as long as multiple whole-genome sequences are available that collectively capture the vast majority of the species supragenome. Copyright © 2015 John Wiley & Sons, Inc.
The α‐synuclein gene in multiple system atrophy
Ozawa, T; Healy, D G; Abou‐Sleiman, P M; Ahmadi, K R; Quinn, N; Lees, A J; Shaw, K; Wullner, U; Berciano, J; Moller, J C; Kamm, C; Burk, K; Josephs, K A; Barone, P; Tolosa, E; Goldstein, D B; Wenning, G; Geser, F; Holton, J L; Gasser, T; Revesz, T; Wood, N W
2006-01-01
Background The formation of α‐synuclein aggregates may be a critical event in the pathogenesis of multiple system atrophy (MSA). However, the role of this gene in the aetiology of MSA is unknown and untested. Method The linkage disequilibrium (LD) structure of the α‐synuclein gene was established and LD patterns were used to identify a set of tagging single nucleotide polymorphisms (SNPs) that represent 95% of the haplotype diversity across the entire gene. The effect of polymorphisms on the pathological expression of MSA in pathologically confirmed cases was also evaluated. Results and conclusion In 253 Gilman probable or definite MSA patients, 457 possible, probable, and definite MSA cases and 1472 controls, a frequency difference for the individual tagging SNPs or tag‐defined haplotypes was not detected. No effect was observed of polymorphisms on the pathological expression of MSA in pathologically confirmed cases. PMID:16543523
Schmidt, S; Pericak-Vance, M A; Sawcer, S; Barcellos, L F; Hart, J; Sims, J; Prokop, A M; van der Walt, J; DeLoa, C; Lincoln, R R; Oksenberg, J R; Compston, A; Hauser, S L; Haines, J L; Gregory, S G
2006-07-01
Discrepant findings have been reported regarding an association of the apolipoprotein E (APOE) gene with the clinical course of multiple sclerosis (MS). To resolve these discrepancies, we examined common sequence variation in six candidate genes residing in a 380-kb genomic region surrounding and including the APOE locus for an association with MS severity. We genotyped at least three polymorphisms in each of six candidate genes in 1,540 Caucasian MS families (729 single-case and multiple-case families from the United States, 811 single-case families from the UK). By applying the quantitative transmission/disequilibrium test to a recently proposed MS severity score, the only statistically significant (P=0.003) association with MS severity was found for an intronic variant in the Herpes Virus Entry Mediator-B Gene PVRL2. Additional genotyping extended the association to a 16.6 kb block spanning intron 1 to intron 2 of the gene. Sequencing of PVRL2 failed to identify variants with an obvious functional role. In conclusion, the analysis of a very large data set suggests that genetic polymorphisms in PVRL2 may influence MS severity and supports the possibility that viral factors may contribute to the clinical course of MS, consistent with previous reports.
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes
Li, Li; Stoeckert, Christian J.; Roos, David S.
2003-01-01
The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of “recent” paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome. PMID:12952885
Microgravity and immunity: Changes in lymphocyte gene expression.
NASA Astrophysics Data System (ADS)
Risin, D.; Ward, N. E.; Risin, S. A.; Pellis, N. R.
Earlier studies had shown that modeled and true microgravity MG cause multiple direct effects on human lymphocytes MG inhibits lymphocyte locomotion suppresses polyclonal and antigen-specific activation affects signal transduction mechanisms as well as activation-induced apoptosis In this study we assessed changes in gene expression associated with lymphocyte exposure to microgravity in an attempt to identify microgravity-sensitive genes MGSG in general and specifically those genes that might be responsible for the functional and structural changes observed earlier Two sets of experiments targeting different goals were conducted In the first set T-lymphocytes from normal donors were activated with anti-CD3 and IL2 and then cultured in 1g static and modeled MG MMG conditions Rotating Wall Vessel bioreactor for 24 hours This setting allowed searching for MGSG by comparison of gene expression patterns in zero and 1 g gravity In the second set - activated T-cells after culturing for 24 hours in 1g and MMG were exposed three hours before harvesting to a secondary activation stimulus PHA thus triggering the apoptotic pathway Total RNA was extracted using the RNeasy isolation kit Qiagen Valencia CA Affymetrix Gene Chips U133A allowing testing for 18 400 human genes were used for microarray analysis The experiments were performed in triplicates with T-cells obtained from different blood donors to minimize the possible input of biological variation in gene expression and discriminate changes that are associated with the
Willsey, A. Jeremy; Sanders, Stephan J.; Li, Mingfeng; Dong, Shan; Tebbenkamp, Andrew T.; Muhle, Rebecca A.; Reilly, Steven K.; Lin, Leon; Fertuzinhos, Sofia; Miller, Jeremy A.; Murtha, Michael T.; Bichsel, Candace; Niu, Wei; Cotney, Justin; Ercan-Sencicek, A. Gulhan; Gockley, Jake; Gupta, Abha; Han, Wenqi; He, Xin; Hoffman, Ellen; Klei, Lambertus; Lei, Jing; Liu, Wenzhong; Liu, Li; Lu, Cong; Xu, Xuming; Zhu, Ying; Mane, Shrikant M.; Lein, Edward S.; Wei, Liping; Noonan, James P.; Roeder, Kathryn; Devlin, Bernie; Šestan, Nenad; State, Matthew W.
2013-01-01
SUMMARY Autism spectrum disorder (ASD) is a complex developmental syndrome of unknown etiology. Recent studies employing exome- and genome-wide sequencing have identified nine high-confidence ASD (hcASD) genes. Working from the hypothesis that ASD-associated mutations in these biologically pleiotropic genes will disrupt intersecting developmental processes to contribute to a common phenotype, we have attempted to identify time periods, brain regions, and cell types in which these genes converge. We have constructed coexpression networks based on the hcASD “seed” genes, leveraging a rich expression data set encompassing multiple human brain regions across human development and into adulthood. By assessing enrichment of an independent set of probable ASD (pASD) genes, derived from the same sequencing studies, we demonstrate a key point of convergence in midfetal layer 5/6 cortical projection neurons. This approach informs when, where, and in what cell types mutations in these specific genes may be productively studied to clarify ASD pathophysiology. PMID:24267886
The Cross-Entropy Based Multi-Filter Ensemble Method for Gene Selection.
Sun, Yingqiang; Lu, Chengbo; Li, Xiaobo
2018-05-17
The gene expression profile has the characteristics of a high dimension, low sample, and continuous type, and it is a great challenge to use gene expression profile data for the classification of tumor samples. This paper proposes a cross-entropy based multi-filter ensemble (CEMFE) method for microarray data classification. Firstly, multiple filters are used to select the microarray data in order to obtain a plurality of the pre-selected feature subsets with a different classification ability. The top N genes with the highest rank of each subset are integrated so as to form a new data set. Secondly, the cross-entropy algorithm is used to remove the redundant data in the data set. Finally, the wrapper method, which is based on forward feature selection, is used to select the best feature subset. The experimental results show that the proposed method is more efficient than other gene selection methods and that it can achieve a higher classification accuracy under fewer characteristic genes.
Malki, K; Pain, O; Tosto, M G; Du Rietz, E; Carboni, L; Schalkwyk, L C
2015-01-01
Despite moderate heritability estimates, progress in uncovering the molecular substrate underpinning major depressive disorder (MDD) has been slow. In this study, we used prefrontal cortex (PFC) gene expression from a genetic rat model of MDD to inform probe set prioritization in PFC in a human post-mortem study to uncover genes and gene pathways associated with MDD. Gene expression differences between Flinders sensitive (FSL) and Flinders resistant (FRL) rat lines were statistically evaluated using the RankProd, non-parametric algorithm. Top ranking probe sets in the rat study were subsequently used to prioritize orthologous selection in a human PFC in a case–control post-mortem study on MDD from the Stanley Brain Consortium. Candidate genes in the human post-mortem study were then tested against a matched control sample using the RankProd method. A total of 1767 probe sets were differentially expressed in the PFC between FSL and FRL rat lines at (q⩽0.001). A total of 898 orthologous probe sets was found on Affymetrix's HG-U95A chip used in the human study. Correcting for the number of multiple, non-independent tests, 20 probe sets were found to be significantly dysregulated between human cases and controls at q⩽0.05. These probe sets tagged the expression profile of 18 human genes (11 upregulated and seven downregulated). Using an integrative rat–human study, a number of convergent genes that may have a role in pathogenesis of MDD were uncovered. Eighty percent of these genes were functionally associated with a key stress response signalling cascade, involving NF-κB (nuclear factor kappa-light-chain-enhancer of activated B cells), AP-1 (activator protein 1) and ERK/MAPK, which has been systematically associated with MDD, neuroplasticity and neurogenesis. PMID:25734512
Identification of the Core Set of Carbon-Associated Genes in a Bioenergy Grassland Soil
Howe, Adina; Yang, Fan; Williams, Ryan J.; ...
2016-11-17
Despite the central role of soil microbial communities in global carbon (C) cycling, little is known about soil microbial community structure and even less about their metabolic pathways. Efforts to characterize soil communities often focus on identifying differences in gene content across environmental gradients, but an alternative question is what genes are similar in soils. These genes may indicate critical species or potential functions that are required in all soils. Here we identified the “core” set of C cycling sequences widely present in multiple soil metagenomes from a fertilized prairie (FP). Of 226,887 sequences associated with known enzymes involved inmore » the synthesis, metabolism, and transport of carbohydrates, 843 were identified to be consistently prevalent across four replicate soil metagenomes. This core metagenome was functionally and taxonomically diverse, representing five enzyme classes and 99 enzyme families within the CAZy database. Though it only comprised 0.4% of all CAZy-associated genes identified in FP metagenomes, the core was found to be comprised of functions similar to those within cumulative soils. The FP CAZy-associated core sequences were present in multiple publicly available soil metagenomes and most similar to soils sharing geographic proximity. As a result, in soil ecosystems, where high diversity remains a key challenge for metagenomic investigations, these core genes represent a subset of critical functions necessary for carbohydrate metabolism, which can be targeted to evaluate important C fluxes in these and other similar soils.« less
Predicting Gene Structures from Multiple RT-PCR Tests
NASA Astrophysics Data System (ADS)
Kováč, Jakub; Vinař, Tomáš; Brejová, Broňa
It has been demonstrated that the use of additional information such as ESTs and protein homology can significantly improve accuracy of gene prediction. However, many sources of external information are still being omitted from consideration. Here, we investigate the use of product lengths from RT-PCR experiments in gene finding. We present hardness results and practical algorithms for several variants of the problem and apply our methods to a real RT-PCR data set in the Drosophila genome. We conclude that the use of RT-PCR data can improve the sensitivity of gene prediction and locate novel splicing variants.
Dimitrova, Irina K.; Richer, Jennifer K.; Rudolph, Michael C.; Spoelstra, Nicole S.; Reno, Elaine M.; Medina, Theresa M.; Bradford, Andrew P.
2009-01-01
Objective To identify differentially expressed genes between fibroid and adjacent normal myometrium in an identical hormonal and genetic background. Design Array analysis of 3 leiomyomata and matched adjacent normal myometrium in a single patient. Setting University of Colorado Hospital. Patient(s) A single female undergoing medically indicated hysterectomy for symptomatic fibroids. Interventions(s) mRNA isolation and microarray analysis, reverse-transcriptase polymerase chain reaction, western blotting and immunohistochemistry. Main Outcome Measure(s) Changes in mRNA and protein levels in leiomyomata and matched normal myometrium. Result(s) Expression of 197 genes was increased and 619 decreased, significantly by at least 2 fold, in leiomyomata relative to normal myometrium. Expression profiles between tumors were similar and normal myometrial samples showed minimal variation. Changes in, and variation of, expression of selected genes were confirmed in additional normal and leiomyoma samples from multiple patients. Conclusion(s) Analysis of multiple tumors from a single patient confirmed changes in expression of genes described in previous, apparently disparate, studies and identified novel targets. Gene expression profiles in leiomyomata are consistent with increased activation of mitogenic pathways and inhibition of apoptosis. Down-regulation of genes implicated in invasion and metastasis, of cancers, was observed in fibroids. This expression pattern may underlie the benign nature of uterine leiomyomata and may aid in the differential diagnosis of leiomyosarcoma. PMID:18672237
Detection of Pathways Affected by Positive Selection in Primate Lineages Ancestral to Humans
Moretti, S.; Davydov, I.I.; Excoffier, L.
2017-01-01
Abstract Gene set enrichment approaches have been increasingly successful in finding signals of recent polygenic selection in the human genome. In this study, we aim at detecting biological pathways affected by positive selection in more ancient human evolutionary history. Focusing on four branches of the primate tree that lead to modern humans, we tested all available protein coding gene trees of the Primates clade for signals of adaptation in these branches, using the likelihood-based branch site test of positive selection. The results of these locus-specific tests were then used as input for a gene set enrichment test, where whole pathways are globally scored for a signal of positive selection, instead of focusing only on outlier “significant” genes. We identified signals of positive selection in several pathways that are mainly involved in immune response, sensory perception, metabolism, and energy production. These pathway-level results are highly significant, even though there is no functional enrichment when only focusing on top scoring genes. Interestingly, several gene sets are found significant at multiple levels in the phylogeny, but different genes are responsible for the selection signal in the different branches. This suggests that the same function has been optimized in different ways at different times in primate evolution. PMID:28333345
WormQTLHD--a web database for linking human disease to natural variation data in C. elegans.
van der Velde, K Joeri; de Haan, Mark; Zych, Konrad; Arends, Danny; Snoek, L Basten; Kammenga, Jan E; Jansen, Ritsert C; Swertz, Morris A; Li, Yang
2014-01-01
Interactions between proteins are highly conserved across species. As a result, the molecular basis of multiple diseases affecting humans can be studied in model organisms that offer many alternative experimental opportunities. One such organism-Caenorhabditis elegans-has been used to produce much molecular quantitative genetics and systems biology data over the past decade. We present WormQTL(HD) (Human Disease), a database that quantitatively and systematically links expression Quantitative Trait Loci (eQTL) findings in C. elegans to gene-disease associations in man. WormQTL(HD), available online at http://www.wormqtl-hd.org, is a user-friendly set of tools to reveal functionally coherent, evolutionary conserved gene networks. These can be used to predict novel gene-to-gene associations and the functions of genes underlying the disease of interest. We created a new database that links C. elegans eQTL data sets to human diseases (34 337 gene-disease associations from OMIM, DGA, GWAS Central and NHGRI GWAS Catalogue) based on overlapping sets of orthologous genes associated to phenotypes in these two species. We utilized QTL results, high-throughput molecular phenotypes, classical phenotypes and genotype data covering different developmental stages and environments from WormQTL database. All software is available as open source, built on MOLGENIS and xQTL workbench.
RAMONA: a Web application for gene set analysis on multilevel omics data.
Sass, Steffen; Buettner, Florian; Mueller, Nikola S; Theis, Fabian J
2015-01-01
Decreasing costs of modern high-throughput experiments allow for the simultaneous analysis of altered gene activity on various molecular levels. However, these multi-omics approaches lead to a large amount of data, which is hard to interpret for a non-bioinformatician. Here, we present the remotely accessible multilevel ontology analysis (RAMONA). It offers an easy-to-use interface for the simultaneous gene set analysis of combined omics datasets and is an extension of the previously introduced MONA approach. RAMONA is based on a Bayesian enrichment method for the inference of overrepresented biological processes among given gene sets. Overrepresentation is quantified by interpretable term probabilities. It is able to handle data from various molecular levels, while in parallel coping with redundancies arising from gene set overlaps and related multiple testing problems. The comprehensive output of RAMONA is easy to interpret and thus allows for functional insight into the affected biological processes. With RAMONA, we provide an efficient implementation of the Bayesian inference problem such that ontologies consisting of thousands of terms can be processed in the order of seconds. RAMONA is implemented as ASP.NET Web application and publicly available at http://icb.helmholtz-muenchen.de/ramona. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Shah, Binal; Jain, Kunal; Patel, Namrata; Pandit, Ramesh; Patel, Anand; Joshi, Chaitanya G.
2015-01-01
Paenibacillus sp. strain DMB20, in cometabolism with other Proteobacteria and Firmicutes, exhibits azoreduction of textile dyes. Here, we report the draft genome sequence of this bacterium, consisting of 6,647,181 bp with 7,668 coding sequences (CDSs). The data presented highlight multiple sets of functional genes associated with xenobiotic compound degradation. PMID:26067950
Pyviko: an automated Python tool to design gene knockouts in complex viruses with overlapping genes.
Taylor, Louis J; Strebel, Klaus
2017-01-07
Gene knockouts are a common tool used to study gene function in various organisms. However, designing gene knockouts is complicated in viruses, which frequently contain sequences that code for multiple overlapping genes. Designing mutants that can be traced by the creation of new or elimination of existing restriction sites further compounds the difficulty in experimental design of knockouts of overlapping genes. While software is available to rapidly identify restriction sites in a given nucleotide sequence, no existing software addresses experimental design of mutations involving multiple overlapping amino acid sequences in generating gene knockouts. Pyviko performed well on a test set of over 240,000 gene pairs collected from viral genomes deposited in the National Center for Biotechnology Information Nucleotide database, identifying a point mutation which added a premature stop codon within the first 20 codons of the target gene in 93.2% of all tested gene-overprinted gene pairs. This shows that Pyviko can be used successfully in a wide variety of contexts to facilitate the molecular cloning and study of viral overprinted genes. Pyviko is an extensible and intuitive Python tool for designing knockouts of overlapping genes. Freely available as both a Python package and a web-based interface ( http://louiejtaylor.github.io/pyViKO/ ), Pyviko simplifies the experimental design of gene knockouts in complex viruses with overlapping genes.
Role of DISC1 interacting proteins in schizophrenia risk from genome-wide analysis of missense SNPs.
Costas, Javier; Suárez-Rama, Jose Javier; Carrera, Noa; Paz, Eduardo; Páramo, Mario; Agra, Santiago; Brenlla, Julio; Ramos-Ríos, Ramón; Arrojo, Manuel
2013-11-01
A balanced translocation affecting DISC1 cosegregates with several psychiatric disorders, including schizophrenia, in a Scottish family. DISC1 is a hub protein of a network of protein-protein interactions involved in multiple developmental pathways within the brain. Gene set-based analysis has been proposed as an alternative to individual analysis of single nucleotide polymorphisms (SNPs) to get information from genome-wide association studies. In this work, we tested for an overrepresentation of the DISC1 interacting proteins within the top results of our ranked list of genes based on our previous genome-wide association study of missense SNPs in schizophrenia. Our data set consisted of 5100 common missense SNPs genotyped in 476 schizophrenic patients and 447 control subjects from Galicia, NW Spain. We used a modification of the Gene Set Enrichment Analysis adapted for SNPs, as implemented in the GenGen software. The analysis detected an overrepresentation of the DISC1 interacting proteins (permuted P-value=0.0158), indicative of the role of this gene set in schizophrenia risk. We identified seven leading-edge genes, MACF1, UTRN, DST, DISC1, KIF3A, SYNE1, and AKAP9, responsible for the overrepresentation. These genes are involved in neuronal cytoskeleton organization and intracellular transport through the microtubule cytoskeleton, suggesting that these processes may be impaired in schizophrenia. © 2013 John Wiley & Sons Ltd/University College London.
Joshi, Anagha
2014-12-30
Transcriptional hotspots are defined as genomic regions bound by multiple factors. They have been identified recently as cell type specific enhancers regulating developmentally essential genes in many species such as worm, fly and humans. The in-depth analysis of hotspots across multiple cell types in same species still remains to be explored and can bring new biological insights. We therefore collected 108 transcription-related factor (TF) ChIP sequencing data sets in ten murine cell types and classified the peaks in each cell type in three groups according to binding occupancy as singletons (low-occupancy), combinatorials (mid-occupancy) and hotspots (high-occupancy). The peaks in the three groups clustered largely according to the occupancy, suggesting priming of genomic loci for mid occupancy irrespective of cell type. We then characterized hotspots for diverse structural functional properties. The genes neighbouring hotspots had a small overlap with hotspot genes in other cell types and were highly enriched for cell type specific function. Hotspots were enriched for sequence motifs of key TFs in that cell type and more than 90% of hotspots were occupied by pioneering factors. Though we did not find any sequence signature in the three groups, the H3K4me1 binding profile had bimodal peaks at hotspots, distinguishing hotspots from mono-modal H3K4me1 singletons. In ES cells, differentially expressed genes after perturbation of activators were enriched for hotspot genes suggesting hotspots primarily act as transcriptional activator hubs. Finally, we proposed that ES hotspots might be under control of SetDB1 and not DNMT for silencing. Transcriptional hotspots are enriched for tissue specific enhancers near cell type specific highly expressed genes. In ES cells, they are predicted to act as transcriptional activator hubs and might be under SetDB1 control for silencing.
Wang, Jinglu; Qu, Susu; Wang, Weixiao; Guo, Liyuan; Zhang, Kunlin; Chang, Suhua; Wang, Jing
2016-11-01
Numbers of gene expression profiling studies of bipolar disorder have been published. Besides different array chips and tissues, variety of the data processes in different cohorts aggravated the inconsistency of results of these genome-wide gene expression profiling studies. By searching the gene expression databases, we obtained six data sets for prefrontal cortex (PFC) of bipolar disorder with raw data and combinable platforms. We used standardized pre-processing and quality control procedures to analyze each data set separately and then combined them into a large gene expression matrix with 101 bipolar disorder subjects and 106 controls. A standard linear mixed-effects model was used to calculate the differentially expressed genes (DEGs). Multiple levels of sensitivity analyses and cross validation with genetic data were conducted. Functional and network analyses were carried out on basis of the DEGs. In the result, we identified 198 unique differentially expressed genes in the PFC of bipolar disorder and control. Among them, 115 DEGs were robust to at least three leave-one-out tests or different pre-processing methods; 51 DEGs were validated with genetic association signals. Pathway enrichment analysis showed these DEGs were related with regulation of neurological system, cell death and apoptosis, and several basic binding processes. Protein-protein interaction network further identified one key hub gene. We have contributed the most comprehensive integrated analysis of bipolar disorder expression profiling studies in PFC to date. The DEGs, especially those with multiple validations, may denote a common signature of bipolar disorder and contribute to the pathogenesis of disease. Copyright © 2016 Elsevier Ltd. All rights reserved.
Tao, Yebin; Sánchez, Brisa N; Mukherjee, Bhramar
2015-03-30
Many existing cohort studies designed to investigate health effects of environmental exposures also collect data on genetic markers. The Early Life Exposures in Mexico to Environmental Toxicants project, for instance, has been genotyping single nucleotide polymorphisms on candidate genes involved in mental and nutrient metabolism and also in potentially shared metabolic pathways with the environmental exposures. Given the longitudinal nature of these cohort studies, rich exposure and outcome data are available to address novel questions regarding gene-environment interaction (G × E). Latent variable (LV) models have been effectively used for dimension reduction, helping with multiple testing and multicollinearity issues in the presence of correlated multivariate exposures and outcomes. In this paper, we first propose a modeling strategy, based on LV models, to examine the association between repeated outcome measures (e.g., child weight) and a set of correlated exposure biomarkers (e.g., prenatal lead exposure). We then construct novel tests for G × E effects within the LV framework to examine effect modification of outcome-exposure association by genetic factors (e.g., the hemochromatosis gene). We consider two scenarios: one allowing dependence of the LV models on genes and the other assuming independence between the LV models and genes. We combine the two sets of estimates by shrinkage estimation to trade off bias and efficiency in a data-adaptive way. Using simulations, we evaluate the properties of the shrinkage estimates, and in particular, we demonstrate the need for this data-adaptive shrinkage given repeated outcome measures, exposure measures possibly repeated and time-varying gene-environment association. Copyright © 2014 John Wiley & Sons, Ltd.
2013-01-01
Background Colorectal cancer is the third leading cause of cancer deaths in the United States. The initial assessment of colorectal cancer involves clinical staging that takes into account the extent of primary tumor invasion, determining the number of lymph nodes with metastatic cancer and the identification of metastatic sites in other organs. Advanced clinical stage indicates metastatic cancer, either in regional lymph nodes or in distant organs. While the genomic and genetic basis of colorectal cancer has been elucidated to some degree, less is known about the identity of specific cancer genes that are associated with advanced clinical stage and metastasis. Methods We compiled multiple genomic data types (mutations, copy number alterations, gene expression and methylation status) as well as clinical meta-data from The Cancer Genome Atlas (TCGA). We used an elastic-net regularized regression method on the combined genomic data to identify genetic aberrations and their associated cancer genes that are indicators of clinical stage. We ranked candidate genes by their regression coefficient and level of support from multiple assay modalities. Results A fit of the elastic-net regularized regression to 197 samples and integrated analysis of four genomic platforms identified the set of top gene predictors of advanced clinical stage, including: WRN, SYK, DDX5 and ADRA2C. These genetic features were identified robustly in bootstrap resampling analysis. Conclusions We conducted an analysis integrating multiple genomic features including mutations, copy number alterations, gene expression and methylation. This integrated approach in which one considers all of these genomic features performs better than any individual genomic assay. We identified multiple genes that robustly delineate advanced clinical stage, suggesting their possible role in colorectal cancer metastatic progression. PMID:24308539
Yang, Ze-Hui; Zheng, Rui; Gao, Yuan; Zhang, Qiang
2016-09-01
With the widespread application of high-throughput technology, numerous meta-analysis methods have been proposed for differential expression profiling across multiple studies. We identified the suitable differentially expressed (DE) genes that contributed to lung adenocarcinoma (ADC) clustering based on seven popular multiple meta-analysis methods. Seven microarray expression profiles of ADC and normal controls were extracted from the ArrayExpress database. The Bioconductor was used to perform the data preliminary preprocessing. Then, DE genes across multiple studies were identified. Hierarchical clustering was applied to compare the classification performance for microarray data samples. The classification efficiency was compared based on accuracy, sensitivity and specificity. Across seven datasets, 573 ADC cases and 222 normal controls were collected. After filtering out unexpressed and noninformative genes, 3688 genes were remained for further analysis. The classification efficiency analysis showed that DE genes identified by sum of ranks method separated ADC from normal controls with the best accuracy, sensitivity and specificity of 0.953, 0.969 and 0.932, respectively. The gene set with the highest classification accuracy mainly participated in the regulation of response to external stimulus (P = 7.97E-04), cyclic nucleotide-mediated signaling (P = 0.01), regulation of cell morphogenesis (P = 0.01) and regulation of cell proliferation (P = 0.01). Evaluation of DE genes identified by different meta-analysis methods in classification efficiency provided a new perspective to the choice of the suitable method in a given application. Varying meta-analysis methods always present varying abilities, so synthetic consideration should be taken when providing meta-analysis methods for particular research. © 2015 John Wiley & Sons Ltd.
Benitez, Cecil M.; Qu, Kun; Sugiyama, Takuya; Pauerstein, Philip T.; Liu, Yinghua; Tsai, Jennifer; Gu, Xueying; Ghodasara, Amar; Arda, H. Efsun; Zhang, Jiajing; Dekker, Joseph D.; Tucker, Haley O.; Chang, Howard Y.; Kim, Seung K.
2014-01-01
The regulatory logic underlying global transcriptional programs controlling development of visceral organs like the pancreas remains undiscovered. Here, we profiled gene expression in 12 purified populations of fetal and adult pancreatic epithelial cells representing crucial progenitor cell subsets, and their endocrine or exocrine progeny. Using probabilistic models to decode the general programs organizing gene expression, we identified co-expressed gene sets in cell subsets that revealed patterns and processes governing progenitor cell development, lineage specification, and endocrine cell maturation. Purification of Neurog3 mutant cells and module network analysis linked established regulators such as Neurog3 to unrecognized gene targets and roles in pancreas development. Iterative module network analysis nominated and prioritized transcriptional regulators, including diabetes risk genes. Functional validation of a subset of candidate regulators with corresponding mutant mice revealed that the transcription factors Etv1, Prdm16, Runx1t1 and Bcl11a are essential for pancreas development. Our integrated approach provides a unique framework for identifying regulatory genes and functional gene sets underlying pancreas development and associated diseases such as diabetes mellitus. PMID:25330008
Berchtold, Nicole C.; Coleman, Paul D.; Cribbs, David H.; Rogers, Joseph; Gillen, Daniel L.; Cotman, Carl W.
2014-01-01
Synapses are essential for transmitting, processing, and storing information, all of which decline in aging and Alzheimer’s disease (AD). Because synapse loss only partially accounts for the cognitive declines seen in aging and AD, we hypothesized that existing synapses might undergo molecular changes that reduce their functional capacity. Microarrays were used to evaluate expression profiles of 340 synaptic genes in aging (20–99 years) and AD across 4 brain regions from 81 cases. The analysis revealed an unexpectedly large number of significant expression changes in synapse-related genes in aging, with many undergoing progressive downregulation across aging and AD. Functional classification of the genes showing altered expression revealed that multiple aspects of synaptic function are affected, notably synaptic vesicle trafficking and release, neurotransmitter receptors and receptor trafficking, postsynaptic density scaffolding, cell adhesion regulating synaptic stability, and neuromodulatory systems. The widespread declines in synaptic gene expression in normal aging suggests that function of existing synapses might be impaired, and that a common set of synaptic genes are vulnerable to change in aging and AD. PMID:23273601
Wang, Yi-Ting; Sung, Pei-Yuan; Lin, Peng-Lin; Yu, Ya-Wen; Chung, Ren-Hua
2015-05-15
Genome-wide association studies (GWAS) have become a common approach to identifying single nucleotide polymorphisms (SNPs) associated with complex diseases. As complex diseases are caused by the joint effects of multiple genes, while the effect of individual gene or SNP is modest, a method considering the joint effects of multiple SNPs can be more powerful than testing individual SNPs. The multi-SNP analysis aims to test association based on a SNP set, usually defined based on biological knowledge such as gene or pathway, which may contain only a portion of SNPs with effects on the disease. Therefore, a challenge for the multi-SNP analysis is how to effectively select a subset of SNPs with promising association signals from the SNP set. We developed the Optimal P-value Threshold Pedigree Disequilibrium Test (OPTPDT). The OPTPDT uses general nuclear families. A variable p-value threshold algorithm is used to determine an optimal p-value threshold for selecting a subset of SNPs. A permutation procedure is used to assess the significance of the test. We used simulations to verify that the OPTPDT has correct type I error rates. Our power studies showed that the OPTPDT can be more powerful than the set-based test in PLINK, the multi-SNP FBAT test, and the p-value based test GATES. We applied the OPTPDT to a family-based autism GWAS dataset for gene-based association analysis and identified MACROD2-AS1 with genome-wide significance (p-value=2.5×10(-6)). Our simulation results suggested that the OPTPDT is a valid and powerful test. The OPTPDT will be helpful for gene-based or pathway association analysis. The method is ideal for the secondary analysis of existing GWAS datasets, which may identify a set of SNPs with joint effects on the disease.
Genome-wide analysis of YY2 versus YY1 target genes
Chen, Li; Shioda, Toshi; Coser, Kathryn R.; Lynch, Mary C.; Yang, Chuanwei; Schmidt, Emmett V.
2010-01-01
Yin Yang 1 (YY1) is a critical transcription factor controlling cell proliferation, development and DNA damage responses. Retrotranspositions have independently generated additional YY family members in multiple species. Although Drosophila YY1 [pleiohomeotic (Pho)] and its homolog [pleiohomeotic-like (Phol)] redundantly control homeotic gene expression, the regulatory contributions of YY1-homologs have not yet been examined in other species. Indeed, targets for the mammalian YY1 homolog YY2 are completely unknown. Using gene set enrichment analysis, we found that lentiviral constructs containing short hairpin loop inhibitory RNAs for human YY1 (shYY1) and its homolog YY2 (shYY2) caused significant changes in both shared and distinguishable gene sets in human cells. Ribosomal protein genes were the most significant gene set upregulated by both shYY1 and shYY2, although combined shYY1/2 knock downs were not additive. In contrast, shYY2 reversed the anti-proliferative effects of shYY1, and shYY2 particularly altered UV damage response, platelet-specific and mitochondrial function genes. We found that decreases in YY1 or YY2 caused inverse changes in UV sensitivity, and that their combined loss reversed their respective individual effects. Our studies show that human YY2 is not redundant to YY1, and YY2 is a significant regulator of genes previously identified as uniquely responding to YY1. PMID:20215434
Diversification of Root Hair Development Genes in Vascular Plants.
Huang, Ling; Shi, Xinhui; Wang, Wenjia; Ryu, Kook Hui; Schiefelbein, John
2017-07-01
The molecular genetic program for root hair development has been studied intensively in Arabidopsis ( Arabidopsis thaliana ). To understand the extent to which this program might operate in other plants, we conducted a large-scale comparative analysis of root hair development genes from diverse vascular plants, including eudicots, monocots, and a lycophyte. Combining phylogenetics and transcriptomics, we discovered conservation of a core set of root hair genes across all vascular plants, which may derive from an ancient program for unidirectional cell growth coopted for root hair development during vascular plant evolution. Interestingly, we also discovered preferential diversification in the structure and expression of root hair development genes, relative to other root hair- and root-expressed genes, among these species. These differences enabled the definition of sets of genes and gene functions that were acquired or lost in specific lineages during vascular plant evolution. In particular, we found substantial divergence in the structure and expression of genes used for root hair patterning, suggesting that the Arabidopsis transcriptional regulatory mechanism is not shared by other species. To our knowledge, this study provides the first comprehensive view of gene expression in a single plant cell type across multiple species. © 2017 American Society of Plant Biologists. All Rights Reserved.
Diversification of Root Hair Development Genes in Vascular Plants1[OPEN
Shi, Xinhui; Wang, Wenjia; Ryu, Kook Hui
2017-01-01
The molecular genetic program for root hair development has been studied intensively in Arabidopsis (Arabidopsis thaliana). To understand the extent to which this program might operate in other plants, we conducted a large-scale comparative analysis of root hair development genes from diverse vascular plants, including eudicots, monocots, and a lycophyte. Combining phylogenetics and transcriptomics, we discovered conservation of a core set of root hair genes across all vascular plants, which may derive from an ancient program for unidirectional cell growth coopted for root hair development during vascular plant evolution. Interestingly, we also discovered preferential diversification in the structure and expression of root hair development genes, relative to other root hair- and root-expressed genes, among these species. These differences enabled the definition of sets of genes and gene functions that were acquired or lost in specific lineages during vascular plant evolution. In particular, we found substantial divergence in the structure and expression of genes used for root hair patterning, suggesting that the Arabidopsis transcriptional regulatory mechanism is not shared by other species. To our knowledge, this study provides the first comprehensive view of gene expression in a single plant cell type across multiple species. PMID:28487476
Kavak, Erşen; Ünlü, Mustafa; Nistér, Monica; Koman, Ahmet
2010-01-01
Cancer is among the major causes of human death and its mechanism(s) are not fully understood. We applied a novel meta-analysis approach to multiple sets of merged serial analysis of gene expression and microarray cancer data in order to analyze transcriptome alterations in human cancer. Our methodology, which we denote ‘COgnate Gene Expression patterNing in tumours’ (COGENT), unmasked numerous genes that were differentially expressed in multiple cancers. COGENT detected well-known tumor-associated (TA) genes such as TP53, EGFR and VEGF, as well as many multi-cancer, but not-yet-tumor-associated genes. In addition, we identified 81 co-regulated regions on the human genome (RIDGEs) by using expression data from all cancers. Some RIDGEs (28%) consist of paralog genes while another subset (30%) are specifically dysregulated in tumors but not in normal tissues. Furthermore, a significant number of RIDGEs are associated with GC-rich regions on the genome. All assembled data is freely available online (www.oncoreveal.org) as a tool implementing COGENT analysis of multi-cancer genes and RIDGEs. These findings engender a deeper understanding of cancer biology by demonstrating the existence of a pool of under-studied multi-cancer genes and by highlighting the cancer-specificity of some TA-RIDGEs. PMID:20621981
Genome-Wide Detection and Analysis of Multifunctional Genes
Pritykin, Yuri; Ghersi, Dario; Singh, Mona
2015-01-01
Many genes can play a role in multiple biological processes or molecular functions. Identifying multifunctional genes at the genome-wide level and studying their properties can shed light upon the complexity of molecular events that underpin cellular functioning, thereby leading to a better understanding of the functional landscape of the cell. However, to date, genome-wide analysis of multifunctional genes (and the proteins they encode) has been limited. Here we introduce a computational approach that uses known functional annotations to extract genes playing a role in at least two distinct biological processes. We leverage functional genomics data sets for three organisms—H. sapiens, D. melanogaster, and S. cerevisiae—and show that, as compared to other annotated genes, genes involved in multiple biological processes possess distinct physicochemical properties, are more broadly expressed, tend to be more central in protein interaction networks, tend to be more evolutionarily conserved, and are more likely to be essential. We also find that multifunctional genes are significantly more likely to be involved in human disorders. These same features also hold when multifunctionality is defined with respect to molecular functions instead of biological processes. Our analysis uncovers key features about multifunctional genes, and is a step towards a better genome-wide understanding of gene multifunctionality. PMID:26436655
NASA Astrophysics Data System (ADS)
Sundaresan, A.; Pellis, N. R.
2005-08-01
Genetic response suites in human lymphocytes in response to microgravity are important to identify and further study in order to augment human physiological adaptation to novel environments. Emerging technologies, such as DNA micro array profiling, have the potential to identify novel genes that are involved in mediating adaptation to these environments. These genes may prove to be therapeutically valuable as new targets for countermeasures, or as predictive biomarkers of response to these new environments. Human lymphocytes cultured in 1g and microgravity analog culture were analyzed for their differential gene expression response. Different groups of genes related to the immune response, cardiovascular system and stress response were then analyzed. Analysis of cells from multiple donors reveals a small shared set that are likely to be essential to adaptation. These three groups focus on human adaptation to new environments. The shared set contains genes related to T cell activation, immune response and stress response to analog microgravity.
Markov State Models of gene regulatory networks.
Chu, Brian K; Tse, Margaret J; Sato, Royce R; Read, Elizabeth L
2017-02-06
Gene regulatory networks with dynamics characterized by multiple stable states underlie cell fate-decisions. Quantitative models that can link molecular-level knowledge of gene regulation to a global understanding of network dynamics have the potential to guide cell-reprogramming strategies. Networks are often modeled by the stochastic Chemical Master Equation, but methods for systematic identification of key properties of the global dynamics are currently lacking. The method identifies the number, phenotypes, and lifetimes of long-lived states for a set of common gene regulatory network models. Application of transition path theory to the constructed Markov State Model decomposes global dynamics into a set of dominant transition paths and associated relative probabilities for stochastic state-switching. In this proof-of-concept study, we found that the Markov State Model provides a general framework for analyzing and visualizing stochastic multistability and state-transitions in gene networks. Our results suggest that this framework-adopted from the field of atomistic Molecular Dynamics-can be a useful tool for quantitative Systems Biology at the network scale.
Paul, Topon Kumar; Iba, Hitoshi
2009-01-01
In order to get a better understanding of different types of cancers and to find the possible biomarkers for diseases, recently, many researchers are analyzing the gene expression data using various machine learning techniques. However, due to a very small number of training samples compared to the huge number of genes and class imbalance, most of these methods suffer from overfitting. In this paper, we present a majority voting genetic programming classifier (MVGPC) for the classification of microarray data. Instead of a single rule or a single set of rules, we evolve multiple rules with genetic programming (GP) and then apply those rules to test samples to determine their labels with majority voting technique. By performing experiments on four different public cancer data sets, including multiclass data sets, we have found that the test accuracies of MVGPC are better than those of other methods, including AdaBoost with GP. Moreover, some of the more frequently occurring genes in the classification rules are known to be associated with the types of cancers being studied in this paper.
Lv, Yufeng; Wei, Wenhao; Huang, Zhong; Chen, Zhichao; Fang, Yuan; Pan, Lili; Han, Xueqiong; Xu, Zihai
2018-06-20
The aim of this study was to develop a novel long non-coding RNA (lncRNA) expression signature to accurately predict early recurrence for patients with hepatocellular carcinoma (HCC) after curative resection. Using expression profiles downloaded from The Cancer Genome Atlas database, we identified multiple lncRNAs with differential expression between early recurrence (ER) group and non-early recurrence (non-ER) group of HCC. Least absolute shrinkage and selection operator (LASSO) for logistic regression models were used to develop a lncRNA-based classifier for predicting ER in the training set. An independent test set was used to validated the predictive value of this classifier. Futhermore, a co-expression network based on these lncRNAs and its highly related genes was constructed and Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses of genes in the network were performed. We identified 10 differentially expressed lncRNAs, including 3 that were upregulated and 7 that were downregulated in ER group. The lncRNA-based classifier was constructed based on 7 lncRNAs (AL035661.1, PART1, AC011632.1, AC109588.1, AL365361.1, LINC00861 and LINC02084), and its accuracy was 0.83 in training set, 0.87 in test set and 0.84 in total set. And ROC curve analysis showed the AUROC was 0.741 in training set, 0.824 in the test set and 0.765 in total set. A functional enrichment analysis suggested that the genes of which is highly related to 4 lncRNAs were involved in immune system. This 7-lncRNA expression profile can effectively predict the early recurrence after surgical resection for HCC. This article is protected by copyright. All rights reserved.
Shah, Binal; Jain, Kunal; Patel, Namrata; Pandit, Ramesh; Patel, Anand; Joshi, Chaitanya G; Madamwar, Datta
2015-06-11
Paenibacillus sp. strain DMB20, in cometabolism with other Proteobacteria and Firmicutes, exhibits azoreduction of textile dyes. Here, we report the draft genome sequence of this bacterium, consisting of 6,647,181 bp with 7,668 coding sequences (CDSs). The data presented highlight multiple sets of functional genes associated with xenobiotic compound degradation. Copyright © 2015 Shah et al.
Deng, Yi-Mo; Spirason, Natalie; Iannello, Pina; Jelley, Lauren; Lau, Hilda; Barr, Ian G
2015-07-01
Full genome sequencing of influenza A viruses (IAV), including those that arise from annual influenza epidemics, is undertaken to determine if reassorting has occurred or if other pathogenic traits are present. Traditionally IAV sequencing has been biased toward the major surface glycoproteins haemagglutinin and neuraminidase, while the internal genes are often ignored. Despite the development of next generation sequencing (NGS), many laboratories are still reliant on conventional Sanger sequencing to sequence IAV. To develop a minimal and robust set of primers for Sanger sequencing of the full genome of IAV currently circulating in humans. A set of 13 primer pairs was designed that enabled amplification of the six internal genes of multiple human IAV subtypes including the recent avian influenza A(H7N9) virus from China. Specific primers were designed to amplify the HA and NA genes of each IAV subtype of interest. Each of the primers also incorporated a binding site at its 5'-end for either a forward or reverse M13 primer, such that only two M13 primers were required for all subsequent sequencing reactions. This minimal set of primers was suitable for sequencing the six internal genes of all currently circulating human seasonal influenza A subtypes as well as the avian A(H7N9) viruses that have infected humans in China. This streamlined Sanger sequencing protocol could be used to generate full genome sequence data more rapidly and easily than existing influenza genome sequencing protocols. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.
Genomic approaches for the elucidation of genes and gene networks underlying cardiovascular traits.
Adriaens, M E; Bezzina, C R
2018-06-22
Genome-wide association studies have shed light on the association between natural genetic variation and cardiovascular traits. However, linking a cardiovascular trait associated locus to a candidate gene or set of candidate genes for prioritization for follow-up mechanistic studies is all but straightforward. Genomic technologies based on next-generation sequencing technology nowadays offer multiple opportunities to dissect gene regulatory networks underlying genetic cardiovascular trait associations, thereby aiding in the identification of candidate genes at unprecedented scale. RNA sequencing in particular becomes a powerful tool when combined with genotyping to identify loci that modulate transcript abundance, known as expression quantitative trait loci (eQTL), or loci modulating transcript splicing known as splicing quantitative trait loci (sQTL). Additionally, the allele-specific resolution of RNA-sequencing technology enables estimation of allelic imbalance, a state where the two alleles of a gene are expressed at a ratio differing from the expected 1:1 ratio. When multiple high-throughput approaches are combined with deep phenotyping in a single study, a comprehensive elucidation of the relationship between genotype and phenotype comes into view, an approach known as systems genetics. In this review, we cover key applications of systems genetics in the broad cardiovascular field.
Evaluating Reported Candidate Gene Associations with Polycystic Ovary Syndrome
Pau, Cindy; Saxena, Richa; Welt, Corrine Kolka
2013-01-01
Objective To replicate variants in candidate genes associated with PCOS in a population of European PCOS and control subjects. Design Case-control association analysis and meta-analysis. Setting Major academic hospital Patients Women of European ancestry with PCOS (n=525) and controls (n=472), aged 18 to 45 years. Intervention Variants previously associated with PCOS in candidate gene studies were genotyped (n=39). Metabolic, reproductive and anthropomorphic parameters were examined as a function of the candidate variants. All genetic association analyses were adjusted for age, BMI and ancestry and were reported after correction for multiple testing. Main Outcome Measure Association of candidate gene variants with PCOS. Results Three variants, rs3797179 (SRD5A1), rs12473543 (POMC), and rs1501299 (ADIPOQ), were nominally associated with PCOS. However, they did not remain significant after correction for multiple testing and none of the variants replicated in a sufficiently powered meta-analysis. Variants in the FBN3 gene (rs17202517 and rs73503752) were associated with smaller waist circumferences and variant rs727428 in the SHBG gene was associated with lower SHBG levels. Conclusion Previously identified variants in candidate genes do not appear to be associated with PCOS risk. PMID:23375202
2013-01-01
Background As high-throughput genomic technologies become accurate and affordable, an increasing number of data sets have been accumulated in the public domain and genomic information integration and meta-analysis have become routine in biomedical research. In this paper, we focus on microarray meta-analysis, where multiple microarray studies with relevant biological hypotheses are combined in order to improve candidate marker detection. Many methods have been developed and applied in the literature, but their performance and properties have only been minimally investigated. There is currently no clear conclusion or guideline as to the proper choice of a meta-analysis method given an application; the decision essentially requires both statistical and biological considerations. Results We performed 12 microarray meta-analysis methods for combining multiple simulated expression profiles, and such methods can be categorized for different hypothesis setting purposes: (1) HS A : DE genes with non-zero effect sizes in all studies, (2) HS B : DE genes with non-zero effect sizes in one or more studies and (3) HS r : DE gene with non-zero effect in "majority" of studies. We then performed a comprehensive comparative analysis through six large-scale real applications using four quantitative statistical evaluation criteria: detection capability, biological association, stability and robustness. We elucidated hypothesis settings behind the methods and further apply multi-dimensional scaling (MDS) and an entropy measure to characterize the meta-analysis methods and data structure, respectively. Conclusions The aggregated results from the simulation study categorized the 12 methods into three hypothesis settings (HS A , HS B , and HS r ). Evaluation in real data and results from MDS and entropy analyses provided an insightful and practical guideline to the choice of the most suitable method in a given application. All source files for simulation and real data are available on the author’s publication website. PMID:24359104
Chang, Lun-Ching; Lin, Hui-Min; Sibille, Etienne; Tseng, George C
2013-12-21
As high-throughput genomic technologies become accurate and affordable, an increasing number of data sets have been accumulated in the public domain and genomic information integration and meta-analysis have become routine in biomedical research. In this paper, we focus on microarray meta-analysis, where multiple microarray studies with relevant biological hypotheses are combined in order to improve candidate marker detection. Many methods have been developed and applied in the literature, but their performance and properties have only been minimally investigated. There is currently no clear conclusion or guideline as to the proper choice of a meta-analysis method given an application; the decision essentially requires both statistical and biological considerations. We performed 12 microarray meta-analysis methods for combining multiple simulated expression profiles, and such methods can be categorized for different hypothesis setting purposes: (1) HS(A): DE genes with non-zero effect sizes in all studies, (2) HS(B): DE genes with non-zero effect sizes in one or more studies and (3) HS(r): DE gene with non-zero effect in "majority" of studies. We then performed a comprehensive comparative analysis through six large-scale real applications using four quantitative statistical evaluation criteria: detection capability, biological association, stability and robustness. We elucidated hypothesis settings behind the methods and further apply multi-dimensional scaling (MDS) and an entropy measure to characterize the meta-analysis methods and data structure, respectively. The aggregated results from the simulation study categorized the 12 methods into three hypothesis settings (HS(A), HS(B), and HS(r)). Evaluation in real data and results from MDS and entropy analyses provided an insightful and practical guideline to the choice of the most suitable method in a given application. All source files for simulation and real data are available on the author's publication website.
Integrative set enrichment testing for multiple omics platforms
2011-01-01
Background Enrichment testing assesses the overall evidence of differential expression behavior of the elements within a defined set. When we have measured many molecular aspects, e.g. gene expression, metabolites, proteins, it is desirable to assess their differential tendencies jointly across platforms using an integrated set enrichment test. In this work we explore the properties of several methods for performing a combined enrichment test using gene expression and metabolomics as the motivating platforms. Results Using two simulation models we explored the properties of several enrichment methods including two novel methods: the logistic regression 2-degree of freedom Wald test and the 2-dimensional permutation p-value for the sum-of-squared statistics test. In relation to their univariate counterparts we find that the joint tests can improve our ability to detect results that are marginal univariately. We also find that joint tests improve the ranking of associated pathways compared to their univariate counterparts. However, there is a risk of Type I error inflation with some methods and self-contained methods lose specificity when the sets are not representative of underlying association. Conclusions In this work we show that consideration of data from multiple platforms, in conjunction with summarization via a priori pathway information, leads to increased power in detection of genomic associations with phenotypes. PMID:22118224
Development and applications of transgenesis in the yellow fever mosquito, Aedes aegypti.
Adelman, Zachary N; Jasinskiene, Nijole; James, Anthony A
2002-04-30
Transgenesis technology has been developed for the yellow fever mosquito, Aedes aegypti. Successful integration of exogenous DNA into the germline of this mosquito has been achieved with the class II transposable elements, Hermes, mariner and piggyBac. A number of marker genes, including the cinnabar(+) gene of Drosophila melanogaster, and fluorescent protein genes, can be used to monitor the insertion of these elements. The availability of multiple elements and marker genes provides a powerful set of tools to investigate basic biological properties of this vector insect, as well as the materials for developing novel, genetics-based, control strategies for the transmission of disease.
Jin, Yulan; Sharma, Ashok; Bai, Shan; Davis, Colleen; Liu, Haitao; Hopkins, Diane; Barriga, Kathy; Rewers, Marian; She, Jin-Xiong
2014-07-01
There is tremendous scientific and clinical value to further improving the predictive power of autoantibodies because autoantibody-positive (AbP) children have heterogeneous rates of progression to clinical diabetes. This study explored the potential of gene expression profiles as biomarkers for risk stratification among 104 AbP subjects from the Diabetes Autoimmunity Study in the Young (DAISY) using a discovery data set based on microarray and a validation data set based on real-time RT-PCR. The microarray data identified 454 candidate genes with expression levels associated with various type 1 diabetes (T1D) progression rates. RT-PCR analyses of the top-27 candidate genes confirmed 5 genes (BACH2, IGLL3, EIF3A, CDC20, and TXNDC5) associated with differential progression and implicated in lymphocyte activation and function. Multivariate analyses of these five genes in the discovery and validation data sets identified and confirmed four multigene models (BI, ICE, BICE, and BITE, with each letter representing a gene) that consistently stratify high- and low-risk subsets of AbP subjects with hazard ratios >6 (P < 0.01). The results suggest that these genes may be involved in T1D pathogenesis and potentially serve as excellent gene expression biomarkers to predict the risk of progression to clinical diabetes for AbP subjects. © 2014 by the American Diabetes Association.
Node-Based Learning of Multiple Gaussian Graphical Models
Mohan, Karthik; London, Palma; Fazel, Maryam; Witten, Daniela; Lee, Su-In
2014-01-01
We consider the problem of estimating high-dimensional Gaussian graphical models corresponding to a single set of variables under several distinct conditions. This problem is motivated by the task of recovering transcriptional regulatory networks on the basis of gene expression data containing heterogeneous samples, such as different disease states, multiple species, or different developmental stages. We assume that most aspects of the conditional dependence networks are shared, but that there are some structured differences between them. Rather than assuming that similarities and differences between networks are driven by individual edges, we take a node-based approach, which in many cases provides a more intuitive interpretation of the network differences. We consider estimation under two distinct assumptions: (1) differences between the K networks are due to individual nodes that are perturbed across conditions, or (2) similarities among the K networks are due to the presence of common hub nodes that are shared across all K networks. Using a row-column overlap norm penalty function, we formulate two convex optimization problems that correspond to these two assumptions. We solve these problems using an alternating direction method of multipliers algorithm, and we derive a set of necessary and sufficient conditions that allows us to decompose the problem into independent subproblems so that our algorithm can be scaled to high-dimensional settings. Our proposal is illustrated on synthetic data, a webpage data set, and a brain cancer gene expression data set. PMID:25309137
A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction
De Oliveira Martins, Leonardo; Mallo, Diego; Posada, David
2016-01-01
Current phylogenomic data sets highlight the need for species tree methods able to deal with several sources of gene tree/species tree incongruence. At the same time, we need to make most use of all available data. Most species tree methods deal with single processes of phylogenetic discordance, namely, gene duplication and loss, incomplete lineage sorting (ILS) or horizontal gene transfer. In this manuscript, we address the problem of species tree inference from multilocus, genome-wide data sets regardless of the presence of gene duplication and loss and ILS therefore without the need to identify orthologs or to use a single individual per species. We do this by extending the idea of Maximum Likelihood (ML) supertrees to a hierarchical Bayesian model where several sources of gene tree/species tree disagreement can be accounted for in a modular manner. We implemented this model in a computer program called guenomu whose inputs are posterior distributions of unrooted gene tree topologies for multiple gene families, and whose output is the posterior distribution of rooted species tree topologies. We conducted extensive simulations to evaluate the performance of our approach in comparison with other species tree approaches able to deal with more than one leaf from the same species. Our method ranked best under simulated data sets, in spite of ignoring branch lengths, and performed well on empirical data, as well as being fast enough to analyze relatively large data sets. Our Bayesian supertree method was also very successful in obtaining better estimates of gene trees, by reducing the uncertainty in their distributions. In addition, our results show that under complex simulation scenarios, gene tree parsimony is also a competitive approach once we consider its speed, in contrast to more sophisticated models. PMID:25281847
Molecular phylogeny and evolutionary timescale for the family of mammalian herpesviruses.
McGeoch, D J; Cook, S; Dolan, A; Jamieson, F E; Telford, E A
1995-03-31
A detailed phylogenetic analysis for mammalian members of the family Herpesviridae, based on molecular sequences is reported. Sets of encoded amino acid sequences were collected for eight well conserved genes that are common to mammalian herpesviruses. Phylogenetic trees were inferred from alignments of these sequence sets using both maximum parsimony and distance methods, and evaluated by bootstrap analysis. In all cases the three recognised subfamilies (Alpha-, Beta- and Gammaherpesvirinae), and major sublineages in each subfamily, were clearly distinguished, but within sublineages some finer details of branching were incompletely resolved. Multiple-gene sets were assembled to give a broadly based tree. The root position of the tree was estimated by assuming a constant molecular clock and also by analysis of one herpesviral gene set (that encoding uracil-DNA glycosylase) using cellular homologues as outgroups. Both procedures placed the root between the Alphaherpesvirinae and the other two subfamilies. Substitution rates were calculated for the combined gene sets based on a previous estimate for alphaherpesviral UL27 genes, where the time base had been obtained according to the hypothesis of cospeciation of virus and host lineages. Assuming a constant molecular clock, it was then estimated that the three subfamilies arose approximately 180 to 220 million years ago, that major sublineages within subfamilies were probably generated before the mammalian radiation of 80 to 60 million years ago, and that speciations within sublineages took place in the last 80 million years, probably with a major component of cospeciation with host lineages.
GENETICALLY MODIFIED FOODS: TECHNOLOGICAL BREAKTHROUGH OR ECOLOGICAL NIGHMARE?
Fifty years ago, Wastson and Crick described the structure of DNA, setting the stage for the past decade's biotechnology revolution. Scientists have now broken the code of the entire human genome, and delineated the function of multiple genes; similar strides are being taken with...
Liu, Gang; Mukherjee, Bhramar; Lee, Seunggeun; Lee, Alice W; Wu, Anna H; Bandera, Elisa V; Jensen, Allan; Rossing, Mary Anne; Moysich, Kirsten B; Chang-Claude, Jenny; Doherty, Jennifer A; Gentry-Maharaj, Aleksandra; Kiemeney, Lambertus; Gayther, Simon A; Modugno, Francesmary; Massuger, Leon; Goode, Ellen L; Fridley, Brooke L; Terry, Kathryn L; Cramer, Daniel W; Ramus, Susan J; Anton-Culver, Hoda; Ziogas, Argyrios; Tyrer, Jonathan P; Schildkraut, Joellen M; Kjaer, Susanne K; Webb, Penelope M; Ness, Roberta B; Menon, Usha; Berchuck, Andrew; Pharoah, Paul D; Risch, Harvey; Pearce, Celeste Leigh
2018-02-01
There have been recent proposals advocating the use of additive gene-environment interaction instead of the widely used multiplicative scale, as a more relevant public health measure. Using gene-environment independence enhances statistical power for testing multiplicative interaction in case-control studies. However, under departure from this assumption, substantial bias in the estimates and inflated type I error in the corresponding tests can occur. In this paper, we extend the empirical Bayes (EB) approach previously developed for multiplicative interaction, which trades off between bias and efficiency in a data-adaptive way, to the additive scale. An EB estimator of the relative excess risk due to interaction is derived, and the corresponding Wald test is proposed with a general regression setting under a retrospective likelihood framework. We study the impact of gene-environment association on the resultant test with case-control data. Our simulation studies suggest that the EB approach uses the gene-environment independence assumption in a data-adaptive way and provides a gain in power compared with the standard logistic regression analysis and better control of type I error when compared with the analysis assuming gene-environment independence. We illustrate the methods with data from the Ovarian Cancer Association Consortium. © The Author(s) 2017. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Ohno, Satoshi; Yoshikawa, Katsunori; Shimizu, Hiroshi; Tamura, Tomohiro
2014-01-01
We describe here the construction of a series of 71 vectors to silence central carbon metabolism genes in Escherichia coli. The vectors inducibly express antisense RNAs called paired-terminus antisense RNAs, which have a higher silencing efficacy than ordinary antisense RNAs. By measuring mRNA amounts, measuring activities of target proteins, or observing specific phenotypes, it was confirmed that all the vectors were able to silence the expression of target genes efficiently. Using this vector set, each of the central carbon metabolism genes was silenced individually, and the accumulation of metabolites was investigated. We were able to obtain accurate information on ways to increase the production of pyruvate, an industrially valuable compound, from the silencing results. Furthermore, the experimental results of pyruvate accumulation were compared to in silico predictions, and both sets of results were consistent. Compared to the gene disruption approach, the silencing approach has an advantage in that any E. coli strain can be used and multiple gene silencing is easily possible in any combination. PMID:24212579
Mitochondrial genome deletions and minicircles are common in lice (Insecta: Phthiraptera)
2011-01-01
Background The gene composition, gene order and structure of the mitochondrial genome are remarkably stable across bilaterian animals. Lice (Insecta: Phthiraptera) are a major exception to this genomic stability in that the canonical single chromosome with 37 genes found in almost all other bilaterians has been lost in multiple lineages in favour of multiple, minicircular chromosomes with less than 37 genes on each chromosome. Results Minicircular mt genomes are found in six of the ten louse species examined to date and three types of minicircles were identified: heteroplasmic minicircles which coexist with full sized mt genomes (type 1); multigene chromosomes with short, simple control regions, we infer that the genome consists of several such chromosomes (type 2); and multiple, single to three gene chromosomes with large, complex control regions (type 3). Mapping minicircle types onto a phylogenetic tree of lice fails to show a pattern of their occurrence consistent with an evolutionary series of minicircle types. Analysis of the nuclear-encoded, mitochondrially-targetted genes inferred from the body louse, Pediculus, suggests that the loss of mitochondrial single-stranded binding protein (mtSSB) may be responsible for the presence of minicircles in at least species with the most derived type 3 minicircles (Pediculus, Damalinia). Conclusions Minicircular mt genomes are common in lice and appear to have arisen multiple times within the group. Life history adaptive explanations which attribute minicircular mt genomes in lice to the adoption of blood-feeding in the Anoplura are not supported by this expanded data set as minicircles are found in multiple non-blood feeding louse groups but are not found in the blood-feeding genus Heterodoxus. In contrast, a mechanist explanation based on the loss of mtSSB suggests that minicircles may be selectively favoured due to the incapacity of the mt replisome to synthesize long replicative products without mtSSB and thus the loss of this gene lead to the formation of minicircles in lice. PMID:21813020
Mitochondrial genome deletions and minicircles are common in lice (Insecta: Phthiraptera).
Cameron, Stephen L; Yoshizawa, Kazunori; Mizukoshi, Atsushi; Whiting, Michael F; Johnson, Kevin P
2011-08-04
The gene composition, gene order and structure of the mitochondrial genome are remarkably stable across bilaterian animals. Lice (Insecta: Phthiraptera) are a major exception to this genomic stability in that the canonical single chromosome with 37 genes found in almost all other bilaterians has been lost in multiple lineages in favour of multiple, minicircular chromosomes with less than 37 genes on each chromosome. Minicircular mt genomes are found in six of the ten louse species examined to date and three types of minicircles were identified: heteroplasmic minicircles which coexist with full sized mt genomes (type 1); multigene chromosomes with short, simple control regions, we infer that the genome consists of several such chromosomes (type 2); and multiple, single to three gene chromosomes with large, complex control regions (type 3). Mapping minicircle types onto a phylogenetic tree of lice fails to show a pattern of their occurrence consistent with an evolutionary series of minicircle types. Analysis of the nuclear-encoded, mitochondrially-targetted genes inferred from the body louse, Pediculus, suggests that the loss of mitochondrial single-stranded binding protein (mtSSB) may be responsible for the presence of minicircles in at least species with the most derived type 3 minicircles (Pediculus, Damalinia). Minicircular mt genomes are common in lice and appear to have arisen multiple times within the group. Life history adaptive explanations which attribute minicircular mt genomes in lice to the adoption of blood-feeding in the Anoplura are not supported by this expanded data set as minicircles are found in multiple non-blood feeding louse groups but are not found in the blood-feeding genus Heterodoxus. In contrast, a mechanist explanation based on the loss of mtSSB suggests that minicircles may be selectively favoured due to the incapacity of the mt replisome to synthesize long replicative products without mtSSB and thus the loss of this gene lead to the formation of minicircles in lice.
Azuaje, Francisco; Zheng, Huiru; Camargo, Anyela; Wang, Haiying
2011-08-01
The discovery of novel disease biomarkers is a crucial challenge for translational bioinformatics. Demonstration of both their classification power and reproducibility across independent datasets are essential requirements to assess their potential clinical relevance. Small datasets and multiplicity of putative biomarker sets may explain lack of predictive reproducibility. Studies based on pathway-driven discovery approaches have suggested that, despite such discrepancies, the resulting putative biomarkers tend to be implicated in common biological processes. Investigations of this problem have been mainly focused on datasets derived from cancer research. We investigated the predictive and functional concordance of five methods for discovering putative biomarkers in four independently-generated datasets from the cardiovascular disease domain. A diversity of biosignatures was identified by the different methods. However, we found strong biological process concordance between them, especially in the case of methods based on gene set analysis. With a few exceptions, we observed lack of classification reproducibility using independent datasets. Partial overlaps between our putative sets of biomarkers and the primary studies exist. Despite the observed limitations, pathway-driven or gene set analysis can predict potentially novel biomarkers and can jointly point to biomedically-relevant underlying molecular mechanisms. Copyright © 2011 Elsevier Inc. All rights reserved.
Prediction of regulatory gene pairs using dynamic time warping and gene ontology.
Yang, Andy C; Hsu, Hui-Huang; Lu, Ming-Da; Tseng, Vincent S; Shih, Timothy K
2014-01-01
Selecting informative genes is the most important task for data analysis on microarray gene expression data. In this work, we aim at identifying regulatory gene pairs from microarray gene expression data. However, microarray data often contain multiple missing expression values. Missing value imputation is thus needed before further processing for regulatory gene pairs becomes possible. We develop a novel approach to first impute missing values in microarray time series data by combining k-Nearest Neighbour (KNN), Dynamic Time Warping (DTW) and Gene Ontology (GO). After missing values are imputed, we then perform gene regulation prediction based on our proposed DTW-GO distance measurement of gene pairs. Experimental results show that our approach is more accurate when compared with existing missing value imputation methods on real microarray data sets. Furthermore, our approach can also discover more regulatory gene pairs that are known in the literature than other methods.
Challacombe, Jean Faust; Petersen, Jeannine M.; Gallegos-Graves, La Verne A.; ...
2016-11-23
Francisella tularensis is a highly virulent zoonotic pathogen that causes tularemia and, because of weaponization efforts in past world wars, is considered a tier 1 biothreat agent. Detection and surveillance of F. tularensis may be confounded by the presence of uncharacterized, closely related organisms. Through DNA-based diagnostics and environmental surveys, novel clinical and environmental Francisella isolates have been obtained in recent years. Here we present 7 new Francisella genomes and a comparison of their characteristics to each other and to 24 publicly available genomes as well as a comparative analysis of 16S rRNA and sdhA genes from over 90 Francisellamore » strains. Delineation of new species in bacteria is challenging, especially when isolates having very close genomic characteristics exhibit different physiological features—for example, when some are virulent pathogens in humans and animals while others are nonpathogenic or are opportunistic pathogens. Species resolution within Francisella varies with analyses of single genes, multiple gene or protein sets, or whole-genome comparisons of nucleic acid and amino acid sequences. Analyses focusing on single genes (16S rRNA, sdhA), multiple gene sets (virulence genes, lipopolysaccharide [LPS] biosynthesis genes, pathogenicity island), and whole-genome comparisons (nucleotide and protein) gave congruent results, but with different levels of discrimination confidence. We designate four new species within the genus; Francisella opportunistica sp. nov. (MA06-7296), Francisella salina sp. nov. (TX07-7308), Francisella uliginis sp. nov. (TX07-7310), and Francisella frigiditurris sp. nov. (CA97-1460). Lastly, this study provides a robust comparative framework to discern species and virulence features of newly detected Francisella bacteria.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Challacombe, Jean Faust; Petersen, Jeannine M.; Gallegos-Graves, La Verne A.
Francisella tularensis is a highly virulent zoonotic pathogen that causes tularemia and, because of weaponization efforts in past world wars, is considered a tier 1 biothreat agent. Detection and surveillance of F. tularensis may be confounded by the presence of uncharacterized, closely related organisms. Through DNA-based diagnostics and environmental surveys, novel clinical and environmental Francisella isolates have been obtained in recent years. Here we present 7 new Francisella genomes and a comparison of their characteristics to each other and to 24 publicly available genomes as well as a comparative analysis of 16S rRNA and sdhA genes from over 90 Francisellamore » strains. Delineation of new species in bacteria is challenging, especially when isolates having very close genomic characteristics exhibit different physiological features—for example, when some are virulent pathogens in humans and animals while others are nonpathogenic or are opportunistic pathogens. Species resolution within Francisella varies with analyses of single genes, multiple gene or protein sets, or whole-genome comparisons of nucleic acid and amino acid sequences. Analyses focusing on single genes (16S rRNA, sdhA), multiple gene sets (virulence genes, lipopolysaccharide [LPS] biosynthesis genes, pathogenicity island), and whole-genome comparisons (nucleotide and protein) gave congruent results, but with different levels of discrimination confidence. We designate four new species within the genus; Francisella opportunistica sp. nov. (MA06-7296), Francisella salina sp. nov. (TX07-7308), Francisella uliginis sp. nov. (TX07-7310), and Francisella frigiditurris sp. nov. (CA97-1460). Lastly, this study provides a robust comparative framework to discern species and virulence features of newly detected Francisella bacteria.« less
Carbonetto, Peter; Stephens, Matthew
2013-01-01
Pathway analyses of genome-wide association studies aggregate information over sets of related genes, such as genes in common pathways, to identify gene sets that are enriched for variants associated with disease. We develop a model-based approach to pathway analysis, and apply this approach to data from the Wellcome Trust Case Control Consortium (WTCCC) studies. Our method offers several benefits over existing approaches. First, our method not only interrogates pathways for enrichment of disease associations, but also estimates the level of enrichment, which yields a coherent way to promote variants in enriched pathways, enhancing discovery of genes underlying disease. Second, our approach allows for multiple enriched pathways, a feature that leads to novel findings in two diseases where the major histocompatibility complex (MHC) is a major determinant of disease susceptibility. Third, by modeling disease as the combined effect of multiple markers, our method automatically accounts for linkage disequilibrium among variants. Interrogation of pathways from eight pathway databases yields strong support for enriched pathways, indicating links between Crohn's disease (CD) and cytokine-driven networks that modulate immune responses; between rheumatoid arthritis (RA) and “Measles” pathway genes involved in immune responses triggered by measles infection; and between type 1 diabetes (T1D) and IL2-mediated signaling genes. Prioritizing variants in these enriched pathways yields many additional putative disease associations compared to analyses without enrichment. For CD and RA, 7 of 8 additional non-MHC associations are corroborated by other studies, providing validation for our approach. For T1D, prioritization of IL-2 signaling genes yields strong evidence for 7 additional non-MHC candidate disease loci, as well as suggestive evidence for several more. Of the 7 strongest associations, 4 are validated by other studies, and 3 (near IL-2 signaling genes RAF1, MAPK14, and FYN) constitute novel putative T1D loci for further study. PMID:24098138
Integrating alternative splicing detection into gene prediction.
Foissac, Sylvain; Schiex, Thomas
2005-02-10
Alternative splicing (AS) is now considered as a major actor in transcriptome/proteome diversity and it cannot be neglected in the annotation process of a new genome. Despite considerable progresses in term of accuracy in computational gene prediction, the ability to reliably predict AS variants when there is local experimental evidence of it remains an open challenge for gene finders. We have used a new integrative approach that allows to incorporate AS detection into ab initio gene prediction. This method relies on the analysis of genomically aligned transcript sequences (ESTs and/or cDNAs), and has been implemented in the dynamic programming algorithm of the graph-based gene finder EuGENE. Given a genomic sequence and a set of aligned transcripts, this new version identifies the set of transcripts carrying evidence of alternative splicing events, and provides, in addition to the classical optimal gene prediction, alternative optimal predictions (among those which are consistent with the AS events detected). This allows for multiple annotations of a single gene in a way such that each predicted variant is supported by a transcript evidence (but not necessarily with a full-length coverage). This automatic combination of experimental data analysis and ab initio gene finding offers an ideal integration of alternatively spliced gene prediction inside a single annotation pipeline.
Gregoretti, Francesco; Belcastro, Vincenzo; di Bernardo, Diego; Oliva, Gennaro
2010-04-21
The reverse engineering of gene regulatory networks using gene expression profile data has become crucial to gain novel biological knowledge. Large amounts of data that need to be analyzed are currently being produced due to advances in microarray technologies. Using current reverse engineering algorithms to analyze large data sets can be very computational-intensive. These emerging computational requirements can be met using parallel computing techniques. It has been shown that the Network Identification by multiple Regression (NIR) algorithm performs better than the other ready-to-use reverse engineering software. However it cannot be used with large networks with thousands of nodes--as is the case in biological networks--due to the high time and space complexity. In this work we overcome this limitation by designing and developing a parallel version of the NIR algorithm. The new implementation of the algorithm reaches a very good accuracy even for large gene networks, improving our understanding of the gene regulatory networks that is crucial for a wide range of biomedical applications.
Genetic variations in the serotonergic system contribute to amygdala volume in humans.
Li, Jin; Chen, Chunhui; Wu, Karen; Zhang, Mingxia; Zhu, Bi; Chen, Chuansheng; Moyzis, Robert K; Dong, Qi
2015-01-01
The amygdala plays a critical role in emotion processing and psychiatric disorders associated with emotion dysfunction. Accumulating evidence suggests that amygdala structure is modulated by serotonin-related genes. However, there is a gap between the small contributions of single loci (less than 1%) and the reported 63-65% heritability of amygdala structure. To understand the "missing heritability," we systematically explored the contribution of serotonin genes on amygdala structure at the gene set level. The present study of 417 healthy Chinese volunteers examined 129 representative polymorphisms in genes from multiple biological mechanisms in the regulation of serotonin neurotransmission. A system-level approach using multiple regression analyses identified that nine SNPs collectively accounted for approximately 8% of the variance in amygdala volume. Permutation analyses showed that the probability of obtaining these findings by chance was low (p = 0.043, permuted for 1000 times). Findings showed that serotonin genes contribute moderately to individual differences in amygdala volume in a healthy Chinese sample. These results indicate that the system-level approach can help us to understand the genetic basis of a complex trait such as amygdala structure.
Dynamic regulation of genetic pathways and targets during aging in Caenorhabditis elegans.
He, Kan; Zhou, Tao; Shao, Jiaofang; Ren, Xiaoliang; Zhao, Zhongying; Liu, Dahai
2014-03-01
Numerous genetic targets and some individual pathways associated with aging have been identified using the worm model. However, less is known about the genetic mechanisms of aging in genome wide, particularly at the level of multiple pathways as well as the regulatory networks during aging. Here, we employed the gene expression datasets of three time points during aging in Caenorhabditis elegans (C. elegans) and performed the approach of gene set enrichment analysis (GSEA) on each dataset between adjacent stages. As a result, multiple genetic pathways and targets were identified as significantly down- or up-regulated. Among them, 5 truly aging-dependent signaling pathways including MAPK signaling pathway, mTOR signaling pathway, Wnt signaling pathway, TGF-beta signaling pathway and ErbB signaling pathway as well as 12 significantly associated genes were identified with dynamic expression pattern during aging. On the other hand, the continued declines in the regulation of several metabolic pathways have been demonstrated to display age-related changes. Furthermore, the reconstructed regulatory networks based on three of aging related Chromatin immunoprecipitation experiments followed by sequencing (ChIP-seq) datasets and the expression matrices of 154 involved genes in above signaling pathways provide new insights into aging at the multiple pathways level. The combination of multiple genetic pathways and targets needs to be taken into consideration in future studies of aging, in which the dynamic regulation would be uncovered.
Informatic selection of a neural crest-melanocyte cDNA set for microarray analysis
Loftus, S. K.; Chen, Y.; Gooden, G.; Ryan, J. F.; Birznieks, G.; Hilliard, M.; Baxevanis, A. D.; Bittner, M.; Meltzer, P.; Trent, J.; Pavan, W.
1999-01-01
With cDNA microarrays, it is now possible to compare the expression of many genes simultaneously. To maximize the likelihood of finding genes whose expression is altered under the experimental conditions, it would be advantageous to be able to select clones for tissue-appropriate cDNA sets. We have taken advantage of the extensive sequence information in the dbEST expressed sequence tag (EST) database to identify a neural crest-derived melanocyte cDNA set for microarray analysis. Analysis of characterized genes with dbEST identified one library that contained ESTs representing 21 neural crest-expressed genes (library 198). The distribution of the ESTs corresponding to these genes was biased toward being derived from library 198. This is in contrast to the EST distribution profile for a set of control genes, characterized to be more ubiquitously expressed in multiple tissues (P < 1 × 10−9). From library 198, a subset of 852 clustered ESTs were selected that have a library distribution profile similar to that of the 21 neural crest-expressed genes. Microarray analysis demonstrated the majority of the neural crest-selected 852 ESTs (Mel1 array) were differentially expressed in melanoma cell lines compared with a non-neural crest kidney epithelial cell line (P < 1 × 10−8). This was not observed with an array of 1,238 ESTs that was selected without library origin bias (P = 0.204). This study presents an approach for selecting tissue-appropriate cDNAs that can be used to examine the expression profiles of developmental processes and diseases. PMID:10430933
Mor, Avishai; Koh, Eugene; Weiner, Lev; Rosenwasser, Shilo; Sibony-Benyamini, Hadas; Fluhr, Robert
2014-05-01
The production of singlet oxygen is typically associated with inefficient dissipation of photosynthetic energy or can arise from light reactions as a result of accumulation of chlorophyll precursors as observed in fluorescent (flu)-like mutants. Such photodynamic production of singlet oxygen is thought to be involved in stress signaling and programmed cell death. Here we show that transcriptomes of multiple stresses, whether from light or dark treatments, were correlated with the transcriptome of the flu mutant. A core gene set of 118 genes, common to singlet oxygen, biotic and abiotic stresses was defined and confirmed to be activated photodynamically by the photosensitizer Rose Bengal. In addition, induction of the core gene set by abiotic and biotic selected stresses was shown to occur in the dark and in nonphotosynthetic tissue. Furthermore, when subjected to various biotic and abiotic stresses in the dark, the singlet oxygen-specific probe Singlet Oxygen Sensor Green detected rapid production of singlet oxygen in the Arabidopsis (Arabidopsis thaliana) root. Subcellular localization of Singlet Oxygen Sensor Green fluorescence showed its accumulation in mitochondria, peroxisomes, and the nucleus, suggesting several compartments as the possible origins or targets for singlet oxygen. Collectively, the results show that singlet oxygen can be produced by multiple stress pathways and can emanate from compartments other than the chloroplast in a light-independent manner. The results imply that the role of singlet oxygen in plant stress regulation and response is more ubiquitous than previously thought.
Mor, Avishai; Koh, Eugene; Weiner, Lev; Rosenwasser, Shilo; Sibony-Benyamini, Hadas; Fluhr, Robert
2014-01-01
The production of singlet oxygen is typically associated with inefficient dissipation of photosynthetic energy or can arise from light reactions as a result of accumulation of chlorophyll precursors as observed in fluorescent (flu)-like mutants. Such photodynamic production of singlet oxygen is thought to be involved in stress signaling and programmed cell death. Here we show that transcriptomes of multiple stresses, whether from light or dark treatments, were correlated with the transcriptome of the flu mutant. A core gene set of 118 genes, common to singlet oxygen, biotic and abiotic stresses was defined and confirmed to be activated photodynamically by the photosensitizer Rose Bengal. In addition, induction of the core gene set by abiotic and biotic selected stresses was shown to occur in the dark and in nonphotosynthetic tissue. Furthermore, when subjected to various biotic and abiotic stresses in the dark, the singlet oxygen-specific probe Singlet Oxygen Sensor Green detected rapid production of singlet oxygen in the Arabidopsis (Arabidopsis thaliana) root. Subcellular localization of Singlet Oxygen Sensor Green fluorescence showed its accumulation in mitochondria, peroxisomes, and the nucleus, suggesting several compartments as the possible origins or targets for singlet oxygen. Collectively, the results show that singlet oxygen can be produced by multiple stress pathways and can emanate from compartments other than the chloroplast in a light-independent manner. The results imply that the role of singlet oxygen in plant stress regulation and response is more ubiquitous than previously thought. PMID:24599491
Severgnini, Marco; Bicciato, Silvio; Mangano, Eleonora; Scarlatti, Francesca; Mezzelani, Alessandra; Mattioli, Michela; Ghidoni, Riccardo; Peano, Clelia; Bonnal, Raoul; Viti, Federica; Milanesi, Luciano; De Bellis, Gianluca; Battaglia, Cristina
2006-06-01
Meta-analysis of microarray data is increasingly important, considering both the availability of multiple platforms using disparate technologies and the accumulation in public repositories of data sets from different laboratories. We addressed the issue of comparing gene expression profiles from two microarray platforms by devising a standardized investigative strategy. We tested this procedure by studying MDA-MB-231 cells, which undergo apoptosis on treatment with resveratrol. Gene expression profiles were obtained using high-density, short-oligonucleotide, single-color microarray platforms: GeneChip (Affymetrix) and CodeLink (Amersham). Interplatform analyses were carried out on 8414 common transcripts represented on both platforms, as identified by LocusLink ID, representing 70.8% and 88.6% of annotated GeneChip and CodeLink features, respectively. We identified 105 differentially expressed genes (DEGs) on CodeLink and 42 DEGs on GeneChip. Among them, only 9 DEGs were commonly identified by both platforms. Multiple analyses (BLAST alignment of probes with target sequences, gene ontology, literature mining, and quantitative real-time PCR) permitted us to investigate the factors contributing to the generation of platform-dependent results in single-color microarray experiments. An effective approach to cross-platform comparison involves microarrays of similar technologies, samples prepared by identical methods, and a standardized battery of bioinformatic and statistical analyses.
Matsuzaki, Jun; Kawahara, Yoshihiro; Izawa, Takeshi
2015-01-01
Plant circadian clocks that oscillate autonomously with a roughly 24-h period are entrained by fluctuating light and temperature and globally regulate downstream genes in the field. However, it remains unknown how punctual internal time produced by the circadian clock in the field is and how it is affected by environmental fluctuations due to weather or daylength. Using hundreds of samples of field-grown rice (Oryza sativa) leaves, we developed a statistical model for the expression of circadian clock-related genes integrating diurnally entrained circadian clock with phase setting by light, both responses to light and temperature gated by the circadian clock. We show that expression of individual genes was strongly affected by temperature. However, internal time estimated from expression of multiple genes, which may reflect transcriptional regulation of downstream genes, is punctual to 22 min and not affected by weather, daylength, or plant developmental age in the field. We also revealed perturbed progression of internal time under controlled environment or in a mutant of the circadian clock gene GIGANTEA. Thus, we demonstrated that the circadian clock is a regulatory network of multiple genes that retains accurate physical time of day by integrating the perturbations on individual genes under fluctuating environments in the field. PMID:25757473
Schiffels, Johannes; Pinkenburg, Olaf; Schelden, Maximilian; Aboulnaga, El-Hussiny A. A.; Baumann, Marcus E. M.; Selmer, Thorsten
2013-01-01
Expression of multiple heterologous genes in a dedicated host is a prerequisite for approaches in synthetic biology, spanning from the production of recombinant multiprotein complexes to the transfer of tailor-made metabolic pathways. Such attempts are often exacerbated, due in most cases to a lack of proper directional, robust and readily accessible genetic tools. Here, we introduce an innovative system for cloning and expression of multiple genes in Escherichia coli BL21 (DE3). Using the novel methodology, genes are equipped with individual promoters and terminators and subsequently assembled. The resulting multiple gene cassettes may either be placed in one vector or alternatively distributed among a set of compatible plasmids. We demonstrate the effectiveness of the developed tool by production and maturation of the NAD+reducing soluble [NiFe]-hydrogenase (SH) from Cupriavidus necator H16 (formerly Ralstonia eutropha H16) in E. coli BL21Star™ (DE3). The SH (encoded in hoxFUYHI) was successfully matured by co-expression of a dedicated set of auxiliary genes, comprising seven hyp genes (hypC1D1E1A2B2F2X) along with hoxW, which encodes a specific endopeptidase. Deletion of genes involved in SH maturation reduced maturation efficiency substantially. Further addition of hoxN1, encoding a high-affinity nickel permease from C. necator, considerably increased maturation efficiency in E. coli. Carefully balanced growth conditions enabled hydrogenase production at high cell-densities, scoring mg·(Liter culture)−1 yields of purified functional SH. Specific activities of up to 7.2±1.15 U·mg−1 were obtained in cell-free extracts, which is in the range of the highest activities ever determined in C. necator extracts. The recombinant enzyme was isolated in equal purity and stability as previously achieved with the native form, yielding ultrapure preparations with anaerobic specific activities of up to 230 U·mg−1. Owing to the combinatorial power exhibited by the presented cloning platform, the system might represent an important step towards new routes in synthetic biology. PMID:23861944
Pounds, Stan; Cao, Xueyuan; Cheng, Cheng; Yang, Jun; Campana, Dario; Evans, William E.; Pui, Ching-Hon; Relling, Mary V.
2010-01-01
Powerful methods for integrated analysis of multiple biological data sets are needed to maximize interpretation capacity and acquire meaningful knowledge. We recently developed Projection Onto the Most Interesting Statistical Evidence (PROMISE). PROMISE is a statistical procedure that incorporates prior knowledge about the biological relationships among endpoint variables into an integrated analysis of microarray gene expression data with multiple biological and clinical endpoints. Here, PROMISE is adapted to the integrated analysis of pharmacologic, clinical, and genome-wide genotype data that incorporating knowledge about the biological relationships among pharmacologic and clinical response data. An efficient permutation-testing algorithm is introduced so that statistical calculations are computationally feasible in this higher-dimension setting. The new method is applied to a pediatric leukemia data set. The results clearly indicate that PROMISE is a powerful statistical tool for identifying genomic features that exhibit a biologically meaningful pattern of association with multiple endpoint variables. PMID:21516175
Zheng, Chunfang; Santos Muñoz, Daniella; Albert, Victor A; Sankoff, David
2015-01-01
Following whole genome duplication (WGD), there is a compact distribution of gene similarities within the genome reflecting duplicate pairs of all the genes in the genome. With time, the distribution broadens and loses volume due to variable decay of duplicate gene similarity and to the process of duplicate gene loss. If there are two WGD, the older one becomes so reduced and broad that it merges with the tail of the distributions resulting from more recent events, and it becomes difficult to distinguish them. The goal of this paper is to advance statistical methods of identifying, or at least counting, the WGD events in the lineage of a given genome. For a set of 15 angiosperm genomes, we analyze all 15 × 14 = 210 ordered pairs of target genome versus reference genome, using SynMap to find syntenic blocks. We consider all sets of B ≥ 2 syntenic blocks in the target genome that overlap in the reference genome as evidence of WGD activity in the target, whether it be one event or several. We hypothesize that in fitting an exponential function to the tail of the empirical distribution f (B) of block multiplicities, the size of the exponent will reflect the amount of WGD in the history of the target genome. By amalgamating the results from all reference genomes, a range of values of SynMap parameters, and alternative cutoff points for the tail, we find a clear pattern whereby multiple-WGD core eudicots have the smallest (negative) exponents, followed by core eudicots with only the single "γ" triplication in their history, followed by a non-core eudicot with a single WGD, followed by the monocots, with a basal angiosperm, the WGD-free Amborella having the largest exponent. The hypothesis that the exponent of the fit to the tail of the multiplicity distribution is a signature of the amount of WGD is verified, but there is also a clear complicating factor in the monocot clade, where a history of multiple WGD is not reflected in a small exponent.
2014-01-01
Background A thorough investigation of the neurobiology of HIV-induced neuronal dysfunction and its evolving phenotype in the setting of viral suppression has been limited by the lack of validated small animal models to probe the effects of concomitant low level expression of multiple HIV-1 products in disease-relevant cells in the CNS. Results We report the results of gene expression profiling of the hippocampus of HIV-1 Tg rats, a rodent model of HIV infection in which multiple HIV-1 proteins are expressed under the control of the viral LTR promoter in disease-relevant cells including microglia and astrocytes. The Gene Set Enrichment Analysis (GSEA) algorithm was used for pathway analysis. Gene expression changes observed are consistent with astrogliosis and microgliosis and include evidence of inflammation and cell proliferation. Among the genes with increased expression in HIV-1 Tg rats was the interferon stimulated gene 15 (ISG-15), which was previously shown to be increased in the cerebrospinal fluid (CSF) of HIV patients and to correlate with neuropsychological impairment and neuropathology, and prostaglandin D2 (PGD2) synthase (Ptgds), which has been associated with immune activation and the induction of astrogliosis and microgliosis. GSEA-based pathway analysis highlighted a broad dysregulation of genes involved in neuronal trophism and neurodegenerative disorders. Among the latter are genesets associated with Huntington’s disease, Parkinson’s disease, mitochondrial, peroxisome function, and synaptic trophism and plasticity, such as IGF, ErbB and netrin signaling and the PI3K signal transduction pathway, a mediator of neural plasticity and of a vast array of trophic signals. Additionally, gene expression analyses also show altered lipid metabolism and peroxisomes dysfunction. Supporting the functional significance of these gene expression alterations, HIV-1 Tg rats showed working memory impairments in spontaneous alternation behavior in the T-Maze, a paradigm sensitive to prefrontal cortex and hippocampal function. Conclusions Altogether, differentially regulated genes and pathway analysis identify specific pathways that can be targeted therapeutically to increase trophic support, e.g. IGF, ErbB and netrin signaling, and reduce neuroinflammation, e.g. PGD2 synthesis, which may be beneficial in the treatment of chronic forms of HIV-associated neurocognitive disorders in the setting of viral suppression. PMID:24980976
Zhang, Wensheng; Edwards, Andrea; Fan, Wei; Zhu, Dongxiao; Zhang, Kun
2010-06-22
Comparative analysis of gene expression profiling of multiple biological categories, such as different species of organisms or different kinds of tissue, promises to enhance the fundamental understanding of the universality as well as the specialization of mechanisms and related biological themes. Grouping genes with a similar expression pattern or exhibiting co-expression together is a starting point in understanding and analyzing gene expression data. In recent literature, gene module level analysis is advocated in order to understand biological network design and system behaviors in disease and life processes; however, practical difficulties often lie in the implementation of existing methods. Using the singular value decomposition (SVD) technique, we developed a new computational tool, named svdPPCS (SVD-based Pattern Pairing and Chart Splitting), to identify conserved and divergent co-expression modules of two sets of microarray experiments. In the proposed methods, gene modules are identified by splitting the two-way chart coordinated with a pair of left singular vectors factorized from the gene expression matrices of the two biological categories. Importantly, the cutoffs are determined by a data-driven algorithm using the well-defined statistic, SVD-p. The implementation was illustrated on two time series microarray data sets generated from the samples of accessory gland (ACG) and malpighian tubule (MT) tissues of the line W118 of M. drosophila. Two conserved modules and six divergent modules, each of which has a unique characteristic profile across tissue kinds and aging processes, were identified. The number of genes contained in these models ranged from five to a few hundred. Three to over a hundred GO terms were over-represented in individual modules with FDR < 0.1. One divergent module suggested the tissue-specific relationship between the expressions of mitochondrion-related genes and the aging process. This finding, together with others, may be of biological significance. The validity of the proposed SVD-based method was further verified by a simulation study, as well as the comparisons with regression analysis and cubic spline regression analysis plus PAM based clustering. svdPPCS is a novel computational tool for the comparative analysis of transcriptional profiling. It especially fits the comparison of time series data of related organisms or different tissues of the same organism under equivalent or similar experimental conditions. The general scheme can be directly extended to the comparisons of multiple data sets. It also can be applied to the integration of data sets from different platforms and of different sources.
2012-01-01
High-dimensional gene expression data provide a rich source of information because they capture the expression level of genes in dynamic states that reflect the biological functioning of a cell. For this reason, such data are suitable to reveal systems related properties inside a cell, e.g., in order to elucidate molecular mechanisms of complex diseases like breast or prostate cancer. However, this is not only strongly dependent on the sample size and the correlation structure of a data set, but also on the statistical hypotheses tested. Many different approaches have been developed over the years to analyze gene expression data to (I) identify changes in single genes, (II) identify changes in gene sets or pathways, and (III) identify changes in the correlation structure in pathways. In this paper, we review statistical methods for all three types of approaches, including subtypes, in the context of cancer data and provide links to software implementations and tools and address also the general problem of multiple hypotheses testing. Further, we provide recommendations for the selection of such analysis methods. Reviewers This article was reviewed by Arcady Mushegian, Byung-Soo Kim and Joel Bader. PMID:23227854
Teste, Marie-Ange; Duquenne, Manon; François, Jean M; Parrou, Jean-Luc
2009-01-01
Background Real-time RT-PCR is the recommended method for quantitative gene expression analysis. A compulsory step is the selection of good reference genes for normalization. A few genes often referred to as HouseKeeping Genes (HSK), such as ACT1, RDN18 or PDA1 are among the most commonly used, as their expression is assumed to remain unchanged over a wide range of conditions. Since this assumption is very unlikely, a geometric averaging of multiple, carefully selected internal control genes is now strongly recommended for normalization to avoid this problem of expression variation of single reference genes. The aim of this work was to search for a set of reference genes for reliable gene expression analysis in Saccharomyces cerevisiae. Results From public microarray datasets, we selected potential reference genes whose expression remained apparently invariable during long-term growth on glucose. Using the algorithm geNorm, ALG9, TAF10, TFC1 and UBC6 turned out to be genes whose expression remained stable, independent of the growth conditions and the strain backgrounds tested in this study. We then showed that the geometric averaging of any subset of three genes among the six most stable genes resulted in very similar normalized data, which contrasted with inconsistent results among various biological samples when the normalization was performed with ACT1. Normalization with multiple selected genes was therefore applied to transcriptional analysis of genes involved in glycogen metabolism. We determined an induction ratio of 100-fold for GPH1 and 20-fold for GSY2 between the exponential phase and the diauxic shift on glucose. There was no induction of these two genes at this transition phase on galactose, although in both cases, the kinetics of glycogen accumulation was similar. In contrast, SGA1 expression was independent of the carbon source and increased by 3-fold in stationary phase. Conclusion In this work, we provided a set of genes that are suitable reference genes for quantitative gene expression analysis by real-time RT-PCR in yeast biological samples covering a large panel of physiological states. In contrast, we invalidated and discourage the use of ACT1 as well as other commonly used reference genes (PDA1, TDH3, RDN18, etc) as internal controls for quantitative gene expression analysis in yeast. PMID:19874630
Teste, Marie-Ange; Duquenne, Manon; François, Jean M; Parrou, Jean-Luc
2009-10-30
Real-time RT-PCR is the recommended method for quantitative gene expression analysis. A compulsory step is the selection of good reference genes for normalization. A few genes often referred to as HouseKeeping Genes (HSK), such as ACT1, RDN18 or PDA1 are among the most commonly used, as their expression is assumed to remain unchanged over a wide range of conditions. Since this assumption is very unlikely, a geometric averaging of multiple, carefully selected internal control genes is now strongly recommended for normalization to avoid this problem of expression variation of single reference genes. The aim of this work was to search for a set of reference genes for reliable gene expression analysis in Saccharomyces cerevisiae. From public microarray datasets, we selected potential reference genes whose expression remained apparently invariable during long-term growth on glucose. Using the algorithm geNorm, ALG9, TAF10, TFC1 and UBC6 turned out to be genes whose expression remained stable, independent of the growth conditions and the strain backgrounds tested in this study. We then showed that the geometric averaging of any subset of three genes among the six most stable genes resulted in very similar normalized data, which contrasted with inconsistent results among various biological samples when the normalization was performed with ACT1. Normalization with multiple selected genes was therefore applied to transcriptional analysis of genes involved in glycogen metabolism. We determined an induction ratio of 100-fold for GPH1 and 20-fold for GSY2 between the exponential phase and the diauxic shift on glucose. There was no induction of these two genes at this transition phase on galactose, although in both cases, the kinetics of glycogen accumulation was similar. In contrast, SGA1 expression was independent of the carbon source and increased by 3-fold in stationary phase. In this work, we provided a set of genes that are suitable reference genes for quantitative gene expression analysis by real-time RT-PCR in yeast biological samples covering a large panel of physiological states. In contrast, we invalidated and discourage the use of ACT1 as well as other commonly used reference genes (PDA1, TDH3, RDN18, etc) as internal controls for quantitative gene expression analysis in yeast.
Hohman, Timothy J; Bush, William S; Jiang, Lan; Brown-Gentry, Kristin D; Torstenson, Eric S; Dudek, Scott M; Mukherjee, Shubhabrata; Naj, Adam; Kunkle, Brian W; Ritchie, Marylyn D; Martin, Eden R; Schellenberg, Gerard D; Mayeux, Richard; Farrer, Lindsay A; Pericak-Vance, Margaret A; Haines, Jonathan L; Thornton-Wells, Tricia A
2016-02-01
Late-onset Alzheimer disease (AD) has a complex genetic etiology, involving locus heterogeneity, polygenic inheritance, and gene-gene interactions; however, the investigation of interactions in recent genome-wide association studies has been limited. We used a biological knowledge-driven approach to evaluate gene-gene interactions for consistency across 13 data sets from the Alzheimer Disease Genetics Consortium. Fifteen single nucleotide polymorphism (SNP)-SNP pairs within 3 gene-gene combinations were identified: SIRT1 × ABCB1, PSAP × PEBP4, and GRIN2B × ADRA1A. In addition, we extend a previously identified interaction from an endophenotype analysis between RYR3 × CACNA1C. Finally, post hoc gene expression analyses of the implicated SNPs further implicate SIRT1 and ABCB1, and implicate CDH23 which was most recently identified as an AD risk locus in an epigenetic analysis of AD. The observed interactions in this article highlight ways in which genotypic variation related to disease may depend on the genetic context in which it occurs. Further, our results highlight the utility of evaluating genetic interactions to explain additional variance in AD risk and identify novel molecular mechanisms of AD pathogenesis. Copyright © 2016 Elsevier Inc. All rights reserved.
MOTIFSIM 2.1: An Enhanced Software Platform for Detecting Similarity in Multiple DNA Motif Data Sets
Huang, Chun-Hsi
2017-01-01
Abstract Finding binding site motifs plays an important role in bioinformatics as it reveals the transcription factors that control the gene expression. The development for motif finders has flourished in the past years with many tools have been introduced to the research community. Although these tools possess exceptional features for detecting motifs, they report different results for an identical data set. Hence, using multiple tools is recommended because motifs reported by several tools are likely biologically significant. However, the results from multiple tools need to be compared for obtaining common significant motifs. MOTIFSIM web tool and command-line tool were developed for this purpose. In this work, we present several technical improvements as well as additional features to further support the motif analysis in our new release MOTIFSIM 2.1. PMID:28632401
PGMapper: a web-based tool linking phenotype to genes.
Xiong, Qing; Qiu, Yuhui; Gu, Weikuan
2008-04-01
With the availability of whole genome sequence in many species, linkage analysis, positional cloning and microarray are gradually becoming powerful tools for investigating the links between phenotype and genotype or genes. However, in these methods, causative genes underlying a quantitative trait locus, or a disease, are usually located within a large genomic region or a large set of genes. Examining the function of every gene is very time consuming and needs to retrieve and integrate the information from multiple databases or genome resources. PGMapper is a software tool for automatically matching phenotype to genes from a defined genome region or a group of given genes by combining the mapping information from the Ensembl database and gene function information from the OMIM and PubMed databases. PGMapper is currently available for candidate gene search of human, mouse, rat, zebrafish and 12 other species. Available online at http://www.genediscovery.org/pgmapper/index.jsp.
Integrative Genomic Analyses Yields Cell Cycle Regulatory Programs with Prognostic Value
Cheng, Chao; Lou, Shaoke; Andrews, Erik H.; Ung, Matthew H.; Varn, Frederick S.
2016-01-01
Liposarcoma is the second most common form of sarcoma, which has been categorized into four molecular subtypes, which are associated with differential prognosis of patients. However, the transcriptional regulatory programs associated with distinct histological and molecular subtypes of liposarcoma have not been investigated. This study uses integrative analyses to systematically define the transcriptional regulatory programs associated with liposarcoma. Likewise, computational methods are used to identify regulatory programs associated with different liposarcoma subtypes as well as programs that are predictive of prognosis. Further analysis of curated gene sets was used to identify prognostic gene signatures. The integration of data from a variety sources including gene expression profiles, transcription factor (TF) binding data from ChIP-seq experiments, curated gene sets, and clinical information of patients indicated discrete regulatory programs (e.g., controlled by E2F1 and E2F4) with significantly different regulatory activity in one or multiple subtypes of liposarcoma with respect to normal adipose tissue. These programs were also shown to be prognostic, wherein liposarcoma patients with higher E2F4 or E2F1 activity associated with unfavorable prognosis. A total of 259 gene sets were significantly associated with patient survival in liposarcoma, among which >50% are involved in cell cycle and proliferation. PMID:26856934
WormQTLHD—a web database for linking human disease to natural variation data in C. elegans
van der Velde, K. Joeri; de Haan, Mark; Zych, Konrad; Arends, Danny; Snoek, L. Basten; Kammenga, Jan E.; Jansen, Ritsert C.; Swertz, Morris A.; Li, Yang
2014-01-01
Interactions between proteins are highly conserved across species. As a result, the molecular basis of multiple diseases affecting humans can be studied in model organisms that offer many alternative experimental opportunities. One such organism—Caenorhabditis elegans—has been used to produce much molecular quantitative genetics and systems biology data over the past decade. We present WormQTLHD (Human Disease), a database that quantitatively and systematically links expression Quantitative Trait Loci (eQTL) findings in C. elegans to gene–disease associations in man. WormQTLHD, available online at http://www.wormqtl-hd.org, is a user-friendly set of tools to reveal functionally coherent, evolutionary conserved gene networks. These can be used to predict novel gene-to-gene associations and the functions of genes underlying the disease of interest. We created a new database that links C. elegans eQTL data sets to human diseases (34 337 gene–disease associations from OMIM, DGA, GWAS Central and NHGRI GWAS Catalogue) based on overlapping sets of orthologous genes associated to phenotypes in these two species. We utilized QTL results, high-throughput molecular phenotypes, classical phenotypes and genotype data covering different developmental stages and environments from WormQTL database. All software is available as open source, built on MOLGENIS and xQTL workbench. PMID:24217915
2009-01-01
Background One of the most common and efficient methods for detecting mutations in genes is PCR amplification followed by direct sequencing. Until recently, the process of designing PCR assays has been to focus on individual assay parameters rather than concentrating on matching conditions for a set of assays. Primers for each individual assay were selected based on location and sequence concerns. The two primer sequences were then iteratively adjusted to make the individual assays work properly. This generally resulted in groups of assays with different annealing temperatures that required the use of multiple thermal cyclers or multiple passes in a single thermal cycler making diagnostic testing time-consuming, laborious and expensive. These factors have severely hampered diagnostic testing services, leaving many families without an answer for the exact cause of a familial genetic disease. A search of GeneTests for sequencing analysis of the entire coding sequence for genes that are known to cause muscular dystrophies returns only a small list of laboratories that perform comprehensive gene panels. The hypothesis for the study was that a complete set of universal assays can be designed to amplify and sequence any gene or family of genes using computer aided design tools. If true, this would allow automation and optimization of the mutation detection process resulting in reduced cost and increased throughput. Results An automated process has been developed for the detection of deletions, duplications/insertions and point mutations in any gene or family of genes and has been applied to ten genes known to bear mutations that cause muscular dystrophy: DMD; CAV3; CAPN3; FKRP; TRIM32; LMNA; SGCA; SGCB; SGCG; SGCD. Using this process, mutations have been found in five DMD patients and four LGMD patients (one in the FKRP gene, one in the CAV3 gene, and two likely causative heterozygous pairs of variations in the CAPN3 gene of two other patients). Methods and assay sequences are reported in this paper. Conclusion This automated process allows laboratories to discover DNA variations in a short time and at low cost. PMID:19835634
Bennett, Richard R; Schneider, Hal E; Estrella, Elicia; Burgess, Stephanie; Cheng, Andrew S; Barrett, Caitlin; Lip, Va; Lai, Poh San; Shen, Yiping; Wu, Bai-Lin; Darras, Basil T; Beggs, Alan H; Kunkel, Louis M
2009-10-18
One of the most common and efficient methods for detecting mutations in genes is PCR amplification followed by direct sequencing. Until recently, the process of designing PCR assays has been to focus on individual assay parameters rather than concentrating on matching conditions for a set of assays. Primers for each individual assay were selected based on location and sequence concerns. The two primer sequences were then iteratively adjusted to make the individual assays work properly. This generally resulted in groups of assays with different annealing temperatures that required the use of multiple thermal cyclers or multiple passes in a single thermal cycler making diagnostic testing time-consuming, laborious and expensive.These factors have severely hampered diagnostic testing services, leaving many families without an answer for the exact cause of a familial genetic disease. A search of GeneTests for sequencing analysis of the entire coding sequence for genes that are known to cause muscular dystrophies returns only a small list of laboratories that perform comprehensive gene panels.The hypothesis for the study was that a complete set of universal assays can be designed to amplify and sequence any gene or family of genes using computer aided design tools. If true, this would allow automation and optimization of the mutation detection process resulting in reduced cost and increased throughput. An automated process has been developed for the detection of deletions, duplications/insertions and point mutations in any gene or family of genes and has been applied to ten genes known to bear mutations that cause muscular dystrophy: DMD; CAV3; CAPN3; FKRP; TRIM32; LMNA; SGCA; SGCB; SGCG; SGCD. Using this process, mutations have been found in five DMD patients and four LGMD patients (one in the FKRP gene, one in the CAV3 gene, and two likely causative heterozygous pairs of variations in the CAPN3 gene of two other patients). Methods and assay sequences are reported in this paper. This automated process allows laboratories to discover DNA variations in a short time and at low cost.
Ozerov, Ivan V; Lezhnina, Ksenia V; Izumchenko, Evgeny; Artemov, Artem V; Medintsev, Sergey; Vanhaelen, Quentin; Aliper, Alexander; Vijg, Jan; Osipov, Andreyan N; Labat, Ivan; West, Michael D; Buzdin, Anton; Cantor, Charles R; Nikolsky, Yuri; Borisov, Nikolay; Irincheeva, Irina; Khokhlovich, Edward; Sidransky, David; Camargo, Miguel Luiz; Zhavoronkov, Alex
2016-11-16
Signalling pathway activation analysis is a powerful approach for extracting biologically relevant features from large-scale transcriptomic and proteomic data. However, modern pathway-based methods often fail to provide stable pathway signatures of a specific phenotype or reliable disease biomarkers. In the present study, we introduce the in silico Pathway Activation Network Decomposition Analysis (iPANDA) as a scalable robust method for biomarker identification using gene expression data. The iPANDA method combines precalculated gene coexpression data with gene importance factors based on the degree of differential gene expression and pathway topology decomposition for obtaining pathway activation scores. Using Microarray Analysis Quality Control (MAQC) data sets and pretreatment data on Taxol-based neoadjuvant breast cancer therapy from multiple sources, we demonstrate that iPANDA provides significant noise reduction in transcriptomic data and identifies highly robust sets of biologically relevant pathway signatures. We successfully apply iPANDA for stratifying breast cancer patients according to their sensitivity to neoadjuvant therapy.
Ozerov, Ivan V.; Lezhnina, Ksenia V.; Izumchenko, Evgeny; Artemov, Artem V.; Medintsev, Sergey; Vanhaelen, Quentin; Aliper, Alexander; Vijg, Jan; Osipov, Andreyan N.; Labat, Ivan; West, Michael D.; Buzdin, Anton; Cantor, Charles R.; Nikolsky, Yuri; Borisov, Nikolay; Irincheeva, Irina; Khokhlovich, Edward; Sidransky, David; Camargo, Miguel Luiz; Zhavoronkov, Alex
2016-01-01
Signalling pathway activation analysis is a powerful approach for extracting biologically relevant features from large-scale transcriptomic and proteomic data. However, modern pathway-based methods often fail to provide stable pathway signatures of a specific phenotype or reliable disease biomarkers. In the present study, we introduce the in silico Pathway Activation Network Decomposition Analysis (iPANDA) as a scalable robust method for biomarker identification using gene expression data. The iPANDA method combines precalculated gene coexpression data with gene importance factors based on the degree of differential gene expression and pathway topology decomposition for obtaining pathway activation scores. Using Microarray Analysis Quality Control (MAQC) data sets and pretreatment data on Taxol-based neoadjuvant breast cancer therapy from multiple sources, we demonstrate that iPANDA provides significant noise reduction in transcriptomic data and identifies highly robust sets of biologically relevant pathway signatures. We successfully apply iPANDA for stratifying breast cancer patients according to their sensitivity to neoadjuvant therapy. PMID:27848968
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Xiaojun; Department of General Surgery, Gansu Provincial Hospital, Lanzhou, Gansu 710000; Zhong, Xiaomin
2013-02-15
Highlights: ► Gene set enrichment analysis indicated mir-30d might regulate the autophagy pathway. ► mir-30d represses the expression of BECN1, BNIP3L, ATG12, ATG5 and ATG2. ► BECN1, BNIP3L, ATG12, ATG5 and ATG2 are direct targets of mir-30d. ► mir-30d inhibits autophagosome formation and LC3B-I conversion to LC3B-II. ► mir-30d regulates the autophagy process. -- Abstract: In human epithelial cancers, the microRNA (miRNA) mir-30d is amplified with high frequency and serves as a critical oncomir by regulating metastasis, apoptosis, proliferation, and differentiation. Autophagy, a degradation pathway for long-lived protein and organelles, regulates the survival and death of many cell types. Increasingmore » evidence suggests that autophagy plays an important function in epithelial tumor initiation and progression. Using a combined bioinformatics approach, gene set enrichment analysis, and miRNA target prediction, we found that mir-30d might regulate multiple genes in the autophagy pathway including BECN1, BNIP3L, ATG12, ATG5, and ATG2. Our further functional experiments demonstrated that the expression of these core proteins in the autophagy pathway was directly suppressed by mir-30d in cancer cells. Finally, we showed that mir-30d regulated the autophagy process by inhibiting autophagosome formation and LC3B-I conversion to LC3B-II. Taken together, our results provide evidence that the oncomir mir-30d impairs the autophagy process by targeting multiple genes in the autophagy pathway. This result will contribute to understanding the molecular mechanism of mir-30d in tumorigenesis and developing novel cancer therapy strategy.« less
Ohneda, Kinuko; Mirmira, Raghavendra G.; Wang, Juehu; Johnson, Jeffrey D.; German, Michael S.
2000-01-01
Activation of insulin gene transcription specifically in the pancreatic β cells depends on multiple nuclear proteins that interact with each other and with sequences on the insulin gene promoter to build a transcriptional activation complex. The homeodomain protein PDX-1 exemplifies such interactions by binding to the A3/4 region of the rat insulin I promoter and activating insulin gene transcription by cooperating with the basic-helix-loop-helix (bHLH) protein E47/Pan1, which binds to the adjacent E2 site. The present study provides evidence that the homeodomain of PDX-1 acts as a protein-protein interaction domain to recruit multiple proteins, including E47/Pan1, BETA2/NeuroD1, and high-mobility group protein I(Y), to an activation complex on the E2A3/4 minienhancer. The transcriptional activity of this complex results from the clustering of multiple activation domains capable of interacting with coactivators and the basal transcriptional machinery. These interactions are not common to all homeodomain proteins: the LIM homeodomain protein Lmx1.1 can also activate the E2A3/4 minienhancer in cooperation with E47/Pan1 but does so through different interactions. Cooperation between Lmx1.1 and E47/Pan1 results not only in the aggregation of multiple activation domains but also in the unmasking of a potent activation domain on E47/Pan1 that is normally silent in non-β cells. While more than one activation complex may be capable of activating insulin gene transcription through the E2A3/4 minienhancer, each is dependent on multiple specific interactions among a unique set of nuclear proteins. PMID:10629047
Özgür, Arzucan; Hur, Junguk; He, Yongqun
2016-01-01
The Interaction Network Ontology (INO) logically represents biological interactions, pathways, and networks. INO has been demonstrated to be valuable in providing a set of structured ontological terms and associated keywords to support literature mining of gene-gene interactions from biomedical literature. However, previous work using INO focused on single keyword matching, while many interactions are represented with two or more interaction keywords used in combination. This paper reports our extension of INO to include combinatory patterns of two or more literature mining keywords co-existing in one sentence to represent specific INO interaction classes. Such keyword combinations and related INO interaction type information could be automatically obtained via SPARQL queries, formatted in Excel format, and used in an INO-supported SciMiner, an in-house literature mining program. We studied the gene interaction sentences from the commonly used benchmark Learning Logic in Language (LLL) dataset and one internally generated vaccine-related dataset to identify and analyze interaction types containing multiple keywords. Patterns obtained from the dependency parse trees of the sentences were used to identify the interaction keywords that are related to each other and collectively represent an interaction type. The INO ontology currently has 575 terms including 202 terms under the interaction branch. The relations between the INO interaction types and associated keywords are represented using the INO annotation relations: 'has literature mining keywords' and 'has keyword dependency pattern'. The keyword dependency patterns were generated via running the Stanford Parser to obtain dependency relation types. Out of the 107 interactions in the LLL dataset represented with two-keyword interaction types, 86 were identified by using the direct dependency relations. The LLL dataset contained 34 gene regulation interaction types, each of which associated with multiple keywords. A hierarchical display of these 34 interaction types and their ancestor terms in INO resulted in the identification of specific gene-gene interaction patterns from the LLL dataset. The phenomenon of having multi-keyword interaction types was also frequently observed in the vaccine dataset. By modeling and representing multiple textual keywords for interaction types, the extended INO enabled the identification of complex biological gene-gene interactions represented with multiple keywords.
TEGS-CN: A Statistical Method for Pathway Analysis of Genome-wide Copy Number Profile.
Huang, Yen-Tsung; Hsu, Thomas; Christiani, David C
2014-01-01
The effects of copy number alterations make up a significant part of the tumor genome profile, but pathway analyses of these alterations are still not well established. We proposed a novel method to analyze multiple copy numbers of genes within a pathway, termed Test for the Effect of a Gene Set with Copy Number data (TEGS-CN). TEGS-CN was adapted from TEGS, a method that we previously developed for gene expression data using a variance component score test. With additional development, we extend the method to analyze DNA copy number data, accounting for different sizes and thus various numbers of copy number probes in genes. The test statistic follows a mixture of X (2) distributions that can be obtained using permutation with scaled X (2) approximation. We conducted simulation studies to evaluate the size and the power of TEGS-CN and to compare its performance with TEGS. We analyzed a genome-wide copy number data from 264 patients of non-small-cell lung cancer. With the Molecular Signatures Database (MSigDB) pathway database, the genome-wide copy number data can be classified into 1814 biological pathways or gene sets. We investigated associations of the copy number profile of the 1814 gene sets with pack-years of cigarette smoking. Our analysis revealed five pathways with significant P values after Bonferroni adjustment (<2.8 × 10(-5)), including the PTEN pathway (7.8 × 10(-7)), the gene set up-regulated under heat shock (3.6 × 10(-6)), the gene sets involved in the immune profile for rejection of kidney transplantation (9.2 × 10(-6)) and for transcriptional control of leukocytes (2.2 × 10(-5)), and the ganglioside biosynthesis pathway (2.7 × 10(-5)). In conclusion, we present a new method for pathway analyses of copy number data, and causal mechanisms of the five pathways require further study.
Dominguez, Daniel; Tsai, Yi-Hsuan; Gomez, Nicholas; Jha, Deepak Kumar; Davis, Ian; Wang, Zefeng
2016-01-01
Progression through the cell cycle is largely dependent on waves of periodic gene expression, and the regulatory networks for these transcriptome dynamics have emerged as critical points of vulnerability in various aspects of tumor biology. Through RNA-sequencing of human cells during two continuous cell cycles (>2.3 billion paired reads), we identified over 1 000 mRNAs, non-coding RNAs and pseudogenes with periodic expression. Periodic transcripts are enriched in functions related to DNA metabolism, mitosis, and DNA damage response, indicating these genes likely represent putative cell cycle regulators. Using our set of periodic genes, we developed a new approach termed “mitotic trait” that can classify primary tumors and normal tissues by their transcriptome similarity to different cell cycle stages. By analyzing >4 000 tumor samples in The Cancer Genome Atlas (TCGA) and other expression data sets, we found that mitotic trait significantly correlates with genetic alterations, tumor subtype and, notably, patient survival. We further defined a core set of 67 genes with robust periodic expression in multiple cell types. Proteins encoded by these genes function as major hubs of protein-protein interaction and are mostly required for cell cycle progression. The core genes also have unique chromatin features including increased levels of CTCF/RAD21 binding and H3K36me3. Loss of these features in uterine and kidney cancers is associated with altered expression of the core 67 genes. Our study suggests new chromatin-associated mechanisms for periodic gene regulation and offers a predictor of cancer patient outcomes. PMID:27364684
Using parentage analysis to examine gene flow and spatial genetic structure.
Kane, Nolan C; King, Matthew G
2009-04-01
Numerous approaches have been developed to examine recent and historical gene flow between populations, but few studies have used empirical data sets to compare different approaches. Some methods are expected to perform better under particular scenarios, such as high or low gene flow, but this, too, has rarely been tested. In this issue of Molecular Ecology, Saenz-Agudelo et al. (2009) apply assignment tests and parentage analysis to microsatellite data from five geographically proximal (2-6 km) and one much more distant (1500 km) panda clownfish populations, showing that parentage analysis performed better in situations of high gene flow, while their assignment tests did better with low gene flow. This unusually complete data set is comprised of multiple exhaustively sampled populations, including nearly all adults and large numbers of juveniles, enabling the authors to ask questions that in many systems would be impossible to answer. Their results emphasize the importance of selecting the right analysis to use, based on the underlying model and how well its assumptions are met by the populations to be analysed.
Trainable Gene Regulation Networks with Applications to Drosophila Pattern Formation
NASA Technical Reports Server (NTRS)
Mjolsness, Eric
2000-01-01
This chapter will very briefly introduce and review some computational experiments in using trainable gene regulation network models to simulate and understand selected episodes in the development of the fruit fly, Drosophila melanogaster. For details the reader is referred to the papers introduced below. It will then introduce a new gene regulation network model which can describe promoter-level substructure in gene regulation. As described in chapter 2, gene regulation may be thought of as a combination of cis-acting regulation by the extended promoter of a gene (including all regulatory sequences) by way of the transcription complex, and of trans-acting regulation by the transcription factor products of other genes. If we simplify the cis-action by using a phenomenological model which can be tuned to data, such as a unit or other small portion of an artificial neural network, then the full transacting interaction between multiple genes during development can be modelled as a larger network which can again be tuned or trained to data. The larger network will in general need to have recurrent (feedback) connections since at least some real gene regulation networks do. This is the basic modeling approach taken, which describes how a set of recurrent neural networks can be used as a modeling language for multiple developmental processes including gene regulation within a single cell, cell-cell communication, and cell division. Such network models have been called "gene circuits", "gene regulation networks", or "genetic regulatory networks", sometimes without distinguishing the models from the actual modeled systems.
Investigation of exomic variants associated with overall survival in ovarian cancer
Ann Chen, Yian; Larson, Melissa C; Fogarty, Zachary C; Earp, Madalene A; Anton-Culver, Hoda; Bandera, Elisa V; Cramer, Daniel; Doherty, Jennifer A; Goodman, Marc T; Gronwald, Jacek; Karlan, Beth Y; Kjaer, Susanne K; Levine, Douglas A; Menon, Usha; Ness, Roberta B; Pearce, Celeste L; Pejovic, Tanja; Rossing, Mary Anne; Wentzensen, Nicolas; Bean, Yukie T; Bisogna, Maria; Brinton, Louise A; Carney, Michael E; Cunningham, Julie M; Cybulski, Cezary; deFazio, Anna; Dicks, Ed M; Edwards, Robert P; Gayther, Simon A; Gentry-Maharaj, Aleksandra; Gore, Martin; Iversen, Edwin S; Jensen, Allan; Johnatty, Sharon E; Lester, Jenny; Lin, Hui-Yi; Lissowska, Jolanta; Lubinski, Jan; Menkiszak, Janusz; Modugno, Francesmary; Moysich, Kirsten B; Orlow, Irene; Pike, Malcolm C; Ramus, Susan J; Song, Honglin; Terry, Kathryn L; Thompson, Pamela J; Tyrer, Jonathan P; van den Berg, David J; Vierkant, Robert A; Vitonis, Allison F; Walsh, Christine; Wilkens, Lynne R; Wu, Anna H; Yang, Hannah; Ziogas, Argyrios; Berchuck, Andrew; Chenevix-Trench, Georgia; Schildkraut, Joellen M; Permuth-Wey, Jennifer; Phelan, Catherine M; Pharoah, Paul D P; Fridley, Brooke L
2016-01-01
Background While numerous susceptibility loci for epithelial ovarian cancer (EOC) have been identified, few associations have been reported with overall survival. In the absence of common prognostic genetic markers, we hypothesize that rare coding variants may be associated with overall EOC survival and assessed their contribution in two exome-based genotyping projects of the Ovarian Cancer Association Consortium (OCAC). Methods The primary patient set (Set 1) included 14 independent EOC studies (4293 patients) and 227,892 variants, and a secondary patient set (Set 2) included six additional EOC studies (1744 patients) and 114,620 variants. Because power to detect rare variants individually is reduced, gene-level tests were conducted. Sets were analyzed separately at individual variants and by gene, and then combined with meta-analyses (73,203 variants and 13,163 genes overlapped). Results No individual variant reached genome-wide statistical significance. A SNP previously implicated to be associated with EOC risk and, to a lesser extent, survival, rs8170, showed the strongest evidence of association with survival and similar effect size estimates across sets (Pmeta=1.1E-6, HRSet1=1.17, HRSet2=1.14). Rare variants in ATG2B, an autophagy gene important for apoptosis, were significantly associated with survival after multiple testing correction (Pmeta=1.1E-6; Pcorrected=0.01). Conclusions Common variant rs8170 and rare variants in ATG2B may be associated with EOC overall survival, although further study is needed. Impact This study represents the first exome-wide association study of EOC survival to include rare variant analyses, and suggests that complementary single variant and gene-level analyses in large studies are needed to identify rare variants that warrant follow-up study. PMID:26747452
Deng, Peng; Wang, Xiaoqiang; Baird, Sonya M; Showmaker, Kurt C; Smith, Leif; Peterson, Daniel G; Lu, Shien
2016-06-01
Burkholderia contaminans MS14 shows significant antimicrobial activities against plant and animal pathogenic fungi and bacteria. The antifungal agent occidiofungin produced by MS14 has great potential for development of biopesticides and pharmaceutical drugs. However, the use of Burkholderia species as biocontrol agent in agriculture is restricted due to the difficulties in distinguishing between plant growth-promoting bacteria and the pathogenic bacteria. The complete MS14 genome was sequenced and analyzed to find what beneficial and virulence-related genes it harbors. The phylogenetic relatedness of B. contaminans MS14 and other 17 Burkholderia species was also analyzed. To research MS14's potential virulence, the gene regions related to the antibiotic production, antibiotic resistance, and virulence were compared between MS14 and other Burkholderia genomes. The genome of B. contaminans MS14 was sequenced and annotated. The genomic analyses reveal the presence of multiple gene sets for antimicrobial biosynthesis, which contribute to its antimicrobial activities. BLAST results indicate that the MS14 genome harbors a large number of unique regions. MS14 is closely related to another plant growth-promoting Burkholderia strain B. lata 383 according to the average nucleotide identity data. Moreover, according to the phylogenetic analysis, plant growth-promoting species isolated from soils and mammalian pathogenic species are clustered together, respectively. MS14 has multiple antimicrobial activity-related genes identified from the genome, but it lacks key virulence-related gene loci found in the pathogenic strains. Additionally, plant growth-promoting Burkholderia species have one or more antimicrobial biosynthesis genes in their genomes as compared with nonplant growth-promoting soil-isolated Burkholderia species. On the other hand, pathogenic species harbor multiple virulence-associated gene loci that are not present in nonpathogenic Burkholderia species. The MS14 genome as well as Burkholderia species genome show considerable diversity. Multiple antimicrobial agent biosynthesis genes were identified in the genome of plant growth-promoting species of Burkholderia. In addition, by comparing to nonpathogenic Burkholderia species, pathogenic Burkholderia species have more characterized homologs of the gene loci known to contribute to pathogenicity and virulence to plant and animals. © 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.
Xu, Jin; Spitale, Robert C.; Guan, Linna; Flynn, Ryan A.; Torre, Eduardo A.; Li, Rui; Raber, Inbar; Qu, Kun; Kern, Dale; Knaggs, Helen E.; Chang, Howard Y.; Chang, Anne Lynn S.
2016-01-01
While much is known about genes that promote aging, little is known about genes that protect against or prevent aging, particularly in human skin. The main objective of this study was to perform an unbiased, whole transcriptome search for genes that associate with intrinsic skin youthfulness. To accomplish this, healthy women (n = 122) of European descent, ages 18–89 years with Fitzpatrick skin type I/II were examined for facial skin aging parameters and clinical covariates, including smoking and ultraviolet exposure. Skin youthfulness was defined as the top 10% of individuals whose assessed skin aging features were most discrepant with their chronological ages. Skin biopsies from sun-protected inner arm were subjected to 3’-end sequencing for expression quantification, with results verified by quantitative reverse transcriptase-polymerase chain reaction. Unbiased clustering revealed gene expression signatures characteristic of older women with skin youthfulness (n = 12) compared to older women without skin youthfulness (n = 33), after accounting for gene expression changes associated with chronological age alone. Gene set analysis was performed using Genomica open-access software. This study identified a novel set of candidate skin youthfulness genes demonstrating differences between SY and non-SY group, including pleckstrin homology like domain family A member 1 (PHLDA1) (p = 2.4x10-5), a follicle stem cell marker, and hyaluronan synthase 2-anti-sense 1 (HAS2-AS1) (p = 0.00105), a non-coding RNA that is part of the hyaluronan synthesis pathway. We show that immunologic gene sets are the most significantly altered in skin youthfulness (with the most significant gene set p = 2.4x10-5), suggesting the immune system plays an important role in skin youthfulness, a finding that has not previously been recognized. These results are a valuable resource from which multiple future studies may be undertaken to better understand the mechanisms that promote skin youthfulness in humans. PMID:27829007
DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data
Glez-Peña, Daniel; Álvarez, Rodrigo; Díaz, Fernando; Fdez-Riverola, Florentino
2009-01-01
Background Expression profiling assays done by using DNA microarray technology generate enormous data sets that are not amenable to simple analysis. The greatest challenge in maximizing the use of this huge amount of data is to develop algorithms to interpret and interconnect results from different genes under different conditions. In this context, fuzzy logic can provide a systematic and unbiased way to both (i) find biologically significant insights relating to meaningful genes, thereby removing the need for expert knowledge in preliminary steps of microarray data analyses and (ii) reduce the cost and complexity of later applied machine learning techniques being able to achieve interpretable models. Results DFP is a new Bioconductor R package that implements a method for discretizing and selecting differentially expressed genes based on the application of fuzzy logic. DFP takes advantage of fuzzy membership functions to assign linguistic labels to gene expression levels. The technique builds a reduced set of relevant genes (FP, Fuzzy Pattern) able to summarize and represent each underlying class (pathology). A last step constructs a biased set of genes (DFP, Discriminant Fuzzy Pattern) by intersecting existing fuzzy patterns in order to detect discriminative elements. In addition, the software provides new functions and visualisation tools that summarize achieved results and aid in the interpretation of differentially expressed genes from multiple microarray experiments. Conclusion DFP integrates with other packages of the Bioconductor project, uses common data structures and is accompanied by ample documentation. It has the advantage that its parameters are highly configurable, facilitating the discovery of biologically relevant connections between sets of genes belonging to different pathologies. This information makes it possible to automatically filter irrelevant genes thereby reducing the large volume of data supplied by microarray experiments. Based on these contributions GENECBR, a successful tool for cancer diagnosis using microarray datasets, has recently been released. PMID:19178723
DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data.
Glez-Peña, Daniel; Alvarez, Rodrigo; Díaz, Fernando; Fdez-Riverola, Florentino
2009-01-29
Expression profiling assays done by using DNA microarray technology generate enormous data sets that are not amenable to simple analysis. The greatest challenge in maximizing the use of this huge amount of data is to develop algorithms to interpret and interconnect results from different genes under different conditions. In this context, fuzzy logic can provide a systematic and unbiased way to both (i) find biologically significant insights relating to meaningful genes, thereby removing the need for expert knowledge in preliminary steps of microarray data analyses and (ii) reduce the cost and complexity of later applied machine learning techniques being able to achieve interpretable models. DFP is a new Bioconductor R package that implements a method for discretizing and selecting differentially expressed genes based on the application of fuzzy logic. DFP takes advantage of fuzzy membership functions to assign linguistic labels to gene expression levels. The technique builds a reduced set of relevant genes (FP, Fuzzy Pattern) able to summarize and represent each underlying class (pathology). A last step constructs a biased set of genes (DFP, Discriminant Fuzzy Pattern) by intersecting existing fuzzy patterns in order to detect discriminative elements. In addition, the software provides new functions and visualisation tools that summarize achieved results and aid in the interpretation of differentially expressed genes from multiple microarray experiments. DFP integrates with other packages of the Bioconductor project, uses common data structures and is accompanied by ample documentation. It has the advantage that its parameters are highly configurable, facilitating the discovery of biologically relevant connections between sets of genes belonging to different pathologies. This information makes it possible to automatically filter irrelevant genes thereby reducing the large volume of data supplied by microarray experiments. Based on these contributions GENECBR, a successful tool for cancer diagnosis using microarray datasets, has recently been released.
Badr, Eman; ElHefnawi, Mahmoud; Heath, Lenwood S
2016-01-01
Alternative splicing is a vital process for regulating gene expression and promoting proteomic diversity. It plays a key role in tissue-specific expressed genes. This specificity is mainly regulated by splicing factors that bind to specific sequences called splicing regulatory elements (SREs). Here, we report a genome-wide analysis to study alternative splicing on multiple tissues, including brain, heart, liver, and muscle. We propose a pipeline to identify differential exons across tissues and hence tissue-specific SREs. In our pipeline, we utilize the DEXSeq package along with our previously reported algorithms. Utilizing the publicly available RNA-Seq data set from the Human BodyMap project, we identified 28,100 differentially used exons across the four tissues. We identified tissue-specific exonic splicing enhancers that overlap with various previously published experimental and computational databases. A complicated exonic enhancer regulatory network was revealed, where multiple exonic enhancers were found across multiple tissues while some were found only in specific tissues. Putative combinatorial exonic enhancers and silencers were discovered as well, which may be responsible for exon inclusion or exclusion across tissues. Some of the exonic enhancers are found to be co-occurring with multiple exonic silencers and vice versa, which demonstrates a complicated relationship between tissue-specific exonic enhancers and silencers.
Novel method to load multiple genes onto a mammalian artificial chromosome.
Tóth, Anna; Fodor, Katalin; Praznovszky, Tünde; Tubak, Vilmos; Udvardy, Andor; Hadlaczky, Gyula; Katona, Robert L
2014-01-01
Mammalian artificial chromosomes are natural chromosome-based vectors that may carry a vast amount of genetic material in terms of both size and number. They are reasonably stable and segregate well in both mitosis and meiosis. A platform artificial chromosome expression system (ACEs) was earlier described with multiple loading sites for a modified lambda-integrase enzyme. It has been shown that this ACEs is suitable for high-level industrial protein production and the treatment of a mouse model for a devastating human disorder, Krabbe's disease. ACEs-treated mutant mice carrying a therapeutic gene lived more than four times longer than untreated counterparts. This novel gene therapy method is called combined mammalian artificial chromosome-stem cell therapy. At present, this method suffers from the limitation that a new selection marker gene should be present for each therapeutic gene loaded onto the ACEs. Complex diseases require the cooperative action of several genes for treatment, but only a limited number of selection marker genes are available and there is also a risk of serious side-effects caused by the unwanted expression of these marker genes in mammalian cells, organs and organisms. We describe here a novel method to load multiple genes onto the ACEs by using only two selectable marker genes. These markers may be removed from the ACEs before therapeutic application. This novel technology could revolutionize gene therapeutic applications targeting the treatment of complex disorders and cancers. It could also speed up cell therapy by allowing researchers to engineer a chromosome with a predetermined set of genetic factors to differentiate adult stem cells, embryonic stem cells and induced pluripotent stem (iPS) cells into cell types of therapeutic value. It is also a suitable tool for the investigation of complex biochemical pathways in basic science by producing an ACEs with several genes from a signal transduction pathway of interest.
Leung, Yuk Yee; Chang, Chun Qi; Hung, Yeung Sam
2012-01-01
Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own. We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the 'wrong' (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.
Ghadie, Mohamed A; Japkowicz, Nathalie; Perkins, Theodore J
2015-08-15
Stem cell differentiation is largely guided by master transcriptional regulators, but it also depends on the expression of other types of genes, such as cell cycle genes, signaling genes, metabolic genes, trafficking genes, etc. Traditional approaches to understanding gene expression patterns across multiple conditions, such as principal components analysis or K-means clustering, can group cell types based on gene expression, but they do so without knowledge of the differentiation hierarchy. Hierarchical clustering can organize cell types into a tree, but in general this tree is different from the differentiation hierarchy itself. Given the differentiation hierarchy and gene expression data at each node, we construct a weighted Euclidean distance metric such that the minimum spanning tree with respect to that metric is precisely the given differentiation hierarchy. We provide a set of linear constraints that are provably sufficient for the desired construction and a linear programming approach to identify sparse sets of weights, effectively identifying genes that are most relevant for discriminating different parts of the tree. We apply our method to microarray gene expression data describing 38 cell types in the hematopoiesis hierarchy, constructing a weighted Euclidean metric that uses just 175 genes. However, we find that there are many alternative sets of weights that satisfy the linear constraints. Thus, in the style of random-forest training, we also construct metrics based on random subsets of the genes and compare them to the metric of 175 genes. We then report on the selected genes and their biological functions. Our approach offers a new way to identify genes that may have important roles in stem cell differentiation. tperkins@ohri.ca Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Evidence That Up-Regulation of MicroRNA-29 Contributes to Postnatal Body Growth Deceleration
Kamran, Fariha; Andrade, Anenisia C.; Nella, Aikaterini A.; Clokie, Samuel J.; Rezvani, Geoffrey; Nilsson, Ola; Baron, Jeffrey
2015-01-01
Body growth is rapid in infancy but subsequently slows and eventually ceases due to a progressive decline in cell proliferation that occurs simultaneously in multiple organs. We previously showed that this decline in proliferation is driven in part by postnatal down-regulation of a large set of growth-promoting genes in multiple organs. We hypothesized that this growth-limiting genetic program is orchestrated by microRNAs (miRNAs). Bioinformatic analysis identified target sequences of the miR-29 family of miRNAs to be overrepresented in age–down-regulated genes. Concomitantly, expression microarray analysis in mouse kidney and lung showed that all members of the miR-29 family, miR-29a, -b, and -c, were strongly up-regulated from 1 to 6 weeks of age. Real-time PCR confirmed that miR-29a, -b, and -c were up-regulated with age in liver, kidney, lung, and heart, and their expression levels were higher in hepatocytes isolated from 5-week-old mice than in hepatocytes from embryonic mouse liver at embryonic day 16.5. We next focused on 3 predicted miR-29 target genes (Igf1, Imp1, and Mest), all of which are growth-promoting. A 3′-untranslated region containing the predicted target sequences from each gene was placed individually in a luciferase reporter construct. Transfection of miR-29 mimics suppressed luciferase gene activity for all 3 genes, and this suppression was diminished by mutating the target sequences, suggesting that these genes are indeed regulated by miR-29. Taken together, the findings suggest that up-regulation of miR-29 during juvenile life drives the down-regulation of multiple growth-promoting genes, thus contributing to physiological slowing and eventual cessation of body growth. PMID:25866874
Evidence That Up-Regulation of MicroRNA-29 Contributes to Postnatal Body Growth Deceleration.
Kamran, Fariha; Andrade, Anenisia C; Nella, Aikaterini A; Clokie, Samuel J; Rezvani, Geoffrey; Nilsson, Ola; Baron, Jeffrey; Lui, Julian C
2015-06-01
Body growth is rapid in infancy but subsequently slows and eventually ceases due to a progressive decline in cell proliferation that occurs simultaneously in multiple organs. We previously showed that this decline in proliferation is driven in part by postnatal down-regulation of a large set of growth-promoting genes in multiple organs. We hypothesized that this growth-limiting genetic program is orchestrated by microRNAs (miRNAs). Bioinformatic analysis identified target sequences of the miR-29 family of miRNAs to be overrepresented in age-down-regulated genes. Concomitantly, expression microarray analysis in mouse kidney and lung showed that all members of the miR-29 family, miR-29a, -b, and -c, were strongly up-regulated from 1 to 6 weeks of age. Real-time PCR confirmed that miR-29a, -b, and -c were up-regulated with age in liver, kidney, lung, and heart, and their expression levels were higher in hepatocytes isolated from 5-week-old mice than in hepatocytes from embryonic mouse liver at embryonic day 16.5. We next focused on 3 predicted miR-29 target genes (Igf1, Imp1, and Mest), all of which are growth-promoting. A 3'-untranslated region containing the predicted target sequences from each gene was placed individually in a luciferase reporter construct. Transfection of miR-29 mimics suppressed luciferase gene activity for all 3 genes, and this suppression was diminished by mutating the target sequences, suggesting that these genes are indeed regulated by miR-29. Taken together, the findings suggest that up-regulation of miR-29 during juvenile life drives the down-regulation of multiple growth-promoting genes, thus contributing to physiological slowing and eventual cessation of body growth.
Diverse Antibiotic Resistance Genes in Dairy Cow Manure
Wichmann, Fabienne; Udikovic-Kolic, Nikolina; Andrew, Sheila; Handelsman, Jo
2014-01-01
ABSTRACT Application of manure from antibiotic-treated animals to crops facilitates the dissemination of antibiotic resistance determinants into the environment. However, our knowledge of the identity, diversity, and patterns of distribution of these antibiotic resistance determinants remains limited. We used a new combination of methods to examine the resistome of dairy cow manure, a common soil amendment. Metagenomic libraries constructed with DNA extracted from manure were screened for resistance to beta-lactams, phenicols, aminoglycosides, and tetracyclines. Functional screening of fosmid and small-insert libraries identified 80 different antibiotic resistance genes whose deduced protein sequences were on average 50 to 60% identical to sequences deposited in GenBank. The resistance genes were frequently found in clusters and originated from a taxonomically diverse set of species, suggesting that some microorganisms in manure harbor multiple resistance genes. Furthermore, amid the great genetic diversity in manure, we discovered a novel clade of chloramphenicol acetyltransferases. Our study combined functional metagenomics with third-generation PacBio sequencing to significantly extend the roster of functional antibiotic resistance genes found in animal gut bacteria, providing a particularly broad resource for understanding the origins and dispersal of antibiotic resistance genes in agriculture and clinical settings. PMID:24757214
Spliced synthetic genes as internal controls in RNA sequencing experiments.
Hardwick, Simon A; Chen, Wendy Y; Wong, Ted; Deveson, Ira W; Blackburn, James; Andersen, Stacey B; Nielsen, Lars K; Mattick, John S; Mercer, Tim R
2016-09-01
RNA sequencing (RNA-seq) can be used to assemble spliced isoforms, quantify expressed genes and provide a global profile of the transcriptome. However, the size and diversity of the transcriptome, the wide dynamic range in gene expression and inherent technical biases confound RNA-seq analysis. We have developed a set of spike-in RNA standards, termed 'sequins' (sequencing spike-ins), that represent full-length spliced mRNA isoforms. Sequins have an entirely artificial sequence with no homology to natural reference genomes, but they align to gene loci encoded on an artificial in silico chromosome. The combination of multiple sequins across a range of concentrations emulates alternative splicing and differential gene expression, and it provides scaling factors for normalization between samples. We demonstrate the use of sequins in RNA-seq experiments to measure sample-specific biases and determine the limits of reliable transcript assembly and quantification in accompanying human RNA samples. In addition, we have designed a complementary set of sequins that represent fusion genes arising from rearrangements of the in silico chromosome to aid in cancer diagnosis. RNA sequins provide a qualitative and quantitative reference with which to navigate the complexity of the human transcriptome.
dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts
Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre
2013-01-01
The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284
Liu, Ting-Wu; Niu, Li; Fu, Bin; Chen, Juan; Wu, Fei-Hua; Chen, Juan; Wang, Wen-Hua; Hu, Wen-Jun; He, Jun-Xian; Zheng, Hai-Lei
2013-01-01
Acid rain, as a worldwide environmental issue, can cause serious damage to plants. In this study, we provided the first case study on the systematic responses of arabidopsis (Arabidopsis thaliana (L.) Heynh.) to simulated acid rain (SiAR) by transcriptome approach. Transcriptomic analysis revealed that the expression of a set of genes related to primary metabolisms, including nitrogen, sulfur, amino acid, photosynthesis, and reactive oxygen species metabolism, were altered under SiAR. In addition, transport and signal transduction related pathways, especially calcium-related signaling pathways, were found to play important roles in the response of arabidopsis to SiAR stress. Further, we compared our data set with previously published data sets on arabidopsis transcriptome subjected to various stresses, including wound, salt, light, heavy metal, karrikin, temperature, osmosis, etc. The results showed that many genes were overlapped in several stresses, suggesting that plant response to SiAR is a complex process, which may require the participation of multiple defense-signaling pathways. The results of this study will help us gain further insights into the response mechanisms of plants to acid rain stress.
Allen Brain Atlas: an integrated spatio-temporal portal for exploring the central nervous system
Sunkin, Susan M.; Ng, Lydia; Lau, Chris; Dolbeare, Tim; Gilbert, Terri L.; Thompson, Carol L.; Hawrylycz, Michael; Dang, Chinh
2013-01-01
The Allen Brain Atlas (http://www.brain-map.org) provides a unique online public resource integrating extensive gene expression data, connectivity data and neuroanatomical information with powerful search and viewing tools for the adult and developing brain in mouse, human and non-human primate. Here, we review the resources available at the Allen Brain Atlas, describing each product and data type [such as in situ hybridization (ISH) and supporting histology, microarray, RNA sequencing, reference atlases, projection mapping and magnetic resonance imaging]. In addition, standardized and unique features in the web applications are described that enable users to search and mine the various data sets. Features include both simple and sophisticated methods for gene searches, colorimetric and fluorescent ISH image viewers, graphical displays of ISH, microarray and RNA sequencing data, Brain Explorer software for 3D navigation of anatomy and gene expression, and an interactive reference atlas viewer. In addition, cross data set searches enable users to query multiple Allen Brain Atlas data sets simultaneously. All of the Allen Brain Atlas resources can be accessed through the Allen Brain Atlas data portal. PMID:23193282
Wide distribution of O157-antigen biosynthesis gene clusters in Escherichia coli.
Iguchi, Atsushi; Shirai, Hiroki; Seto, Kazuko; Ooka, Tadasuke; Ogura, Yoshitoshi; Hayashi, Tetsuya; Osawa, Kayo; Osawa, Ro
2011-01-01
Most Escherichia coli O157-serogroup strains are classified as enterohemorrhagic E. coli (EHEC), which is known as an important food-borne pathogen for humans. They usually produce Shiga toxin (Stx) 1 and/or Stx2, and express H7-flagella antigen (or nonmotile). However, O157 strains that do not produce Stxs and express H antigens different from H7 are sometimes isolated from clinical and other sources. Multilocus sequence analysis revealed that these 21 O157:non-H7 strains tested in this study belong to multiple evolutionary lineages different from that of EHEC O157:H7 strains, suggesting a wide distribution of the gene set encoding the O157-antigen biosynthesis in multiple lineages. To gain insight into the gene organization and the sequence similarity of the O157-antigen biosynthesis gene clusters, we conducted genomic comparisons of the chromosomal regions (about 59 kb in each strain) covering the O-antigen gene cluster and its flanking regions between six O157:H7/non-H7 strains. Gene organization of the O157-antigen gene cluster was identical among O157:H7/non-H7 strains, but was divided into two distinct types at the nucleotide sequence level. Interestingly, distribution of the two types did not clearly follow the evolutionary lineages of the strains, suggesting that horizontal gene transfer of both types of O157-antigen gene clusters has occurred independently among E. coli strains. Additionally, detailed sequence comparison revealed that some positions of the repetitive extragenic palindromic (REP) sequences in the regions flanking the O-antigen gene clusters were coincident with possible recombination points. From these results, we conclude that the horizontal transfer of the O157-antigen gene clusters induced the emergence of multiple O157 lineages within E. coli and speculate that REP sequences may involve one of the driving forces for exchange and evolution of O-antigen loci.
Winnier, Deidre A.; Fourcaudot, Marcel; Norton, Luke; Abdul-Ghani, Muhammad A.; Hu, Shirley L.; Farook, Vidya S.; Coletta, Dawn K.; Kumar, Satish; Puppala, Sobha; Chittoor, Geetha; Dyer, Thomas D.; Arya, Rector; Carless, Melanie; Lehman, Donna M.; Curran, Joanne E.; Cromack, Douglas T.; Tripathy, Devjit; Blangero, John; Duggirala, Ravindranath; Göring, Harald H. H.; DeFronzo, Ralph A.; Jenkinson, Christopher P.
2015-01-01
Type 2 diabetes (T2D) is a complex metabolic disease that is more prevalent in ethnic groups such as Mexican Americans, and is strongly associated with the risk factors obesity and insulin resistance. The goal of this study was to perform whole genome gene expression profiling in adipose tissue to detect common patterns of gene regulation associated with obesity and insulin resistance. We used phenotypic and genotypic data from 308 Mexican American participants from the Veterans Administration Genetic Epidemiology Study (VAGES). Basal fasting RNA was extracted from adipose tissue biopsies from a subset of 75 unrelated individuals, and gene expression data generated on the Illumina BeadArray platform. The number of gene probes with significant expression above baseline was approximately 31,000. We performed multiple regression analysis of all probes with 15 metabolic traits. Adipose tissue had 3,012 genes significantly associated with the traits of interest (false discovery rate, FDR ≤ 0.05). The significance of gene expression changes was used to select 52 genes with significant (FDR ≤ 10-4) gene expression changes across multiple traits. Gene sets/Pathways analysis identified one gene, alcohol dehydrogenase 1B (ADH1B) that was significantly enriched (P < 10-60) as a prime candidate for involvement in multiple relevant metabolic pathways. Illumina BeadChip derived ADH1B expression data was consistent with quantitative real time PCR data. We observed significant inverse correlations with waist circumference (2.8 x 10-9), BMI (5.4 x 10-6), and fasting plasma insulin (P < 0.001). These findings are consistent with a central role for ADH1B in obesity and insulin resistance and provide evidence for a novel genetic regulatory mechanism for human metabolic diseases related to these traits. PMID:25830378
Detecting discordance enrichment among a series of two-sample genome-wide expression data sets.
Lai, Yinglei; Zhang, Fanni; Nayak, Tapan K; Modarres, Reza; Lee, Norman H; McCaffrey, Timothy A
2017-01-25
With the current microarray and RNA-seq technologies, two-sample genome-wide expression data have been widely collected in biological and medical studies. The related differential expression analysis and gene set enrichment analysis have been frequently conducted. Integrative analysis can be conducted when multiple data sets are available. In practice, discordant molecular behaviors among a series of data sets can be of biological and clinical interest. In this study, a statistical method is proposed for detecting discordance gene set enrichment. Our method is based on a two-level multivariate normal mixture model. It is statistically efficient with linearly increased parameter space when the number of data sets is increased. The model-based probability of discordance enrichment can be calculated for gene set detection. We apply our method to a microarray expression data set collected from forty-five matched tumor/non-tumor pairs of tissues for studying pancreatic cancer. We divided the data set into a series of non-overlapping subsets according to the tumor/non-tumor paired expression ratio of gene PNLIP (pancreatic lipase, recently shown it association with pancreatic cancer). The log-ratio ranges from a negative value (e.g. more expressed in non-tumor tissue) to a positive value (e.g. more expressed in tumor tissue). Our purpose is to understand whether any gene sets are enriched in discordant behaviors among these subsets (when the log-ratio is increased from negative to positive). We focus on KEGG pathways. The detected pathways will be useful for our further understanding of the role of gene PNLIP in pancreatic cancer research. Among the top list of detected pathways, the neuroactive ligand receptor interaction and olfactory transduction pathways are the most significant two. Then, we consider gene TP53 that is well-known for its role as tumor suppressor in cancer research. The log-ratio also ranges from a negative value (e.g. more expressed in non-tumor tissue) to a positive value (e.g. more expressed in tumor tissue). We divided the microarray data set again according to the expression ratio of gene TP53. After the discordance enrichment analysis, we observed overall similar results and the above two pathways are still the most significant detections. More interestingly, only these two pathways have been identified for their association with pancreatic cancer in a pathway analysis of genome-wide association study (GWAS) data. This study illustrates that some disease-related pathways can be enriched in discordant molecular behaviors when an important disease-related gene changes its expression. Our proposed statistical method is useful in the detection of these pathways. Furthermore, our method can also be applied to genome-wide expression data collected by the recent RNA-seq technology.
Genetic variations in the serotonergic system contribute to amygdala volume in humans
Li, Jin; Chen, Chunhui; Wu, Karen; Zhang, Mingxia; Zhu, Bi; Chen, Chuansheng; Moyzis, Robert K.; Dong, Qi
2015-01-01
The amygdala plays a critical role in emotion processing and psychiatric disorders associated with emotion dysfunction. Accumulating evidence suggests that amygdala structure is modulated by serotonin-related genes. However, there is a gap between the small contributions of single loci (less than 1%) and the reported 63–65% heritability of amygdala structure. To understand the “missing heritability,” we systematically explored the contribution of serotonin genes on amygdala structure at the gene set level. The present study of 417 healthy Chinese volunteers examined 129 representative polymorphisms in genes from multiple biological mechanisms in the regulation of serotonin neurotransmission. A system-level approach using multiple regression analyses identified that nine SNPs collectively accounted for approximately 8% of the variance in amygdala volume. Permutation analyses showed that the probability of obtaining these findings by chance was low (p = 0.043, permuted for 1000 times). Findings showed that serotonin genes contribute moderately to individual differences in amygdala volume in a healthy Chinese sample. These results indicate that the system-level approach can help us to understand the genetic basis of a complex trait such as amygdala structure. PMID:26500508
MultiSite Gateway-Compatible Cell Type-Specific Gene-Inducible System for Plants1[OPEN
Siligato, Riccardo; Wang, Xin; Yadav, Shri Ram; Lehesranta, Satu; Ma, Guojie; Ursache, Robertas; Sevilem, Iris; Zhang, Jing; Gorte, Maartje; Prasad, Kalika; Heidstra, Renze
2016-01-01
A powerful method to study gene function is expression or overexpression in an inducible, cell type-specific system followed by observation of consequent phenotypic changes and visualization of linked reporters in the target tissue. Multiple inducible gene overexpression systems have been developed for plants, but very few of these combine plant selection markers, control of expression domains, access to multiple promoters and protein fusion reporters, chemical induction, and high-throughput cloning capabilities. Here, we introduce a MultiSite Gateway-compatible inducible system for Arabidopsis (Arabidopsis thaliana) plants that provides the capability to generate such constructs in a single cloning step. The system is based on the tightly controlled, estrogen-inducible XVE system. We demonstrate that the transformants generated with this system exhibit the expected cell type-specific expression, similar to what is observed with constitutively expressed native promoters. With this new system, cloning of inducible constructs is no longer limited to a few special cases but can be used as a standard approach when gene function is studied. In addition, we present a set of entry clones consisting of histochemical and fluorescent reporter variants designed for gene and promoter expression studies. PMID:26644504
CoPub: a literature-based keyword enrichment tool for microarray data analysis.
Frijters, Raoul; Heupers, Bart; van Beek, Pieter; Bouwhuis, Maurice; van Schaik, René; de Vlieg, Jacob; Polman, Jan; Alkema, Wynand
2008-07-01
Medline is a rich information source, from which links between genes and keywords describing biological processes, pathways, drugs, pathologies and diseases can be extracted. We developed a publicly available tool called CoPub that uses the information in the Medline database for the biological interpretation of microarray data. CoPub allows batch input of multiple human, mouse or rat genes and produces lists of keywords from several biomedical thesauri that are significantly correlated with the set of input genes. These lists link to Medline abstracts in which the co-occurring input genes and correlated keywords are highlighted. Furthermore, CoPub can graphically visualize differentially expressed genes and over-represented keywords in a network, providing detailed insight in the relationships between genes and keywords, and revealing the most influential genes as highly connected hubs. CoPub is freely accessible at http://services.nbic.nl/cgi-bin/copub/CoPub.pl.
Discovery of error-tolerant biclusters from noisy gene expression data.
Gupta, Rohit; Rao, Navneet; Kumar, Vipin
2011-11-24
An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, which limits its applicability in real-life data sets where the biclusters may be fragmented due to random noise/errors. Moreover, as they only work with binary or boolean attributes, their application on gene-expression data require transforming real-valued attributes to binary attributes, which often results in loss of information. Many past approaches have tried to address the issue of noise and handling real-valued attributes independently but there is no systematic approach that addresses both of these issues together. In this paper, we first propose a novel error-tolerant biclustering model, 'ET-bicluster', and then propose a bottom-up heuristic-based mining algorithm to sequentially discover error-tolerant biclusters directly from real-valued gene-expression data. The efficacy of our proposed approach is illustrated by comparing it with a recent approach RAP in the context of two biological problems: discovery of functional modules and discovery of biomarkers. For the first problem, two real-valued S.Cerevisiae microarray gene-expression data sets are used to demonstrate that the biclusters obtained from ET-bicluster approach not only recover larger set of genes as compared to those obtained from RAP approach but also have higher functional coherence as evaluated using the GO-based functional enrichment analysis. The statistical significance of the discovered error-tolerant biclusters as estimated by using two randomization tests, reveal that they are indeed biologically meaningful and statistically significant. For the second problem of biomarker discovery, we used four real-valued Breast Cancer microarray gene-expression data sets and evaluate the biomarkers obtained using MSigDB gene sets. The results obtained for both the problems: functional module discovery and biomarkers discovery, clearly signifies the usefulness of the proposed ET-bicluster approach and illustrate the importance of explicitly incorporating noise/errors in discovering coherent groups of genes from gene-expression data.
Zolfaghari Emameh, Reza; Barker, Harlan R; Hytönen, Vesa P; Parkkila, Seppo
2018-05-25
Genomic islands (GIs) are a type of mobile genetic element (MGE) that are present in bacterial chromosomes. They consist of a cluster of genes which produce proteins that contribute to a variety of functions, including, but not limited to, regulation of cell metabolism, anti-microbial resistance, pathogenicity, virulence, and resistance to heavy metals. The genes carried in MGEs can be used as a trait reservoir in times of adversity. Transfer of genes using MGEs, occurring outside of reproduction, is called horizontal gene transfer (HGT). Previous literature has shown that numerous HGT events have occurred through endosymbiosis between prokaryotes and eukaryotes.Beta carbonic anhydrase (β-CA) enzymes play a critical role in the biochemical pathways of many prokaryotes and eukaryotes. We have previously suggested horizontal transfer of β-CA genes from plasmids of some prokaryotic endosymbionts to their protozoan hosts. In this study, we set out to identify β-CA genes that might have transferred between prokaryotic and protist species through HGT in GIs. Therefore, we investigated prokaryotic chromosomes containing β-CA-encoding GIs and utilized multiple bioinformatics tools to reveal the distinct movements of β-CA genes among a wide variety of organisms. Our results identify the presence of β-CA genes in GIs of several medically and industrially relevant bacterial species, and phylogenetic analyses reveal multiple cases of likely horizontal transfer of β-CA genes from GIs of ancestral prokaryotes to protists. IMPORTANCE The evolutionary process is mediated by mobile genetic elements (MGEs), such as genomic islands (GIs). A gene or set of genes in the GIs are exchanged between and within various species through horizontal gene transfer (HGT). Based on the crucial role that GIs can play in bacterial survival and proliferation, they were introduced as the environmental- and pathogen-associated factors. Carbonic anhydrases (CAs) are involved in many critical biochemical pathways, such as regulation of pH homeostasis and electrolyte transfer. Among the six evolutionary families of CAs, β-CA gene sequences are present in many bacterial species, which can be horizontally transferred to protists during evolution. This study shows for the first time the involvement of bacterial β-CA gene sequences in the GIs, and suggests their horizontal transfer to protists during evolution. Copyright © 2018 American Society for Microbiology.
Palumbo, Maria Concetta; Zenoni, Sara; Fasoli, Marianna; Massonnet, Mélanie; Farina, Lorenzo; Castiglione, Filippo; Pezzotti, Mario; Paci, Paola
2014-12-01
We developed an approach that integrates different network-based methods to analyze the correlation network arising from large-scale gene expression data. By studying grapevine (Vitis vinifera) and tomato (Solanum lycopersicum) gene expression atlases and a grapevine berry transcriptomic data set during the transition from immature to mature growth, we identified a category named "fight-club hubs" characterized by a marked negative correlation with the expression profiles of neighboring genes in the network. A special subset named "switch genes" was identified, with the additional property of many significant negative correlations outside their own group in the network. Switch genes are involved in multiple processes and include transcription factors that may be considered master regulators of the previously reported transcriptome remodeling that marks the developmental shift from immature to mature growth. All switch genes, expressed at low levels in vegetative/green tissues, showed a significant increase in mature/woody organs, suggesting a potential regulatory role during the developmental transition. Finally, our analysis of tomato gene expression data sets showed that wild-type switch genes are downregulated in ripening-deficient mutants. The identification of known master regulators of tomato fruit maturation suggests our method is suitable for the detection of key regulators of organ development in different fleshy fruit crops. © 2014 American Society of Plant Biologists. All rights reserved.
Oh, Dong-Ha; Hong, Hyewon; Lee, Sang Yeol; Yun, Dae-Jin; Bohnert, Hans J.; Dassanayake, Maheshi
2014-01-01
Schrenkiella parvula (formerly Thellungiella parvula), a close relative of Arabidopsis (Arabidopsis thaliana) and Brassica crop species, thrives on the shores of Lake Tuz, Turkey, where soils accumulate high concentrations of multiple-ion salts. Despite the stark differences in adaptations to extreme salt stresses, the genomes of S. parvula and Arabidopsis show extensive synteny. S. parvula completes its life cycle in the presence of Na+, K+, Mg2+, Li+, and borate at soil concentrations lethal to Arabidopsis. Genome structural variations, including tandem duplications and translocations of genes, interrupt the colinearity observed throughout the S. parvula and Arabidopsis genomes. Structural variations distinguish homologous gene pairs characterized by divergent promoter sequences and basal-level expression strengths. Comparative RNA sequencing reveals the enrichment of ion-transport functions among genes with higher expression in S. parvula, while pathogen defense-related genes show higher expression in Arabidopsis. Key stress-related ion transporter genes in S. parvula showed increased copy number, higher transcript dosage, and evidence for subfunctionalization. This extremophyte offers a framework to identify the requisite adjustments of genomic architecture and expression control for a set of genes found in most plants in a way to support distinct niche adaptation and lifestyles. PMID:24563282
Hua, Hong-Li; Zhang, Fa-Zhan; Labena, Abraham Alemayehu; Dong, Chuan; Jin, Yan-Ting; Guo, Feng-Biao
Investigation of essential genes is significant to comprehend the minimal gene sets of cell and discover potential drug targets. In this study, a novel approach based on multiple homology mapping and machine learning method was introduced to predict essential genes. We focused on 25 bacteria which have characterized essential genes. The predictions yielded the highest area under receiver operating characteristic (ROC) curve (AUC) of 0.9716 through tenfold cross-validation test. Proper features were utilized to construct models to make predictions in distantly related bacteria. The accuracy of predictions was evaluated via the consistency of predictions and known essential genes of target species. The highest AUC of 0.9552 and average AUC of 0.8314 were achieved when making predictions across organisms. An independent dataset from Synechococcus elongatus , which was released recently, was obtained for further assessment of the performance of our model. The AUC score of predictions is 0.7855, which is higher than other methods. This research presents that features obtained by homology mapping uniquely can achieve quite great or even better results than those integrated features. Meanwhile, the work indicates that machine learning-based method can assign more efficient weight coefficients than using empirical formula based on biological knowledge.
Multiple levels of redundant processes inhibit Caenorhabditis elegans vulval cell fates.
Andersen, Erik C; Saffer, Adam M; Horvitz, H Robert
2008-08-01
Many mutations cause obvious abnormalities only when combined with other mutations. Such synthetic interactions can be the result of redundant gene functions. In Caenorhabditis elegans, the synthetic multivulva (synMuv) genes have been grouped into multiple classes that redundantly inhibit vulval cell fates. Animals with one or more mutations of the same class undergo wild-type vulval development, whereas animals with mutations of any two classes have a multivulva phenotype. By varying temperature and genetic background, we determined that mutations in most synMuv genes within a single synMuv class enhance each other. However, in a few cases no enhancement was observed. For example, mutations that affect an Mi2 homolog and a histone methyltransferase are of the same class and do not show enhancement. We suggest that such sets of genes function together in vivo and in at least some cases encode proteins that interact physically. The approach of genetic enhancement can be applied more broadly to identify potential protein complexes as well as redundant processes or pathways. Many synMuv genes are evolutionarily conserved, and the genetic relationships we have identified might define the functions not only of synMuv genes in C. elegans but also of their homologs in other organisms.
Multiple Levels of Redundant Processes Inhibit Caenorhabditis elegans Vulval Cell Fates
Andersen, Erik C.; Saffer, Adam M.; Horvitz, H. Robert
2008-01-01
Many mutations cause obvious abnormalities only when combined with other mutations. Such synthetic interactions can be the result of redundant gene functions. In Caenorhabditis elegans, the synthetic multivulva (synMuv) genes have been grouped into multiple classes that redundantly inhibit vulval cell fates. Animals with one or more mutations of the same class undergo wild-type vulval development, whereas animals with mutations of any two classes have a multivulva phenotype. By varying temperature and genetic background, we determined that mutations in most synMuv genes within a single synMuv class enhance each other. However, in a few cases no enhancement was observed. For example, mutations that affect an Mi2 homolog and a histone methyltransferase are of the same class and do not show enhancement. We suggest that such sets of genes function together in vivo and in at least some cases encode proteins that interact physically. The approach of genetic enhancement can be applied more broadly to identify potential protein complexes as well as redundant processes or pathways. Many synMuv genes are evolutionarily conserved, and the genetic relationships we have identified might define the functions not only of synMuv genes in C. elegans but also of their homologs in other organisms. PMID:18689876
Ho, Daniel W. H.; Yap, Maurice K. H.; Ng, Po Wah; Fung, Wai Yan; Yip, Shea Ping
2012-01-01
Background Myopia is the most common ocular disorder worldwide and imposes tremendous burden on the society. It is a complex disease. The MYP6 locus at 22 q12 is of particular interest because many studies have detected linkage signals at this interval. The MYP6 locus is likely to contain susceptibility gene(s) for myopia, but none has yet been identified. Methodology/Principal Findings Two independent subject groups of southern Chinese in Hong Kong participated in the study an initial study using a discovery sample set of 342 cases and 342 controls, and a follow-up study using a replication sample set of 316 cases and 313 controls. Cases with high myopia were defined by spherical equivalent ≤ -8 dioptres and emmetropic controls by spherical equivalent within ±1.00 dioptre for both eyes. Manual candidate gene selection from the MYP6 locus was supported by objective in silico prioritization. DNA samples of discovery sample set were genotyped for 178 tagging single nucleotide polymorphisms (SNPs) from 26 genes. For replication, 25 SNPs (tagging or located at predicted transcription factor or microRNA binding sites) from 4 genes were subsequently examined using the replication sample set. Fisher P value was calculated for all SNPs and overall association results were summarized by meta-analysis. Based on initial and replication studies, rs2009066 located in the crystallin beta A4 (CRYBA4) gene was identified to be the most significantly associated with high myopia (initial study: P = 0.02; replication study: P = 1.88e-4; meta-analysis: P = 1.54e-5) among all the SNPs tested. The association result survived correction for multiple comparisons. Under the allelic genetic model for the combined sample set, the odds ratio of the minor allele G was 1.41 (95% confidence intervals, 1.21-1.64). Conclusions/Significance A novel susceptibility gene (CRYBA4) was discovered for high myopia. Our study also signified the potential importance of appropriate gene prioritization in candidate selection. PMID:22792142
Zwaenepoel, Arthur; Diels, Tim; Amar, David; Van Parys, Thomas; Shamir, Ron; Van de Peer, Yves; Tzfadia, Oren
2018-01-01
Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at http://bioinformatics.psb.ugent.be/webtools/morphdb/morphDB/index/. We also provide a toolkit, named "MORPH bulk" (https://github.com/arzwa/morph-bulk), for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest.
Bosch, Linda J.W.; Coupé, Veerle M.H.; Mongera, Sandra; Haan, Josien C.; Richman, Susan D.; Koopman, Miriam; Tol, Jolien; de Meyer, Tim; Louwagie, Joost; Dehaspe, Luc; van Grieken, Nicole C.T.; Ylstra, Bauke; Verheul, Henk M.W.; van Engeland, Manon; Nagtegaal, Iris D.; Herman, James G.; Quirke, Philip; Seymour, Matthew T.; Punt, Cornelis J.A.; van Criekinge, Wim; Carvalho, Beatriz; Meijer, Gerrit A.
2017-01-01
Diversity in colorectal cancer biology is associated with variable responses to standard chemotherapy. We aimed to identify and validate DNA hypermethylated genes as predictive biomarkers for irinotecan treatment of metastatic CRC patients. Candidate genes were selected from 389 genes involved in DNA Damage Repair by correlation analyses between gene methylation status and drug response in 32 cell lines. A large series of samples (n=818) from two phase III clinical trials was used to evaluate these candidate genes by correlating methylation status to progression-free survival after treatment with first-line single-agent fluorouracil (Capecitabine or 5-fluorouracil) or combination chemotherapy (Capecitabine or 5-fluorouracil plus irinotecan (CAPIRI/FOLFIRI)). In the discovery (n=185) and initial validation set (n=166), patients with methylated Decoy Receptor 1 (DCR1) did not benefit from CAPIRI over Capecitabine treatment (discovery set: HR=1.2 (95%CI 0.7-1.9, p=0.6), validation set: HR=0.9 (95%CI 0.6-1.4, p=0.5)), whereas patients with unmethylated DCR1 did (discovery set: HR=0.4 (95%CI 0.3-0.6, p=0.00001), validation set: HR=0.5 (95%CI 0.3-0.7, p=0.0008)). These results could not be replicated in the external data set (n=467), where a similar effect size was found in patients with methylated and unmethylated DCR1 for FOLFIRI over 5FU treatment (methylated DCR1: HR=0.7 (95%CI 0.5-0.9, p=0.01), unmethylated DCR1: HR=0.8 (95%CI 0.6-1.2, p=0.4)). In conclusion, DCR1 promoter hypermethylation status is a potential predictive biomarker for response to treatment with irinotecan, when combined with capecitabine. This finding could not be replicated in an external validation set, in which irinotecan was combined with 5FU. These results underline the challenge and importance of extensive clinical evaluation of candidate biomarkers in multiple trials. PMID:28968978
Bosch, Linda J W; Trooskens, Geert; Snaebjornsson, Petur; Coupé, Veerle M H; Mongera, Sandra; Haan, Josien C; Richman, Susan D; Koopman, Miriam; Tol, Jolien; de Meyer, Tim; Louwagie, Joost; Dehaspe, Luc; van Grieken, Nicole C T; Ylstra, Bauke; Verheul, Henk M W; van Engeland, Manon; Nagtegaal, Iris D; Herman, James G; Quirke, Philip; Seymour, Matthew T; Punt, Cornelis J A; van Criekinge, Wim; Carvalho, Beatriz; Meijer, Gerrit A
2017-09-08
Diversity in colorectal cancer biology is associated with variable responses to standard chemotherapy. We aimed to identify and validate DNA hypermethylated genes as predictive biomarkers for irinotecan treatment of metastatic CRC patients. Candidate genes were selected from 389 genes involved in DNA Damage Repair by correlation analyses between gene methylation status and drug response in 32 cell lines. A large series of samples (n=818) from two phase III clinical trials was used to evaluate these candidate genes by correlating methylation status to progression-free survival after treatment with first-line single-agent fluorouracil (Capecitabine or 5-fluorouracil) or combination chemotherapy (Capecitabine or 5-fluorouracil plus irinotecan (CAPIRI/FOLFIRI)). In the discovery (n=185) and initial validation set (n=166), patients with methylated Decoy Receptor 1 ( DCR1) did not benefit from CAPIRI over Capecitabine treatment (discovery set: HR=1.2 (95%CI 0.7-1.9, p =0.6), validation set: HR=0.9 (95%CI 0.6-1.4, p =0.5)), whereas patients with unmethylated DCR1 did (discovery set: HR=0.4 (95%CI 0.3-0.6, p =0.00001), validation set: HR=0.5 (95%CI 0.3-0.7, p =0.0008)). These results could not be replicated in the external data set (n=467), where a similar effect size was found in patients with methylated and unmethylated DCR1 for FOLFIRI over 5FU treatment (methylated DCR1 : HR=0.7 (95%CI 0.5-0.9, p =0.01), unmethylated DCR1 : HR=0.8 (95%CI 0.6-1.2, p =0.4)). In conclusion, DCR1 promoter hypermethylation status is a potential predictive biomarker for response to treatment with irinotecan, when combined with capecitabine. This finding could not be replicated in an external validation set, in which irinotecan was combined with 5FU. These results underline the challenge and importance of extensive clinical evaluation of candidate biomarkers in multiple trials.
Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses
Bayzid, Md Shamsuzzoha; Mirarab, Siavash; Boussau, Bastien; Warnow, Tandy
2015-01-01
Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning. PMID:26086579
Chen, Zhenyu; Li, Jianping; Wei, Liwei
2007-10-01
Recently, gene expression profiling using microarray techniques has been shown as a promising tool to improve the diagnosis and treatment of cancer. Gene expression data contain high level of noise and the overwhelming number of genes relative to the number of available samples. It brings out a great challenge for machine learning and statistic techniques. Support vector machine (SVM) has been successfully used to classify gene expression data of cancer tissue. In the medical field, it is crucial to deliver the user a transparent decision process. How to explain the computed solutions and present the extracted knowledge becomes a main obstacle for SVM. A multiple kernel support vector machine (MK-SVM) scheme, consisting of feature selection, rule extraction and prediction modeling is proposed to improve the explanation capacity of SVM. In this scheme, we show that the feature selection problem can be translated into an ordinary multiple parameters learning problem. And a shrinkage approach: 1-norm based linear programming is proposed to obtain the sparse parameters and the corresponding selected features. We propose a novel rule extraction approach using the information provided by the separating hyperplane and support vectors to improve the generalization capacity and comprehensibility of rules and reduce the computational complexity. Two public gene expression datasets: leukemia dataset and colon tumor dataset are used to demonstrate the performance of this approach. Using the small number of selected genes, MK-SVM achieves encouraging classification accuracy: more than 90% for both two datasets. Moreover, very simple rules with linguist labels are extracted. The rule sets have high diagnostic power because of their good classification performance.
Dai, Weijun; Li, Wencheng; Hoque, Mainul; Li, Zhuyun; Tian, Bin; Makeyev, Eugene V
2015-07-06
Nervous system (NS) development relies on coherent upregulation of extensive sets of genes in a precise spatiotemporal manner. How such transcriptome-wide effects are orchestrated at the molecular level remains an open question. Here we show that 3'-untranslated regions (3' UTRs) of multiple neural transcripts contain AU-rich cis-elements (AREs) recognized by tristetraprolin (TTP/Zfp36), an RNA-binding protein previously implicated in regulation of mRNA stability. We further demonstrate that the efficiency of ARE-dependent mRNA degradation declines in the neural lineage because of a decrease in the TTP protein expression mediated by the NS-enriched microRNA miR-9. Importantly, TTP downregulation in this context is essential for proper neuronal differentiation. On the other hand, inactivation of TTP in non-neuronal cells leads to dramatic upregulation of multiple NS-specific genes. We conclude that the newly identified miR-9/TTP circuitry limits unscheduled accumulation of neuronal mRNAs in non-neuronal cells and ensures coordinated upregulation of these transcripts in neurons.
Ohta, Yuko; McKinney, E Churchill; Criscitiello, Michael F; Flajnik, Martin F
2002-01-15
Cartilaginous fish (e.g., sharks) are derived from the oldest vertebrate ancestor having an adaptive immune system, and thus are key models for examining MHC evolution. Previously, family studies in two shark species showed that classical class I (UAA) and class II genes are genetically linked. In this study, we show that proteasome genes LMP2 and LMP7, shark-specific LMP7-like, and the TAP1/2 genes are linked to class I/II. Functional LMP7 and LMP7-like genes, as well as multiple LMP2 genes or gene fragments, are found only in some sharks, suggesting that different sets of peptides might be generated depending upon inherited MHC haplotypes. Cosmid clones bearing the MHC-linked classical class I genes were isolated and shown to contain proteasome gene fragments. A non-MHC-linked LMP7 gene also was identified on another cosmid, but only two exons of this gene were detected, closely linked to a class I pseudogene (UAA-NC2); this region probably resulted from a recent duplication and translocation from the functional MHC. Tight linkage of proteasome and class I genes, in comparison with gene organizations of other vertebrates, suggests a primordial MHC organization. Another nonclassical class I gene (UAA-NC1) was detected that is linked neither to MHC nor to UAA-NC2; its high level of sequence similarity to UAA suggests that UAA-NC1 also was recently derived from UAA and translocated from MHC. These data further support the principle of a primordial class I region with few class I genes. Finally, multiple paternities in one family were demonstrated, with potential segregation distortions.
A methodology for the analysis of differential coexpression across the human lifespan.
Gillis, Jesse; Pavlidis, Paul
2009-09-22
Differential coexpression is a change in coexpression between genes that may reflect 'rewiring' of transcriptional networks. It has previously been hypothesized that such changes might be occurring over time in the lifespan of an organism. While both coexpression and differential expression of genes have been previously studied in life stage change or aging, differential coexpression has not. Generalizing differential coexpression analysis to many time points presents a methodological challenge. Here we introduce a method for analyzing changes in coexpression across multiple ordered groups (e.g., over time) and extensively test its validity and usefulness. Our method is based on the use of the Haar basis set to efficiently represent changes in coexpression at multiple time scales, and thus represents a principled and generalizable extension of the idea of differential coexpression to life stage data. We used published microarray studies categorized by age to test the methodology. We validated the methodology by testing our ability to reconstruct Gene Ontology (GO) categories using our measure of differential coexpression and compared this result to using coexpression alone. Our method allows significant improvement in characterizing these groups of genes. Further, we examine the statistical properties of our measure of differential coexpression and establish that the results are significant both statistically and by an improvement in semantic similarity. In addition, we found that our method finds more significant changes in gene relationships compared to several other methods of expressing temporal relationships between genes, such as coexpression over time. Differential coexpression over age generates significant and biologically relevant information about the genes producing it. Our Haar basis methodology for determining age-related differential coexpression performs better than other tested methods. The Haar basis set also lends itself to ready interpretation in terms of both evolutionary and physiological mechanisms of aging and can be seen as a natural generalization of two-category differential coexpression. paul@bioinformatics.ubc.ca.
Hara, Yuichiro; Tatsumi, Kaori; Yoshida, Michio; Kajikawa, Eriko; Kiyonari, Hiroshi; Kuraku, Shigehiro
2015-11-18
RNA-seq enables gene expression profiling in selected spatiotemporal windows and yields massive sequence information with relatively low cost and time investment, even for non-model species. However, there remains a large room for optimizing its workflow, in order to take full advantage of continuously developing sequencing capacity. Transcriptome sequencing for three embryonic stages of Madagascar ground gecko (Paroedura picta) was performed with the Illumina platform. The output reads were assembled de novo for reconstructing transcript sequences. In order to evaluate the completeness of transcriptome assemblies, we prepared a reference gene set consisting of vertebrate one-to-one orthologs. To take advantage of increased read length of >150 nt, we demonstrated shortened RNA fragmentation time, which resulted in a dramatic shift of insert size distribution. To evaluate products of multiple de novo assembly runs incorporating reads with different RNA sources, read lengths, and insert sizes, we introduce a new reference gene set, core vertebrate genes (CVG), consisting of 233 genes that are shared as one-to-one orthologs by all vertebrate genomes examined (29 species)., The completeness assessment performed by the computational pipelines CEGMA and BUSCO referring to CVG, demonstrated higher accuracy and resolution than with the gene set previously established for this purpose. As a result of the assessment with CVG, we have derived the most comprehensive transcript sequence set of the Madagascar ground gecko by means of assembling individual libraries followed by clustering the assembled sequences based on their overall similarities. Our results provide several insights into optimizing de novo RNA-seq workflow, including the coordination between library insert size and read length, which manifested in improved connectivity of assemblies. The approach and assembly assessment with CVG demonstrated here would be applicable to transcriptome analysis of other species as well as whole genome analyses.
Zimmermann, Michael T.; Kennedy, Richard B.; Grill, Diane E.; Oberg, Ann L.; Goergen, Krista M.; Ovsyannikova, Inna G.; Haralambieva, Iana H.; Poland, Gregory A.
2017-01-01
The development of a humoral immune response to influenza vaccines occurs on a multisystems level. Due to the orchestration required for robust immune responses when multiple genes and their regulatory components across multiple cell types are involved, we examined an influenza vaccination cohort using multiple high-throughput technologies. In this study, we sought a more thorough understanding of how immune cell composition and gene expression relate to each other and contribute to interindividual variation in response to influenza vaccination. We first hypothesized that many of the differentially expressed (DE) genes observed after influenza vaccination result from changes in the composition of participants’ peripheral blood mononuclear cells (PBMCs), which were assessed using flow cytometry. We demonstrated that DE genes in our study are correlated with changes in PBMC composition. We gathered DE genes from 128 other publically available PBMC-based vaccine studies and identified that an average of 57% correlated with specific cell subset levels in our study (permutation used to control false discovery), suggesting that the associations we have identified are likely general features of PBMC-based transcriptomics. Second, we hypothesized that more robust models of vaccine response could be generated by accounting for the interplay between PBMC composition, gene expression, and gene regulation. We employed machine learning to generate predictive models of B-cell ELISPOT response outcomes and hemagglutination inhibition (HAI) antibody titers. The top HAI and B-cell ELISPOT model achieved an area under the receiver operating curve (AUC) of 0.64 and 0.79, respectively, with linear model coefficients of determination of 0.08 and 0.28. For the B-cell ELISPOT outcomes, CpG methylation had the greatest predictive ability, highlighting potentially novel regulatory features important for immune response. B-cell ELISOT models using only PBMC composition had lower performance (AUC = 0.67), but highlighted well-known mechanisms. Our analysis demonstrated that each of the three data sets (cell composition, mRNA-Seq, and DNA methylation) may provide distinct information for the prediction of humoral immune response outcomes. We believe that these findings are important for the interpretation of current omics-based studies and set the stage for a more thorough understanding of interindividual immune responses to influenza vaccination. PMID:28484452
Hackett, Justin B; Lu, Yan
2017-05-04
In land plants, plastid and mitochondrial RNAs are subject to post-transcriptional C-to-U RNA editing. T-DNA insertions in the ORGANELLE RNA RECOGNITION MOTIF PROTEIN6 gene resulted in reduced photosystem II (PSII) activity and smaller plant and leaf sizes. Exon coverage analysis of the ORRM6 gene showed that orrm6-1 and orrm6-2 are loss-of-function mutants. Compared to other ORRM proteins, ORRM6 affects a relative small number of RNA editing sites. Sanger sequencing of reverse transcription-PCR products of plastid transcripts revealed 2 plastid RNA editing sites that are substantially affected in the orrm6 mutants: psbF-C77 and accD-C794. The psbF gene encodes the β subunit of cytochrome b 559 , an essential component of PSII. The accD gene encodes the β subunit of acetyl-CoA carboxylase, a protein required in plastid fatty acid biosynthesis. Whole-transcriptome RNA-seq demonstrated that editing at psbF-C77 is nearly absent and the editing extent at accD-C794 was significantly reduced. Gene set enrichment pathway analysis showed that expression of multiple gene sets involved in photosynthesis, especially photosynthetic electron transport, is significantly upregulated in both orrm6 mutants. The upregulation could be a mechanism to compensate for the reduced PSII electron transport rate in the orrm6 mutants. These results further demonstrated that Organelle RNA Recognition Motif protein ORRM6 is required in editing of specific RNAs in the Arabidopsis (Arabidopsis thaliana) plastid.
MIDAS: A Modular DNA Assembly System for Synthetic Biology.
van Dolleweerd, Craig J; Kessans, Sarah A; Van de Bittner, Kyle C; Bustamante, Leyla Y; Bundela, Rudranuj; Scott, Barry; Nicholson, Matthew J; Parker, Emily J
2018-04-20
A modular and hierarchical DNA assembly platform for synthetic biology based on Golden Gate (Type IIS restriction enzyme) cloning is described. This enabling technology, termed MIDAS (for Modular Idempotent DNA Assembly System), can be used to precisely assemble multiple DNA fragments in a single reaction using a standardized assembly design. It can be used to build genes from libraries of sequence-verified, reusable parts and to assemble multiple genes in a single vector, with full user control over gene order and orientation, as well as control of the direction of growth (polarity) of the multigene assembly, a feature that allows genes to be nested between other genes or genetic elements. We describe the detailed design and use of MIDAS, exemplified by the reconstruction, in the filamentous fungus Penicillium paxilli, of the metabolic pathway for production of paspaline and paxilline, key intermediates in the biosynthesis of a range of indole diterpenes-a class of secondary metabolites produced by several species of filamentous fungi. MIDAS was used to efficiently assemble a 25.2 kb plasmid from 21 different modules (seven genes, each composed of three basic parts). By using a parts library-based system for construction of complex assemblies, and a unique set of vectors, MIDAS can provide a flexible route to assembling tailored combinations of genes and other genetic elements, thereby supporting synthetic biology applications in a wide range of expression hosts.
Gene genealogies for genetic association mapping, with application to Crohn's disease
Burkett, Kelly M.; Greenwood, Celia M. T.; McNeney, Brad; Graham, Jinko
2013-01-01
A gene genealogy describes relationships among haplotypes sampled from a population. Knowledge of the gene genealogy for a set of haplotypes is useful for estimation of population genetic parameters and it also has potential application in finding disease-predisposing genetic variants. As the true gene genealogy is unknown, Markov chain Monte Carlo (MCMC) approaches have been used to sample genealogies conditional on data at multiple genetic markers. We previously implemented an MCMC algorithm to sample from an approximation to the distribution of the gene genealogy conditional on haplotype data. Our approach samples ancestral trees, recombination and mutation rates at a genomic focal point. In this work, we describe how our sampler can be used to find disease-predisposing genetic variants in samples of cases and controls. We use a tree-based association statistic that quantifies the degree to which case haplotypes are more closely related to each other around the focal point than control haplotypes, without relying on a disease model. As the ancestral tree is a latent variable, so is the tree-based association statistic. We show how the sampler can be used to estimate the posterior distribution of the latent test statistic and corresponding latent p-values, which together comprise a fuzzy p-value. We illustrate the approach on a publicly-available dataset from a study of Crohn's disease that consists of genotypes at multiple SNP markers in a small genomic region. We estimate the posterior distribution of the tree-based association statistic and the recombination rate at multiple focal points in the region. Reassuringly, the posterior mean recombination rates estimated at the different focal points are consistent with previously published estimates. The tree-based association approach finds multiple sub-regions where the case haplotypes are more genetically related than the control haplotypes, and that there may be one or multiple disease-predisposing loci. PMID:24348515
Sporulation genes associated with sporulation efficiency in natural isolates of yeast.
Tomar, Parul; Bhatia, Aatish; Ramdas, Shweta; Diao, Liyang; Bhanot, Gyan; Sinha, Himanshu
2013-01-01
Yeast sporulation efficiency is a quantitative trait and is known to vary among experimental populations and natural isolates. Some studies have uncovered the genetic basis of this variation and have identified the role of sporulation genes (IME1, RME1) and sporulation-associated genes (FKH2, PMS1, RAS2, RSF1, SWS2), as well as non-sporulation pathway genes (MKT1, TAO3) in maintaining this variation. However, these studies have been done mostly in experimental populations. Sporulation is a response to nutrient deprivation. Unlike laboratory strains, natural isolates have likely undergone multiple selections for quick adaptation to varying nutrient conditions. As a result, sporulation efficiency in natural isolates may have different genetic factors contributing to phenotypic variation. Using Saccharomyces cerevisiae strains in the genetically and environmentally diverse SGRP collection, we have identified genetic loci associated with sporulation efficiency variation in a set of sporulation and sporulation-associated genes. Using two independent methods for association mapping and correcting for population structure biases, our analysis identified two linked clusters containing 4 non-synonymous mutations in genes - HOS4, MCK1, SET3, and SPO74. Five regulatory polymorphisms in five genes such as MLS1 and CDC10 were also identified as putative candidates. Our results provide candidate genes contributing to phenotypic variation in the sporulation efficiency of natural isolates of yeast.
Sporulation Genes Associated with Sporulation Efficiency in Natural Isolates of Yeast
Ramdas, Shweta; Diao, Liyang; Bhanot, Gyan; Sinha, Himanshu
2013-01-01
Yeast sporulation efficiency is a quantitative trait and is known to vary among experimental populations and natural isolates. Some studies have uncovered the genetic basis of this variation and have identified the role of sporulation genes (IME1, RME1) and sporulation-associated genes (FKH2, PMS1, RAS2, RSF1, SWS2), as well as non-sporulation pathway genes (MKT1, TAO3) in maintaining this variation. However, these studies have been done mostly in experimental populations. Sporulation is a response to nutrient deprivation. Unlike laboratory strains, natural isolates have likely undergone multiple selections for quick adaptation to varying nutrient conditions. As a result, sporulation efficiency in natural isolates may have different genetic factors contributing to phenotypic variation. Using Saccharomyces cerevisiae strains in the genetically and environmentally diverse SGRP collection, we have identified genetic loci associated with sporulation efficiency variation in a set of sporulation and sporulation-associated genes. Using two independent methods for association mapping and correcting for population structure biases, our analysis identified two linked clusters containing 4 non-synonymous mutations in genes – HOS4, MCK1, SET3, and SPO74. Five regulatory polymorphisms in five genes such as MLS1 and CDC10 were also identified as putative candidates. Our results provide candidate genes contributing to phenotypic variation in the sporulation efficiency of natural isolates of yeast. PMID:23874994
A 16-Gene Signature Distinguishes Anaplastic Astrocytoma from Glioblastoma
Rao, Soumya Alige Mahabala; Srinivasan, Sujaya; Patric, Irene Rosita Pia; Hegde, Alangar Sathyaranjandas; Chandramouli, Bangalore Ashwathnarayanara; Arimappamagan, Arivazhagan; Santosh, Vani; Kondaiah, Paturu; Rao, Manchanahalli R. Sathyanarayana; Somasundaram, Kumaravel
2014-01-01
Anaplastic astrocytoma (AA; Grade III) and glioblastoma (GBM; Grade IV) are diffusely infiltrating tumors and are called malignant astrocytomas. The treatment regimen and prognosis are distinctly different between anaplastic astrocytoma and glioblastoma patients. Although histopathology based current grading system is well accepted and largely reproducible, intratumoral histologic variations often lead to difficulties in classification of malignant astrocytoma samples. In order to obtain a more robust molecular classifier, we analysed RT-qPCR expression data of 175 differentially regulated genes across astrocytoma using Prediction Analysis of Microarrays (PAM) and found the most discriminatory 16-gene expression signature for the classification of anaplastic astrocytoma and glioblastoma. The 16-gene signature obtained in the training set was validated in the test set with diagnostic accuracy of 89%. Additionally, validation of the 16-gene signature in multiple independent cohorts revealed that the signature predicted anaplastic astrocytoma and glioblastoma samples with accuracy rates of 99%, 88%, and 92% in TCGA, GSE1993 and GSE4422 datasets, respectively. The protein-protein interaction network and pathway analysis suggested that the 16-genes of the signature identified epithelial-mesenchymal transition (EMT) pathway as the most differentially regulated pathway in glioblastoma compared to anaplastic astrocytoma. In addition to identifying 16 gene classification signature, we also demonstrated that genes involved in epithelial-mesenchymal transition may play an important role in distinguishing glioblastoma from anaplastic astrocytoma. PMID:24475040
Molecular Diagnosis of Infantile Mitochondrial Disease with Targeted Next-Generation Sequencing
Calvo, Sarah E.; Compton, Alison G.; Hershman, Steven G.; Lim, Sze Chern; Lieber, Daniel S.; Tucker, Elena J.; Laskowski, Adrienne; Garone, Caterina; Liu, Shangtao; Jaffe, David B.; Christodoulou, John; Fletcher, Janice M.; Bruno, Damien L; Goldblatt, Jack; DiMauro, Salvatore; Thorburn, David R.; Mootha, Vamsi K.
2012-01-01
Advances in next-generation sequencing (NGS) promise to facilitate diagnosis of inherited disorders. While in research settings NGS has pinpointed causal alleles using segregation in large families, the key challenge for clinical diagnosis is application to single individuals. To explore its diagnostic utility, we performed targeted NGS in 42 unrelated infants with clinical and biochemical evidence of mitochondrial oxidative phosphorylation disease, who were refractory to traditional molecular diagnosis. These devastating mitochondrial disorders are characterized by phenotypic and genetic heterogeneity, with over 100 causal genes identified to date. We performed “MitoExome” sequencing of the mitochondrial DNA (mtDNA) and exons of ~1000 nuclear genes encoding mitochondrial proteins and prioritized rare mutations predicted to disrupt function. Since patients and controls harbored a comparable number of such heterozygous alleles, we could not prioritize dominant acting genes. However, patients showed a five-fold enrichment of genes with two such mutations that could underlie recessive disease. In total, 23/42 (55%) patients harbored such recessive genes or pathogenic mtDNA variants. Firm diagnoses were enabled in 10 patients (24%) who had mutations in genes previously linked to disease. 13 patients (31%) had mutations in nuclear genes never linked to disease. The pathogenicity of two such genes, NDUFB3 and AGK, was supported by cDNA complementation and evidence from multiple patients, respectively. The results underscore the immediate potential and challenges of deploying NGS in clinical settings. PMID:22277967
Palumbo, Maria Concetta; Zenoni, Sara; Fasoli, Marianna; Massonnet, Mélanie; Farina, Lorenzo; Castiglione, Filippo; Pezzotti, Mario; Paci, Paola
2014-01-01
We developed an approach that integrates different network-based methods to analyze the correlation network arising from large-scale gene expression data. By studying grapevine (Vitis vinifera) and tomato (Solanum lycopersicum) gene expression atlases and a grapevine berry transcriptomic data set during the transition from immature to mature growth, we identified a category named “fight-club hubs” characterized by a marked negative correlation with the expression profiles of neighboring genes in the network. A special subset named “switch genes” was identified, with the additional property of many significant negative correlations outside their own group in the network. Switch genes are involved in multiple processes and include transcription factors that may be considered master regulators of the previously reported transcriptome remodeling that marks the developmental shift from immature to mature growth. All switch genes, expressed at low levels in vegetative/green tissues, showed a significant increase in mature/woody organs, suggesting a potential regulatory role during the developmental transition. Finally, our analysis of tomato gene expression data sets showed that wild-type switch genes are downregulated in ripening-deficient mutants. The identification of known master regulators of tomato fruit maturation suggests our method is suitable for the detection of key regulators of organ development in different fleshy fruit crops. PMID:25490918
A functional polymorphism of the TNF-{alpha} gene that is associated with type 2 DM
DOE Office of Scientific and Technical Information (OSTI.GOV)
Susa, Shinji; Daimon, Makoto; Sakabe, Jun-Ichi
2008-05-09
To examine the association of the tumor necrosis factor-{alpha} (TNF-{alpha}) gene region with type 2 diabetes (DM), 11 single-nucleotide polymorphisms (SNPs) of the region were analyzed. The initial study using a sample set (148 cases vs. 227 controls) showed a significant association of the SNP IVS1G + 123A of the TNF-{alpha} gene with DM (p = 0.0056). Multiple logistic regression analysis using an enlarged sample set (225 vs. 716) revealed the significant association of the SNP with DM independently of any clinical traits examined (OR: 1.49, p = 0.014). The functional relevance of the SNP were examined by the electrophoreticmore » mobility shift assays using nuclear extracts from the U937 and NIH3T3 cells and luciferase assays in these cells with Simian virus 40 promoter- and TNF-{alpha} promoter-reporter gene constructs. The functional analyses showed that YY1 transcription factor bound allele-specifically to the SNP region and, the IVS1 + 123A allele had an increase in luciferase expression compared with the G allele.« less
A multistage gene normalization system integrating multiple effective methods.
Li, Lishuang; Liu, Shanshan; Li, Lihua; Fan, Wenting; Huang, Degen; Zhou, Huiwei
2013-01-01
Gene/protein recognition and normalization is an important preliminary step for many biological text mining tasks. In this paper, we present a multistage gene normalization system which consists of four major subtasks: pre-processing, dictionary matching, ambiguity resolution and filtering. For the first subtask, we apply the gene mention tagger developed in our earlier work, which achieves an F-score of 88.42% on the BioCreative II GM testing set. In the stage of dictionary matching, the exact matching and approximate matching between gene names and the EntrezGene lexicon have been combined. For the ambiguity resolution subtask, we propose a semantic similarity disambiguation method based on Munkres' Assignment Algorithm. At the last step, a filter based on Wikipedia has been built to remove the false positives. Experimental results show that the presented system can achieve an F-score of 90.1%, outperforming most of the state-of-the-art systems.
Gene panel testing for inherited cancer risk.
Hall, Michael J; Forman, Andrea D; Pilarski, Robert; Wiesner, Georgia; Giri, Veda N
2014-09-01
Next-generation sequencing technologies have ushered in the capability to assess multiple genes in parallel for genetic alterations that may contribute to inherited risk for cancers in families. Thus, gene panel testing is now an option in the setting of genetic counseling and testing for cancer risk. This article describes the many gene panel testing options clinically available to assess inherited cancer susceptibility, the potential advantages and challenges associated with various types of panels, clinical scenarios in which gene panels may be particularly useful in cancer risk assessment, and testing and counseling considerations. Given the potential issues for patients and their families, gene panel testing for inherited cancer risk is recommended to be offered in conjunction or consultation with an experienced cancer genetic specialist, such as a certified genetic counselor or geneticist, as an integral part of the testing process. Copyright © 2014 by the National Comprehensive Cancer Network.
The Ad5 [E1-, E2b-]-based vector: a new and versatile gene delivery platform
NASA Astrophysics Data System (ADS)
Jones, Frank R.; Gabitzsch, Elizabeth S.; Balint, Joseph P.
2015-05-01
Based upon advances in gene sequencing and construction, it is now possible to identify specific genes or sequences thereof for gene delivery applications. Recombinant adenovirus serotype-5 (Ad5) viral vectors have been utilized in the settings of gene therapy, vaccination, and immunotherapy but have encountered clinical challenges because they are recognized as foreign entities to the host. This recognition leads to an immunologic clearance of the vector that contains the inserted gene of interest and prevents effective immunization(s). We have reported on a new Ad5-based viral vector technology that can be utilized as an immunization modality to induce immune responses even in the presence of Ad5 vector immunity. We have reported successful immunization and immunotherapy results to infectious diseases and cancers. This improved recombinant viral platform (Ad5 [E1-, E2b-]) can now be utilized in the development of multiple vaccines and immunotherapies.
A genome-scale map of expression for a mouse brain section obtained using voxelation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chin, Mark H.; Geng, Alex B.; Khan, Arshad H.
Gene expression signatures in the mammalian brain hold the key to understanding neural development and neurological diseases. We have reconstructed 2- dimensional images of gene expression for 20,000 genes in a coronal slice of the mouse brain at the level of the striatum by using microarrays in combination with voxelation at a resolution of 1 mm3. Good reliability of the microarray results were confirmed using multiple replicates, subsequent quantitative RT-PCR voxelation, mass spectrometry voxelation and publicly available in situ hybridization data. Known and novel genes were identified with expression patterns localized to defined substructures within the brain. In addition, genesmore » with unexpected patterns were identified and cluster analysis identified a set of genes with a gradient of dorsal/ventral expression not restricted to known anatomical boundaries. The genome-scale maps of gene expression obtained using voxelation will be a valuable tool for the neuroscience community.« less
A simulation-based evaluation of methods for inferring linear barriers to gene flow
Christopher Blair; Dana E. Weigel; Matthew Balazik; Annika T. H. Keeley; Faith M. Walker; Erin Landguth; Sam Cushman; Melanie Murphy; Lisette Waits; Niko Balkenhol
2012-01-01
Different analytical techniques used on the same data set may lead to different conclusions about the existence and strength of genetic structure. Therefore, reliable interpretation of the results from different methods depends on the efficacy and reliability of different statistical methods. In this paper, we evaluated the performance of multiple analytical methods to...
Evaluation of atpB nucleotide sequences for phylogenetic studies of ferns and other pteridophytes.
Wolf, P
1997-10-01
Inferring basal relationships among vascular plants poses a major challenge to plant systematists. The divergence events that describe these relationships occurred long ago and considerable homoplasy has since accrued for both molecular and morphological characters. A potential solution is to examine phylogenetic analyses from multiple data sets. Here I present a new source of phylogenetic data for ferns and other pteridophytes. I sequenced the chloroplast gene atpB from 23 pteridophyte taxa and used maximum parsimony to infer relationships. A 588-bp region of the gene appeared to contain a statistically significant amount of phylogenetic signal and the resulting trees were largely congruent with similar analyses of nucleotide sequences from rbcL. However, a combined analysis of atpB plus rbcL produced a better resolved tree than did either data set alone. In the shortest trees, leptosporangiate ferns formed a monophyletic group. Also, I detected a well-supported clade of Psilotaceae (Psilotum and Tmesipteris) plus Ophioglossaceae (Ophioglossum and Botrychium). The demonstrated utility of atpB suggests that sequences from this gene should play a role in phylogenetic analyses that incorporate data from chloroplast genes, nuclear genes, morphology, and fossil data.
2011-01-01
Background Several tools have been developed to perform global gene expression profile data analysis, to search for specific chromosomal regions whose features meet defined criteria as well as to study neighbouring gene expression. However, most of these tools are tailored for a specific use in a particular context (e.g. they are species-specific, or limited to a particular data format) and they typically accept only gene lists as input. Results TRAM (Transcriptome Mapper) is a new general tool that allows the simple generation and analysis of quantitative transcriptome maps, starting from any source listing gene expression values for a given gene set (e.g. expression microarrays), implemented as a relational database. It includes a parser able to assign univocal and updated gene symbols to gene identifiers from different data sources. Moreover, TRAM is able to perform intra-sample and inter-sample data normalization, including an original variant of quantile normalization (scaled quantile), useful to normalize data from platforms with highly different numbers of investigated genes. When in 'Map' mode, the software generates a quantitative representation of the transcriptome of a sample (or of a pool of samples) and identifies if segments of defined lengths are over/under-expressed compared to the desired threshold. When in 'Cluster' mode, the software searches for a set of over/under-expressed consecutive genes. Statistical significance for all results is calculated with respect to genes localized on the same chromosome or to all genome genes. Transcriptome maps, showing differential expression between two sample groups, relative to two different biological conditions, may be easily generated. We present the results of a biological model test, based on a meta-analysis comparison between a sample pool of human CD34+ hematopoietic progenitor cells and a sample pool of megakaryocytic cells. Biologically relevant chromosomal segments and gene clusters with differential expression during the differentiation toward megakaryocyte were identified. Conclusions TRAM is designed to create, and statistically analyze, quantitative transcriptome maps, based on gene expression data from multiple sources. The release includes FileMaker Pro database management runtime application and it is freely available at http://apollo11.isto.unibo.it/software/, along with preconfigured implementations for mapping of human, mouse and zebrafish transcriptomes. PMID:21333005
A Versatile Panel of Reference Gene Assays for the Measurement of Chicken mRNA by Quantitative PCR
Maier, Helena J.; Van Borm, Steven; Young, John R.; Fife, Mark
2016-01-01
Quantitative real-time PCR assays are widely used for the quantification of mRNA within avian experimental samples. Multiple stably-expressed reference genes, selected for the lowest variation in representative samples, can be used to control random technical variation. Reference gene assays must be reliable, have high amplification specificity and efficiency, and not produce signals from contaminating DNA. Whilst recent research papers identify specific genes that are stable in particular tissues and experimental treatments, here we describe a panel of ten avian gene primer and probe sets that can be used to identify suitable reference genes in many experimental contexts. The panel was tested with TaqMan and SYBR Green systems in two experimental scenarios: a tissue collection and virus infection of cultured fibroblasts. GeNorm and NormFinder algorithms were able to select appropriate reference gene sets in each case. We show the effects of using the selected genes on the detection of statistically significant differences in expression. The results are compared with those obtained using 28s ribosomal RNA, the present most widely accepted reference gene in chicken work, identifying circumstances where its use might provide misleading results. Methods for eliminating DNA contamination of RNA reduced, but did not completely remove, detectable DNA. We therefore attached special importance to testing each qPCR assay for absence of signal using DNA template. The assays and analyses developed here provide a useful resource for selecting reference genes for investigations of avian biology. PMID:27537060
OPATs: Omnibus P-value association tests.
Chen, Chia-Wei; Yang, Hsin-Chou
2017-07-10
Combining statistical significances (P-values) from a set of single-locus association tests in genome-wide association studies is a proof-of-principle method for identifying disease-associated genomic segments, functional genes and biological pathways. We review P-value combinations for genome-wide association studies and introduce an integrated analysis tool, Omnibus P-value Association Tests (OPATs), which provides popular analysis methods of P-value combinations. The software OPATs programmed in R and R graphical user interface features a user-friendly interface. In addition to analysis modules for data quality control and single-locus association tests, OPATs provides three types of set-based association test: window-, gene- and biopathway-based association tests. P-value combinations with or without threshold and rank truncation are provided. The significance of a set-based association test is evaluated by using resampling procedures. Performance of the set-based association tests in OPATs has been evaluated by simulation studies and real data analyses. These set-based association tests help boost the statistical power, alleviate the multiple-testing problem, reduce the impact of genetic heterogeneity, increase the replication efficiency of association tests and facilitate the interpretation of association signals by streamlining the testing procedures and integrating the genetic effects of multiple variants in genomic regions of biological relevance. In summary, P-value combinations facilitate the identification of marker sets associated with disease susceptibility and uncover missing heritability in association studies, thereby establishing a foundation for the genetic dissection of complex diseases and traits. OPATs provides an easy-to-use and statistically powerful analysis tool for P-value combinations. OPATs, examples, and user guide can be downloaded from http://www.stat.sinica.edu.tw/hsinchou/genetics/association/OPATs.htm. © The Author 2017. Published by Oxford University Press.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tilton, Susan C.; Karin, Norman J.; Webb-Robertson, Bobbie-Jo M.
Smoking and obesity are each well-established risk factors for cardiovascular heart disease, which together impose earlier onset and greater severity of disease. To identify early signaling events in the response of the heart to cigarette smoke exposure within the setting of obesity, we exposed normal weight and high fat diet-induced obese (DIO) C57BL/6 mice to repeated inhaled doses of mainstream (MS) or sidestream (SS) cigarette smoke administered over a two week period, monitoring effects on both cardiac and pulmonary transcriptomes. MS smoke (250 μg wet total particulate matter (WTPM)/L, 5 h/day) exposures elicited robust cellular and molecular inflammatory responses inmore » the lung with 1466 differentially expressed pulmonary genes (p < 0.01) in normal weight animals and a much-attenuated response (463 genes) in the hearts of the same animals. In contrast, exposures to SS smoke (85 μg WTPM/L) with a CO concentration equivalent to that of MS smoke (250 CO ppm) induced a weak pulmonary response (328 genes) but an extensive cardiac response (1590 genes). SS smoke and to a lesser extent MS smoke preferentially elicited hypoxia- and stress-responsive genes as well as genes predicting early changes of vascular smooth muscle and endothelium, precursors of cardiovascular disease. The most sensitive smoke-induced cardiac transcriptional changes of normal weight mice were largely absent in DIO mice after smoke exposure, while genes involved in fatty acid utilization were unaffected. At the same time, smoke exposure suppressed multiple proteome maintenance genes induced in the hearts of DIO mice. Together, these results underscore the sensitivity of the heart to SS smoke and reveal adaptive responses in healthy individuals that are absent in the setting of high fat diet and obesity.« less
Tilton, Susan C; Karin, Norman J; Webb-Robertson, Bobbie-Jo M; Waters, Katrina M; Mikheev, Vladimir; Lee, K Monica; Corley, Richard A; Pounds, Joel G; Bigelow, Diana J
2013-07-15
Smoking and obesity are each well-established risk factors for cardiovascular heart disease, which together impose earlier onset and greater severity of disease. To identify early signaling events in the response of the heart to cigarette smoke exposure within the setting of obesity, we exposed normal weight and high fat diet-induced obese (DIO) C57BL/6 mice to repeated inhaled doses of mainstream (MS) or sidestream (SS) cigarette smoke administered over a two week period, monitoring effects on both cardiac and pulmonary transcriptomes. MS smoke (250 μg wet total particulate matter (WTPM)/L, 5 h/day) exposures elicited robust cellular and molecular inflammatory responses in the lung with 1466 differentially expressed pulmonary genes (p < 0.01) in normal weight animals and a much-attenuated response (463 genes) in the hearts of the same animals. In contrast, exposures to SS smoke (85 μg WTPM/L) with a CO concentration equivalent to that of MS smoke (~250 CO ppm) induced a weak pulmonary response (328 genes) but an extensive cardiac response (1590 genes). SS smoke and to a lesser extent MS smoke preferentially elicited hypoxia- and stress-responsive genes as well as genes predicting early changes of vascular smooth muscle and endothelium, precursors of cardiovascular disease. The most sensitive smoke-induced cardiac transcriptional changes of normal weight mice were largely absent in DIO mice after smoke exposure, while genes involved in fatty acid utilization were unaffected. At the same time, smoke exposure suppressed multiple proteome maintenance genes induced in the hearts of DIO mice. Together, these results underscore the sensitivity of the heart to SS smoke and reveal adaptive responses in healthy individuals that are absent in the setting of high fat diet and obesity.
Kurian, S. M.; Williams, A. N.; Gelbart, T.; Campbell, D.; Mondala, T. S.; Head, S. R.; Horvath, S.; Gaber, L.; Thompson, R.; Whisenant, T.; Lin, W.; Langfelder, P.; Robison, E. H.; Schaffer, R. L.; Fisher, J. S.; Friedewald, J.; Flechner, S. M.; Chan, L. K.; Wiseman, A. C.; Shidban, H.; Mendez, R.; Heilman, R.; Abecassis, M. M.; Marsh, C. L.; Salomon, D. R.
2015-01-01
There are no minimally invasive diagnostic metrics for acute kidney transplant rejection (AR), especially in the setting of the common confounding diagnosis, acute dysfunction with no rejection (ADNR). Thus, though kidney transplant biopsies remain the gold standard, they are invasive, have substantial risks, sampling error issues and significant costs and are not suitable for serial monitoring. Global gene expression profiles of 148 peripheral blood samples from transplant patients with excellent function and normal histology (TX; n = 46), AR (n = 63) and ADNR (n = 39), from two independent cohorts were analyzed with DNA microarrays. We applied a new normalization tool, frozen robust multi-array analysis, particularly suitable for clinical diagnostics, multiple prediction tools to discover, refine and validate robust molecular classifiers and we tested a novel one-by-one analysis strategy to model the real clinical application of this test. Multiple three-way classifier tools identified 200 highest value probesets with sensitivity, specificity, positive predictive value, negative predictive value and area under the curve for the validation cohort ranging from 82% to 100%, 76% to 95%, 76% to 95%, 79% to 100%, 84% to 100% and 0.817 to 0.968, respectively. We conclude that peripheral blood gene expression profiling can be used as a minimally invasive tool to accurately reveal TX, AR and ADNR in the setting of acute kidney transplant dysfunction. PMID:24725967
Suleiman, Suleiman H; Koko, Mahmoud E; Nasir, Wafaa H; Elfateh, Ommnyiah; Elgizouli, Ubai K; Abdallah, Mohammed O E; Alfarouk, Khalid O; Hussain, Ayman; Faisal, Shima; Ibrahim, Fathelrahamn M A; Romano, Maurizio; Sultan, Ali; Banks, Lawrence; Newport, Melanie; Baralle, Francesco; Elhassan, Ahmed M; Mohamed, Hiba S; Ibrahim, Muntaser E
2015-01-01
The molecular basis of cancer and cancer multiple phenotypes are not yet fully understood. Next Generation Sequencing promises new insight into the role of genetic interactions in shaping the complexity of cancer. Aiming to outline the differences in mutation patterns between familial colorectal cancer cases and controls we analyzed whole exomes of cancer tissues and control samples from an extended colorectal cancer pedigree, providing one of the first data sets of exome sequencing of cancer in an African population against a background of large effective size typically with excess of variants. Tumors showed hMSH2 loss of function SNV consistent with Lynch syndrome. Sets of genes harboring insertions-deletions in tumor tissues revealed, however, significant GO enrichment, a feature that was not seen in control samples, suggesting that ordered insertions-deletions are central to tumorigenesis in this type of cancer. Network analysis identified multiple hub genes of centrality. ELAVL1/HuR showed remarkable centrality, interacting specially with genes harboring non-synonymous SNVs thus reinforcing the proposition of targeted mutagenesis in cancer pathways. A likely explanation to such mutation pattern is DNA/RNA editing, suggested here by nucleotide transition-to-transversion ratio that significantly departed from expected values (p-value 5e-6). NFKB1 also showed significant centrality along with ELAVL1, raising the suspicion of viral etiology given the known interaction between oncogenic viruses and these proteins.
Youssef, Noha H; Blainey, Paul C; Quake, Stephen R; Elshahed, Mostafa S
2011-11-01
Members of candidate division OP11 are widely distributed in terrestrial and marine ecosystems, yet little information regarding their metabolic capabilities and ecological role within such habitats is currently available. Here, we report on the microfluidic isolation, multiple-displacement-amplification, pyrosequencing, and genomic analysis of a single cell (ZG1) belonging to candidate division OP11. Genome analysis of the ∼270-kb partial genome assembly obtained showed that it had no particular similarity to a specific phylum. Four hundred twenty-three open reading frames were identified, 46% of which had no function prediction. In-depth analysis revealed a heterotrophic lifestyle, with genes encoding endoglucanase, amylopullulanase, and laccase enzymes, suggesting a capacity for utilization of cellulose, starch, and, potentially, lignin, respectively. Genes encoding several glycolysis enzymes as well as formate utilization were identified, but no evidence for an electron transport chain was found. The presence of genes encoding various components of lipopolysaccharide biosynthesis indicates a Gram-negative bacterial cell wall. The partial genome also provides evidence for antibiotic resistance (β-lactamase, aminoglycoside phosphotransferase), as well as antibiotic production (bacteriocin) and extracellular bactericidal peptidases. Multiple mechanisms for stress response were identified, as were elements of type I and type IV secretion systems. Finally, housekeeping genes identified within the partial genome were used to demonstrate the OP11 affiliation of multiple hitherto unclassified genomic fragments from multiple database-deposited metagenomic data sets. These results provide the first glimpse into the lifestyle of a member of a ubiquitous, yet poorly understood bacterial candidate division.
Jockusch, Elizabeth L; Martínez-Solano, Iñigo; Timpe, Elizabeth K
2015-01-01
Species tree methods are now widely used to infer the relationships among species from multilocus data sets. Many methods have been developed, which differ in whether gene and species trees are estimated simultaneously or sequentially, and in how gene trees are used to infer the species tree. While these methods perform well on simulated data, less is known about what impacts their performance on empirical data. We used a data set including five nuclear genes and one mitochondrial gene for 22 species of Batrachoseps to compare the effects of method of analysis, within-species sampling and gene sampling on species tree inferences. For this data set, the choice of inference method had the largest effect on the species tree topology. Exclusion of individual loci had large effects in *BEAST and STEM, but not in MP-EST. Different loci carried the greatest leverage in these different methods, showing that the causes of their disproportionate effects differ. Even though substantial information was present in the nuclear loci, the mitochondrial gene dominated the *BEAST species tree. This leverage is inherent to the mtDNA locus and results from its high variation and lower assumed ploidy. This mtDNA leverage may be problematic when mtDNA has undergone introgression, as is likely in this data set. By contrast, the leverage of RAG1 in STEM analyses does not reflect properties inherent to the locus, but rather results from a gene tree that is strongly discordant with all others, and is best explained by introgression between distantly related species. Within-species sampling was also important, especially in *BEAST analyses, as shown by differences in tree topology across 100 subsampled data sets. Despite the sensitivity of the species tree methods to multiple factors, five species groups, the relationships among these, and some relationships within them, are generally consistently resolved for Batrachoseps. © The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Eronen, Lauri; Toivonen, Hannu
2012-06-06
Biological databases contain large amounts of data concerning the functions and associations of genes and proteins. Integration of data from several such databases into a single repository can aid the discovery of previously unknown connections spanning multiple types of relationships and databases. Biomine is a system that integrates cross-references from several biological databases into a graph model with multiple types of edges, such as protein interactions, gene-disease associations and gene ontology annotations. Edges are weighted based on their type, reliability, and informativeness. We present Biomine and evaluate its performance in link prediction, where the goal is to predict pairs of nodes that will be connected in the future, based on current data. In particular, we formulate protein interaction prediction and disease gene prioritization tasks as instances of link prediction. The predictions are based on a proximity measure computed on the integrated graph. We consider and experiment with several such measures, and perform a parameter optimization procedure where different edge types are weighted to optimize link prediction accuracy. We also propose a novel method for disease-gene prioritization, defined as finding a subset of candidate genes that cluster together in the graph. We experimentally evaluate Biomine by predicting future annotations in the source databases and prioritizing lists of putative disease genes. The experimental results show that Biomine has strong potential for predicting links when a set of selected candidate links is available. The predictions obtained using the entire Biomine dataset are shown to clearly outperform ones obtained using any single source of data alone, when different types of links are suitably weighted. In the gene prioritization task, an established reference set of disease-associated genes is useful, but the results show that under favorable conditions, Biomine can also perform well when no such information is available.The Biomine system is a proof of concept. Its current version contains 1.1 million entities and 8.1 million relations between them, with focus on human genetics. Some of its functionalities are available in a public query interface at http://biomine.cs.helsinki.fi, allowing searching for and visualizing connections between given biological entities.
Polygenic overlap between schizophrenia risk and antipsychotic response: a genomic medicine approach
Ruderfer, Douglas M; Charney, Alexander W; Readhead, Ben; Kidd, Brian A; Kähler, Anna K; Kenny, Paul J; Keiser, Michael J; Moran, Jennifer L; Hultman, Christina M; Scott, Stuart A; Sullivan, Patrick F; Purcell, Shaun M; Dudley, Joel T; Sklar, Pamela
2016-01-01
Summary Background Therapeutic treatments for schizophrenia do not alleviate symptoms for all patients and efficacy is limited by common, often severe, side-effects. Genetic studies of disease can identify novel drug targets, and drugs for which the mechanism has direct genetic support have increased likelihood of clinical success. Large-scale genetic studies of schizophrenia have increased the number of genes and gene sets associated with risk. We aimed to examine the overlap between schizophrenia risk loci and gene targets of a comprehensive set of medications to potentially inform and improve treatment of schizophrenia. Methods We defined schizophrenia risk loci as genomic regions reaching genome-wide significance in the latest Psychiatric Genomics Consortium schizophrenia genome-wide association study (GWAS) of 36 989 cases and 113 075 controls and loss of function variants observed only once among 5079 individuals in an exome-sequencing study of 2536 schizophrenia cases and 2543 controls (Swedish Schizophrenia Study). Using two large and orthogonally created databases, we collated drug targets into 167 gene sets targeted by pharmacologically similar drugs and examined enrichment of schizophrenia risk loci in these sets. We further linked the exome-sequenced data with a national drug registry (the Swedish Prescribed Drug Register) to assess the contribution of rare variants to treatment response, using clozapine prescription as a proxy for treatment resistance. Findings We combined results from testing rare and common variation and, after correction for multiple testing, two gene sets were associated with schizophrenia risk: agents against amoebiasis and other protozoal diseases (106 genes, p=0·00046, pcorrected =0·024) and antipsychotics (347 genes, p=0·00078, pcorrected=0·046). Further analysis pointed to antipsychotics as having independent enrichment after removing genes that overlapped these two target sets. We noted significant enrichment both in known targets of antipsychotics (70 genes, p=0·0078) and novel predicted targets (277 genes, p=0·019). Patients with treatment-resistant schizophrenia had an excess of rare disruptive variants in gene targets of antipsychotics (347 genes, p=0·0067) and in genes with evidence for a role in antipsychotic efficacy (91 genes, p=0·0029). Interpretation Our results support genetic overlap between schizophrenia pathogenesis and antipsychotic mechanism of action. This finding is consistent with treatment efficacy being polygenic and suggests that single-target therapeutics might be insufficient. We provide evidence of a role for rare functional variants in antipsychotic treatment response, pointing to a subset of patients where their genetic information could inform treatment. Finally, we present a novel framework for identifying treatments from genetic data and improving our understanding of therapeutic mechanism. PMID:26915512
Variation in the oxytocin receptor gene (OXTR) is associated with differences in moral judgment
Chaponis, Jonathan; Siburian, Richie; Gallagher, Patience; Ransohoff, Katherine; Wikler, Daniel; Perlis, Roy H.; Greene, Joshua D.
2016-01-01
Moral judgments are produced through the coordinated interaction of multiple neural systems, each of which relies on a characteristic set of neurotransmitters. Genes that produce or regulate these neurotransmitters may have distinctive influences on moral judgment. Two studies examined potential genetic influences on moral judgment using dilemmas that reliably elicit competing automatic and controlled responses, generated by dissociable neural systems. Study 1 (N = 228) examined 49 common variants (SNPs) within 10 candidate genes and identified a nominal association between a polymorphism (rs237889) of the oxytocin receptor gene (OXTR) and variation in deontological vs utilitarian moral judgment (that is, judgments favoring individual rights vs the greater good). An association was likewise observed for rs1042615 of the arginine vasopressin receptor gene (AVPR1A). Study 2 (N = 322) aimed to replicate these findings using the aforementioned dilemmas as well as a new set of structurally similar medical dilemmas. Study 2 failed to replicate the association with AVPR1A, but replicated the OXTR finding using both the original and new dilemmas. Together, these findings suggest that moral judgment is influenced by variation in the oxytocin receptor gene and, more generally, that single genetic polymorphisms can have a detectable effect on complex decision processes. PMID:27497314
Variation in the oxytocin receptor gene (OXTR) is associated with differences in moral judgment.
Bernhard, Regan M; Chaponis, Jonathan; Siburian, Richie; Gallagher, Patience; Ransohoff, Katherine; Wikler, Daniel; Perlis, Roy H; Greene, Joshua D
2016-12-01
Moral judgments are produced through the coordinated interaction of multiple neural systems, each of which relies on a characteristic set of neurotransmitters. Genes that produce or regulate these neurotransmitters may have distinctive influences on moral judgment. Two studies examined potential genetic influences on moral judgment using dilemmas that reliably elicit competing automatic and controlled responses, generated by dissociable neural systems. Study 1 (N = 228) examined 49 common variants (SNPs) within 10 candidate genes and identified a nominal association between a polymorphism (rs237889) of the oxytocin receptor gene (OXTR) and variation in deontological vs utilitarian moral judgment (that is, judgments favoring individual rights vs the greater good). An association was likewise observed for rs1042615 of the arginine vasopressin receptor gene (AVPR1A). Study 2 (N = 322) aimed to replicate these findings using the aforementioned dilemmas as well as a new set of structurally similar medical dilemmas. Study 2 failed to replicate the association with AVPR1A, but replicated the OXTR finding using both the original and new dilemmas. Together, these findings suggest that moral judgment is influenced by variation in the oxytocin receptor gene and, more generally, that single genetic polymorphisms can have a detectable effect on complex decision processes. © The Author (2016). Published by Oxford University Press.
Broad-Enrich: functional interpretation of large sets of broad genomic regions.
Cavalcante, Raymond G; Lee, Chee; Welch, Ryan P; Patil, Snehal; Weymouth, Terry; Scott, Laura J; Sartor, Maureen A
2014-09-01
Functional enrichment testing facilitates the interpretation of Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) data in terms of pathways and other biological contexts. Previous methods developed and used to test for key gene sets affected in ChIP-seq experiments treat peaks as points, and are based on the number of peaks associated with a gene or a binary score for each gene. These approaches work well for transcription factors, but histone modifications often occur over broad domains, and across multiple genes. To incorporate the unique properties of broad domains into functional enrichment testing, we developed Broad-Enrich, a method that uses the proportion of each gene's locus covered by a peak. We show that our method has a well-calibrated false-positive rate, performing well with ChIP-seq data having broad domains compared with alternative approaches. We illustrate Broad-Enrich with 55 ENCODE ChIP-seq datasets using different methods to define gene loci. Broad-Enrich can also be applied to other datasets consisting of broad genomic domains such as copy number variations. http://broad-enrich.med.umich.edu for Web version and R package. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
MAAMD: a workflow to standardize meta-analyses and comparison of affymetrix microarray data
2014-01-01
Background Mandatory deposit of raw microarray data files for public access, prior to study publication, provides significant opportunities to conduct new bioinformatics analyses within and across multiple datasets. Analysis of raw microarray data files (e.g. Affymetrix CEL files) can be time consuming, complex, and requires fundamental computational and bioinformatics skills. The development of analytical workflows to automate these tasks simplifies the processing of, improves the efficiency of, and serves to standardize multiple and sequential analyses. Once installed, workflows facilitate the tedious steps required to run rapid intra- and inter-dataset comparisons. Results We developed a workflow to facilitate and standardize Meta-Analysis of Affymetrix Microarray Data analysis (MAAMD) in Kepler. Two freely available stand-alone software tools, R and AltAnalyze were embedded in MAAMD. The inputs of MAAMD are user-editable csv files, which contain sample information and parameters describing the locations of input files and required tools. MAAMD was tested by analyzing 4 different GEO datasets from mice and drosophila. MAAMD automates data downloading, data organization, data quality control assesment, differential gene expression analysis, clustering analysis, pathway visualization, gene-set enrichment analysis, and cross-species orthologous-gene comparisons. MAAMD was utilized to identify gene orthologues responding to hypoxia or hyperoxia in both mice and drosophila. The entire set of analyses for 4 datasets (34 total microarrays) finished in ~ one hour. Conclusions MAAMD saves time, minimizes the required computer skills, and offers a standardized procedure for users to analyze microarray datasets and make new intra- and inter-dataset comparisons. PMID:24621103
Tuller, Tamir; Atar, Shimshi; Ruppin, Eytan; Gurevich, Michael; Achiron, Anat
2011-09-15
Multiple sclerosis (MS) is a central nervous system autoimmune inflammatory T-cell-mediated disease with a relapsing-remitting course in the majority of patients. In this study, we performed a high-resolution systems biology analysis of gene expression and physical interactions in MS relapse and remission. To this end, we integrated 164 large-scale measurements of gene expression in peripheral blood mononuclear cells of MS patients in relapse or remission and healthy subjects, with large-scale information about the physical interactions between these genes obtained from public databases. These data were analyzed with a variety of computational methods. We find that there is a clear and significant global network-level signal that is related to the changes in gene expression of MS patients in comparison to healthy subjects. However, despite the clear differences in the clinical symptoms of MS patients in relapse versus remission, the network level signal is weaker when comparing patients in these two stages of the disease. This result suggests that most of the genes have relatively similar expression levels in the two stages of the disease. In accordance with previous studies, we found that the pathways related to regulation of cell death, chemotaxis and inflammatory response are differentially expressed in the disease in comparison to healthy subjects, while pathways related to cell adhesion, cell migration and cell-cell signaling are activated in relapse in comparison to remission. However, the current study includes a detailed report of the exact set of genes involved in these pathways and the interactions between them. For example, we found that the genes TP53 and IL1 are 'network-hub' that interacts with many of the differentially expressed genes in MS patients versus healthy subjects, and the epidermal growth factor receptor is a 'network-hub' in the case of MS patients with relapse versus remission. The statistical approaches employed in this study enabled us to report new sets of genes that according to their gene expression and physical interactions are predicted to be differentially expressed in MS versus healthy subjects, and in MS patients in relapse versus remission. Some of these genes may be useful biomarkers for diagnosing MS and predicting relapses in MS patients.
Nozaki, Hisayoshi; Yang, Yi; Maruyama, Shinichiro; Suzaki, Toshinobu
2012-01-01
Recent multigene phylogenetic analyses have contributed much to our understanding of eukaryotic phylogeny. However, the phylogenetic positions of various lineages within the eukaryotes have remained unresolved or in conflict between different phylogenetic studies. These phylogenetic ambiguities might have resulted from mixtures or integration from various factors including limited taxon sampling, missing data in the alignment, saturations of rapidly evolving genes, mixed analyses of short- and long-branched operational taxonomic units (OTUs), intracellular endoparasite and ciliate OTUs with unusual substitution etc. In order to evaluate the effects from intracellular endoparasite and ciliate OTUs co-analyzed on the eukaryotic phylogeny and simplify the results, we here used two different sets of data matrices of multiple slowly evolving genes with small amounts of missing data and examined the phylogenetic position of the secondary photosynthetic chromalveolates Haptophyta, one of the most abundant groups of oceanic phytoplankton and significant primary producers. In both sets, a robust sister relationship between Haptophyta and SAR (stramenopiles, alveolates, rhizarians, or SA [stramenopiles and alveolates]) was resolved when intracellular endoparasite/ciliate OTUs were excluded, but not in their presence. Based on comparisons of character optimizations on a fixed tree (with a clade composed of haptophytes and SAR or SA), disruption of the monophyly between haptophytes and SAR (or SA) in the presence of intracellular endoparasite/ciliate OTUs can be considered to be a result of multiple evolutionary reversals of character positions that supported the synapomorphy of the haptophyte and SAR (or SA) clade in the absence of intracellular endoparasite/ciliate OTUs.
Comparative analysis and visualization of multiple collinear genomes
2012-01-01
Background Genome browsers are a common tool used by biologists to visualize genomic features including genes, polymorphisms, and many others. However, existing genome browsers and visualization tools are not well-suited to perform meaningful comparative analysis among a large number of genomes. With the increasing quantity and availability of genomic data, there is an increased burden to provide useful visualization and analysis tools for comparison of multiple collinear genomes such as the large panels of model organisms which are the basis for much of the current genetic research. Results We have developed a novel web-based tool for visualizing and analyzing multiple collinear genomes. Our tool illustrates genome-sequence similarity through a mosaic of intervals representing local phylogeny, subspecific origin, and haplotype identity. Comparative analysis is facilitated through reordering and clustering of tracks, which can vary throughout the genome. In addition, we provide local phylogenetic trees as an alternate visualization to assess local variations. Conclusions Unlike previous genome browsers and viewers, ours allows for simultaneous and comparative analysis. Our browser provides intuitive selection and interactive navigation about features of interest. Dynamic visualizations adjust to scale and data content making analysis at variable resolutions and of multiple data sets more informative. We demonstrate our genome browser for an extensive set of genomic data sets composed of almost 200 distinct mouse laboratory strains. PMID:22536897
An integrated map of structural variation in 2,504 human genomes.
Sudmant, Peter H; Rausch, Tobias; Gardner, Eugene J; Handsaker, Robert E; Abyzov, Alexej; Huddleston, John; Zhang, Yan; Ye, Kai; Jun, Goo; Fritz, Markus Hsi-Yang; Konkel, Miriam K; Malhotra, Ankit; Stütz, Adrian M; Shi, Xinghua; Casale, Francesco Paolo; Chen, Jieming; Hormozdiari, Fereydoun; Dayama, Gargi; Chen, Ken; Malig, Maika; Chaisson, Mark J P; Walter, Klaudia; Meiers, Sascha; Kashin, Seva; Garrison, Erik; Auton, Adam; Lam, Hugo Y K; Mu, Xinmeng Jasmine; Alkan, Can; Antaki, Danny; Bae, Taejeong; Cerveira, Eliza; Chines, Peter; Chong, Zechen; Clarke, Laura; Dal, Elif; Ding, Li; Emery, Sarah; Fan, Xian; Gujral, Madhusudan; Kahveci, Fatma; Kidd, Jeffrey M; Kong, Yu; Lameijer, Eric-Wubbo; McCarthy, Shane; Flicek, Paul; Gibbs, Richard A; Marth, Gabor; Mason, Christopher E; Menelaou, Androniki; Muzny, Donna M; Nelson, Bradley J; Noor, Amina; Parrish, Nicholas F; Pendleton, Matthew; Quitadamo, Andrew; Raeder, Benjamin; Schadt, Eric E; Romanovitch, Mallory; Schlattl, Andreas; Sebra, Robert; Shabalin, Andrey A; Untergasser, Andreas; Walker, Jerilyn A; Wang, Min; Yu, Fuli; Zhang, Chengsheng; Zhang, Jing; Zheng-Bradley, Xiangqun; Zhou, Wanding; Zichner, Thomas; Sebat, Jonathan; Batzer, Mark A; McCarroll, Steven A; Mills, Ryan E; Gerstein, Mark B; Bashir, Ali; Stegle, Oliver; Devine, Scott E; Lee, Charles; Eichler, Evan E; Korbel, Jan O
2015-10-01
Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.
Cloning and analysis of the positively acting regulatory gene amdR from Aspergillus nidulans.
Andrianopoulos, A; Hynes, M J
1988-01-01
The positively acting regulatory gene amdR of Aspergillus nidulans coordinately regulates the expression of four unlinked structural genes involved in acetamide (amdS), omega amino acid (gatA and gabA), and lactam (lamA) catabolism. By the use of DNA-mediated transformation of A. nidulans, the amdR regulatory gene was cloned from a genomic cosmid library. Southern blot analysis of DNA from various loss-of-function amdR mutants revealed the presence of four detectable DNA rearrangements, including a deletion, an insertion, and a translocation. No detectable DNA rearrangements were found in several constitutive amdRc mutants. Analysis of the fate of amdR-bearing plasmids in transformants showed that 10 to 20% of the transformation events were homologous integrations or gene conversions, and this phenomenon was exploited in developing a strategy by which amdRc and amdR- alleles can be readily cloned and analyzed. Examination of the transcription of amdR by Northern blot (RNA blot) analysis revealed the presence of two mRNAs (2.7 and 1.8 kilobases) which were constitutively synthesized at a very low level. In addition, amdR transcription did not appear to depend on the presence of a functional amdR product nor was it altered in amdRc mutants. The dosage effects of multiple copies of amdR in transformants were examined, and it was shown that such transformants exhibited stronger growth than did the wild type on acetamide and pyrrolidinone media, indicating increased expression of the amdS and lamA genes, respectively. These results were used to formulate a model for amdR-mediated regulation of gene expression in which the low constitutive level of amdR product sets the upper limits of basal and induced transcription of the structural genes. Multiple copies of 5' sequences from the amdS gene can result in reduced growth on substrates whose utilization is dependent on amdR-controlled genes. This has been attributed to titration of limiting amdR gene product. Strong support for this proposal was obtained by showing that multiple copies of the amdR gene can reverse this phenomenon (antititration). Images PMID:3062382
Detection of multiple perturbations in multi-omics biological networks.
Griffin, Paula J; Zhang, Yuqing; Johnson, William Evan; Kolaczyk, Eric D
2018-05-17
Cellular mechanism-of-action is of fundamental concern in many biological studies. It is of particular interest for identifying the cause of disease and learning the way in which treatments act against disease. However, pinpointing such mechanisms is difficult, due to the fact that small perturbations to the cell can have wide-ranging downstream effects. Given a snapshot of cellular activity, it can be challenging to tell where a disturbance originated. The presence of an ever-greater variety of high-throughput biological data offers an opportunity to examine cellular behavior from multiple angles, but also presents the statistical challenge of how to effectively analyze data from multiple sources. In this setting, we propose a method for mechanism-of-action inference by extending network filtering to multi-attribute data. We first estimate a joint Gaussian graphical model across multiple data types using penalized regression and filter for network effects. We then apply a set of likelihood ratio tests to identify the most likely site of the original perturbation. In addition, we propose a conditional testing procedure to allow for detection of multiple perturbations. We demonstrate this methodology on paired gene expression and methylation data from The Cancer Genome Atlas (TCGA). © 2018, The International Biometric Society.
Binder, Andreas; Lambert, Jayne; Morbitzer, Robert; Popp, Claudia; Ott, Thomas; Lahaye, Thomas; Parniske, Martin
2014-01-01
The Golden Gate (GG) modular assembly approach offers a standardized, inexpensive and reliable way to ligate multiple DNA fragments in a pre-defined order in a single-tube reaction. We developed a GG based toolkit for the flexible construction of binary plasmids for transgene expression in plants. Starting from a common set of modules, such as promoters, protein tags and transcribed regions of interest, synthetic genes are assembled, which can be further combined to multigene constructs. As an example, we created T-DNA constructs encoding multiple fluorescent proteins targeted to distinct cellular compartments (nucleus, cytosol, plastids) and demonstrated simultaneous expression of all genes in Nicotiana benthamiana, Lotus japonicus and Arabidopsis thaliana. We assembled an RNA interference (RNAi) module for the construction of intron-spliced hairpin RNA constructs and demonstrated silencing of GFP in N. benthamiana. By combination of the silencing construct together with a codon adapted rescue construct into one vector, our system facilitates genetic complementation and thus confirmation of the causative gene responsible for a given RNAi phenotype. As proof of principle, we silenced a destabilized GFP gene (dGFP) and restored GFP fluorescence by expression of a recoded version of dGFP, which was not targeted by the silencing construct. PMID:24551083
Discovering time-lagged rules from microarray data using gene profile classifiers
2011-01-01
Background Gene regulatory networks have an essential role in every process of life. In this regard, the amount of genome-wide time series data is becoming increasingly available, providing the opportunity to discover the time-delayed gene regulatory networks that govern the majority of these molecular processes. Results This paper aims at reconstructing gene regulatory networks from multiple genome-wide microarray time series datasets. In this sense, a new model-free algorithm called GRNCOP2 (Gene Regulatory Network inference by Combinatorial OPtimization 2), which is a significant evolution of the GRNCOP algorithm, was developed using combinatorial optimization of gene profile classifiers. The method is capable of inferring potential time-delay relationships with any span of time between genes from various time series datasets given as input. The proposed algorithm was applied to time series data composed of twenty yeast genes that are highly relevant for the cell-cycle study, and the results were compared against several related approaches. The outcomes have shown that GRNCOP2 outperforms the contrasted methods in terms of the proposed metrics, and that the results are consistent with previous biological knowledge. Additionally, a genome-wide study on multiple publicly available time series data was performed. In this case, the experimentation has exhibited the soundness and scalability of the new method which inferred highly-related statistically-significant gene associations. Conclusions A novel method for inferring time-delayed gene regulatory networks from genome-wide time series datasets is proposed in this paper. The method was carefully validated with several publicly available data sets. The results have demonstrated that the algorithm constitutes a usable model-free approach capable of predicting meaningful relationships between genes, revealing the time-trends of gene regulation. PMID:21524308
Kneidinger, Doris; Ibrišimović, Mirza; Lion, Thomas; Klein, Reinhard
2012-06-01
Human adenoviruses are a common threat to immunocompromised patients, e.g., HIV-positive individuals or solid-organ and, in particular, allogeneic stem cell transplant recipients. Antiviral drugs have a limited effect on adenoviruses, and existing treatment modalities often fail to prevent fatal outcome. Silencing of viral genes by short interfering RNAs (siRNAs) holds a great promise in the treatment of viral infections. The aim of the present study was to identify adenoviral candidate targets for RNA interference-mediated inhibition of adenoviral replication. We investigated the impact of silencing of a set of early, middle, and late viral genes on the replication of adenovirus 5 in vitro. Adenovirus replication was inhibited by siRNAs directed against the adenoviral E1A, DNA polymerase, preterminal protein (pTP), IVa2, hexon, and protease genes. Silencing of early and middle genes was more effective in inhibiting adenovirus multiplication than was silencing of late genes. A siRNA directed against the viral DNA polymerase mRNA decreased viral genome copy numbers and infectious virus progeny by several orders of magnitude. Since silencing of any of the early genes directly or indirectly affected viral DNA synthesis, our data suggest that reducing viral genome copy numbers is a more promising strategy for the treatment of adenoviral infections than is reducing the numbers of proteins necessary for capsid generation. Thus, adenoviral DNA replication was identified as a key target for RNAi-mediated inhibition of adenovirus multiplication. In addition, the E1A transcripts emerged as a second important target, because its knockdown markedly improved the viability of cells at late stages of infection. Copyright © 2012 Elsevier B.V. All rights reserved.
Kristjansdottir, G; Sandling, J K; Bonetti, A; Roos, I M; Milani, L; Wang, C; Gustafsdottir, S M; Sigurdsson, S; Lundmark, A; Tienari, P J; Koivisto, K; Elovaara, I; Pirttilä, T; Reunanen, M; Peltonen, L; Saarela, J; Hillert, J; Olsson, T; Landegren, U; Alcina, A; Fernández, O; Leyva, L; Guerrero, M; Lucas, M; Izquierdo, G; Matesanz, F; Syvänen, A-C
2008-01-01
Background: IRF5 is a transcription factor involved both in the type I interferon and the toll-like receptor signalling pathways. Previously, IRF5 has been found to be associated with systemic lupus erythematosus, rheumatoid arthritis and inflammatory bowel diseases. Here we investigated whether polymorphisms in the IRF5 gene would be associated with yet another disease with features of autoimmunity, multiple sclerosis (MS). Methods: We genotyped nine single nucleotide polymorphisms and one insertion-deletion polymorphism in the IRF5 gene in a collection of 2337 patients with MS and 2813 controls from three populations: two case–control cohorts from Spain and Sweden, and a set of MS trio families from Finland. Results: Two single nucleotide polymorphism (SNPs) (rs4728142, rs3807306), and a 5 bp insertion-deletion polymorphism located in the promoter and first intron of the IRF5 gene, showed association signals with values of p<0.001 when the data from all cohorts were combined. The predisposing alleles were present on the same common haplotype in all populations. Using electrophoretic mobility shift assays we observed allele specific differences in protein binding for the SNP rs4728142 and the 5 bp indel, and by a proximity ligation assay we demonstrated increased binding of the transcription factor SP1 to the risk allele of the 5 bp indel. Conclusion: These findings add IRF5 to the short list of genes shown to be associated with MS in more than one population. Our study adds to the evidence that there might be genes or pathways that are common in multiple autoimmune diseases, and that the type I interferon system is likely to be involved in the development of these diseases. PMID:18285424
Bao, Yun-Juan; Liang, Zhong; Mayfield, Jeffrey A.; McShan, William M.; Lee, Shaun W.; Ploplis, Victoria A.; Castellino, Francis J.
2016-01-01
Symmetric genomic rearrangements around replication axes in genomes are commonly observed in prokaryotic genomes, including Group A Streptococcus (GAS). However, asymmetric rearrangements are rare. Our previous studies showed that the hypervirulent invasive GAS strain, M23ND, containing an inactivated transcriptional regulator system, covRS, exhibits unique extensive asymmetric rearrangements, which reconstructed a genomic structure distinct from other GAS genomes. In the current investigation, we identified the rearrangement events and examined the genetic consequences and evolutionary implications underlying the rearrangements. By comparison with a close phylogenetic relative, M18-MGAS8232, we propose a molecular model wherein a series of asymmetric rearrangements have occurred in M23ND, involving translocations, inversions and integrations mediated by multiple factors, viz., rRNA-comX (factor for late competence), transposons and phage-encoded gene segments. Assessments of the cumulative gene orientations and GC skews reveal that the asymmetric genomic rearrangements did not affect the general genomic integrity of the organism. However, functional distributions reveal re-clustering of a broad set of CovRS-regulated actively transcribed genes, including virulence factors and metabolic genes, to the same leading strand, with high confidence (p-value ~10−10). The re-clustering of the genes suggests a potential selection advantage for the spatial proximity to the transcription complexes, which may contain the global transcriptional regulator, CovRS, and other RNA polymerases. Their proximities allow for efficient transcription of the genes required for growth, virulence and persistence. A new paradigm of survival strategies of GAS strains is provided through multiple genomic rearrangements, while, at the same time, maintaining genomic integrity. PMID:27329479
DynGO: a tool for visualizing and mining of Gene Ontology and its associations
Liu, Hongfang; Hu, Zhang-Zhi; Wu, Cathy H
2005-01-01
Background A large volume of data and information about genes and gene products has been stored in various molecular biology databases. A major challenge for knowledge discovery using these databases is to identify related genes and gene products in disparate databases. The development of Gene Ontology (GO) as a common vocabulary for annotation allows integrated queries across multiple databases and identification of semantically related genes and gene products (i.e., genes and gene products that have similar GO annotations). Meanwhile, dozens of tools have been developed for browsing, mining or editing GO terms, their hierarchical relationships, or their "associated" genes and gene products (i.e., genes and gene products annotated with GO terms). Tools that allow users to directly search and inspect relations among all GO terms and their associated genes and gene products from multiple databases are needed. Results We present a standalone package called DynGO, which provides several advanced functionalities in addition to the standard browsing capability of the official GO browsing tool (AmiGO). DynGO allows users to conduct batch retrieval of GO annotations for a list of genes and gene products, and semantic retrieval of genes and gene products sharing similar GO annotations. The result are shown in an association tree organized according to GO hierarchies and supported with many dynamic display options such as sorting tree nodes or changing orientation of the tree. For GO curators and frequent GO users, DynGO provides fast and convenient access to GO annotation data. DynGO is generally applicable to any data set where the records are annotated with GO terms, as illustrated by two examples. Conclusion We have presented a standalone package DynGO that provides functionalities to search and browse GO and its association databases as well as several additional functions such as batch retrieval and semantic retrieval. The complete documentation and software are freely available for download from the website . PMID:16091147
Multiple DNA and protein sequence alignment on a workstation and a supercomputer.
Tajima, K
1988-11-01
This paper describes a multiple alignment method using a workstation and supercomputer. The method is based on the alignment of a set of aligned sequences with the new sequence, and uses a recursive procedure of such alignment. The alignment is executed in a reasonable computation time on diverse levels from a workstation to a supercomputer, from the viewpoint of alignment results and computational speed by parallel processing. The application of the algorithm is illustrated by several examples of multiple alignment of 12 amino acid and DNA sequences of HIV (human immunodeficiency virus) env genes. Colour graphic programs on a workstation and parallel processing on a supercomputer are discussed.
aGEM: an integrative system for analyzing spatial-temporal gene-expression information
Jiménez-Lozano, Natalia; Segura, Joan; Macías, José Ramón; Vega, Juanjo; Carazo, José María
2009-01-01
Motivation: The work presented here describes the ‘anatomical Gene-Expression Mapping (aGEM)’ Platform, a development conceived to integrate phenotypic information with the spatial and temporal distributions of genes expressed in the mouse. The aGEM Platform has been built by extending the Distributed Annotation System (DAS) protocol, which was originally designed to share genome annotations over the WWW. DAS is a client-server system in which a single client integrates information from multiple distributed servers. Results: The aGEM Platform provides information to answer three main questions. (i) Which genes are expressed in a given mouse anatomical component? (ii) In which mouse anatomical structures are a given gene or set of genes expressed? And (iii) is there any correlation among these findings? Currently, this Platform includes several well-known mouse resources (EMAGE, GXD and GENSAT), hosting gene-expression data mostly obtained from in situ techniques together with a broad set of image-derived annotations. Availability: The Platform is optimized for Firefox 3.0 and it is accessed through a friendly and intuitive display: http://agem.cnb.csic.es Contact: natalia@cnb.csic.es Supplementary information: Supplementary data are available at http://bioweb.cnb.csic.es/VisualOmics/aGEM/home.html and http://bioweb.cnb.csic.es/VisualOmics/index_VO.html and Bioinformatics online. PMID:19592395
paraGSEA: a scalable approach for large-scale gene expression profiling
Peng, Shaoliang; Yang, Shunyun
2017-01-01
Abstract More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA. PMID:28973463
cDREM: inferring dynamic combinatorial gene regulation.
Wise, Aaron; Bar-Joseph, Ziv
2015-04-01
Genes are often combinatorially regulated by multiple transcription factors (TFs). Such combinatorial regulation plays an important role in development and facilitates the ability of cells to respond to different stresses. While a number of approaches have utilized sequence and ChIP-based datasets to study combinational regulation, these have often ignored the combinational logic and the dynamics associated with such regulation. Here we present cDREM, a new method for reconstructing dynamic models of combinatorial regulation. cDREM integrates time series gene expression data with (static) protein interaction data. The method is based on a hidden Markov model and utilizes the sparse group Lasso to identify small subsets of combinatorially active TFs, their time of activation, and the logical function they implement. We tested cDREM on yeast and human data sets. Using yeast we show that the predicted combinatorial sets agree with other high throughput genomic datasets and improve upon prior methods developed to infer combinatorial regulation. Applying cDREM to study human response to flu, we were able to identify several combinatorial TF sets, some of which were known to regulate immune response while others represent novel combinations of important TFs.
DuBois, Debra C; Piel, William H; Jusko, William J
2008-01-01
High-throughput data collection using gene microarrays has great potential as a method for addressing the pharmacogenomics of complex biological systems. Similarly, mechanism-based pharmacokinetic/pharmacodynamic modeling provides a tool for formulating quantitative testable hypotheses concerning the responses of complex biological systems. As the response of such systems to drugs generally entails cascades of molecular events in time, a time series design provides the best approach to capturing the full scope of drug effects. A major problem in using microarrays for high-throughput data collection is sorting through the massive amount of data in order to identify probe sets and genes of interest. Due to its inherent redundancy, a rich time series containing many time points and multiple samples per time point allows for the use of less stringent criteria of expression, expression change and data quality for initial filtering of unwanted probe sets. The remaining probe sets can then become the focus of more intense scrutiny by other methods, including temporal clustering, functional clustering and pharmacokinetic/pharmacodynamic modeling, which provide additional ways of identifying the probes and genes of pharmacological interest. PMID:15212590
DLRS: gene tree evolution in light of a species tree.
Sjöstrand, Joel; Sennblad, Bengt; Arvestad, Lars; Lagergren, Jens
2012-11-15
PrIME-DLRS (or colloquially: 'Delirious') is a phylogenetic software tool to simultaneously infer and reconcile a gene tree given a species tree. It accounts for duplication and loss events, a relaxed molecular clock and is intended for the study of homologous gene families, for example in a comparative genomics setting involving multiple species. PrIME-DLRS uses a Bayesian MCMC framework, where the input is a known species tree with divergence times and a multiple sequence alignment, and the output is a posterior distribution over gene trees and model parameters. PrIME-DLRS is available for Java SE 6+ under the New BSD License, and JAR files and source code can be downloaded from http://code.google.com/p/jprime/. There is also a slightly older C++ version available as a binary package for Ubuntu, with download instructions at http://prime.sbc.su.se. The C++ source code is available upon request. joel.sjostrand@scilifelab.se or jens.lagergren@scilifelab.se. PrIME-DLRS is based on a sound probabilistic model (Åkerborg et al., 2009) and has been thoroughly validated on synthetic and biological datasets (Supplementary Material online).
Identifying Loci Under Selection Against Gene Flow in Isolation-with-Migration Models
Sousa, Vitor C.; Carneiro, Miguel; Ferrand, Nuno; Hey, Jody
2013-01-01
When divergence occurs in the presence of gene flow, there can arise an interesting dynamic in which selection against gene flow, at sites associated with population-specific adaptations or genetic incompatibilities, can cause net gene flow to vary across the genome. Loci linked to sites under selection may experience reduced gene flow and may experience genetic bottlenecks by the action of nearby selective sweeps. Data from histories such as these may be poorly fitted by conventional neutral model approaches to demographic inference, which treat all loci as equally subject to forces of genetic drift and gene flow. To allow for demographic inference in the face of such histories, as well as the identification of loci affected by selection, we developed an isolation-with-migration model that explicitly provides for variation among genomic regions in migration rates and/or rates of genetic drift. The method allows for loci to fall into any of multiple groups, each characterized by a different set of parameters, thus relaxing the assumption that all loci share the same demography. By grouping loci, the method can be applied to data with multiple loci and still have tractable dimensionality and statistical power. We studied the performance of the method using simulated data, and we applied the method to study the divergence of two subspecies of European rabbits (Oryctolagus cuniculus). PMID:23457232
Genome-Wide Gene Set Analysis for Identification of Pathways Associated with Alcohol Dependence
Biernacka, Joanna M.; Geske, Jennifer; Jenkins, Gregory D.; Colby, Colin; Rider, David N.; Karpyak, Victor M.; Choi, Doo-Sup; Fridley, Brooke L.
2013-01-01
It is believed that multiple genetic variants with small individual effects contribute to the risk of alcohol dependence. Such polygenic effects are difficult to detect in genome-wide association studies that test for association of the phenotype with each single nucleotide polymorphism (SNP) individually. To overcome this challenge, gene set analysis (GSA) methods that jointly test for the effects of pre-defined groups of genes have been proposed. Rather than testing for association between the phenotype and individual SNPs, these analyses evaluate the global evidence of association with a set of related genes enabling the identification of cellular or molecular pathways or biological processes that play a role in development of the disease. It is hoped that by aggregating the evidence of association for all available SNPs in a group of related genes, these approaches will have enhanced power to detect genetic associations with complex traits. We performed GSA using data from a genome-wide study of 1165 alcohol dependent cases and 1379 controls from the Study of Addiction: Genetics and Environment (SAGE), for all 200 pathways listed in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Results demonstrated a potential role of the “Synthesis and Degradation of Ketone Bodies” pathway. Our results also support the potential involvement of the “Neuroactive Ligand Receptor Interaction” pathway, which has previously been implicated in addictive disorders. These findings demonstrate the utility of GSA in the study of complex disease, and suggest specific directions for further research into the genetic architecture of alcohol dependence. PMID:22717047
Chikkagoudar, Satish; Wang, Kai; Li, Mingyao
2011-05-26
Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/.
2011-01-01
Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/. PMID:21615923
Molecular evolution: breakthroughs and mysteries in Batesian mimicry.
Booker, Tom; Ness, Rob W; Charlesworth, Deborah
2015-06-15
Recent studies appear to overthrow the hypothesis that, in butterfly species exhibiting Batesian mimicry, a multi-gene complex or 'supergene' controls the multiple differences between mimetic and non-mimetic individuals, suggesting instead that near-perfect mimicry can be produced by a set of changes within a single locus, together with changes in the genetic background. Copyright © 2015 Elsevier Ltd. All rights reserved.
CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules.
Cestarelli, Valerio; Fiscon, Giulia; Felici, Giovanni; Bertolazzi, Paola; Weitschek, Emanuel
2016-03-01
Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class. We propose CAMUR, a new method that extracts multiple and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a querying tool.We analyze three different types of RNA-seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA) and we validate CAMUR and its models also on non-TCGA data. Our experimental results show the efficacy of CAMUR: we obtain several reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced. dmb.iasi.cnr.it/camur.php emanuel@iasi.cnr.it Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Yoshikawa, Yoshie; Emi, Mitsuru; Hashimoto-Tamaoki, Tomoko; Ohmuraya, Masaki; Sato, Ayuko; Tsujimura, Tohru; Hasegawa, Seiki; Nakano, Takashi; Nasu, Masaki; Pastorino, Sandra; Szymiczek, Agata; Bononi, Angela; Tanji, Mika; Pagano, Ian; Gaudino, Giovanni; Napolitano, Andrea; Goparaju, Chandra; Pass, Harvey I; Yang, Haining; Carbone, Michele
2016-11-22
We used a custom-made comparative genomic hybridization array (aCGH; average probe interval 254 bp) to screen 33 malignant mesothelioma (MM) biopsies for somatic copy number loss throughout the 3p21 region (10.7 Mb) that harbors 251 genes, including BRCA1 (breast cancer 1)-associated protein 1 (BAP1), the most commonly mutated gene in MM. We identified frequent minute biallelic deletions (<3 kb) in 46 of 251 genes: four were cancer-associated genes: SETD2 (SET domain-containing protein 2) (7 of 33), BAP1 (8 of 33), PBRM1 (polybromo 1) (3 of 33), and SMARCC1 (switch/sucrose nonfermentable- SWI/SNF-related, matrix-associated, actin-dependent regulator of chromatin, subfamily c, member 1) (2 of 33). These four genes were further investigated by targeted next-generation sequencing (tNGS), which revealed sequence-level mutations causing biallelic inactivation. Combined high-density aCGH and tNGS revealed biallelic gene inactivation in SETD2 (9 of 33, 27%), BAP1 (16 of 33, 48%), PBRM1 (5 of 33, 15%), and SMARCC1 (2 of 33, 6%). The incidence of genetic alterations detected is much higher than reported in the literature because minute deletions are not detected by NGS or commercial aCGH. Many of these minute deletions were not contiguous, but rather alternated with segments showing oscillating copy number changes along the 3p21 region. In summary, we found that in MM: (i) multiple minute simultaneous biallelic deletions are frequent in chromosome 3p21, where they occur as distinct events involving multiple genes; (ii) in addition to BAP1, mutations of SETD2, PBRM1, and SMARCC1 are frequent in MM; and (iii) our results suggest that high-density aCGH combined with tNGS provides a more precise estimate of the frequency and types of genes inactivated in human cancer than approaches based exclusively on NGS strategy.
SynFind: Compiling Syntenic Regions across Any Set of Genomes on Demand.
Tang, Haibao; Bomhoff, Matthew D; Briones, Evan; Zhang, Liangsheng; Schnable, James C; Lyons, Eric
2015-11-11
The identification of conserved syntenic regions enables discovery of predicted locations for orthologous and homeologous genes, even when no such gene is present. This capability means that synteny-based methods are far more effective than sequence similarity-based methods in identifying true-negatives, a necessity for studying gene loss and gene transposition. However, the identification of syntenic regions requires complex analyses which must be repeated for pairwise comparisons between any two species. Therefore, as the number of published genomes increases, there is a growing demand for scalable, simple-to-use applications to perform comparative genomic analyses that cater to both gene family studies and genome-scale studies. We implemented SynFind, a web-based tool that addresses this need. Given one query genome, SynFind is capable of identifying conserved syntenic regions in any set of target genomes. SynFind is capable of reporting per-gene information, useful for researchers studying specific gene families, as well as genome-wide data sets of syntenic gene and predicted gene locations, critical for researchers focused on large-scale genomic analyses. Inference of syntenic homologs provides the basis for correlation of functional changes around genes of interests between related organisms. Deployed on the CoGe online platform, SynFind is connected to the genomic data from over 15,000 organisms from all domains of life as well as supporting multiple releases of the same organism. SynFind makes use of a powerful job execution framework that promises scalability and reproducibility. SynFind can be accessed at http://genomevolution.org/CoGe/SynFind.pl. A video tutorial of SynFind using Phytophthrora as an example is available at http://www.youtube.com/watch?v=2Agczny9Nyc. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Manda, Prashanti; McCarthy, Fiona; Bridges, Susan M
2013-10-01
The Gene Ontology (GO), a set of three sub-ontologies, is one of the most popular bio-ontologies used for describing gene product characteristics. GO annotation data containing terms from multiple sub-ontologies and at different levels in the ontologies is an important source of implicit relationships between terms from the three sub-ontologies. Data mining techniques such as association rule mining that are tailored to mine from multiple ontologies at multiple levels of abstraction are required for effective knowledge discovery from GO annotation data. We present a data mining approach, Multi-ontology data mining at All Levels (MOAL) that uses the structure and relationships of the GO to mine multi-ontology multi-level association rules. We introduce two interestingness measures: Multi-ontology Support (MOSupport) and Multi-ontology Confidence (MOConfidence) customized to evaluate multi-ontology multi-level association rules. We also describe a variety of post-processing strategies for pruning uninteresting rules. We use publicly available GO annotation data to demonstrate our methods with respect to two applications (1) the discovery of co-annotation suggestions and (2) the discovery of new cross-ontology relationships. Copyright © 2013 The Authors. Published by Elsevier Inc. All rights reserved.
Paisitkriangkrai, Sakrapee; Quek, Kelly; Nievergall, Eva; Jabbour, Anissa; Zannettino, Andrew; Kok, Chung Hoow
2018-06-07
Recurrent oncogenic fusion genes play a critical role in the development of various cancers and diseases and provide, in some cases, excellent therapeutic targets. To date, analysis tools that can identify and compare recurrent fusion genes across multiple samples have not been available to researchers. To address this deficiency, we developed Co-occurrence Fusion (Co-fuse), a new and easy to use software tool that enables biologists to merge RNA-seq information, allowing them to identify recurrent fusion genes, without the need for exhaustive data processing. Notably, Co-fuse is based on pattern mining and statistical analysis which enables the identification of hidden patterns of recurrent fusion genes. In this report, we show that Co-fuse can be used to identify 2 distinct groups within a set of 49 leukemic cell lines based on their recurrent fusion genes: a multiple myeloma (MM) samples-enriched cluster and an acute myeloid leukemia (AML) samples-enriched cluster. Our experimental results further demonstrate that Co-fuse can identify known driver fusion genes (e.g., IGH-MYC, IGH-WHSC1) in MM, when compared to AML samples, indicating the potential of Co-fuse to aid the discovery of yet unknown driver fusion genes through cohort comparisons. Additionally, using a 272 primary glioma sample RNA-seq dataset, Co-fuse was able to validate recurrent fusion genes, further demonstrating the power of this analysis tool to identify recurrent fusion genes. Taken together, Co-fuse is a powerful new analysis tool that can be readily applied to large RNA-seq datasets, and may lead to the discovery of new disease subgroups and potentially new driver genes, for which, targeted therapies could be developed. The Co-fuse R source code is publicly available at https://github.com/sakrapee/co-fuse .
EdiPy: a resource to simulate the evolution of plant mitochondrial genes under the RNA editing.
Picardi, Ernesto; Quagliariello, Carla
2006-02-01
EdiPy is an online resource appropriately designed to simulate the evolution of plant mitochondrial genes in a biologically realistic fashion. EdiPy takes into account the presence of sites subjected to RNA editing and provides multiple artificial alignments corresponding to both genomic and cDNA sequences. Each artificial data set can successively be submitted to main and widespread evolutionary and phylogenetic software packages such as PAUP, Phyml, PAML and Phylip. As an online bioinformatic resource, EdiPy is available at the following web page: http://biologia.unical.it/py_script/index.html.
Grote, Steffi; Prüfer, Kay; Kelso, Janet; Dannemann, Michael
2016-10-15
We present ABAEnrichment, an R package that tests for expression enrichment in specific brain regions at different developmental stages using expression information gathered from multiple regions of the adult and developing human brain, together with ontologically organized structural information about the brain, both provided by the Allen Brain Atlas. We validate ABAEnrichment by successfully recovering the origin of gene sets identified in specific brain cell-types and developmental stages. ABAEnrichment was implemented as an R package and is available under GPL (≥ 2) from the Bioconductor website (http://bioconductor.org/packages/3.3/bioc/html/ABAEnrichment.html). steffi_grote@eva.mpg.de, kelso@eva.mpg.de or michael_dannemann@eva.mpg.deSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
BIG: a large-scale data integration tool for renal physiology.
Zhao, Yue; Yang, Chin-Rang; Raghuram, Viswanathan; Parulekar, Jaya; Knepper, Mark A
2016-10-01
Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: "How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?" This is the type of problem that has motivated the "Big-Data" revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/.
Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan
2018-04-01
Association rule mining is an important technique for identifying interesting relationships between gene pairs in a biological data set. Earlier methods basically work for a single biological data set, and, in maximum cases, a single minimum support cutoff can be applied globally, i.e., across all genesets/itemsets. To overcome this limitation, in this paper, we propose dynamic threshold-based FP-growth rule mining algorithm that integrates gene expression, methylation and protein-protein interaction profiles based on weighted shortest distance to find the novel associations among different pairs of genes in multi-view data sets. For this purpose, we introduce three new thresholds, namely, Distance-based Variable/Dynamic Supports (DVS), Distance-based Variable Confidences (DVC), and Distance-based Variable Lifts (DVL) for each rule by integrating co-expression, co-methylation, and protein-protein interactions existed in the multi-omics data set. We develop the proposed algorithm utilizing these three novel multiple threshold measures. In the proposed algorithm, the values of , , and are computed for each rule separately, and subsequently it is verified whether the support, confidence, and lift of each evolved rule are greater than or equal to the corresponding individual , , and values, respectively, or not. If all these three conditions for a rule are found to be true, the rule is treated as a resultant rule. One of the major advantages of the proposed method compared with other related state-of-the-art methods is that it considers both the quantitative and interactive significance among all pairwise genes belonging to each rule. Moreover, the proposed method generates fewer rules, takes less running time, and provides greater biological significance for the resultant top-ranking rules compared to previous methods.
Extensive complementarity between gene function prediction methods.
Vidulin, Vedrana; Šmuc, Tomislav; Supek, Fran
2016-12-01
The number of sequenced genomes rises steadily but we still lack the knowledge about the biological roles of many genes. Automated function prediction (AFP) is thus a necessity. We hypothesized that AFP approaches that draw on distinct genome features may be useful for predicting different types of gene functions, motivating a systematic analysis of the benefits gained by obtaining and integrating such predictions. Our pipeline amalgamates 5 133 543 genes from 2071 genomes in a single massive analysis that evaluates five established genomic AFP methodologies. While 1227 Gene Ontology (GO) terms yielded reliable predictions, the majority of these functions were accessible to only one or two of the methods. Moreover, different methods tend to assign a GO term to non-overlapping sets of genes. Thus, inferences made by diverse genomic AFP methods display a striking complementary, both gene-wise and function-wise. Because of this, a viable integration strategy is to rely on a single most-confident prediction per gene/function, rather than enforcing agreement across multiple AFP methods. Using an information-theoretic approach, we estimate that current databases contain 29.2 bits/gene of known Escherichia coli gene functions. This can be increased by up to 5.5 bits/gene using individual AFP methods or by 11 additional bits/gene upon integration, thereby providing a highly-ranking predictor on the Critical Assessment of Function Annotation 2 community benchmark. Availability of more sequenced genomes boosts the predictive accuracy of AFP approaches and also the benefit from integrating them. The individual and integrated GO predictions for the complete set of genes are available from http://gorbi.irb.hr/ CONTACT: fran.supek@irb.hrSupplementary information: Supplementary materials are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Ander, Bradley P.; Zhang, Xiaoshuai; Xue, Fuzhong; Sharp, Frank R.; Yang, Xiaowei
2013-01-01
The discovery of genetic or genomic markers plays a central role in the development of personalized medicine. A notable challenge exists when dealing with the high dimensionality of the data sets, as thousands of genes or millions of genetic variants are collected on a relatively small number of subjects. Traditional gene-wise selection methods using univariate analyses face difficulty to incorporate correlational, structural, or functional structures amongst the molecular measures. For microarray gene expression data, we first summarize solutions in dealing with ‘large p, small n’ problems, and then propose an integrative Bayesian variable selection (iBVS) framework for simultaneously identifying causal or marker genes and regulatory pathways. A novel partial least squares (PLS) g-prior for iBVS is developed to allow the incorporation of prior knowledge on gene-gene interactions or functional relationships. From the point view of systems biology, iBVS enables user to directly target the joint effects of multiple genes and pathways in a hierarchical modeling diagram to predict disease status or phenotype. The estimated posterior selection probabilities offer probabilitic and biological interpretations. Both simulated data and a set of microarray data in predicting stroke status are used in validating the performance of iBVS in a Probit model with binary outcomes. iBVS offers a general framework for effective discovery of various molecular biomarkers by combining data-based statistics and knowledge-based priors. Guidelines on making posterior inferences, determining Bayesian significance levels, and improving computational efficiencies are also discussed. PMID:23844055
Peng, Bin; Zhu, Dianwen; Ander, Bradley P; Zhang, Xiaoshuai; Xue, Fuzhong; Sharp, Frank R; Yang, Xiaowei
2013-01-01
The discovery of genetic or genomic markers plays a central role in the development of personalized medicine. A notable challenge exists when dealing with the high dimensionality of the data sets, as thousands of genes or millions of genetic variants are collected on a relatively small number of subjects. Traditional gene-wise selection methods using univariate analyses face difficulty to incorporate correlational, structural, or functional structures amongst the molecular measures. For microarray gene expression data, we first summarize solutions in dealing with 'large p, small n' problems, and then propose an integrative Bayesian variable selection (iBVS) framework for simultaneously identifying causal or marker genes and regulatory pathways. A novel partial least squares (PLS) g-prior for iBVS is developed to allow the incorporation of prior knowledge on gene-gene interactions or functional relationships. From the point view of systems biology, iBVS enables user to directly target the joint effects of multiple genes and pathways in a hierarchical modeling diagram to predict disease status or phenotype. The estimated posterior selection probabilities offer probabilitic and biological interpretations. Both simulated data and a set of microarray data in predicting stroke status are used in validating the performance of iBVS in a Probit model with binary outcomes. iBVS offers a general framework for effective discovery of various molecular biomarkers by combining data-based statistics and knowledge-based priors. Guidelines on making posterior inferences, determining Bayesian significance levels, and improving computational efficiencies are also discussed.
Scuba: scalable kernel-based gene prioritization.
Zampieri, Guido; Tran, Dinh Van; Donini, Michele; Navarin, Nicolò; Aiolli, Fabio; Sperduti, Alessandro; Valle, Giorgio
2018-01-25
The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability. We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods. Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba .
Krienen, Fenna M.; Yeo, B. T. Thomas; Ge, Tian; Buckner, Randy L.; Sherwood, Chet C.
2016-01-01
The human brain is patterned with disproportionately large, distributed cerebral networks that connect multiple association zones in the frontal, temporal, and parietal lobes. The expansion of the cortical surface, along with the emergence of long-range connectivity networks, may be reflected in changes to the underlying molecular architecture. Using the Allen Institute’s human brain transcriptional atlas, we demonstrate that genes particularly enriched in supragranular layers of the human cerebral cortex relative to mouse distinguish major cortical classes. The topography of transcriptional expression reflects large-scale brain network organization consistent with estimates from functional connectivity MRI and anatomical tracing in nonhuman primates. Microarray expression data for genes preferentially expressed in human upper layers (II/III), but enriched only in lower layers (V/VI) of mouse, were cross-correlated to identify molecular profiles across the cerebral cortex of postmortem human brains (n = 6). Unimodal sensory and motor zones have similar molecular profiles, despite being distributed across the cortical mantle. Sensory/motor profiles were anticorrelated with paralimbic and certain distributed association network profiles. Tests of alternative gene sets did not consistently distinguish sensory and motor regions from paralimbic and association regions: (i) genes enriched in supragranular layers in both humans and mice, (ii) genes cortically enriched in humans relative to nonhuman primates, (iii) genes related to connectivity in rodents, (iv) genes associated with human and mouse connectivity, and (v) 1,454 gene sets curated from known gene ontologies. Molecular innovations of upper cortical layers may be an important component in the evolution of long-range corticocortical projections. PMID:26739559
Krienen, Fenna M; Yeo, B T Thomas; Ge, Tian; Buckner, Randy L; Sherwood, Chet C
2016-01-26
The human brain is patterned with disproportionately large, distributed cerebral networks that connect multiple association zones in the frontal, temporal, and parietal lobes. The expansion of the cortical surface, along with the emergence of long-range connectivity networks, may be reflected in changes to the underlying molecular architecture. Using the Allen Institute's human brain transcriptional atlas, we demonstrate that genes particularly enriched in supragranular layers of the human cerebral cortex relative to mouse distinguish major cortical classes. The topography of transcriptional expression reflects large-scale brain network organization consistent with estimates from functional connectivity MRI and anatomical tracing in nonhuman primates. Microarray expression data for genes preferentially expressed in human upper layers (II/III), but enriched only in lower layers (V/VI) of mouse, were cross-correlated to identify molecular profiles across the cerebral cortex of postmortem human brains (n = 6). Unimodal sensory and motor zones have similar molecular profiles, despite being distributed across the cortical mantle. Sensory/motor profiles were anticorrelated with paralimbic and certain distributed association network profiles. Tests of alternative gene sets did not consistently distinguish sensory and motor regions from paralimbic and association regions: (i) genes enriched in supragranular layers in both humans and mice, (ii) genes cortically enriched in humans relative to nonhuman primates, (iii) genes related to connectivity in rodents, (iv) genes associated with human and mouse connectivity, and (v) 1,454 gene sets curated from known gene ontologies. Molecular innovations of upper cortical layers may be an important component in the evolution of long-range corticocortical projections.
Identification of a B cell signature associated with renal transplant tolerance in humans
Newell, Kenneth A.; Asare, Adam; Kirk, Allan D.; Gisler, Trang D.; Bourcier, Kasia; Suthanthiran, Manikkam; Burlingham, William J.; Marks, William H.; Sanz, Ignacio; Lechler, Robert I.; Hernandez-Fuentes, Maria P.; Turka, Laurence A.; Seyfert-Margolis, Vicki L.
2010-01-01
Establishing long-term allograft acceptance without the requirement for continuous immunosuppression, a condition known as allograft tolerance, is a highly desirable therapeutic goal in solid organ transplantation. Determining which recipients would benefit from withdrawal or minimization of immunosuppression would be greatly facilitated by biomarkers predictive of tolerance. In this study, we identified the largest reported cohort to our knowledge of tolerant renal transplant recipients, as defined by stable graft function and receiving no immunosuppression for more than 1 year, and compared their gene expression profiles and peripheral blood lymphocyte subsets with those of subjects with stable graft function who are receiving immunosuppressive drugs as well as healthy controls. In addition to being associated with clinical and phenotypic parameters, renal allograft tolerance was strongly associated with a B cell signature using several assays. Tolerant subjects showed increased expression of multiple B cell differentiation genes, and a set of just 3 of these genes distinguished tolerant from nontolerant recipients in a unique test set of samples. This B cell signature was associated with upregulation of CD20 mRNA in urine sediment cells and elevated numbers of peripheral blood naive and transitional B cells in tolerant participants compared with those receiving immunosuppression. These results point to a critical role for B cells in regulating alloimmunity and provide a candidate set of genes for wider-scale screening of renal transplant recipients. PMID:20501946
Understanding the Origin of Species with Genome-Scale Data: the Role of Gene Flow
Sousa, Vitor; Hey, Jody
2017-01-01
As it becomes easier to sequence multiple genomes from closely related species, evolutionary biologists working on speciation are struggling to get the most out of very large population-genomic data sets. Such data hold the potential to resolve evolutionary biology’s long-standing questions about the role of gene exchange in species formation. In principle the new population genomic data can be used to disentangle the conflicting roles of natural selection and gene flow during the divergence process. However there are great challenges in taking full advantage of such data, especially with regard to including recombination in genetic models of the divergence process. Current data, models, methods and the potential pitfalls in using them will be considered here. PMID:23657479
Pena, S D; Barreto, G; Vago, A R; De Marco, L; Reinach, F C; Dias Neto, E; Simpson, A J
1994-01-01
Low-stringency single specific primer PCR (LSSP-PCR) is an extremely simple PCR-based technique that detects single or multiple mutations in gene-sized DNA fragments. A purified DNA fragment is subjected to PCR using high concentrations of a single specific oligonucleotide primer, large amounts of Taq polymerase, and a very low annealing temperature. Under these conditions the primer hybridizes specifically to its complementary region and nonspecifically to multiple sites within the fragment, in a sequence-dependent manner, producing a heterogeneous set of reaction products resolvable by electrophoresis. The complex banding pattern obtained is significantly altered by even a single-base change and thus constitutes a unique "gene signature." Therefore LSSP-PCR will have almost unlimited application in all fields of genetics and molecular medicine where rapid and sensitive detection of mutations and sequence variations is important. The usefulness of LSSP-PCR is illustrated by applications in the study of mutants of smooth muscle myosin light chain, analysis of a family with X-linked nephrogenic diabetes insipidus, and identity testing using human mitochondrial DNA. Images PMID:8127912
Dai, Weijun; Li, Wencheng; Hoque, Mainul; Li, Zhuyun; Tian, Bin; Makeyev, Eugene V.
2015-01-01
Nervous system (NS) development relies on coherent upregulation of extensive sets of genes in a precise spatiotemporal manner. How such transcriptome-wide effects are orchestrated at the molecular level remains an open question. Here we show that 3′-untranslated regions (3′ UTRs) of multiple neural transcripts contain AU-rich cis-elements (AREs) recognized by tristetraprolin (TTP/Zfp36), an RNA-binding protein previously implicated in regulation of mRNA stability. We further demonstrate that the efficiency of ARE-dependent mRNA degradation declines in the neural lineage because of a decrease in the TTP protein expression mediated by the NS-enriched microRNA miR-9. Importantly, TTP downregulation in this context is essential for proper neuronal differentiation. On the other hand, inactivation of TTP in non-neuronal cells leads to dramatic upregulation of multiple NS-specific genes. We conclude that the newly identified miR-9/TTP circuitry limits unscheduled accumulation of neuronal mRNAs in non-neuronal cells and ensures coordinated upregulation of these transcripts in neurons. PMID:26144867
Recent progress in the genetics of spontaneously hypertensive rats.
Pravenec, M; Křen, V; Landa, V; Mlejnek, P; Musilová, A; Šilhavý, J; Šimáková, M; Zídek, V
2014-01-01
The spontaneously hypertensive rat (SHR) is the most widely used animal model of essential hypertension and accompanying metabolic disturbances. Recent advances in sequencing of genomes of BN-Lx and SHR progenitors of the BXH/HXB recombinant inbred (RI) strains as well as accumulation of multiple data sets of intermediary phenotypes in the RI strains, including mRNA and microRNA abundance, quantitative metabolomics, proteomics, methylomics or histone modifications, will make it possible to systematically search for genetic variants involved in regulation of gene expression and in the etiology of complex pathophysiological traits. New advances in manipulation of the rat genome, including efficient transgenesis and gene targeting, will enable in vivo functional analyses of selected candidate genes to identify QTL at the molecular level or to provide insight into mechanisms whereby targeted genes affect pathophysiological traits in the SHR.
Singh, Upinder; Brewer, Jeremy L; Boothroyd, John C
2002-05-01
Developmental switching in Toxoplasma gondii, from the virulent tachyzoite to the relatively quiescent bradyzoite stage, is responsible for disease propagation and reactivation. We have generated tachyzoite to bradyzoite differentiation (Tbd-) mutants in T. gondii and used these in combination with a cDNA microarray to identify developmental pathways in bradyzoite formation. Four independently generated Tbd- mutants were analysed and had defects in bradyzoite development in response to multiple bradyzoite-inducing conditions, a stable phenotype after in vivo passages and a markedly reduced brain cyst burden in a murine model of chronic infection. Transcriptional profiles of mutant and wild-type parasites, growing under bradyzoite conditions, revealed a hierarchy of developmentally regulated genes, including many bradyzoite-induced genes whose transcripts were reduced in all mutants. A set of non-developmentally regulated genes whose transcripts were less abundant in Tbd- mutants were also identified. These may represent genes that mediate downstream effects and/or whose expression is dependent on the same transcription factors as the bradyzoite-induced set. Using these data, we have generated a model of transcription regulation during bradyzoite development in T. gondii. Our approach shows the utility of this system as a model to study developmental biology in single-celled eukaryotes including protozoa and fungi.
Bukowski, Radek; Sadovsky, Yoel; Goodarzi, Hani; Zhang, Heping; Biggio, Joseph R; Varner, Michael; Parry, Samuel; Xiao, Feifei; Esplin, Sean M; Andrews, William; Saade, George R; Ilekis, John V; Reddy, Uma M; Baldwin, Donald A
2017-01-01
Preterm birth is a main determinant of neonatal mortality and morbidity and a major contributor to the overall mortality and burden of disease. However, research of the preterm birth is hindered by the imprecise definition of the clinical phenotype and complexity of the molecular phenotype due to multiple pregnancy tissue types and molecular processes that may contribute to the preterm birth. Here we comprehensively evaluate the mRNA transcriptome that characterizes preterm and term labor in tissues comprising the pregnancy using precisely phenotyped samples. The four complementary phenotypes together provide comprehensive insight into preterm and term parturition. Samples of maternal blood, chorion, amnion, placenta, decidua, fetal blood, and myometrium from the uterine fundus and lower segment ( n = 183) were obtained during cesarean delivery from women with four complementary phenotypes: delivering preterm with (PL) and without labor (PNL), term with (TL) and without labor (TNL). Enrolled were 35 pregnant women with four precisely and prospectively defined phenotypes: PL ( n = 8), PNL ( n = 10), TL ( n = 7) and TNL ( n = 10). Gene expression data were analyzed using shrunken centroid analysis to identify a minimal set of genes that uniquely characterizes each of the four phenotypes. Expression profiles of 73 genes and non-coding RNA sequences uniquely identified each of the four phenotypes. The shrunken centroid analysis and 10 times 10-fold cross-validation was also used to minimize false positive finings and overfitting. Identified were the pathways and molecular processes associated with and the cis-regulatory elements in gene's 5' promoter or 3'-UTR regions of the set of genes which expression uniquely characterized the four phenotypes. The largest differences in gene expression among the four groups occurred at maternal fetal interface in decidua, chorion and amnion. The gene expression profiles showed suppression of chemokines expression in TNL, withdrawal of this suppression in TL, activation of multiple pathways of inflammation in PL, and an immune rejection profile in PNL. The genes constituting expression signatures showed over-representation of three putative regulatory elements in their 5'and 3' UTR regions. The results suggest that pregnancy is maintained by downregulation of chemokines at the maternal-fetal interface. Withdrawal of this downregulation results in the term birth and its overriding by the activation of multiple pathways of the immune system in the preterm birth. Complications of the pregnancy associated with impairment of placental function, which necessitated premature delivery of the fetus in the absence of labor, show gene expression patterns associated with immune rejection.
A whole blood gene expression-based signature for smoking status
2012-01-01
Background Smoking is the leading cause of preventable death worldwide and has been shown to increase the risk of multiple diseases including coronary artery disease (CAD). We sought to identify genes whose levels of expression in whole blood correlate with self-reported smoking status. Methods Microarrays were used to identify gene expression changes in whole blood which correlated with self-reported smoking status; a set of significant genes from the microarray analysis were validated by qRT-PCR in an independent set of subjects. Stepwise forward logistic regression was performed using the qRT-PCR data to create a predictive model whose performance was validated in an independent set of subjects and compared to cotinine, a nicotine metabolite. Results Microarray analysis of whole blood RNA from 209 PREDICT subjects (41 current smokers, 4 quit ≤ 2 months, 64 quit > 2 months, 100 never smoked; NCT00500617) identified 4214 genes significantly correlated with self-reported smoking status. qRT-PCR was performed on 1,071 PREDICT subjects across 256 microarray genes significantly correlated with smoking or CAD. A five gene (CLDND1, LRRN3, MUC1, GOPC, LEF1) predictive model, derived from the qRT-PCR data using stepwise forward logistic regression, had a cross-validated mean AUC of 0.93 (sensitivity=0.78; specificity=0.95), and was validated using 180 independent PREDICT subjects (AUC=0.82, CI 0.69-0.94; sensitivity=0.63; specificity=0.94). Plasma from the 180 validation subjects was used to assess levels of cotinine; a model using a threshold of 10 ng/ml cotinine resulted in an AUC of 0.89 (CI 0.81-0.97; sensitivity=0.81; specificity=0.97; kappa with expression model = 0.53). Conclusion We have constructed and validated a whole blood gene expression score for the evaluation of smoking status, demonstrating that clinical and environmental factors contributing to cardiovascular disease risk can be assessed by gene expression. PMID:23210427
Jung, Inuk; Jo, Kyuri; Kang, Hyejin; Ahn, Hongryul; Yu, Youngjae; Kim, Sun
2017-12-01
Identifying biologically meaningful gene expression patterns from time series gene expression data is important to understand the underlying biological mechanisms. To identify significantly perturbed gene sets between different phenotypes, analysis of time series transcriptome data requires consideration of time and sample dimensions. Thus, the analysis of such time series data seeks to search gene sets that exhibit similar or different expression patterns between two or more sample conditions, constituting the three-dimensional data, i.e. gene-time-condition. Computational complexity for analyzing such data is very high, compared to the already difficult NP-hard two dimensional biclustering algorithms. Because of this challenge, traditional time series clustering algorithms are designed to capture co-expressed genes with similar expression pattern in two sample conditions. We present a triclustering algorithm, TimesVector, specifically designed for clustering three-dimensional time series data to capture distinctively similar or different gene expression patterns between two or more sample conditions. TimesVector identifies clusters with distinctive expression patterns in three steps: (i) dimension reduction and clustering of time-condition concatenated vectors, (ii) post-processing clusters for detecting similar and distinct expression patterns and (iii) rescuing genes from unclassified clusters. Using four sets of time series gene expression data, generated by both microarray and high throughput sequencing platforms, we demonstrated that TimesVector successfully detected biologically meaningful clusters of high quality. TimesVector improved the clustering quality compared to existing triclustering tools and only TimesVector detected clusters with differential expression patterns across conditions successfully. The TimesVector software is available at http://biohealth.snu.ac.kr/software/TimesVector/. sunkim.bioinfo@snu.ac.kr. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Complete Genomic Structure of the Cultivated Rice Endophyte Azospirillum sp. B510
Kaneko, Takakazu; Minamisawa, Kiwamu; Isawa, Tsuyoshi; Nakatsukasa, Hiroki; Mitsui, Hisayuki; Kawaharada, Yasuyuki; Nakamura, Yasukazu; Watanabe, Akiko; Kawashima, Kumiko; Ono, Akiko; Shimizu, Yoshimi; Takahashi, Chika; Minami, Chiharu; Fujishiro, Tsunakazu; Kohara, Mitsuyo; Katoh, Midori; Nakazaki, Naomi; Nakayama, Shinobu; Yamada, Manabu; Tabata, Satoshi; Sato, Shusei
2010-01-01
We determined the nucleotide sequence of the entire genome of a diazotrophic endophyte, Azospirillum sp. B510. Strain B510 is an endophytic bacterium isolated from stems of rice plants (Oryza sativa cv. Nipponbare). The genome of B510 consisted of a single chromosome (3 311 395 bp) and six plasmids, designated as pAB510a (1 455 109 bp), pAB510b (723 779 bp), pAB510c (681 723 bp), pAB510d (628 837 bp), pAB510e (537 299 bp), and pAB510f (261 596 bp). The chromosome bears 2893 potential protein-encoding genes, two sets of rRNA gene clusters (rrns), and 45 tRNA genes representing 37 tRNA species. The genomes of the six plasmids contained a total of 3416 protein-encoding genes, seven sets of rrns, and 34 tRNAs representing 19 tRNA species. Eight genes for plasmid-specific tRNA species are located on either pAB510a or pAB510d. Two out of eight genomic islands are inserted in the plasmids, pAB510b and pAB510e, and one of the islands is inserted into trnfM-CAU in the rrn located on pAB510e. Genes other than the nif gene cluster that are involved in N2 fixation and are homologues of Bradyrhizobium japonicum USDA110 include fixABCX, fixNOQP, fixHIS, fixG, and fixLJK. Three putative plant hormone-related genes encoding tryptophan 2-monooxytenase (iaaM) and indole-3-acetaldehyde hydrolase (iaaH), which are involved in IAA biosynthesis, and ACC deaminase (acdS), which reduces ethylene levels, were identified. Multiple gene-clusters for tripartite ATP-independent periplasmic-transport systems and a diverse set of malic enzymes were identified, suggesting that B510 utilizes C4-dicarboxylate during its symbiotic relationship with the host plant. PMID:20047946
Uptake, Results, and Outcomes of Germline Multiple-Gene Sequencing After Diagnosis of Breast Cancer.
Kurian, Allison W; Ward, Kevin C; Hamilton, Ann S; Deapen, Dennis M; Abrahamse, Paul; Bondarenko, Irina; Li, Yun; Hawley, Sarah T; Morrow, Monica; Jagsi, Reshma; Katz, Steven J
2018-05-10
Low-cost sequencing of multiple genes is increasingly available for cancer risk assessment. Little is known about uptake or outcomes of multiple-gene sequencing after breast cancer diagnosis in community practice. To examine the effect of multiple-gene sequencing on the experience and treatment outcomes for patients with breast cancer. For this population-based retrospective cohort study, patients with breast cancer diagnosed from January 2013 to December 2015 and accrued from SEER registries across Georgia and in Los Angeles, California, were surveyed (n = 5080, response rate = 70%). Responses were merged with SEER data and results of clinical genetic tests, either BRCA1 and BRCA2 (BRCA1/2) sequencing only or including additional other genes (multiple-gene sequencing), provided by 4 laboratories. Type of testing (multiple-gene sequencing vs BRCA1/2-only sequencing), test results (negative, variant of unknown significance, or pathogenic variant), patient experiences with testing (timing of testing, who discussed results), and treatment (strength of patient consideration of, and surgeon recommendation for, prophylactic mastectomy), and prophylactic mastectomy receipt. We defined a patient subgroup with higher pretest risk of carrying a pathogenic variant according to practice guidelines. Among 5026 patients (mean [SD] age, 59.9 [10.7]), 1316 (26.2%) were linked to genetic results from any laboratory. Multiple-gene sequencing increasingly replaced BRCA1/2-only testing over time: in 2013, the rate of multiple-gene sequencing was 25.6% and BRCA1/2-only testing, 74.4%;in 2015 the rate of multiple-gene sequencing was 66.5% and BRCA1/2-only testing, 33.5%. Multiple-gene sequencing was more often ordered by genetic counselors (multiple-gene sequencing, 25.5% and BRCA1/2-only testing, 15.3%) and delayed until after surgery (multiple-gene sequencing, 32.5% and BRCA1/2-only testing, 19.9%). Multiple-gene sequencing substantially increased rate of detection of any pathogenic variant (multiple-gene sequencing: higher-risk patients, 12%; average-risk patients, 4.2% and BRCA1/2-only testing: higher-risk patients, 7.8%; average-risk patients, 2.2%) and variants of uncertain significance, especially in minorities (multiple-gene sequencing: white patients, 23.7%; black patients, 44.5%; and Asian patients, 50.9% and BRCA1/2-only testing: white patients, 2.2%; black patients, 5.6%; and Asian patients, 0%). Multiple-gene sequencing was not associated with an increase in the rate of prophylactic mastectomy use, which was highest with pathogenic variants in BRCA1/2 (BRCA1/2, 79.0%; other pathogenic variant, 37.6%; variant of uncertain significance, 30.2%; negative, 35.3%). Multiple-gene sequencing rapidly replaced BRCA1/2-only testing for patients with breast cancer in the community and enabled 2-fold higher detection of clinically relevant pathogenic variants without an associated increase in prophylactic mastectomy. However, important targets for improvement in the clinical utility of multiple-gene sequencing include postsurgical delay and racial/ethnic disparity in variants of uncertain significance.
Chattaway, Marie Anne; Day, Michaela; Mtwale, Julia; White, Emma; Rogers, James; Day, Martin; Powell, David; Ahmad, Marwa; Harris, Ross; Talukder, Kaisar Ali; Wain, John; Jenkins, Claire; Cravioto, Alejandro
2017-10-01
This study investigates the virulence and antimicrobial resistance in association with common clonal complexes (CCs) of enteroaggregative Escherichia coli (EAEC) isolated from Bangladesh. The aim was to determine whether specific CCs were more likely to be associated with putative virulence genes and/or antimicrobial resistance. The presence of 15 virulence genes (by PCR) and susceptibility to 18 antibiotics were determined for 151 EAEC isolated from cases and controls during an intestinal infectious disease study carried out between 2007-2011 in the rural setting of Mirzapur, Bangladesh (Kotloff KL, Blackwelder WC, Nasrin D, Nataro JP, Farag TH et al.Clin Infect Dis 2012;55:S232-S245). These data were then analysed in the context of previously determined serotypes and clonal complexes defined by multi-locus sequence typing. Overall there was no association between the presence of virulence or antimicrobial resistance genes in isolates of EAEC from cases versus controls. However, when stratified by clonal complex (CC) one CC associated with cases harboured more virulence factors (CC40) and one CC harboured more resistance genes (CC38) than the average. There was no direct link between the virulence gene content and antibiotic resistance. Strains within a single CC had variable virulence and resistance gene content indicating independent and multiple gene acquisitions over time. In Bangladesh, there are multiple clonal complexes of EAEC harbouring a variety of virulence and resistance genes. The emergence of two of the most successful clones appeared to be linked to either increased virulence (CC40) or antimicrobial resistance (CC38), but increased resistance and virulence were not found in the same clonal complexes.
Kavakiotis, Ioannis; Xochelli, Aliki; Agathangelidis, Andreas; Tsoumakas, Grigorios; Maglaveras, Nicos; Stamatopoulos, Kostas; Hadzidimitriou, Anastasia; Vlahavas, Ioannis; Chouvarda, Ioanna
2016-06-06
Somatic Hypermutation (SHM) refers to the introduction of mutations within rearranged V(D)J genes, a process that increases the diversity of Immunoglobulins (IGs). The analysis of SHM has offered critical insight into the physiology and pathology of B cells, leading to strong prognostication markers for clinical outcome in chronic lymphocytic leukaemia (CLL), the most frequent adult B-cell malignancy. In this paper we present a methodology for integrating multiple immunogenetic and clinocobiological data sources in order to extract features and create high quality datasets for SHM analysis in IG receptors of CLL patients. This dataset is used as the basis for a higher level integration procedure, inspired form social choice theory. This is applied in the Towards Analysis, our attempt to investigate the potential ontogenetic transformation of genes belonging to specific stereotyped CLL subsets towards other genes or gene families, through SHM. The data integration process, followed by feature extraction, resulted in the generation of a dataset containing information about mutations occurring through SHM. The Towards analysis performed on the integrated dataset applying voting techniques, revealed the distinct behaviour of subset #201 compared to other subsets, as regards SHM related movements among gene clans, both in allele-conserved and non-conserved gene areas. With respect to movement between genes, a high percentage movement towards pseudo genes was found in all CLL subsets. This data integration and feature extraction process can set the basis for exploratory analysis or a fully automated computational data mining approach on many as yet unanswered, clinically relevant biological questions.
Transcriptional activation of Mina by Sp1/3 factors.
Lian, Shangli; Potula, Hari Hara S K; Pillai, Meenu R; Van Stry, Melanie; Koyanagi, Madoka; Chung, Linda; Watanabe, Makiko; Bix, Mark
2013-01-01
Mina is an epigenetic gene regulatory protein known to function in multiple physiological and pathological contexts, including pulmonary inflammation, cell proliferation, cancer and immunity. We showed previously that the level of Mina gene expression is subject to natural genetic variation linked to 21 SNPs occurring in the Mina 5' region. In order to explore the mechanisms regulating Mina gene expression, we set out to molecularly characterize the Mina promoter in the region encompassing these SNPs. We used three kinds of assays--reporter, gel shift and chromatin immunoprecipitation--to analyze a 2 kb genomic fragment spanning the upstream and intron 1 regions flanking exon 1. Here we discovered a pair of Mina promoters (P1 and P2) and a P1-specific enhancer element (E1). Pharmacologic inhibition and siRNA knockdown experiments suggested that Sp1/3 transcription factors trigger Mina expression through additive activity targeted to a cluster of four Sp1/3 binding sites forming the P1 promoter. These results set the stage for comprehensive analysis of Mina gene regulation from the context of tissue specificity, the impact of inherited genetic variation and the nature of upstream signaling pathways.
Transcriptional Activation of Mina by Sp1/3 Factors
Lian, Shangli; Potula, Hari Hara S. K.; Pillai, Meenu R.; Van Stry, Melanie; Koyanagi, Madoka; Chung, Linda; Watanabe, Makiko; Bix, Mark
2013-01-01
Mina is an epigenetic gene regulatory protein known to function in multiple physiological and pathological contexts, including pulmonary inflammation, cell proliferation, cancer and immunity. We showed previously that the level of Mina gene expression is subject to natural genetic variation linked to 21 SNPs occurring in the Mina 5′ region [1]. In order to explore the mechanisms regulating Mina gene expression, we set out to molecularly characterize the Mina promoter in the region encompassing these SNPs. We used three kinds of assays – reporter, gel shift and chromatin immunoprecipitation – to analyze a 2 kb genomic fragment spanning the upstream and intron 1 regions flanking exon 1. Here we discovered a pair of Mina promoters (P1 and P2) and a P1-specific enhancer element (E1). Pharmacologic inhibition and siRNA knockdown experiments suggested that Sp1/3 transcription factors trigger Mina expression through additive activity targeted to a cluster of four Sp1/3 binding sites forming the P1 promoter. These results set the stage for comprehensive analysis of Mina gene regulation from the context of tissue specificity, the impact of inherited genetic variation and the nature of upstream signaling pathways. PMID:24324617
NCBI GEO: archive for functional genomics data sets--10 years on.
Barrett, Tanya; Troup, Dennis B; Wilhite, Stephen E; Ledoux, Pierre; Evangelista, Carlos; Kim, Irene F; Tomashevsky, Maxim; Marshall, Kimberly A; Phillippy, Katherine H; Sherman, Patti M; Muertter, Rolf N; Holko, Michelle; Ayanbule, Oluwabukunmi; Yefanov, Andrey; Soboleva, Alexandra
2011-01-01
A decade ago, the Gene Expression Omnibus (GEO) database was established at the National Center for Biotechnology Information (NCBI). The original objective of GEO was to serve as a public repository for high-throughput gene expression data generated mostly by microarray technology. However, the research community quickly applied microarrays to non-gene-expression studies, including examination of genome copy number variation and genome-wide profiling of DNA-binding proteins. Because the GEO database was designed with a flexible structure, it was possible to quickly adapt the repository to store these data types. More recently, as the microarray community switches to next-generation sequencing technologies, GEO has again adapted to host these data sets. Today, GEO stores over 20,000 microarray- and sequence-based functional genomics studies, and continues to handle the majority of direct high-throughput data submissions from the research community. Multiple mechanisms are provided to help users effectively search, browse, download and visualize the data at the level of individual genes or entire studies. This paper describes recent database enhancements, including new search and data representation tools, as well as a brief review of how the community uses GEO data. GEO is freely accessible at http://www.ncbi.nlm.nih.gov/geo/.
AUC-based biomarker ensemble with an application on gene scores predicting low bone mineral density.
Zhao, X G; Dai, W; Li, Y; Tian, L
2011-11-01
The area under the receiver operating characteristic (ROC) curve (AUC), long regarded as a 'golden' measure for the predictiveness of a continuous score, has propelled the need to develop AUC-based predictors. However, the AUC-based ensemble methods are rather scant, largely due to the fact that the associated objective function is neither continuous nor concave. Indeed, there is no reliable numerical algorithm identifying optimal combination of a set of biomarkers to maximize the AUC, especially when the number of biomarkers is large. We have proposed a novel AUC-based statistical ensemble methods for combining multiple biomarkers to differentiate a binary response of interest. Specifically, we propose to replace the non-continuous and non-convex AUC objective function by a convex surrogate loss function, whose minimizer can be efficiently identified. With the established framework, the lasso and other regularization techniques enable feature selections. Extensive simulations have demonstrated the superiority of the new methods to the existing methods. The proposal has been applied to a gene expression dataset to construct gene expression scores to differentiate elderly women with low bone mineral density (BMD) and those with normal BMD. The AUCs of the resulting scores in the independent test dataset has been satisfactory. Aiming for directly maximizing AUC, the proposed AUC-based ensemble method provides an efficient means of generating a stable combination of multiple biomarkers, which is especially useful under the high-dimensional settings. lutian@stanford.edu. Supplementary data are available at Bioinformatics online.
Suleiman, Suleiman H.; Koko, Mahmoud E.; Nasir, Wafaa H.; Elfateh, Ommnyiah; Elgizouli, Ubai K.; Abdallah, Mohammed O. E.; Alfarouk, Khalid O.; Hussain, Ayman; Faisal, Shima; Ibrahim, Fathelrahamn M. A.; Romano, Maurizio; Sultan, Ali; Banks, Lawrence; Newport, Melanie; Baralle, Francesco; Elhassan, Ahmed M.; Mohamed, Hiba S.; Ibrahim, Muntaser E.
2015-01-01
The molecular basis of cancer and cancer multiple phenotypes are not yet fully understood. Next Generation Sequencing promises new insight into the role of genetic interactions in shaping the complexity of cancer. Aiming to outline the differences in mutation patterns between familial colorectal cancer cases and controls we analyzed whole exomes of cancer tissues and control samples from an extended colorectal cancer pedigree, providing one of the first data sets of exome sequencing of cancer in an African population against a background of large effective size typically with excess of variants. Tumors showed hMSH2 loss of function SNV consistent with Lynch syndrome. Sets of genes harboring insertions–deletions in tumor tissues revealed, however, significant GO enrichment, a feature that was not seen in control samples, suggesting that ordered insertions–deletions are central to tumorigenesis in this type of cancer. Network analysis identified multiple hub genes of centrality. ELAVL1/HuR showed remarkable centrality, interacting specially with genes harboring non-synonymous SNVs thus reinforcing the proposition of targeted mutagenesis in cancer pathways. A likely explanation to such mutation pattern is DNA/RNA editing, suggested here by nucleotide transition-to-transversion ratio that significantly departed from expected values (p-value 5e-6). NFKB1 also showed significant centrality along with ELAVL1, raising the suspicion of viral etiology given the known interaction between oncogenic viruses and these proteins. PMID:26442106
Yamada, Takashi; Onimatsu, Hideki; Van Etten, James L.
2007-01-01
Chlorella viruses or chloroviruses are large, icosahedral, plaque‐forming, double‐stranded‐DNA—containing viruses that replicate in certain strains of the unicellular green alga Chlorella. DNA sequence analysis of the 330‐kbp genome of Paramecium bursaria chlorella virus 1 (PBCV‐1), the prototype of this virus family (Phycodnaviridae), predict ∼366 protein‐encoding genes and 11 tRNA genes. The predicted gene products of ∼50% of these genes resemble proteins of known function, including many that are completely unexpected for a virus. In addition, the chlorella viruses have several features and encode many gene products that distinguish them from most viruses. These products include: (1) multiple DNA methyltransferases and DNA site‐specific endonucleases, (2) the enzymes required to glycosylate their proteins and synthesize polysaccharides such as hyaluronan and chitin, (3) a virus‐encoded K+ channel (called Kcv) located in the internal membrane of the virions, (4) a SET domain containing protein (referred to as vSET) that dimethylates Lys27 in histone 3, and (5) PBCV‐1 has three types of introns; a self‐splicing intron, a spliceosomal processed intron, and a small tRNA intron. Accumulating evidence indicates that the chlorella viruses have a very long evolutionary history. This review mainly deals with research on the virion structure, genome rearrangements, gene expression, cell wall degradation, polysaccharide synthesis, and evolution of PBCV‐1 as well as other related viruses. PMID:16877063
Web-TCGA: an online platform for integrated analysis of molecular cancer data sets.
Deng, Mario; Brägelmann, Johannes; Schultze, Joachim L; Perner, Sven
2016-02-06
The Cancer Genome Atlas (TCGA) is a pool of molecular data sets publicly accessible and freely available to cancer researchers anywhere around the world. However, wide spread use is limited since an advanced knowledge of statistics and statistical software is required. In order to improve accessibility we created Web-TCGA, a web based, freely accessible online tool, which can also be run in a private instance, for integrated analysis of molecular cancer data sets provided by TCGA. In contrast to already available tools, Web-TCGA utilizes different methods for analysis and visualization of TCGA data, allowing users to generate global molecular profiles across different cancer entities simultaneously. In addition to global molecular profiles, Web-TCGA offers highly detailed gene and tumor entity centric analysis by providing interactive tables and views. As a supplement to other already available tools, such as cBioPortal (Sci Signal 6:pl1, 2013, Cancer Discov 2:401-4, 2012), Web-TCGA is offering an analysis service, which does not require any installation or configuration, for molecular data sets available at the TCGA. Individual processing requests (queries) are generated by the user for mutation, methylation, expression and copy number variation (CNV) analyses. The user can focus analyses on results from single genes and cancer entities or perform a global analysis (multiple cancer entities and genes simultaneously).
Brock, Guy N; Shaffer, John R; Blakesley, Richard E; Lotz, Meredith J; Tseng, George C
2008-01-10
Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
Bayesian state space models for dynamic genetic network construction across multiple tissues.
Liang, Yulan; Kelemen, Arpad
2016-08-01
Construction of gene-gene interaction networks and potential pathways is a challenging and important problem in genomic research for complex diseases while estimating the dynamic changes of the temporal correlations and non-stationarity are the keys in this process. In this paper, we develop dynamic state space models with hierarchical Bayesian settings to tackle this challenge for inferring the dynamic profiles and genetic networks associated with disease treatments. We treat both the stochastic transition matrix and the observation matrix time-variant and include temporal correlation structures in the covariance matrix estimations in the multivariate Bayesian state space models. The unevenly spaced short time courses with unseen time points are treated as hidden state variables. Hierarchical Bayesian approaches with various prior and hyper-prior models with Monte Carlo Markov Chain and Gibbs sampling algorithms are used to estimate the model parameters and the hidden state variables. We apply the proposed Hierarchical Bayesian state space models to multiple tissues (liver, skeletal muscle, and kidney) Affymetrix time course data sets following corticosteroid (CS) drug administration. Both simulation and real data analysis results show that the genomic changes over time and gene-gene interaction in response to CS treatment can be well captured by the proposed models. The proposed dynamic Hierarchical Bayesian state space modeling approaches could be expanded and applied to other large scale genomic data, such as next generation sequence (NGS) combined with real time and time varying electronic health record (EHR) for more comprehensive and robust systematic and network based analysis in order to transform big biomedical data into predictions and diagnostics for precision medicine and personalized healthcare with better decision making and patient outcomes.
Walker, Joseph F; Yang, Ya; Feng, Tao; Timoneda, Alfonso; Mikenas, Jessica; Hutchison, Vera; Edwards, Caroline; Wang, Ning; Ahluwalia, Sonia; Olivieri, Julia; Walker-Hale, Nathanael; Majure, Lucas C; Puente, Raúl; Kadereit, Gudrun; Lauterbach, Maximilian; Eggli, Urs; Flores-Olvera, Hilda; Ochoterena, Helga; Brockington, Samuel F; Moore, Michael J; Smith, Stephen A
2018-03-01
The Caryophyllales contain ~12,500 species and are known for their cosmopolitan distribution, convergence of trait evolution, and extreme adaptations. Some relationships within the Caryophyllales, like those of many large plant clades, remain unclear, and phylogenetic studies often recover alternative hypotheses. We explore the utility of broad and dense transcriptome sampling across the order for resolving evolutionary relationships in Caryophyllales. We generated 84 transcriptomes and combined these with 224 publicly available transcriptomes to perform a phylogenomic analysis of Caryophyllales. To overcome the computational challenge of ortholog detection in such a large data set, we developed an approach for clustering gene families that allowed us to analyze >300 transcriptomes and genomes. We then inferred the species relationships using multiple methods and performed gene-tree conflict analyses. Our phylogenetic analyses resolved many clades with strong support, but also showed significant gene-tree discordance. This discordance is not only a common feature of phylogenomic studies, but also represents an opportunity to understand processes that have structured phylogenies. We also found taxon sampling influences species-tree inference, highlighting the importance of more focused studies with additional taxon sampling. Transcriptomes are useful both for species-tree inference and for uncovering evolutionary complexity within lineages. Through analyses of gene-tree conflict and multiple methods of species-tree inference, we demonstrate that phylogenomic data can provide unparalleled insight into the evolutionary history of Caryophyllales. We also discuss a method for overcoming computational challenges associated with homolog clustering in large data sets. © 2018 The Authors. American Journal of Botany is published by Wiley Periodicals, Inc. on behalf of the Botanical Society of America.
Jung, Hyungtaek; Yoon, Byung-Ha; Kim, Woo-Jin; Kim, Dong-Wook; Hurwood, David A; Lyons, Russell E; Salin, Krishna R; Kim, Heui-Soo; Baek, Ilseon; Chand, Vincent; Mather, Peter B
2016-05-07
The giant freshwater prawn, Macrobrachium rosenbergii, a sexually dimorphic decapod crustacean is currently the world's most economically important cultured freshwater crustacean species. Despite its economic importance, there is currently a lack of genomic resources available for this species, and this has limited exploration of the molecular mechanisms that control the M. rosenbergii sex-differentiation system more widely in freshwater prawns. Here, we present the first hybrid transcriptome from M. rosenbergii applying RNA-Seq technologies directed at identifying genes that have potential functional roles in reproductive-related traits. A total of 13,733,210 combined raw reads (1720 Mbp) were obtained from Ion-Torrent PGM and 454 FLX. Bioinformatic analyses based on three state-of-the-art assemblers, the CLC Genomic Workbench, Trans-ABySS, and Trinity, that use single and multiple k-mer methods respectively, were used to analyse the data. The influence of multiple k-mers on assembly performance was assessed to gain insight into transcriptome assembly from short reads. After optimisation, de novo assembly resulted in 44,407 contigs with a mean length of 437 bp, and the assembled transcripts were further functionally annotated to detect single nucleotide polymorphisms and simple sequence repeat motifs. Gene expression analysis was also used to compare expression patterns from ovary and testis tissue libraries to identify genes with potential roles in reproduction and sex differentiation. The large transcript set assembled here represents the most comprehensive set of transcriptomic resources ever developed for reproduction traits in M. rosenbergii, and the large number of genetic markers predicted should constitute an invaluable resource for future genetic research studies on M. rosenbergii and can be applied more widely on other freshwater prawn species in the genus Macrobrachium.
Jung, Hyungtaek; Yoon, Byung-Ha; Kim, Woo-Jin; Kim, Dong-Wook; Hurwood, David A.; Lyons, Russell E.; Salin, Krishna R.; Kim, Heui-Soo; Baek, Ilseon; Chand, Vincent; Mather, Peter B.
2016-01-01
The giant freshwater prawn, Macrobrachium rosenbergii, a sexually dimorphic decapod crustacean is currently the world’s most economically important cultured freshwater crustacean species. Despite its economic importance, there is currently a lack of genomic resources available for this species, and this has limited exploration of the molecular mechanisms that control the M. rosenbergii sex-differentiation system more widely in freshwater prawns. Here, we present the first hybrid transcriptome from M. rosenbergii applying RNA-Seq technologies directed at identifying genes that have potential functional roles in reproductive-related traits. A total of 13,733,210 combined raw reads (1720 Mbp) were obtained from Ion-Torrent PGM and 454 FLX. Bioinformatic analyses based on three state-of-the-art assemblers, the CLC Genomic Workbench, Trans-ABySS, and Trinity, that use single and multiple k-mer methods respectively, were used to analyse the data. The influence of multiple k-mers on assembly performance was assessed to gain insight into transcriptome assembly from short reads. After optimisation, de novo assembly resulted in 44,407 contigs with a mean length of 437 bp, and the assembled transcripts were further functionally annotated to detect single nucleotide polymorphisms and simple sequence repeat motifs. Gene expression analysis was also used to compare expression patterns from ovary and testis tissue libraries to identify genes with potential roles in reproduction and sex differentiation. The large transcript set assembled here represents the most comprehensive set of transcriptomic resources ever developed for reproduction traits in M. rosenbergii, and the large number of genetic markers predicted should constitute an invaluable resource for future genetic research studies on M. rosenbergii and can be applied more widely on other freshwater prawn species in the genus Macrobrachium. PMID:27164098
NASA Astrophysics Data System (ADS)
Kaminski, Naftali; Allard, John D.; Pittet, Jean F.; Zuo, Fengrong; Griffiths, Mark J. D.; Morris, David; Huang, Xiaozhu; Sheppard, Dean; Heller, Renu A.
2000-02-01
The molecular mechanisms of pulmonary fibrosis are poorly understood. We have used oligonucleotide arrays to analyze the gene expression programs that underlie pulmonary fibrosis in response to bleomycin, a drug that causes lung inflammation and fibrosis, in two strains of susceptible mice (129 and C57BL/6). We then compared the gene expression patterns in these mice with 129 mice carrying a null mutation in the epithelial-restricted integrin 6 subunit (6/-), which develop inflammation but are protected from pulmonary fibrosis. Cluster analysis identified two distinct groups of genes involved in the inflammatory and fibrotic responses. Analysis of gene expression at multiple time points after bleomycin administration revealed sequential induction of subsets of genes that characterize each response. The availability of this comprehensive data set should accelerate the development of more effective strategies for intervention at the various stages in the development of fibrotic diseases of the lungs and other organs.
Vivar, Juan C; Pemu, Priscilla; McPherson, Ruth; Ghosh, Sujoy
2013-08-01
Abstract Unparalleled technological advances have fueled an explosive growth in the scope and scale of biological data and have propelled life sciences into the realm of "Big Data" that cannot be managed or analyzed by conventional approaches. Big Data in the life sciences are driven primarily via a diverse collection of 'omics'-based technologies, including genomics, proteomics, metabolomics, transcriptomics, metagenomics, and lipidomics. Gene-set enrichment analysis is a powerful approach for interrogating large 'omics' datasets, leading to the identification of biological mechanisms associated with observed outcomes. While several factors influence the results from such analysis, the impact from the contents of pathway databases is often under-appreciated. Pathway databases often contain variously named pathways that overlap with one another to varying degrees. Ignoring such redundancies during pathway analysis can lead to the designation of several pathways as being significant due to high content-similarity, rather than truly independent biological mechanisms. Statistically, such dependencies also result in correlated p values and overdispersion, leading to biased results. We investigated the level of redundancies in multiple pathway databases and observed large discrepancies in the nature and extent of pathway overlap. This prompted us to develop the application, ReCiPa (Redundancy Control in Pathway Databases), to control redundancies in pathway databases based on user-defined thresholds. Analysis of genomic and genetic datasets, using ReCiPa-generated overlap-controlled versions of KEGG and Reactome pathways, led to a reduction in redundancy among the top-scoring gene-sets and allowed for the inclusion of additional gene-sets representing possibly novel biological mechanisms. Using obesity as an example, bioinformatic analysis further demonstrated that gene-sets identified from overlap-controlled pathway databases show stronger evidence of prior association to obesity compared to pathways identified from the original databases.
Tsai, Yu-Shuen; Aguan, Kripamoy; Pal, Nikhil R.; Chung, I-Fang
2011-01-01
Informative genes from microarray data can be used to construct prediction model and investigate biological mechanisms. Differentially expressed genes, the main targets of most gene selection methods, can be classified as single- and multiple-class specific signature genes. Here, we present a novel gene selection algorithm based on a Group Marker Index (GMI), which is intuitive, of low-computational complexity, and efficient in identification of both types of genes. Most gene selection methods identify only single-class specific signature genes and cannot identify multiple-class specific signature genes easily. Our algorithm can detect de novo certain conditions of multiple-class specificity of a gene and makes use of a novel non-parametric indicator to assess the discrimination ability between classes. Our method is effective even when the sample size is small as well as when the class sizes are significantly different. To compare the effectiveness and robustness we formulate an intuitive template-based method and use four well-known datasets. We demonstrate that our algorithm outperforms the template-based method in difficult cases with unbalanced distribution. Moreover, the multiple-class specific genes are good biomarkers and play important roles in biological pathways. Our literature survey supports that the proposed method identifies unique multiple-class specific marker genes (not reported earlier to be related to cancer) in the Central Nervous System data. It also discovers unique biomarkers indicating the intrinsic difference between subtypes of lung cancer. We also associate the pathway information with the multiple-class specific signature genes and cross-reference to published studies. We find that the identified genes participate in the pathways directly involved in cancer development in leukemia data. Our method gives a promising way to find genes that can involve in pathways of multiple diseases and hence opens up the possibility of using an existing drug on other diseases as well as designing a single drug for multiple diseases. PMID:21909426
2012-01-01
Background Metallothioneins (MT) are low molecular weight, cysteine rich metal binding proteins, found across genera and species, but their function(s) in abiotic stress tolerance are not well documented. Results We have characterized a rice MT gene, OsMT1e-P, isolated from a subtractive library generated from a stressed salinity tolerant rice genotype, Pokkali. Bioinformatics analysis of the rice genome sequence revealed that this gene belongs to a multigenic family, which consists of 13 genes with 15 protein products. OsMT1e-P is located on chromosome XI, away from the majority of other type I genes that are clustered on chromosome XII. Various members of this MT gene cluster showed a tight co-regulation pattern under several abiotic stresses. Sequence analysis revealed the presence of conserved cysteine residues in OsMT1e-P protein. Salinity stress was found to regulate the transcript abundance of OsMT1e-P in a developmental and organ specific manner. Using transgenic approach, we found a positive correlation between ectopic expression of OsMT1e-P and stress tolerance. Our experiments further suggest ROS scavenging to be the possible mechanism for multiple stress tolerance conferred by OsMT1e-P. Conclusion We present an overview of MTs, describing their gene structure, genome localization and expression patterns under salinity and development in rice. We have found that ectopic expression of OsMT1e-P enhances tolerance towards multiple abiotic stresses in transgenic tobacco and the resultant plants could survive and set viable seeds under saline conditions. Taken together, the experiments presented here have indicated that ectopic expression of OsMT1e-P protects against oxidative stress primarily through efficient scavenging of reactive oxygen species. PMID:22780875
Kumar, Gautam; Kushwaha, Hemant Ritturaj; Panjabi-Sabharwal, Vaishali; Kumari, Sumita; Joshi, Rohit; Karan, Ratna; Mittal, Shweta; Pareek, Sneh L Singla; Pareek, Ashwani
2012-07-10
Metallothioneins (MT) are low molecular weight, cysteine rich metal binding proteins, found across genera and species, but their function(s) in abiotic stress tolerance are not well documented. We have characterized a rice MT gene, OsMT1e-P, isolated from a subtractive library generated from a stressed salinity tolerant rice genotype, Pokkali. Bioinformatics analysis of the rice genome sequence revealed that this gene belongs to a multigenic family, which consists of 13 genes with 15 protein products. OsMT1e-P is located on chromosome XI, away from the majority of other type I genes that are clustered on chromosome XII. Various members of this MT gene cluster showed a tight co-regulation pattern under several abiotic stresses. Sequence analysis revealed the presence of conserved cysteine residues in OsMT1e-P protein. Salinity stress was found to regulate the transcript abundance of OsMT1e-P in a developmental and organ specific manner. Using transgenic approach, we found a positive correlation between ectopic expression of OsMT1e-P and stress tolerance. Our experiments further suggest ROS scavenging to be the possible mechanism for multiple stress tolerance conferred by OsMT1e-P. We present an overview of MTs, describing their gene structure, genome localization and expression patterns under salinity and development in rice. We have found that ectopic expression of OsMT1e-P enhances tolerance towards multiple abiotic stresses in transgenic tobacco and the resultant plants could survive and set viable seeds under saline conditions. Taken together, the experiments presented here have indicated that ectopic expression of OsMT1e-P protects against oxidative stress primarily through efficient scavenging of reactive oxygen species.
Properties of genes essential for mouse development
Kabir, Mitra; Barradas, Ana; Tzotzos, George T.; Hentges, Kathryn E.
2017-01-01
Essential genes are those that are critical for life. In the specific case of the mouse, they are the set of genes whose deletion means that a mouse is unable to survive after birth. As such, they are the key minimal set of genes needed for all the steps of development to produce an organism capable of life ex utero. We explored a wide range of sequence and functional features to characterise essential (lethal) and non-essential (viable) genes in mice. Experimental data curated manually identified 1301 essential genes and 3451 viable genes. Very many sequence features show highly significant differences between essential and viable mouse genes. Essential genes generally encode complex proteins, with multiple domains and many introns. These genes tend to be: long, highly expressed, old and evolutionarily conserved. These genes tend to encode ligases, transferases, phosphorylated proteins, intracellular proteins, nuclear proteins, and hubs in protein-protein interaction networks. They are involved with regulating protein-protein interactions, gene expression and metabolic processes, cell morphogenesis, cell division, cell proliferation, DNA replication, cell differentiation, DNA repair and transcription, cell differentiation and embryonic development. Viable genes tend to encode: membrane proteins or secreted proteins, and are associated with functions such as cellular communication, apoptosis, behaviour and immune response, as well as housekeeping and tissue specific functions. Viable genes are linked to transport, ion channels, signal transduction, calcium binding and lipid binding, consistent with their location in membranes and involvement with cell-cell communication. From the analysis of the composite features of essential and viable genes, we conclude that essential genes tend to be required for intracellular functions, and viable genes tend to be involved with extracellular functions and cell-cell communication. Knowledge of the features that are over-represented in essential genes allows for a deeper understanding of the functions and processes implemented during mammalian development. PMID:28562614
Ghosh, Sujoy; Vivar, Juan; Nelson, Christopher P; Willenborg, Christina; Segrè, Ayellet V; Mäkinen, Ville-Petteri; Nikpay, Majid; Erdmann, Jeannette; Blankenberg, Stefan; O'Donnell, Christopher; März, Winfried; Laaksonen, Reijo; Stewart, Alexandre FR; Epstein, Stephen E; Shah, Svati H; Granger, Christopher B; Hazen, Stanley L; Kathiresan, Sekar; Reilly, Muredach P; Yang, Xia; Quertermous, Thomas; Samani, Nilesh J; Schunkert, Heribert; Assimes, Themistocles L; McPherson, Ruth
2016-01-01
Objective Genome-wide association (GWA) studies have identified multiple genetic variants affecting the risk of coronary artery disease (CAD). However, individually these explain only a small fraction of the heritability of CAD and for most, the causal biological mechanisms remain unclear. We sought to obtain further insights into potential causal processes of CAD by integrating large-scale GWA data with expertly curated databases of core human pathways and functional networks. Approaches and Results Employing pathways (gene sets) from Reactome, we carried out a two-stage gene set enrichment analysis strategy. From a meta-analyzed discovery cohort of 7 CADGWAS data sets (9,889 cases/11,089 controls), nominally significant gene-sets were tested for replication in a meta-analysis of 9 additional studies (15,502 cases/55,730 controls) from the CARDIoGRAM Consortium. A total of 32 of 639 Reactome pathways tested showed convincing association with CAD (replication p<0.05). These pathways resided in 9 of 21 core biological processes represented in Reactome, and included pathways relevant to extracellular matrix integrity, innate immunity, axon guidance, and signaling by PDRF, NOTCH, and the TGF-β/SMAD receptor complex. Many of these pathways had strengths of association comparable to those observed in lipid transport pathways. Network analysis of unique genes within the replicated pathways further revealed several interconnected functional and topologically interacting modules representing novel associations (e.g. semaphorin regulated axonal guidance pathway) besides confirming known processes (lipid metabolism). The connectivity in the observed networks was statistically significant compared to random networks (p<0.001). Network centrality analysis (‘degree’ and ‘betweenness’) further identified genes (e.g. NCAM1, FYN, FURIN etc.) likely to play critical roles in the maintenance and functioning of several of the replicated pathways. Conclusions These findings provide novel insights into how genetic variation, interpreted in the context of biological processes and functional interactions among genes, may help define the genetic architecture of CAD. PMID:25977570
Random forests-based differential analysis of gene sets for gene expression data.
Hsueh, Huey-Miin; Zhou, Da-Wei; Tsai, Chen-An
2013-04-10
In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for interpretation of data in complex biological systems. The classifications of biologically defined gene sets can reveal the underlying interactions of gene sets associated with the phenotypes, and provide an insightful complement to conventional gene set analyses. Copyright © 2012 Elsevier B.V. All rights reserved.
Gene finding in metatranscriptomic sequences.
Ismail, Wazim Mohammed; Ye, Yuzhen; Tang, Haixu
2014-01-01
Metatranscriptomic sequencing is a highly sensitive bioassay of functional activity in a microbial community, providing complementary information to the metagenomic sequencing of the community. The acquisition of the metatranscriptomic sequences will enable us to refine the annotations of the metagenomes, and to study the gene activities and their regulation in complex microbial communities and their dynamics. In this paper, we present TransGeneScan, a software tool for finding genes in assembled transcripts from metatranscriptomic sequences. By incorporating several features of metatranscriptomic sequencing, including strand-specificity, short intergenic regions, and putative antisense transcripts into a Hidden Markov Model, TranGeneScan can predict a sense transcript containing one or multiple genes (in an operon) or an antisense transcript. We tested TransGeneScan on a mock metatranscriptomic data set containing three known bacterial genomes. The results showed that TranGeneScan performs better than metagenomic gene finders (MetaGeneMark and FragGeneScan) on predicting protein coding genes in assembled transcripts, and achieves comparable or even higher accuracy than gene finders for microbial genomes (Glimmer and GeneMark). These results imply, with the assistance of metatranscriptomic sequencing, we can obtain a broad and precise picture about the genes (and their functions) in a microbial community. TransGeneScan is available as open-source software on SourceForge at https://sourceforge.net/projects/transgenescan/.
Chavan, Shweta S; Bauer, Michael A; Peterson, Erich A; Heuck, Christoph J; Johann, Donald J
2013-01-01
Transcriptome analysis by microarrays has produced important advances in biomedicine. For instance in multiple myeloma (MM), microarray approaches led to the development of an effective disease subtyping via cluster assignment, and a 70 gene risk score. Both enabled an improved molecular understanding of MM, and have provided prognostic information for the purposes of clinical management. Many researchers are now transitioning to Next Generation Sequencing (NGS) approaches and RNA-seq in particular, due to its discovery-based nature, improved sensitivity, and dynamic range. Additionally, RNA-seq allows for the analysis of gene isoforms, splice variants, and novel gene fusions. Given the voluminous amounts of historical microarray data, there is now a need to associate and integrate microarray and RNA-seq data via advanced bioinformatic approaches. Custom software was developed following a model-view-controller (MVC) approach to integrate Affymetrix probe set-IDs, and gene annotation information from a variety of sources. The tool/approach employs an assortment of strategies to integrate, cross reference, and associate microarray and RNA-seq datasets. Output from a variety of transcriptome reconstruction and quantitation tools (e.g., Cufflinks) can be directly integrated, and/or associated with Affymetrix probe set data, as well as necessary gene identifiers and/or symbols from a diversity of sources. Strategies are employed to maximize the annotation and cross referencing process. Custom gene sets (e.g., MM 70 risk score (GEP-70)) can be specified, and the tool can be directly assimilated into an RNA-seq pipeline. A novel bioinformatic approach to aid in the facilitation of both annotation and association of historic microarray data, in conjunction with richer RNA-seq data, is now assisting with the study of MM cancer biology.
Transcriptome analysis of trigeminal ganglia following masseter muscle inflammation in rats
Park, Jennifer; Asgar, Jamila; Ro, Jin Y.
2016-01-01
Background Chronic pain in masticatory muscles is a major medical problem. Although mechanisms underlying persistent pain in masticatory muscles are not fully understood, sensitization of nociceptive primary afferents following muscle inflammation or injury contributes to muscle hyperalgesia. It is well known that craniofacial muscle injury or inflammation induces regulation of multiple genes in trigeminal ganglia, which is associated with muscle hyperalgesia. However, overall transcriptional profiles within trigeminal ganglia following masseter inflammation have not yet been determined. In the present study, we performed RNA sequencing assay in rat trigeminal ganglia to identify transcriptome profiles of genes relevant to hyperalgesia following inflammation of the rat masseter muscle. Results Masseter inflammation differentially regulated >3500 genes in trigeminal ganglia. Predominant biological pathways were predicted to be related with activation of resident non-neuronal cells within trigeminal ganglia or recruitment of immune cells. To focus our analysis on the genes more relevant to nociceptors, we selected genes implicated in pain mechanisms, genes enriched in small- to medium-sized sensory neurons, and genes enriched in TRPV1-lineage nociceptors. Among the 2320 candidate genes, 622 genes showed differential expression following masseter inflammation. When the analysis was limited to these candidate genes, pathways related with G protein-coupled signaling and synaptic plasticity were predicted to be enriched. Inspection of individual gene expression changes confirmed the transcriptional changes of multiple nociceptor genes associated with masseter hyperalgesia (e.g., Trpv1, Trpa1, P2rx3, Tac1, and Bdnf) and also suggested a number of novel probable contributors (e.g., Piezo2, Tmem100, and Hdac9). Conclusion These findings should further advance our understanding of peripheral mechanisms involved in persistent craniofacial muscle pain conditions and provide a rational basis for identifying novel genes or sets of genes that can be potentially targeted for treating such conditions. PMID:27702909
Bayesian correlated clustering to integrate multiple datasets
Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.
2012-01-01
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23047558
DOE Office of Scientific and Technical Information (OSTI.GOV)
Howe, Adina; Yang, Fan; Williams, Ryan J.
Despite the central role of soil microbial communities in global carbon (C) cycling, little is known about soil microbial community structure and even less about their metabolic pathways. Efforts to characterize soil communities often focus on identifying differences in gene content across environmental gradients, but an alternative question is what genes are similar in soils. These genes may indicate critical species or potential functions that are required in all soils. Here we identified the “core” set of C cycling sequences widely present in multiple soil metagenomes from a fertilized prairie (FP). Of 226,887 sequences associated with known enzymes involved inmore » the synthesis, metabolism, and transport of carbohydrates, 843 were identified to be consistently prevalent across four replicate soil metagenomes. This core metagenome was functionally and taxonomically diverse, representing five enzyme classes and 99 enzyme families within the CAZy database. Though it only comprised 0.4% of all CAZy-associated genes identified in FP metagenomes, the core was found to be comprised of functions similar to those within cumulative soils. The FP CAZy-associated core sequences were present in multiple publicly available soil metagenomes and most similar to soils sharing geographic proximity. As a result, in soil ecosystems, where high diversity remains a key challenge for metagenomic investigations, these core genes represent a subset of critical functions necessary for carbohydrate metabolism, which can be targeted to evaluate important C fluxes in these and other similar soils.« less
Ethylene Response Factors Are Controlled by Multiple Harvesting Stresses in Hevea brasiliensis
Putranto, Riza-Arief; Duan, Cuifang; Kuswanhadi; Chaidamsari, Tetty; Rio, Maryannick; Piyatrakul, Piyanuch; Herlinawati, Eva; Pirrello, Julien; Dessailly, Florence; Leclercq, Julie; Bonnot, François; Tang, Chaorong; Hu, Songnian; Montoro, Pascal
2015-01-01
Tolerance of recurrent mechanical wounding and exogenous ethylene is a feature of the rubber tree. Latex harvesting involves tapping of the tree bark and ethephon is applied to increase latex flow. Ethylene is an essential element in controlling latex production. The ethylene signalling pathway leads to the activation of Ethylene Response Factor (ERF) transcription factors. This family has been identified in Hevea brasiliensis. This study set out to understand the regulation of ERF genes during latex harvesting in relation to abiotic stress and hormonal treatments. Analyses of the relative transcript abundance were carried out for 35 HbERF genes in latex, in bark from mature trees and in leaves from juvenile plants under multiple abiotic stresses. Twenty-one HbERF genes were regulated by harvesting stress in laticifers, revealing an overrepresentation of genes in group IX. Transcripts of three HbERF-IX genes from HbERF-IXc4, HbERF-IXc5 and HbERF-IXc6 were dramatically accumulated by combining wounding, methyl jasmonate and ethylene treatments. When an ethylene inhibitor was used, the transcript accumulation for these three genes was halted, showing ethylene-dependent induction. Subcellular localization and transactivation experiments confirmed that several members of HbERF-IX are activator-type transcription factors. This study suggested that latex harvesting induces mechanisms developed for the response to abiotic stress. These mechanisms probably depend on various hormonal signalling pathways. Several members of HbERF-IX could be essential integrators of complex hormonal signalling pathways in Hevea. PMID:25906196
pico-PLAZA, a genome database of microbial photosynthetic eukaryotes.
Vandepoele, Klaas; Van Bel, Michiel; Richard, Guilhem; Van Landeghem, Sofie; Verhelst, Bram; Moreau, Hervé; Van de Peer, Yves; Grimsley, Nigel; Piganeau, Gwenael
2013-08-01
With the advent of next generation genome sequencing, the number of sequenced algal genomes and transcriptomes is rapidly growing. Although a few genome portals exist to browse individual genome sequences, exploring complete genome information from multiple species for the analysis of user-defined sequences or gene lists remains a major challenge. pico-PLAZA is a web-based resource (http://bioinformatics.psb.ugent.be/pico-plaza/) for algal genomics that combines different data types with intuitive tools to explore genomic diversity, perform integrative evolutionary sequence analysis and study gene functions. Apart from homologous gene families, multiple sequence alignments, phylogenetic trees, Gene Ontology, InterPro and text-mining functional annotations, different interactive viewers are available to study genome organization using gene collinearity and synteny information. Different search functions, documentation pages, export functions and an extensive glossary are available to guide non-expert scientists. To illustrate the versatility of the platform, different case studies are presented demonstrating how pico-PLAZA can be used to functionally characterize large-scale EST/RNA-Seq data sets and to perform environmental genomics. Functional enrichments analysis of 16 Phaeodactylum tricornutum transcriptome libraries offers a molecular view on diatom adaptation to different environments of ecological relevance. Furthermore, we show how complementary genomic data sources can easily be combined to identify marker genes to study the diversity and distribution of algal species, for example in metagenomes, or to quantify intraspecific diversity from environmental strains. © 2013 John Wiley & Sons Ltd and Society for Applied Microbiology.
Generation of mammalian cells stably expressing multiple genes at predetermined levels.
Liu, X; Constantinescu, S N; Sun, Y; Bogan, J S; Hirsch, D; Weinberg, R A; Lodish, H F
2000-04-10
Expression of cloned genes at desired levels in cultured mammalian cells is essential for studying protein function. Controlled levels of expression have been difficult to achieve, especially for cell lines with low transfection efficiency or when expression of multiple genes is required. An internal ribosomal entry site (IRES) has been incorporated into many types of expression vectors to allow simultaneous expression of two genes. However, there has been no systematic quantitative analysis of expression levels in individual cells of genes linked by an IRES, and thus the broad use of these vectors in functional analysis has been limited. We constructed a set of retroviral expression vectors containing an IRES followed by a quantitative selectable marker such as green fluorescent protein (GFP) or truncated cell surface proteins CD2 or CD4. The gene of interest is placed in a multiple cloning site 5' of the IRES sequence under the control of the retroviral long terminal repeat (LTR) promoter. These vectors exploit the approximately 100-fold differences in levels of expression of a retrovirus vector depending on its site of insertion in the host chromosome. We show that the level of expression of the gene downstream of the IRES and the expression level and functional activity of the gene cloned upstream of the IRES are highly correlated in stably infected target cells. This feature makes our vectors extremely useful for the rapid generation of stably transfected cell populations or clonal cell lines expressing specific amounts of a desired protein simply by fluorescent activated cell sorting (FACS) based on the level of expression of the gene downstream of the IRES. We show how these vectors can be used to generate cells expressing high levels of the erythropoietin receptor (EpoR) or a dominant negative Smad3 protein and to generate cells expressing two different cloned proteins, Ski and Smad4. Correlation of a biologic effect with the level of expression of the protein downstream of the IRES provides strong evidence for the function of the protein placed upstream of the IRES.
Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes
Liu, Kuan-Liang; Porras-Alfaro, Andrea; Eichorst, Stephanie A.
2012-01-01
Taxonomic and phylogenetic fingerprinting based on sequence analysis of gene fragments from the large-subunit rRNA (LSU) gene or the internal transcribed spacer (ITS) region is becoming an integral part of fungal classification. The lack of an accurate and robust classification tool trained by a validated sequence database for taxonomic placement of fungal LSU genes is a severe limitation in taxonomic analysis of fungal isolates or large data sets obtained from environmental surveys. Using a hand-curated set of 8,506 fungal LSU gene fragments, we determined the performance characteristics of a naïve Bayesian classifier across multiple taxonomic levels and compared the classifier performance to that of a sequence similarity-based (BLASTN) approach. The naïve Bayesian classifier was computationally more rapid (>460-fold with our system) than the BLASTN approach, and it provided equal or superior classification accuracy. Classifier accuracies were compared using sequence fragments of 100 bp and 400 bp and two different PCR primer anchor points to mimic sequence read lengths commonly obtained using current high-throughput sequencing technologies. Accuracy was higher with 400-bp sequence reads than with 100-bp reads. It was also significantly affected by sequence location across the 1,400-bp test region. The highest accuracy was obtained across either the D1 or D2 variable region. The naïve Bayesian classifier provides an effective and rapid means to classify fungal LSU sequences from large environmental surveys. The training set and tool are publicly available through the Ribosomal Database Project (http://rdp.cme.msu.edu/classifier/classifier.jsp). PMID:22194300
BIG: a large-scale data integration tool for renal physiology
Zhao, Yue; Yang, Chin-Rang; Raghuram, Viswanathan; Parulekar, Jaya
2016-01-01
Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: “How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?” This is the type of problem that has motivated the “Big-Data” revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/. PMID:27279488
Dasgupta, Diptarka; Ghosh, Debashish; Bandhu, Sheetal; Adhikari, Dilip K
2017-07-01
Optimum utilization of fermentable sugars from lignocellulosic biomass to deliver multiple products under biorefinery concept has been reported in this work. Alcohol fermentation has been carried out with multiple cell recycling of Kluyveromyces marxianus IIPE453. The yeast utilized xylose-rich fraction from acid and steam treated biomass for cell generation and xylitol production with an average yield of 0.315±0.01g/g while the entire glucose rich saccharified fraction had been fermented to ethanol with high productivity of 0.9±0.08g/L/h. A detailed insight into its genome illustrated the strain's complete set of genes associated with sugar transport and metabolism for high-temperature fermentation. A set flocculation proteins were identified that aided in high cell recovery in successive fermentation cycles to achieve alcohols with high productivity. We have brought biomass derived sugars, yeast cell biomass generation, and ethanol and xylitol fermentation in one platform and validated the overall material balance. 2kg sugarcane bagasse yielded 193.4g yeast cell, and with multiple times cell recycling generated 125.56g xylitol and 289.2g ethanol (366mL). Copyright © 2017 Elsevier GmbH. All rights reserved.
An Independent Filter for Gene Set Testing Based on Spectral Enrichment.
Frost, H Robert; Li, Zhigang; Asselbergs, Folkert W; Moore, Jason H
2015-01-01
Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.
Tissue Gene Expression Analysis Using Arrayed Normalized cDNA Libraries
Eickhoff, Holger; Schuchhardt, Johannes; Ivanov, Igor; Meier-Ewert, Sebastian; O'Brien, John; Malik, Arif; Tandon, Neeraj; Wolski, Eryk-Witold; Rohlfs, Elke; Nyarsik, Lajos; Reinhardt, Richard; Nietfeld, Wilfried; Lehrach, Hans
2000-01-01
We have used oligonucleotide-fingerprinting data on 60,000 cDNA clones from two different mouse embryonic stages to establish a normalized cDNA clone set. The normalized set of 5,376 clones represents different clusters and therefore, in almost all cases, different genes. The inserts of the cDNA clones were amplified by PCR and spotted on glass slides. The resulting arrays were hybridized with mRNA probes prepared from six different adult mouse tissues. Expression profiles were analyzed by hierarchical clustering techniques. We have chosen radioactive detection because it combines robustness with sensitivity and allows the comparison of multiple normalized experiments. Sensitive detection combined with highly effective clustering algorithms allowed the identification of tissue-specific expression profiles and the detection of genes specifically expressed in the tissues investigated. The obtained results are publicly available (http://www.rzpd.de) and can be used by other researchers as a digital expression reference. [The sequence data described in this paper have been submitted to the EMBL data library under accession nos. AL360374–AL36537.] PMID:10958641
Qu, Mei; Zhang, Xin; Liu, Guirong; Huang, Ying; Jia, Lei; Liang, Weili; Li, Xitai; Wu, Xiaona; Li, Jie; Yan, Hanqiu; Kan, Biao; Wang, Quanyi
2014-07-14
This study was conducted to determine the prevalence of serotypes, virulence factors, and antimicrobial resistance patterns of Shigella spp. in Beijing, China, from 2004 to 2011. Real-time PCR assays were used to detect virulent genes, and the Kirby-Bauer disk diffusion method was used to evaluate antimicrobial resistance. Among the total of 1,652 Shigella isolates, S. sonnei (57.1%) was the predominant species, followed by S. flexneri (42.3%), S. dysenteriae (0.4%), and S. boydii (0.2%). Nineteen serotypes were discovered among S. flexneri strains. The virulence gene ipaH was the most frequent, followed by sen and set. The presence of set showed significant difference in two dominant serogroups, S. flexneri and S. sonnei. Over 90% of Shigella isolates showed resistance to at least three drugs with widened spectrum. High-level antimicrobial resistance to single and multiple antibiotics was more common among S. sonnei than S. flexneri. There was an obvious serotype change and a dramatic increase of antibiotic resistance in Shigella prevalence in Beijing.
Down-weighting overlapping genes improves gene set analysis
2012-01-01
Background The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set. Results In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method Pathway Analysis with Down-weighting of Overlapping Genes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results. Conclusions PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org. PMID:22713124
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liang, Ying; Gao, Yajun; Jones, Alan M.
The three-member family of Arabidopsis extra-large G proteins (XLG1-3) defines the prototype of an atypical Ga subunit in the heterotrimeric G protein complex. Some recent evidence indicate that XLG subunits operate along with its Gbg dimer in root morphology, stress responsiveness, and cytokinin induced development, however downstream targets of activated XLG proteins in the stress pathways are rarely known. In order to assemble a set of candidate XLG-targeted proteins, a yeast two-hybrid complementation-based screen was performed using XLG protein baits to query interactions between XLG and partner protein found in glucose-treated seedlings, roots, and Arabidopsis cells in culture. Seventy twomore » interactors were identified and >60% of a test set displayed in vivo interaction with XLG proteins. Gene co-expression analysis shows that >70% of the interactors are positively correlated with the corresponding XLG partners. Gene Ontology enrichment for all the candidates indicates stress responses and posits a molecular mechanism involving a specific set of transcription factor partners to XLG. Genes encoding two of these transcription factors, SZF1 and 2, require XLG proteins for full NaCl-induced expression. Furthermore, the subcellular localization of the XLG proteins in the nucleus, endosome, and plasma membrane is dependent on the specific interacting partner.« less
Liang, Ying; Gao, Yajun; Jones, Alan M.
2017-06-13
The three-member family of Arabidopsis extra-large G proteins (XLG1-3) defines the prototype of an atypical Ga subunit in the heterotrimeric G protein complex. Some recent evidence indicate that XLG subunits operate along with its Gbg dimer in root morphology, stress responsiveness, and cytokinin induced development, however downstream targets of activated XLG proteins in the stress pathways are rarely known. In order to assemble a set of candidate XLG-targeted proteins, a yeast two-hybrid complementation-based screen was performed using XLG protein baits to query interactions between XLG and partner protein found in glucose-treated seedlings, roots, and Arabidopsis cells in culture. Seventy twomore » interactors were identified and >60% of a test set displayed in vivo interaction with XLG proteins. Gene co-expression analysis shows that >70% of the interactors are positively correlated with the corresponding XLG partners. Gene Ontology enrichment for all the candidates indicates stress responses and posits a molecular mechanism involving a specific set of transcription factor partners to XLG. Genes encoding two of these transcription factors, SZF1 and 2, require XLG proteins for full NaCl-induced expression. Furthermore, the subcellular localization of the XLG proteins in the nucleus, endosome, and plasma membrane is dependent on the specific interacting partner.« less
Mahelka, Václav; Krak, Karol; Kopecký, David; Fehrer, Judith; Šafář, Jan; Bartoš, Jan; Hobza, Roman; Blavet, Nicolas; Blattner, Frank R
2017-02-14
The movement of nuclear DNA from one vascular plant species to another in the absence of fertilization is thought to be rare. Here, nonnative rRNA gene [ribosomal DNA (rDNA)] copies were identified in a set of 16 diploid barley ( Hordeum ) species; their origin was traceable via their internal transcribed spacer (ITS) sequence to five distinct Panicoideae genera, a lineage that split from the Pooideae about 60 Mya. Phylogenetic, cytogenetic, and genomic analyses implied that the nonnative sequences were acquired between 1 and 5 Mya after a series of multiple events, with the result that some current Hordeum sp. individuals harbor up to five different panicoid rDNA units in addition to the native Hordeum rDNA copies. There was no evidence that any of the nonnative rDNA units were transcribed; some showed indications of having been silenced via pseudogenization. A single copy of a Panicum sp. rDNA unit present in H. bogdanii had been interrupted by a native transposable element and was surrounded by about 70 kbp of mostly noncoding sequence of panicoid origin. The data suggest that horizontal gene transfer between vascular plants is not a rare event, that it is not necessarily restricted to one or a few genes only, and that it can be selectively neutral.
Dissecting DNA repair in adult high grade gliomas for patient stratification in the post-genomic era
Perry, Christina; Agarwal, Devika; Abdel-Fatah, Tarek M.A.; Lourdusamy, Anbarasu; Grundy, Richard; Auer, Dorothee T.; Walker, David; Lakhani, Ravi; Scott, Ian S.; Chan, Stephen; Ball, Graham; Madhusudan, Srinivasan
2014-01-01
Deregulation of multiple DNA repair pathways may contribute to aggressive biology and therapy resistance in gliomas. We evaluated transcript levels of 157 genes involved in DNA repair in an adult glioblastoma Test set (n=191) and validated in ‘The Cancer Genome Atlas’ (TCGA) cohort (n=508). A DNA repair prognostic index model was generated. Artificial neural network analysis (ANN) was conducted to investigate global gene interactions. Protein expression by immunohistochemistry was conducted in 61 tumours. A fourteen DNA repair gene expression panel was associated with poor survival in Test and TCGA cohorts. A Cox multivariate model revealed APE1, NBN, PMS2, MGMT and PTEN as independently associated with poor prognosis. A DNA repair prognostic index incorporating APE1, NBN, PMS2, MGMT and PTEN stratified patients in to three prognostic sub-groups with worsening survival. APE1, NBN, PMS2, MGMT and PTEN also have predictive significance in patients who received chemotherapy and/or radiotherapy. ANN analysis of APE1, NBN, PMS2, MGMT and PTEN revealed interactions with genes involved in transcription, hypoxia and metabolic regulation. At the protein level, low APE1 and low PTEN remain associated with poor prognosis. In conclusion, multiple DNA repair pathways operate to influence biology and clinical outcomes in adult high grade gliomas. PMID:25026297
Novel gene sets improve set-level classification of prokaryotic gene expression data.
Holec, Matěj; Kuželka, Ondřej; Železný, Filip
2015-10-28
Set-level classification of gene expression data has received significant attention recently. In this setting, high-dimensional vectors of features corresponding to genes are converted into lower-dimensional vectors of features corresponding to biologically interpretable gene sets. The dimensionality reduction brings the promise of a decreased risk of overfitting, potentially resulting in improved accuracy of the learned classifiers. However, recent empirical research has not confirmed this expectation. Here we hypothesize that the reported unfavorable classification results in the set-level framework were due to the adoption of unsuitable gene sets defined typically on the basis of the Gene ontology and the KEGG database of metabolic networks. We explore an alternative approach to defining gene sets, based on regulatory interactions, which we expect to collect genes with more correlated expression. We hypothesize that such more correlated gene sets will enable to learn more accurate classifiers. We define two families of gene sets using information on regulatory interactions, and evaluate them on phenotype-classification tasks using public prokaryotic gene expression data sets. From each of the two gene-set families, we first select the best-performing subtype. The two selected subtypes are then evaluated on independent (testing) data sets against state-of-the-art gene sets and against the conventional gene-level approach. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. Novel gene sets defined on the basis of regulatory interactions improve set-level classification of gene expression data. The experimental scripts and other material needed to reproduce the experiments are available at http://ida.felk.cvut.cz/novelgenesets.tar.gz.
2012-01-01
Background Transcript profiling of differentiating secondary xylem has allowed us to draw a general picture of the genes involved in wood formation. However, our knowledge is still limited about the regulatory mechanisms that coordinate and modulate the different pathways providing substrates during xylogenesis. The development of compression wood in conifers constitutes an exceptional model for these studies. Although differential expression of a few genes in differentiating compression wood compared to normal or opposite wood has been reported, the broad range of features that distinguish this reaction wood suggest that the expression of a larger set of genes would be modified. Results By combining the construction of different cDNA libraries with microarray analyses we have identified a total of 496 genes in maritime pine (Pinus pinaster, Ait.) that change in expression during differentiation of compression wood (331 up-regulated and 165 down-regulated compared to opposite wood). Samples from different provenances collected in different years and geographic locations were integrated into the analyses to mitigate the effects of multiple sources of variability. This strategy allowed us to define a group of genes that are consistently associated with compression wood formation. Correlating with the deposition of a thicker secondary cell wall that characterizes compression wood development, the expression of a number of genes involved in synthesis of cellulose, hemicellulose, lignin and lignans was up-regulated. Further analysis of a set of these genes involved in S-adenosylmethionine metabolism, ammonium recycling, and lignin and lignans biosynthesis showed changes in expression levels in parallel to the levels of lignin accumulation in cells undergoing xylogenesis in vivo and in vitro. Conclusions The comparative transcriptomic analysis reported here have revealed a broad spectrum of coordinated transcriptional modulation of genes involved in biosynthesis of different cell wall polymers associated with within-tree variations in pine wood structure and composition. In particular, we demonstrate the coordinated modulation at transcriptional level of a gene set involved in S-adenosylmethionine synthesis and ammonium assimilation with increased demand for coniferyl alcohol for lignin and lignan synthesis, enabling a better understanding of the metabolic requirements in cells undergoing lignification. PMID:22747794
Statistical approach for selection of biologically informative genes.
Das, Samarendra; Rai, Anil; Mishra, D C; Rai, Shesh N
2018-05-20
Selection of informative genes from high dimensional gene expression data has emerged as an important research area in genomics. Many gene selection techniques have been proposed so far are either based on relevancy or redundancy measure. Further, the performance of these techniques has been adjudged through post selection classification accuracy computed through a classifier using the selected genes. This performance metric may be statistically sound but may not be biologically relevant. A statistical approach, i.e. Boot-MRMR, was proposed based on a composite measure of maximum relevance and minimum redundancy, which is both statistically sound and biologically relevant for informative gene selection. For comparative evaluation of the proposed approach, we developed two biological sufficient criteria, i.e. Gene Set Enrichment with QTL (GSEQ) and biological similarity score based on Gene Ontology (GO). Further, a systematic and rigorous evaluation of the proposed technique with 12 existing gene selection techniques was carried out using five gene expression datasets. This evaluation was based on a broad spectrum of statistically sound (e.g. subject classification) and biological relevant (based on QTL and GO) criteria under a multiple criteria decision-making framework. The performance analysis showed that the proposed technique selects informative genes which are more biologically relevant. The proposed technique is also found to be quite competitive with the existing techniques with respect to subject classification and computational time. Our results also showed that under the multiple criteria decision-making setup, the proposed technique is best for informative gene selection over the available alternatives. Based on the proposed approach, an R Package, i.e. BootMRMR has been developed and available at https://cran.r-project.org/web/packages/BootMRMR. This study will provide a practical guide to select statistical techniques for selecting informative genes from high dimensional expression data for breeding and system biology studies. Published by Elsevier B.V.
Identifying a gene expression signature of cluster headache in blood
Eising, Else; Pelzer, Nadine; Vijfhuizen, Lisanne S.; Vries, Boukje de; Ferrari, Michel D.; ‘t Hoen, Peter A. C.; Terwindt, Gisela M.; van den Maagdenberg, Arn M. J. M.
2017-01-01
Cluster headache is a relatively rare headache disorder, typically characterized by multiple daily, short-lasting attacks of excruciating, unilateral (peri-)orbital or temporal pain associated with autonomic symptoms and restlessness. To better understand the pathophysiology of cluster headache, we used RNA sequencing to identify differentially expressed genes and pathways in whole blood of patients with episodic (n = 19) or chronic (n = 20) cluster headache in comparison with headache-free controls (n = 20). Gene expression data were analysed by gene and by module of co-expressed genes with particular attention to previously implicated disease pathways including hypocretin dysregulation. Only moderate gene expression differences were identified and no associations were found with previously reported pathogenic mechanisms. At the level of functional gene sets, associations were observed for genes involved in several brain-related mechanisms such as GABA receptor function and voltage-gated channels. In addition, genes and modules of co-expressed genes showed a role for intracellular signalling cascades, mitochondria and inflammation. Although larger study samples may be required to identify the full range of involved pathways, these results indicate a role for mitochondria, intracellular signalling and inflammation in cluster headache. PMID:28074859
Oleksiak, Marjorie F; Karchner, Sibel I; Jenny, Matthew J; Franks, Diana G; Welch, David B Mark; Hahn, Mark E
2011-05-24
Populations of Atlantic killifish (Fundulus heteroclitus) have evolved resistance to the embryotoxic effects of polychlorinated biphenyls (PCBs) and other halogenated and nonhalogenated aromatic hydrocarbons that act through an aryl hydrocarbon receptor (AHR)-dependent signaling pathway. The resistance is accompanied by reduced sensitivity to induction of cytochrome P450 1A (CYP1A), a widely used biomarker of aromatic hydrocarbon exposure and effect, but whether the reduced sensitivity is specific to CYP1A or reflects a genome-wide reduction in responsiveness to all AHR-mediated changes in gene expression is unknown. We compared gene expression profiles and the response to 3,3',4,4',5-pentachlorobiphenyl (PCB-126) exposure in embryos (5 and 10 dpf) and larvae (15 dpf) from F. heteroclitus populations inhabiting the New Bedford Harbor, Massachusetts (NBH) Superfund site (PCB-resistant) and a reference site, Scorton Creek, Massachusetts (SC; PCB-sensitive). Analysis using a 7,000-gene cDNA array revealed striking differences in responsiveness to PCB-126 between the populations; the differences occur at all three stages examined. There was a sizeable set of PCB-responsive genes in the sensitive SC population, a much smaller set of PCB-responsive genes in NBH fish, and few similarities in PCB-responsive genes between the two populations. Most of the array results were confirmed, and additional PCB-regulated genes identified, by RNA-Seq (deep pyrosequencing). The results suggest that NBH fish possess a gene regulatory defect that is not specific to one target gene such as CYP1A but rather lies in a regulatory pathway that controls the transcriptional response of multiple genes to PCB exposure. The results are consistent with genome-wide disruption of AHR-dependent signaling in NBH fish.
Huang, Lei; Zhao, Shuangping; Frasor, Jonna M.; Dai, Yang
2011-01-01
Approximately half of estrogen receptor (ER) positive breast tumors will fail to respond to endocrine therapy. Here we used an integrative bioinformatics approach to analyze three gene expression profiling data sets from breast tumors in an attempt to uncover underlying mechanisms contributing to the development of resistance and potential therapeutic strategies to counteract these mechanisms. Genes that are differentially expressed in tamoxifen resistant vs. sensitive breast tumors were identified from three different publically available microarray datasets. These differentially expressed (DE) genes were analyzed using gene function and gene set enrichment and examined in intrinsic subtypes of breast tumors. The Connectivity Map analysis was utilized to link gene expression profiles of tamoxifen resistant tumors to small molecules and validation studies were carried out in a tamoxifen resistant cell line. Despite little overlap in genes that are differentially expressed in tamoxifen resistant vs. sensitive tumors, a high degree of functional similarity was observed among the three datasets. Tamoxifen resistant tumors displayed enriched expression of genes related to cell cycle and proliferation, as well as elevated activity of E2F transcription factors, and were highly correlated with a Luminal intrinsic subtype. A number of small molecules, including phenothiazines, were found that induced a gene signature in breast cancer cell lines opposite to that found in tamoxifen resistant vs. sensitive tumors and the ability of phenothiazines to down-regulate cyclin E2 and inhibit proliferation of tamoxifen resistant breast cancer cells was validated. Our findings demonstrate that an integrated bioinformatics approach to analyze gene expression profiles from multiple breast tumor datasets can identify important biological pathways and potentially novel therapeutic options for tamoxifen-resistant breast cancers. PMID:21789246
Sadovsky, Yoel; Goodarzi, Hani; Zhang, Heping; Biggio, Joseph R.; Varner, Michael; Parry, Samuel; Xiao, Feifei; Esplin, Sean M.; Andrews, William; Saade, George R.; Ilekis, John V.; Reddy, Uma M.; Baldwin, Donald A.
2017-01-01
Background Preterm birth is a main determinant of neonatal mortality and morbidity and a major contributor to the overall mortality and burden of disease. However, research of the preterm birth is hindered by the imprecise definition of the clinical phenotype and complexity of the molecular phenotype due to multiple pregnancy tissue types and molecular processes that may contribute to the preterm birth. Here we comprehensively evaluate the mRNA transcriptome that characterizes preterm and term labor in tissues comprising the pregnancy using precisely phenotyped samples. The four complementary phenotypes together provide comprehensive insight into preterm and term parturition. Methods Samples of maternal blood, chorion, amnion, placenta, decidua, fetal blood, and myometrium from the uterine fundus and lower segment (n = 183) were obtained during cesarean delivery from women with four complementary phenotypes: delivering preterm with (PL) and without labor (PNL), term with (TL) and without labor (TNL). Enrolled were 35 pregnant women with four precisely and prospectively defined phenotypes: PL (n = 8), PNL (n = 10), TL (n = 7) and TNL (n = 10). Gene expression data were analyzed using shrunken centroid analysis to identify a minimal set of genes that uniquely characterizes each of the four phenotypes. Expression profiles of 73 genes and non-coding RNA sequences uniquely identified each of the four phenotypes. The shrunken centroid analysis and 10 times 10-fold cross-validation was also used to minimize false positive finings and overfitting. Identified were the pathways and molecular processes associated with and the cis-regulatory elements in gene’s 5′ promoter or 3′-UTR regions of the set of genes which expression uniquely characterized the four phenotypes. Results The largest differences in gene expression among the four groups occurred at maternal fetal interface in decidua, chorion and amnion. The gene expression profiles showed suppression of chemokines expression in TNL, withdrawal of this suppression in TL, activation of multiple pathways of inflammation in PL, and an immune rejection profile in PNL. The genes constituting expression signatures showed over-representation of three putative regulatory elements in their 5′and 3′ UTR regions. Conclusions The results suggest that pregnancy is maintained by downregulation of chemokines at the maternal-fetal interface. Withdrawal of this downregulation results in the term birth and its overriding by the activation of multiple pathways of the immune system in the preterm birth. Complications of the pregnancy associated with impairment of placental function, which necessitated premature delivery of the fetus in the absence of labor, show gene expression patterns associated with immune rejection. PMID:28879060
Saunders, Edward J; Dadaev, Tokhir; Leongamornlert, Daniel A; Al Olama, Ali Amin; Benlloch, Sara; Giles, Graham G; Wiklund, Fredrik; Gronberg, Henrik; Haiman, Christopher A; Schleutker, Johanna; Nordestgaard, Borge G; Travis, Ruth C; Neal, David; Pasayan, Nora; Khaw, Kay-Tee; Stanford, Janet L; Blot, William J; Thibodeau, Stephen N; Maier, Christiane; Kibel, Adam S; Cybulski, Cezary; Cannon-Albright, Lisa; Brenner, Hermann; Park, Jong Y; Kaneva, Radka; Batra, Jyotsna; Teixeira, Manuel R; Pandha, Hardev; Govindasami, Koveela; Muir, Ken; Easton, Douglas F; Eeles, Rosalind A; Kote-Jarai, Zsofia
2016-04-12
Germline mutations within DNA-repair genes are implicated in susceptibility to multiple forms of cancer. For prostate cancer (PrCa), rare mutations in BRCA2 and BRCA1 give rise to moderately elevated risk, whereas two of B100 common, low-penetrance PrCa susceptibility variants identified so far by genome-wide association studies implicate RAD51B and RAD23B. Genotype data from the iCOGS array were imputed to the 1000 genomes phase 3 reference panel for 21 780 PrCa cases and 21 727 controls from the Prostate Cancer Association Group to Investigate Cancer Associated Alterations in the Genome (PRACTICAL) consortium. We subsequently performed single variant, gene and pathway-level analyses using 81 303 SNPs within 20 Kb of a panel of 179 DNA-repair genes. Single SNP analyses identified only the previously reported association with RAD51B. Gene-level analyses using the SKAT-C test from the SNP-set (Sequence) Kernel Association Test (SKAT) identified a significant association with PrCa for MSH5. Pathway-level analyses suggested a possible role for the translesion synthesis pathway in PrCa risk and Homologous recombination/Fanconi Anaemia pathway for PrCa aggressiveness, even though after adjustment for multiple testing these did not remain significant. MSH5 is a novel candidate gene warranting additional follow-up as a prospective PrCa-risk locus. MSH5 has previously been reported as a pleiotropic susceptibility locus for lung, colorectal and serous ovarian cancers.
Epigenetic changes in leukocytes after 8 weeks of resistance exercise training.
Denham, Joshua; Marques, Francine Z; Bruns, Emma L; O'Brien, Brendan J; Charchar, Fadi J
2016-06-01
Regular engagement in resistance exercise training elicits many health benefits including improvement to muscular strength, hypertrophy and insulin sensitivity, though the underpinning molecular mechanisms are poorly understood. The purpose of this study was to determine the influence 8 weeks of resistance exercise training has on leukocyte genome-wide DNA methylation and gene expression in healthy young men. Eight young (21.1 ± 2.2 years) men completed one repetition maximum (1RM) testing before completing 8 weeks of supervised, thrice-weekly resistance exercise training comprising three sets of 8-12 repetitions with a load equivalent to 80 % of 1RM. Blood samples were collected at rest before and after the 8-week training intervention. Genome-wide DNA methylation and gene expression were assessed on isolated leukocyte DNA and RNA using the 450K BeadChip and HumanHT-12 v4 Expression BeadChip (Illumina), respectively. Resistance exercise training significantly improved upper and lower body strength concurrently with diverse genome-wide DNA methylation and gene expression changes (p ≤ 0. 01). DNA methylation changes occurred at multiple regions throughout the genome in context with genes and CpG islands, and in genes relating to axon guidance, diabetes and immune pathways. There were multiple genes with increased expression that were enriched for RNA processing and developmental proteins. Growth factor genes-GHRH and FGF1-showed differential methylation and mRNA expression changes after resistance training. Our findings indicate that resistance exercise training improves muscular strength and is associated with reprogramming of the leukocyte DNA methylome and transcriptome.
Wang, Genhong; Chen, Yanfei; Zhang, Xiaoying; Bai, Bingchuan; Yan, Hao; Qin, Daoyuan; Xia, Qingyou
2018-06-01
The silkworm, Bombyx mori, is one of the world's most economically important insect. Surveying variations in gene expression among multiple tissue/organ samples will provide clues for gene function assignments and will be helpful for identifying genes related to economic traits or specific cellular processes. To ensure their accuracy, commonly used gene expression quantification methods require a set of stable reference genes for data normalization. In this study, 24 candidate reference genes were assessed in 10 tissue/organ samples of day 3 fifth-instar B. mori larvae using geNorm and NormFinder. The results revealed that, using the combination of the expression of BGIBMGA003186 and BGIBMGA008209 was the optimum choice for normalizing the expression data of the B. mori tissue/organ samples. The most stable gene, BGIBMGA003186, is recommended if just one reference gene is used. Moreover, the commonly used reference gene encoding cytoplasmic actin was the least appropriate reference gene of the samples investigated. The reliability of the selected reference genes was further confirmed by evaluating the expression profiles of two cathepsin genes. Our results may be useful for future studies involving the quantification of relative gene expression levels of different tissue/organ samples in B. mori. © 2018 Wiley Periodicals, Inc.
Molecular defects identified by whole exome sequencing in a child with Fanconi anemia.
Zheng, Zhaojing; Geng, Juan; Yao, Ru-En; Li, Caihua; Ying, Daming; Shen, Yongnian; Ying, Lei; Yu, Yongguo; Fu, Qihua
2013-11-10
Fanconi anemia is a rare genetic disease characterized by bone marrow failure, multiple congenital malformations, and an increased susceptibility to malignancy. At least 15 genes have been identified that are involved in the pathogenesis of Fanconi anemia. However, it is still a challenge to assign the complementation group and to characterize the molecular defects in patients with Fanconi anemia. In the current study, whole exome sequencing was used to identify the affected gene(s) in a boy with Fanconi anemia. A recurring, non-synonymous mutation was found (c.3971C>T, p.P1324L) as well as a novel frameshift mutation (c.989_995del, p.H330LfsX2) in FANCA gene. Our results indicate that whole exome sequencing may be useful in clinical settings for rapid identification of disease-causing mutations in rare genetic disorders such as Fanconi anemia. © 2013 Elsevier B.V. All rights reserved.
Shadows of complexity: what biological networks reveal about epistasis and pleiotropy
Tyler, Anna L.; Asselbergs, Folkert W.; Williams, Scott M.; Moore, Jason H.
2011-01-01
Pleiotropy, in which one mutation causes multiple phenotypes, has traditionally been seen as a deviation from the conventional observation in which one gene affects one phenotype. Epistasis, or gene-gene interaction, has also been treated as an exception to the Mendelian one gene-one phenotype paradigm. This simplified perspective belies the pervasive complexity of biology and hinders progress toward a deeper understanding of biological systems. We assert that epistasis and pleiotropy are not isolated occurrences, but ubiquitous and inherent properties of biomolecular networks. These phenomena should not be treated as exceptions, but rather as fundamental components of genetic analyses. A systems level understanding of epistasis and pleiotropy is, therefore, critical to furthering our understanding of human genetics and its contribution to common human disease. Finally, graph theory offers an intuitive and powerful set of tools with which to study the network bases of these important genetic phenomena. PMID:19204994
Mediator phosphorylation prevents stress response transcription during non-stress conditions.
Miller, Christian; Matic, Ivan; Maier, Kerstin C; Schwalb, Björn; Roether, Susanne; Strässer, Katja; Tresch, Achim; Mann, Matthias; Cramer, Patrick
2012-12-28
The multiprotein complex Mediator is a coactivator of RNA polymerase (Pol) II transcription that is required for the regulated expression of protein-coding genes. Mediator serves as an end point of signaling pathways and regulates Pol II transcription, but the mechanisms it uses are not well understood. Here, we used mass spectrometry and dynamic transcriptome analysis to investigate a functional role of Mediator phosphorylation in gene expression. Affinity purification and mass spectrometry revealed that Mediator from the yeast Saccharomyces cerevisiae is phosphorylated at multiple sites of 17 of its 25 subunits. Mediator phosphorylation levels change upon an external stimulus set by exposure of cells to high salt concentrations. Phosphorylated sites in the Mediator tail subunit Med15 are required for suppression of stress-induced changes in gene expression under non-stress conditions. Thus dynamic and differential Mediator phosphorylation contributes to gene regulation in eukaryotic cells.
Inferring gene and protein interactions using PubMed citations and consensus Bayesian networks.
Deeter, Anthony; Dalman, Mark; Haddad, Joseph; Duan, Zhong-Hui
2017-01-01
The PubMed database offers an extensive set of publication data that can be useful, yet inherently complex to use without automated computational techniques. Data repositories such as the Genomic Data Commons (GDC) and the Gene Expression Omnibus (GEO) offer experimental data storage and retrieval as well as curated gene expression profiles. Genetic interaction databases, including Reactome and Ingenuity Pathway Analysis, offer pathway and experiment data analysis using data curated from these publications and data repositories. We have created a method to generate and analyze consensus networks, inferring potential gene interactions, using large numbers of Bayesian networks generated by data mining publications in the PubMed database. Through the concept of network resolution, these consensus networks can be tailored to represent possible genetic interactions. We designed a set of experiments to confirm that our method is stable across variation in both sample and topological input sizes. Using gene product interactions from the KEGG pathway database and data mining PubMed publication abstracts, we verify that regardless of the network resolution or the inferred consensus network, our method is capable of inferring meaningful gene interactions through consensus Bayesian network generation with multiple, randomized topological orderings. Our method can not only confirm the existence of currently accepted interactions, but has the potential to hypothesize new ones as well. We show our method confirms the existence of known gene interactions such as JAK-STAT-PI3K-AKT-mTOR, infers novel gene interactions such as RAS- Bcl-2 and RAS-AKT, and found significant pathway-pathway interactions between the JAK-STAT signaling and Cardiac Muscle Contraction KEGG pathways.
Cinti, Alessandro; De Giorgi, Marco; Chisci, Elisa; Arena, Claudia; Galimberti, Gloria; Farina, Laura; Bugarin, Cristina; Rivolta, Ilaria; Gaipa, Giuseppe; Smolenski, Ryszard Tom; Cerrito, Maria Grazia; Lavitrano, Marialuisa; Giovannoni, Roberto
2015-01-01
Several biomedical applications, such as xenotransplantation, require multiple genes simultaneously expressed in eukaryotic cells. Advances in genetic engineering technologies have led to the development of efficient polycistronic vectors based on the use of the 2A self-processing oligopeptide. The aim of this work was to evaluate the protective effects of the simultaneous expression of a novel combination of anti-inflammatory human genes, ENTPD1, E5NT and HO-1, in eukaryotic cells. We produced an F2A system-based multicistronic construct to express three human proteins in NIH3T3 cells exposed to an inflammatory stimulus represented by tumor necrosis factor alpha (TNF-α), a pro-inflammatory cytokine which plays an important role during inflammation, cell proliferation, differentiation and apoptosis and in the inflammatory response during ischemia/reperfusion injury in several organ transplantation settings. The protective effects against TNF-α-induced cytotoxicity and cell death, mediated by HO-1, ENTPD1 and E5NT genes were better observed in cells expressing the combination of genes as compared to cells expressing each single gene and the effect was further improved by administrating enzymatic substrates of the human genes to the cells. Moreover, a gene expression analyses demonstrated that the expression of the three genes has a role in modulating key regulators of TNF-α signalling pathway, namely Nemo and Tnfaip3, that promoted pro-survival phenotype in TNF-α injured cells. These results could provide new insights in the research of protective mechanisms in transplantation settings. PMID:26513260
Aging as an Epigenetic Phenomenon
Ashapkin, Vasily V.; Kutueva, Lyudmila I.; Vanyushin, Boris F.
2017-01-01
Introduction: Hypermethylation of genes associated with promoter CpG islands, and hypomethylation of CpG poor genes, repeat sequences, transposable elements and intergenic genome sections occur during aging in mammals. Methylation levels of certain CpG sites display strict correlation to age and could be used as “epigenetic clock” to predict biological age. Multi-substrate deacetylases SIRT1 and SIRT6 affect aging via locus-specific modulations of chromatin structure and activity of multiple regulatory proteins involved in aging. Random errors in DNA methylation and other epigenetic marks during aging increase the transcriptional noise, and thus lead to enhanced phenotypic variation between cells of the same tissue. Such variation could cause progressive organ dysfunction observed in aged individuals. Multiple experimental data show that induction of NF-κB regulated gene sets occurs in various tissues of aged mammals. Upregulation of multiple miRNAs occurs at mid age leading to downregulation of enzymes and regulatory proteins involved in basic cellular functions, such as DNA repair, oxidative phosphorylation, intermediate metabolism, and others. Conclusion: Strong evidence shows that all epigenetic systems contribute to the lifespan control in various organisms. Similar to other cell systems, epigenome is prone to gradual degradation due to the genome damage, stressful agents, and other aging factors. But unlike mutations and other kinds of the genome damage, age-related epigenetic changes could be fully or partially reversed to a “young” state. PMID:29081695
Speed control: cogs and gears that drive the circadian clock.
Zheng, Xiangzhong; Sehgal, Amita
2012-09-01
In most organisms, an intrinsic circadian (~24-h) timekeeping system drives rhythms of physiology and behavior. Within cells that contain a circadian clock, specific transcriptional activators and repressors reciprocally regulate each other to generate a basic molecular oscillator. A mismatch of the period generated by this oscillator with the external environment creates circadian disruption, which can have adverse effects on neural function. Although several clock genes have been extensively characterized, a fundamental question remains: how do these genes work together to generate a ~24-h period? Period-altering mutations in clock genes can affect any of multiple regulated steps in the molecular oscillator. In this review, we examine the regulatory mechanisms that contribute to setting the pace of the circadian oscillator. Copyright © 2012 Elsevier Ltd. All rights reserved.
Normanno, Davide; Vanzi, Francesco; Pavone, Francesco Saverio
2008-01-01
Gene expression regulation is a fundamental biological process which deploys specific sets of genomic information depending on physiological or environmental conditions. Several transcription factors (including lac repressor, LacI) are present in the cell at very low copy number and increase their local concentration by binding to multiple sites on DNA and looping the intervening sequence. In this work, we employ single-molecule manipulation to experimentally address the role of DNA supercoiling in the dynamics and stability of LacI-mediated DNA looping. We performed measurements over a range of degrees of supercoiling between −0.026 and +0.026, in the absence of axial stretching forces. A supercoiling-dependent modulation of the lifetimes of both the looped and unlooped states was observed. Our experiments also provide evidence for multiple structural conformations of the LacI–DNA complex, depending on torsional constraints. The supercoiling-dependent modulation demonstrated here adds an important element to the model of the lac operon. In fact, the complex network of proteins acting on the DNA in a living cell constantly modifies its topological and mechanical properties: our observations demonstrate the possibility of establishing a signaling pathway from factors affecting DNA supercoiling to transcription factors responsible for the regulation of specific sets of genes. PMID:18310101
Geographic setting influences Great Lakes beach microbiological water quality
Haack, Sheridan K.; Fogarty, Lisa R.; Stelzer, Erin A.; Fuller, Lori M.; Brennan, Angela K.; Isaacs, Natasha M.; Johnson, Heather E.
2013-01-01
Understanding of factors that influence Escherichia coli (EC) and enterococci (ENT) concentrations, pathogen occurrence, and microbial sources at Great Lakes beaches comes largely from individual beach studies. Using 12 representative beaches, we tested enrichment cultures from 273 beach water and 22 tributary samples for EC, ENT, and genes indicating the bacterial pathogens Shiga-toxin producing E. coli (STEC), Shigella spp., Salmonella spp, Campylobacter jejuni/coli, and methicillin-resistant Staphylococcus aureus, and 108–145 samples for Bacteroides human, ruminant, and gull source-marker genes. EC/ENT temporal patterns, general Bacteroides concentration, and pathogen types and occurrence were regionally consistent (up to 40 km), but beach catchment variables (drains/creeks, impervious surface, urban land cover) influenced exceedances of EC/ENT standards and detections of Salmonella and STEC. Pathogen detections were more numerous when the EC/ENT Beach Action Value (but not when the Geometric Mean and Statistical Threshold Value) was exceeded. EC, ENT, and pathogens were not necessarily influenced by the same variables. Multiple Bacteroides sources, varying by date, occurred at every beach. Study of multiple beaches in different geographic settings provided new insights on the contrasting influences of regional and local variables, and a broader-scale perspective, on significance of EC/ENT exceedances, bacterial sources, and pathogen occurrence.
An in vivo and in silico approach to study cis-antisense: a short cut to higher order response
NASA Astrophysics Data System (ADS)
Courtney, Colleen; Varanasi, Usha; Chatterjee, Anushree
2014-03-01
Antisense interactions are present in all domains of life. Typically sense, antisense RNA pairs originate from overlapping genes with convergent face to face promoters, and are speculated to be involved in gene regulation. Recent studies indicate the role of transcriptional interference (TI) in regulating expression of genes in convergent orientation. Modeling antisense, TI gene regulation mechanisms allows us to understand how organisms control gene expression. We present a modeling and experimental framework to understand convergent transcription that combines the effects of transcriptional interference and cis-antisense regulation. Our model shows that combining transcriptional interference and antisense RNA interaction adds multiple-levels of regulation which affords a highly tunable biological output, ranging from first order response to complex higher-order response. To study this system we created a library of experimental constructs with engineered TI and antisense interaction by using face-to-face inducible promoters separated by carefully tailored overlapping DNA sequences to control expression of a set of fluorescent reporter proteins. Studying this gene expression mechanism allows for an understanding of higher order behavior of gene expression networks.
Tang, Hongwei; Wei, Peng; Duell, Eric J; Risch, Harvey A; Olson, Sara H; Bueno-de-Mesquita, H Bas; Gallinger, Steven; Holly, Elizabeth A; Petersen, Gloria; Bracci, Paige M; McWilliams, Robert R; Jenab, Mazda; Riboli, Elio; Tjønneland, Anne; Boutron-Ruault, Marie Christine; Kaaks, Rudolph; Trichopoulos, Dimitrios; Panico, Salvatore; Sund, Malin; Peeters, Petra H M; Khaw, Kay-Tee; Amos, Christopher I; Li, Donghui
2014-05-01
Cigarette smoking is the best established modifiable risk factor for pancreatic cancer. Genetic factors that underlie smoking-related pancreatic cancer have previously not been examined at the genome-wide level. Taking advantage of the existing Genome-wide association study (GWAS) genotype and risk factor data from the Pancreatic Cancer Case Control Consortium, we conducted a discovery study in 2028 cases and 2109 controls to examine gene-smoking interactions at pathway/gene/single nucleotide polymorphism (SNP) level. Using the likelihood ratio test nested in logistic regression models and ingenuity pathway analysis (IPA), we examined 172 KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways, 3 manually curated gene sets, 3 nicotine dependency gene ontology pathways, 17 912 genes and 468 114 SNPs. None of the individual pathway/gene/SNP showed significant interaction with smoking after adjusting for multiple comparisons. Six KEGG pathways showed nominal interactions (P < 0.05) with smoking, and the top two are the pancreatic secretion and salivary secretion pathways (major contributing genes: RAB8A, PLCB and CTRB1). Nine genes, i.e. ZBED2, EXO1, PSG2, SLC36A1, CLSTN1, MTHFSD, FAT2, IL10RB and ATXN2 had P interaction < 0.0005. Five intergenic region SNPs and two SNPs of the EVC and KCNIP4 genes had P interaction < 0.00003. In IPA analysis of genes with nominal interactions with smoking, axonal guidance signaling $$\\left(P=2.12\\times 1{0}^{-7}\\right)$$ and α-adrenergic signaling $$\\left(P=2.52\\times 1{0}^{-5}\\right)$$ genes were significantly overrepresented canonical pathways. Genes contributing to the axon guidance signaling pathway included the SLIT/ROBO signaling genes that were frequently altered in pancreatic cancer. These observations need to be confirmed in additional data set. Once confirmed, it will open a new avenue to unveiling the etiology of smoking-associated pancreatic cancer.
The Association of CD81 Polymorphisms with Alloimmunization in Sickle Cell Disease
Tatari-Calderone, Zohreh; Tamouza, Ryad; Le Bouder, Gama P.; Dewan, Ramita; Luban, Naomi L. C.; Lasserre, Jacqueline; Maury, Jacqueline; Lionnet, François; Krishnamoorthy, Rajagopal; Girot, Robert
2013-01-01
The goal of the present work was to identify the candidate genetic markers predictive of alloimmunization in sickle cell disease (SCD). Red blood cell (RBC) transfusion is indicated for acute treatment, prevention, and abrogation of some complications of SCD. A well-known consequence of multiple RBC transfusions is alloimmunization. Given that a subset of SCD patients develop multiple RBC allo-/autoantibodies, while others do not in a similar multiple transfusional setting, we investigated a possible genetic basis for alloimmunization. Biomarker(s) which predicts (predict) susceptibility to alloimmunization could identify patients at risk before the onset of a transfusion program and thus may have important implications for clinical management. In addition, such markers could shed light on the mechanism(s) underlying alloimmunization. We genotyped 27 single nucleotide polymorphisms (SNPs) in the CD81, CHRNA10, and ARHG genes in two groups of SCD patients. One group (35) of patients developed alloantibodies, and another (40) had no alloantibodies despite having received multiple transfusions. Two SNPs in the CD81 gene, that encodes molecule involved in the signal modulation of B lymphocytes, show a strong association with alloimmunization. If confirmed in prospective studies with larger cohorts, the two SNPs identified in this retrospective study could serve as predictive biomarkers for alloimmunization. PMID:23762099
Anderson, Olin D; Coleman-Derr, Devin; Gu, Yong Q; Heath, Sekou
2010-06-16
Among the dietary essential amino acids, the most severely limiting in the cereals is lysine. Since cereals make up half of the human diet, lysine limitation has quality/nutritional consequences. The breakdown of lysine is controlled mainly by the catabolic bifunctional enzyme lysine ketoglutarate reductase - saccharopine dehydrogenase (LKR/SDH). The LKR/SDH gene has been reported to produce transcripts for the bifunctional enzyme and separate monofunctional transcripts. In addition to lysine metabolism, this gene has been implicated in a number of metabolic and developmental pathways, which along with its production of multiple transcript types and complex exon/intron structure suggest an important node in plant metabolism. Understanding more about the LKR/SDH gene is thus interesting both from applied standpoint and for basic plant metabolism. The current report describes a wheat genomic fragment containing an LKR/SDH gene and adjacent genes. The wheat LKR/SDH genomic segment was found to originate from the A-genome of wheat, and EST analysis indicates all three LKR/SDH genes in hexaploid wheat are transcriptionally active. A comparison of a set of plant LKR/SDH genes suggests regions of greater sequence conservation likely related to critical enzymatic functions and metabolic controls. Although most plants contain only a single LKR/SDH gene per genome, poplar contains at least two functional bifunctional genes in addition to a monofunctional LKR gene. Analysis of ESTs finds evidence for monofunctional LKR transcripts in switchgrass, and monofunctional SDH transcripts in wheat, Brachypodium, and poplar. The analysis of a wheat LKR/SDH gene and comparative structural and functional analyses among available plant genes provides new information on this important gene. Both the structure of the LKR/SDH gene and the immediately adjacent genes show lineage-specific differences between monocots and dicots, and findings suggest variation in activity of LKR/SDH genes among plants. Although most plant genomes seem to contain a single conserved LKR/SDH gene per genome, poplar possesses multiple contiguous genes. A preponderance of SDH transcripts suggests the LKR region may be more rate-limiting. Only switchgrass has EST evidence for LKR monofunctional transcripts. Evidence for monofunctional SDH transcripts shows a novel intron in wheat, Brachypodium, and poplar.
Hettne, Kristina M; Boorsma, André; van Dartel, Dorien A M; Goeman, Jelle J; de Jong, Esther; Piersma, Aldert H; Stierum, Rob H; Kleinjans, Jos C; Kors, Jan A
2013-01-29
Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals. Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.
2013-01-01
Background Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. Methods We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals. Results Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Conclusions Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect. PMID:23356878
Reitz, Christiane; Jun, Gyungah; Naj, Adam; Rajbhandary, Ruchita; Vardarajan, Badri Narayan; Wang, Li-San; Valladares, Otto; Lin, Chiao-Feng; Larson, Eric B; Graff-Radford, Neill R; Evans, Denis; De Jager, Philip L; Crane, Paul K; Buxbaum, Joseph D; Murrell, Jill R; Raj, Towfique; Ertekin-Taner, Nilufer; Logue, Mark; Baldwin, Clinton T; Green, Robert C; Barnes, Lisa L; Cantwell, Laura B; Fallin, M Daniele; Go, Rodney C P; Griffith, Patrick; Obisesan, Thomas O; Manly, Jennifer J; Lunetta, Kathryn L; Kamboh, M Ilyas; Lopez, Oscar L; Bennett, David A; Hendrie, Hugh; Hall, Kathleen S; Goate, Alison M; Byrd, Goldie S; Kukull, Walter A; Foroud, Tatiana M; Haines, Jonathan L; Farrer, Lindsay A; Pericak-Vance, Margaret A; Schellenberg, Gerard D; Mayeux, Richard
2013-04-10
Genetic variants associated with susceptibility to late-onset Alzheimer disease are known for individuals of European ancestry, but whether the same or different variants account for the genetic risk of Alzheimer disease in African American individuals is unknown. Identification of disease-associated variants helps identify targets for genetic testing, prevention, and treatment. To identify genetic loci associated with late-onset Alzheimer disease in African Americans. The Alzheimer Disease Genetics Consortium (ADGC) assembled multiple data sets representing a total of 5896 African Americans (1968 case participants, 3928 control participants) 60 years or older that were collected between 1989 and 2011 at multiple sites. The association of Alzheimer disease with genotyped and imputed single-nucleotide polymorphisms (SNPs) was assessed in case-control and in family-based data sets. Results from individual data sets were combined to perform an inverse variance-weighted meta-analysis, first with genome-wide analyses and subsequently with gene-based tests for previously reported loci. Presence of Alzheimer disease according to standardized criteria. Genome-wide significance in fully adjusted models (sex, age, APOE genotype, population stratification) was observed for a SNP in ABCA7 (rs115550680, allele = G; frequency, 0.09 cases and 0.06 controls; odds ratio [OR], 1.79 [95% CI, 1.47-2.12]; P = 2.2 × 10(-9)), which is in linkage disequilibrium with SNPs previously associated with Alzheimer disease in Europeans (0.8 < D' < 0.9). The effect size for the SNP in ABCA7 was comparable with that of the APOE ϵ4-determining SNP rs429358 (allele = C; frequency, 0.30 cases and 0.18 controls; OR, 2.31 [95% CI, 2.19-2.42]; P = 5.5 × 10(-47)). Several loci previously associated with Alzheimer disease but not reaching significance in genome-wide analyses were replicated in gene-based analyses accounting for linkage disequilibrium between markers and correcting for number of tests performed per gene (CR1, BIN1, EPHA1, CD33; 0.0005 < empirical P < .001). In this meta-analysis of data from African American participants, Alzheimer disease was significantly associated with variants in ABCA7 and with other genes that have been associated with Alzheimer disease in individuals of European ancestry. Replication and functional validation of this finding is needed before this information is used in clinical settings.
Genome-wide transcriptomics of aging in the rotifer Brachionus manjavacas, an emerging model system.
Gribble, Kristin E; Mark Welch, David B
2017-03-01
Understanding gene expression changes over lifespan in diverse animal species will lead to insights to conserved processes in the biology of aging and allow development of interventions to improve health. Rotifers are small aquatic invertebrates that have been used in aging studies for nearly 100 years and are now re-emerging as a modern model system. To provide a baseline to evaluate genetic responses to interventions that change health throughout lifespan and a framework for new hypotheses about the molecular genetic mechanisms of aging, we examined the transcriptome of an asexual female lineage of the rotifer Brachionus manjavacas at five life stages: eggs, neonates, and early-, late-, and post-reproductive adults. There are widespread shifts in gene expression over the lifespan of B. manjavacas; the largest change occurs between neonates and early reproductive adults and is characterized by down-regulation of developmental genes and up-regulation of genes involved in reproduction. The expression profile of post-reproductive adults was distinct from that of other life stages. While few genes were significantly differentially expressed in the late- to post-reproductive transition, gene set enrichment analysis revealed multiple down-regulated pathways in metabolism, maintenance and repair, and proteostasis, united by genes involved in mitochondrial function and oxidative phosphorylation. This study provides the first examination of changes in gene expression over lifespan in rotifers. We detected differential expression of many genes with human orthologs that are absent in Drosophila and C. elegans, highlighting the potential of the rotifer model in aging studies. Our findings suggest that small but coordinated changes in expression of many genes in pathways that integrate diverse functions drive the aging process. The observation of simultaneous declines in expression of genes in multiple pathways may have consequences for health and longevity not detected by single- or multi-gene knockdown in otherwise healthy animals. Investigation of subtle but genome-wide change in these pathways during aging is an important area for future study.
Chattaway, Marie Anne; Day, Michaela; Mtwale, Julia; White, Emma; Rogers, James; Day, Martin; Powell, David; Ahmad, Marwa; Harris, Ross; Talukder, Kaisar Ali; Wain, John; Jenkins, Claire; Cravioto, Alejandro
2017-01-01
Purpose This study investigates the virulence and antimicrobial resistance in association with common clonal complexes (CCs) of enteroaggregative Escherichia coli (EAEC) isolated from Bangladesh. The aim was to determine whether specific CCs were more likely to be associated with putative virulence genes and/or antimicrobial resistance. Methodology The presence of 15 virulence genes (by PCR) and susceptibility to 18 antibiotics were determined for 151 EAEC isolated from cases and controls during an intestinal infectious disease study carried out between 2007–2011 in the rural setting of Mirzapur, Bangladesh (Kotloff KL, Blackwelder WC, Nasrin D, Nataro JP, Farag TH et al. Clin Infect Dis 2012;55:S232–S245). These data were then analysed in the context of previously determined serotypes and clonal complexes defined by multi-locus sequence typing. Results Overall there was no association between the presence of virulence or antimicrobial resistance genes in isolates of EAEC from cases versus controls. However, when stratified by clonal complex (CC) one CC associated with cases harboured more virulence factors (CC40) and one CC harboured more resistance genes (CC38) than the average. There was no direct link between the virulence gene content and antibiotic resistance. Strains within a single CC had variable virulence and resistance gene content indicating independent and multiple gene acquisitions over time. Conclusion In Bangladesh, there are multiple clonal complexes of EAEC harbouring a variety of virulence and resistance genes. The emergence of two of the most successful clones appeared to be linked to either increased virulence (CC40) or antimicrobial resistance (CC38), but increased resistance and virulence were not found in the same clonal complexes. PMID:28945190
Synergistic interactions of biotic and abiotic environmental stressors on gene expression.
Altshuler, Ianina; McLeod, Anne M; Colbourne, John K; Yan, Norman D; Cristescu, Melania E
2015-03-01
Understanding the response of organisms to multiple stressors is critical for predicting if populations can adapt to rapid environmental change. Natural and anthropogenic stressors often interact, complicating general predictions. In this study, we examined the interactive and cumulative effects of two common environmental stressors, lowered calcium concentration, an anthropogenic stressor, and predator presence, a natural stressor, on the water flea Daphnia pulex. We analyzed expression changes of five genes involved in calcium homeostasis - cuticle proteins (Cutie, Icp2), calbindin (Calb), and calcium pump and channel (Serca and Ip3R) - using real-time quantitative PCR (RT-qPCR) in a full factorial experiment. We observed strong synergistic interactions between low calcium concentration and predator presence. While the Ip3R gene was not affected by the stressors, the other four genes were affected in their transcriptional levels by the combination of the stressors. Transcriptional patterns of genes that code for cuticle proteins (Cutie and Icp2) and a sarcoplasmic calcium pump (Serca) only responded to the combination of stressors, changing their relative expression levels in a synergistic response, while a calcium-binding protein (Calb) responded to low calcium stress and the combination of both stressors. The expression pattern of these genes (Cutie, Icp2, and Serca) were nonlinear, yet they were dose dependent across the calcium gradient. Multiple stressors can have complex, often unexpected effects on ecosystems. This study demonstrates that the dominant interaction for the set of tested genes appears to be synergism. We argue that gene expression patterns can be used to understand and predict the type of interaction expected when organisms are exposed simultaneously to natural and anthropogenic stressors.
Identification of type 2 diabetes-associated combination of SNPs using support vector machine.
Ban, Hyo-Jeong; Heo, Jee Yeon; Oh, Kyung-Soo; Park, Keun-Joon
2010-04-23
Type 2 diabetes mellitus (T2D), a metabolic disorder characterized by insulin resistance and relative insulin deficiency, is a complex disease of major public health importance. Its incidence is rapidly increasing in the developed countries. Complex diseases are caused by interactions between multiple genes and environmental factors. Most association studies aim to identify individual susceptibility single markers using a simple disease model. Recent studies are trying to estimate the effects of multiple genes and multi-locus in genome-wide association. However, estimating the effects of association is very difficult. We aim to assess the rules for classifying diseased and normal subjects by evaluating potential gene-gene interactions in the same or distinct biological pathways. We analyzed the importance of gene-gene interactions in T2D susceptibility by investigating 408 single nucleotide polymorphisms (SNPs) in 87 genes involved in major T2D-related pathways in 462 T2D patients and 456 healthy controls from the Korean cohort studies. We evaluated the support vector machine (SVM) method to differentiate between cases and controls using SNP information in a 10-fold cross-validation test. We achieved a 65.3% prediction rate with a combination of 14 SNPs in 12 genes by using the radial basis function (RBF)-kernel SVM. Similarly, we investigated subpopulation data sets of men and women and identified different SNP combinations with the prediction rates of 70.9% and 70.6%, respectively. As the high-throughput technology for genome-wide SNPs improves, it is likely that a much higher prediction rate with biologically more interesting combination of SNPs can be acquired by using this method. Support Vector Machine based feature selection method in this research found novel association between combinations of SNPs and T2D in a Korean population.
GeneNetFinder2: Improved Inference of Dynamic Gene Regulatory Relations with Multiple Regulators.
Han, Kyungsook; Lee, Jeonghoon
2016-01-01
A gene involved in complex regulatory interactions may have multiple regulators since gene expression in such interactions is often controlled by more than one gene. Another thing that makes gene regulatory interactions complicated is that regulatory interactions are not static, but change over time during the cell cycle. Most research so far has focused on identifying gene regulatory relations between individual genes in a particular stage of the cell cycle. In this study we developed a method for identifying dynamic gene regulations of several types from the time-series gene expression data. The method can find gene regulations with multiple regulators that work in combination or individually as well as those with single regulators. The method has been implemented as the second version of GeneNetFinder (hereafter called GeneNetFinder2) and tested on several gene expression datasets. Experimental results with gene expression data revealed the existence of genes that are not regulated by individual genes but rather by a combination of several genes. Such gene regulatory relations cannot be found by conventional methods. Our method finds such regulatory relations as well as those with multiple, independent regulators or single regulators, and represents gene regulatory relations as a dynamic network in which different gene regulatory relations are shown in different stages of the cell cycle. GeneNetFinder2 is available at http://bclab.inha.ac.kr/GeneNetFinder and will be useful for modeling dynamic gene regulations with multiple regulators.
Polymorphisms in miRNA genes and their involvement in autoimmune diseases susceptibility.
Latini, Andrea; Ciccacci, Cinzia; Novelli, Giuseppe; Borgiani, Paola
2017-08-01
MicroRNAs (miRNAs) are small non-coding RNA molecules that negatively regulate the expression of multiple protein-encoding genes at the post-transcriptional level. MicroRNAs are involved in different pathways, such as cellular proliferation and differentiation, signal transduction and inflammation, and play crucial roles in the development of several diseases, such as cancer, diabetes, and cardiovascular diseases. They have recently been recognized to play a role also in the pathogenesis of autoimmune diseases. Although the majority of studies are focused on miRNA expression profiles investigation, a growing number of studies have been investigating the role of polymorphisms in miRNA genes in the autoimmune diseases development. Indeed, polymorphisms affecting the miRNA genes can modify the set of targets they regulate or the maturation efficiency. This review is aimed to give an overview about the available studies that have investigated the association of miRNA gene polymorphisms with the susceptibility to various autoimmune diseases and to their clinical phenotypes.
Limited common origins of multiple adult health-related behaviors: Evidence from U.S. twins.
Sudharsanan, Nikkil; Behrman, Jere R; Kohler, Hans-Peter
2016-12-01
Health-related behaviors are significant contributors to morbidity and mortality in the United States, yet evidence on the underlying causes of the vast within-population variation in behaviors is mixed. While many potential causes of health-related behaviors have been identified-such as schooling, genetics, and environments-little is known on how much of the variation across multiple behaviors is due to a common set of causes. We use three separate datasets on U.S. twins to investigate the degree to which multiple health-related behaviors correlate and can be explained by a common set of factors. We find that aside from smoking and drinking, most behaviors are not strongly correlated among individuals. Based on the results of both within-identical-twins regressions and multivariate behavioral genetics models, we find some evidence that schooling may be related to smoking but not to the covariation between multiple behaviors. Similarly, we find that a large fraction of the variance in each of the behaviors is consistent with genetic factors; however, we do not find strong evidence that a single common set of genes explains variation in multiple behaviors. We find, however, that a large portion of the correlation between smoking and heavy drinking is consistent with common, mostly childhood, environments. This suggests that the initiation and patterns of these two behaviors might arise from a common childhood origin. Research and policy to identify and modify this source may provide a strong way to reduce the population health burden of smoking and heavy drinking. Copyright © 2016 Elsevier Ltd. All rights reserved.
Reitz, Christiane; Jun, Gyungah; Naj, Adam; Rajbhandary, Ruchita; Vardarajan, Badri Narayan; Wang, Li-San; Valladares, Otto; Lin, Chiao-Feng; Larson, Eric B.; Graff-Radford, Neill R.; Evans, Denis; De Jager, Philip L.; Crane, Paul K.; Buxbaum, Joseph D.; Murrell, Jill R.; Raj, Towfique; Ertekin-Taner, Nilufer; Logue, Mark; Baldwin, Clinton T.; Green, Robert C.; Barnes, Lisa L.; Cantwell, Laura B.; Fallin, M. Daniele; Go, Rodney C. P.; Griffith, Patrick; Obisesan, Thomas O.; Manly, Jennifer J.; Lunetta, Kathryn L.; Kamboh, M. Ilyas; Lopez, Oscar L.; Bennett, David A.; Hendrie, Hugh; Hall, Kathleen S.; Goate, Alison M.; Byrd, Goldie S.; Kukull, Walter A.; Foroud, Tatiana M.; Haines, Jonathan L.; Farrer, Lindsay A.; Pericak-Vance, Margaret A.; Schellenberg, Gerard D.; Mayeux, Richard
2013-01-01
Importance Genetic variants associated with susceptibility to late-onset Alzheimer disease are known for individuals of European ancestry, but whether the same or different variants account for the genetic risk of Alzheimer disease in African American individuals is unknown. Identification of disease-associated variants helps identify targets for genetic testing, prevention, and treatment. Objective To identify genetic loci associated with late-onset Alzheimer disease in African Americans. Design, Setting, and Participants The Alzheimer Disease Genetics Consortium (ADGC) assembled multiple data sets representing a total of 5896 African Americans (1968 case participants, 3928 control participants) 60 years or older that were collected between 1989 and 2011 at multiple sites. The association of Alzheimer disease with genotyped and imputed single-nucleotide polymorphisms (SNPs) was assessed in case-control and in family-based data sets. Results from individual data sets were combined to perform an inverse variance–weighted meta-analysis, first with genome-wide analyses and subsequently with gene-based tests for previously reported loci. Main Outcomes and Measures Presence of Alzheimer disease according to standardized criteria. Results Genome-wide significance in fully adjusted models (sex, age, APOE genotype, population stratification) was observed for a SNP in ABCA7 (rs115550680, allele = G; frequency, 0.09 cases and 0.06 controls; odds ratio [OR], 1.79 [95% CI, 1.47-2.12]; P = 2.2 × 10–9), which is in linkage disequilibrium with SNPs previously associated with Alzheimer disease in Europeans (0.8
Prior knowledge based mining functional modules from Yeast PPI networks with gene ontology
2010-01-01
Background In the literature, there are fruitful algorithmic approaches for identification functional modules in protein-protein interactions (PPI) networks. Because of accumulation of large-scale interaction data on multiple organisms and non-recording interaction data in the existing PPI database, it is still emergent to design novel computational techniques that can be able to correctly and scalably analyze interaction data sets. Indeed there are a number of large scale biological data sets providing indirect evidence for protein-protein interaction relationships. Results The main aim of this paper is to present a prior knowledge based mining strategy to identify functional modules from PPI networks with the aid of Gene Ontology. Higher similarity value in Gene Ontology means that two gene products are more functionally related to each other, so it is better to group such gene products into one functional module. We study (i) to encode the functional pairs into the existing PPI networks; and (ii) to use these functional pairs as pairwise constraints to supervise the existing functional module identification algorithms. Topology-based modularity metric and complex annotation in MIPs will be used to evaluate the identified functional modules by these two approaches. Conclusions The experimental results on Yeast PPI networks and GO have shown that the prior knowledge based learning methods perform better than the existing algorithms. PMID:21172053
Tzeng, Jung-Ying; Zhang, Daowen; Pongpanich, Monnat; Smith, Chris; McCarthy, Mark I.; Sale, Michèle M.; Worrall, Bradford B.; Hsu, Fang-Chi; Thomas, Duncan C.; Sullivan, Patrick F.
2011-01-01
Genomic association analyses of complex traits demand statistical tools that are capable of detecting small effects of common and rare variants and modeling complex interaction effects and yet are computationally feasible. In this work, we introduce a similarity-based regression method for assessing the main genetic and interaction effects of a group of markers on quantitative traits. The method uses genetic similarity to aggregate information from multiple polymorphic sites and integrates adaptive weights that depend on allele frequencies to accomodate common and uncommon variants. Collapsing information at the similarity level instead of the genotype level avoids canceling signals that have the opposite etiological effects and is applicable to any class of genetic variants without the need for dichotomizing the allele types. To assess gene-trait associations, we regress trait similarities for pairs of unrelated individuals on their genetic similarities and assess association by using a score test whose limiting distribution is derived in this work. The proposed regression framework allows for covariates, has the capacity to model both main and interaction effects, can be applied to a mixture of different polymorphism types, and is computationally efficient. These features make it an ideal tool for evaluating associations between phenotype and marker sets defined by linkage disequilibrium (LD) blocks, genes, or pathways in whole-genome analysis. PMID:21835306
Provenzano, Paolo P; Inman, David R; Eliceiri, Kevin W; Beggs, Hilary E; Keely, Patricia J
2008-11-01
Focal adhesion kinase (FAK) is a central regulator of the focal adhesion, influencing cell proliferation, survival, and migration. Despite evidence demonstrating FAK overexpression in human cancer, its role in tumor initiation and progression is not well understood. Using Cre/LoxP technology to specifically knockout FAK in the mammary epithelium, we showed that FAK is not required for tumor initiation but is required for tumor progression. The mechanistic underpinnings of these results suggested that FAK regulates clinically relevant gene signatures and multiple signaling complexes associated with tumor progression and metastasis, such as Src, ERK, and p130Cas. Furthermore, a systems-level analysis identified FAK as a major regulator of the tumor transcriptome, influencing genes associated with adhesion and growth factor signaling pathways, and their cross talk. Additionally, FAK was shown to down-regulate the expression of clinically relevant proliferation- and metastasis-associated gene signatures, as well as an enriched group of genes associated with the G(2) and G(2)/M phases of the cell cycle. Computational analysis of transcription factor-binding sites within ontology-enriched or clustered gene sets suggested that the differentially expressed proliferation- and metastasis-associated genes in FAK-null cells were regulated through a common set of transcription factors, including p53. Therefore, FAK acts as a primary node in the activated signaling network in transformed motile cells and is a prime candidate for novel therapeutic interventions to treat aggressive human breast cancers.
Structural Characterization of the Fla2 Flagellum of Rhodobacter sphaeroides
de la Mora, Javier; Uchida, Kaoru; del Campo, Ana Martínez; Camarena, Laura; Aizawa, Shin-Ichi
2015-01-01
ABSTRACT Rhodobacter sphaeroides is a free-living alphaproteobacterium that contains two clusters of functional flagellar genes in its genome: one acquired by horizontal gene transfer (fla1) and one that is endogenous (fla2). We have shown that the Fla2 system is normally quiescent and under certain conditions produces polar flagella, while the Fla1 system is always active and produces a single flagellum at a nonpolar position. In this work we purified and characterized the structure and analyzed the composition of the Fla2 flagellum. The number of polar filaments per cell is 4.6 on average. By comparison with the Fla1 flagellum, the prominent features of the ultra structure of the Fla2 HBB are the absence of an H ring, thick and long hooks, and a smoother zone at the hook-filament junction. The Fla2 helical filaments have a pitch of 2.64 μm and a diameter of 1.4 μm, which are smaller than those of the Fla1 filaments. Fla2 filaments undergo polymorphic transitions in vitro and showed two polymorphs: curly (right-handed) and coiled. However, in vivo in free-swimming cells, we observed only a bundle of filaments, which should probably be left-handed. Together, our results indicate that Fla2 cell produces multiple right-handed polar flagella, which are not conventional but exceptional. IMPORTANCE R. sphaeroides possesses two functional sets of flagellar genes. The fla1 genes are normally expressed in the laboratory and were acquired by horizontal transfer. The fla2 genes are endogenous and are expressed in a Fla1− mutant grown phototrophically and in the absence of organic acids. The Fla1 system produces a single lateral or subpolar flagellum, and the Fla2 system produces multiple polar flagella. The two kinds of flagella are never expressed simultaneously, and both are used for swimming in liquid media. The two sets of genes are certainly ready for responding to specific environmental conditions. The characterization of the Fla2 system will help us to understand its role in the physiology of this microorganism. PMID:26124240
VisANT 3.0: new modules for pathway visualization, editing, prediction and construction.
Hu, Zhenjun; Ng, David M; Yamada, Takuji; Chen, Chunnuan; Kawashima, Shuichi; Mellor, Joe; Linghu, Bolan; Kanehisa, Minoru; Stuart, Joshua M; DeLisi, Charles
2007-07-01
With the integration of the KEGG and Predictome databases as well as two search engines for coexpressed genes/proteins using data sets obtained from the Stanford Microarray Database (SMD) and Gene Expression Omnibus (GEO) database, VisANT 3.0 supports exploratory pathway analysis, which includes multi-scale visualization of multiple pathways, editing and annotating pathways using a KEGG compatible visual notation and visualization of expression data in the context of pathways. Expression levels are represented either by color intensity or by nodes with an embedded expression profile. Multiple experiments can be navigated or animated. Known KEGG pathways can be enriched by querying either coexpressed components of known pathway members or proteins with known physical interactions. Predicted pathways for genes/proteins with unknown functions can be inferred from coexpression or physical interaction data. Pathways produced in VisANT can be saved as computer-readable XML format (VisML), graphic images or high-resolution Scalable Vector Graphics (SVG). Pathways in the format of VisML can be securely shared within an interested group or published online using a simple Web link. VisANT is freely available at http://visant.bu.edu.
2012-01-01
Background Fever is one of the most common adverse events of vaccines. The detailed mechanisms of fever and vaccine-associated gene interaction networks are not fully understood. In the present study, we employed a genome-wide, Centrality and Ontology-based Network Discovery using Literature data (CONDL) approach to analyse the genes and gene interaction networks associated with fever or vaccine-related fever responses. Results Over 170,000 fever-related articles from PubMed abstracts and titles were retrieved and analysed at the sentence level using natural language processing techniques to identify genes and vaccines (including 186 Vaccine Ontology terms) as well as their interactions. This resulted in a generic fever network consisting of 403 genes and 577 gene interactions. A vaccine-specific fever sub-network consisting of 29 genes and 28 gene interactions was extracted from articles that are related to both fever and vaccines. In addition, gene-vaccine interactions were identified. Vaccines (including 4 specific vaccine names) were found to directly interact with 26 genes. Gene set enrichment analysis was performed using the genes in the generated interaction networks. Moreover, the genes in these networks were prioritized using network centrality metrics. Making scientific discoveries and generating new hypotheses were possible by using network centrality and gene set enrichment analyses. For example, our study found that the genes in the generic fever network were more enriched in cell death and responses to wounding, and the vaccine sub-network had more gene enrichment in leukocyte activation and phosphorylation regulation. The most central genes in the vaccine-specific fever network are predicted to be highly relevant to vaccine-induced fever, whereas genes that are central only in the generic fever network are likely to be highly relevant to generic fever responses. Interestingly, no Toll-like receptors (TLRs) were found in the gene-vaccine interaction network. Since multiple TLRs were found in the generic fever network, it is reasonable to hypothesize that vaccine-TLR interactions may play an important role in inducing fever response, which deserves a further investigation. Conclusions This study demonstrated that ontology-based literature mining is a powerful method for analyzing gene interaction networks and generating new scientific hypotheses. PMID:23256563
Witt, S H; Streit, F; Jungkunz, M; Frank, J; Awasthi, S; Reinbold, C S; Treutlein, J; Degenhardt, F; Forstner, A J; Heilmann-Heimbach, S; Dietl, L; Schwarze, C E; Schendel, D; Strohmaier, J; Abdellaoui, A; Adolfsson, R; Air, T M; Akil, H; Alda, M; Alliey-Rodriguez, N; Andreassen, O A; Babadjanova, G; Bass, N J; Bauer, M; Baune, B T; Bellivier, F; Bergen, S; Bethell, A; Biernacka, J M; Blackwood, D H R; Boks, M P; Boomsma, D I; Børglum, A D; Borrmann-Hassenbach, M; Brennan, P; Budde, M; Buttenschøn, H N; Byrne, E M; Cervantes, P; Clarke, T-K; Craddock, N; Cruceanu, C; Curtis, D; Czerski, P M; Dannlowski, U; Davis, T; de Geus, E J C; Di Florio, A; Djurovic, S; Domenici, E; Edenberg, H J; Etain, B; Fischer, S B; Forty, L; Fraser, C; Frye, M A; Fullerton, J M; Gade, K; Gershon, E S; Giegling, I; Gordon, S D; Gordon-Smith, K; Grabe, H J; Green, E K; Greenwood, T A; Grigoroiu-Serbanescu, M; Guzman-Parra, J; Hall, L S; Hamshere, M; Hauser, J; Hautzinger, M; Heilbronner, U; Herms, S; Hitturlingappa, S; Hoffmann, P; Holmans, P; Hottenga, J-J; Jamain, S; Jones, I; Jones, L A; Juréus, A; Kahn, R S; Kammerer-Ciernioch, J; Kirov, G; Kittel-Schneider, S; Kloiber, S; Knott, S V; Kogevinas, M; Landén, M; Leber, M; Leboyer, M; Li, Q S; Lissowska, J; Lucae, S; Martin, N G; Mayoral-Cleries, F; McElroy, S L; McIntosh, A M; McKay, J D; McQuillin, A; Medland, S E; Middeldorp, C M; Milaneschi, Y; Mitchell, P B; Montgomery, G W; Morken, G; Mors, O; Mühleisen, T W; Müller-Myhsok, B; Myers, R M; Nievergelt, C M; Nurnberger, J I; O'Donovan, M C; Loohuis, L M O; Ophoff, R; Oruc, L; Owen, M J; Paciga, S A; Penninx, B W J H; Perry, A; Pfennig, A; Potash, J B; Preisig, M; Reif, A; Rivas, F; Rouleau, G A; Schofield, P R; Schulze, T G; Schwarz, M; Scott, L; Sinnamon, G C B; Stahl, E A; Strauss, J; Turecki, G; Van der Auwera, S; Vedder, H; Vincent, J B; Willemsen, G; Witt, C C; Wray, N R; Xi, H S; Tadic, A; Dahmen, N; Schott, B H; Cichon, S; Nöthen, M M; Ripke, S; Mobascher, A; Rujescu, D; Lieb, K; Roepke, S; Schmahl, C; Bohus, M; Rietschel, M
2017-06-20
Borderline personality disorder (BOR) is determined by environmental and genetic factors, and characterized by affective instability and impulsivity, diagnostic symptoms also observed in manic phases of bipolar disorder (BIP). Up to 20% of BIP patients show comorbidity with BOR. This report describes the first case-control genome-wide association study (GWAS) of BOR, performed in one of the largest BOR patient samples worldwide. The focus of our analysis was (i) to detect genes and gene sets involved in BOR and (ii) to investigate the genetic overlap with BIP. As there is considerable genetic overlap between BIP, major depression (MDD) and schizophrenia (SCZ) and a high comorbidity of BOR and MDD, we also analyzed the genetic overlap of BOR with SCZ and MDD. GWAS, gene-based tests and gene-set analyses were performed in 998 BOR patients and 1545 controls. Linkage disequilibrium score regression was used to detect the genetic overlap between BOR and these disorders. Single marker analysis revealed no significant association after correction for multiple testing. Gene-based analysis yielded two significant genes: DPYD (P=4.42 × 10 -7 ) and PKP4 (P=8.67 × 10 -7 ); and gene-set analysis yielded a significant finding for exocytosis (GO:0006887, P FDR =0.019; FDR, false discovery rate). Prior studies have implicated DPYD, PKP4 and exocytosis in BIP and SCZ. The most notable finding of the present study was the genetic overlap of BOR with BIP (r g =0.28 [P=2.99 × 10 -3 ]), SCZ (r g =0.34 [P=4.37 × 10 -5 ]) and MDD (r g =0.57 [P=1.04 × 10 -3 ]). We believe our study is the first to demonstrate that BOR overlaps with BIP, MDD and SCZ on the genetic level. Whether this is confined to transdiagnostic clinical symptoms should be examined in future studies.
Dixit, Shalabh; Kumar Biswal, Akshaya; Min, Aye; Henry, Amelia; Oane, Rowena H.; Raorane, Manish L.; Longkumer, Toshisangba; Pabuayon, Isaiah M.; Mutte, Sumanth K.; Vardarajan, Adithi R.; Miro, Berta; Govindan, Ganesan; Albano-Enriquez, Blesilda; Pueffeld, Mandy; Sreenivasulu, Nese; Slamet-Loedin, Inez; Sundarvelpandian, Kalaipandian; Tsai, Yuan-Ching; Raghuvanshi, Saurabh; Hsing, Yue-Ie C.; Kumar, Arvind; Kohli, Ajay
2015-01-01
Sub-QTLs and multiple intra-QTL genes are hypothesized to underpin large-effect QTLs. Known QTLs over gene families, biosynthetic pathways or certain traits represent functional gene-clusters of genes of the same gene ontology (GO). Gene-clusters containing genes of different GO have not been elaborated, except in silico as coexpressed genes within QTLs. Here we demonstrate the requirement of multiple intra-QTL genes for the full impact of QTL qDTY12.1 on rice yield under drought. Multiple evidences are presented for the need of the transcription factor ‘no apical meristem’ (OsNAM12.1) and its co-localized target genes of separate GO categories for qDTY12.1 function, raising a regulon-like model of genetic architecture. The molecular underpinnings of qDTY12.1 support its effectiveness in further improving a drought tolerant genotype and for its validity in multiple genotypes/ecosystems/environments. Resolving the combinatorial value of OsNAM12.1 with individual intra-QTL genes notwithstanding, identification and analyses of qDTY12.1has fast-tracked rice improvement towards food security. PMID:26507552
Jena, Kshirod K; Hechanova, Sherry Lou; Verdeprado, Holden; Prahalada, G D; Kim, Sung-Ryul
2017-11-01
A first set of 25 NILs carrying ten BPH resistance genes and their pyramids was developed in the background of indica variety IR24 for insect resistance breeding in rice. Brown planthopper (Nilaparvata lugens Stal.) is one of the most destructive insect pests in rice. Development of near-isogenic lines (NILs) is an important strategy for genetic analysis of brown planthopper (BPH) resistance (R) genes and their deployment against diverse BPH populations. A set of 25 NILs with 9 single R genes and 16 multiple R gene combinations consisting of 11 two-gene pyramids and 5 three-gene pyramids in the genetic background of the susceptible indica rice cultivar IR24 was developed through marker-assisted selection. The linked DNA markers for each of the R genes were used for foreground selection and confirming the introgressed regions of the BPH R genes. Modified seed box screening and feeding rate of BPH were used to evaluate the spectrum of resistance. BPH reaction of each of the NILs carrying different single genes was variable at the antibiosis level with the four BPH populations of the Philippines. The NILs with two- to three-pyramided genes showed a stronger level of antibiosis (49.3-99.0%) against BPH populations compared with NILs with a single R gene NILs (42.0-83.5%) and IR24 (10.0%). Background genotyping by high-density SNPs markers revealed that most of the chromosome regions of the NILs (BC 3 F 5 ) had IR24 genome recovery of 82.0-94.2%. Six major agronomic data of the NILs showed a phenotypically comparable agronomic performance with IR24. These newly developed NILs will be useful as new genetic resources for BPH resistance breeding and are valuable sources of genes in monitoring against the emerging BPH biotypes in different rice-growing countries.
Mining functionally relevant gene sets for analyzing physiologically novel clinical expression data.
Turcan, Sevin; Vetter, Douglas E; Maron, Jill L; Wei, Xintao; Slonim, Donna K
2011-01-01
Gene set analyses have become a standard approach for increasing the sensitivity of transcriptomic studies. However, analytical methods incorporating gene sets require the availability of pre-defined gene sets relevant to the underlying physiology being studied. For novel physiological problems, relevant gene sets may be unavailable or existing gene set databases may bias the results towards only the best-studied of the relevant biological processes. We describe a successful attempt to mine novel functional gene sets for translational projects where the underlying physiology is not necessarily well characterized in existing annotation databases. We choose targeted training data from public expression data repositories and define new criteria for selecting biclusters to serve as candidate gene sets. Many of the discovered gene sets show little or no enrichment for informative Gene Ontology terms or other functional annotation. However, we observe that such gene sets show coherent differential expression in new clinical test data sets, even if derived from different species, tissues, and disease states. We demonstrate the efficacy of this method on a human metabolic data set, where we discover novel, uncharacterized gene sets that are diagnostic of diabetes, and on additional data sets related to neuronal processes and human development. Our results suggest that our approach may be an efficient way to generate a collection of gene sets relevant to the analysis of data for novel clinical applications where existing functional annotation is relatively incomplete.
Multiple homologous genes knockout (KO) by CRISPR/Cas9 system in rabbit.
Liu, Huan; Sui, Tingting; Liu, Di; Liu, Tingjun; Chen, Mao; Deng, Jichao; Xu, Yuanyuan; Li, Zhanjun
2018-03-20
The CRISPR/Cas9 system is a highly efficient and convenient genome editing tool, which has been widely used for single or multiple gene mutation in a variety of organisms. Disruption of multiple homologous genes, which have similar DNA sequences and gene function, is required for the study of the desired phenotype. In this study, to test whether the CRISPR/Cas9 system works on the mutation of multiple homologous genes, a single guide RNA (sgRNA) targeting three fucosyltransferases encoding genes (FUT1, FUT2 and SEC1) was designed. As expected, triple gene mutation of FUT1, FUT2 and SEC1 could be achieved simultaneously via a sgRNA mediated CRISPR/Cas9 system. Besides, significantly reduced serum fucosyltransferases enzymes activity was also determined in those triple gene mutation rabbits. Thus, we provide the first evidence that multiple homologous genes knockout (KO) could be achieved efficiently by a sgRNA mediated CRISPR/Cas9 system in mammals, which could facilitate the genotype to phenotype studies of homologous genes in future. Copyright © 2018 Elsevier B.V. All rights reserved.
A Mutation in the Bacillus subtilis rsbU Gene That Limits RNA Synthesis during Sporulation.
Rothstein, David M; Lazinski, David; Osburne, Marcia S; Sonenshein, Abraham L
2017-07-15
Mutants of Bacillis subtilis that are temperature sensitive for RNA synthesis during sporulation were isolated after selection with a 32 P suicide agent. Whole-genome sequencing revealed that two of the mutants carried an identical lesion in the rsbU gene, which encodes a phosphatase that indirectly activates SigB, the stress-responsive RNA polymerase sigma factor. The mutation appeared to cause RsbU to be hyperactive, because the mutants were more resistant than the parent strain to ethanol stress. In support of this hypothesis, pseudorevertants that regained wild-type levels of sporulation at high temperature had secondary mutations that prevented expression of the mutant rsbU gene. The properties of these RsbU mutants support the idea that activation of SigB diminishes the bacterium's ability to sporulate. IMPORTANCE Most bacterial species encode multiple RNA polymerase promoter recognition subunits (sigma factors). Each sigma factor directs RNA polymerase to different sets of genes; each gene set typically encodes proteins important for responses to specific environmental conditions, such as changes in temperature, salt concentration, and nutrient availability. A selection for mutants of Bacillus subtilis that are temperature sensitive for RNA synthesis during sporulation unexpectedly yielded strains with a point mutation in rsbU , a gene that encodes a protein that normally activates sigma factor B (SigB) under conditions of salt stress. The mutation appears to cause RsbU, and therefore SigB, to be active inappropriately, thereby inhibiting, directly or indirectly, the ability of the cells to transcribe sporulation genes. Copyright © 2017 American Society for Microbiology.
Schelkunov, Mikhail I.; Shtratnikova, Viktoria Yu; Nuraliev, Maxim S.; Selosse, Marc-Andre; Penin, Aleksey A.; Logacheva, Maria D.
2015-01-01
The question on the patterns and limits of reduction of plastid genomes in nonphotosynthetic plants and the reasons of their conservation is one of the intriguing topics in plant genome evolution. Here, we report sequencing and analysis of plastid genome in nonphotosynthetic orchids Epipogium aphyllum and Epipogium roseum, which, with sizes of 31 and 19 kbp, respectively, represent the smallest plastid genomes characterized by now. Besides drastic reduction, which is expected, we found several unusual features of these “minimal” plastomes: Multiple rearrangements, highly biased nucleotide composition, and unprecedentedly high substitution rate. Only 27 and 29 genes remained intact in the plastomes of E. aphyllum and E. roseum—those encoding ribosomal components, transfer RNAs, and three additional housekeeping genes (infA, clpP, and accD). We found no signs of relaxed selection acting on these genes. We hypothesize that the main reason for retention of plastid genomes in Epipogium is the necessity to translate messenger RNAs (mRNAs) of accD and/or clpP proteins which are essential for cell metabolism. However, these genes are absent in plastomes of several plant species; their absence is compensated by the presence of a functional copy arisen by gene transfer from plastid to the nuclear genome. This suggests that there is no single set of plastid-encoded essential genes, but rather different sets for different species and that the retention of a gene in the plastome depends on the interaction between the nucleus and plastids. PMID:25635040
2012-01-01
Background We explore the benefits of applying a new proportional hazard model to analyze survival of breast cancer patients. As a parametric model, the hypertabastic survival model offers a closer fit to experimental data than Cox regression, and furthermore provides explicit survival and hazard functions which can be used as additional tools in the survival analysis. In addition, one of our main concerns is utilization of multiple gene expression variables. Our analysis treats the important issue of interaction of different gene signatures in the survival analysis. Methods The hypertabastic proportional hazards model was applied in survival analysis of breast cancer patients. This model was compared, using statistical measures of goodness of fit, with models based on the semi-parametric Cox proportional hazards model and the parametric log-logistic and Weibull models. The explicit functions for hazard and survival were then used to analyze the dynamic behavior of hazard and survival functions. Results The hypertabastic model provided the best fit among all the models considered. Use of multiple gene expression variables also provided a considerable improvement in the goodness of fit of the model, as compared to use of only one. By utilizing the explicit survival and hazard functions provided by the model, we were able to determine the magnitude of the maximum rate of increase in hazard, and the maximum rate of decrease in survival, as well as the times when these occurred. We explore the influence of each gene expression variable on these extrema. Furthermore, in the cases of continuous gene expression variables, represented by a measure of correlation, we were able to investigate the dynamics with respect to changes in gene expression. Conclusions We observed that use of three different gene signatures in the model provided a greater combined effect and allowed us to assess the relative importance of each in determination of outcome in this data set. These results point to the potential to combine gene signatures to a greater effect in cases where each gene signature represents some distinct aspect of the cancer biology. Furthermore we conclude that the hypertabastic survival models can be an effective survival analysis tool for breast cancer patients. PMID:23241496
Identification of prostate cancer modifier pathways using parental strain expression mapping
Xu, Qing; Majumder, Pradip K.; Ross, Kenneth; Shim, Yeonju; Golub, Todd R.; Loda, Massimo; Sellers, William R.
2007-01-01
Inherited genetic risk factors play an important role in cancer. However, other than the Mendelian fashion cancer susceptibility genes found in familial cancer syndromes, little is known about risk modifiers that control individual susceptibility. Here we developed a strategy, parental strain expression mapping, that utilizes the homogeneity of inbred mice and genome-wide mRNA expression analyses to directly identify candidate germ-line modifier genes and pathways underlying phenotypic differences among murine strains exposed to transgenic activation of AKT1. We identified multiple candidate modifier pathways and, specifically, the glycolysis pathway as a candidate negative modulator of AKT1-induced proliferation. In keeping with the findings in the murine models, in multiple human prostate expression data set, we found that enrichment of glycolysis pathways in normal tissues was associated with decreased rates of cancer recurrence after prostatectomy. Together, these data suggest that parental strain expression mapping can directly identify germ-line modifier pathways of relevance to human disease. PMID:17978178
Transcription Profiling of the mgrA Regulon in Staphylococcus aureus
Luong, Thanh T.; Dunman, Paul M.; Murphy, Ellen; Projan, Steven J.; Lee, Chia Y.
2006-01-01
MgrA has been shown to affect multiple Staphylococcus aureus genes involved in virulence and antibiotic resistance. To comprehensively identify the target genes regulated by mgrA, we employed a microarray method to analyze the transcription profiles of S. aureus Newman, its isogeneic mgrA mutant, and an MgrA-overproducing derivative. We compared genes that were differentially expressed at exponential or early stationary growth phases. Our results showed that MgrA affected an impressive number of genes, 175 of which were positively regulated and 180 of which were negatively regulated in an mgrA-specific manner. The target genes included all functional categories. The microarray results were validated by real-time reverse transcription-PCR quantitation of a set of selected genes from different functional categories. Our data also indicate that mgrA regulates virulence factors in a fashion analogous to that of the accessory gene regulatory locus (agr). Accordingly, exoproteins are upregulated and surface proteins are downregulated by the regulator, suggesting that mgrA may function in concert with agr. The fact that a large number of genes are regulated by mgrA implies that MgrA is a major global regulator in S. aureus. PMID:16484201
VitisExpDB: a database resource for grape functional genomics.
Doddapaneni, Harshavardhan; Lin, Hong; Walker, M Andrew; Yao, Jiqiang; Civerolo, Edwin L
2008-02-28
The family Vitaceae consists of many different grape species that grow in a range of climatic conditions. In the past few years, several studies have generated functional genomic information on different Vitis species and cultivars, including the European grape vine, Vitis vinifera. Our goal is to develop a comprehensive web data source for Vitaceae. VitisExpDB is an online MySQL-PHP driven relational database that houses annotated EST and gene expression data for V. vinifera and non-vinifera grape species and varieties. Currently, the database stores approximately 320,000 EST sequences derived from 8 species/hybrids, their annotation (BLAST top match) details and Gene Ontology based structured vocabulary. Putative homologs for each EST in other species and varieties along with information on their percent nucleotide identities, phylogenetic relationship and common primers can be retrieved. The database also includes information on probe sequence and annotation features of the high density 60-mer gene expression chip consisting of approximately 20,000 non-redundant set of ESTs. Finally, the database includes 14 processed global microarray expression profile sets. Data from 12 of these expression profile sets have been mapped onto metabolic pathways. A user-friendly web interface with multiple search indices and extensively hyperlinked result features that permit efficient data retrieval has been developed. Several online bioinformatics tools that interact with the database along with other sequence analysis tools have been added. In addition, users can submit their ESTs to the database. The developed database provides genomic resource to grape community for functional analysis of genes in the collection and for the grape genome annotation and gene function identification. The VitisExpDB database is available through our website http://cropdisease.ars.usda.gov/vitis_at/main-page.htm.
Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome
2010-01-01
Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence. PMID:21092105
VitisExpDB: A database resource for grape functional genomics
Doddapaneni, Harshavardhan; Lin, Hong; Walker, M Andrew; Yao, Jiqiang; Civerolo, Edwin L
2008-01-01
Background The family Vitaceae consists of many different grape species that grow in a range of climatic conditions. In the past few years, several studies have generated functional genomic information on different Vitis species and cultivars, including the European grape vine, Vitis vinifera. Our goal is to develop a comprehensive web data source for Vitaceae. Description VitisExpDB is an online MySQL-PHP driven relational database that houses annotated EST and gene expression data for V. vinifera and non-vinifera grape species and varieties. Currently, the database stores ~320,000 EST sequences derived from 8 species/hybrids, their annotation (BLAST top match) details and Gene Ontology based structured vocabulary. Putative homologs for each EST in other species and varieties along with information on their percent nucleotide identities, phylogenetic relationship and common primers can be retrieved. The database also includes information on probe sequence and annotation features of the high density 60-mer gene expression chip consisting of ~20,000 non-redundant set of ESTs. Finally, the database includes 14 processed global microarray expression profile sets. Data from 12 of these expression profile sets have been mapped onto metabolic pathways. A user-friendly web interface with multiple search indices and extensively hyperlinked result features that permit efficient data retrieval has been developed. Several online bioinformatics tools that interact with the database along with other sequence analysis tools have been added. In addition, users can submit their ESTs to the database. Conclusion The developed database provides genomic resource to grape community for functional analysis of genes in the collection and for the grape genome annotation and gene function identification. The VitisExpDB database is available through our website . PMID:18307813
Gene Ontology Consortium: going forward
2015-01-01
The Gene Ontology (GO; http://www.geneontology.org) is a community-based bioinformatics resource that supplies information about gene product function using ontologies to represent biological knowledge. Here we describe improvements and expansions to several branches of the ontology, as well as updates that have allowed us to more efficiently disseminate the GO and capture feedback from the research community. The Gene Ontology Consortium (GOC) has expanded areas of the ontology such as cilia-related terms, cell-cycle terms and multicellular organism processes. We have also implemented new tools for generating ontology terms based on a set of logical rules making use of templates, and we have made efforts to increase our use of logical definitions. The GOC has a new and improved web site summarizing new developments and documentation, serving as a portal to GO data. Users can perform GO enrichment analysis, and search the GO for terms, annotations to gene products, and associated metadata across multiple species using the all-new AmiGO 2 browser. We encourage and welcome the input of the research community in all biological areas in our continued effort to improve the Gene Ontology. PMID:25428369
STOP using just GO: a multi-ontology hypothesis generation tool for high throughput experimentation
2013-01-01
Background Gene Ontology (GO) enrichment analysis remains one of the most common methods for hypothesis generation from high throughput datasets. However, we believe that researchers strive to test other hypotheses that fall outside of GO. Here, we developed and evaluated a tool for hypothesis generation from gene or protein lists using ontological concepts present in manually curated text that describes those genes and proteins. Results As a consequence we have developed the method Statistical Tracking of Ontological Phrases (STOP) that expands the realm of testable hypotheses in gene set enrichment analyses by integrating automated annotations of genes to terms from over 200 biomedical ontologies. While not as precise as manually curated terms, we find that the additional enriched concepts have value when coupled with traditional enrichment analyses using curated terms. Conclusion Multiple ontologies have been developed for gene and protein annotation, by using a dataset of both manually curated GO terms and automatically recognized concepts from curated text we can expand the realm of hypotheses that can be discovered. The web application STOP is available at http://mooneygroup.org/stop/. PMID:23409969
Evaluating the consistency of gene sets used in the analysis of bacterial gene expression data.
Tintle, Nathan L; Sitarik, Alexandra; Boerema, Benjamin; Young, Kylie; Best, Aaron A; Dejongh, Matthew
2012-08-08
Statistical analyses of whole genome expression data require functional information about genes in order to yield meaningful biological conclusions. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) are common sources of functionally grouped gene sets. For bacteria, the SEED and MicrobesOnline provide alternative, complementary sources of gene sets. To date, no comprehensive evaluation of the data obtained from these resources has been performed. We define a series of gene set consistency metrics directly related to the most common classes of statistical analyses for gene expression data, and then perform a comprehensive analysis of 3581 Affymetrix® gene expression arrays across 17 diverse bacteria. We find that gene sets obtained from GO and KEGG demonstrate lower consistency than those obtained from the SEED and MicrobesOnline, regardless of gene set size. Despite the widespread use of GO and KEGG gene sets in bacterial gene expression data analysis, the SEED and MicrobesOnline provide more consistent sets for a wide variety of statistical analyses. Increased use of the SEED and MicrobesOnline gene sets in the analysis of bacterial gene expression data may improve statistical power and utility of expression data.
Genome-wide introgression among distantly related Heliconius butterfly species.
Zhang, Wei; Dasmahapatra, Kanchon K; Mallet, James; Moreira, Gilson R P; Kronforst, Marcus R
2016-02-27
Although hybridization is thought to be relatively rare in animals, the raw genetic material introduced via introgression may play an important role in fueling adaptation and adaptive radiation. The butterfly genus Heliconius is an excellent system to study hybridization and introgression but most studies have focused on closely related species such as H. cydno and H. melpomene. Here we characterize genome-wide patterns of introgression between H. besckei, the only species with a red and yellow banded 'postman' wing pattern in the tiger-striped silvaniform clade, and co-mimetic H. melpomene nanna. We find a pronounced signature of putative introgression from H. melpomene into H. besckei in the genomic region upstream of the gene optix, known to control red wing patterning, suggesting adaptive introgression of wing pattern mimicry between these two distantly related species. At least 39 additional genomic regions show signals of introgression as strong or stronger than this mimicry locus. Gene flow has been on-going, with evidence of gene exchange at multiple time points, and bidirectional, moving from the melpomene to the silvaniform clade and vice versa. The history of gene exchange has also been complex, with contributions from multiple silvaniform species in addition to H. besckei. We also detect a signature of ancient introgression of the entire Z chromosome between the silvaniform and melpomene/cydno clades. Our study provides a genome-wide portrait of introgression between distantly related butterfly species. We further propose a comprehensive and efficient workflow for gene flow identification in genomic data sets.
Christensen, Douglas; Jovic, Marko
2006-05-01
This report describes a molecular biotechnology-based laboratory curriculum developed to accompany an undergraduate genetics course. During the course of a semester, students researched the pathogen, developed a research question, designed experiments, and performed transcriptional analysis of a set of genes that confer virulence to the food-borne pathogen, Listeria monocytogenes. Gene fragments were amplified via PCR and utilized in "mini-arrays," a dot-blot-based format suitable for the simultaneous transcriptional analysis of multiple genes. The project provides exposure to a wide range of molecular techniques and can be easily modified for variations in class size. Data are generated at various steps of the process, allowing for student interpretation, troubleshooting, and assessment opportunities. Copyright © 2006 International Union of Biochemistry and Molecular Biology, Inc.
Three-layered polyplex as a microRNA targeted delivery system for breast cancer gene therapy
NASA Astrophysics Data System (ADS)
Li, Yan; Dai, Yu; Zhang, Xiaojin; Chen, Jihua
2017-07-01
MicroRNAs (miRNAs), small non-coding RNAs, play an important role in modulating cell proliferation, migration, and differentiation. Since miRNAs can regulate multiple cancer-related genes simultaneously, regulating miRNAs could target a set of related oncogenic genes or pathways. Owing to their reduced immune response and low toxicity, miRNAs with small size and low molecular weight have become increasingly promising therapeutic drugs in cancer therapy. However, one of the major challenges of miRNAs-based cancer therapy is to achieve specific, effective, and safe delivery of therapeutic miRNAs into cancer cells. Here we provide a strategy using three-layered polyplex with folic acid as a targeting group to systemically deliver miR-210 into breast cancer cells, which results in breast cancer growth being inhibited.
How exaptations facilitated photosensory evolution: Seeing the light by accident.
Gavelis, Gregory S; Keeling, Patrick J; Leander, Brian S
2017-07-01
Exaptations are adaptations that have undergone a major change in function. By recruiting genes from sources originally unrelated to vision, exaptation has allowed for sudden and critical photosensory innovations, such as lenses, photopigments, and photoreceptors. Here we review new or neglected findings, with an emphasis on unicellular eukaryotes (protists), to illustrate how exaptation has shaped photoreception across the tree of life. Protist phylogeny attests to multiple origins of photoreception, as well as the extreme creativity of evolution. By appropriating genes and even entire organelles from foreign organisms via lateral gene transfer and endosymbiosis, protists have cobbled photoreceptors and eyespots from a diverse set of ingredients. While refinement through natural selection is paramount, exaptation helps illustrate how novelties arise in the first place, and is now shedding light on the origins of photoreception itself. © 2017 WILEY Periodicals, Inc.
Three-layered polyplex as a microRNA targeted delivery system for breast cancer gene therapy.
Li, Yan; Dai, Yu; Zhang, Xiaojin; Chen, Jihua
2017-07-14
MicroRNAs (miRNAs), small non-coding RNAs, play an important role in modulating cell proliferation, migration, and differentiation. Since miRNAs can regulate multiple cancer-related genes simultaneously, regulating miRNAs could target a set of related oncogenic genes or pathways. Owing to their reduced immune response and low toxicity, miRNAs with small size and low molecular weight have become increasingly promising therapeutic drugs in cancer therapy. However, one of the major challenges of miRNAs-based cancer therapy is to achieve specific, effective, and safe delivery of therapeutic miRNAs into cancer cells. Here we provide a strategy using three-layered polyplex with folic acid as a targeting group to systemically deliver miR-210 into breast cancer cells, which results in breast cancer growth being inhibited.
Galfalvy, Hanga C; Erraji-Benchekroun, Loubna; Smyrniotopoulos, Peggy; Pavlidis, Paul; Ellis, Steven P; Mann, J John; Sibille, Etienne; Arango, Victoria
2003-01-01
Background Genomic studies of complex tissues pose unique analytical challenges for assessment of data quality, performance of statistical methods used for data extraction, and detection of differentially expressed genes. Ideally, to assess the accuracy of gene expression analysis methods, one needs a set of genes which are known to be differentially expressed in the samples and which can be used as a "gold standard". We introduce the idea of using sex-chromosome genes as an alternative to spiked-in control genes or simulations for assessment of microarray data and analysis methods. Results Expression of sex-chromosome genes were used as true internal biological controls to compare alternate probe-level data extraction algorithms (Microarray Suite 5.0 [MAS5.0], Model Based Expression Index [MBEI] and Robust Multi-array Average [RMA]), to assess microarray data quality and to establish some statistical guidelines for analyzing large-scale gene expression. These approaches were implemented on a large new dataset of human brain samples. RMA-generated gene expression values were markedly less variable and more reliable than MAS5.0 and MBEI-derived values. A statistical technique controlling the false discovery rate was applied to adjust for multiple testing, as an alternative to the Bonferroni method, and showed no evidence of false negative results. Fourteen probesets, representing nine Y- and two X-chromosome linked genes, displayed significant sex differences in brain prefrontal cortex gene expression. Conclusion In this study, we have demonstrated the use of sex genes as true biological internal controls for genomic analysis of complex tissues, and suggested analytical guidelines for testing alternate oligonucleotide microarray data extraction protocols and for adjusting multiple statistical analysis of differentially expressed genes. Our results also provided evidence for sex differences in gene expression in the brain prefrontal cortex, supporting the notion of a putative direct role of sex-chromosome genes in differentiation and maintenance of sexual dimorphism of the central nervous system. Importantly, these analytical approaches are applicable to all microarray studies that include male and female human or animal subjects. PMID:12962547
Galfalvy, Hanga C; Erraji-Benchekroun, Loubna; Smyrniotopoulos, Peggy; Pavlidis, Paul; Ellis, Steven P; Mann, J John; Sibille, Etienne; Arango, Victoria
2003-09-08
Genomic studies of complex tissues pose unique analytical challenges for assessment of data quality, performance of statistical methods used for data extraction, and detection of differentially expressed genes. Ideally, to assess the accuracy of gene expression analysis methods, one needs a set of genes which are known to be differentially expressed in the samples and which can be used as a "gold standard". We introduce the idea of using sex-chromosome genes as an alternative to spiked-in control genes or simulations for assessment of microarray data and analysis methods. Expression of sex-chromosome genes were used as true internal biological controls to compare alternate probe-level data extraction algorithms (Microarray Suite 5.0 [MAS5.0], Model Based Expression Index [MBEI] and Robust Multi-array Average [RMA]), to assess microarray data quality and to establish some statistical guidelines for analyzing large-scale gene expression. These approaches were implemented on a large new dataset of human brain samples. RMA-generated gene expression values were markedly less variable and more reliable than MAS5.0 and MBEI-derived values. A statistical technique controlling the false discovery rate was applied to adjust for multiple testing, as an alternative to the Bonferroni method, and showed no evidence of false negative results. Fourteen probesets, representing nine Y- and two X-chromosome linked genes, displayed significant sex differences in brain prefrontal cortex gene expression. In this study, we have demonstrated the use of sex genes as true biological internal controls for genomic analysis of complex tissues, and suggested analytical guidelines for testing alternate oligonucleotide microarray data extraction protocols and for adjusting multiple statistical analysis of differentially expressed genes. Our results also provided evidence for sex differences in gene expression in the brain prefrontal cortex, supporting the notion of a putative direct role of sex-chromosome genes in differentiation and maintenance of sexual dimorphism of the central nervous system. Importantly, these analytical approaches are applicable to all microarray studies that include male and female human or animal subjects.
Inferring gene and protein interactions using PubMed citations and consensus Bayesian networks
Dalman, Mark; Haddad, Joseph; Duan, Zhong-Hui
2017-01-01
The PubMed database offers an extensive set of publication data that can be useful, yet inherently complex to use without automated computational techniques. Data repositories such as the Genomic Data Commons (GDC) and the Gene Expression Omnibus (GEO) offer experimental data storage and retrieval as well as curated gene expression profiles. Genetic interaction databases, including Reactome and Ingenuity Pathway Analysis, offer pathway and experiment data analysis using data curated from these publications and data repositories. We have created a method to generate and analyze consensus networks, inferring potential gene interactions, using large numbers of Bayesian networks generated by data mining publications in the PubMed database. Through the concept of network resolution, these consensus networks can be tailored to represent possible genetic interactions. We designed a set of experiments to confirm that our method is stable across variation in both sample and topological input sizes. Using gene product interactions from the KEGG pathway database and data mining PubMed publication abstracts, we verify that regardless of the network resolution or the inferred consensus network, our method is capable of inferring meaningful gene interactions through consensus Bayesian network generation with multiple, randomized topological orderings. Our method can not only confirm the existence of currently accepted interactions, but has the potential to hypothesize new ones as well. We show our method confirms the existence of known gene interactions such as JAK-STAT-PI3K-AKT-mTOR, infers novel gene interactions such as RAS- Bcl-2 and RAS-AKT, and found significant pathway-pathway interactions between the JAK-STAT signaling and Cardiac Muscle Contraction KEGG pathways. PMID:29049295
Phylogenetic Analysis of the Incidence of lux Gene Horizontal Transfer in Vibrionaceae▿ †
Urbanczyk, Henryk; Ast, Jennifer C.; Kaeding, Allison J.; Oliver, James D.; Dunlap, Paul V.
2008-01-01
Horizontal gene transfer (HGT) is thought to occur frequently in bacteria in nature and to play an important role in bacterial evolution, contributing to the formation of new species. To gain insight into the frequency of HGT in Vibrionaceae and its possible impact on speciation, we assessed the incidence of interspecies transfer of the lux genes (luxCDABEG), which encode proteins involved in luminescence, a distinctive phenotype. Three hundred three luminous strains, most of which were recently isolated from nature and which represent 11 Aliivibrio, Photobacterium, and Vibrio species, were screened for incongruence of phylogenies based on a representative housekeeping gene (gyrB or pyrH) and a representative lux gene (luxA). Strains exhibiting incongruence were then subjected to detailed phylogenetic analysis of horizontal transfer by using multiple housekeeping genes (gyrB, recA, and pyrH) and multiple lux genes (luxCDABEG). In nearly all cases, housekeeping gene and lux gene phylogenies were congruent, and there was no instance in which the lux genes of one luminous species had replaced the lux genes of another luminous species. Therefore, the lux genes are predominantly vertically inherited in Vibrionaceae. The few exceptions to this pattern of congruence were as follows: (i) the lux genes of the only known luminous strain of Vibrio vulnificus, VVL1 (ATCC 43382), were evolutionarily closely related to the lux genes of Vibrio harveyi; (ii) the lux genes of two luminous strains of Vibrio chagasii, 21N-12 and SB-52, were closely related to those of V. harveyi and Vibrio splendidus, respectively; (iii) the lux genes of a luminous strain of Photobacterium damselae, BT-6, were closely related to the lux genes of the lux-rib2 operon of Photobacterium leiognathi; and (iv) a strain of the luminous bacterium Photobacterium mandapamensis was found to be merodiploid for the lux genes, and the second set of lux genes was closely related to the lux genes of the lux-rib2 operon of P. leiognathi. In none of these cases of apparent HGT, however, did acquisition of the lux genes correlate with phylogenetic divergence of the recipient strain from other members of its species. The results indicate that horizontal transfer of the lux genes in nature is rare and that horizontal acquisition of the lux genes apparently has not contributed to speciation in recipient taxa. PMID:18359809
Zhang, Bing; Schmoyer, Denise; Kirov, Stefan; Snoddy, Jay
2004-01-01
Background Microarray and other high-throughput technologies are producing large sets of interesting genes that are difficult to analyze directly. Bioinformatics tools are needed to interpret the functional information in the gene sets. Results We have created a web-based tool for data analysis and data visualization for sets of genes called GOTree Machine (GOTM). This tool was originally intended to analyze sets of co-regulated genes identified from microarray analysis but is adaptable for use with other gene sets from other high-throughput analyses. GOTree Machine generates a GOTree, a tree-like structure to navigate the Gene Ontology Directed Acyclic Graph for input gene sets. This system provides user friendly data navigation and visualization. Statistical analysis helps users to identify the most important Gene Ontology categories for the input gene sets and suggests biological areas that warrant further study. GOTree Machine is available online at . Conclusion GOTree Machine has a broad application in functional genomic, proteomic and other high-throughput methods that generate large sets of interesting genes; its primary purpose is to help users sort for interesting patterns in gene sets. PMID:14975175
RGmatch: matching genomic regions to proximal genes in omics data integration.
Furió-Tarí, Pedro; Conesa, Ana; Tarazona, Sonia
2016-11-22
The integrative analysis of multiple genomics data often requires that genome coordinates-based signals have to be associated with proximal genes. The relative location of a genomic region with respect to the gene (gene area) is important for functional data interpretation; hence algorithms that match regions to genes should be able to deliver insight into this information. In this work we review the tools that are publicly available for making region-to-gene associations. We also present a novel method, RGmatch, a flexible and easy-to-use Python tool that computes associations either at the gene, transcript, or exon level, applying a set of rules to annotate each region-gene association with the region location within the gene. RGmatch can be applied to any organism as long as genome annotation is available. Furthermore, we qualitatively and quantitatively compare RGmatch to other tools. RGmatch simplifies the association of a genomic region with its closest gene. At the same time, it is a powerful tool because the rules used to annotate these associations are very easy to modify according to the researcher's specific interests. Some important differences between RGmatch and other similar tools already in existence are RGmatch's flexibility, its wide range of user options, compatibility with any annotatable organism, and its comprehensive and user-friendly output.
Kang, Hye-Min; Lee, Jin-Sol; Kim, Min-Sub; Lee, Young Hwan; Jung, Jee-Hyun; Hagiwara, Atsushi; Zhou, Bingsheng; Lee, Jae-Seong; Jeong, Chang-Bum
2018-05-30
Autophagy originated from the common ancestor of all life forms, and its function is highly conserved from yeast to humans. Autophagy plays a key role in various fundamental biological processes including defense, and has developed through serial interactions of multiple gene sets referred to as autophagy-related (Atg) genes. Despite their significance in metazoan life and evolution, few studies have been conducted to identify these genes in aquatic invertebrates. In this study, we identified whole Atg genes in four Brachionus rotifer spp., namely B. calyciflorus, B. koreanus, B. plicatilis, and B. rotundiformis, through searches of their entire genomes; and we annotated them according to the yeast nomenclature. Twenty-four genes orthologous to yeast genes were present in all of the Brachionus spp. while three additional gene duplicates were identified in the genome of B. koreanus, indicating that these genes had diversified during the speciation. Also, their transcriptional responses to cadmium exposure indicated regulation by cadmium-induced oxidative-stress-related signaling pathways. This study provides valuable information on 99 conserved Atg genes involved in autophagosome formation in Brachionus spp., with transcriptional modulation in response to cadmium, in the context of the role of autophagy in the damage response. Copyright © 2018 Elsevier B.V. All rights reserved.
The transcriptomic fingerprint of glucoamylase over-expression in Aspergillus niger
2012-01-01
Background Filamentous fungi such as Aspergillus niger are well known for their exceptionally high capacity for secretion of proteins, organic acids, and secondary metabolites and they are therefore used in biotechnology as versatile microbial production platforms. However, system-wide insights into their metabolic and secretory capacities are sparse and rational strain improvement approaches are therefore limited. In order to gain a genome-wide view on the transcriptional regulation of the protein secretory pathway of A. niger, we investigated the transcriptome of A. niger when it was forced to overexpression the glaA gene (encoding glucoamylase, GlaA) and secrete GlaA to high level. Results An A. niger wild-type strain and a GlaA over-expressing strain, containing multiple copies of the glaA gene, were cultivated under maltose-limited chemostat conditions (specific growth rate 0.1 h-1). Elevated glaA mRNA and extracellular GlaA levels in the over-expressing strain were accompanied by elevated transcript levels from 772 genes and lowered transcript levels from 815 genes when compared to the wild-type strain. Using GO term enrichment analysis, four higher-order categories were identified in the up-regulated gene set: i) endoplasmic reticulum (ER) membrane translocation, ii) protein glycosylation, iii) vesicle transport, and iv) ion homeostasis. Among these, about 130 genes had predicted functions for the passage of proteins through the ER and those genes included target genes of the HacA transcription factor that mediates the unfolded protein response (UPR), e.g. bipA, clxA, prpA, tigA and pdiA. In order to identify those genes that are important for high-level secretion of proteins by A. niger, we compared the transcriptome of the GlaA overexpression strain of A. niger with six other relevant transcriptomes of A. niger. Overall, 40 genes were found to have either elevated (from 36 genes) or lowered (from 4 genes) transcript levels under all conditions that were examined, thus defining the core set of genes important for ensuring high protein traffic through the secretory pathway. Conclusion We have defined the A. niger genes that respond to elevated secretion of GlaA and, furthermore, we have defined a core set of genes that appear to be involved more generally in the intensified traffic of proteins through the secretory pathway of A. niger. The consistent up-regulation of a gene encoding the acetyl-coenzyme A transporter suggests a possible role for transient acetylation to ensure correct folding of secreted proteins. PMID:23237452
Funwei, Roland I; Thomas, Bolaji N; Falade, Catherine O; Ojurongbe, Olusola
2018-01-02
Nigeria carries a high burden of malaria which makes continuous surveillance for current information on genetic diversity imperative. In this study, the merozoite surface proteins (msp-1, msp-2) and glutamate-rich protein (glurp) of Plasmodium falciparum collected from two communities representing rural and urban settings in Ibadan, southwestern Nigeria were analysed. A total of 511 febrile children, aged 3-59 months, whose parents/guardians provided informed consent, were recruited into the study. Capillary blood was obtained for malaria rapid diagnostic test, thick blood smears for parasite count and blood spots on filter paper for molecular analysis. Three-hundred and nine samples were successfully genotyped for msp-1, msp-2 and glurp genes. The allelic distribution of the three genes was not significantly different in the rural and urban communities. R033 and 3D7 were the most prevalent alleles in both rural and urban communities for msp-1 and msp-2, respectively. Eleven of glurp RII region genotypes, coded I-XII, with sizes ranging from 500 to 1100 base pairs were detected in the rural setting. Genotype XI (1000-1050 bp) had the highest prevalence of 41.5 and 38.5% in rural and urban settings, respectively. Overall, 82.1 and 70.0% of samples had multiclonal infection with msp-1 gene resulting in a mean multiplicity of infection (MOI) of 2.8 and 2.6 for rural and urban samples, respectively. Msp-1 and msp-2 genes displayed higher levels of diversity and higher MOI rates than the glurp gene. Significant genetic diversity was observed between rural and urban parasite populations in Ibadan, southwestern Nigeria. The results of this study show that malaria transmission intensity in these regions is still high. No significant difference was observed between rural and urban settings, except for a completely different msp-1 allele, compared to previous reports, thereby confirming the changing face of malaria transmission in these communities. This study provides important baseline information required for monitoring the impact of malaria elimination efforts in this region and data points useful in revising current protocols.
Lane, Courtney E; Benton, Michael G
2015-12-01
A colony PCR-based assay was developed to rapidly determine if a cyanobacterium of interest contains the requisite genetic material, the PHA synthase PhaC subunit, to produce polyhydroxyalkanoates (PHAs). The test is both high throughput and robust, owing to an extensive sequence analysis of cyanobacteria PHA synthases. The assay uses a single detection primer set and a single reaction condition across multiple cyanobacteria strains to produce an easily detectable positive result - amplification via PCR as evidenced by a band in electrophoresis. In order to demonstrate the potential of the presence of phaC as an indicator of a cyanobacteria's PHA accumulation capabilities, the ability to produce PHA was assessed for five cyanobacteria with a traditional in vivo PHA granule staining using an oxazine dye. The confirmed in vivo staining results were then compared to the PCR-based assay results and found to be in agreement. The colony PCR assay was capable of successfully detecting the phaC gene in all six of the diverse cyanobacteria tested which possessed the gene, while exhibiting no undesired product formation across the nine total cyanobacteria strains tested. The colony PCR quick prep provides sufficient usable DNA template such that this assay could be readily expanded to assess multiple genes of interest simultaneously. Copyright © 2015 Elsevier Ltd. All rights reserved.
Berenger, Byron M; Berry, Chrystal; Peterson, Trevor; Fach, Patrick; Delannoy, Sabine; Li, Vincent; Tschetter, Lorelee; Nadon, Celine; Honish, Lance; Louie, Marie; Chui, Linda
2015-01-01
A standardised method for determining Escherichia coli O157:H7 strain relatedness using whole genome sequencing or virulence gene profiling is not yet established. We sought to assess the capacity of either high-throughput polymerase chain reaction (PCR) of 49 virulence genes, core-genome single nt variants (SNVs) or k-mer clustering to discriminate between outbreak-associated and sporadic E. coli O157:H7 isolates. Three outbreaks and multiple sporadic isolates from the province of Alberta, Canada were included in the study. Two of the outbreaks occurred concurrently in 2014 and one occurred in 2012. Pulsed-field gel electrophoresis (PFGE) and multilocus variable-number tandem repeat analysis (MLVA) were employed as comparator typing methods. The virulence gene profiles of isolates from the 2012 and 2014 Alberta outbreak events and contemporary sporadic isolates were mostly identical; therefore the set of virulence genes chosen in this study were not discriminatory enough to distinguish between outbreak clusters. Concordant with PFGE and MLVA results, core genome SNV and k-mer phylogenies clustered isolates from the 2012 and 2014 outbreaks as distinct events. k-mer phylogenies demonstrated increased discriminatory power compared with core SNV phylogenies. Prior to the widespread implementation of whole genome sequencing for routine public health use, issues surrounding cost, technical expertise, software standardisation, and data sharing/comparisons must be addressed.
A vector space model approach to identify genetically related diseases.
Sarkar, Indra Neil
2012-01-01
The relationship between diseases and their causative genes can be complex, especially in the case of polygenic diseases. Further exacerbating the challenges in their study is that many genes may be causally related to multiple diseases. This study explored the relationship between diseases through the adaptation of an approach pioneered in the context of information retrieval: vector space models. A vector space model approach was developed that bridges gene disease knowledge inferred across three knowledge bases: Online Mendelian Inheritance in Man, GenBank, and Medline. The approach was then used to identify potentially related diseases for two target diseases: Alzheimer disease and Prader-Willi Syndrome. In the case of both Alzheimer Disease and Prader-Willi Syndrome, a set of plausible diseases were identified that may warrant further exploration. This study furthers seminal work by Swanson, et al. that demonstrated the potential for mining literature for putative correlations. Using a vector space modeling approach, information from both biomedical literature and genomic resources (like GenBank) can be combined towards identification of putative correlations of interest. To this end, the relevance of the predicted diseases of interest in this study using the vector space modeling approach were validated based on supporting literature. The results of this study suggest that a vector space model approach may be a useful means to identify potential relationships between complex diseases, and thereby enable the coordination of gene-based findings across multiple complex diseases.
GARNET--gene set analysis with exploration of annotation relations.
Rho, Kyoohyoung; Kim, Bumjin; Jang, Youngjun; Lee, Sanghyun; Bae, Taejeong; Seo, Jihae; Seo, Chaehwa; Lee, Jihyun; Kang, Hyunjung; Yu, Ungsik; Kim, Sunghoon; Lee, Sanghyuk; Kim, Wan Kyu
2011-02-15
Gene set analysis is a powerful method of deducing biological meaning for an a priori defined set of genes. Numerous tools have been developed to test statistical enrichment or depletion in specific pathways or gene ontology (GO) terms. Major difficulties towards biological interpretation are integrating diverse types of annotation categories and exploring the relationships between annotation terms of similar information. GARNET (Gene Annotation Relationship NEtwork Tools) is an integrative platform for gene set analysis with many novel features. It includes tools for retrieval of genes from annotation database, statistical analysis & visualization of annotation relationships, and managing gene sets. In an effort to allow access to a full spectrum of amassed biological knowledge, we have integrated a variety of annotation data that include the GO, domain, disease, drug, chromosomal location, and custom-defined annotations. Diverse types of molecular networks (pathways, transcription and microRNA regulations, protein-protein interaction) are also included. The pair-wise relationship between annotation gene sets was calculated using kappa statistics. GARNET consists of three modules--gene set manager, gene set analysis and gene set retrieval, which are tightly integrated to provide virtually automatic analysis for gene sets. A dedicated viewer for annotation network has been developed to facilitate exploration of the related annotations. GARNET (gene annotation relationship network tools) is an integrative platform for diverse types of gene set analysis, where complex relationships among gene annotations can be easily explored with an intuitive network visualization tool (http://garnet.isysbio.org/ or http://ercsb.ewha.ac.kr/garnet/).
cMapper: gene-centric connectivity mapper for EBI-RDF platform.
Shoaib, Muhammad; Ansari, Adnan Ahmad; Ahn, Sung-Min
2017-01-15
In this era of biological big data, data integration has become a common task and a challenge for biologists. The Resource Description Framework (RDF) was developed to enable interoperability of heterogeneous datasets. The EBI-RDF platform enables an efficient data integration of six independent biological databases using RDF technologies and shared ontologies. However, to take advantage of this platform, biologists need to be familiar with RDF technologies and SPARQL query language. To overcome this practical limitation of the EBI-RDF platform, we developed cMapper, a web-based tool that enables biologists to search the EBI-RDF databases in a gene-centric manner without a thorough knowledge of RDF and SPARQL. cMapper allows biologists to search data entities in the EBI-RDF platform that are connected to genes or small molecules of interest in multiple biological contexts. The input to cMapper consists of a set of genes or small molecules, and the output are data entities in six independent EBI-RDF databases connected with the given genes or small molecules in the user's query. cMapper provides output to users in the form of a graph in which nodes represent data entities and the edges represent connections between data entities and inputted set of genes or small molecules. Furthermore, users can apply filters based on database, taxonomy, organ and pathways in order to focus on a core connectivity graph of their interest. Data entities from multiple databases are differentiated based on background colors. cMapper also enables users to investigate shared connections between genes or small molecules of interest. Users can view the output graph on a web browser or download it in either GraphML or JSON formats. cMapper is available as a web application with an integrated MySQL database. The web application was developed using Java and deployed on Tomcat server. We developed the user interface using HTML5, JQuery and the Cytoscape Graph API. cMapper can be accessed at http://cmapper.ewostech.net Readers can download the development manual from the website http://cmapper.ewostech.net/docs/cMapperDocumentation.pdf. Source Code is available at https://github.com/muhammadshoaib/cmapperContact:smahn@gachon.ac.krSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Persistence of Multiple Genetic Lineages within Intrahost Populations of Ross River Virus▿
Liu, Wen J.; Rourke, Michelle F.; Holmes, Edward C.; Aaskov, John G.
2011-01-01
We examined the structure and extent of genetic diversity in intrahost populations of Ross River virus (RRV) in samples from six human patients, focusing on the nonstructural (nsP3) and structural (E2) protein genes. Strikingly, although the samples were collected from contrasting ecological settings 3,000 kilometers apart in Australia, we observed multiple viral lineages in four of the six individuals, which is indicative of widespread mixed infections. In addition, a comparison with previously published RRV sequences revealed that these distinct lineages have been in circulation for at least 5 years, and we were able to document their long-term persistence over extensive geographical distances. PMID:21430052
CFH Variants Affect Structural and Functional Brain Changes and Genetic Risk of Alzheimer's Disease.
Zhang, Deng-Feng; Li, Jin; Wu, Huan; Cui, Yue; Bi, Rui; Zhou, He-Jiang; Wang, Hui-Zhen; Zhang, Chen; Wang, Dong; Kong, Qing-Peng; Li, Tao; Fang, Yiru; Jiang, Tianzi; Yao, Yong-Gang
2016-03-01
The immune response is highly active in Alzheimer's disease (AD). Identification of genetic risk contributed by immune genes to AD may provide essential insight for the prognosis, diagnosis, and treatment of this neurodegenerative disease. In this study, we performed a genetic screening for AD-related top immune genes identified in Europeans in a Chinese cohort, followed by a multiple-stage study focusing on Complement Factor H (CFH) gene. Effects of the risk SNPs on AD-related neuroimaging endophenotypes were evaluated through magnetic resonance imaging scan, and the effects on AD cerebrospinal fluid biomarkers (CSF) and CFH expression changes were measured in aged and AD brain tissues and AD cellular models. Our results showed that the AD-associated top immune genes reported in Europeans (CR1, CD33, CLU, and TREML2) have weak effects in Chinese, whereas CFH showed strong effects. In particular, rs1061170 (P(meta)=5.0 × 10(-4)) and rs800292 (P(meta)=1.3 × 10(-5)) showed robust associations with AD, which were confirmed in multiple world-wide sample sets (4317 cases and 16 795 controls). Rs1061170 (P=2.5 × 10(-3)) and rs800292 (P=4.7 × 10(-4)) risk-allele carriers have an increased entorhinal thickness in their young age and a higher atrophy rate as the disease progresses. Rs800292 risk-allele carriers have higher CSF tau and Aβ levels and severe cognitive decline. CFH expression level, which was affected by the risk-alleles, was increased in AD brains and cellular models. These comprehensive analyses suggested that CFH is an important immune factor in AD and affects multiple pathological changes in early life and during disease progress.
Selection of higher order regression models in the analysis of multi-factorial transcription data.
Prazeres da Costa, Olivia; Hoffman, Arthur; Rey, Johannes W; Mansmann, Ulrich; Buch, Thorsten; Tresch, Achim
2014-01-01
Many studies examine gene expression data that has been obtained under the influence of multiple factors, such as genetic background, environmental conditions, or exposure to diseases. The interplay of multiple factors may lead to effect modification and confounding. Higher order linear regression models can account for these effects. We present a new methodology for linear model selection and apply it to microarray data of bone marrow-derived macrophages. This experiment investigates the influence of three variable factors: the genetic background of the mice from which the macrophages were obtained, Yersinia enterocolitica infection (two strains, and a mock control), and treatment/non-treatment with interferon-γ. We set up four different linear regression models in a hierarchical order. We introduce the eruption plot as a new practical tool for model selection complementary to global testing. It visually compares the size and significance of effect estimates between two nested models. Using this methodology we were able to select the most appropriate model by keeping only relevant factors showing additional explanatory power. Application to experimental data allowed us to qualify the interaction of factors as either neutral (no interaction), alleviating (co-occurring effects are weaker than expected from the single effects), or aggravating (stronger than expected). We find a biologically meaningful gene cluster of putative C2TA target genes that appear to be co-regulated with MHC class II genes. We introduced the eruption plot as a tool for visual model comparison to identify relevant higher order interactions in the analysis of expression data obtained under the influence of multiple factors. We conclude that model selection in higher order linear regression models should generally be performed for the analysis of multi-factorial microarray data.
Estimation of gene induction enables a relevance-based ranking of gene sets.
Bartholomé, Kilian; Kreutz, Clemens; Timmer, Jens
2009-07-01
In order to handle and interpret the vast amounts of data produced by microarray experiments, the analysis of sets of genes with a common biological functionality has been shown to be advantageous compared to single gene analyses. Some statistical methods have been proposed to analyse the differential gene expression of gene sets in microarray experiments. However, most of these methods either require threshhold values to be chosen for the analysis, or they need some reference set for the determination of significance. We present a method that estimates the number of differentially expressed genes in a gene set without requiring a threshold value for significance of genes. The method is self-contained (i.e., it does not require a reference set for comparison). In contrast to other methods which are focused on significance, our approach emphasizes the relevance of the regulation of gene sets. The presented method measures the degree of regulation of a gene set and is a useful tool to compare the induction of different gene sets and place the results of microarray experiments into the biological context. An R-package is available.
Wang, Guo-Ming; Yin, Hao; Qiao, Xin; Tan, Xu; Gu, Chao; Wang, Bao-Hua; Cheng, Rui; Wang, Ying-Zhen; Zhang, Shao-Ling
2016-12-01
F-box gene family, as one of the largest gene families in plants, plays crucial roles in regulating plant development, reproduction, cellular protein degradation and responses to biotic and abiotic stresses. However, comprehensive analysis of the F-box gene family in pear (Pyrus bretschneideri Rehd.) and other Rosaceae species has not been reported yet. Herein, we identified a total of 226 full-length F-box genes in pear for the first time. And these genes were further divided into various subgroups based on specific domains and phylogenetic analysis. Intriguingly, we observed that whole-genome duplication and dispersed duplication have a major contribution to F-box family expansion. Furthermore, the dynamic evolution for different modes of gene duplication was dissected. Interestingly, we found that dispersed and tandem duplicate have been evolving at a high rate. In addition, we found that F-box genes exhibited functional specificity based on GO analysis, and most of the F-box genes were significantly enriched in the protein binding (GO: 0005515) term, supporting that F-box genes might play a critical role for gene regulation in pear. Transcriptome and digital expression profiles revealed that F-box genes are involved in the development of multiple pear tissues. Overall, these results will set stage for elaborating the biological role of F-box genes in pear and other plants. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Ghosh, Sujoy; Vivar, Juan; Nelson, Christopher P; Willenborg, Christina; Segrè, Ayellet V; Mäkinen, Ville-Petteri; Nikpay, Majid; Erdmann, Jeannette; Blankenberg, Stefan; O'Donnell, Christopher; März, Winfried; Laaksonen, Reijo; Stewart, Alexandre F R; Epstein, Stephen E; Shah, Svati H; Granger, Christopher B; Hazen, Stanley L; Kathiresan, Sekar; Reilly, Muredach P; Yang, Xia; Quertermous, Thomas; Samani, Nilesh J; Schunkert, Heribert; Assimes, Themistocles L; McPherson, Ruth
2015-07-01
Genome-wide association studies have identified multiple genetic variants affecting the risk of coronary artery disease (CAD). However, individually these explain only a small fraction of the heritability of CAD and for most, the causal biological mechanisms remain unclear. We sought to obtain further insights into potential causal processes of CAD by integrating large-scale GWA data with expertly curated databases of core human pathways and functional networks. Using pathways (gene sets) from Reactome, we carried out a 2-stage gene set enrichment analysis strategy. From a meta-analyzed discovery cohort of 7 CAD genome-wide association study data sets (9889 cases/11 089 controls), nominally significant gene sets were tested for replication in a meta-analysis of 9 additional studies (15 502 cases/55 730 controls) from the Coronary ARtery DIsease Genome wide Replication and Meta-analysis (CARDIoGRAM) Consortium. A total of 32 of 639 Reactome pathways tested showed convincing association with CAD (replication P<0.05). These pathways resided in 9 of 21 core biological processes represented in Reactome, and included pathways relevant to extracellular matrix (ECM) integrity, innate immunity, axon guidance, and signaling by PDRF (platelet-derived growth factor), NOTCH, and the transforming growth factor-β/SMAD receptor complex. Many of these pathways had strengths of association comparable to those observed in lipid transport pathways. Network analysis of unique genes within the replicated pathways further revealed several interconnected functional and topologically interacting modules representing novel associations (eg, semaphoring-regulated axonal guidance pathway) besides confirming known processes (lipid metabolism). The connectivity in the observed networks was statistically significant compared with random networks (P<0.001). Network centrality analysis (degree and betweenness) further identified genes (eg, NCAM1, FYN, FURIN, etc) likely to play critical roles in the maintenance and functioning of several of the replicated pathways. These findings provide novel insights into how genetic variation, interpreted in the context of biological processes and functional interactions among genes, may help define the genetic architecture of CAD. © 2015 American Heart Association, Inc.
Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts.
Xu, Lijing; Furlotte, Nicholas; Lin, Yunyue; Heinrich, Kevin; Berry, Michael W; George, Ebenezer O; Homayouni, Ramin
2011-04-14
High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature. GCAT is freely available at http://binf1.memphis.edu/gcat.
A sigma factor toolbox for orthogonal gene expression in Escherichia coli
Van Brempt, Maarten; Van Nerom, Katleen; Van Hove, Bob; Maertens, Jo; De Mey, Marjan; Charlier, Daniel
2018-01-01
Abstract Synthetic genetic sensors and circuits enable programmable control over timing and conditions of gene expression and, as a result, are increasingly incorporated into the control of complex and multi-gene pathways. Size and complexity of genetic circuits are growing, but stay limited by a shortage of regulatory parts that can be used without interference. Therefore, orthogonal expression and regulation systems are needed to minimize undesired crosstalk and allow for dynamic control of separate modules. This work presents a set of orthogonal expression systems for use in Escherichia coli based on heterologous sigma factors from Bacillus subtilis that recognize specific promoter sequences. Up to four of the analyzed sigma factors can be combined to function orthogonally between each other and toward the host. Additionally, the toolbox is expanded by creating promoter libraries for three sigma factors without loss of their orthogonal nature. As this set covers a wide range of transcription initiation frequencies, it enables tuning of multiple outputs of the circuit in response to different sensory signals in an orthogonal manner. This sigma factor toolbox constitutes an interesting expansion of the synthetic biology toolbox and may contribute to the assembly of more complex synthetic genetic systems in the future. PMID:29361130
Pathway Analysis in Attention Deficit Hyperactivity Disorder: An Ensemble Approach
Mooney, Michael A.; McWeeney, Shannon K.; Faraone, Stephen V.; Hinney, Anke; Hebebrand, Johannes; Nigg, Joel T.; Wilmot, Beth
2016-01-01
Despite a wealth of evidence for the role of genetics in attention deficit hyperactivity disorder (ADHD), specific and definitive genetic mechanisms have not been identified. Pathway analyses, a subset of gene-set analyses, extend the knowledge gained from genome-wide association studies (GWAS) by providing functional context for genetic associations. However, there are numerous methods for association testing of gene sets and no real consensus regarding the best approach. The present study applied six pathway analysis methods to identify pathways associated with ADHD in two GWAS datasets from the Psychiatric Genomics Consortium. Methods that utilize genotypes to model pathway-level effects identified more replicable pathway associations than methods using summary statistics. In addition, pathways implicated by more than one method were significantly more likely to replicate. A number of brain-relevant pathways, such as RhoA signaling, glycosaminoglycan biosynthesis, fibroblast growth factor receptor activity, and pathways containing potassium channel genes, were nominally significant by multiple methods in both datasets. These results support previous hypotheses about the role of regulation of neurotransmitter release, neurite outgrowth and axon guidance in contributing to the ADHD phenotype and suggest the value of cross-method convergence in evaluating pathway analysis results. PMID:27004716
The Gene Set Builder: collation, curation, and distribution of sets of genes
Yusuf, Dimas; Lim, Jonathan S; Wasserman, Wyeth W
2005-01-01
Background In bioinformatics and genomics, there are many applications designed to investigate the common properties for a set of genes. Often, these multi-gene analysis tools attempt to reveal sequential, functional, and expressional ties. However, while tremendous effort has been invested in developing tools that can analyze a set of genes, minimal effort has been invested in developing tools that can help researchers compile, store, and annotate gene sets in the first place. As a result, the process of making or accessing a set often involves tedious and time consuming steps such as finding identifiers for each individual gene. These steps are often repeated extensively to shift from one identifier type to another; or to recreate a published set. In this paper, we present a simple online tool which – with the help of the gene catalogs Ensembl and GeneLynx – can help researchers build and annotate sets of genes quickly and easily. Description The Gene Set Builder is a database-driven, web-based tool designed to help researchers compile, store, export, and share sets of genes. This application supports the 17 eukaryotic genomes found in version 32 of the Ensembl database, which includes species from yeast to human. User-created information such as sets and customized annotations are stored to facilitate easy access. Gene sets stored in the system can be "exported" in a variety of output formats – as lists of identifiers, in tables, or as sequences. In addition, gene sets can be "shared" with specific users to facilitate collaborations or fully released to provide access to published results. The application also features a Perl API (Application Programming Interface) for direct connectivity to custom analysis tools. A downloadable Quick Reference guide and an online tutorial are available to help new users learn its functionalities. Conclusion The Gene Set Builder is an Ensembl-facilitated online tool designed to help researchers compile and manage sets of genes in a user-friendly environment. The application can be accessed via . PMID:16371163
Feltus, F Alex
2014-06-01
Understanding the control of any trait optimally requires the detection of causal genes, gene interaction, and mechanism of action to discover and model the biochemical pathways underlying the expressed phenotype. Functional genomics techniques, including RNA expression profiling via microarray and high-throughput DNA sequencing, allow for the precise genome localization of biological information. Powerful genetic approaches, including quantitative trait locus (QTL) and genome-wide association study mapping, link phenotype with genome positions, yet genetics is less precise in localizing the relevant mechanistic information encoded in DNA. The coupling of salient functional genomic signals with genetically mapped positions is an appealing approach to discover meaningful gene-phenotype relationships. Techniques used to define this genetic-genomic convergence comprise the field of systems genetics. This short review will address an application of systems genetics where RNA profiles are associated with genetically mapped genome positions of individual genes (eQTL mapping) or as gene sets (co-expression network modules). Both approaches can be applied for knowledge independent selection of candidate genes (and possible control mechanisms) underlying complex traits where multiple, likely unlinked, genomic regions might control specific complex traits. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Rodríguez-Esteban, Gustavo; González-Sastre, Alejandro; Rojo-Laguna, José Ignacio; Saló, Emili; Abril, Josep F
2015-05-08
The freshwater planarian Schmidtea mediterranea is recognised as a valuable model for research into adult stem cells and regeneration. With the advent of the high-throughput sequencing technologies, it has become feasible to undertake detailed transcriptional analysis of its unique stem cell population, the neoblasts. Nonetheless, a reliable reference for this type of studies is still lacking. Taking advantage of digital gene expression (DGE) sequencing technology we compare all the available transcriptomes for S. mediterranea and improve their annotation. These results are accessible via web for the community of researchers. Using the quantitative nature of DGE, we describe the transcriptional profile of neoblasts and present 42 new neoblast genes, including several cancer-related genes and transcription factors. Furthermore, we describe in detail the Smed-meis-like gene and the three Nuclear Factor Y subunits Smed-nf-YA, Smed-nf-YB-2 and Smed-nf-YC. DGE is a valuable tool for gene discovery, quantification and annotation. The application of DGE in S. mediterranea confirms the planarian stem cells or neoblasts as a complex population of pluripotent and multipotent cells regulated by a mixture of transcription factors and cancer-related genes.
Herbold, Craig W.; Pelikan, Claus; Kuzyk, Orest; Hausmann, Bela; Angel, Roey; Berry, David; Loy, Alexander
2015-01-01
High throughput sequencing of phylogenetic and functional gene amplicons provides tremendous insight into the structure and functional potential of complex microbial communities. Here, we introduce a highly adaptable and economical PCR approach to barcoding and pooling libraries of numerous target genes. In this approach, we replace gene- and sequencing platform-specific fusion primers with general, interchangeable barcoding primers, enabling nearly limitless customized barcode-primer combinations. Compared to barcoding with long fusion primers, our multiple-target gene approach is more economical because it overall requires lower number of primers and is based on short primers with generally lower synthesis and purification costs. To highlight our approach, we pooled over 900 different small-subunit rRNA and functional gene amplicon libraries obtained from various environmental or host-associated microbial community samples into a single, paired-end Illumina MiSeq run. Although the amplicon regions ranged in size from approximately 290 to 720 bp, we found no significant systematic sequencing bias related to amplicon length or gene target. Our results indicate that this flexible multiplexing approach produces large, diverse, and high quality sets of amplicon sequence data for modern studies in microbial ecology. PMID:26236305
Chatterjee, Sumantra; Sivakamasundari, V; Yap, Sook Peng; Kraus, Petra; Kumar, Vibhor; Xing, Xing; Lim, Siew Lan; Sng, Joel; Prabhakar, Shyam; Lufkin, Thomas
2014-12-05
Vertebrate organogenesis is a highly complex process involving sequential cascades of transcription factor activation or repression. Interestingly a single developmental control gene can occasionally be essential for the morphogenesis and differentiation of tissues and organs arising from vastly disparate embryological lineages. Here we elucidated the role of the mammalian homeobox gene Bapx1 during the embryogenesis of five distinct organs at E12.5 - vertebral column, spleen, gut, forelimb and hindlimb - using expression profiling of sorted wildtype and mutant cells combined with genome wide binding site analysis. Furthermore we analyzed the development of the vertebral column at the molecular level by combining transcriptional profiling and genome wide binding data for Bapx1 with similarly generated data sets for Sox9 to assemble a detailed gene regulatory network revealing genes previously not reported to be controlled by either of these two transcription factors. The gene regulatory network appears to control cell fate decisions and morphogenesis in the vertebral column along with the prevention of premature chondrocyte differentiation thus providing a detailed molecular view of vertebral column development.
Determination of performance characteristics of scientific applications on IBM Blue Gene/Q
DOE Office of Scientific and Technical Information (OSTI.GOV)
Evangelinos, C.; Walkup, R. E.; Sachdeva, V.
The IBM Blue Gene®/Q platform presents scientists and engineers with a rich set of hardware features such as 16 cores per chip sharing a Level 2 cache, a wide SIMD (single-instruction, multiple-data) unit, a five-dimensional torus network, and hardware support for collective operations. Especially important is the feature related to cores that have four “hardware threads,” which makes it possible to hide latencies and obtain a high fraction of the peak issue rate from each core. All of these hardware resources present unique performance-tuning opportunities on Blue Gene/Q. We provide an overview of several important applications and solvers and studymore » them on Blue Gene/Q using performance counters and Message Passing Interface profiles. We also discuss how Blue Gene/Q tools help us understand the interaction of the application with the hardware and software layers and provide guidance for optimization. Furthermore, on the basis of our analysis, we discuss code improvement strategies targeting Blue Gene/Q. Information about how these algorithms map to the Blue Gene® architecture is expected to have an impact on future system design as we move to the exascale era.« less
Horizontal transfer of the msp130 gene supported the evolution of metazoan biomineralization.
Ettensohn, Charles A
2014-05-01
It is widely accepted that biomineralized structures appeared independently in many metazoan clades during the Cambrian. How this occurred, and whether it involved the parallel co-option of a common set of biochemical and developmental pathways (i.e., a shared biomineralization "toolkit"), are questions that remain unanswered. Here, I provide evidence that horizontal gene transfer supported the evolution of biomineralization in some metazoans. I show that Msp130 proteins, first described as proteins expressed selectively by the biomineral-forming primary mesenchyme cells of the sea urchin embryo, have a much wider taxonomic distribution than was previously appreciated. Msp130 proteins are present in several invertebrate deuterostomes and in one protostome clade (molluscs). Surprisingly, closely related proteins are also present in many bacteria and several algae, and I propose that msp130 genes were introduced into metazoan lineages via multiple, independent horizontal gene transfer events. Phylogenetic analysis shows that the introduction of an ancestral msp130 gene occurred in the sea urchin lineage more than 250 million years ago and that msp130 genes underwent independent, parallel duplications in each of the metazoan phyla in which these genes are found. © 2014 Wiley Periodicals, Inc.
A multiplex branched DNA assay for parallel quantitative gene expression profiling.
Flagella, Michael; Bui, Son; Zheng, Zhi; Nguyen, Cung Tuong; Zhang, Aiguo; Pastor, Larry; Ma, Yunqing; Yang, Wen; Crawford, Kimberly L; McMaster, Gary K; Witney, Frank; Luo, Yuling
2006-05-01
We describe a novel method to quantitatively measure messenger RNA (mRNA) expression of multiple genes directly from crude cell lysates and tissue homogenates without the need for RNA purification or target amplification. The multiplex branched DNA (bDNA) assay adapts the bDNA technology to the Luminex fluorescent bead-based platform through the use of cooperative hybridization, which ensures an exceptionally high degree of assay specificity. Using in vitro transcribed RNA as reference standards, we demonstrated that the assay is highly specific, with cross-reactivity less than 0.2%. We also determined that the assay detection sensitivity is 25,000 RNA transcripts with intra- and interplate coefficients of variance of less than 10% and less than 15%, respectively. Using three 10-gene panels designed to measure proinflammatory and apoptosis responses, we demonstrated sensitive and specific multiplex gene expression profiling directly from cell lysates. The gene expression change data demonstrate a high correlation coefficient (R(2)=0.94) compared with measurements obtained using the single-plex bDNA assay. Thus, the multiplex bDNA assay provides a powerful means to quantify the gene expression profile of a defined set of target genes in large sample populations.
Anurag, Meenakshi; Punturi, Nindo; Hoog, Jeremy; Bainbridge, Matthew N; Ellis, Matthew J; Haricharan, Svasti
2018-05-23
This study was undertaken to conduct a comprehensive investigation of the role of DNA damage repair (DDR) defects in poor outcome ER+ disease. Expression and mutational status of DDR genes in ER+ breast tumors were correlated with proliferative response in neoadjuvant aromatase inhibitor therapy trials (discovery data set), with outcomes in METABRIC, TCGA and Loi data sets (validation data sets), and in patient derived xenografts. A causal relationship between candidate DDR genes and endocrine treatment response, and the underlying mechanism, was then tested in ER+ breast cancer cell lines. Correlations between loss of expression of three genes: CETN2 (p<0.001) and ERCC1 (p=0.01) from the nucleotide excision repair (NER) and NEIL2 (p=0.04) from the base excision repair (BER) pathways were associated with endocrine treatment resistance in discovery data sets, and subsequently validated in independent patient cohorts. Complementary mutation analysis supported associations between mutations in NER and BER pathways and reduced endocrine treatment response. A causal role for CETN2, NEIL2 and ERCC1 loss in intrinsic endocrine resistance was experimentally validated in ER+ breast cancer cell lines, and in ER+ patient-derived xenograft models. Loss of CETN2, NEIL2 or ERCC1 induced endocrine treatment response by dysregulating G1/S transition, and therefore, increased sensitivity to CDK4/6 inhibitors. A combined DDR signature score was developed that predicted poor outcome in multiple patient cohorts. This report identifies DDR defects as a new class of endocrine treatment resistance drivers and indicates new avenues for predicting efficacy of CDK4/6 inhibition in the adjuvant treatment setting. Copyright ©2018, American Association for Cancer Research.
This proposal develops scalable R / Bioconductor software infrastructure and data resources to integrate complex, heterogeneous, and large cancer genomic experiments. The falling cost of genomic assays facilitates collection of multiple data types (e.g., gene and transcript expression, structural variation, copy number, methylation, and microRNA data) from a set of clinical specimens. Furthermore, substantial resources are now available from large consortium activities like The Cancer Genome Atlas (TCGA).
Mason-Gamer, Roberta J
2013-01-01
The grass tribe Triticeae (=Hordeeae) comprises only about 300 species, but it is well known for the economically important crop plants wheat, barley, and rye. The group is also recognized as a fascinating example of evolutionary complexity, with a history shaped by numerous events of auto- and allopolyploidy and apparent introgression involving diploids and polyploids. The genus Elymus comprises a heterogeneous collection of allopolyploid genome combinations, all of which include at least one set of homoeologs, designated St, derived from Pseudoroegneria. The current analysis includes a geographically and genomically diverse collection of 21 tetraploid Elymus species, and a single hexaploid species. Diploid and polyploid relationships were estimated using four molecular data sets, including one that combines two regions of the chloroplast genome, and three from unlinked nuclear genes: phosphoenolpyruvate carboxylase, β-amylase, and granule-bound starch synthase I. Four gene trees were generated using maximum likelihood, and the phylogenetic placement of the polyploid sequences reveals extensive reticulation beyond allopolyploidy alone. The trees were interpreted with reference to numerous phenomena known to complicate allopolyploid phylogenies, and introgression was identified as a major factor in their history. The work illustrates the interpretation of complicated phylogenetic results through the sequential consideration of numerous possible explanations, and the results highlight the value of careful inspection of multiple independent molecular phylogenetic estimates, with particular focus on the differences among them.
Aliper, Alexander; Plis, Sergey; Artemov, Artem; Ulloa, Alvaro; Mamoshina, Polina; Zhavoronkov, Alex
2016-07-05
Deep learning is rapidly advancing many areas of science and technology with multiple success stories in image, text, voice and video recognition, robotics, and autonomous driving. In this paper we demonstrate how deep neural networks (DNN) trained on large transcriptional response data sets can classify various drugs to therapeutic categories solely based on their transcriptional profiles. We used the perturbation samples of 678 drugs across A549, MCF-7, and PC-3 cell lines from the LINCS Project and linked those to 12 therapeutic use categories derived from MeSH. To train the DNN, we utilized both gene level transcriptomic data and transcriptomic data processed using a pathway activation scoring algorithm, for a pooled data set of samples perturbed with different concentrations of the drug for 6 and 24 hours. In both pathway and gene level classification, DNN achieved high classification accuracy and convincingly outperformed the support vector machine (SVM) model on every multiclass classification problem, however, models based on pathway level data performed significantly better. For the first time we demonstrate a deep learning neural net trained on transcriptomic data to recognize pharmacological properties of multiple drugs across different biological systems and conditions. We also propose using deep neural net confusion matrices for drug repositioning. This work is a proof of principle for applying deep learning to drug discovery and development.
Spectral gene set enrichment (SGSE).
Frost, H Robert; Li, Zhigang; Moore, Jason H
2015-03-03
Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.
The Protein Interaction Network of Bacteriophage Lambda with Its Host, Escherichia coli
Blasche, Sonja; Wuchty, Stefan; Rajagopala, Seesandra V.
2013-01-01
Although most of the 73 open reading frames (ORFs) in bacteriophage λ have been investigated intensively, the function of many genes in host-phage interactions remains poorly understood. Using yeast two-hybrid screens of all lambda ORFs for interactions with its host Escherichia coli, we determined a raw data set of 631 host-phage interactions resulting in a set of 62 high-confidence interactions after multiple rounds of retesting. These links suggest novel regulatory interactions between the E. coli transcriptional network and lambda proteins. Targeted host proteins and genes required for lambda infection are enriched among highly connected proteins, suggesting that bacteriophages resemble interaction patterns of human viruses. Lambda tail proteins interact with both bacterial fimbrial proteins and E. coli proteins homologous to other phage proteins. Lambda appears to dramatically differ from other phages, such as T7, because of its unusually large number of modified and processed proteins, which reduces the number of host-virus interactions detectable by yeast two-hybrid screens. PMID:24049175
Rietveld, Cornelius A.; Esko, Tõnu; Davies, Gail; Pers, Tune H.; Turley, Patrick; Benyamin, Beben; Chabris, Christopher F.; Emilsson, Valur; Johnson, Andrew D.; Lee, James J.; de Leeuw, Christiaan; Marioni, Riccardo E.; Medland, Sarah E.; Miller, Michael B.; Rostapshova, Olga; van der Lee, Sven J.; Vinkhuyzen, Anna A. E.; Amin, Najaf; Conley, Dalton; Derringer, Jaime; van Duijn, Cornelia M.; Fehrmann, Rudolf; Franke, Lude; Glaeser, Edward L.; Hansell, Narelle K.; Hayward, Caroline; Iacono, William G.; Ibrahim-Verbaas, Carla; Jaddoe, Vincent; Karjalainen, Juha; Laibson, David; Lichtenstein, Paul; Liewald, David C.; Magnusson, Patrik K. E.; Martin, Nicholas G.; McGue, Matt; McMahon, George; Pedersen, Nancy L.; Pinker, Steven; Porteous, David J.; Posthuma, Danielle; Rivadeneira, Fernando; Smith, Blair H.; Starr, John M.; Tiemeier, Henning; Timpson, Nicholas J.; Trzaskowski, Maciej; Uitterlinden, André G.; Verhulst, Frank C.; Ward, Mary E.; Wright, Margaret J.; Davey Smith, George; Deary, Ian J.; Johannesson, Magnus; Plomin, Robert; Visscher, Peter M.; Benjamin, Daniel J.; Koellinger, Philipp D.
2014-01-01
We identify common genetic variants associated with cognitive performance using a two-stage approach, which we call the proxy-phenotype method. First, we conduct a genome-wide association study of educational attainment in a large sample (n = 106,736), which produces a set of 69 education-associated SNPs. Second, using independent samples (n = 24,189), we measure the association of these education-associated SNPs with cognitive performance. Three SNPs (rs1487441, rs7923609, and rs2721173) are significantly associated with cognitive performance after correction for multiple hypothesis testing. In an independent sample of older Americans (n = 8,652), we also show that a polygenic score derived from the education-associated SNPs is associated with memory and absence of dementia. Convergent evidence from a set of bioinformatics analyses implicates four specific genes (KNCMA1, NRXN1, POU2F3, and SCRT). All of these genes are associated with a particular neurotransmitter pathway involved in synaptic plasticity, the main cellular mechanism for learning and memory. PMID:25201988
Takahashi, Kei-ichiro; Takigawa, Ichigaku; Mamitsuka, Hiroshi
2013-01-01
Detecting biclusters from expression data is useful, since biclusters are coexpressed genes under only part of all given experimental conditions. We present a software called SiBIC, which from a given expression dataset, first exhaustively enumerates biclusters, which are then merged into rather independent biclusters, which finally are used to generate gene set networks, in which a gene set assigned to one node has coexpressed genes. We evaluated each step of this procedure: 1) significance of the generated biclusters biologically and statistically, 2) biological quality of merged biclusters, and 3) biological significance of gene set networks. We emphasize that gene set networks, in which nodes are not genes but gene sets, can be more compact than usual gene networks, meaning that gene set networks are more comprehensible. SiBIC is available at http://utrecht.kuicr.kyoto-u.ac.jp:8080/miami/faces/index.jsp.
Otoupal, Peter B; Erickson, Keesha E; Escalas-Bordoy, Antoni; Chatterjee, Anushree
2017-01-20
The evolution of antibiotic resistance has engendered an impending global health crisis that necessitates a greater understanding of how resistance emerges. The impact of nongenetic factors and how they influence the evolution of resistance is a largely unexplored area of research. Here we present a novel application of CRISPR-Cas9 technology for investigating how gene expression governs the adaptive pathways available to bacteria during the evolution of resistance. We examine the impact of gene expression changes on bacterial adaptation by constructing a library of deactivated CRISPR-Cas9 synthetic devices to tune the expression of a set of stress-response genes in Escherichia coli. We show that artificially inducing perturbations in gene expression imparts significant synthetic control over fitness and growth during stress exposure. We present evidence that these impacts are reversible; strains with synthetically perturbed gene expression regained wild-type growth phenotypes upon stress removal, while maintaining divergent growth characteristics under stress. Furthermore, we demonstrate a prevailing trend toward negative epistatic interactions when multiple gene perturbations are combined simultaneously, thereby posing an intrinsic constraint on gene expression underlying adaptive trajectories. Together, these results emphasize how CRISPR-Cas9 can be employed to engineer gene expression changes that shape bacterial adaptation, and present a novel approach to synthetically control the evolution of antimicrobial resistance.
Fujimoto, Akihiro; Okada, Yukinori; Boroevich, Keith A; Tsunoda, Tatsuhiko; Taniguchi, Hiroaki; Nakagawa, Hidewaki
2016-05-26
Protein tertiary structure determines molecular function, interaction, and stability of the protein, therefore distribution of mutation in the tertiary structure can facilitate the identification of new driver genes in cancer. To analyze mutation distribution in protein tertiary structures, we applied a novel three dimensional permutation test to the mutation positions. We analyzed somatic mutation datasets of 21 types of cancers obtained from exome sequencing conducted by the TCGA project. Of the 3,622 genes that had ≥3 mutations in the regions with tertiary structure data, 106 genes showed significant skew in mutation distribution. Known tumor suppressors and oncogenes were significantly enriched in these identified cancer gene sets. Physical distances between mutations in known oncogenes were significantly smaller than those of tumor suppressors. Twenty-three genes were detected in multiple cancers. Candidate genes with significant skew of the 3D mutation distribution included kinases (MAPK1, EPHA5, ERBB3, and ERBB4), an apoptosis related gene (APP), an RNA splicing factor (SF1), a miRNA processing factor (DICER1), an E3 ubiquitin ligase (CUL1) and transcription factors (KLF5 and EEF1B2). Our study suggests that systematic analysis of mutation distribution in the tertiary protein structure can help identify cancer driver genes.
Fujimoto, Akihiro; Okada, Yukinori; Boroevich, Keith A.; Tsunoda, Tatsuhiko; Taniguchi, Hiroaki; Nakagawa, Hidewaki
2016-01-01
Protein tertiary structure determines molecular function, interaction, and stability of the protein, therefore distribution of mutation in the tertiary structure can facilitate the identification of new driver genes in cancer. To analyze mutation distribution in protein tertiary structures, we applied a novel three dimensional permutation test to the mutation positions. We analyzed somatic mutation datasets of 21 types of cancers obtained from exome sequencing conducted by the TCGA project. Of the 3,622 genes that had ≥3 mutations in the regions with tertiary structure data, 106 genes showed significant skew in mutation distribution. Known tumor suppressors and oncogenes were significantly enriched in these identified cancer gene sets. Physical distances between mutations in known oncogenes were significantly smaller than those of tumor suppressors. Twenty-three genes were detected in multiple cancers. Candidate genes with significant skew of the 3D mutation distribution included kinases (MAPK1, EPHA5, ERBB3, and ERBB4), an apoptosis related gene (APP), an RNA splicing factor (SF1), a miRNA processing factor (DICER1), an E3 ubiquitin ligase (CUL1) and transcription factors (KLF5 and EEF1B2). Our study suggests that systematic analysis of mutation distribution in the tertiary protein structure can help identify cancer driver genes. PMID:27225414
van Dam, Jesse C J; Schaap, Peter J; Martins dos Santos, Vitor A P; Suárez-Diez, María
2014-09-26
Different methods have been developed to infer regulatory networks from heterogeneous omics datasets and to construct co-expression networks. Each algorithm produces different networks and efforts have been devoted to automatically integrate them into consensus sets. However each separate set has an intrinsic value that is diluted and partly lost when building a consensus network. Here we present a methodology to generate co-expression networks and, instead of a consensus network, we propose an integration framework where the different networks are kept and analysed with additional tools to efficiently combine the information extracted from each network. We developed a workflow to efficiently analyse information generated by different inference and prediction methods. Our methodology relies on providing the user the means to simultaneously visualise and analyse the coexisting networks generated by different algorithms, heterogeneous datasets, and a suite of analysis tools. As a show case, we have analysed the gene co-expression networks of Mycobacterium tuberculosis generated using over 600 expression experiments. Regarding DNA damage repair, we identified SigC as a key control element, 12 new targets for LexA, an updated LexA binding motif, and a potential mismatch repair system. We expanded the DevR regulon with 27 genes while identifying 9 targets wrongly assigned to this regulon. We discovered 10 new genes linked to zinc uptake and a new regulatory mechanism for ZuR. The use of co-expression networks to perform system level analysis allows the development of custom made methodologies. As show cases we implemented a pipeline to integrate ChIP-seq data and another method to uncover multiple regulatory layers. Our workflow is based on representing the multiple types of information as network representations and presenting these networks in a synchronous framework that allows their simultaneous visualization while keeping specific associations from the different networks. By simultaneously exploring these networks and metadata, we gained insights into regulatory mechanisms in M. tuberculosis that could not be obtained through the separate analysis of each data type.
White-Al Habeeb, Nicole M A; Garcia, Julia; Fleshner, Neil; Bapat, Bharati
2016-12-01
This study explored the biological effects of metformin on prostate cancer (PCa) cells and determined molecular pathways and epigenetic regulators implicated in its mechanism of action. We performed mRNA expression profiling in 22Rv1 cells following 2.5 mM and 5 mM metformin treatment. Genes significantly modified by metformin treatment were ranked based on altered expression, involvement with cancer-related processes, and reported dysregulation in PCa. The effects of the top ranked gene, MMSET, on the proliferative and invasive capabilities of PCa cells were investigated via siRNA knockdown alone and also combined with metformin treatment. Metformin treatment decreased cell growth of PCa cell line 22Rv1 and stalled cells at the G1/S checkpoint in a time- and dose-dependent manner, resulting in increased cells in G1 (P < 0.05) and decreased cells in S (P < 0.05) phase. Metformin activated the AMPK/mTOR signaling pathway as shown by increased p-AMPK and decreased p-p70S6K. mRNA expression profiling following metformin treatment identified significant changes in 136 chromatin-modifying genes. The top ranked gene, multiple myeloma SET domain (MMSET) showed increased expression in PCa cell lines (22Rv1 and DU145) when compared to the benign prostate epithelium-derived cell-line RWPE-1, and its expression was decreased upon metformin treatment. siRNA-mediated knockdown of MMSET showed decreased cellular migration and invasion in DU-145 cells. MMSET knockdown in combination with metformin treatment resulted in further reduction in the capacity of PCa cells to migrate and invade. These data suggest MMSET may play a role in the inhibitory effect of metformin on PCa and could serve as a potential novel therapeutic target for PCa. Prostate 76:1507-1518, 2016. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Effect of the absolute statistic on gene-sampling gene-set analysis methods.
Nam, Dougu
2017-06-01
Gene-set enrichment analysis and its modified versions have commonly been used for identifying altered functions or pathways in disease from microarray data. In particular, the simple gene-sampling gene-set analysis methods have been heavily used for datasets with only a few sample replicates. The biggest problem with this approach is the highly inflated false-positive rate. In this paper, the effect of absolute gene statistic on gene-sampling gene-set analysis methods is systematically investigated. Thus far, the absolute gene statistic has merely been regarded as a supplementary method for capturing the bidirectional changes in each gene set. Here, it is shown that incorporating the absolute gene statistic in gene-sampling gene-set analysis substantially reduces the false-positive rate and improves the overall discriminatory ability. Its effect was investigated by power, false-positive rate, and receiver operating curve for a number of simulated and real datasets. The performances of gene-set analysis methods in one-tailed (genome-wide association study) and two-tailed (gene expression data) tests were also compared and discussed.
Reitzel, Adam M; Pang, Kevin; Martindale, Mark Q
2016-01-01
An essential developmental pathway in sexually reproducing animals is the specification of germ cells and the differentiation of mature gametes, sperm and oocytes. The "germline" genes vasa, nanos and piwi are commonly identified in primordial germ cells, suggesting a molecular signature for the germline throughout animals. However, these genes are also expressed in a diverse set of somatic stem cells throughout the animal kingdom leaving open significant questions for whether they are required for germline specification. Similarly, members of the Dmrt gene family are essential components regulating sex determination and differentiation in bilaterian animals, but the functions of these transcription factors, including potential roles in sex determination, in early diverging animals remain unknown. The phylogenetic position of ctenophores and the genome sequence of the lobate Mnemiopsis leidyi motivated us to determine the compliment of these gene families in this species and determine expression patterns during development. Our phylogenetic analyses of the vasa, piwi and nanos gene families show that Mnemiopsis has multiple genes in each family with multiple lineage-specific paralogs. Expression domains of Mnemiopsis nanos, vasa and piwi, during embryogenesis from fertilization to the cydippid stage, were diverse, with little overlapping expression and no or little expression in what we think are the germ cells or gametogenic regions. piwi paralogs in Mnemiopsis had distinct expression domains in the ectoderm during development. We observed overlapping expression domains in the apical organ and tentacle apparatus of the cydippid for a subset of "germline genes," which are areas of high cell proliferation, suggesting that these genes are involved with "stem cell" specification and maintenance. Similarly, the five Dmrt genes show diverse non-overlapping expression domains, with no clear evidence for expression in future gametogenic regions of the adult. We also report on splice variants for two Mnemiopsis Dmrt genes that impact the presence and composition of the DM DNA binding domain for these transcription factors. Our results are consistent with a potential role for vasa, piwi and nanos genes in the specification or maintenance of somatic stem cell populations during development in Mnemiopsis. These results are similar to previous results in the tentaculate ctenophore Pleurobrachia, with the exception that these genes were also expressed in gonads and developing gametes of adult Pleurobrachia. These differences suggest that the Mnemiopsis germline is either specified later in development than hypothesized, the germline undergoes extensive migration, or the germline does not express these classic molecular markers. Our results highlight the utility of comparing expression of orthologous genes across multiple species. We provide the first description of Dmrt expression in a ctenophore, which indicates that Dmrt genes are expressed in distinct structures and regions during development but not in future gametogenic regions, the only sex-specific structure for this hermaphroditic species.
Gilli, Francesca; Navone, Nicole Désirée; Perga, Simona; Marnetto, Fabiana; Caldano, Marzia; Capobianco, Marco; Pulizzi, Annalisa; Malucchi, Simona; Bertolotto, Antonio
2011-07-01
In a recent genome-wide transcriptional analysis, we identified a gene signature for multiple sclerosis (MS), which reverted back to normal during pregnancy. Reversion was particularly evident for 7 genes: SOCS2, TNFAIP3, NR4A2, CXCR4, POLR2J, FAM49B, and STAG3L1, most of which encode negative regulators of inflammation. To corroborate dysregulation of genes, to evaluate the prognostic value of genes, and to study modulation of genes during different treatments. Comparison study. Italian referral center for MS. Quantitative polymerase chain reaction measurements were performed for 274 patients with MS and 60 healthy controls. Of the 274 patients with MS, 113 were treatment-naive patients in the initial stages of their disorder who were followed up in real-world clinical settings and categorized on the basis of disease course. The remaining 161 patients with MS received disease-modifying therapies (55 patients were treated with interferon beta, 52 with glatiramer acetate, and 54 with natalizumab) for a mean (SD) of 12 (2) months. Gene expression levels, relapse rate, and change in Expanded Disability Status Scale. We found a dysregulated gene pathway (P ≤ .006), with a downregulation of genes encoding negative regulators. The SOCS2, NR4A2, and TNFAIP3 genes were inversely correlated with both relapse rate (P ≤ .002) and change in Expanded Disability Status Scale (P ≤ .005). SOCS2 was modulated by both interferon beta and glatiramer acetate, TNFAIP3 was modulated by glatiramer acetate, and NR4A2 was not altered at all. No changes were induced by natalizumab. We demonstrate that there is a new molecular pathogenic mechanism that underlies the initiation and progression of MS. Defects in negative-feedback loops of inflammation lead to an overactivation of the immune system so as to predispose the brain to inflammation-sensitive MS.
Bao, Le; Gu, Hong; Dunn, Katherine A; Bielawski, Joseph P
2007-02-08
Models of codon evolution have proven useful for investigating the strength and direction of natural selection. In some cases, a priori biological knowledge has been used successfully to model heterogeneous evolutionary dynamics among codon sites. These are called fixed-effect models, and they require that all codon sites are assigned to one of several partitions which are permitted to have independent parameters for selection pressure, evolutionary rate, transition to transversion ratio or codon frequencies. For single gene analysis, partitions might be defined according to protein tertiary structure, and for multiple gene analysis partitions might be defined according to a gene's functional category. Given a set of related fixed-effect models, the task of selecting the model that best fits the data is not trivial. In this study, we implement a set of fixed-effect codon models which allow for different levels of heterogeneity among partitions in the substitution process. We describe strategies for selecting among these models by a backward elimination procedure, Akaike information criterion (AIC) or a corrected Akaike information criterion (AICc). We evaluate the performance of these model selection methods via a simulation study, and make several recommendations for real data analysis. Our simulation study indicates that the backward elimination procedure can provide a reliable method for model selection in this setting. We also demonstrate the utility of these models by application to a single-gene dataset partitioned according to tertiary structure (abalone sperm lysin), and a multi-gene dataset partitioned according to the functional category of the gene (flagellar-related proteins of Listeria). Fixed-effect models have advantages and disadvantages. Fixed-effect models are desirable when data partitions are known to exhibit significant heterogeneity or when a statistical test of such heterogeneity is desired. They have the disadvantage of requiring a priori knowledge for partitioning sites. We recommend: (i) selection of models by using backward elimination rather than AIC or AICc, (ii) use a stringent cut-off, e.g., p = 0.0001, and (iii) conduct sensitivity analysis of results. With thoughtful application, fixed-effect codon models should provide a useful tool for large scale multi-gene analyses.
Co, Aila L.; Hay, Ariel M.; MacDonald, James W.; Bammler, Theo K.; Farin, Federico M.; Costa, Lucio G.; Furlong, Clement E.
2014-01-01
Chlorpyrifos oxon (CPO), the toxic metabolite of the organophosphorus (OP) insecticide chlorpyrifos, causes developmental neurotoxicity in humans and rodents. CPO is hydrolyzed by paraoxonase-1 (PON1), with protection determined by PON1 levels and the human Q192R polymorphism. To examine how the Q192R polymorphism influences fetal toxicity associated with gestational CPO exposure, we measured enzyme inhibition and fetal-brain gene expression in wild-type (PON1+/+), PON1-knockout (PON1−/−), and tgHuPON1R192 and tgHuPON1Q192 transgenic mice. Pregnant mice exposed dermally to 0, 0.50, 0.75, or 0.85 mg/kg/d CPO from gestational day (GD) 6 through 17 were sacrificed on GD18. Biomarkers of CPO exposure inhibited in maternal tissues included brain acetylcholinesterase (AChE), red blood cell acylpeptide hydrolase (APH), and plasma butyrylcholinesterase (BChE) and carboxylesterase (CES). Fetal plasma BChE was inhibited in PON1−/− and tgHuPON1Q192, but not PON1+/+ or tgHuPON1R192 mice. Fetal brain AChE and plasma CES were inhibited in PON1−/− mice, but not in other genotypes. Weighted gene co-expression network analysis identified five gene modules based on clustering of the correlations among their fetal-brain expression values, allowing for correlation of module membership with the phenotypic data on enzyme inhibition. One module that correlated highly with maternal brain AChE activity had a large representation of homeobox genes. Gene set enrichment analysis revealed multiple gene sets affected by gestational CPO exposure in tgHuPON1Q192 but not tgHuPON1R192 mice, including gene sets involved in protein export, lipid metabolism, and neurotransmission. These data indicate that maternal PON1 status modulates the effects of repeated gestational CPO exposure on fetal-brain gene expression and on inhibition of both maternal and fetal biomarker enzymes. PMID:25070982
An investigation of obesity susceptibility genes in Northern Han Chinese by targeted resequencing.
Wu, Yili; Wang, Weijing; Jiang, Wenjie; Yao, Jie; Zhang, Dongfeng
2017-02-01
Our earlier genome-wide linkage study of body mass index (BMI) showed strong signals from 7q36.3 and 8q21.13. This case-control study set to investigate 2 genomic regions which may harbor variants contributed to development of obesity.We employed targeted resequencing technology to detect single nucleotide polymorphisms (SNPs) in 7q36.3 and 8q21.13 from 16 individuals with obesity. These were compared with 504 East Asians in the 1000 Genomes Project as a reference panel. Linkage disequilibrium (LD) block analysis was performed for the significant SNPs located near the same gene. Genes involved in statistically significant loci were then subject to gene set enrichment analysis (GSEA).The 16 individuals aged between 30 and 60 years with BMI = 33.25 ± 2.22 kg/m. A total of 12,131 genetic variants across all of samples were found. After correcting for multiple testing, 65 SNPs from 25 nearest genes (INSIG1, FABP5, PTPRN2, VIPR2, WDR60, SHH, UBE3C, LMBR1, PAG1, IMPA1, CHMP4, SNX16, BLACE, EN2, CNPY1, LOC100506302, RBM33, LOC389602, LOC285889, LINC01006, NOM1, DNAJB6, LOC101927914, ESYT2, LINC00689) were associated with obesity at significant level q-value ≤ 0.05. LD block analysis showed there were 10 pairs of loci with D' ≥ 0.8 and r ≥ 0.8. GSEA further identified 2 major related gene sets, involving lipid raft and lipid metabolic process, with FDR values <0.12 and <0.4, respectively.Our data are the first documentation of genetic variants in 7q36.3 and 8q21.13 associated with obesity using target capture sequencing and Northern Han Chinese samples. Additional replication and functional studies are merited to validate our findings.
Ensembl comparative genomics resources.
Herrero, Javier; Muffato, Matthieu; Beal, Kathryn; Fitzgerald, Stephen; Gordon, Leo; Pignatelli, Miguel; Vilella, Albert J; Searle, Stephen M J; Amode, Ridwan; Brent, Simon; Spooner, William; Kulesha, Eugene; Yates, Andrew; Flicek, Paul
2016-01-01
Evolution provides the unifying framework with which to understand biology. The coherent investigation of genic and genomic data often requires comparative genomics analyses based on whole-genome alignments, sets of homologous genes and other relevant datasets in order to evaluate and answer evolutionary-related questions. However, the complexity and computational requirements of producing such data are substantial: this has led to only a small number of reference resources that are used for most comparative analyses. The Ensembl comparative genomics resources are one such reference set that facilitates comprehensive and reproducible analysis of chordate genome data. Ensembl computes pairwise and multiple whole-genome alignments from which large-scale synteny, per-base conservation scores and constrained elements are obtained. Gene alignments are used to define Ensembl Protein Families, GeneTrees and homologies for both protein-coding and non-coding RNA genes. These resources are updated frequently and have a consistent informatics infrastructure and data presentation across all supported species. Specialized web-based visualizations are also available including synteny displays, collapsible gene tree plots, a gene family locator and different alignment views. The Ensembl comparative genomics infrastructure is extensively reused for the analysis of non-vertebrate species by other projects including Ensembl Genomes and Gramene and much of the information here is relevant to these projects. The consistency of the annotation across species and the focus on vertebrates makes Ensembl an ideal system to perform and support vertebrate comparative genomic analyses. We use robust software and pipelines to produce reference comparative data and make it freely available. Database URL: http://www.ensembl.org. © The Author(s) 2016. Published by Oxford University Press.
Cai, Xiaojun; Jin, Rongrong; Wang, Jiali; Yue, Dong; Jiang, Qian; Wu, Yao; Gu, Zhongwei
2016-03-09
Polymeric vectors have shown great promise in the development of safe and efficient gene delivery systems; however, only a few have been developed in clinical settings due to poor transport across multiple physiological barriers. To address this issue and promote clinical translocation of polymeric vectors, a new type of polymeric vector, bioreducible fluorinated peptide dendrimers (BFPDs), was designed and synthesized by reversible cross-linking of fluorinated low generation peptide dendrimers. Through masterly integration all of the features of reversible cross-linking, fluorination, and polyhedral oligomeric silsesquioxane (POSS) core-based peptide dendrimers, this novel vector exhibited lots of unique features, including (i) inactive surface to resist protein interactions; (ii) virus-mimicking surface topography to augment cellular uptake; (iii) fluorination-mediated efficient cellular uptake, endosome escape, cytoplasm trafficking, and nuclear entry, and (iv) disulfide-cleavage-mediated polyplex disassembly and DNA release that allows efficient DNA transcription. Noteworthy, all of these features are functionally important and can synergistically facilitate DNA transport from solution to the nucleus. As a consequences, BFPDs showed excellent gene transfection efficiency in several cell lines (∼95% in HEK293 cells) and superior biocompatibility compared with polyethylenimine (PEI). Meanwhile BFPDs provided excellent serum resistance in gene delivery. More importantly, BFPDs offer considerable in vivo gene transfection efficiency (in muscular tissues and in HepG2 tumor xenografts), which was approximately 77-fold higher than that of PEI in luciferase activity. These results suggest bioreducible fluorinated peptide dendrimers are a new class of highly efficient and safe gene delivery vectors and should be used in clinical settings.
Ensembl comparative genomics resources
Muffato, Matthieu; Beal, Kathryn; Fitzgerald, Stephen; Gordon, Leo; Pignatelli, Miguel; Vilella, Albert J.; Searle, Stephen M. J.; Amode, Ridwan; Brent, Simon; Spooner, William; Kulesha, Eugene; Yates, Andrew; Flicek, Paul
2016-01-01
Evolution provides the unifying framework with which to understand biology. The coherent investigation of genic and genomic data often requires comparative genomics analyses based on whole-genome alignments, sets of homologous genes and other relevant datasets in order to evaluate and answer evolutionary-related questions. However, the complexity and computational requirements of producing such data are substantial: this has led to only a small number of reference resources that are used for most comparative analyses. The Ensembl comparative genomics resources are one such reference set that facilitates comprehensive and reproducible analysis of chordate genome data. Ensembl computes pairwise and multiple whole-genome alignments from which large-scale synteny, per-base conservation scores and constrained elements are obtained. Gene alignments are used to define Ensembl Protein Families, GeneTrees and homologies for both protein-coding and non-coding RNA genes. These resources are updated frequently and have a consistent informatics infrastructure and data presentation across all supported species. Specialized web-based visualizations are also available including synteny displays, collapsible gene tree plots, a gene family locator and different alignment views. The Ensembl comparative genomics infrastructure is extensively reused for the analysis of non-vertebrate species by other projects including Ensembl Genomes and Gramene and much of the information here is relevant to these projects. The consistency of the annotation across species and the focus on vertebrates makes Ensembl an ideal system to perform and support vertebrate comparative genomic analyses. We use robust software and pipelines to produce reference comparative data and make it freely available. Database URL: http://www.ensembl.org. PMID:26896847