A Review of Computational Methods for Finding Non-Coding RNA Genes
Abbas, Qaisar; Raza, Syed Mansoor; Biyabani, Azizuddin Ahmed; Jaffar, Muhammad Arfan
2016-01-01
Finding non-coding RNA (ncRNA) genes has emerged over the past few years as a cutting-edge trend in bioinformatics. There are numerous computational intelligence (CI) challenges in the annotation and interpretation of ncRNAs because it requires a domain-related expert knowledge in CI techniques. Moreover, there are many classes predicted yet not experimentally verified by researchers. Recently, researchers have applied many CI methods to predict the classes of ncRNAs. However, the diverse CI approaches lack a definitive classification framework to take advantage of past studies. A few review papers have attempted to summarize CI approaches, but focused on the particular methodological viewpoints. Accordingly, in this article, we summarize in greater detail than previously available, the CI techniques for finding ncRNAs genes. We differentiate from the existing bodies of research and discuss concisely the technical merits of various techniques. Lastly, we review the limitations of ncRNA gene-finding CI methods with a point-of-view towards the development of new computational tools. PMID:27918472
Zagrijchuk, Elizaveta A.; Sabirov, Marat A.; Holloway, David M.; Spirov, Alexander V.
2014-01-01
Biological development depends on the coordinated expression of genes in time and space. Developmental genes have extensive cis-regulatory regions which control their expression. These regions are organized in a modular manner, with different modules controlling expression at different times and locations. Both how modularity evolved and what function it serves are open questions. We present a computational model for the cis-regulation of the hunchback (hb) gene in the fruit fly (Drosophila). We simulate evolution (using an evolutionary computation approach from computer science) to find the optimal cis-regulatory arrangements for fitting experimental hb expression patterns. We find that the cis-regulatory region tends to readily evolve modularity. These cis-regulatory modules (CRMs) do not tend to control single spatial domains, but show a multi-CRM/multi-domain correspondence. We find that the CRM-domain correspondence seen in Drosophila evolves with a high probability in our model, supporting the biological relevance of the approach. The partial redundancy resulting from multi-CRM control may confer some biological robustness against corruption of regulatory sequences. The technique developed on hb could readily be applied to other multi-CRM developmental genes. PMID:24712536
Inference of cancer-specific gene regulatory networks using soft computing rules.
Wang, Xiaosheng; Gotoh, Osamu
2010-03-24
Perturbations of gene regulatory networks are essentially responsible for oncogenesis. Therefore, inferring the gene regulatory networks is a key step to overcoming cancer. In this work, we propose a method for inferring directed gene regulatory networks based on soft computing rules, which can identify important cause-effect regulatory relations of gene expression. First, we identify important genes associated with a specific cancer (colon cancer) using a supervised learning approach. Next, we reconstruct the gene regulatory networks by inferring the regulatory relations among the identified genes, and their regulated relations by other genes within the genome. We obtain two meaningful findings. One is that upregulated genes are regulated by more genes than downregulated ones, while downregulated genes regulate more genes than upregulated ones. The other one is that tumor suppressors suppress tumor activators and activate other tumor suppressors strongly, while tumor activators activate other tumor activators and suppress tumor suppressors weakly, indicating the robustness of biological systems. These findings provide valuable insights into the pathogenesis of cancer.
Informatics approaches in the Biological Characterization of ...
Adverse Outcome Pathways (AOPs) are a conceptual framework to characterize toxicity pathways by a series of mechanistic steps from a molecular initiating event to population outcomes. This framework helps to direct risk assessment research, for example by aiding in computational prioritization of chemicals, genes, and tissues relevant to an adverse health outcome. We have designed and implemented a computational workflow to access a wealth of public data relating genes, chemicals, diseases, pathways, and species, to provide a biological context for putative AOPs. We selected three AOP case studies: ER/Aromatase Antagonism Leading to Reproductive Dysfunction, AHR1 Activation Leading to Cardiotoxicity, and AChE Inhibition Leading to Acute Mortality, and deduced a taxonomic range of applicability for each AOP. We developed computational tools to automatically access and analyze the pathway activity of AOP-relevant protein orthologs, finding broad similarity among vertebrate species for the ER/Aromatase and AHR1 AOPs, and similarity extending to invertebrate animal species for AChE inhibition. Additionally, we used public gene expression data to find groups of highly co-expressed genes, and compared those groups across organisms. To interpret these findings at a higher level of biological organization, we created the AOPdb, a relational database that mines results from sources including NCBI, KEGG, Reactome, CTD, and OMIM. This multi-source database connects genes,
Finding approximate gene clusters with Gecko 3.
Winter, Sascha; Jahn, Katharina; Wehner, Stefanie; Kuchenbecker, Leon; Marz, Manja; Stoye, Jens; Böcker, Sebastian
2016-11-16
Gene-order-based comparison of multiple genomes provides signals for functional analysis of genes and the evolutionary process of genome organization. Gene clusters are regions of co-localized genes on genomes of different species. The rapid increase in sequenced genomes necessitates bioinformatics tools for finding gene clusters in hundreds of genomes. Existing tools are often restricted to few (in many cases, only two) genomes, and often make restrictive assumptions such as short perfect conservation, conserved gene order or monophyletic gene clusters. We present Gecko 3, an open-source software for finding gene clusters in hundreds of bacterial genomes, that comes with an easy-to-use graphical user interface. The underlying gene cluster model is intuitive, can cope with low degrees of conservation as well as misannotations and is complemented by a sound statistical evaluation. To evaluate the biological benefit of Gecko 3 and to exemplify our method, we search for gene clusters in a dataset of 678 bacterial genomes using Synechocystis sp. PCC 6803 as a reference. We confirm detected gene clusters reviewing the literature and comparing them to a database of operons; we detect two novel clusters, which were confirmed by publicly available experimental RNA-Seq data. The computational analysis is carried out on a laptop computer in <40 min. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Huang, Ying; Chen, Shi-Yi; Deng, Feilong
2016-01-01
In silico analysis of DNA sequences is an important area of computational biology in the post-genomic era. Over the past two decades, computational approaches for ab initio prediction of gene structure from genome sequence alone have largely facilitated our understanding on a variety of biological questions. Although the computational prediction of protein-coding genes has already been well-established, we are also facing challenges to robustly find the non-coding RNA genes, such as miRNA and lncRNA. Two main aspects of ab initio gene prediction include the computed values for describing sequence features and used algorithm for training the discriminant function, and by which different combinations are employed into various bioinformatic tools. Herein, we briefly review these well-characterized sequence features in eukaryote genomes and applications to ab initio gene prediction. The main purpose of this article is to provide an overview to beginners who aim to develop the related bioinformatic tools.
Genome-Wide Comparative Gene Family Classification
Frech, Christian; Chen, Nansheng
2010-01-01
Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species. PMID:20976221
ERIC Educational Resources Information Center
National Institute of General Medical Sciences (NIGMS), 2009
2009-01-01
Computer advances now let researchers quickly search through DNA sequences to find gene variations that could lead to disease, simulate how flu might spread through one's school, and design three-dimensional animations of molecules that rival any video game. By teaming computers and biology, scientists can answer new and old questions that could…
Discovering novel subsystems using comparative genomics
Ferrer, Luciana; Shearer, Alexander G.; Karp, Peter D.
2011-01-01
Motivation: Key problems for computational genomics include discovering novel pathways in genome data, and discovering functional interaction partners for genes to define new members of partially elucidated pathways. Results: We propose a novel method for the discovery of subsystems from annotated genomes. For each gene pair, a score measuring the likelihood that the two genes belong to a same subsystem is computed using genome context methods. Genes are then grouped based on these scores, and the resulting groups are filtered to keep only high-confidence groups. Since the method is based on genome context analysis, it relies solely on structural annotation of the genomes. The method can be used to discover new pathways, find missing genes from a known pathway, find new protein complexes or other kinds of functional groups and assign function to genes. We tested the accuracy of our method in Escherichia coli K-12. In one configuration of the system, we find that 31.6% of the candidate groups generated by our method match a known pathway or protein complex closely, and that we rediscover 31.2% of all known pathways and protein complexes of at least 4 genes. We believe that a significant proportion of the candidates that do not match any known group in E.coli K-12 corresponds to novel subsystems that may represent promising leads for future laboratory research. We discuss in-depth examples of these findings. Availability: Predicted subsystems are available at http://brg.ai.sri.com/pwy-discovery/journal.html. Contact: lferrer@ai.sri.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21775308
Algebraic model checking for Boolean gene regulatory networks.
Tran, Quoc-Nam
2011-01-01
We present a computational method in which modular and Groebner bases (GB) computation in Boolean rings are used for solving problems in Boolean gene regulatory networks (BN). In contrast to other known algebraic approaches, the degree of intermediate polynomials during the calculation of Groebner bases using our method will never grow resulting in a significant improvement in running time and memory space consumption. We also show how calculation in temporal logic for model checking can be done by means of our direct and efficient Groebner basis computation in Boolean rings. We present our experimental results in finding attractors and control strategies of Boolean networks to illustrate our theoretical arguments. The results are promising. Our algebraic approach is more efficient than the state-of-the-art model checker NuSMV on BNs. More importantly, our approach finds all solutions for the BN problems.
Palumbo, Michael J; Newberg, Lee A
2010-07-01
The transcription of a gene from its DNA template into an mRNA molecule is the first, and most heavily regulated, step in gene expression. Especially in bacteria, regulation is typically achieved via the binding of a transcription factor (protein) or small RNA molecule to the chromosomal region upstream of a regulated gene. The protein or RNA molecule recognizes a short, approximately conserved sequence within a gene's promoter region and, by binding to it, either enhances or represses expression of the nearby gene. Since the sought-for motif (pattern) is short and accommodating to variation, computational approaches that scan for binding sites have trouble distinguishing functional sites from look-alikes. Many computational approaches are unable to find the majority of experimentally verified binding sites without also finding many false positives. Phyloscan overcomes this difficulty by exploiting two key features of functional binding sites: (i) these sites are typically more conserved evolutionarily than are non-functional DNA sequences; and (ii) these sites often occur two or more times in the promoter region of a regulated gene. The website is free and open to all users, and there is no login requirement. Address: (http://bayesweb.wadsworth.org/phyloscan/).
Identifying the impact of G-quadruplexes on Affymetrix 3' arrays using cloud computing.
Memon, Farhat N; Owen, Anne M; Sanchez-Graillet, Olivia; Upton, Graham J G; Harrison, Andrew P
2010-01-15
A tetramer quadruplex structure is formed by four parallel strands of DNA/ RNA containing runs of guanine. These quadruplexes are able to form because guanine can Hoogsteen hydrogen bond to other guanines, and a tetrad of guanines can form a stable arrangement. Recently we have discovered that probes on Affymetrix GeneChips that contain runs of guanine do not measure gene expression reliably. We associate this finding with the likelihood that quadruplexes are forming on the surface of GeneChips. In order to cope with the rapidly expanding size of GeneChip array datasets in the public domain, we are exploring the use of cloud computing to replicate our experiments on 3' arrays to look at the effect of the location of G-spots (runs of guanines). Cloud computing is a recently introduced high-performance solution that takes advantage of the computational infrastructure of large organisations such as Amazon and Google. We expect that cloud computing will become widely adopted because it enables bioinformaticians to avoid capital expenditure on expensive computing resources and to only pay a cloud computing provider for what is used. Moreover, as well as financial efficiency, cloud computing is an ecologically-friendly technology, it enables efficient data-sharing and we expect it to be faster for development purposes. Here we propose the advantageous use of cloud computing to perform a large data-mining analysis of public domain 3' arrays.
Performance and scalability evaluation of "Big Memory" on Blue Gene Linux.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoshii, K.; Iskra, K.; Naik, H.
2011-05-01
We address memory performance issues observed in Blue Gene Linux and discuss the design and implementation of 'Big Memory' - an alternative, transparent memory space introduced to eliminate the memory performance issues. We evaluate the performance of Big Memory using custom memory benchmarks, NAS Parallel Benchmarks, and the Parallel Ocean Program, at a scale of up to 4,096 nodes. We find that Big Memory successfully resolves the performance issues normally encountered in Blue Gene Linux. For the ocean simulation program, we even find that Linux with Big Memory provides better scalability than does the lightweight compute node kernel designed solelymore » for high-performance applications. Originally intended exclusively for compute node tasks, our new memory subsystem dramatically improves the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray radio telescope as an example.« less
Zhang, Weixiong; Ruan, Jianhua; Ho, Tuan-Hua David; You, Youngsook; Yu, Taotao; Quatrano, Ralph S
2005-07-15
A fundamental problem of computational genomics is identifying the genes that respond to certain endogenous cues and environmental stimuli. This problem can be referred to as targeted gene finding. Since gene regulation is mainly determined by the binding of transcription factors and cis-regulatory DNA sequences, most existing gene annotation methods, which exploit the conservation of open reading frames, are not effective in finding target genes. A viable approach to targeted gene finding is to exploit the cis-regulatory elements that are known to be responsible for the transcription of target genes. Given such cis-elements, putative target genes whose promoters contain the elements can be identified. As a case study, we apply the above approach to predict the genes in model plant Arabidopsis thaliana which are inducible by a phytohormone, abscisic acid (ABA), and abiotic stress, such as drought, cold and salinity. We first construct and analyze two ABA specific cis-elements, ABA-responsive element (ABRE) and its coupling element (CE), in A.thaliana, based on their conservation in rice and other cereal plants. We then use the ABRE-CE module to identify putative ABA-responsive genes in A.thaliana. Based on RT-PCR verification and the results from literature, this method has an accuracy rate of 67.5% for the top 40 predictions. The cis-element based targeted gene finding approach is expected to be widely applicable since a large number of cis-elements in many species are available.
Computing and Applying Atomic Regulons to Understand Gene Expression and Regulation
Faria, José P.; Davis, James J.; Edirisinghe, Janaka N.; Taylor, Ronald C.; Weisenhorn, Pamela; Olson, Robert D.; Stevens, Rick L.; Rocha, Miguel; Rocha, Isabel; Best, Aaron A.; DeJongh, Matthew; Tintle, Nathan L.; Parrello, Bruce; Overbeek, Ross; Henry, Christopher S.
2016-01-01
Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. An important step toward meeting the challenge of understanding gene function and regulation is the identification of sets of genes that are always co-expressed. These gene sets, Atomic Regulons (ARs), represent fundamental units of function within a cell and could be used to associate genes of unknown function with cellular processes and to enable rational genetic engineering of cellular systems. Here, we describe an approach for inferring ARs that leverages large-scale expression data sets, gene context, and functional relationships among genes. We computed ARs for Escherichia coli based on 907 gene expression experiments and compared our results with gene clusters produced by two prevalent data-driven methods: Hierarchical clustering and k-means clustering. We compared ARs and purely data-driven gene clusters to the curated set of regulatory interactions for E. coli found in RegulonDB, showing that ARs are more consistent with gold standard regulons than are data-driven gene clusters. We further examined the consistency of ARs and data-driven gene clusters in the context of gene interactions predicted by Context Likelihood of Relatedness (CLR) analysis, finding that the ARs show better agreement with CLR predicted interactions. We determined the impact of increasing amounts of expression data on AR construction and find that while more data improve ARs, it is not necessary to use the full set of gene expression experiments available for E. coli to produce high quality ARs. In order to explore the conservation of co-regulated gene sets across different organisms, we computed ARs for Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus, each of which represents increasing degrees of phylogenetic distance from E. coli. Comparison of the organism-specific ARs showed that the consistency of AR gene membership correlates with phylogenetic distance, but there is clear variability in the regulatory networks of closely related organisms. As large scale expression data sets become increasingly common for model and non-model organisms, comparative analyses of atomic regulons will provide valuable insights into fundamental regulatory modules used across the bacterial domain. PMID:27933038
Computational gene expression profiling under salt stress reveals patterns of co-expression
Sanchita; Sharma, Ashok
2016-01-01
Plants respond differently to environmental conditions. Among various abiotic stresses, salt stress is a condition where excess salt in soil causes inhibition of plant growth. To understand the response of plants to the stress conditions, identification of the responsible genes is required. Clustering is a data mining technique used to group the genes with similar expression. The genes of a cluster show similar expression and function. We applied clustering algorithms on gene expression data of Solanum tuberosum showing differential expression in Capsicum annuum under salt stress. The clusters, which were common in multiple algorithms were taken further for analysis. Principal component analysis (PCA) further validated the findings of other cluster algorithms by visualizing their clusters in three-dimensional space. Functional annotation results revealed that most of the genes were involved in stress related responses. Our findings suggest that these algorithms may be helpful in the prediction of the function of co-expressed genes. PMID:26981411
Detecting novel genes with sparse arrays
Haiminen, Niina; Smit, Bart; Rautio, Jari; Vitikainen, Marika; Wiebe, Marilyn; Martinez, Diego; Chee, Christine; Kunkel, Joe; Sanchez, Charles; Nelson, Mary Anne; Pakula, Tiina; Saloheimo, Markku; Penttilä, Merja; Kivioja, Teemu
2014-01-01
Species-specific genes play an important role in defining the phenotype of an organism. However, current gene prediction methods can only efficiently find genes that share features such as sequence similarity or general sequence characteristics with previously known genes. Novel sequencing methods and tiling arrays can be used to find genes without prior information and they have demonstrated that novel genes can still be found from extensively studied model organisms. Unfortunately, these methods are expensive and thus are not easily applicable, e.g., to finding genes that are expressed only in very specific conditions. We demonstrate a method for finding novel genes with sparse arrays, applying it on the 33.9 Mb genome of the filamentous fungus Trichoderma reesei. Our computational method does not require normalisations between arrays and it takes into account the multiple-testing problem typical for analysis of microarray data. In contrast to tiling arrays, that use overlapping probes, only one 25mer microarray oligonucleotide probe was used for every 100 b. Thus, only relatively little space on a microarray slide was required to cover the intergenic regions of a genome. The analysis was done as a by-product of a conventional microarray experiment with no additional costs. We found at least 23 good candidates for novel transcripts that could code for proteins and all of which were expressed at high levels. Candidate genes were found to neighbour ire1 and cre1 and many other regulatory genes. Our simple, low-cost method can easily be applied to finding novel species-specific genes without prior knowledge of their sequence properties. PMID:20691772
Hsiao, Tzu-Hung; Chiu, Yu-Chiao; Hsu, Pei-Yin; Lu, Tzu-Pin; Lai, Liang-Chuan; Tsai, Mong-Hsun; Huang, Tim H.-M.; Chuang, Eric Y.; Chen, Yidong
2016-01-01
Several mutual information (MI)-based algorithms have been developed to identify dynamic gene-gene and function-function interactions governed by key modulators (genes, proteins, etc.). Due to intensive computation, however, these methods rely heavily on prior knowledge and are limited in genome-wide analysis. We present the modulated gene/gene set interaction (MAGIC) analysis to systematically identify genome-wide modulation of interaction networks. Based on a novel statistical test employing conjugate Fisher transformations of correlation coefficients, MAGIC features fast computation and adaption to variations of clinical cohorts. In simulated datasets MAGIC achieved greatly improved computation efficiency and overall superior performance than the MI-based method. We applied MAGIC to construct the estrogen receptor (ER) modulated gene and gene set (representing biological function) interaction networks in breast cancer. Several novel interaction hubs and functional interactions were discovered. ER+ dependent interaction between TGFβ and NFκB was further shown to be associated with patient survival. The findings were verified in independent datasets. Using MAGIC, we also assessed the essential roles of ER modulation in another hormonal cancer, ovarian cancer. Overall, MAGIC is a systematic framework for comprehensively identifying and constructing the modulated interaction networks in a whole-genome landscape. MATLAB implementation of MAGIC is available for academic uses at https://github.com/chiuyc/MAGIC. PMID:26972162
2011-01-01
Background Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches - the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known. Results The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (http://main.g2.bx.psu.edu/). Nine genes (APLN, ZC4H2, MAGED4, MAGED4B, RAP2C, FAM156A, FAM156B, TBL1X, and UXT) were highlighted as highly-ranked XLMR methods. Conclusions The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR. Reviewers: This article was reviewed by Dr Barbara Bardoni (nominated by Prof Juergen Brosius); Prof Neil Smalheiser and Dr Dustin Holloway (nominated by Prof Charles DeLisi). PMID:21668950
DNA context represents transcription regulation of the gene in mouse embryonic stem cells
NASA Astrophysics Data System (ADS)
Ha, Misook; Hong, Soondo
2016-04-01
Understanding gene regulatory information in DNA remains a significant challenge in biomedical research. This study presents a computational approach to infer gene regulatory programs from primary DNA sequences. Using DNA around transcription start sites as attributes, our model predicts gene regulation in the gene. We find that H3K27ac around TSS is an informative descriptor of the transcription program in mouse embryonic stem cells. We build a computational model inferring the cell-type-specific H3K27ac signatures in the DNA around TSS. A comparison of embryonic stem cell and liver cell-specific H3K27ac signatures in DNA shows that the H3K27ac signatures in DNA around TSS efficiently distinguish the cell-type specific H3K27ac peaks and the gene regulation. The arrangement of the H3K27ac signatures inferred from the DNA represents the transcription regulation of the gene in mESC. We show that the DNA around transcription start sites is associated with the gene regulatory program by specific interaction with H3K27ac.
DNA context represents transcription regulation of the gene in mouse embryonic stem cells.
Ha, Misook; Hong, Soondo
2016-04-14
Understanding gene regulatory information in DNA remains a significant challenge in biomedical research. This study presents a computational approach to infer gene regulatory programs from primary DNA sequences. Using DNA around transcription start sites as attributes, our model predicts gene regulation in the gene. We find that H3K27ac around TSS is an informative descriptor of the transcription program in mouse embryonic stem cells. We build a computational model inferring the cell-type-specific H3K27ac signatures in the DNA around TSS. A comparison of embryonic stem cell and liver cell-specific H3K27ac signatures in DNA shows that the H3K27ac signatures in DNA around TSS efficiently distinguish the cell-type specific H3K27ac peaks and the gene regulation. The arrangement of the H3K27ac signatures inferred from the DNA represents the transcription regulation of the gene in mESC. We show that the DNA around transcription start sites is associated with the gene regulatory program by specific interaction with H3K27ac.
Genome-Wide Analysis of Gene-Gene and Gene-Environment Interactions Using Closed-Form Wald Tests.
Yu, Zhaoxia; Demetriou, Michael; Gillen, Daniel L
2015-09-01
Despite the successful discovery of hundreds of variants for complex human traits using genome-wide association studies, the degree to which genes and environmental risk factors jointly affect disease risk is largely unknown. One obstacle toward this goal is that the computational effort required for testing gene-gene and gene-environment interactions is enormous. As a result, numerous computationally efficient tests were recently proposed. However, the validity of these methods often relies on unrealistic assumptions such as additive main effects, main effects at only one variable, no linkage disequilibrium between the two single-nucleotide polymorphisms (SNPs) in a pair or gene-environment independence. Here, we derive closed-form and consistent estimates for interaction parameters and propose to use Wald tests for testing interactions. The Wald tests are asymptotically equivalent to the likelihood ratio tests (LRTs), largely considered to be the gold standard tests but generally too computationally demanding for genome-wide interaction analysis. Simulation studies show that the proposed Wald tests have very similar performances with the LRTs but are much more computationally efficient. Applying the proposed tests to a genome-wide study of multiple sclerosis, we identify interactions within the major histocompatibility complex region. In this application, we find that (1) focusing on pairs where both SNPs are marginally significant leads to more significant interactions when compared to focusing on pairs where at least one SNP is marginally significant; and (2) parsimonious parameterization of interaction effects might decrease, rather than increase, statistical power. © 2015 WILEY PERIODICALS, INC.
A Review of Computational Intelligence Methods for Eukaryotic Promoter Prediction.
Singh, Shailendra; Kaur, Sukhbir; Goel, Neelam
2015-01-01
In past decades, prediction of genes in DNA sequences has attracted the attention of many researchers but due to its complex structure it is extremely intricate to correctly locate its position. A large number of regulatory regions are present in DNA that helps in transcription of a gene. Promoter is one such region and to find its location is a challenging problem. Various computational methods for promoter prediction have been developed over the past few years. This paper reviews these promoter prediction methods. Several difficulties and pitfalls encountered by these methods are also detailed, along with future research directions.
Gene-network inference by message passing
NASA Astrophysics Data System (ADS)
Braunstein, A.; Pagnani, A.; Weigt, M.; Zecchina, R.
2008-01-01
The inference of gene-regulatory processes from gene-expression data belongs to the major challenges of computational systems biology. Here we address the problem from a statistical-physics perspective and develop a message-passing algorithm which is able to infer sparse, directed and combinatorial regulatory mechanisms. Using the replica technique, the algorithmic performance can be characterized analytically for artificially generated data. The algorithm is applied to genome-wide expression data of baker's yeast under various environmental conditions. We find clear cases of combinatorial control, and enrichment in common functional annotations of regulated genes and their regulators.
Pan, Qian; Peng, Jin; Zhou, Xue; Yang, Hao; Zhang, Wei
2012-07-01
In order to screen out important genes from large gene data of gene microarray after nerve injury, we combine gene ontology (GO) method and computer pattern recognition technology to find key genes responding to nerve injury, and then verify one of these screened-out genes. Data mining and gene ontology analysis of gene chip data GSE26350 was carried out through MATLAB software. Cd44 was selected from screened-out key gene molecular spectrum by comparing genes' different GO terms and positions on score map of principal component. Function interferences were employed to influence the normal binding of Cd44 and one of its ligands, chondroitin sulfate C (CSC), to observe neurite extension. Gene ontology analysis showed that the first genes on score map (marked by red *) mainly distributed in molecular transducer activity, receptor activity, protein binding et al molecular function GO terms. Cd44 is one of six effector protein genes, and attracted us with its function diversity. After adding different reagents into the medium to interfere the normal binding of CSC and Cd44, varying-degree remissions of CSC's inhibition on neurite extension were observed. CSC can inhibit neurite extension through binding Cd44 on the neuron membrane. This verifies that important genes in given physiological processes can be identified by gene ontology analysis of gene chip data.
Pavlidis, Paul; Qin, Jie; Arango, Victoria; Mann, John J; Sibille, Etienne
2004-06-01
One of the challenges in the analysis of gene expression data is placing the results in the context of other data available about genes and their relationships to each other. Here, we approach this problem in the study of gene expression changes associated with age in two areas of the human prefrontal cortex, comparing two computational methods. The first method, "overrepresentation analysis" (ORA), is based on statistically evaluating the fraction of genes in a particular gene ontology class found among the set of genes showing age-related changes in expression. The second method, "functional class scoring" (FCS), examines the statistical distribution of individual gene scores among all genes in the gene ontology class and does not involve an initial gene selection step. We find that FCS yields more consistent results than ORA, and the results of ORA depended strongly on the gene selection threshold. Our findings highlight the utility of functional class scoring for the analysis of complex expression data sets and emphasize the advantage of considering all available genomic information rather than sets of genes that pass a predetermined "threshold of significance."
DroSpeGe: rapid access database for new Drosophila species genomes.
Gilbert, Donald G
2007-01-01
The Drosophila species comparative genome database DroSpeGe (http://insects.eugenes.org/DroSpeGe/) provides genome researchers with rapid, usable access to 12 new and old Drosophila genomes, since its inception in 2004. Scientists can use, with minimal computing expertise, the wealth of new genome information for developing new insights into insect evolution. New genome assemblies provided by several sequencing centers have been annotated with known model organism gene homologies and gene predictions to provided basic comparative data. TeraGrid supplies the shared cyberinfrastructure for the primary computations. This genome database includes homologies to Drosophila melanogaster and eight other eukaryote model genomes, and gene predictions from several groups. BLAST searches of the newest assemblies are integrated with genome maps. GBrowse maps provide detailed views of cross-species aligned genomes. BioMart provides for data mining of annotations and sequences. Common chromosome maps identify major synteny among species. Potential gain and loss of genes is suggested by Gene Ontology groupings for genes of the new species. Summaries of essential genome statistics include sizes, genes found and predicted, homology among genomes, phylogenetic trees of species and comparisons of several gene predictions for sensitivity and specificity in finding new and known genes.
Retrieving relevant time-course experiments: a study on Arabidopsis microarrays.
Şener, Duygu Dede; Oğul, Hasan
2016-06-01
Understanding time-course regulation of genes in response to a stimulus is a major concern in current systems biology. The problem is usually approached by computational methods to model the gene behaviour or its networked interactions with the others by a set of latent parameters. The model parameters can be estimated through a meta-analysis of available data obtained from other relevant experiments. The key question here is how to find the relevant experiments which are potentially useful in analysing current data. In this study, the authors address this problem in the context of time-course gene expression experiments from an information retrieval perspective. To this end, they introduce a computational framework that takes a time-course experiment as a query and reports a list of relevant experiments retrieved from a given repository. These retrieved experiments can then be used to associate the environmental factors of query experiment with the findings previously reported. The model is tested using a set of time-course Arabidopsis microarrays. The experimental results show that relevant experiments can be successfully retrieved based on content similarity.
Computing and Applying Atomic Regulons to Understand Gene Expression and Regulation
Faria, José P.; Davis, James J.; Edirisinghe, Janaka N.; ...
2016-11-24
Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. A multitude of technologies, abstractions, and interpretive frameworks have emerged to answer the challenges presented by genome function and regulatory network inference. Here, we propose a new approach for producing biologically meaningful clusters of coexpressed genes, called Atomic Regulons (ARs), based on expression data, gene context, and functional relationships. We demonstrate this new approach by computing ARs for Escherichia coli, which we compare with the coexpressed gene clusters predicted by two prevalent existing methods: hierarchical clustering and k-meansmore » clustering. We test the consistency of ARs predicted by all methods against expected interactions predicted by the Context Likelihood of Relatedness (CLR) mutual information based method, finding that the ARs produced by our approach show better agreement with CLR interactions. We then apply our method to compute ARs for four other genomes: Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus. We compare the AR clusters from all genomes to study the similarity of coexpression among a phylogenetically diverse set of species, identifying subsystems that show remarkable similarity over wide phylogenetic distances. We also study the sensitivity of our method for computing ARs to the expression data used in the computation, showing that our new approach requires less data than competing approaches to converge to a near final configuration of ARs. We go on to use our sensitivity analysis to identify the specific experiments that lead most rapidly to the final set of ARs for E. coli. As a result, this analysis produces insights into improving the design of gene expression experiments.« less
Computing and Applying Atomic Regulons to Understand Gene Expression and Regulation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Faria, José P.; Davis, James J.; Edirisinghe, Janaka N.
Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. A multitude of technologies, abstractions, and interpretive frameworks have emerged to answer the challenges presented by genome function and regulatory network inference. Here, we propose a new approach for producing biologically meaningful clusters of coexpressed genes, called Atomic Regulons (ARs), based on expression data, gene context, and functional relationships. We demonstrate this new approach by computing ARs for Escherichia coli, which we compare with the coexpressed gene clusters predicted by two prevalent existing methods: hierarchical clustering and k-meansmore » clustering. We test the consistency of ARs predicted by all methods against expected interactions predicted by the Context Likelihood of Relatedness (CLR) mutual information based method, finding that the ARs produced by our approach show better agreement with CLR interactions. We then apply our method to compute ARs for four other genomes: Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus. We compare the AR clusters from all genomes to study the similarity of coexpression among a phylogenetically diverse set of species, identifying subsystems that show remarkable similarity over wide phylogenetic distances. We also study the sensitivity of our method for computing ARs to the expression data used in the computation, showing that our new approach requires less data than competing approaches to converge to a near final configuration of ARs. We go on to use our sensitivity analysis to identify the specific experiments that lead most rapidly to the final set of ARs for E. coli. As a result, this analysis produces insights into improving the design of gene expression experiments.« less
Adetiba, Emmanuel; Olugbara, Oludayo O
2015-01-01
Lung cancer is one of the diseases responsible for a large number of cancer related death cases worldwide. The recommended standard for screening and early detection of lung cancer is the low dose computed tomography. However, many patients diagnosed die within one year, which makes it essential to find alternative approaches for screening and early detection of lung cancer. We present computational methods that can be implemented in a functional multi-genomic system for classification, screening and early detection of lung cancer victims. Samples of top ten biomarker genes previously reported to have the highest frequency of lung cancer mutations and sequences of normal biomarker genes were respectively collected from the COSMIC and NCBI databases to validate the computational methods. Experiments were performed based on the combinations of Z-curve and tetrahedron affine transforms, Histogram of Oriented Gradient (HOG), Multilayer perceptron and Gaussian Radial Basis Function (RBF) neural networks to obtain an appropriate combination of computational methods to achieve improved classification of lung cancer biomarker genes. Results show that a combination of affine transforms of Voss representation, HOG genomic features and Gaussian RBF neural network perceptibly improves classification accuracy, specificity and sensitivity of lung cancer biomarker genes as well as achieving low mean square error.
Why is the correlation between gene importance and gene evolutionary rate so weak?
Wang, Zhi; Zhang, Jianzhi
2009-01-01
One of the few commonly believed principles of molecular evolution is that functionally more important genes (or DNA sequences) evolve more slowly than less important ones. This principle is widely used by molecular biologists in daily practice. However, recent genomic analysis of a diverse array of organisms found only weak, negative correlations between the evolutionary rate of a gene and its functional importance, typically measured under a single benign lab condition. A frequently suggested cause of the above finding is that gene importance determined in the lab differs from that in an organism's natural environment. Here, we test this hypothesis in yeast using gene importance values experimentally determined in 418 lab conditions or computationally predicted for 10,000 nutritional conditions. In no single condition or combination of conditions did we find a much stronger negative correlation, which is explainable by our subsequent finding that always-essential (enzyme) genes do not evolve significantly more slowly than sometimes-essential or always-nonessential ones. Furthermore, we verified that functional density, approximated by the fraction of amino acid sites within protein domains, is uncorrelated with gene importance. Thus, neither the lab-nature mismatch nor a potentially biased among-gene distribution of functional density explains the observed weakness of the correlation between gene importance and evolutionary rate. We conclude that the weakness is factual, rather than artifactual. In addition to being weakened by population genetic reasons, the correlation is likely to have been further weakened by the presence of multiple nontrivial rate determinants that are independent from gene importance. These findings notwithstanding, we show that the principle of slower evolution of more important genes does have some predictive power when genes with vastly different evolutionary rates are compared, explaining why the principle can be practically useful despite the weakness of the correlation.
Namkung, Junghyun; Nam, Jin-Wu; Park, Taesung
2007-01-01
Many genes with major effects on quantitative traits have been reported to interact with other genes. However, finding a group of interacting genes from thousands of SNPs is challenging. Hence, an efficient and robust algorithm is needed. The genetic algorithm (GA) is useful in searching for the optimal solution from a very large searchable space. In this study, we show that genome-wide interaction analysis using GA and a statistical interaction model can provide a practical method to detect biologically interacting loci. We focus our search on transcriptional regulators by analyzing gene x gene interactions for cancer-related genes. The expression values of three cancer-related genes were selected from the expression data of the Genetic Analysis Workshop 15 Problem 1 data set. We implemented a GA to identify the expression quantitative trait loci that are significantly associated with expression levels of the cancer-related genes. The time complexity of the GA was compared with that of an exhaustive search algorithm. As a result, our GA, which included heuristic methods, such as archive, elitism, and local search, has greatly reduced computational time in a genome-wide search for gene x gene interactions. In general, the GA took one-fifth the computation time of an exhaustive search for the most significant pair of single-nucleotide polymorphisms.
Namkung, Junghyun; Nam, Jin-Wu; Park, Taesung
2007-01-01
Many genes with major effects on quantitative traits have been reported to interact with other genes. However, finding a group of interacting genes from thousands of SNPs is challenging. Hence, an efficient and robust algorithm is needed. The genetic algorithm (GA) is useful in searching for the optimal solution from a very large searchable space. In this study, we show that genome-wide interaction analysis using GA and a statistical interaction model can provide a practical method to detect biologically interacting loci. We focus our search on transcriptional regulators by analyzing gene × gene interactions for cancer-related genes. The expression values of three cancer-related genes were selected from the expression data of the Genetic Analysis Workshop 15 Problem 1 data set. We implemented a GA to identify the expression quantitative trait loci that are significantly associated with expression levels of the cancer-related genes. The time complexity of the GA was compared with that of an exhaustive search algorithm. As a result, our GA, which included heuristic methods, such as archive, elitism, and local search, has greatly reduced computational time in a genome-wide search for gene × gene interactions. In general, the GA took one-fifth the computation time of an exhaustive search for the most significant pair of single-nucleotide polymorphisms. PMID:18466570
“Guilt by Association” Is the Exception Rather Than the Rule in Gene Networks
Gillis, Jesse; Pavlidis, Paul
2012-01-01
Gene networks are commonly interpreted as encoding functional information in their connections. An extensively validated principle called guilt by association states that genes which are associated or interacting are more likely to share function. Guilt by association provides the central top-down principle for analyzing gene networks in functional terms or assessing their quality in encoding functional information. In this work, we show that functional information within gene networks is typically concentrated in only a very few interactions whose properties cannot be reliably related to the rest of the network. In effect, the apparent encoding of function within networks has been largely driven by outliers whose behaviour cannot even be generalized to individual genes, let alone to the network at large. While experimentalist-driven analysis of interactions may use prior expert knowledge to focus on the small fraction of critically important data, large-scale computational analyses have typically assumed that high-performance cross-validation in a network is due to a generalizable encoding of function. Because we find that gene function is not systemically encoded in networks, but dependent on specific and critical interactions, we conclude it is necessary to focus on the details of how networks encode function and what information computational analyses use to extract functional meaning. We explore a number of consequences of this and find that network structure itself provides clues as to which connections are critical and that systemic properties, such as scale-free-like behaviour, do not map onto the functional connectivity within networks. PMID:22479173
Replacing and Additive Horizontal Gene Transfer in Streptococcus
Choi, Sang Chul; Rasmussen, Matthew D.; Hubisz, Melissa J.; Gronau, Ilan; Stanhope, Michael J.; Siepel, Adam
2012-01-01
The prominent role of Horizontal Gene Transfer (HGT) in the evolution of bacteria is now well documented, but few studies have differentiated between evolutionary events that predominantly cause genes in one lineage to be replaced by homologs from another lineage (“replacing HGT”) and events that result in the addition of substantial new genomic material (“additive HGT”). Here in, we make use of the distinct phylogenetic signatures of replacing and additive HGTs in a genome-wide study of the important human pathogen Streptococcus pyogenes (SPY) and its close relatives S. dysgalactiae subspecies equisimilis (SDE) and S. dysgalactiae subspecies dysgalactiae (SDD). Using recently developed statistical models and computational methods, we find evidence for abundant gene flow of both kinds within each of the SPY and SDE clades and of reduced levels of exchange between SPY and SDD. In addition, our analysis strongly supports a pronounced asymmetry in SPY–SDE gene flow, favoring the SPY-to-SDE direction. This finding is of particular interest in light of the recent increase in virulence of pathogenic SDE. We find much stronger evidence for SPY–SDE gene flow among replacing than among additive transfers, suggesting a primary influence from homologous recombination between co-occurring SPY and SDE cells in human hosts. Putative virulence genes are correlated with transfer events, but this correlation is found to be driven by additive, not replacing, HGTs. The genes affected by additive HGTs are enriched for functions having to do with transposition, recombination, and DNA integration, consistent with previous findings, whereas replacing HGTs seen to influence a more diverse set of genes. Additive transfers are also found to be associated with evidence of positive selection. These findings shed new light on the manner in which HGT has shaped pathogenic bacterial genomes. PMID:22617954
Lamônica, Dionísia A C; Maximino, Luciana P; Feniman, Mariza Ribeiro; Silva, Greyce K; Zanchetta, Sthella; Abramides, Dagma V M; Passos-Bueno, Maria Rita; Rocha, Kátia; Richieri-Costa, Antonio
2010-09-01
To describe the clinical, speech, hearing, and imaging findings in three members of a Brazilian family with Saethre-Chotzen syndrome (SCS) who presented some unusual characteristics within the spectrum of the syndrome. Clinical evaluation was performed by a multidisciplinary team. Direct sequencing of the polymerase chain reaction-amplified coding region of the TWIST1 gene, routine and electrophysiological hearing evaluation, speech evaluation, and imaging studies through computed tomography (CT) scan and magnetic resonance imaging (MRI) were performed. TWIST1 gene analysis revealed a Pro136His mutation in all patients. Hearing evaluation showed peripherial and mixed hearing loss in two of the patients, one of them with severe unilateral microtia. Computed tomography scan showed structural middle ear anomalies, and MRI showed distortion of the skull contour as well as some of the brain structures. We report a previously undescribed TWIST1 gene mutation in patients with SCS. There is evidence that indicates hearing loss (conductive and mixed) can be related both with middle ear (microtia, high jugular bulb, and enlarged vestibules) as well as with brain stem anomalies. Here we discuss the relationship between the gene mutation and the clinical, imaging, speech, and hearing findings.
Gomez-Pulido, Juan A; Cerrada-Barrios, Jose L; Trinidad-Amado, Sebastian; Lanza-Gutierrez, Jose M; Fernandez-Diaz, Ramon A; Crawford, Broderick; Soto, Ricardo
2016-08-31
Metaheuristics are widely used to solve large combinatorial optimization problems in bioinformatics because of the huge set of possible solutions. Two representative problems are gene selection for cancer classification and biclustering of gene expression data. In most cases, these metaheuristics, as well as other non-linear techniques, apply a fitness function to each possible solution with a size-limited population, and that step involves higher latencies than other parts of the algorithms, which is the reason why the execution time of the applications will mainly depend on the execution time of the fitness function. In addition, it is usual to find floating-point arithmetic formulations for the fitness functions. This way, a careful parallelization of these functions using the reconfigurable hardware technology will accelerate the computation, specially if they are applied in parallel to several solutions of the population. A fine-grained parallelization of two floating-point fitness functions of different complexities and features involved in biclustering of gene expression data and gene selection for cancer classification allowed for obtaining higher speedups and power-reduced computation with regard to usual microprocessors. The results show better performances using reconfigurable hardware technology instead of usual microprocessors, in computing time and power consumption terms, not only because of the parallelization of the arithmetic operations, but also thanks to the concurrent fitness evaluation for several individuals of the population in the metaheuristic. This is a good basis for building accelerated and low-energy solutions for intensive computing scenarios.
Computational Models of HIV-1 Resistance to Gene Therapy Elucidate Therapy Design Principles
Aviran, Sharon; Shah, Priya S.; Schaffer, David V.; Arkin, Adam P.
2010-01-01
Gene therapy is an emerging alternative to conventional anti-HIV-1 drugs, and can potentially control the virus while alleviating major limitations of current approaches. Yet, HIV-1's ability to rapidly acquire mutations and escape therapy presents a critical challenge to any novel treatment paradigm. Viral escape is thus a key consideration in the design of any gene-based technique. We develop a computational model of HIV's evolutionary dynamics in vivo in the presence of a genetic therapy to explore the impact of therapy parameters and strategies on the development of resistance. Our model is generic and captures the properties of a broad class of gene-based agents that inhibit early stages of the viral life cycle. We highlight the differences in viral resistance dynamics between gene and standard antiretroviral therapies, and identify key factors that impact long-term viral suppression. In particular, we underscore the importance of mutationally-induced viral fitness losses in cells that are not genetically modified, as these can severely constrain the replication of resistant virus. We also propose and investigate a novel treatment strategy that leverages upon gene therapy's unique capacity to deliver different genes to distinct cell populations, and we find that such a strategy can dramatically improve efficacy when used judiciously within a certain parametric regime. Finally, we revisit a previously-suggested idea of improving clinical outcomes by boosting the proliferation of the genetically-modified cells, but we find that such an approach has mixed effects on resistance dynamics. Our results provide insights into the short- and long-term effects of gene therapy and the role of its key properties in the evolution of resistance, which can serve as guidelines for the choice and optimization of effective therapeutic agents. PMID:20711350
Gene Expression Noise, Fitness Landscapes, and Evolution
NASA Astrophysics Data System (ADS)
Charlebois, Daniel
The stochastic (or noisy) process of gene expression can have fitness consequences for living organisms. For example, gene expression noise facilitates the development of drug resistance by increasing the time scale at which beneficial phenotypic states can be maintained. The present work investigates the relationship between gene expression noise and the fitness landscape. By incorporating the costs and benefits of gene expression, we track how the fluctuation magnitude and timescale of expression noise evolve in simulations of cell populations under stress. We find that properties of expression noise evolve to maximize fitness on the fitness landscape, and that low levels of expression noise emerge when the fitness benefits of gene expression exceed the fitness costs (and that high levels of noise emerge when the costs of expression exceed the benefits). The findings from our theoretical/computational work offer new hypotheses on the development of drug resistance, some of which are now being investigated in evolution experiments in our laboratory using well-characterized synthetic gene regulatory networks in budding yeast. Nserc Postdoctoral Fellowship (Grant No. PDF-453977-2014).
Variable neighborhood search for reverse engineering of gene regulatory networks.
Nicholson, Charles; Goodwin, Leslie; Clark, Corey
2017-01-01
A new search heuristic, Divided Neighborhood Exploration Search, designed to be used with inference algorithms such as Bayesian networks to improve on the reverse engineering of gene regulatory networks is presented. The approach systematically moves through the search space to find topologies representative of gene regulatory networks that are more likely to explain microarray data. In empirical testing it is demonstrated that the novel method is superior to the widely employed greedy search techniques in both the quality of the inferred networks and computational time. Copyright © 2016 Elsevier Inc. All rights reserved.
A computational search for box C/D snoRNA genes in the Drosophila melanogaster genome.
Accardo, M C; Giordano, E; Riccardo, S; Digilio, F A; Iazzetti, G; Calogero, R A; Furia, M
2004-12-12
In eukaryotes, the family of non-coding RNA genes includes a number of genes encoding small nucleolar RNAs (mainly C/D and H/ACA snoRNAs), which act as guides in the maturation or post-transcriptional modifications of target RNA molecules. Since in Drosophila melanogaster (Dm) only few examples of snoRNAs have been identified so far by cDNA libraries screening, integration of the molecular data with in silico identification of these types of genes could throw light on their organization in the Dm genome. We have performed a computational screening of the Dm genome for C/D snoRNA genes, followed by experimental validation of the putative candidates. Few of the 26 confirmed snoRNAs had been recognized by cDNA library analysis. Organization of the Dm genome was also found to be more variegated than previously suspected, with snoRNA genes nested in both the introns and exons of protein-coding genes. This finding suggests that the presence of additional mechanisms of snoRNA biogenesis based on the alternative production of overlapping mRNA/snoRNA molecules. Additional information is available at http://www.bioinformatica.unito.it/bioinformatics/snoRNAs.
Computational Prediction and Validation of BAHD1 as a Novel Molecule for Ulcerative Colitis
NASA Astrophysics Data System (ADS)
Zhu, Huatuo; Wan, Xingyong; Li, Jing; Han, Lu; Bo, Xiaochen; Chen, Wenguo; Lu, Chao; Shen, Zhe; Xu, Chenfu; Chen, Lihua; Yu, Chaohui; Xu, Guoqiang
2015-07-01
Ulcerative colitis (UC) is a common inflammatory bowel disease (IBD) producing intestinal inflammation and tissue damage. The precise aetiology of UC remains unknown. In this study, we applied a rank-based expression profile comparative algorithm, gene set enrichment analysis (GSEA), to evaluate the expression profiles of UC patients and small interfering RNA (siRNA)-perturbed cells to predict proteins that might be essential in UC from publicly available expression profiles. We used quantitative PCR (qPCR) to characterize the expression levels of those genes predicted to be the most important for UC in dextran sodium sulphate (DSS)-induced colitic mice. We found that bromo-adjacent homology domain (BAHD1), a novel heterochromatinization factor in vertebrates, was the most downregulated gene. We further validated a potential role of BAHD1 as a regulatory factor for inflammation through the TNF signalling pathway in vitro. Our findings indicate that computational approaches leveraging public gene expression data can be used to infer potential genes or proteins for diseases, and BAHD1 might act as an indispensable factor in regulating the cellular inflammatory response in UC.
On Computing Breakpoint Distances for Genomes with Duplicate Genes.
Shao, Mingfu; Moret, Bernard M E
2017-06-01
A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure. Of the many distance measures, the breakpoint distance (the number of nonconserved adjacencies) was the first one to be studied and remains of significant interest because of its simplicity and model-free property. The three breakpoint distance problems corresponding to the three formulations have been widely studied. Although we provided last year a solution for the exemplar problem that runs very fast on full genomes, computing optimal solutions for the other two problems has remained challenging. In this article, we describe very fast, exact algorithms for these two problems. Our algorithms rely on a compact integer-linear program that we further simplify by developing an algorithm to remove variables, based on new results on the structure of adjacencies and matchings. Through extensive experiments using both simulations and biological data sets, we show that our algorithms run very fast (in seconds) on mammalian genomes and scale well beyond. We also apply these algorithms (as well as the classic orthology tool MSOAR) to create orthology assignment, then compare their quality in terms of both accuracy and coverage. We find that our algorithm for the "any matching" formulation significantly outperforms other methods in terms of accuracy while achieving nearly maximum coverage.
Lippert, Christoph; Xiang, Jing; Horta, Danilo; Widmer, Christian; Kadie, Carl; Heckerman, David; Listgarten, Jennifer
2014-11-15
Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test-a score test-with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene-gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods. After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test-up to 23 more associations-whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene-gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500. Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/. heckerma@microsoft.com Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Why Is the Correlation between Gene Importance and Gene Evolutionary Rate So Weak?
Wang, Zhi; Zhang, Jianzhi
2009-01-01
One of the few commonly believed principles of molecular evolution is that functionally more important genes (or DNA sequences) evolve more slowly than less important ones. This principle is widely used by molecular biologists in daily practice. However, recent genomic analysis of a diverse array of organisms found only weak, negative correlations between the evolutionary rate of a gene and its functional importance, typically measured under a single benign lab condition. A frequently suggested cause of the above finding is that gene importance determined in the lab differs from that in an organism's natural environment. Here, we test this hypothesis in yeast using gene importance values experimentally determined in 418 lab conditions or computationally predicted for 10,000 nutritional conditions. In no single condition or combination of conditions did we find a much stronger negative correlation, which is explainable by our subsequent finding that always-essential (enzyme) genes do not evolve significantly more slowly than sometimes-essential or always-nonessential ones. Furthermore, we verified that functional density, approximated by the fraction of amino acid sites within protein domains, is uncorrelated with gene importance. Thus, neither the lab-nature mismatch nor a potentially biased among-gene distribution of functional density explains the observed weakness of the correlation between gene importance and evolutionary rate. We conclude that the weakness is factual, rather than artifactual. In addition to being weakened by population genetic reasons, the correlation is likely to have been further weakened by the presence of multiple nontrivial rate determinants that are independent from gene importance. These findings notwithstanding, we show that the principle of slower evolution of more important genes does have some predictive power when genes with vastly different evolutionary rates are compared, explaining why the principle can be practically useful despite the weakness of the correlation. PMID:19132081
Pollard, Harvey B.; Shivakumar, Chittari; Starr, Joshua; Eidelman, Ofer; Jacobowitz, David M.; Dalgard, Clifton L.; Srivastava, Meera; Wilkerson, Matthew D.; Stein, Murray B.; Ursano, Robert J.
2016-01-01
“Soldier's Heart,” is an American Civil War term linking post-traumatic stress disorder (PTSD) with increased propensity for cardiovascular disease (CVD). We have hypothesized that there might be a quantifiable genetic basis for this linkage. To test this hypothesis we identified a comprehensive set of candidate risk genes for PTSD, and tested whether any were also independent risk genes for CVD. A functional analysis algorithm was used to identify associated signaling networks. We identified 106 PTSD studies that report one or more polymorphic variants in 87 candidate genes in 83,463 subjects and controls. The top upstream drivers for these PTSD risk genes are predicted to be the glucocorticoid receptor (NR3C1) and Tumor Necrosis Factor alpha (TNFA). We find that 37 of the PTSD candidate risk genes are also candidate independent risk genes for CVD. The association between PTSD and CVD is significant by Fisher's Exact Test (P = 3 × 10−54). We also find 15 PTSD risk genes that are independently associated with Type 2 Diabetes Mellitus (T2DM; also significant by Fisher's Exact Test (P = 1.8 × 10−16). Our findings offer quantitative evidence for a genetic link between post-traumatic stress and cardiovascular disease, Computationally, the common mechanism for this linkage between PTSD and CVD is innate immunity and NFκB-mediated inflammation. PMID:27721742
Pollard, Harvey B; Shivakumar, Chittari; Starr, Joshua; Eidelman, Ofer; Jacobowitz, David M; Dalgard, Clifton L; Srivastava, Meera; Wilkerson, Matthew D; Stein, Murray B; Ursano, Robert J
2016-01-01
"Soldier's Heart," is an American Civil War term linking post-traumatic stress disorder (PTSD) with increased propensity for cardiovascular disease (CVD). We have hypothesized that there might be a quantifiable genetic basis for this linkage. To test this hypothesis we identified a comprehensive set of candidate risk genes for PTSD, and tested whether any were also independent risk genes for CVD. A functional analysis algorithm was used to identify associated signaling networks. We identified 106 PTSD studies that report one or more polymorphic variants in 87 candidate genes in 83,463 subjects and controls. The top upstream drivers for these PTSD risk genes are predicted to be the glucocorticoid receptor (NR3C1) and Tumor Necrosis Factor alpha (TNFA). We find that 37 of the PTSD candidate risk genes are also candidate independent risk genes for CVD. The association between PTSD and CVD is significant by Fisher's Exact Test ( P = 3 × 10 -54 ). We also find 15 PTSD risk genes that are independently associated with Type 2 Diabetes Mellitus (T2DM; also significant by Fisher's Exact Test ( P = 1.8 × 10 -16 ). Our findings offer quantitative evidence for a genetic link between post-traumatic stress and cardiovascular disease, Computationally, the common mechanism for this linkage between PTSD and CVD is innate immunity and NFκB-mediated inflammation.
Zhang, Jian; Suo, Yan; Liu, Min; Xu, Xun
2018-06-01
Proliferative diabetic retinopathy (PDR) is one of the most common complications of diabetes and can lead to blindness. Proteomic studies have provided insight into the pathogenesis of PDR and a series of PDR-related genes has been identified but are far from fully characterized because the experimental methods are expensive and time consuming. In our previous study, we successfully identified 35 candidate PDR-related genes through the shortest-path algorithm. In the current study, we developed a computational method using the random walk with restart (RWR) algorithm and the protein-protein interaction (PPI) network to identify potential PDR-related genes. After some possible genes were obtained by the RWR algorithm, a three-stage filtration strategy, which includes the permutation test, interaction test and enrichment test, was applied to exclude potential false positives caused by the structure of PPI network, the poor interaction strength, and the limited similarity on gene ontology (GO) terms and biological pathways. As a result, 36 candidate genes were discovered by the method which was different from the 35 genes reported in our previous study. A literature review showed that 21 of these 36 genes are supported by previous experiments. These findings suggest the robustness and complementary effects of both our efforts using different computational methods, thus providing an alternative method to study PDR pathogenesis. Copyright © 2017 Elsevier B.V. All rights reserved.
Graph Curvature for Differentiating Cancer Networks
Sandhu, Romeil; Georgiou, Tryphon; Reznik, Ed; Zhu, Liangjia; Kolesov, Ivan; Senbabaoglu, Yasin; Tannenbaum, Allen
2015-01-01
Cellular interactions can be modeled as complex dynamical systems represented by weighted graphs. The functionality of such networks, including measures of robustness, reliability, performance, and efficiency, are intrinsically tied to the topology and geometry of the underlying graph. Utilizing recently proposed geometric notions of curvature on weighted graphs, we investigate the features of gene co-expression networks derived from large-scale genomic studies of cancer. We find that the curvature of these networks reliably distinguishes between cancer and normal samples, with cancer networks exhibiting higher curvature than their normal counterparts. We establish a quantitative relationship between our findings and prior investigations of network entropy. Furthermore, we demonstrate how our approach yields additional, non-trivial pair-wise (i.e. gene-gene) interactions which may be disrupted in cancer samples. The mathematical formulation of our approach yields an exact solution to calculating pair-wise changes in curvature which was computationally infeasible using prior methods. As such, our findings lay the foundation for an analytical approach to studying complex biological networks. PMID:26169480
Coalescent histories for caterpillar-like families.
Rosenberg, Noah A
2013-01-01
A coalescent history is an assignment of branches of a gene tree to branches of a species tree on which coalescences in the gene tree occur. The number of coalescent histories for a pair consisting of a labeled gene tree topology and a labeled species tree topology is important in gene tree probability computations, and more generally, in studying evolutionary possibilities for gene trees on species trees. Defining the Tr-caterpillar-like family as a sequence of n-taxon trees constructed by replacing the r-taxon subtree of n-taxon caterpillars by a specific r-taxon labeled topology Tr, we examine the number of coalescent histories for caterpillar-like families with matching gene tree and species tree labeled topologies. For each Tr with size r≤8, we compute the number of coalescent histories for n-taxon trees in the Tr-caterpillar-like family. Next, as n→∞, we find that the limiting ratio of the numbers of coalescent histories for the Tr family and caterpillars themselves is correlated with the number of labeled histories for Tr. The results support a view that large numbers of coalescent histories occur when a tree has both a relatively balanced subtree and a high tree depth, contributing to deeper understanding of the combinatorics of gene trees and species trees.
Drift diffusion model of reward and punishment learning in rare alpha-synuclein gene carriers.
Moustafa, Ahmed A; Kéri, Szabolcs; Polner, Bertalan; White, Corey
To understand the cognitive effects of alpha-synuclein polymorphism, we employed a drift diffusion model (DDM) to analyze reward- and punishment-guided probabilistic learning task data of participants with the rare alpha-synuclein gene duplication and age- and education-matched controls. Overall, the DDM analysis showed that, relative to controls, asymptomatic alpha-synuclein gene duplication carriers had significantly increased learning from negative feedback, while they tended to show impaired learning from positive feedback. No significant differences were found in response caution, response bias, or motor/encoding time. We here discuss the implications of these computational findings to the understanding of the neural mechanism of alpha-synuclein gene duplication.
Kato, Hiroki; Kanematsu, Masayuki; Yokoi, Shigeaki; Miwa, Kousei; Horie, Kengo; Deguchi, Takashi; Hirose, Yoshinobu
2011-01-01
The authors describe the computed tomography (CT) and magnetic resonance imaging (MRI) findings of an 18-year-old man with renal cell carcinoma (RCC) associated with the Xp11.2 translocation/transcription factor E3 (TFE3) gene fusion (Xp11 translocation carcinoma). The lesion was hyperdense on unenhanced CT, hypovascular on contrast-enhanced studies, hypointense on T2-weighted MR images, and hemosiderin deposition was suspected on phase-shift gradient-echo MR images. Histopathological specimens revealed pathological findings resembling papillary RCC predominantly and exhibited immunoreactivity for TFE3. Because there is often considerable morphological overlap between this carcinoma and papillary RCC, the imaging findings of Xp11 translocation carcinoma may be similar to those of the papillary subtype. Therefore, Xp11 translocation carcinoma should be considered, particularly in young patients when radiologic images demonstrate a renal tumor mimicking the papillary subtype. Copyright © 2010 Wiley-Liss, Inc.
Gene Selection and Cancer Classification: A Rough Sets Based Approach
NASA Astrophysics Data System (ADS)
Sun, Lijun; Miao, Duoqian; Zhang, Hongyun
Indentification of informative gene subsets responsible for discerning between available samples of gene expression data is an important task in bioinformatics. Reducts, from rough sets theory, corresponding to a minimal set of essential genes for discerning samples, is an efficient tool for gene selection. Due to the compuational complexty of the existing reduct algoritms, feature ranking is usually used to narrow down gene space as the first step and top ranked genes are selected . In this paper,we define a novel certierion based on the expression level difference btween classes and contribution to classification of the gene for scoring genes and present a algorithm for generating all possible reduct from informative genes.The algorithm takes the whole attribute sets into account and find short reduct with a significant reduction in computational complexity. An exploration of this approach on benchmark gene expression data sets demonstrates that this approach is successful for selecting high discriminative genes and the classification accuracy is impressive.
Arias, Carlos Roberto; Yeh, Hsiang-Yuan; Soo, Von-Wun
2012-01-01
Finding a genetic disease-related gene is not a trivial task. Therefore, computational methods are needed to present clues to the biomedical community to explore genes that are more likely to be related to a specific disease as biomarker. We present biomarker identification problem using gene prioritization method called gene prioritization from microarray data based on shortest paths, extended with structural and biological properties and edge flux using voting scheme (GP-MIDAS-VXEF). The method is based on finding relevant interactions on protein interaction networks, then scoring the genes using shortest paths and topological analysis, integrating the results using a voting scheme and a biological boosting. We applied two experiments, one is prostate primary and normal samples and the other is prostate primary tumor with and without lymph nodes metastasis. We used 137 truly prostate cancer genes as benchmark. In the first experiment, GP-MIDAS-VXEF outperforms all the other state-of-the-art methods in the benchmark by retrieving the truest related genes from the candidate set in the top 50 scores found. We applied the same technique to infer the significant biomarkers in prostate cancer with lymph nodes metastasis which is not established well. PMID:22654636
Li, Chen; Shen, Weixing; Shen, Sheng; Ai, Zhilong
2013-12-01
To explore the molecular mechanisms of cholangiocarcinoma (CC), microarray technology was used to find biomarkers for early detection and diagnosis. The gene expression profiles from 6 patients with CC and 5 normal controls were downloaded from Gene Expression Omnibus and compared. As a result, 204 differentially co-expressed genes (DCGs) in CC patients compared to normal controls were identified using a computational bioinformatics analysis. These genes were mainly involved in coenzyme metabolic process, peptidase activity and oxidation reduction. A regulatory network was constructed by mapping the DCGs to known regulation data. Four transcription factors, FOXC1, ZIC2, NKX2-2 and GCGR, were hub nodes in the network. In conclusion, this study provides a set of targets useful for future investigations into molecular biomarker studies. Copyright © 2013 Elsevier Ltd. All rights reserved.
The impact of network medicine in gastroenterology and hepatology.
Baffy, György
2013-10-01
In the footsteps of groundbreaking achievements made by biomedical research, another scientific revolution is unfolding. Systems biology draws from the chaos and complexity theory and applies computational models to predict emerging behavior of the interactions between genes, gene products, and environmental factors. Adaptation of systems biology to translational and clinical sciences has been termed network medicine, and is likely to change the way we think about preventing, predicting, diagnosing, and treating complex human diseases. Network medicine finds gene-disease associations by analyzing the unparalleled digital information discovered and created by high-throughput technologies (dubbed as "omics" science) and links genetic variance to clinical disease phenotypes through intermediate organizational levels of life such as the epigenome, transcriptome, proteome, and metabolome. Supported by large reference databases, unprecedented data storage capacity, and innovative computational analysis, network medicine is poised to find links between conditions that were thought to be distinct, uncover shared disease mechanisms and key drivers of the pathogenesis, predict individual disease outcomes and trajectories, identify novel therapeutic applications, and help avoid off-target and undesirable drug effects. Recent advances indicate that these perspectives are increasingly within our reach for understanding and managing complex diseases of the digestive system. Copyright © 2013 AGA Institute. Published by Elsevier Inc. All rights reserved.
Computational Identification and Functional Predictions of Long Noncoding RNA in Zea mays
Boerner, Susan; McGinnis, Karen M.
2012-01-01
Background Computational analysis of cDNA sequences from multiple organisms suggests that a large portion of transcribed DNA does not code for a functional protein. In mammals, noncoding transcription is abundant, and often results in functional RNA molecules that do not appear to encode proteins. Many long noncoding RNAs (lncRNAs) appear to have epigenetic regulatory function in humans, including HOTAIR and XIST. While epigenetic gene regulation is clearly an essential mechanism in plants, relatively little is known about the presence or function of lncRNAs in plants. Methodology/Principal Findings To explore the connection between lncRNA and epigenetic regulation of gene expression in plants, a computational pipeline using the programming language Python has been developed and applied to maize full length cDNA sequences to identify, classify, and localize potential lncRNAs. The pipeline was used in parallel with an SVM tool for identifying ncRNAs to identify the maximal number of ncRNAs in the dataset. Although the available library of sequences was small and potentially biased toward protein coding transcripts, 15% of the sequences were predicted to be noncoding. Approximately 60% of these sequences appear to act as precursors for small RNA molecules and may function to regulate gene expression via a small RNA dependent mechanism. ncRNAs were predicted to originate from both genic and intergenic loci. Of the lncRNAs that originated from genic loci, ∼20% were antisense to the host gene loci. Conclusions/Significance Consistent with similar studies in other organisms, noncoding transcription appears to be widespread in the maize genome. Computational predictions indicate that maize lncRNAs may function to regulate expression of other genes through multiple RNA mediated mechanisms. PMID:22916204
Efficient experimental design for uncertainty reduction in gene regulatory networks.
Dehghannasiri, Roozbeh; Yoon, Byung-Jun; Dougherty, Edward R
2015-01-01
An accurate understanding of interactions among genes plays a major role in developing therapeutic intervention methods. Gene regulatory networks often contain a significant amount of uncertainty. The process of prioritizing biological experiments to reduce the uncertainty of gene regulatory networks is called experimental design. Under such a strategy, the experiments with high priority are suggested to be conducted first. The authors have already proposed an optimal experimental design method based upon the objective for modeling gene regulatory networks, such as deriving therapeutic interventions. The experimental design method utilizes the concept of mean objective cost of uncertainty (MOCU). MOCU quantifies the expected increase of cost resulting from uncertainty. The optimal experiment to be conducted first is the one which leads to the minimum expected remaining MOCU subsequent to the experiment. In the process, one must find the optimal intervention for every gene regulatory network compatible with the prior knowledge, which can be prohibitively expensive when the size of the network is large. In this paper, we propose a computationally efficient experimental design method. This method incorporates a network reduction scheme by introducing a novel cost function that takes into account the disruption in the ranking of potential experiments. We then estimate the approximate expected remaining MOCU at a lower computational cost using the reduced networks. Simulation results based on synthetic and real gene regulatory networks show that the proposed approximate method has close performance to that of the optimal method but at lower computational cost. The proposed approximate method also outperforms the random selection policy significantly. A MATLAB software implementing the proposed experimental design method is available at http://gsp.tamu.edu/Publications/supplementary/roozbeh15a/.
Efficient experimental design for uncertainty reduction in gene regulatory networks
2015-01-01
Background An accurate understanding of interactions among genes plays a major role in developing therapeutic intervention methods. Gene regulatory networks often contain a significant amount of uncertainty. The process of prioritizing biological experiments to reduce the uncertainty of gene regulatory networks is called experimental design. Under such a strategy, the experiments with high priority are suggested to be conducted first. Results The authors have already proposed an optimal experimental design method based upon the objective for modeling gene regulatory networks, such as deriving therapeutic interventions. The experimental design method utilizes the concept of mean objective cost of uncertainty (MOCU). MOCU quantifies the expected increase of cost resulting from uncertainty. The optimal experiment to be conducted first is the one which leads to the minimum expected remaining MOCU subsequent to the experiment. In the process, one must find the optimal intervention for every gene regulatory network compatible with the prior knowledge, which can be prohibitively expensive when the size of the network is large. In this paper, we propose a computationally efficient experimental design method. This method incorporates a network reduction scheme by introducing a novel cost function that takes into account the disruption in the ranking of potential experiments. We then estimate the approximate expected remaining MOCU at a lower computational cost using the reduced networks. Conclusions Simulation results based on synthetic and real gene regulatory networks show that the proposed approximate method has close performance to that of the optimal method but at lower computational cost. The proposed approximate method also outperforms the random selection policy significantly. A MATLAB software implementing the proposed experimental design method is available at http://gsp.tamu.edu/Publications/supplementary/roozbeh15a/. PMID:26423515
Evolutionary Approach for Relative Gene Expression Algorithms
Czajkowski, Marcin
2014-01-01
A Relative Expression Analysis (RXA) uses ordering relationships in a small collection of genes and is successfully applied to classiffication using microarray data. As checking all possible subsets of genes is computationally infeasible, the RXA algorithms require feature selection and multiple restrictive assumptions. Our main contribution is a specialized evolutionary algorithm (EA) for top-scoring pairs called EvoTSP which allows finding more advanced gene relations. We managed to unify the major variants of relative expression algorithms through EA and introduce weights to the top-scoring pairs. Experimental validation of EvoTSP on public available microarray datasets showed that the proposed solution significantly outperforms in terms of accuracy other relative expression algorithms and allows exploring much larger solution space. PMID:24790574
PACAP Interactions in the Mouse Brain: Implications for Behavioral and Other Disorders
DOE Office of Scientific and Technical Information (OSTI.GOV)
Acquaah-Mensah, George; Taylor, Ronald C.; Bhave, Sanjiv V.
2012-01-10
As an activator of adenylate cyclase, the neuropeptide Pituitary Adenylate Cyclase Activating Peptide (PACAP) impacts levels of cyclic AMP, a key second messenger available in brain cells. PACAP is involved in certain adult behaviors. To elucidate PACAP interactions, a compendium of microarrays representing mRNA expression in the adult mouse whole brain was pooled from the Phenogen database for analysis. A regulatory network was computed based on mutual information between gene pairs using gene expression data across the compendium. Clusters among genes directly linked to PACAP, and probable interactions between corresponding proteins were computed. Database 'experts' affirmed some of the inferredmore » relationships. The findings suggest ADCY7 is probably the adenylate cyclase isoform most relevant to PACAP's action. They also support intervening roles for kinases including GSK3B, PI 3-kinase, SGK3 and AMPK. Other high-confidence interactions are hypothesized for future testing. This new information has implications for certain behavioral and other disorders.« less
Assessment of gene order computing methods for Alzheimer's disease
2013-01-01
Background Computational genomics of Alzheimer disease (AD), the most common form of senile dementia, is a nascent field in AD research. The field includes AD gene clustering by computing gene order which generates higher quality gene clustering patterns than most other clustering methods. However, there are few available gene order computing methods such as Genetic Algorithm (GA) and Ant Colony Optimization (ACO). Further, their performance in gene order computation using AD microarray data is not known. We thus set forth to evaluate the performances of current gene order computing methods with different distance formulas, and to identify additional features associated with gene order computation. Methods Using different distance formulas- Pearson distance and Euclidean distance, the squared Euclidean distance, and other conditions, gene orders were calculated by ACO and GA (including standard GA and improved GA) methods, respectively. The qualities of the gene orders were compared, and new features from the calculated gene orders were identified. Results Compared to the GA methods tested in this study, ACO fits the AD microarray data the best when calculating gene order. In addition, the following features were revealed: different distance formulas generated a different quality of gene order, and the commonly used Pearson distance was not the best distance formula when used with both GA and ACO methods for AD microarray data. Conclusion Compared with Pearson distance and Euclidean distance, the squared Euclidean distance generated the best quality gene order computed by GA and ACO methods. PMID:23369541
Schmidt, Florian; Gasparoni, Nina; Gasparoni, Gilles; Gianmoena, Kathrin; Cadenas, Cristina; Polansky, Julia K.; Ebert, Peter; Nordström, Karl; Barann, Matthias; Sinha, Anupam; Fröhler, Sebastian; Xiong, Jieyi; Dehghani Amirabad, Azim; Behjati Ardakani, Fatemeh; Hutter, Barbara; Zipprich, Gideon; Felder, Bärbel; Eils, Jürgen; Brors, Benedikt; Chen, Wei; Hengstler, Jan G.; Hamann, Alf; Lengauer, Thomas; Rosenstiel, Philip; Walter, Jörn; Schulz, Marcel H.
2017-01-01
The binding and contribution of transcription factors (TF) to cell specific gene expression is often deduced from open-chromatin measurements to avoid costly TF ChIP-seq assays. Thus, it is important to develop computational methods for accurate TF binding prediction in open-chromatin regions (OCRs). Here, we report a novel segmentation-based method, TEPIC, to predict TF binding by combining sets of OCRs with position weight matrices. TEPIC can be applied to various open-chromatin data, e.g. DNaseI-seq and NOMe-seq. Additionally, Histone-Marks (HMs) can be used to identify candidate TF binding sites. TEPIC computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength. Using machine learning, we find low affinity binding sites to improve our ability to explain gene expression variability compared to the standard presence/absence classification of binding sites. Further, we show that both footprints and peaks capture essential TF binding events and lead to a good prediction performance. In our application, gene-based scores computed by TEPIC with one open-chromatin assay nearly reach the quality of several TF ChIP-seq data sets. Finally, these scores correctly predict known transcriptional regulators as illustrated by the application to novel DNaseI-seq and NOMe-seq data for primary human hepatocytes and CD4+ T-cells, respectively. PMID:27899623
Identification of Surprisingly Diverse Type IV Pili, across a Broad Range of Gram-Positive Bacteria
Roos, David S.; Pohlschröder, Mechthild
2011-01-01
Background In Gram-negative bacteria, type IV pili (TFP) have long been known to play important roles in such diverse biological phenomena as surface adhesion, motility, and DNA transfer, with significant consequences for pathogenicity. More recently it became apparent that Gram-positive bacteria also express type IV pili; however, little is known about the diversity and abundance of these structures in Gram-positives. Computational tools for automated identification of type IV pilins are not currently available. Results To assess TFP diversity in Gram-positive bacteria and facilitate pilin identification, we compiled a comprehensive list of putative Gram-positive pilins encoded by operons containing highly conserved pilus biosynthetic genes (pilB, pilC). A surprisingly large number of species were found to contain multiple TFP operons (pil, com and/or tad). The N-terminal sequences of predicted pilins were exploited to develop PilFind, a rule-based algorithm for genome-wide identification of otherwise poorly conserved type IV pilins in any species, regardless of their association with TFP biosynthetic operons (http://signalfind.org). Using PilFind to scan 53 Gram-positive genomes (encoding >187,000 proteins), we identified 286 candidate pilins, including 214 in operons containing TFP biosynthetic genes (TBG+ operons). Although trained on Gram-positive pilins, PilFind identified 55 of 58 manually curated Gram-negative pilins in TBG+ operons, as well as 53 additional pilin candidates in operons lacking biosynthetic genes in ten species (>38,000 proteins), including 27 of 29 experimentally verified pilins. False positive rates appear to be low, as PilFind predicted only four pilin candidates in eleven bacterial species (>13,000 proteins) lacking TFP biosynthetic genes. Conclusions We have shown that Gram-positive bacteria contain a highly diverse set of type IV pili. PilFind can be an invaluable tool to study bacterial cellular processes known to involve type IV pilus-like structures. Its use in combination with other currently available computational tools should improve the accuracy of predicting the subcellular localization of bacterial proteins. PMID:22216142
Zhang, Xue; Acencio, Marcio Luis; Lemke, Ney
2016-01-01
Essential proteins/genes are indispensable to the survival or reproduction of an organism, and the deletion of such essential proteins will result in lethality or infertility. The identification of essential genes is very important not only for understanding the minimal requirements for survival of an organism, but also for finding human disease genes and new drug targets. Experimental methods for identifying essential genes are costly, time-consuming, and laborious. With the accumulation of sequenced genomes data and high-throughput experimental data, many computational methods for identifying essential proteins are proposed, which are useful complements to experimental methods. In this review, we show the state-of-the-art methods for identifying essential genes and proteins based on machine learning and network topological features, point out the progress and limitations of current methods, and discuss the challenges and directions for further research. PMID:27014079
A transcriptional dynamic network during Arabidopsis thaliana pollen development.
Wang, Jigang; Qiu, Xiaojie; Li, Yuhua; Deng, Youping; Shi, Tieliu
2011-01-01
To understand transcriptional regulatory networks (TRNs), especially the coordinated dynamic regulation between transcription factors (TFs) and their corresponding target genes during development, computational approaches would represent significant advances in the genome-wide expression analysis. The major challenges for the experiments include monitoring the time-specific TFs' activities and identifying the dynamic regulatory relationships between TFs and their target genes, both of which are currently not yet available at the large scale. However, various methods have been proposed to computationally estimate those activities and regulations. During the past decade, significant progresses have been made towards understanding pollen development at each development stage under the molecular level, yet the regulatory mechanisms that control the dynamic pollen development processes remain largely unknown. Here, we adopt Networks Component Analysis (NCA) to identify TF activities over time course, and infer their regulatory relationships based on the coexpression of TFs and their target genes during pollen development. We carried out meta-analysis by integrating several sets of gene expression data related to Arabidopsis thaliana pollen development (stages range from UNM, BCP, TCP, HP to 0.5 hr pollen tube and 4 hr pollen tube). We constructed a regulatory network, including 19 TFs, 101 target genes and 319 regulatory interactions. The computationally estimated TF activities were well correlated to their coordinated genes' expressions during the development process. We clustered the expression of their target genes in the context of regulatory influences, and inferred new regulatory relationships between those TFs and their target genes, such as transcription factor WRKY34, which was identified that specifically expressed in pollen, and regulated several new target genes. Our finding facilitates the interpretation of the expression patterns with more biological relevancy, since the clusters corresponding to the activity of specific TF or the combination of TFs suggest the coordinated regulation of TFs to their target genes. Through integrating different resources, we constructed a dynamic regulatory network of Arabidopsis thaliana during pollen development with gene coexpression and NCA. The network illustrated the relationships between the TFs' activities and their target genes' expression, as well as the interactions between TFs, which provide new insight into the molecular mechanisms that control the pollen development.
The importance of biochemical and genetic findings in the diagnosis of atypical Norrie disease.
Rodríguez-Muñoz, Ana; García-García, Gema; Menor, Francisco; Millán, José M; Tomás-Vila, Miguel; Jaijo, Teresa
2018-01-26
Norrie disease (ND) is a rare X-linked disorder characterized by bilateral congenital blindness. ND is caused by a mutation in the Norrie disease pseudoglioma (NDP) gene, which encodes a 133-amino acid protein called norrin. Intragenic deletions including NDP and adjacent genes have been identified in ND patients with a more severe neurologic phenotype. We report the biochemical, molecular, clinical and radiological features of two unrelated affected males with a deletion including NDP and MAO genes. Biochemical and genetic analyses were performed to understand the atypical phenotype and radiological findings. Biogenic amines in cerebrospinal fluid (CSF) were measured by high-performance liquid chromatography. The coding exons of NDP gene were amplified by polymerase chain reaction. Multiplex ligation-dependent probe amplification and chromosomal microarray were carried out on both affected males. Computed tomography and magnetic resonance imaging were performed on the two patients. In one patient, the serotonin and catecholamine metabolite levels in CSF were virtually undetectable. In both patients, genetic studies revealed microdeletions in the Xp11.3 region, involving the NDP, MAOA and MAOB genes. Radiological examination demonstrated brain and cerebellar atrophy. We suggest that alterations caused by MAO deficit may remain during the first years of life. Clinical phenotype, biochemical findings and neuroimaging can guide the genetic study in patients with atypical ND and help us to a better understanding of this disease.
Neuhaus, Klaus; Landstorfer, Richard; Fellner, Lea; Simon, Svenja; Schafferhans, Andrea; Goldberg, Tatyana; Marx, Harald; Ozoline, Olga N; Rost, Burkhard; Kuster, Bernhard; Keim, Daniel A; Scherer, Siegfried
2016-02-24
Genomes of E. coli, including that of the human pathogen Escherichia coli O157:H7 (EHEC) EDL933, still harbor undetected protein-coding genes which, apparently, have escaped annotation due to their small size and non-essential function. To find such genes, global gene expression of EHEC EDL933 was examined, using strand-specific RNAseq (transcriptome), ribosomal footprinting (translatome) and mass spectrometry (proteome). Using the above methods, 72 short, non-annotated protein-coding genes were detected. All of these showed signals in the ribosomal footprinting assay indicating mRNA translation. Seven were verified by mass spectrometry. Fifty-seven genes are annotated in other enterobacteriaceae, mainly as hypothetical genes; the remaining 15 genes constitute novel discoveries. In addition, protein structure and function were predicted computationally and compared between EHEC-encoded proteins and 100-times randomly shuffled proteins. Based on this comparison, 61 of the 72 novel proteins exhibit predicted structural and functional features similar to those of annotated proteins. Many of the novel genes show differential transcription when grown under eleven diverse growth conditions suggesting environmental regulation. Three genes were found to confer a phenotype in previous studies, e.g., decreased cattle colonization. These findings demonstrate that ribosomal footprinting can be used to detect novel protein coding genes, contributing to the growing body of evidence that hypothetical genes are not annotation artifacts and opening an additional way to study their functionality. All 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo.
Advantages and disadvantages in usage of bioinformatic programs in promoter region analysis
NASA Astrophysics Data System (ADS)
Pawełkowicz, Magdalena E.; Skarzyńska, Agnieszka; Posyniak, Kacper; ZiÄ bska, Karolina; PlÄ der, Wojciech; Przybecki, Zbigniew
2015-09-01
An important computational challenge is finding the regulatory elements across the promotor region. In this work we present the advantages and disadvantages from the application of different bioinformatics programs for localization of transcription factor binding sites in the upstream region of genes connected with sex determination in cucumber. We use PlantCARE, PlantPAN and SignalScan to find motifs in the promotor regions. The results have been compared and possible function of chosen motifs has been described.
The computational core and fixed point organization in Boolean networks
NASA Astrophysics Data System (ADS)
Correale, L.; Leone, M.; Pagnani, A.; Weigt, M.; Zecchina, R.
2006-03-01
In this paper, we analyse large random Boolean networks in terms of a constraint satisfaction problem. We first develop an algorithmic scheme which allows us to prune simple logical cascades and underdetermined variables, returning thereby the computational core of the network. Second, we apply the cavity method to analyse the number and organization of fixed points. We find in particular a phase transition between an easy and a complex regulatory phase, the latter being characterized by the existence of an exponential number of macroscopically separated fixed point clusters. The different techniques developed are reinterpreted as algorithms for the analysis of single Boolean networks, and they are applied in the analysis of and in silico experiments on the gene regulatory networks of baker's yeast (Saccharomyces cerevisiae) and the segment-polarity genes of the fruitfly Drosophila melanogaster.
2011-01-01
Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/. PMID:21615923
An Integrative data mining approach to identifying Adverse ...
The Adverse Outcome Pathway (AOP) framework is a tool for making biological connections and summarizing key information across different levels of biological organization to connect biological perturbations at the molecular level to adverse outcomes for an individual or population. Computational approaches to explore and determine these connections can accelerate the assembly of AOPs. By leveraging the wealth of publicly available data covering chemical effects on biological systems, computationally-predicted AOPs (cpAOPs) were assembled via data mining of high-throughput screening (HTS) in vitro data, in vivo data and other disease phenotype information. Frequent Itemset Mining (FIM) was used to find associations between the gene targets of ToxCast HTS assays and disease data from Comparative Toxicogenomics Database (CTD) by using the chemicals as the common aggregators between datasets. The method was also used to map gene expression data to disease data from CTD. A cpAOP network was defined by considering genes and diseases as nodes and FIM associations as edges. This network contained 18,283 gene to disease associations for the ToxCast data and 110,253 for CTD gene expression. Two case studies show the value of the cpAOP network by extracting subnetworks focused either on fatty liver disease or the Aryl Hydrocarbon Receptor (AHR). The subnetwork surrounding fatty liver disease included many genes known to play a role in this disease. When querying the cpAOP
Radiation protective effects of baclofen predicted by a computational drug repurposing strategy.
Ren, Lei; Xie, Dafei; Li, Peng; Qu, Xinyan; Zhang, Xiujuan; Xing, Yaling; Zhou, Pingkun; Bo, Xiaochen; Zhou, Zhe; Wang, Shengqi
2016-11-01
Exposure to ionizing radiation causes damage to living tissues; however, only a small number of agents have been approved for use in radiation injuries. Radioprotector is the primary countermeasure to radiation injury and none radioprotector has indeed reached the drug development stage. Repurposing the long list of approved, non-radioprotective drugs is an attractive strategy to find new radioprotective agents. Here, we applied a computational approach to discover new radioprotectors in silico by comparing publicly available gene expression data of ionizing radiation-treated samples from the Gene Expression Omnibus (GEO) database with gene expression signatures of more than 1309 small-molecule compounds from the Connectivity Map (cmap) dataset. Among the best compounds predicted to be therapeutic for ionizing radiation damage by this approach were some previously reported radioprotectors and baclofen (P<0.01), a chemical that was not previously used as radioprotector. Validation using a cell-based model and a rodent in vivo model demonstrated that treatment with baclofen reduced radiation-induced cytotoxicity in vitro (P<0.01), attenuated bone marrow damage and increased survival in vivo (P<0.05). These findings suggest that baclofen might serve as a radioprotector. The drug repurposing strategy by connecting the GEO data and cmap can be used to identify known drugs as potential radioprotective agents. Copyright © 2016 Elsevier Ltd. All rights reserved.
Abiri, Maryam; Karamzadeh, Razieh; Karimipoor, Morteza; Ghadami, Shirin; Alaei, Mohammad Reza; Bagheri, Samira Dabagh; Bagherian, Hamideh; Setoodeh, Aria; Noori-Daloii, Mohammad Reza; Sirous Zeinali
2016-04-01
Maple syrup urine disease (MSUD) is a rare inborn error of branched-chain amino acid metabolism. The disease prevalence is higher in populations with elevated rate of consanguineous marriages such as Iran. Different types of disease causing mutations have been previously reported in BCKDHA, BCKDHB, DBT and DLD genes known to be responsible for MSUD phenotype. In this study, two sets of multiplex polymorphic STR (Short Tandem Repeat) markers linked to the above genes were used to aid in homozygosity mapping in order to find probable pathogenic change(s) in the studied families. The families who showed homozygote haplotype for the BCKDHA gene were subsequently sequenced. Our findings showed that exons 2, 4 and 6 contain most of the mutations which are novel. The changes include two single nucleotide deletion (i.e. c. 143delT and c.702delT), one gross deletion covering the whole exon four c.(375+1_376-1)_(8849+1_885-1), two splice site changes (c.1167+1G>T, c. 288+1G>A), and one point mutation (c.731G>A). Computational approaches were used to analyze these two novel mutations in terms of their impact on protein structure. Computational structural modeling indicated that these mutations might affect structural stability and multimeric assembly of branched-chain α-keto acid dehydrogenase complex (BCKDC). Copyright © 2016. Published by Elsevier B.V.
Kumar, Rajnish; Mishra, Bharat Kumar; Lahiri, Tapobrata; Kumar, Gautam; Kumar, Nilesh; Gupta, Rahul; Pal, Manoj Kumar
2017-06-01
Online retrieval of the homologous nucleotide sequences through existing alignment techniques is a common practice against the given database of sequences. The salient point of these techniques is their dependence on local alignment techniques and scoring matrices the reliability of which is limited by computational complexity and accuracy. Toward this direction, this work offers a novel way for numerical representation of genes which can further help in dividing the data space into smaller partitions helping formation of a search tree. In this context, this paper introduces a 36-dimensional Periodicity Count Value (PCV) which is representative of a particular nucleotide sequence and created through adaptation from the concept of stochastic model of Kolekar et al. (American Institute of Physics 1298:307-312, 2010. doi: 10.1063/1.3516320 ). The PCV construct uses information on physicochemical properties of nucleotides and their positional distribution pattern within a gene. It is observed that PCV representation of gene reduces computational cost in the calculation of distances between a pair of genes while being consistent with the existing methods. The validity of PCV-based method was further tested through their use in molecular phylogeny constructs in comparison with that using existing sequence alignment methods.
Zhang, Yan-Qiong; Chen, Dong-Liang; Tian, Hai-Feng; Zhang, Bao-Hong; Wen, Jian-Fan
2009-10-01
Using a combined computational program, we identified 50 potential microRNAs (miRNAs) in Giardia lamblia, one of the most primitive unicellular eukaryotes. These miRNAs are unique to G. lamblia and no homologues have been found in other organisms; miRNAs, currently known in other species, were not found in G. lamblia. This suggests that miRNA biogenesis and miRNA-mediated gene regulation pathway may evolve independently, especially in evolutionarily distant lineages. A majority (43) of the predicted miRNAs are located at one single locus; however, some miRNAs have two or more copies in the genome. Among the 58 miRNA genes, 28 are located in the intergenic regions whereas 30 are present in the anti-sense strands of the protein-coding sequences. Five predicted miRNAs are expressed in G. lamblia trophozoite cells evidenced by expressed sequence tags or RT-PCR. Thirty-seven identified miRNAs may target 50 protein-coding genes, including seven variant-specific surface proteins (VSPs). Our findings provide a clue that miRNA-mediated gene regulation may exist in the early stage of eukaryotic evolution, suggesting that it is an important regulation system ubiquitous in eukaryotes.
HRGFish: A database of hypoxia responsive genes in fishes
NASA Astrophysics Data System (ADS)
Rashid, Iliyas; Nagpure, Naresh Sahebrao; Srivastava, Prachi; Kumar, Ravindra; Pathak, Ajey Kumar; Singh, Mahender; Kushwaha, Basdeo
2017-02-01
Several studies have highlighted the changes in the gene expression due to the hypoxia response in fishes, but the systematic organization of the information and the analytical platform for such genes are lacking. In the present study, an attempt was made to develop a database of hypoxia responsive genes in fishes (HRGFish), integrated with analytical tools, using LAMPP technology. Genes reported in hypoxia response for fishes were compiled through literature survey and the database presently covers 818 gene sequences and 35 gene types from 38 fishes. The upstream fragments (3,000 bp), covered in this database, enables to compute CG dinucleotides frequencies, motif finding of the hypoxia response element, identification of CpG island and mapping with the reference promoter of zebrafish. The database also includes functional annotation of genes and provides tools for analyzing sequences and designing primers for selected gene fragments. This may be the first database on the hypoxia response genes in fishes that provides a workbench to the scientific community involved in studying the evolution and ecological adaptation of the fish species in relation to hypoxia.
Computer-Aided Resolution of an Experimental Paradox in Bacterial Chemotaxis
Abouhamad, Walid N.; Bray, Dennis; Schuster, Martin; Boesch, Kristin C.; Silversmith, Ruth E.; Bourret, Robert B.
1998-01-01
Escherichia coli responds to its environment by means of a network of intracellular reactions which process signals from membrane-bound receptors and relay them to the flagellar motors. Although characterization of the reactions in the chemotaxis signaling pathway is sufficiently complete to construct computer simulations that predict the phenotypes of mutant strains with a high degree of accuracy, two previous experimental investigations of the activity remaining upon genetic deletion of multiple signaling components yielded several contradictory results (M. P. Conley, A. J. Wolfe, D. F. Blair, and H. C. Berg, J. Bacteriol. 171:5190–5193, 1989; J. D. Liu and J. S. Parkinson, Proc. Natl. Acad. Sci. USA 86:8703–8707, 1989). For example, “building up” the pathway by adding back CheA and CheY to a gutted strain lacking chemotaxis genes resulted in counterclockwise flagellar rotation whereas “breaking down” the pathway by deleting chemotaxis genes except cheA and cheY resulted in alternating episodes of clockwise and counterclockwise flagellar rotation. Our computer simulation predicts that trace amounts of CheZ expressed in the gutted strain could account for this difference. We tested this explanation experimentally by constructing a mutant containing a new deletion of the che genes that cannot express CheZ and verified that the behavior of strains built up from the new deletion does in fact conform to both the phenotypes observed for breakdown strains and computer-generated predictions. Our findings consolidate the present view of the chemotaxis signaling pathway and highlight the utility of molecularly based computer models in the analysis of complex biochemical networks. PMID:9683468
Demographic history and gene flow during silkworm domestication
2014-01-01
Background Gene flow plays an important role in domestication history of domesticated species. However, little is known about the demographic history of domesticated silkworm involving gene flow with its wild relative. Results In this study, four model-based evolutionary scenarios to describe the demographic history of B. mori were hypothesized. Using Approximate Bayesian Computation method and DNA sequence data from 29 nuclear loci, we found that the gene flow at bottleneck model is the most likely scenario for silkworm domestication. The starting time of silkworm domestication was estimated to be approximate 7,500 years ago; the time of domestication termination was 3,984 years ago. Using coalescent simulation analysis, we also found that bi-directional gene flow occurred during silkworm domestication. Conclusions Estimates of silkworm domestication time are nearly consistent with the archeological evidence and our previous results. Importantly, we found that the bi-directional gene flow might occur during silkworm domestication. Our findings add a dimension to highlight the important role of gene flow in domestication of crops and animals. PMID:25123546
Long, Hannah K; Sims, David; Heger, Andreas; Blackledge, Neil P; Kutter, Claudia; Wright, Megan L; Grützner, Frank; Odom, Duncan T; Patient, Roger; Ponting, Chris P; Klose, Robert J
2013-01-01
Two-thirds of gene promoters in mammals are associated with regions of non-methylated DNA, called CpG islands (CGIs), which counteract the repressive effects of DNA methylation on chromatin. In cold-blooded vertebrates, computational CGI predictions often reside away from gene promoters, suggesting a major divergence in gene promoter architecture across vertebrates. By experimentally identifying non-methylated DNA in the genomes of seven diverse vertebrates, we instead reveal that non-methylated islands (NMIs) of DNA are a central feature of vertebrate gene promoters. Furthermore, NMIs are present at orthologous genes across vast evolutionary distances, revealing a surprising level of conservation in this epigenetic feature. By profiling NMIs in different tissues and developmental stages we uncover a unifying set of features that are central to the function of NMIs in vertebrates. Together these findings demonstrate an ancient logic for NMI usage at gene promoters and reveal an unprecedented level of epigenetic conservation across vertebrate evolution. DOI: http://dx.doi.org/10.7554/eLife.00348.001 PMID:23467541
NETWORK ASSISTED ANALYSIS TO REVEAL THE GENETIC BASIS OF AUTISM1
Liu, Li; Lei, Jing; Roeder, Kathryn
2016-01-01
While studies show that autism is highly heritable, the nature of the genetic basis of this disorder remains illusive. Based on the idea that highly correlated genes are functionally interrelated and more likely to affect risk, we develop a novel statistical tool to find more potentially autism risk genes by combining the genetic association scores with gene co-expression in specific brain regions and periods of development. The gene dependence network is estimated using a novel partial neighborhood selection (PNS) algorithm, where node specific properties are incorporated into network estimation for improved statistical and computational efficiency. Then we adopt a hidden Markov random field (HMRF) model to combine the estimated network and the genetic association scores in a systematic manner. The proposed modeling framework can be naturally extended to incorporate additional structural information concerning the dependence between genes. Using currently available genetic association data from whole exome sequencing studies and brain gene expression levels, the proposed algorithm successfully identified 333 genes that plausibly affect autism risk. PMID:27134692
He, Jian; Gan, Weidong; Liu, Song; Zhou, Kefeng; Zhang, Gutian; Guo, Hongqian; Zhu, Bin
2015-01-01
To investigate the dynamic contrast-enhanced computed tomography (CT) characteristics of renal cell carcinoma associated with Xp11.2 translocation and TFE gene fusion (Xp11.2 RCC) by comparison with clear cell renal cell carcinoma (CCRCC). Dynamic contrast-enhanced CT images and clinical and pathological records of 20 adult patients with Xp11.2 RCC confirmed by TFE3 immunohistochemical and fluorescence in situ hybridization assay were retrospectively analyzed and compared with the findings of 21 contemporary CCRCCs. Renal cell carcinoma associated with Xp11.2 translocation and TFE gene fusions often occurred in young (30.6 ± 8.6 years) patients with hematuria (9/20). They presented as well-defined (17/20) cystic-solid (17/20) mass with hemorrhage (8/20) and circular/rim calcifications (6/20). Dynamic contrast-enhanced CT showed heterogeneous moderate prolonged enhancement. A tumor-to-cortex attenuation ratio in corticomedullary phase less than 0.62 gave a sensitivity of 90.0% and a specificity of 92.9% in differentiating Xp11.2 RCC from CCRCC (area under the receiver operating characteristic curve = 0.957, P < 0.001). Computed tomographic characteristics and dynamic contrast-enhanced patterns and index can differentiate Xp11.2 RCC from CCRCC.
Dissociable contribution of prefrontal and striatal dopaminergic genes to learning in economic games
Set, Eric; Saez, Ignacio; Zhu, Lusha; Houser, Daniel E.; Myung, Noah; Zhong, Songfa; Ebstein, Richard P.; Chew, Soo Hong; Hsu, Ming
2014-01-01
Game theory describes strategic interactions where success of players’ actions depends on those of coplayers. In humans, substantial progress has been made at the neural level in characterizing the dopaminergic and frontostriatal mechanisms mediating such behavior. Here we combined computational modeling of strategic learning with a pathway approach to characterize association of strategic behavior with variations in the dopamine pathway. Specifically, using gene-set analysis, we systematically examined contribution of different dopamine genes to variation in a multistrategy competitive game captured by (i) the degree players anticipate and respond to actions of others (belief learning) and (ii) the speed with which such adaptations take place (learning rate). We found that variation in genes that primarily regulate prefrontal dopamine clearance—catechol-O-methyl transferase (COMT) and two isoforms of monoamine oxidase—modulated degree of belief learning across individuals. In contrast, we did not find significant association for other genes in the dopamine pathway. Furthermore, variation in genes that primarily regulate striatal dopamine function—dopamine transporter and D2 receptors—was significantly associated with the learning rate. We found that this was also the case with COMT, but not for other dopaminergic genes. Together, these findings highlight dissociable roles of frontostriatal systems in strategic learning and support the notion that genetic variation, organized along specific pathways, forms an important source of variation in complex phenotypes such as strategic behavior. PMID:24979760
Set, Eric; Saez, Ignacio; Zhu, Lusha; Houser, Daniel E; Myung, Noah; Zhong, Songfa; Ebstein, Richard P; Chew, Soo Hong; Hsu, Ming
2014-07-01
Game theory describes strategic interactions where success of players' actions depends on those of coplayers. In humans, substantial progress has been made at the neural level in characterizing the dopaminergic and frontostriatal mechanisms mediating such behavior. Here we combined computational modeling of strategic learning with a pathway approach to characterize association of strategic behavior with variations in the dopamine pathway. Specifically, using gene-set analysis, we systematically examined contribution of different dopamine genes to variation in a multistrategy competitive game captured by (i) the degree players anticipate and respond to actions of others (belief learning) and (ii) the speed with which such adaptations take place (learning rate). We found that variation in genes that primarily regulate prefrontal dopamine clearance--catechol-O-methyl transferase (COMT) and two isoforms of monoamine oxidase--modulated degree of belief learning across individuals. In contrast, we did not find significant association for other genes in the dopamine pathway. Furthermore, variation in genes that primarily regulate striatal dopamine function--dopamine transporter and D2 receptors--was significantly associated with the learning rate. We found that this was also the case with COMT, but not for other dopaminergic genes. Together, these findings highlight dissociable roles of frontostriatal systems in strategic learning and support the notion that genetic variation, organized along specific pathways, forms an important source of variation in complex phenotypes such as strategic behavior.
Mallik, Saurav; Zhao, Zhongming
2017-12-28
For transcriptomic analysis, there are numerous microarray-based genomic data, especially those generated for cancer research. The typical analysis measures the difference between a cancer sample-group and a matched control group for each transcript or gene. Association rule mining is used to discover interesting item sets through rule-based methodology. Thus, it has advantages to find causal effect relationships between the transcripts. In this work, we introduce two new rule-based similarity measures-weighted rank-based Jaccard and Cosine measures-and then propose a novel computational framework to detect condensed gene co-expression modules ( C o n G E M s) through the association rule-based learning system and the weighted similarity scores. In practice, the list of evolved condensed markers that consists of both singular and complex markers in nature depends on the corresponding condensed gene sets in either antecedent or consequent of the rules of the resultant modules. In our evaluation, these markers could be supported by literature evidence, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway and Gene Ontology annotations. Specifically, we preliminarily identified differentially expressed genes using an empirical Bayes test. A recently developed algorithm-RANWAR-was then utilized to determine the association rules from these genes. Based on that, we computed the integrated similarity scores of these rule-based similarity measures between each rule-pair, and the resultant scores were used for clustering to identify the co-expressed rule-modules. We applied our method to a gene expression dataset for lung squamous cell carcinoma and a genome methylation dataset for uterine cervical carcinogenesis. Our proposed module discovery method produced better results than the traditional gene-module discovery measures. In summary, our proposed rule-based method is useful for exploring biomarker modules from transcriptomic data.
Schmidt, Florian; Gasparoni, Nina; Gasparoni, Gilles; Gianmoena, Kathrin; Cadenas, Cristina; Polansky, Julia K; Ebert, Peter; Nordström, Karl; Barann, Matthias; Sinha, Anupam; Fröhler, Sebastian; Xiong, Jieyi; Dehghani Amirabad, Azim; Behjati Ardakani, Fatemeh; Hutter, Barbara; Zipprich, Gideon; Felder, Bärbel; Eils, Jürgen; Brors, Benedikt; Chen, Wei; Hengstler, Jan G; Hamann, Alf; Lengauer, Thomas; Rosenstiel, Philip; Walter, Jörn; Schulz, Marcel H
2017-01-09
The binding and contribution of transcription factors (TF) to cell specific gene expression is often deduced from open-chromatin measurements to avoid costly TF ChIP-seq assays. Thus, it is important to develop computational methods for accurate TF binding prediction in open-chromatin regions (OCRs). Here, we report a novel segmentation-based method, TEPIC, to predict TF binding by combining sets of OCRs with position weight matrices. TEPIC can be applied to various open-chromatin data, e.g. DNaseI-seq and NOMe-seq. Additionally, Histone-Marks (HMs) can be used to identify candidate TF binding sites. TEPIC computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength. Using machine learning, we find low affinity binding sites to improve our ability to explain gene expression variability compared to the standard presence/absence classification of binding sites. Further, we show that both footprints and peaks capture essential TF binding events and lead to a good prediction performance. In our application, gene-based scores computed by TEPIC with one open-chromatin assay nearly reach the quality of several TF ChIP-seq data sets. Finally, these scores correctly predict known transcriptional regulators as illustrated by the application to novel DNaseI-seq and NOMe-seq data for primary human hepatocytes and CD4+ T-cells, respectively. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Fu, Wei; Xie, Wen; Zhang, Zhuo; Wang, Shaoli; Wu, Qingjun; Liu, Yong; Zhou, Xiaomao; Zhou, Xuguo; Zhang, Youjun
2013-01-01
Abstract: Quantitative real-time PCR (qRT-PCR), a primary tool in gene expression analysis, requires an appropriate normalization strategy to control for variation among samples. The best option is to compare the mRNA level of a target gene with that of reference gene(s) whose expression level is stable across various experimental conditions. In this study, expression profiles of eight candidate reference genes from the diamondback moth, Plutella xylostella, were evaluated under diverse experimental conditions. RefFinder, a web-based analysis tool, integrates four major computational programs including geNorm, Normfinder, BestKeeper, and the comparative ΔCt method to comprehensively rank the tested candidate genes. Elongation factor 1 (EF1) was the most suited reference gene for the biotic factors (development stage, tissue, and strain). In contrast, although appropriate reference gene(s) do exist for several abiotic factors (temperature, photoperiod, insecticide, and mechanical injury), we were not able to identify a single universal reference gene. Nevertheless, a suite of candidate reference genes were specifically recommended for selected experimental conditions. Our finding is the first step toward establishing a standardized qRT-PCR analysis of this agriculturally important insect pest. PMID:23983612
Baresic, Mario; Salatino, Silvia; Kupr, Barbara
2014-01-01
Skeletal muscle tissue shows an extraordinary cellular plasticity, but the underlying molecular mechanisms are still poorly understood. Here, we use a combination of experimental and computational approaches to unravel the complex transcriptional network of muscle cell plasticity centered on the peroxisome proliferator-activated receptor γ coactivator 1α (PGC-1α), a regulatory nexus in endurance training adaptation. By integrating data on genome-wide binding of PGC-1α and gene expression upon PGC-1α overexpression with comprehensive computational prediction of transcription factor binding sites (TFBSs), we uncover a hitherto-underestimated number of transcription factor partners involved in mediating PGC-1α action. In particular, principal component analysis of TFBSs at PGC-1α binding regions predicts that, besides the well-known role of the estrogen-related receptor α (ERRα), the activator protein 1 complex (AP-1) plays a major role in regulating the PGC-1α-controlled gene program of the hypoxia response. Our findings thus reveal the complex transcriptional network of muscle cell plasticity controlled by PGC-1α. PMID:24912679
HomoTarget: a new algorithm for prediction of microRNA targets in Homo sapiens.
Ahmadi, Hamed; Ahmadi, Ali; Azimzadeh-Jamalkandi, Sadegh; Shoorehdeli, Mahdi Aliyari; Salehzadeh-Yazdi, Ali; Bidkhori, Gholamreza; Masoudi-Nejad, Ali
2013-02-01
MiRNAs play an essential role in the networks of gene regulation by inhibiting the translation of target mRNAs. Several computational approaches have been proposed for the prediction of miRNA target-genes. Reports reveal a large fraction of under-predicted or falsely predicted target genes. Thus, there is an imperative need to develop a computational method by which the target mRNAs of existing miRNAs can be correctly identified. In this study, combined pattern recognition neural network (PRNN) and principle component analysis (PCA) architecture has been proposed in order to model the complicated relationship between miRNAs and their target mRNAs in humans. The results of several types of intelligent classifiers and our proposed model were compared, showing that our algorithm outperformed them with higher sensitivity and specificity. Using the recent release of the mirBase database to find potential targets of miRNAs, this model incorporated twelve structural, thermodynamic and positional features of miRNA:mRNA binding sites to select target candidates. Copyright © 2012 Elsevier Inc. All rights reserved.
2018-01-01
CTCF and cohesin are key drivers of 3D-nuclear organization, anchoring the megabase-scale Topologically Associating Domains (TADs) that segment the genome. Here, we present and validate a computational method to predict cohesin-and-CTCF binding sites that form intra-TAD DNA loops. The intra-TAD loop anchors identified are structurally indistinguishable from TAD anchors regarding binding partners, sequence conservation, and resistance to cohesin knockdown; further, the intra-TAD loops retain key functional features of TADs, including chromatin contact insulation, blockage of repressive histone mark spread, and ubiquity across tissues. We propose that intra-TAD loops form by the same loop extrusion mechanism as the larger TAD loops, and that their shorter length enables finer regulatory control in restricting enhancer-promoter interactions, which enables selective, high-level expression of gene targets of super-enhancers and genes located within repressive nuclear compartments. These findings elucidate the role of intra-TAD cohesin-and-CTCF binding in nuclear organization associated with widespread insulation of distal enhancer activity. PMID:29757144
Matthews, Bryan J; Waxman, David J
2018-05-14
CTCF and cohesin are key drivers of 3D-nuclear organization, anchoring the megabase-scale Topologically Associating Domains (TADs) that segment the genome. Here, we present and validate a computational method to predict cohesin-and-CTCF binding sites that form intra-TAD DNA loops. The intra-TAD loop anchors identified are structurally indistinguishable from TAD anchors regarding binding partners, sequence conservation, and resistance to cohesin knockdown; further, the intra-TAD loops retain key functional features of TADs, including chromatin contact insulation, blockage of repressive histone mark spread, and ubiquity across tissues. We propose that intra-TAD loops form by the same loop extrusion mechanism as the larger TAD loops, and that their shorter length enables finer regulatory control in restricting enhancer-promoter interactions, which enables selective, high-level expression of gene targets of super-enhancers and genes located within repressive nuclear compartments. These findings elucidate the role of intra-TAD cohesin-and-CTCF binding in nuclear organization associated with widespread insulation of distal enhancer activity. © 2018, Matthews et al.
Dictionary-driven prokaryotic gene finding.
Shibuya, Tetsuo; Rigoutsos, Isidore
2002-06-15
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm's implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method's generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.
NASA Astrophysics Data System (ADS)
Bekas, C.; Curioni, A.
2010-06-01
Enforcing the orthogonality of approximate wavefunctions becomes one of the dominant computational kernels in planewave based Density Functional Theory electronic structure calculations that involve thousands of atoms. In this context, algorithms that enjoy both excellent scalability and single processor performance properties are much needed. In this paper we present block versions of the Gram-Schmidt method and we show that they are excellent candidates for our purposes. We compare the new approach with the state of the art practice in planewave based calculations and find that it has much to offer, especially when applied on massively parallel supercomputers such as the IBM Blue Gene/P Supercomputer. The new method achieves excellent sustained performance that surpasses 73 TFLOPS (67% of peak) on 8 Blue Gene/P racks (32 768 compute cores), while it enables more than a two fold decrease in run time when compared with the best competing methodology.
Selection Shapes Transcriptional Logic and Regulatory Specialization in Genetic Networks
Fogelmark, Karl; Peterson, Carsten; Troein, Carl
2016-01-01
Background Living organisms need to regulate their gene expression in response to environmental signals and internal cues. This is a computational task where genes act as logic gates that connect to form transcriptional networks, which are shaped at all scales by evolution. Large-scale mutations such as gene duplications and deletions add and remove network components, whereas smaller mutations alter the connections between them. Selection determines what mutations are accepted, but its importance for shaping the resulting networks has been debated. Methodology To investigate the effects of selection in the shaping of transcriptional networks, we derive transcriptional logic from a combinatorially powerful yet tractable model of the binding between DNA and transcription factors. By evolving the resulting networks based on their ability to function as either a simple decision system or a circadian clock, we obtain information on the regulation and logic rules encoded in functional transcriptional networks. Comparisons are made between networks evolved for different functions, as well as with structurally equivalent but non-functional (neutrally evolved) networks, and predictions are validated against the transcriptional network of E. coli. Principal Findings We find that the logic rules governing gene expression depend on the function performed by the network. Unlike the decision systems, the circadian clocks show strong cooperative binding and negative regulation, which achieves tight temporal control of gene expression. Furthermore, we find that transcription factors act preferentially as either activators or repressors, both when binding multiple sites for a single target gene and globally in the transcriptional networks. This separation into positive and negative regulators requires gene duplications, which highlights the interplay between mutation and selection in shaping the transcriptional networks. PMID:26927540
DigOut: viewing differential expression genes as outliers.
Yu, Hui; Tu, Kang; Xie, Lu; Li, Yuan-Yuan
2010-12-01
With regards to well-replicated two-conditional microarray datasets, the selection of differentially expressed (DE) genes is a well-studied computational topic, but for multi-conditional microarray datasets with limited or no replication, the same task is not properly addressed by previous studies. This paper adopts multivariate outlier analysis to analyze replication-lacking multi-conditional microarray datasets, finding that it performs significantly better than the widely used limit fold change (LFC) model in a simulated comparative experiment. Compared with the LFC model, the multivariate outlier analysis also demonstrates improved stability against sample variations in a series of manipulated real expression datasets. The reanalysis of a real non-replicated multi-conditional expression dataset series leads to satisfactory results. In conclusion, a multivariate outlier analysis algorithm, like DigOut, is particularly useful for selecting DE genes from non-replicated multi-conditional gene expression dataset.
A novel swarm intelligence algorithm for finding DNA motifs.
Lei, Chengwei; Ruan, Jianhua
2009-01-01
Discovering DNA motifs from co-expressed or co-regulated genes is an important step towards deciphering complex gene regulatory networks and understanding gene functions. Despite significant improvement in the last decade, it still remains one of the most challenging problems in computational molecular biology. In this work, we propose a novel motif finding algorithm that finds consensus patterns using a population-based stochastic optimisation technique called Particle Swarm Optimisation (PSO), which has been shown to be effective in optimising difficult multidimensional problems in continuous domains. We propose to use a word dissimilarity graph to remap the neighborhood structure of the solution space of DNA motifs, and propose a modification of the naive PSO algorithm to accommodate discrete variables. In order to improve efficiency, we also propose several strategies for escaping from local optima and for automatically determining the termination criteria. Experimental results on simulated challenge problems show that our method is both more efficient and more accurate than several existing algorithms. Applications to several sets of real promoter sequences also show that our approach is able to detect known transcription factor binding sites, and outperforms two of the most popular existing algorithms.
Thanki, Anil S; Soranzo, Nicola; Haerty, Wilfried; Davey, Robert P
2018-03-01
Gene duplication is a major factor contributing to evolutionary novelty, and the contraction or expansion of gene families has often been associated with morphological, physiological, and environmental adaptations. The study of homologous genes helps us to understand the evolution of gene families. It plays a vital role in finding ancestral gene duplication events as well as identifying genes that have diverged from a common ancestor under positive selection. There are various tools available, such as MSOAR, OrthoMCL, and HomoloGene, to identify gene families and visualize syntenic information between species, providing an overview of syntenic regions evolution at the family level. Unfortunately, none of them provide information about structural changes within genes, such as the conservation of ancestral exon boundaries among multiple genomes. The Ensembl GeneTrees computational pipeline generates gene trees based on coding sequences, provides details about exon conservation, and is used in the Ensembl Compara project to discover gene families. A certain amount of expertise is required to configure and run the Ensembl Compara GeneTrees pipeline via command line. Therefore, we converted this pipeline into a Galaxy workflow, called GeneSeqToFamily, and provided additional functionality. This workflow uses existing tools from the Galaxy ToolShed, as well as providing additional wrappers and tools that are required to run the workflow. GeneSeqToFamily represents the Ensembl GeneTrees pipeline as a set of interconnected Galaxy tools, so they can be run interactively within the Galaxy's user-friendly workflow environment while still providing the flexibility to tailor the analysis by changing configurations and tools if necessary. Additional tools allow users to subsequently visualize the gene families produced by the workflow, using the Aequatus.js interactive tool, which has been developed as part of the Aequatus software project.
Chai, Xiaoqiang; Han, Yanan; Yang, Jian; Zhao, Xianxian; Liu, Yewang; Hou, Xugang; Tang, Yiheng; Zhao, Shirong; Li, Xiao
2016-02-01
The molecular pathogenesis of infection by hepatitis B virus with human is extremely complex and heterogeneous. To date the molecular information is not clearly defined despite intensive research efforts. Thus, studies aimed at transcription and regulation during virus infection or combined researches of those already known to be beneficial are needed. With the purpose of identifying the transcriptional regulators related to infection of hepatitis B virus in gene level, the gene expression profiles from some normal individuals and hepatitis B patients were analyzed in our study. In this work, the differential expressed genes were selected primarily. The several genes among those were validated in an independent set by qRT-PCR. Then the differentially co-expression analysis was conducted to identify differentially co-expressed links and differential co-expressed genes. Next, the analysis of the regulatory impact factors was performed through mapping the links and regulatory data. In order to give a further insight to these regulators, the co-expression gene modules were identified using a threshold-based hierarchical clustering method. Incidentally, the construction of the regulatory network was generated using the computer software. A total of 137,284 differentially co-expressed links and 780 differential co-expressed genes were identified. These co-expressed genes were significantly enriched inflammatory response. The results of regulatory impact factors revealed several crucial regulators related to hepatocellular carcinoma and other high-rank regulators. Meanwhile, more than one hundred co-expression gene modules were identified using clustering method. In our study, some important transcriptional regulators were identified using a computational method, which may enhance the understanding of disease mechanisms and lead to an improved treatment of hepatitis B. However, further experimental studies are required to confirm these findings. Copyright © 2015 Elsevier Masson SAS. All rights reserved.
PRGdb: a bioinformatics platform for plant resistance gene analysis
Sanseverino, Walter; Roma, Guglielmo; De Simone, Marco; Faino, Luigi; Melito, Sara; Stupka, Elia; Frusciante, Luigi; Ercolano, Maria Raffaella
2010-01-01
PRGdb is a web accessible open-source (http://www.prgdb.org) database that represents the first bioinformatic resource providing a comprehensive overview of resistance genes (R-genes) in plants. PRGdb holds more than 16 000 known and putative R-genes belonging to 192 plant species challenged by 115 different pathogens and linked with useful biological information. The complete database includes a set of 73 manually curated reference R-genes, 6308 putative R-genes collected from NCBI and 10463 computationally predicted putative R-genes. Thanks to a user-friendly interface, data can be examined using different query tools. A home-made prediction pipeline called Disease Resistance Analysis and Gene Orthology (DRAGO), based on reference R-gene sequence data, was developed to search for plant resistance genes in public datasets such as Unigene and Genbank. New putative R-gene classes containing unknown domain combinations were discovered and characterized. The development of the PRG platform represents an important starting point to conduct various experimental tasks. The inferred cross-link between genomic and phenotypic information allows access to a large body of information to find answers to several biological questions. The database structure also permits easy integration with other data types and opens up prospects for future implementations. PMID:19906694
Functional and Genomic Features of Human Genes Mutated in Neuropsychiatric Disorders.
Forero, Diego A; Prada, Carlos F; Perry, George
2016-01-01
In recent years, a large number of studies around the world have led to the identification of causal genes for hereditary types of common and rare neurological and psychiatric disorders. To explore the functional and genomic features of known human genes mutated in neuropsychiatric disorders. A systematic search was used to develop a comprehensive catalog of genes mutated in neuropsychiatric disorders (NPD). Functional enrichment and protein-protein interaction analyses were carried out. A false discovery rate approach was used for correction for multiple testing. We found several functional categories that are enriched among NPD genes, such as gene ontologies, protein domains, tissue expression, signaling pathways and regulation by brain-expressed miRNAs and transcription factors. Sixty six of those NPD genes are known to be druggable. Several topographic parameters of protein-protein interaction networks and the degree of conservation between orthologous genes were identified as significant among NPD genes. These results represent one of the first analyses of enrichment of functional categories of genes known to harbor mutations for NPD. These findings could be useful for a future creation of computational tools for prioritization of novel candidate genes for NPD.
Functional and Genomic Features of Human Genes Mutated in Neuropsychiatric Disorders
Forero, Diego A.; Prada, Carlos F.; Perry, George
2016-01-01
Background: In recent years, a large number of studies around the world have led to the identification of causal genes for hereditary types of common and rare neurological and psychiatric disorders. Objective: To explore the functional and genomic features of known human genes mutated in neuropsychiatric disorders. Methods: A systematic search was used to develop a comprehensive catalog of genes mutated in neuropsychiatric disorders (NPD). Functional enrichment and protein-protein interaction analyses were carried out. A false discovery rate approach was used for correction for multiple testing. Results: We found several functional categories that are enriched among NPD genes, such as gene ontologies, protein domains, tissue expression, signaling pathways and regulation by brain-expressed miRNAs and transcription factors. Sixty six of those NPD genes are known to be druggable. Several topographic parameters of protein-protein interaction networks and the degree of conservation between orthologous genes were identified as significant among NPD genes. Conclusion: These results represent one of the first analyses of enrichment of functional categories of genes known to harbor mutations for NPD. These findings could be useful for a future creation of computational tools for prioritization of novel candidate genes for NPD. PMID:27990183
DNA-Binding Kinetics Determines the Mechanism of Noise-Induced Switching in Gene Networks
Tse, Margaret J.; Chu, Brian K.; Roy, Mahua; Read, Elizabeth L.
2015-01-01
Gene regulatory networks are multistable dynamical systems in which attractor states represent cell phenotypes. Spontaneous, noise-induced transitions between these states are thought to underlie critical cellular processes, including cell developmental fate decisions, phenotypic plasticity in fluctuating environments, and carcinogenesis. As such, there is increasing interest in the development of theoretical and computational approaches that can shed light on the dynamics of these stochastic state transitions in multistable gene networks. We applied a numerical rare-event sampling algorithm to study transition paths of spontaneous noise-induced switching for a ubiquitous gene regulatory network motif, the bistable toggle switch, in which two mutually repressive genes compete for dominant expression. We find that the method can efficiently uncover detailed switching mechanisms that involve fluctuations both in occupancies of DNA regulatory sites and copy numbers of protein products. In addition, we show that the rate parameters governing binding and unbinding of regulatory proteins to DNA strongly influence the switching mechanism. In a regime of slow DNA-binding/unbinding kinetics, spontaneous switching occurs relatively frequently and is driven primarily by fluctuations in DNA-site occupancies. In contrast, in a regime of fast DNA-binding/unbinding kinetics, switching occurs rarely and is driven by fluctuations in levels of expressed protein. Our results demonstrate how spontaneous cell phenotype transitions involve collective behavior of both regulatory proteins and DNA. Computational approaches capable of simulating dynamics over many system variables are thus well suited to exploring dynamic mechanisms in gene networks. PMID:26488666
160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA)
Li, Isaac TS; Shum, Warren; Truong, Kevin
2007-01-01
Background To infer homology and subsequently gene function, the Smith-Waterman (SW) algorithm is used to find the optimal local alignment between two sequences. When searching sequence databases that may contain hundreds of millions of sequences, this algorithm becomes computationally expensive. Results In this paper, we focused on accelerating the Smith-Waterman algorithm by using FPGA-based hardware that implemented a module for computing the score of a single cell of the SW matrix. Then using a grid of this module, the entire SW matrix was computed at the speed of field propagation through the FPGA circuit. These modifications dramatically accelerated the algorithm's computation time by up to 160 folds compared to a pure software implementation running on the same FPGA with an Altera Nios II softprocessor. Conclusion This design of FPGA accelerated hardware offers a new promising direction to seeking computation improvement of genomic database searching. PMID:17555593
160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA).
Li, Isaac T S; Shum, Warren; Truong, Kevin
2007-06-07
To infer homology and subsequently gene function, the Smith-Waterman (SW) algorithm is used to find the optimal local alignment between two sequences. When searching sequence databases that may contain hundreds of millions of sequences, this algorithm becomes computationally expensive. In this paper, we focused on accelerating the Smith-Waterman algorithm by using FPGA-based hardware that implemented a module for computing the score of a single cell of the SW matrix. Then using a grid of this module, the entire SW matrix was computed at the speed of field propagation through the FPGA circuit. These modifications dramatically accelerated the algorithm's computation time by up to 160 folds compared to a pure software implementation running on the same FPGA with an Altera Nios II softprocessor. This design of FPGA accelerated hardware offers a new promising direction to seeking computation improvement of genomic database searching.
Keilwagen, Jens; Grau, Jan; Paponov, Ivan A; Posch, Stefan; Strickert, Marc; Grosse, Ivo
2011-02-10
Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom.
Predicting Gene Structure Changes Resulting from Genetic Variants via Exon Definition Features.
Majoros, William H; Holt, Carson; Campbell, Michael S; Ware, Doreen; Yandell, Mark; Reddy, Timothy E
2018-04-25
Genetic variation that disrupts gene function by altering gene splicing between individuals can substantially influence traits and disease. In those cases, accurately predicting the effects of genetic variation on splicing can be highly valuable for investigating the mechanisms underlying those traits and diseases. While methods have been developed to generate high quality computational predictions of gene structures in reference genomes, the same methods perform poorly when used to predict the potentially deleterious effects of genetic changes that alter gene splicing between individuals. Underlying that discrepancy in predictive ability are the common assumptions by reference gene finding algorithms that genes are conserved, well-formed, and produce functional proteins. We describe a probabilistic approach for predicting recent changes to gene structure that may or may not conserve function. The model is applicable to both coding and noncoding genes, and can be trained on existing gene annotations without requiring curated examples of aberrant splicing. We apply this model to the problem of predicting altered splicing patterns in the genomes of individual humans, and we demonstrate that performing gene-structure prediction without relying on conserved coding features is feasible. The model predicts an unexpected abundance of variants that create de novo splice sites, an observation supported by both simulations and empirical data from RNA-seq experiments. While these de novo splice variants are commonly misinterpreted by other tools as coding or noncoding variants of little or no effect, we find that in some cases they can have large effects on splicing activity and protein products, and we propose that they may commonly act as cryptic factors in disease. The software is available from geneprediction.org/SGRF. bmajoros@duke.edu. Supplementary information is available at Bioinformatics online.
Joint amalgamation of most parsimonious reconciled gene trees
Scornavacca, Celine; Jacox, Edwin; Szöllősi, Gergely J.
2015-01-01
Motivation: Traditionally, gene phylogenies have been reconstructed solely on the basis of molecular sequences; this, however, often does not provide enough information to distinguish between statistically equivalent relationships. To address this problem, several recent methods have incorporated information on the species phylogeny in gene tree reconstruction, leading to dramatic improvements in accuracy. Although probabilistic methods are able to estimate all model parameters but are computationally expensive, parsimony methods—generally computationally more efficient—require a prior estimate of parameters and of the statistical support. Results: Here, we present the Tree Estimation using Reconciliation (TERA) algorithm, a parsimony based, species tree aware method for gene tree reconstruction based on a scoring scheme combining duplication, transfer and loss costs with an estimate of the sequence likelihood. TERA explores all reconciled gene trees that can be amalgamated from a sample of gene trees. Using a large scale simulated dataset, we demonstrate that TERA achieves the same accuracy as the corresponding probabilistic method while being faster, and outperforms other parsimony-based methods in both accuracy and speed. Running TERA on a set of 1099 homologous gene families from complete cyanobacterial genomes, we find that incorporating knowledge of the species tree results in a two thirds reduction in the number of apparent transfer events. Availability and implementation: The algorithm is implemented in our program TERA, which is freely available from http://mbb.univ-montp2.fr/MBB/download_sources/16__TERA. Contact: celine.scornavacca@univ-montp2.fr, ssolo@angel.elte.hu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25380957
Shan, Shengshuai; He, Xiaoxiao; He, Lin; Wang, Min; Liu, Chengyun
2017-08-19
The coexistence of congenital left ventricular aneurysm and abnormal cardiac trabeculation with gene mutation has not been reported previously. Here, we report a case of coexisting congenital left ventricular aneurysm and prominent left ventricular trabeculation in a patient with LIM domain binding 3 gene mutation. A 30-year-old Asian man showed paroxysmal sinus tachycardia and Q waves in an electrocardiogram health check. There were no specific findings in physical examinations and serological tests. A coronary-computed tomography angiography check showed normal coronary artery and no coronary stenosis. Both left ventricle contrast echocardiography and cardiac magnetic resonance showed rare patterns of a combination of an apical aneurysm-like out-pouching structure with a wide connection to the left ventricle and prominent left ventricular trabecular meshwork. High-throughput sequencing examinations showed a novel mutation in the LDB3 gene (c.C793>T; p.Arg265Cys). Our finding indicates that the phenotypic expression of two heart conditions, congenital left ventricular aneurysm and prominent left ventricular trabeculation, although rare, can occur simultaneously with LDB3 gene mutation. Congenital left ventricular aneurysm and prominent left ventricular trabeculation may share the same genetic background.
Mutual information and the fidelity of response of gene regulatory models
NASA Astrophysics Data System (ADS)
Tabbaa, Omar P.; Jayaprakash, C.
2014-08-01
We investigate cellular response to extracellular signals by using information theory techniques motivated by recent experiments. We present results for the steady state of the following gene regulatory models found in both prokaryotic and eukaryotic cells: a linear transcription-translation model and a positive or negative auto-regulatory model. We calculate both the information capacity and the mutual information exactly for simple models and approximately for the full model. We find that (1) small changes in mutual information can lead to potentially important changes in cellular response and (2) there are diminishing returns in the fidelity of response as the mutual information increases. We calculate the information capacity using Gillespie simulations of a model for the TNF-α-NF-κ B network and find good agreement with the measured value for an experimental realization of this network. Our results provide a quantitative understanding of the differences in cellular response when comparing experimentally measured mutual information values of different gene regulatory models. Our calculations demonstrate that Gillespie simulations can be used to compute the mutual information of more complex gene regulatory models, providing a potentially useful tool in synthetic biology.
Tsai, Yu-Shuen; Aguan, Kripamoy; Pal, Nikhil R.; Chung, I-Fang
2011-01-01
Informative genes from microarray data can be used to construct prediction model and investigate biological mechanisms. Differentially expressed genes, the main targets of most gene selection methods, can be classified as single- and multiple-class specific signature genes. Here, we present a novel gene selection algorithm based on a Group Marker Index (GMI), which is intuitive, of low-computational complexity, and efficient in identification of both types of genes. Most gene selection methods identify only single-class specific signature genes and cannot identify multiple-class specific signature genes easily. Our algorithm can detect de novo certain conditions of multiple-class specificity of a gene and makes use of a novel non-parametric indicator to assess the discrimination ability between classes. Our method is effective even when the sample size is small as well as when the class sizes are significantly different. To compare the effectiveness and robustness we formulate an intuitive template-based method and use four well-known datasets. We demonstrate that our algorithm outperforms the template-based method in difficult cases with unbalanced distribution. Moreover, the multiple-class specific genes are good biomarkers and play important roles in biological pathways. Our literature survey supports that the proposed method identifies unique multiple-class specific marker genes (not reported earlier to be related to cancer) in the Central Nervous System data. It also discovers unique biomarkers indicating the intrinsic difference between subtypes of lung cancer. We also associate the pathway information with the multiple-class specific signature genes and cross-reference to published studies. We find that the identified genes participate in the pathways directly involved in cancer development in leukemia data. Our method gives a promising way to find genes that can involve in pathways of multiple diseases and hence opens up the possibility of using an existing drug on other diseases as well as designing a single drug for multiple diseases. PMID:21909426
Computational dissection of human episodic memory reveals mental process-specific genetic profiles
Luksys, Gediminas; Fastenrath, Matthias; Coynel, David; Freytag, Virginie; Gschwind, Leo; Heck, Angela; Jessen, Frank; Maier, Wolfgang; Milnik, Annette; Riedel-Heller, Steffi G.; Scherer, Martin; Spalek, Klara; Vogler, Christian; Wagner, Michael; Wolfsgruber, Steffen; Papassotiropoulos, Andreas; de Quervain, Dominique J.-F.
2015-01-01
Episodic memory performance is the result of distinct mental processes, such as learning, memory maintenance, and emotional modulation of memory strength. Such processes can be effectively dissociated using computational models. Here we performed gene set enrichment analyses of model parameters estimated from the episodic memory performance of 1,765 healthy young adults. We report robust and replicated associations of the amine compound SLC (solute-carrier) transporters gene set with the learning rate, of the collagen formation and transmembrane receptor protein tyrosine kinase activity gene sets with the modulation of memory strength by negative emotional arousal, and of the L1 cell adhesion molecule (L1CAM) interactions gene set with the repetition-based memory improvement. Furthermore, in a large functional MRI sample of 795 subjects we found that the association between L1CAM interactions and memory maintenance revealed large clusters of differences in brain activity in frontal cortical areas. Our findings provide converging evidence that distinct genetic profiles underlie specific mental processes of human episodic memory. They also provide empirical support to previous theoretical and neurobiological studies linking specific neuromodulators to the learning rate and linking neural cell adhesion molecules to memory maintenance. Furthermore, our study suggests additional memory-related genetic pathways, which may contribute to a better understanding of the neurobiology of human memory. PMID:26261317
Computational dissection of human episodic memory reveals mental process-specific genetic profiles.
Luksys, Gediminas; Fastenrath, Matthias; Coynel, David; Freytag, Virginie; Gschwind, Leo; Heck, Angela; Jessen, Frank; Maier, Wolfgang; Milnik, Annette; Riedel-Heller, Steffi G; Scherer, Martin; Spalek, Klara; Vogler, Christian; Wagner, Michael; Wolfsgruber, Steffen; Papassotiropoulos, Andreas; de Quervain, Dominique J-F
2015-09-01
Episodic memory performance is the result of distinct mental processes, such as learning, memory maintenance, and emotional modulation of memory strength. Such processes can be effectively dissociated using computational models. Here we performed gene set enrichment analyses of model parameters estimated from the episodic memory performance of 1,765 healthy young adults. We report robust and replicated associations of the amine compound SLC (solute-carrier) transporters gene set with the learning rate, of the collagen formation and transmembrane receptor protein tyrosine kinase activity gene sets with the modulation of memory strength by negative emotional arousal, and of the L1 cell adhesion molecule (L1CAM) interactions gene set with the repetition-based memory improvement. Furthermore, in a large functional MRI sample of 795 subjects we found that the association between L1CAM interactions and memory maintenance revealed large clusters of differences in brain activity in frontal cortical areas. Our findings provide converging evidence that distinct genetic profiles underlie specific mental processes of human episodic memory. They also provide empirical support to previous theoretical and neurobiological studies linking specific neuromodulators to the learning rate and linking neural cell adhesion molecules to memory maintenance. Furthermore, our study suggests additional memory-related genetic pathways, which may contribute to a better understanding of the neurobiology of human memory.
Kozlov, Konstantin N.; Kulakovskiy, Ivan V.; Zubair, Asif; Marjoram, Paul; Lawrie, David S.; Nuzhdin, Sergey V.; Samsonova, Maria G.
2017-01-01
Annotating the genotype-phenotype relationship, and developing a proper quantitative description of the relationship, requires understanding the impact of natural genomic variation on gene expression. We apply a sequence-level model of gap gene expression in the early development of Drosophila to analyze single nucleotide polymorphisms (SNPs) in a panel of natural sequenced D. melanogaster lines. Using a thermodynamic modeling framework, we provide both analytical and computational descriptions of how single-nucleotide variants affect gene expression. The analysis reveals that the sequence variants increase (decrease) gene expression if located within binding sites of repressors (activators). We show that the sign of SNP influence (activation or repression) may change in time and space and elucidate the origin of this change in specific examples. The thermodynamic modeling approach predicts non-local and non-linear effects arising from SNPs, and combinations of SNPs, in individual fly genotypes. Simulation of individual fly genotypes using our model reveals that this non-linearity reduces to almost additive inputs from multiple SNPs. Further, we see signatures of the action of purifying selection in the gap gene regulatory regions. To infer the specific targets of purifying selection, we analyze the patterns of polymorphism in the data at two phenotypic levels: the strengths of binding and expression. We find that combinations of SNPs show evidence of being under selective pressure, while individual SNPs do not. The model predicts that SNPs appear to accumulate in the genotypes of the natural population in a way biased towards small increases in activating action on the expression pattern. Taken together, these results provide a systems-level view of how genetic variation translates to the level of gene regulatory networks via combinatorial SNP effects. PMID:28898266
Integrative Functional Genomics for Systems Genetics in GeneWeaver.org.
Bubier, Jason A; Langston, Michael A; Baker, Erich J; Chesler, Elissa J
2017-01-01
The abundance of existing functional genomics studies permits an integrative approach to interpreting and resolving the results of diverse systems genetics studies. However, a major challenge lies in assembling and harmonizing heterogeneous data sets across species for facile comparison to the positional candidate genes and coexpression networks that come from systems genetic studies. GeneWeaver is an online database and suite of tools at www.geneweaver.org that allows for fast aggregation and analysis of gene set-centric data. GeneWeaver contains curated experimental data together with resource-level data such as GO annotations, MP annotations, and KEGG pathways, along with persistent stores of user entered data sets. These can be entered directly into GeneWeaver or transferred from widely used resources such as GeneNetwork.org. Data are analyzed using statistical tools and advanced graph algorithms to discover new relations, prioritize candidate genes, and generate function hypotheses. Here we use GeneWeaver to find genes common to multiple gene sets, prioritize candidate genes from a quantitative trait locus, and characterize a set of differentially expressed genes. Coupling a large multispecies repository curated and empirical functional genomics data to fast computational tools allows for the rapid integrative analysis of heterogeneous data for interpreting and extrapolating systems genetics results.
A Poisson Log-Normal Model for Constructing Gene Covariation Network Using RNA-seq Data.
Choi, Yoonha; Coram, Marc; Peng, Jie; Tang, Hua
2017-07-01
Constructing expression networks using transcriptomic data is an effective approach for studying gene regulation. A popular approach for constructing such a network is based on the Gaussian graphical model (GGM), in which an edge between a pair of genes indicates that the expression levels of these two genes are conditionally dependent, given the expression levels of all other genes. However, GGMs are not appropriate for non-Gaussian data, such as those generated in RNA-seq experiments. We propose a novel statistical framework that maximizes a penalized likelihood, in which the observed count data follow a Poisson log-normal distribution. To overcome the computational challenges, we use Laplace's method to approximate the likelihood and its gradients, and apply the alternating directions method of multipliers to find the penalized maximum likelihood estimates. The proposed method is evaluated and compared with GGMs using both simulated and real RNA-seq data. The proposed method shows improved performance in detecting edges that represent covarying pairs of genes, particularly for edges connecting low-abundant genes and edges around regulatory hubs.
A comparative analysis of soft computing techniques for gene prediction.
Goel, Neelam; Singh, Shailendra; Aseri, Trilok Chand
2013-07-01
The rapid growth of genomic sequence data for both human and nonhuman species has made analyzing these sequences, especially predicting genes in them, very important and is currently the focus of many research efforts. Beside its scientific interest in the molecular biology and genomics community, gene prediction is of considerable importance in human health and medicine. A variety of gene prediction techniques have been developed for eukaryotes over the past few years. This article reviews and analyzes the application of certain soft computing techniques in gene prediction. First, the problem of gene prediction and its challenges are described. These are followed by different soft computing techniques along with their application to gene prediction. In addition, a comparative analysis of different soft computing techniques for gene prediction is given. Finally some limitations of the current research activities and future research directions are provided. Copyright © 2013 Elsevier Inc. All rights reserved.
Dictionary-driven prokaryotic gene finding
Shibuya, Tetsuo; Rigoutsos, Isidore
2002-01-01
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm’s implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method’s generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail. PMID:12060689
Alam, Tanvir; Medvedeva, Yulia A.; Jia, Hui; ...
2014-10-02
Transcriptional regulation of protein-coding genes is increasingly well-understood on a global scale, yet no comparable information exists for long non-coding RNA (lncRNA) genes, which were recently recognized to be as numerous as protein-coding genes in mammalian genomes. We performed a genome-wide comparative analysis of the promoters of human lncRNA and protein-coding genes, finding global differences in specific genetic and epigenetic features relevant to transcriptional regulation. These two groups of genes are hence subject to separate transcriptional regulatory programs, including distinct transcription factor (TF) proteins that significantly favor lncRNA, rather than coding-gene, promoters. We report a specific signature of promoter-proximal transcriptionalmore » regulation of lncRNA genes, including several distinct transcription factor binding sites (TFBS). Experimental DNase I hypersensitive site profiles are consistent with active configurations of these lncRNA TFBS sets in diverse human cell types. TFBS ChIP-seq datasets confirm the binding events that we predicted using computational approaches for a subset of factors. For several TFs known to be directly regulated by lncRNAs, we find that their putative TFBSs are enriched at lncRNA promoters, suggesting that the TFs and the lncRNAs may participate in a bidirectional feedback loop regulatory network. Accordingly, cells may be able to modulate lncRNA expression levels independently of mRNA levels via distinct regulatory pathways. Our results also raise the possibility that, given the historical reliance on protein-coding gene catalogs to define the chromatin states of active promoters, a revision of these chromatin signature profiles to incorporate expressed lncRNA genes is warranted in the future.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Alam, Tanvir; Medvedeva, Yulia A.; Jia, Hui
Transcriptional regulation of protein-coding genes is increasingly well-understood on a global scale, yet no comparable information exists for long non-coding RNA (lncRNA) genes, which were recently recognized to be as numerous as protein-coding genes in mammalian genomes. We performed a genome-wide comparative analysis of the promoters of human lncRNA and protein-coding genes, finding global differences in specific genetic and epigenetic features relevant to transcriptional regulation. These two groups of genes are hence subject to separate transcriptional regulatory programs, including distinct transcription factor (TF) proteins that significantly favor lncRNA, rather than coding-gene, promoters. We report a specific signature of promoter-proximal transcriptionalmore » regulation of lncRNA genes, including several distinct transcription factor binding sites (TFBS). Experimental DNase I hypersensitive site profiles are consistent with active configurations of these lncRNA TFBS sets in diverse human cell types. TFBS ChIP-seq datasets confirm the binding events that we predicted using computational approaches for a subset of factors. For several TFs known to be directly regulated by lncRNAs, we find that their putative TFBSs are enriched at lncRNA promoters, suggesting that the TFs and the lncRNAs may participate in a bidirectional feedback loop regulatory network. Accordingly, cells may be able to modulate lncRNA expression levels independently of mRNA levels via distinct regulatory pathways. Our results also raise the possibility that, given the historical reliance on protein-coding gene catalogs to define the chromatin states of active promoters, a revision of these chromatin signature profiles to incorporate expressed lncRNA genes is warranted in the future.« less
Phenotypic Robustness and the Assortativity Signature of Human Transcription Factor Networks
Pechenick, Dov A.; Payne, Joshua L.; Moore, Jason H.
2014-01-01
Many developmental, physiological, and behavioral processes depend on the precise expression of genes in space and time. Such spatiotemporal gene expression phenotypes arise from the binding of sequence-specific transcription factors (TFs) to DNA, and from the regulation of nearby genes that such binding causes. These nearby genes may themselves encode TFs, giving rise to a transcription factor network (TFN), wherein nodes represent TFs and directed edges denote regulatory interactions between TFs. Computational studies have linked several topological properties of TFNs — such as their degree distribution — with the robustness of a TFN's gene expression phenotype to genetic and environmental perturbation. Another important topological property is assortativity, which measures the tendency of nodes with similar numbers of edges to connect. In directed networks, assortativity comprises four distinct components that collectively form an assortativity signature. We know very little about how a TFN's assortativity signature affects the robustness of its gene expression phenotype to perturbation. While recent theoretical results suggest that increasing one specific component of a TFN's assortativity signature leads to increased phenotypic robustness, the biological context of this finding is currently limited because the assortativity signatures of real-world TFNs have not been characterized. It is therefore unclear whether these earlier theoretical findings are biologically relevant. Moreover, it is not known how the other three components of the assortativity signature contribute to the phenotypic robustness of TFNs. Here, we use publicly available DNaseI-seq data to measure the assortativity signatures of genome-wide TFNs in 41 distinct human cell and tissue types. We find that all TFNs share a common assortativity signature and that this signature confers phenotypic robustness to model TFNs. Lastly, we determine the extent to which each of the four components of the assortativity signature contributes to this robustness. PMID:25121490
Mining subspace clusters from DNA microarray data using large itemset techniques.
Chang, Ye-In; Chen, Jiun-Rung; Tsai, Yueh-Chi
2009-05-01
Mining subspace clusters from the DNA microarrays could help researchers identify those genes which commonly contribute to a disease, where a subspace cluster indicates a subset of genes whose expression levels are similar under a subset of conditions. Since in a DNA microarray, the number of genes is far larger than the number of conditions, those previous proposed algorithms which compute the maximum dimension sets (MDSs) for any two genes will take a long time to mine subspace clusters. In this article, we propose the Large Itemset-Based Clustering (LISC) algorithm for mining subspace clusters. Instead of constructing MDSs for any two genes, we construct only MDSs for any two conditions. Then, we transform the task of finding the maximal possible gene sets into the problem of mining large itemsets from the condition-pair MDSs. Since we are only interested in those subspace clusters with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonable large support values in the condition-pair MDSs. From our simulation results, we show that the proposed algorithm needs shorter processing time than those previous proposed algorithms which need to construct gene-pair MDSs.
DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis.
Yu, Guangchuang; Wang, Li-Gen; Yan, Guang-Rong; He, Qing-Yu
2015-02-15
Disease ontology (DO) annotates human genes in the context of disease. DO is important annotation in translating molecular findings from high-throughput data to clinical relevance. DOSE is an R package providing semantic similarity computations among DO terms and genes which allows biologists to explore the similarities of diseases and of gene functions in disease perspective. Enrichment analyses including hypergeometric model and gene set enrichment analysis are also implemented to support discovering disease associations of high-throughput biological data. This allows biologists to verify disease relevance in a biological experiment and identify unexpected disease associations. Comparison among gene clusters is also supported. DOSE is released under Artistic-2.0 License. The source code and documents are freely available through Bioconductor (http://www.bioconductor.org/packages/release/bioc/html/DOSE.html). Supplementary data are available at Bioinformatics online. gcyu@connect.hku.hk or tqyhe@jnu.edu.cn. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Integrating Candida albicans metabolism with biofilm heterogeneity by transcriptome mapping
NASA Astrophysics Data System (ADS)
Rajendran, Ranjith; May, Ali; Sherry, Leighann; Kean, Ryan; Williams, Craig; Jones, Brian L.; Burgess, Karl V.; Heringa, Jaap; Abeln, Sanne; Brandt, Bernd W.; Munro, Carol A.; Ramage, Gordon
2016-10-01
Candida albicans biofilm formation is an important virulence factor in the pathogenesis of disease, a characteristic which has been shown to be heterogeneous in clinical isolates. Using an unbiased computational approach we investigated the central metabolic pathways driving biofilm heterogeneity. Transcripts from high (HBF) and low (LBF) biofilm forming isolates were analysed by RNA sequencing, with 6312 genes identified to be expressed in these two phenotypes. With a dedicated computational approach we identified and validated a significantly differentially expressed subnetwork of genes associated with these biofilm phenotypes. Our analysis revealed amino acid metabolism, such as arginine, proline, aspartate and glutamate metabolism, were predominantly upregulated in the HBF phenotype. On the contrary, purine, starch and sucrose metabolism was generally upregulated in the LBF phenotype. The aspartate aminotransferase gene AAT1 was found to be a common member of these amino acid pathways and significantly upregulated in the HBF phenotype. Pharmacological inhibition of AAT1 enzyme activity significantly reduced biofilm formation in a dose-dependent manner. Collectively, these findings provide evidence that biofilm phenotype is associated with differential regulation of metabolic pathways. Understanding and targeting such pathways, such as amino acid metabolism, is potentially useful for developing diagnostics and new antifungals to treat biofilm-based infections.
Kimura, Shuhei; Sato, Masanao; Okada-Hatakeyama, Mariko
2013-01-01
The inference of a genetic network is a problem in which mutual interactions among genes are inferred from time-series of gene expression levels. While a number of models have been proposed to describe genetic networks, this study focuses on a mathematical model proposed by Vohradský. Because of its advantageous features, several researchers have proposed the inference methods based on Vohradský's model. When trying to analyze large-scale networks consisting of dozens of genes, however, these methods must solve high-dimensional non-linear function optimization problems. In order to resolve the difficulty of estimating the parameters of the Vohradský's model, this study proposes a new method that defines the problem as several two-dimensional function optimization problems. Through numerical experiments on artificial genetic network inference problems, we showed that, although the computation time of the proposed method is not the shortest, the method has the ability to estimate parameters of Vohradský's models more effectively with sufficiently short computation times. This study then applied the proposed method to an actual inference problem of the bacterial SOS DNA repair system, and succeeded in finding several reasonable regulations. PMID:24386175
Aziz, Ramy K; Monk, Jonathan M; Andrews, Kathleen A; Nhan, Jenny; Khaw, Valerie L; Wong, Hesper; Palsson, Bernhard O; Charusanti, Pep
2017-01-01
Most Escherichia coli strains are naturally unable to grow on 1,2-propanediol (PDO) as a sole carbon source. Recently, however, a K-12 descendent E. coli strain was evolved to grow on 1,2-PDO, and it was hypothesized that this evolved ability was dependent on the aldehyde dehydrogenase, AldA, which is highly conserved among members of the family Enterobacteriacea. To test this hypothesis, we first performed computational model simulation, which confirmed the essentiality of the aldA gene for 1,2-PDO utilization by the evolved PDO-degrading E. coli. Next, we deleted the aldA gene from the evolved strain, and this deletion was sufficient to abolish the evolved phenotype. On re-introducing the gene on a plasmid, the evolved phenotype was restored. These findings provide experimental evidence for the computationally predicted role of AldA in 1,2-PDO utilization, and represent a good example of E. coli robustness, demonstrated by the bacterial deployment of a generalist enzyme (here AldA) in multiple pathways to survive carbon starvation and to grow on a non-native substrate when no native carbon source is available. Copyright © 2016 Elsevier GmbH. All rights reserved.
Computational and Organotypic Modeling of Microcephaly ...
Microcephaly is associated with reduced cortical surface area and ventricular dilations. Many genetic and environmental factors precipitate this malformation, including prenatal alcohol exposure and maternal Zika infection. This complexity motivates the engineering of computational and experimental models to probe the underlying molecular targets, cellular consequences, and biological processes. We describe an Adverse Outcome Pathway (AOP) framework for microcephaly derived from literature on all gene-, chemical-, or viral- effects and brain development. Overlap with NTDs is likely, although the AOP connections identified here focused on microcephaly as the adverse outcome. A query of the Mammalian Phenotype Browser database for ‘microcephaly’ (MP:0000433) returned 85 gene associations; several function in microtubule assembly and centrosome cycle regulated by (microcephalin, MCPH1), a gene for primary microcephaly in humans. The developing ventricular zone is the likely target. In this zone, neuroprogenitor cells (NPCs) self-replicate during the 1st trimester setting brain size, followed by neural differentiation of the neocortex. Recent studies with human NPCs confirmed infectivity with Zika virions invoking critical cell loss (apoptosis) of precursor NPCs; similar findings have been shown with fetal alcohol or methylmercury exposure in rodent studies, leading to mathematical models of NPC dynamics in size determination of the ventricular zone. A key event
Arguello Casteleiro, Mercedes; Demetriou, George; Read, Warren; Fernandez Prieto, Maria Jesus; Maroto, Nava; Maseda Fernandez, Diego; Nenadic, Goran; Klein, Julie; Keane, John; Stevens, Robert
2018-04-12
Automatic identification of term variants or acceptable alternative free-text terms for gene and protein names from the millions of biomedical publications is a challenging task. Ontologies, such as the Cardiovascular Disease Ontology (CVDO), capture domain knowledge in a computational form and can provide context for gene/protein names as written in the literature. This study investigates: 1) if word embeddings from Deep Learning algorithms can provide a list of term variants for a given gene/protein of interest; and 2) if biological knowledge from the CVDO can improve such a list without modifying the word embeddings created. We have manually annotated 105 gene/protein names from 25 PubMed titles/abstracts and mapped them to 79 unique UniProtKB entries corresponding to gene and protein classes from the CVDO. Using more than 14 M PubMed articles (titles and available abstracts), word embeddings were generated with CBOW and Skip-gram. We setup two experiments for a synonym detection task, each with four raters, and 3672 pairs of terms (target term and candidate term) from the word embeddings created. For Experiment I, the target terms for 64 UniProtKB entries were those that appear in the titles/abstracts; Experiment II involves 63 UniProtKB entries and the target terms are a combination of terms from PubMed titles/abstracts with terms (i.e. increased context) from the CVDO protein class expressions and labels. In Experiment I, Skip-gram finds term variants (full and/or partial) for 89% of the 64 UniProtKB entries, while CBOW finds term variants for 67%. In Experiment II (with the aid of the CVDO), Skip-gram finds term variants for 95% of the 63 UniProtKB entries, while CBOW finds term variants for 78%. Combining the results of both experiments, Skip-gram finds term variants for 97% of the 79 UniProtKB entries, while CBOW finds term variants for 81%. This study shows performance improvements for both CBOW and Skip-gram on a gene/protein synonym detection task by adding knowledge formalised in the CVDO and without modifying the word embeddings created. Hence, the CVDO supplies context that is effective in inducing term variability for both CBOW and Skip-gram while reducing ambiguity. Skip-gram outperforms CBOW and finds more pertinent term variants for gene/protein names annotated from the scientific literature.
g:Profiler-a web server for functional interpretation of gene lists (2016 update).
Reimand, Jüri; Arak, Tambet; Adler, Priit; Kolberg, Liis; Reisberg, Sulev; Peterson, Hedi; Vilo, Jaak
2016-07-08
Functional enrichment analysis is a key step in interpreting gene lists discovered in diverse high-throughput experiments. g:Profiler studies flat and ranked gene lists and finds statistically significant Gene Ontology terms, pathways and other gene function related terms. Translation of hundreds of gene identifiers is another core feature of g:Profiler. Since its first publication in 2007, our web server has become a popular tool of choice among basic and translational researchers. Timeliness is a major advantage of g:Profiler as genome and pathway information is synchronized with the Ensembl database in quarterly updates. g:Profiler supports 213 species including mammals and other vertebrates, plants, insects and fungi. The 2016 update of g:Profiler introduces several novel features. We have added further functional datasets to interpret gene lists, including transcription factor binding site predictions, Mendelian disease annotations, information about protein expression and complexes and gene mappings of human genetic polymorphisms. Besides the interactive web interface, g:Profiler can be accessed in computational pipelines using our R package, Python interface and BioJS component. g:Profiler is freely available at http://biit.cs.ut.ee/gprofiler/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Possible linkage of non-syndromic cleft lip and palate to the MSX1 homebox gene on chromosome 4p
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, S.; Walczak, C.; Erickson, R.P.
1994-09-01
The MSX1 (HOX7) gene has been shown recently to cause cleft palate in a mouse model deficient for its product. Several features of this mouse model make the human homolog of this gene an excellent candidate for non-syndromic cleft palate. We tested this hypothesis by linkage studies in two large multiplex human families using a microsatellite marker in the human MSX1 gene. A LOD score of 1.7 was obtained maximizing at a recombination fraction of 0.09. Computer simulation power calculations using the program SIMLINK indicated that a LOD score this large is expected to occur only about 1/200 times bymore » chance alone for a marker locus with comparable informativeness if unlinked to the disease gene. This suggestive finding is being followed up by attempts to recruit and study additional families and by DNA sequence analyses of the MSX1 gene in these families and other cleft lip and/or cleft palate subjects and these further results will also be reported.« less
An efficient method to identify differentially expressed genes in microarray experiments
Qin, Huaizhen; Feng, Tao; Harding, Scott A.; Tsai, Chung-Jui; Zhang, Shuanglin
2013-01-01
Motivation Microarray experiments typically analyze thousands to tens of thousands of genes from small numbers of biological replicates. The fact that genes are normally expressed in functionally relevant patterns suggests that gene-expression data can be stratified and clustered into relatively homogenous groups. Cluster-wise dimensionality reduction should make it feasible to improve screening power while minimizing information loss. Results We propose a powerful and computationally simple method for finding differentially expressed genes in small microarray experiments. The method incorporates a novel stratification-based tight clustering algorithm, principal component analysis and information pooling. Comprehensive simulations show that our method is substantially more powerful than the popular SAM and eBayes approaches. We applied the method to three real microarray datasets: one from a Populus nitrogen stress experiment with 3 biological replicates; and two from public microarray datasets of human cancers with 10 to 40 biological replicates. In all three analyses, our method proved more robust than the popular alternatives for identification of differentially expressed genes. Availability The C++ code to implement the proposed method is available upon request for academic use. PMID:18453554
Wu, Zheyang; Zhao, Hongyu
2012-01-01
For more fruitful discoveries of genetic variants associated with diseases in genome-wide association studies, it is important to know whether joint analysis of multiple markers is more powerful than the commonly used single-marker analysis, especially in the presence of gene-gene interactions. This article provides a statistical framework to rigorously address this question through analytical power calculations for common model search strategies to detect binary trait loci: marginal search, exhaustive search, forward search, and two-stage screening search. Our approach incorporates linkage disequilibrium, random genotypes, and correlations among score test statistics of logistic regressions. We derive analytical results under two power definitions: the power of finding all the associated markers and the power of finding at least one associated marker. We also consider two types of error controls: the discovery number control and the Bonferroni type I error rate control. After demonstrating the accuracy of our analytical results by simulations, we apply them to consider a broad genetic model space to investigate the relative performances of different model search strategies. Our analytical study provides rapid computation as well as insights into the statistical mechanism of capturing genetic signals under different genetic models including gene-gene interactions. Even though we focus on genetic association analysis, our results on the power of model selection procedures are clearly very general and applicable to other studies.
From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing.
Marinov, Georgi K; Williams, Brian A; McCue, Ken; Schroth, Gary P; Gertz, Jason; Myers, Richard M; Wold, Barbara J
2014-03-01
Single-cell RNA-seq mammalian transcriptome studies are at an early stage in uncovering cell-to-cell variation in gene expression, transcript processing and editing, and regulatory module activity. Despite great progress recently, substantial challenges remain, including discriminating biological variation from technical noise. Here we apply the SMART-seq single-cell RNA-seq protocol to study the reference lymphoblastoid cell line GM12878. By using spike-in quantification standards, we estimate the absolute number of RNA molecules per cell for each gene and find significant variation in total mRNA content: between 50,000 and 300,000 transcripts per cell. We directly measure technical stochasticity by a pool/split design and find that there are significant differences in expression between individual cells, over and above technical variation. Specific gene coexpression modules were preferentially expressed in subsets of individual cells, including one enriched for mRNA processing and splicing factors. We assess cell-to-cell variation in alternative splicing and allelic bias and report evidence of significant differences in splice site usage that exceed splice variation in the pool/split comparison. Finally, we show that transcriptomes from small pools of 30-100 cells approach the information content and reproducibility of contemporary RNA-seq from large amounts of input material. Together, our results define an experimental and computational path forward for analyzing gene expression in rare cell types and cell states.
Wu, Zheyang; Zhao, Hongyu
2013-01-01
For more fruitful discoveries of genetic variants associated with diseases in genome-wide association studies, it is important to know whether joint analysis of multiple markers is more powerful than the commonly used single-marker analysis, especially in the presence of gene-gene interactions. This article provides a statistical framework to rigorously address this question through analytical power calculations for common model search strategies to detect binary trait loci: marginal search, exhaustive search, forward search, and two-stage screening search. Our approach incorporates linkage disequilibrium, random genotypes, and correlations among score test statistics of logistic regressions. We derive analytical results under two power definitions: the power of finding all the associated markers and the power of finding at least one associated marker. We also consider two types of error controls: the discovery number control and the Bonferroni type I error rate control. After demonstrating the accuracy of our analytical results by simulations, we apply them to consider a broad genetic model space to investigate the relative performances of different model search strategies. Our analytical study provides rapid computation as well as insights into the statistical mechanism of capturing genetic signals under different genetic models including gene-gene interactions. Even though we focus on genetic association analysis, our results on the power of model selection procedures are clearly very general and applicable to other studies. PMID:23956610
Modeling the Activity of Single Genes
NASA Technical Reports Server (NTRS)
Mjolsness, Eric; Gibson, Michael
1999-01-01
The central dogma of molecular biology states that information is stored in DNA, transcribed to messenger RNA (mRNA) and then translated into proteins. This picture is significantly augmentated when we consider the action of certain proteins in regulating transcription. These transcription factors provide a feedback pathway by which genes can regulate one another's expression as mRNA and then as protein. To review: DNA, RNA and proteins have different functions. DNA is the molecular storehouse of genetic information. When cells divide, the DNA is replicated, so that each daughter cell maintains the same genetic information as the mother cell. RNA acts as a go-between from DNA to proteins. Only a single copy of DNA is present, but multiple copies of the same piece of RNA may be present, allowing cells to make huge amounts of protein. In eukaryotes (organisms with a nucleus), DNA is found in the nucleus only. RNA is copied in the nucleus then translocates(moves) outside the nucleus, where it is transcribed into proteins. Along the way, the RNA may be spliced, i.e., may have pieces cut out. RNA then attaches to ribosomes and is translated to proteins. Proteins are the machinery of the cell other than DNA and RNA, all the complex molecules of the cell are proteins. Proteins are specialized machines, each of which fulfills its own task, which may be transporting oxygen, catalyzing reactions, or responding to extracellular signals, just to name a few. One of the more interesting functions a protein may have is binding directly or indirectly to DNA to perform transcriptional regulation, thus forming a closed feedback loop of gene regulation. The structure of DNA and the central dogma were understood in the 50s; in the early 80s it became possible to make arbitrary modifications to DNA and use cellular machinery to transcribe and translate the resulting genes; more recently, genomes (i.e., the complete DNA sequence) of many organisms have been sequenced. This large-scale sequencing began with simple organisms, viruses and bacteria, progressed to eukaryotes such as yeast, and more recently (1998) progressed to a multi-cellular animal, the nematode Caenorhabditis elegans. Sequencers have now moved on to the fruit fly Drosophila melanogaster, whose sequence is slated for completion by the end of 1999. The human genome project is expected to determine the complete sequence of all 3 billion bases of human DNA within the next five years. In the wake of genome-scale sequencing, further instrumentation is being developed to assay gene expression and function on a comparably large scale. Much of the work in computational biology focuses on computational tools used in sequencing, finding genes that are related to a particular gene, finding which parts of the DNA code for proteins and which do not, understanding what proteins will be formed from a given length of DNA, predicting how the proteins will fold from a one-dimensional structure into a three dimensional structure, and so on. Much less computational work has been done regarding the function of proteins. One reason for this is that different proteins function very differently, and so work on protein function is very specific to certain classes of proteins. There are, for example, proteins such enzymes that catalyze various intracellular reactions, receptors that respond to extracellular signals and ion channels that regulate the flow of charged particles into and out of the cell. In this chapter, we will consider a particular class of proteins called transcription factors(TFs), which are responsible for regulating when a certain gene is expressed in a certain cell, which cells it is express in, and how much is expressed. Understanding these processes will involve developing a deeper understanding of transcription, translation, and the cellular processes that control those processes. All of these elements fall under the aegis of gene regulation or more narrowly transcriptional regulation. Some of the key questions in gene regulation are: What genes are expressed in a certain cell at a certain time? How does gene expression differ from cell to cell in a multicellular organism? Which proteins act as transcription factors, i.e., are important in regulating gene expression? From questions like these, we hope to understand which genes are important for various macroscopic processes. Nearly all of the cells of a multicellular organism contain the same DNA. Yet this same genetic information yields a large number of different cell types. The fundamental difference between a neuron and a liver cell, for example, is which genes are expressed. Thus understanding gene regulation is an important step in understanding development. Furthermore, understanding the usual genes that are expressed in cells may give important clues about various diseases. Some diseases, such as sickle cell anemia and cystic fibrosis, are caused by defects in single, non-regulatory genes; others, such as certain cancers, are caused when the cellular control circuitry malfunctions - an understanding of these diseases will involve pathways of multiple interacting gene products. There are numerous challenges in the area of understanding and modeling gene regulation. First and foremost, biologists would like to develop a deeper understanding of the processes involved, including which genes and families of genes are important, how they interact, etc. From a computation point of view, there has been embarrassingly little work done. In this chapter there are many areas in which we can phrase meaningful, non-trivial computational questions, but questions that have not been addressed. Some of these are purely computational (what is a good algorithm for dealing with a model of type X) and others are more mathematical (given a system with certain characteristics, what sort of model can one use? How does one find biochemical parameters from system-level behavior using as few experiments as possible?). In addition to biological and algorithmic problems, there is also the ever-present issue of theoretical biology - what general principles can be derived from these systems, what can one do with models other than just simulate time-courses, what can be deduced about a class of systems without knowing all the details? The fundamental challenge to computationalists and theorists is to add value to the biology - to use models, modeling techniques and algorithms to understand the biology in new ways.
Computational analysis of human and mouse CREB3L4 Protein
Velpula, Kiran Kumar; Rehman, Azeem Abdul; Chigurupati, Soumya; Sanam, Ramadevi; Inampudi, Krishna Kishore; Akila, Chandra Sekhar
2012-01-01
CREB3L4 is a member of the CREB/ATF transcription factor family, characterized by their regulation of gene expression through the cAMP-responsive element. Previous studies identified this protein in mice and humans. Whereas CREB3L4 in mice (referred to as Tisp40) is found in the testes and functions in spermatogenesis, human CREB3L4 is primarily detected in the prostate and has been implicated in cancer. We conducted computational analyses to compare the structural homology between murine Tisp40α human CREB3L4. Our results reveal that the primary and secondary structures of the two proteins contain high similarity. Additionally, predicted helical transmembrane structure reveals that the proteins likely have similar structure and function. This study offers preliminary findings that support the translation of mouse Tisp40α findings into human models, based on structural homology. PMID:22829733
Heuristics for the inversion median problem
2010-01-01
Background The study of genome rearrangements has become a mainstay of phylogenetics and comparative genomics. Fundamental in such a study is the median problem: given three genomes find a fourth that minimizes the sum of the evolutionary distances between itself and the given three. Many exact algorithms and heuristics have been developed for the inversion median problem, of which the best known is MGR. Results We present a unifying framework for median heuristics, which enables us to clarify existing strategies and to place them in a partial ordering. Analysis of this framework leads to a new insight: the best strategies continue to refer to the input data rather than reducing the problem to smaller instances. Using this insight, we develop a new heuristic for inversion medians that uses input data to the end of its computation and leverages our previous work with DCJ medians. Finally, we present the results of extensive experimentation showing that our new heuristic outperforms all others in accuracy and, especially, in running time: the heuristic typically returns solutions within 1% of optimal and runs in seconds to minutes even on genomes with 25'000 genes--in contrast, MGR can take days on instances of 200 genes and cannot be used beyond 1'000 genes. Conclusion Finding good rearrangement medians, in particular inversion medians, had long been regarded as the computational bottleneck in whole-genome studies. Our new heuristic for inversion medians, ASM, which dominates all others in our framework, puts that issue to rest by providing near-optimal solutions within seconds to minutes on even the largest genomes. PMID:20122203
Gish, Stacey R.; Maier, Ezekiel J.; Haynes, Brian C.; Santiago-Tirado, Felipe H.; Srikanta, Deepa L.; Ma, Cynthia Z.; Li, Lucy X.; Williams, Matthew; Crouch, Erika C.; Khader, Shabaana A.
2016-01-01
ABSTRACT Cryptococcus neoformans is a ubiquitous, opportunistic fungal pathogen that kills over 600,000 people annually. Here, we report integrated computational and experimental investigations of the role and mechanisms of transcriptional regulation in cryptococcal infection. Major cryptococcal virulence traits include melanin production and the development of a large polysaccharide capsule upon host entry; shed capsule polysaccharides also impair host defenses. We found that both transcription and translation are required for capsule growth and that Usv101 is a master regulator of pathogenesis, regulating melanin production, capsule growth, and capsule shedding. It does this by directly regulating genes encoding glycoactive enzymes and genes encoding three other transcription factors that are essential for capsule growth: GAT201, RIM101, and SP1. Murine infection with cryptococci lacking Usv101 significantly alters the kinetics and pathogenesis of disease, with extended survival and, unexpectedly, death by pneumonia rather than meningitis. Our approaches and findings will inform studies of other pathogenic microbes. PMID:27094327
Genome-Wide Detection and Analysis of Multifunctional Genes
Pritykin, Yuri; Ghersi, Dario; Singh, Mona
2015-01-01
Many genes can play a role in multiple biological processes or molecular functions. Identifying multifunctional genes at the genome-wide level and studying their properties can shed light upon the complexity of molecular events that underpin cellular functioning, thereby leading to a better understanding of the functional landscape of the cell. However, to date, genome-wide analysis of multifunctional genes (and the proteins they encode) has been limited. Here we introduce a computational approach that uses known functional annotations to extract genes playing a role in at least two distinct biological processes. We leverage functional genomics data sets for three organisms—H. sapiens, D. melanogaster, and S. cerevisiae—and show that, as compared to other annotated genes, genes involved in multiple biological processes possess distinct physicochemical properties, are more broadly expressed, tend to be more central in protein interaction networks, tend to be more evolutionarily conserved, and are more likely to be essential. We also find that multifunctional genes are significantly more likely to be involved in human disorders. These same features also hold when multifunctionality is defined with respect to molecular functions instead of biological processes. Our analysis uncovers key features about multifunctional genes, and is a step towards a better genome-wide understanding of gene multifunctionality. PMID:26436655
Integrating alternative splicing detection into gene prediction.
Foissac, Sylvain; Schiex, Thomas
2005-02-10
Alternative splicing (AS) is now considered as a major actor in transcriptome/proteome diversity and it cannot be neglected in the annotation process of a new genome. Despite considerable progresses in term of accuracy in computational gene prediction, the ability to reliably predict AS variants when there is local experimental evidence of it remains an open challenge for gene finders. We have used a new integrative approach that allows to incorporate AS detection into ab initio gene prediction. This method relies on the analysis of genomically aligned transcript sequences (ESTs and/or cDNAs), and has been implemented in the dynamic programming algorithm of the graph-based gene finder EuGENE. Given a genomic sequence and a set of aligned transcripts, this new version identifies the set of transcripts carrying evidence of alternative splicing events, and provides, in addition to the classical optimal gene prediction, alternative optimal predictions (among those which are consistent with the AS events detected). This allows for multiple annotations of a single gene in a way such that each predicted variant is supported by a transcript evidence (but not necessarily with a full-length coverage). This automatic combination of experimental data analysis and ab initio gene finding offers an ideal integration of alternatively spliced gene prediction inside a single annotation pipeline.
Selection Shapes Transcriptional Logic and Regulatory Specialization in Genetic Networks.
Fogelmark, Karl; Peterson, Carsten; Troein, Carl
2016-01-01
Living organisms need to regulate their gene expression in response to environmental signals and internal cues. This is a computational task where genes act as logic gates that connect to form transcriptional networks, which are shaped at all scales by evolution. Large-scale mutations such as gene duplications and deletions add and remove network components, whereas smaller mutations alter the connections between them. Selection determines what mutations are accepted, but its importance for shaping the resulting networks has been debated. To investigate the effects of selection in the shaping of transcriptional networks, we derive transcriptional logic from a combinatorially powerful yet tractable model of the binding between DNA and transcription factors. By evolving the resulting networks based on their ability to function as either a simple decision system or a circadian clock, we obtain information on the regulation and logic rules encoded in functional transcriptional networks. Comparisons are made between networks evolved for different functions, as well as with structurally equivalent but non-functional (neutrally evolved) networks, and predictions are validated against the transcriptional network of E. coli. We find that the logic rules governing gene expression depend on the function performed by the network. Unlike the decision systems, the circadian clocks show strong cooperative binding and negative regulation, which achieves tight temporal control of gene expression. Furthermore, we find that transcription factors act preferentially as either activators or repressors, both when binding multiple sites for a single target gene and globally in the transcriptional networks. This separation into positive and negative regulators requires gene duplications, which highlights the interplay between mutation and selection in shaping the transcriptional networks.
Musunuru, Kiran; Bernstein, Daniel; Cole, F Sessions; Khokha, Mustafa K; Lee, Frank S; Lin, Shin; McDonald, Thomas V; Moskowitz, Ivan P; Quertermous, Thomas; Sankaran, Vijay G; Schwartz, David A; Silverman, Edwin K; Zhou, Xiaobo; Hasan, Ahmed A K; Luo, Xiao-Zhong James
2018-04-01
The National Institutes of Health have made substantial investments in genomic studies and technologies to identify DNA sequence variants associated with human disease phenotypes. The National Heart, Lung, and Blood Institute has been at the forefront of these commitments to ascertain genetic variation associated with heart, lung, blood, and sleep diseases and related clinical traits. Genome-wide association studies, exome- and genome-sequencing studies, and exome-genotyping studies of the National Heart, Lung, and Blood Institute-funded epidemiological and clinical case-control studies are identifying large numbers of genetic variants associated with heart, lung, blood, and sleep phenotypes. However, investigators face challenges in identification of genomic variants that are functionally disruptive among the myriad of computationally implicated variants. Studies to define mechanisms of genetic disruption encoded by computationally identified genomic variants require reproducible, adaptable, and inexpensive methods to screen candidate variant and gene function. High-throughput strategies will permit a tiered variant discovery and genetic mechanism approach that begins with rapid functional screening of a large number of computationally implicated variants and genes for discovery of those that merit mechanistic investigation. As such, improved variant-to-gene and gene-to-function screens-and adequate support for such studies-are critical to accelerating the translation of genomic findings. In this White Paper, we outline the variety of novel technologies, assays, and model systems that are making such screens faster, cheaper, and more accurate, referencing published work and ongoing work supported by the National Heart, Lung, and Blood Institute's R21/R33 Functional Assays to Screen Genomic Hits program. We discuss priorities that can accelerate the impressive but incomplete progress represented by big data genomic research. © 2018 American Heart Association, Inc.
Bioinformatic investigation of the role of ubiquitins in cucumber flower morphogenesis
NASA Astrophysics Data System (ADS)
Pawełkowicz, Magdalena; Osipowski, Paweł; Wojcieszek, Michał; Kowalczuk, Cezary; PlÄ der, Wojciech; Przybecki, Zbigniew
2016-09-01
Three cDNA clones were used to screen cucumber genome in order to find genes and proteins. Functional annotation reveals that they are correlated with ubiquitination pathways. Various bioinformatics tools were used to screen and check protein sequences features such as: the presence of specific domains, transmembrane regions, cleavage site and cellular placement. The computational analysis for promotor region shows many binding sites for transcription factors, which could regulate the expression of genes. In order to check gene expression levels in developing flower buds of monoecious (B10) and gynoecious (2gg) cucumber lines, the real - time PCR technique was applied. The expression was checked for the whole buds and only for the 3rd and 4th whorls of bud when generative organ are form which were obtained by Laser Capture Microdissection (LCM) technique.
Finding gene regulatory network candidates using the gene expression knowledge base.
Venkatesan, Aravind; Tripathi, Sushil; Sanz de Galdeano, Alejandro; Blondé, Ward; Lægreid, Astrid; Mironov, Vladimir; Kuiper, Martin
2014-12-10
Network-based approaches for the analysis of large-scale genomics data have become well established. Biological networks provide a knowledge scaffold against which the patterns and dynamics of 'omics' data can be interpreted. The background information required for the construction of such networks is often dispersed across a multitude of knowledge bases in a variety of formats. The seamless integration of this information is one of the main challenges in bioinformatics. The Semantic Web offers powerful technologies for the assembly of integrated knowledge bases that are computationally comprehensible, thereby providing a potentially powerful resource for constructing biological networks and network-based analysis. We have developed the Gene eXpression Knowledge Base (GeXKB), a semantic web technology based resource that contains integrated knowledge about gene expression regulation. To affirm the utility of GeXKB we demonstrate how this resource can be exploited for the identification of candidate regulatory network proteins. We present four use cases that were designed from a biological perspective in order to find candidate members relevant for the gastrin hormone signaling network model. We show how a combination of specific query definitions and additional selection criteria derived from gene expression data and prior knowledge concerning candidate proteins can be used to retrieve a set of proteins that constitute valid candidates for regulatory network extensions. Semantic web technologies provide the means for processing and integrating various heterogeneous information sources. The GeXKB offers biologists such an integrated knowledge resource, allowing them to address complex biological questions pertaining to gene expression. This work illustrates how GeXKB can be used in combination with gene expression results and literature information to identify new potential candidates that may be considered for extending a gene regulatory network.
Gorlin-Goltz syndrome: incidental finding on routine ct scan following car accident.
Kalogeropoulou, Christina; Zampakis, Petros; Kazantzi, Santra; Kraniotis, Pantelis; Mastronikolis, Nicholas S
2009-11-25
Gorlin-Goltz syndrome is a rare hereditary disease. Pathogenesis of the syndrome is attributed to abnormalities in the long arm of chromosome 9 (q22.3-q31) and loss or mutations of human patched gene (PTCH1 gene). Multiple basal cell carcinomas (BCCs), odontogenic keratocysts, skeletal abnormalities, hyperkeratosis of palms and soles, intracranial ectopic calcifications of the falx cerebri and facial dysmorphism are considered the main clinical features. Diagnosis is based upon established major and minor clinical and radiological criteria and ideally confirmed by DNA analysis. Because of the different systems affected, a multidisciplinary approach team of various experts is required for a successful management. We report the case of a 19 year-old female who was involved in a car accident and found to present imaging findings of Gorlin-Goltz syndrome during a routine whole body computed tomography (CT) scan in order to exclude traumatic injuries. Radiologic findings of the syndrome are easily identifiable on CT scans and may prompt to early verification of the disease, which is very important for regular follow-up and better survival rates from the co-existent diseases.
Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che
2014-01-16
To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks.
2014-01-01
Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks. PMID:24428926
Xander: employing a novel method for efficient gene-targeted metagenomic assembly
Wang, Qiong; Fish, Jordan A.; Gilman, Mariah; ...
2015-08-05
Here, metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. Current methods often assemble only fragmented partial genes. We present a novel method for targeting assembly of specific protein-coding genes. This method combines a de Bruijn graph, as used in standard assembly approaches, and a protein profile hidden Markov model (HMM) for the gene of interest, as used in standard annotation approaches. These are used to create a novel combined weighted assembly graph. Xander performs both assembly and annotation concomitantly using information incorporated in this graph. We demonstrate the utility ofmore » this approach by assembling contigs for one phylogenetic marker gene and for two functional marker genes, first on Human Microbiome Project (HMP)-defined community Illumina data and then on 21 rhizosphere soil metagenomic datasets from three different crops totaling over 800 Gbp of unassembled data. We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences. In conclusion, xander combines gene assignment with the rapid assembly of full-length or near full-length functional genes from metagenomic data without requiring bulk assembly or post-processing to find genes of interest. HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines.« less
2013-01-01
Background MicroRNAs (miRNAs) are important post-transcriptional regulators that have been demonstrated to play an important role in human diseases. Elucidating the associations between miRNAs and diseases at the systematic level will deepen our understanding of the molecular mechanisms of diseases. However, miRNA-disease associations identified by previous computational methods are far from completeness and more effort is needed. Results We developed a computational framework to identify miRNA-disease associations by performing random walk analysis, and focused on the functional link between miRNA targets and disease genes in protein-protein interaction (PPI) networks. Furthermore, a bipartite miRNA-disease network was constructed, from which several miRNA-disease co-regulated modules were identified by hierarchical clustering analysis. Our approach achieved satisfactory performance in identifying known cancer-related miRNAs for nine human cancers with an area under the ROC curve (AUC) ranging from 71.3% to 91.3%. By systematically analyzing the global properties of the miRNA-disease network, we found that only a small number of miRNAs regulated genes involved in various diseases, genes associated with neurological diseases were preferentially regulated by miRNAs and some immunological diseases were associated with several specific miRNAs. We also observed that most diseases in the same co-regulated module tended to belong to the same disease category, indicating that these diseases might share similar miRNA regulatory mechanisms. Conclusions In this study, we present a computational framework to identify miRNA-disease associations, and further construct a bipartite miRNA-disease network for systematically analyzing the global properties of miRNA regulation of disease genes. Our findings provide a broad perspective on the relationships between miRNAs and diseases and could potentially aid future research efforts concerning miRNA involvement in disease pathogenesis. PMID:24103777
Discovering and understanding oncogenic gene fusions through data intensive computational approaches
Latysheva, Natasha S.; Babu, M. Madan
2016-01-01
Abstract Although gene fusions have been recognized as important drivers of cancer for decades, our understanding of the prevalence and function of gene fusions has been revolutionized by the rise of next-generation sequencing, advances in bioinformatics theory and an increasing capacity for large-scale computational biology. The computational work on gene fusions has been vastly diverse, and the present state of the literature is fragmented. It will be fruitful to merge three camps of gene fusion bioinformatics that appear to rarely cross over: (i) data-intensive computational work characterizing the molecular biology of gene fusions; (ii) development research on fusion detection tools, candidate fusion prioritization algorithms and dedicated fusion databases and (iii) clinical research that seeks to either therapeutically target fusion transcripts and proteins or leverages advances in detection tools to perform large-scale surveys of gene fusion landscapes in specific cancer types. In this review, we unify these different—yet highly complementary and symbiotic—approaches with the view that increased synergy will catalyze advancements in gene fusion identification, characterization and significance evaluation. PMID:27105842
Ludovini, Vienna; Bianconi, Fortunato; Siggillino, Annamaria; Piobbico, Danilo; Vannucci, Jacopo; Metro, Giulio; Chiari, Rita; Bellezza, Guido; Puma, Francesco; Della Fazia, Maria Agnese; Servillo, Giuseppe; Crinò, Lucio
2016-05-24
Risk assessment and treatment choice remains a challenge in early non-small-cell lung cancer (NSCLC). The aim of this study was to identify novel genes involved in the risk of early relapse (ER) compared to no relapse (NR) in resected lung adenocarcinoma (AD) patients using a combination of high throughput technology and computational analysis. We identified 18 patients (n.13 NR and n.5 ER) with stage I AD. Frozen samples of patients in ER, NR and corresponding normal lung (NL) were subjected to Microarray technology and quantitative-PCR (Q-PCR). A gene network computational analysis was performed to select predictive genes. An independent set of 79 ADs stage I samples was used to validate selected genes by Q-PCR.From microarray analysis we selected 50 genes, using the fold change ratio of ER versus NR. They were validated both in pool and individually in patient samples (ER and NR) by Q-PCR. Fourteen increased and 25 decreased genes showed a concordance between two methods. They were used to perform a computational gene network analysis that identified 4 increased (HOXA10, CLCA2, AKR1B10, FABP3) and 6 decreased (SCGB1A1, PGC, TFF1, PSCA, SPRR1B and PRSS1) genes. Moreover, in an independent dataset of ADs samples, we showed that both high FABP3 expression and low SCGB1A1 expression was associated with a worse disease-free survival (DFS).Our results indicate that it is possible to define, through gene expression and computational analysis, a characteristic gene profiling of patients with an increased risk of relapse that may become a tool for patient selection for adjuvant therapy.
Richardson, Casey R.; Luo, Qing-Jun; Gontcharova, Viktoria; Jiang, Ying-Wen; Samanta, Manoj; Youn, Eunseog; Rock, Christopher D.
2010-01-01
Background MicroRNAs (miRNAs) and trans-acting small-interfering RNAs (tasi-RNAs) are small (20–22 nt long) RNAs (smRNAs) generated from hairpin secondary structures or antisense transcripts, respectively, that regulate gene expression by Watson-Crick pairing to a target mRNA and altering expression by mechanisms related to RNA interference. The high sequence homology of plant miRNAs to their targets has been the mainstay of miRNA prediction algorithms, which are limited in their predictive power for other kingdoms because miRNA complementarity is less conserved yet transitive processes (production of antisense smRNAs) are active in eukaryotes. We hypothesize that antisense transcription and associated smRNAs are biomarkers which can be computationally modeled for gene discovery. Principal Findings We explored rice (Oryza sativa) sense and antisense gene expression in publicly available whole genome tiling array transcriptome data and sequenced smRNA libraries (as well as C. elegans) and found evidence of transitivity of MIRNA genes similar to that found in Arabidopsis. Statistical analysis of antisense transcript abundances, presence of antisense ESTs, and association with smRNAs suggests several hundred Arabidopsis ‘orphan’ hypothetical genes are non-coding RNAs. Consistent with this hypothesis, we found novel Arabidopsis homologues of some MIRNA genes on the antisense strand of previously annotated protein-coding genes. A Support Vector Machine (SVM) was applied using thermodynamic energy of binding plus novel expression features of sense/antisense transcription topology and siRNA abundances to build a prediction model of miRNA targets. The SVM when trained on targets could predict the “ancient” (deeply conserved) class of validated Arabidopsis MIRNA genes with an accuracy of 84%, and 76% for “new” rapidly-evolving MIRNA genes. Conclusions Antisense and smRNA expression features and computational methods may identify novel MIRNA genes and other non-coding RNAs in plants and potentially other kingdoms, which can provide insight into antisense transcription, miRNA evolution, and post-transcriptional gene regulation. PMID:20520764
Shahdoust, Maryam; Hajizadeh, Ebrahim; Mozdarani, Hossein; Chehrei, Ali
2013-01-01
Cigarette smoking is the major risk factor for development of lung cancer. Identification of effects of tobacco on airway gene expression may provide insight into the causes. This research aimed to compare gene expression of large airway epithelium cells in normal smokers (n=13) and non-smokers (n=9) in order to find genes which discriminate the two groups and assess cigarette smoking effects on large airway epithelium cells. Genes discriminating smokers from non-smokers were identified by applying a neural network clustering method, growing self-organizing maps (GSOM), to microarray data according to class discrimination scores. An index was computed based on differentiation between each mean of gene expression in the two groups. This clustering approach provided the possibility of comparing thousands of genes simultaneously. The applied approach compared the mean of 7,129 genes in smokers and non-smokers simultaneously and classified the genes of large airway epithelium cells which had differently expressed in smokers comparing with non-smokers. Seven genes were identified which had the highest different expression in smokers compared with the non-smokers group: NQO1, H19, ALDH3A1, AKR1C1, ABHD2, GPX2 and ADH7. Most (NQO1, ALDH3A1, AKR1C1, H19 and GPX2) are known to be clinically notable in lung cancer studies. Furthermore, statistical discriminate analysis showed that these genes could classify samples in smokers and non-smokers correctly with 100% accuracy. With the performed GSOM map, other nodes with high average discriminate scores included genes with alterations strongly related to the lung cancer such as AKR1C3, CYP1B1, UCHL1 and AKR1B10. This clustering by comparing expression of thousands of genes at the same time revealed alteration in normal smokers. Most of the identified genes were strongly relevant to lung cancer in the existing literature. The genes may be utilized to identify smokers with increased risk for lung cancer. A large sample study is now recommended to determine relations between the genes ABHD2 and ADH7 and smoking.
Synthetic Analog and Digital Circuits for Cellular Computation and Memory
Purcell, Oliver; Lu, Timothy K.
2014-01-01
Biological computation is a major area of focus in synthetic biology because it has the potential to enable a wide range of applications. Synthetic biologists have applied engineering concepts to biological systems in order to construct progressively more complex gene circuits capable of processing information in living cells. Here, we review the current state of computational genetic circuits and describe artificial gene circuits that perform digital and analog computation. We then discuss recent progress in designing gene circuits that exhibit memory, and how memory and computation have been integrated to yield more complex systems that can both process and record information. Finally, we suggest new directions for engineering biological circuits capable of computation. PMID:24794536
An integrative, translational approach to understanding rare and orphan genetically based diseases
Hoehndorf, Robert; Schofield, Paul N.; Gkoutos, Georgios V.
2013-01-01
PhenomeNet is an approach for integrating phenotypes across species and identifying candidate genes for genetic diseases based on the similarity between a disease and animal model phenotypes. In contrast to ‘guilt-by-association’ approaches, PhenomeNet relies exclusively on the comparison of phenotypes to suggest candidate genes, and can, therefore, be applied to study the molecular basis of rare and orphan diseases for which the molecular basis is unknown. In addition to disease phenotypes from the Online Mendelian Inheritance in Man (OMIM) database, we have now integrated the clinical signs from Orphanet into PhenomeNet. We demonstrate that our approach can efficiently identify known candidate genes for genetic diseases in Orphanet and OMIM. Furthermore, we find evidence that mutations in the HIP1 gene might cause Bassoe syndrome, a rare disorder with unknown genetic aetiology. Our results demonstrate that integration and computational analysis of human disease and animal model phenotypes using PhenomeNet has the potential to reveal novel insights into the pathobiology underlying genetic diseases. PMID:23853703
Ramsden, Helen L; Sürmeli, Gülşen; McDonagh, Steven G; Nolan, Matthew F
2015-01-01
Neural circuits in the medial entorhinal cortex (MEC) encode an animal's position and orientation in space. Within the MEC spatial representations, including grid and directional firing fields, have a laminar and dorsoventral organization that corresponds to a similar topography of neuronal connectivity and cellular properties. Yet, in part due to the challenges of integrating anatomical data at the resolution of cortical layers and borders, we know little about the molecular components underlying this organization. To address this we develop a new computational pipeline for high-throughput analysis and comparison of in situ hybridization (ISH) images at laminar resolution. We apply this pipeline to ISH data for over 16,000 genes in the Allen Brain Atlas and validate our analysis with RNA sequencing of MEC tissue from adult mice. We find that differential gene expression delineates the borders of the MEC with neighboring brain structures and reveals its laminar and dorsoventral organization. We propose a new molecular basis for distinguishing the deep layers of the MEC and show that their similarity to corresponding layers of neocortex is greater than that of superficial layers. Our analysis identifies ion channel-, cell adhesion- and synapse-related genes as candidates for functional differentiation of MEC layers and for encoding of spatial information at different scales along the dorsoventral axis of the MEC. We also reveal laminar organization of genes related to disease pathology and suggest that a high metabolic demand predisposes layer II to neurodegenerative pathology. In principle, our computational pipeline can be applied to high-throughput analysis of many forms of neuroanatomical data. Our results support the hypothesis that differences in gene expression contribute to functional specialization of superficial layers of the MEC and dorsoventral organization of the scale of spatial representations.
Ramsden, Helen L.; Sürmeli, Gülşen; McDonagh, Steven G.; Nolan, Matthew F.
2015-01-01
Neural circuits in the medial entorhinal cortex (MEC) encode an animal’s position and orientation in space. Within the MEC spatial representations, including grid and directional firing fields, have a laminar and dorsoventral organization that corresponds to a similar topography of neuronal connectivity and cellular properties. Yet, in part due to the challenges of integrating anatomical data at the resolution of cortical layers and borders, we know little about the molecular components underlying this organization. To address this we develop a new computational pipeline for high-throughput analysis and comparison of in situ hybridization (ISH) images at laminar resolution. We apply this pipeline to ISH data for over 16,000 genes in the Allen Brain Atlas and validate our analysis with RNA sequencing of MEC tissue from adult mice. We find that differential gene expression delineates the borders of the MEC with neighboring brain structures and reveals its laminar and dorsoventral organization. We propose a new molecular basis for distinguishing the deep layers of the MEC and show that their similarity to corresponding layers of neocortex is greater than that of superficial layers. Our analysis identifies ion channel-, cell adhesion- and synapse-related genes as candidates for functional differentiation of MEC layers and for encoding of spatial information at different scales along the dorsoventral axis of the MEC. We also reveal laminar organization of genes related to disease pathology and suggest that a high metabolic demand predisposes layer II to neurodegenerative pathology. In principle, our computational pipeline can be applied to high-throughput analysis of many forms of neuroanatomical data. Our results support the hypothesis that differences in gene expression contribute to functional specialization of superficial layers of the MEC and dorsoventral organization of the scale of spatial representations. PMID:25615592
An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes.
Shao, Mingfu; Lin, Yu; Moret, Bernard M E
2015-05-01
Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this article, we propose an integer linear programming (ILP) formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse, and rat genomes, where once again our method outperforms MSOAR.
Identifying novel genes and chemicals related to nasopharyngeal cancer in a heterogeneous network.
Li, Zhandong; An, Lifeng; Li, Hao; Wang, ShaoPeng; Zhou, You; Yuan, Fei; Li, Lin
2016-05-05
Nasopharyngeal cancer or nasopharyngeal carcinoma (NPC) is the most common cancer originating in the nasopharynx. The factors that induce nasopharyngeal cancer are still not clear. Additional information about the chemicals or genes related to nasopharyngeal cancer will promote a better understanding of the pathogenesis of this cancer and the factors that induce it. Thus, a computational method NPC-RGCP was proposed in this study to identify the possible relevant chemicals and genes based on the presently known chemicals and genes related to nasopharyngeal cancer. To extensively utilize the functional associations between proteins and chemicals, a heterogeneous network was constructed based on interactions of proteins and chemicals. The NPC-RGCP included two stages: the searching stage and the screening stage. The former stage is for finding new possible genes and chemicals in the heterogeneous network, while the latter stage is for screening and removing false discoveries and selecting the core genes and chemicals. As a result, five putative genes, CXCR3, IRF1, CDK1, GSTP1, and CDH2, and seven putative chemicals, iron, propionic acid, dimethyl sulfoxide, isopropanol, erythrose 4-phosphate, β-D-Fructose 6-phosphate, and flavin adenine dinucleotide, were identified by NPC-RGCP. Extensive analyses provided confirmation that the putative genes and chemicals have significant associations with nasopharyngeal cancer.
Identifying novel genes and chemicals related to nasopharyngeal cancer in a heterogeneous network
Li, Zhandong; An, Lifeng; Li, Hao; Wang, ShaoPeng; Zhou, You; Yuan, Fei; Li, Lin
2016-01-01
Nasopharyngeal cancer or nasopharyngeal carcinoma (NPC) is the most common cancer originating in the nasopharynx. The factors that induce nasopharyngeal cancer are still not clear. Additional information about the chemicals or genes related to nasopharyngeal cancer will promote a better understanding of the pathogenesis of this cancer and the factors that induce it. Thus, a computational method NPC-RGCP was proposed in this study to identify the possible relevant chemicals and genes based on the presently known chemicals and genes related to nasopharyngeal cancer. To extensively utilize the functional associations between proteins and chemicals, a heterogeneous network was constructed based on interactions of proteins and chemicals. The NPC-RGCP included two stages: the searching stage and the screening stage. The former stage is for finding new possible genes and chemicals in the heterogeneous network, while the latter stage is for screening and removing false discoveries and selecting the core genes and chemicals. As a result, five putative genes, CXCR3, IRF1, CDK1, GSTP1, and CDH2, and seven putative chemicals, iron, propionic acid, dimethyl sulfoxide, isopropanol, erythrose 4-phosphate, β-D-Fructose 6-phosphate, and flavin adenine dinucleotide, were identified by NPC-RGCP. Extensive analyses provided confirmation that the putative genes and chemicals have significant associations with nasopharyngeal cancer. PMID:27149165
Biswas, Surama; Dutta, Subarna; Acharyya, Sriyankar
2017-12-01
Identifying a small subset of disease critical genes out of a large size of microarray gene expression data is a challenge in computational life sciences. This paper has applied four meta-heuristic algorithms, namely, honey bee mating optimization (HBMO), harmony search (HS), differential evolution (DE) and genetic algorithm (basic version GA) to find disease critical genes of preeclampsia which affects women during gestation. Two hybrid algorithms, namely, HBMO-kNN and HS-kNN have been newly proposed here where kNN (k nearest neighbor classifier) is used for sample classification. Performances of these new approaches have been compared with other two hybrid algorithms, namely, DE-kNN and SGA-kNN. Three datasets of different sizes have been used. In a dataset, the set of genes found common in the output of each algorithm is considered here as disease critical genes. In different datasets, the percentage of classification or classification accuracy of meta-heuristic algorithms varied between 92.46 and 100%. HBMO-kNN has the best performance (99.64-100%) in almost all data sets. DE-kNN secures the second position (99.42-100%). Disease critical genes obtained here match with clinically revealed preeclampsia genes to a large extent.
Cheng, Chao; Ung, Matthew; Grant, Gavin D.; Whitfield, Michael L.
2013-01-01
Cell cycle is a complex and highly supervised process that must proceed with regulatory precision to achieve successful cellular division. Despite the wide application, microarray time course experiments have several limitations in identifying cell cycle genes. We thus propose a computational model to predict human cell cycle genes based on transcription factor (TF) binding and regulatory motif information in their promoters. We utilize ENCODE ChIP-seq data and motif information as predictors to discriminate cell cycle against non-cell cycle genes. Our results show that both the trans- TF features and the cis- motif features are predictive of cell cycle genes, and a combination of the two types of features can further improve prediction accuracy. We apply our model to a complete list of GENCODE promoters to predict novel cell cycle driving promoters for both protein-coding genes and non-coding RNAs such as lincRNAs. We find that a similar percentage of lincRNAs are cell cycle regulated as protein-coding genes, suggesting the importance of non-coding RNAs in cell cycle division. The model we propose here provides not only a practical tool for identifying novel cell cycle genes with high accuracy, but also new insights on cell cycle regulation by TFs and cis-regulatory elements. PMID:23874175
Discretization provides a conceptually simple tool to build expression networks.
Vass, J Keith; Higham, Desmond J; Mudaliar, Manikhandan A V; Mao, Xuerong; Crowther, Daniel J
2011-04-18
Biomarker identification, using network methods, depends on finding regular co-expression patterns; the overall connectivity is of greater importance than any single relationship. A second requirement is a simple algorithm for ranking patients on how relevant a gene-set is. For both of these requirements discretized data helps to first identify gene cliques, and then to stratify patients.We explore a biologically intuitive discretization technique which codes genes as up- or down-regulated, with values close to the mean set as unchanged; this allows a richer description of relationships between genes than can be achieved by positive and negative correlation. We find a close agreement between our results and the template gene-interactions used to build synthetic microarray-like data by SynTReN, which synthesizes "microarray" data using known relationships which are successfully identified by our method.We are able to split positive co-regulation into up-together and down-together and negative co-regulation is considered as directed up-down relationships. In some cases these exist in only one direction, with real data, but not with the synthetic data. We illustrate our approach using two studies on white blood cells and derived immortalized cell lines and compare the approach with standard correlation-based computations. No attempt is made to distinguish possible causal links as the search for biomarkers would be crippled by losing highly significant co-expression relationships. This contrasts with approaches like ARACNE and IRIS.The method is illustrated with an analysis of gene-expression for energy metabolism pathways. For each discovered relationship we are able to identify the samples on which this is based in the discretized sample-gene matrix, along with a simplified view of the patterns of gene expression; this helps to dissect the gene-sample relevant to a research topic--identifying sets of co-regulated and anti-regulated genes and the samples or patients in which this relationship occurs.
Predictive computation of genomic logic processing functions in embryonic development
Peter, Isabelle S.; Faure, Emmanuel; Davidson, Eric H.
2012-01-01
Gene regulatory networks (GRNs) control the dynamic spatial patterns of regulatory gene expression in development. Thus, in principle, GRN models may provide system-level, causal explanations of developmental process. To test this assertion, we have transformed a relatively well-established GRN model into a predictive, dynamic Boolean computational model. This Boolean model computes spatial and temporal gene expression according to the regulatory logic and gene interactions specified in a GRN model for embryonic development in the sea urchin. Additional information input into the model included the progressive embryonic geometry and gene expression kinetics. The resulting model predicted gene expression patterns for a large number of individual regulatory genes each hour up to gastrulation (30 h) in four different spatial domains of the embryo. Direct comparison with experimental observations showed that the model predictively computed these patterns with remarkable spatial and temporal accuracy. In addition, we used this model to carry out in silico perturbations of regulatory functions and of embryonic spatial organization. The model computationally reproduced the altered developmental functions observed experimentally. Two major conclusions are that the starting GRN model contains sufficiently complete regulatory information to permit explanation of a complex developmental process of gene expression solely in terms of genomic regulatory code, and that the Boolean model provides a tool with which to test in silico regulatory circuitry and developmental perturbations. PMID:22927416
Application of machine learning on brain cancer multiclass classification
NASA Astrophysics Data System (ADS)
Panca, V.; Rustam, Z.
2017-07-01
Classification of brain cancer is a problem of multiclass classification. One approach to solve this problem is by first transforming it into several binary problems. The microarray gene expression dataset has the two main characteristics of medical data: extremely many features (genes) and only a few number of samples. The application of machine learning on microarray gene expression dataset mainly consists of two steps: feature selection and classification. In this paper, the features are selected using a method based on support vector machine recursive feature elimination (SVM-RFE) principle which is improved to solve multiclass classification, called multiple multiclass SVM-RFE. Instead of using only the selected features on a single classifier, this method combines the result of multiple classifiers. The features are divided into subsets and SVM-RFE is used on each subset. Then, the selected features on each subset are put on separate classifiers. This method enhances the feature selection ability of each single SVM-RFE. Twin support vector machine (TWSVM) is used as the method of the classifier to reduce computational complexity. While ordinary SVM finds single optimum hyperplane, the main objective Twin SVM is to find two non-parallel optimum hyperplanes. The experiment on the brain cancer microarray gene expression dataset shows this method could classify 71,4% of the overall test data correctly, using 100 and 1000 genes selected from multiple multiclass SVM-RFE feature selection method. Furthermore, the per class results show that this method could classify data of normal and MD class with 100% accuracy.
Robustness, evolvability, and the logic of genetic regulation.
Payne, Joshua L; Moore, Jason H; Wagner, Andreas
2014-01-01
In gene regulatory circuits, the expression of individual genes is commonly modulated by a set of regulating gene products, which bind to a gene's cis-regulatory region. This region encodes an input-output function, referred to as signal-integration logic, that maps a specific combination of regulatory signals (inputs) to a particular expression state (output) of a gene. The space of all possible signal-integration functions is vast and the mapping from input to output is many-to-one: For the same set of inputs, many functions (genotypes) yield the same expression output (phenotype). Here, we exhaustively enumerate the set of signal-integration functions that yield identical gene expression patterns within a computational model of gene regulatory circuits. Our goal is to characterize the relationship between robustness and evolvability in the signal-integration space of regulatory circuits, and to understand how these properties vary between the genotypic and phenotypic scales. Among other results, we find that the distributions of genotypic robustness are skewed, so that the majority of signal-integration functions are robust to perturbation. We show that the connected set of genotypes that make up a given phenotype are constrained to specific regions of the space of all possible signal-integration functions, but that as the distance between genotypes increases, so does their capacity for unique innovations. In addition, we find that robust phenotypes are (i) evolvable, (ii) easily identified by random mutation, and (iii) mutationally biased toward other robust phenotypes. We explore the implications of these latter observations for mutation-based evolution by conducting random walks between randomly chosen source and target phenotypes. We demonstrate that the time required to identify the target phenotype is independent of the properties of the source phenotype.
Synthetic analog and digital circuits for cellular computation and memory.
Purcell, Oliver; Lu, Timothy K
2014-10-01
Biological computation is a major area of focus in synthetic biology because it has the potential to enable a wide range of applications. Synthetic biologists have applied engineering concepts to biological systems in order to construct progressively more complex gene circuits capable of processing information in living cells. Here, we review the current state of computational genetic circuits and describe artificial gene circuits that perform digital and analog computation. We then discuss recent progress in designing gene networks that exhibit memory, and how memory and computation have been integrated to yield more complex systems that can both process and record information. Finally, we suggest new directions for engineering biological circuits capable of computation. Copyright © 2014 The Authors. Published by Elsevier Ltd.. All rights reserved.
Thomas, Gregory S; Wann, L Samuel; Allam, Adel H; Thompson, Randall C; Michalik, David E; Sutherland, M Linda; Sutherland, James D; Lombardi, Guido P; Watson, Lucia; Cox, Samantha L; Valladolid, Clide M; Abd El-Maksoud, Gomaa; Al-Tohamy Soliman, Muhammad; Badr, Ibrahem; el-Halim Nur el-Din, Abd; Clarke, Emily M; Thomas, Ian G; Miyamoto, Michael I; Kaplan, Hillard S; Frohlich, Bruno; Narula, Jagat; Stewart, Alexandre F R; Zink, Albert; Finch, Caleb E
2014-06-01
Computed tomographic findings of atherosclerosis in the ancient cultures of Egypt, Peru, the American Southwest and the Aleutian Islands challenge our understanding of the fundamental causes of atherosclerosis. Could these findings be true? Is so, what traditional risk factors might be present in these cultures that could explain this apparent paradox? The recent computed tomographic findings are consistent with multiple autopsy studies dating as far back as 1852 that demonstrate calcific atherosclerosis in ancient Egyptians and Peruvians. A nontraditional cause of atherosclerosis that could explain this burden of atherosclerosis is the microbial and parasitic inflammatory burden likely to be present in ancient cultures inherently lacking modern hygiene and antimicrobials. Patients with chronic systemic inflammatory diseases of today, including systemic lupus erythematosus, rheumatoid arthritis, and human immunodeficiency virus infection, experience premature atherosclerosis and coronary events. Might the chronic inflammatory load of ancient times secondary to infection have resulted in atherosclerosis? Smoke inhalation from the use of open fires for daily cooking and illumination represents another potential cause. Undiscovered risk factors could also have been present, potential causes that technologically cannot currently be measured in our serum or other tissue. A synthesis of these findings suggests that a gene-environmental interplay is causal for atherosclerosis. That is, humans have an inherent genetic susceptibility to atherosclerosis, whereas the speed and severity of its development are secondary to known and potentially unknown environmental factors. Copyright © 2014 World Heart Federation (Geneva). Published by Elsevier B.V. All rights reserved.
Kazemian, Majid; Zhu, Qiyun; Halfon, Marc S.; Sinha, Saurabh
2011-01-01
Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ‘enhancers’), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for ‘motif-blind’ CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to ‘supervise’ the search. We propose a new statistical method, based on ‘Interpolated Markov Models’, for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. PMID:21821659
Protein-Protein Interaction Network and Gene Ontology
NASA Astrophysics Data System (ADS)
Choi, Yunkyu; Kim, Seok; Yi, Gwan-Su; Park, Jinah
Evolution of computer technologies makes it possible to access a large amount and various kinds of biological data via internet such as DNA sequences, proteomics data and information discovered about them. It is expected that the combination of various data could help researchers find further knowledge about them. Roles of a visualization system are to invoke human abilities to integrate information and to recognize certain patterns in the data. Thus, when the various kinds of data are examined and analyzed manually, an effective visualization system is an essential part. One instance of these integrated visualizations can be combination of protein-protein interaction (PPI) data and Gene Ontology (GO) which could help enhance the analysis of PPI network. We introduce a simple but comprehensive visualization system that integrates GO and PPI data where GO and PPI graphs are visualized side-by-side and supports quick reference functions between them. Furthermore, the proposed system provides several interactive visualization methods for efficiently analyzing the PPI network and GO directedacyclic- graph such as context-based browsing and common ancestors finding.
Large-Scale Bi-Level Strain Design Approaches and Mixed-Integer Programming Solution Techniques
Kim, Joonhoon; Reed, Jennifer L.; Maravelias, Christos T.
2011-01-01
The use of computational models in metabolic engineering has been increasing as more genome-scale metabolic models and computational approaches become available. Various computational approaches have been developed to predict how genetic perturbations affect metabolic behavior at a systems level, and have been successfully used to engineer microbial strains with improved primary or secondary metabolite production. However, identification of metabolic engineering strategies involving a large number of perturbations is currently limited by computational resources due to the size of genome-scale models and the combinatorial nature of the problem. In this study, we present (i) two new bi-level strain design approaches using mixed-integer programming (MIP), and (ii) general solution techniques that improve the performance of MIP-based bi-level approaches. The first approach (SimOptStrain) simultaneously considers gene deletion and non-native reaction addition, while the second approach (BiMOMA) uses minimization of metabolic adjustment to predict knockout behavior in a MIP-based bi-level problem for the first time. Our general MIP solution techniques significantly reduced the CPU times needed to find optimal strategies when applied to an existing strain design approach (OptORF) (e.g., from ∼10 days to ∼5 minutes for metabolic engineering strategies with 4 gene deletions), and identified strategies for producing compounds where previous studies could not (e.g., malate and serine). Additionally, we found novel strategies using SimOptStrain with higher predicted production levels (for succinate and glycerol) than could have been found using an existing approach that considers network additions and deletions in sequential steps rather than simultaneously. Finally, using BiMOMA we found novel strategies involving large numbers of modifications (for pyruvate and glutamate), which sequential search and genetic algorithms were unable to find. The approaches and solution techniques developed here will facilitate the strain design process and extend the scope of its application to metabolic engineering. PMID:21949695
Large-scale bi-level strain design approaches and mixed-integer programming solution techniques.
Kim, Joonhoon; Reed, Jennifer L; Maravelias, Christos T
2011-01-01
The use of computational models in metabolic engineering has been increasing as more genome-scale metabolic models and computational approaches become available. Various computational approaches have been developed to predict how genetic perturbations affect metabolic behavior at a systems level, and have been successfully used to engineer microbial strains with improved primary or secondary metabolite production. However, identification of metabolic engineering strategies involving a large number of perturbations is currently limited by computational resources due to the size of genome-scale models and the combinatorial nature of the problem. In this study, we present (i) two new bi-level strain design approaches using mixed-integer programming (MIP), and (ii) general solution techniques that improve the performance of MIP-based bi-level approaches. The first approach (SimOptStrain) simultaneously considers gene deletion and non-native reaction addition, while the second approach (BiMOMA) uses minimization of metabolic adjustment to predict knockout behavior in a MIP-based bi-level problem for the first time. Our general MIP solution techniques significantly reduced the CPU times needed to find optimal strategies when applied to an existing strain design approach (OptORF) (e.g., from ∼10 days to ∼5 minutes for metabolic engineering strategies with 4 gene deletions), and identified strategies for producing compounds where previous studies could not (e.g., malate and serine). Additionally, we found novel strategies using SimOptStrain with higher predicted production levels (for succinate and glycerol) than could have been found using an existing approach that considers network additions and deletions in sequential steps rather than simultaneously. Finally, using BiMOMA we found novel strategies involving large numbers of modifications (for pyruvate and glutamate), which sequential search and genetic algorithms were unable to find. The approaches and solution techniques developed here will facilitate the strain design process and extend the scope of its application to metabolic engineering.
Lippert, Christoph; Xiang, Jing; Horta, Danilo; Widmer, Christian; Kadie, Carl; Heckerman, David; Listgarten, Jennifer
2014-01-01
Motivation: Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test—a score test—with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene–gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods. Results: After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test—up to 23 more associations—whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene–gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500. Availability: Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/. Contact: heckerma@microsoft.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25075117
San Lucas, F Anthony; Fowler, Jerry; Chang, Kyle; Kopetz, Scott; Vilar, Eduardo; Scheet, Paul
2014-12-01
Large-scale cancer datasets such as The Cancer Genome Atlas (TCGA) allow researchers to profile tumors based on a wide range of clinical and molecular characteristics. Subsequently, TCGA-derived gene expression profiles can be analyzed with the Connectivity Map (CMap) to find candidate drugs to target tumors with specific clinical phenotypes or molecular characteristics. This represents a powerful computational approach for candidate drug identification, but due to the complexity of TCGA and technology differences between CMap and TCGA experiments, such analyses are challenging to conduct and reproduce. We present Cancer in silico Drug Discovery (CiDD; scheet.org/software), a computational drug discovery platform that addresses these challenges. CiDD integrates data from TCGA, CMap, and Cancer Cell Line Encyclopedia (CCLE) to perform computational drug discovery experiments, generating hypotheses for the following three general problems: (i) determining whether specific clinical phenotypes or molecular characteristics are associated with unique gene expression signatures; (ii) finding candidate drugs to repress these expression signatures; and (iii) identifying cell lines that resemble the tumors being studied for subsequent in vitro experiments. The primary input to CiDD is a clinical or molecular characteristic. The output is a biologically annotated list of candidate drugs and a list of cell lines for in vitro experimentation. We applied CiDD to identify candidate drugs to treat colorectal cancers harboring mutations in BRAF. CiDD identified EGFR and proteasome inhibitors, while proposing five cell lines for in vitro testing. CiDD facilitates phenotype-driven, systematic drug discovery based on clinical and molecular data from TCGA. ©2014 American Association for Cancer Research.
On splice site prediction using weight array models: a comparison of smoothing techniques
NASA Astrophysics Data System (ADS)
Taher, Leila; Meinicke, Peter; Morgenstern, Burkhard
2007-11-01
In most eukaryotic genes, protein-coding exons are separated by non-coding introns which are removed from the primary transcript by a process called "splicing". The positions where introns are cut and exons are spliced together are called "splice sites". Thus, computational prediction of splice sites is crucial for gene finding in eukaryotes. Weight array models are a powerful probabilistic approach to splice site detection. Parameters for these models are usually derived from m-tuple frequencies in trusted training data and subsequently smoothed to avoid zero probabilities. In this study we compare three different ways of parameter estimation for m-tuple frequencies, namely (a) non-smoothed probability estimation, (b) standard pseudo counts and (c) a Gaussian smoothing procedure that we recently developed.
NASA Astrophysics Data System (ADS)
Pawełkowicz, Magdalena E.; Wojcieszek, Michał; Osipowski, Paweł; Krzywkowski, Tomasz; PlÄ der, Wojciech; Przybecki, Zbigniew
2016-09-01
Two Arabidopsis thaliana genes from the PP2C family of protein phosphatases (AtABI1 and AtABI2) were used to find orthologous genes in the Cucumis sativus L. cv. Borszczagowski (cucumber) genome. Cucumber has been used as a model plant for sex expression studies because although it has been defined as a monoecious species, numerous genotypes are known to produce only female, only male, or hermaphroditic flowers. We identified two new orthologous genes of AtABI1 and AtABI2 in the cucumber genome and named them CsABI1 and CsABI2. To determine the relationships between the regulation of CsABI1 and CsABI2 and flower morphogenesis in cucumber, we performed various computational analyses to define the structure of the genes, and to predict regulatory elements and protein motifs in their sequences. We also performed an expression analysis to identify differences in the expression levels of CsABI1 and CsABI2 in vegetative and generative tissues (leaf, shoot apex, and flower buds) of monoecious (B10) and gynoecious (2gg) cucumber lines. We found that the expressions of CsABI1 and CsABI2 differed in male and female floral buds, and correlated these findings with the abscisic acid signaling pathways in male and female flowers.
Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application
Cantor, Rita M.; Lange, Kenneth; Sinsheimer, Janet S.
2010-01-01
Genome-wide association studies (GWAS) have rapidly become a standard method for disease gene discovery. A substantial number of recent GWAS indicate that for most disorders, only a few common variants are implicated and the associated SNPs explain only a small fraction of the genetic risk. This review is written from the viewpoint that findings from the GWAS provide preliminary genetic information that is available for additional analysis by statistical procedures that accumulate evidence, and that these secondary analyses are very likely to provide valuable information that will help prioritize the strongest constellations of results. We review and discuss three analytic methods to combine preliminary GWAS statistics to identify genes, alleles, and pathways for deeper investigations. Meta-analysis seeks to pool information from multiple GWAS to increase the chances of finding true positives among the false positives and provides a way to combine associations across GWAS, even when the original data are unavailable. Testing for epistasis within a single GWAS study can identify the stronger results that are revealed when genes interact. Pathway analysis of GWAS results is used to prioritize genes and pathways within a biological context. Following a GWAS, association results can be assigned to pathways and tested in aggregate with computational tools and pathway databases. Reviews of published methods with recommendations for their application are provided within the framework for each approach. PMID:20074509
Xander: employing a novel method for efficient gene-targeted metagenomic assembly.
Wang, Qiong; Fish, Jordan A; Gilman, Mariah; Sun, Yanni; Brown, C Titus; Tiedje, James M; Cole, James R
2015-01-01
Metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. Current methods often assemble only fragmented partial genes. We present a novel method for targeting assembly of specific protein-coding genes. This method combines a de Bruijn graph, as used in standard assembly approaches, and a protein profile hidden Markov model (HMM) for the gene of interest, as used in standard annotation approaches. These are used to create a novel combined weighted assembly graph. Xander performs both assembly and annotation concomitantly using information incorporated in this graph. We demonstrate the utility of this approach by assembling contigs for one phylogenetic marker gene and for two functional marker genes, first on Human Microbiome Project (HMP)-defined community Illumina data and then on 21 rhizosphere soil metagenomic datasets from three different crops totaling over 800 Gbp of unassembled data. We compared our method to a recently published bulk metagenome assembly method and a recently published gene-targeted assembler and found our method produced more, longer, and higher quality gene sequences. Xander combines gene assignment with the rapid assembly of full-length or near full-length functional genes from metagenomic data without requiring bulk assembly or post-processing to find genes of interest. HMMs used for assembly can be tailored to the targeted genes, allowing flexibility to improve annotation over generic annotation pipelines. This method is implemented as open source software and is available at https://github.com/rdpstaff/Xander_assembler.
Chang, Tzu-Hao; Wu, Shih-Lin; Wang, Wei-Jen; Horng, Jorng-Tzong; Chang, Cheng-Wei
2014-01-01
Microarrays are widely used to assess gene expressions. Most microarray studies focus primarily on identifying differential gene expressions between conditions (e.g., cancer versus normal cells), for discovering the major factors that cause diseases. Because previous studies have not identified the correlations of differential gene expression between conditions, crucial but abnormal regulations that cause diseases might have been disregarded. This paper proposes an approach for discovering the condition-specific correlations of gene expressions within biological pathways. Because analyzing gene expression correlations is time consuming, an Apache Hadoop cloud computing platform was implemented. Three microarray data sets of breast cancer were collected from the Gene Expression Omnibus, and pathway information from the Kyoto Encyclopedia of Genes and Genomes was applied for discovering meaningful biological correlations. The results showed that adopting the Hadoop platform considerably decreased the computation time. Several correlations of differential gene expressions were discovered between the relapse and nonrelapse breast cancer samples, and most of them were involved in cancer regulation and cancer-related pathways. The results showed that breast cancer recurrence might be highly associated with the abnormal regulations of these gene pairs, rather than with their individual expression levels. The proposed method was computationally efficient and reliable, and stable results were obtained when different data sets were used. The proposed method is effective in identifying meaningful biological regulation patterns between conditions.
Applications of statistical physics and information theory to the analysis of DNA sequences
NASA Astrophysics Data System (ADS)
Grosse, Ivo
2000-10-01
DNA carries the genetic information of most living organisms, and the of genome projects is to uncover that genetic information. One basic task in the analysis of DNA sequences is the recognition of protein coding genes. Powerful computer programs for gene recognition have been developed, but most of them are based on statistical patterns that vary from species to species. In this thesis I address the question if there exist universal statistical patterns that are different in coding and noncoding DNA of all living species, regardless of their phylogenetic origin. In search for such species-independent patterns I study the mutual information function of genomic DNA sequences, and find that it shows persistent period-three oscillations. To understand the biological origin of the observed period-three oscillations, I compare the mutual information function of genomic DNA sequences to the mutual information function of stochastic model sequences. I find that the pseudo-exon model is able to reproduce the mutual information function of genomic DNA sequences. Moreover, I find that a generalization of the pseudo-exon model can connect the existence and the functional form of long-range correlations to the presence and the length distributions of coding and noncoding regions. Based on these theoretical studies I am able to find an information-theoretical quantity, the average mutual information (AMI), whose probability distributions are significantly different in coding and noncoding DNA, while they are almost identical in all studied species. These findings show that there exist universal statistical patterns that are different in coding and noncoding DNA of all studied species, and they suggest that the AMI may be used to identify genes in different living species, irrespective of their taxonomic origin.
Sign: large-scale gene network estimation environment for high performance computing.
Tamada, Yoshinori; Shimamura, Teppei; Yamaguchi, Rui; Imoto, Seiya; Nagasaki, Masao; Miyano, Satoru
2011-01-01
Our research group is currently developing software for estimating large-scale gene networks from gene expression data. The software, called SiGN, is specifically designed for the Japanese flagship supercomputer "K computer" which is planned to achieve 10 petaflops in 2012, and other high performance computing environments including Human Genome Center (HGC) supercomputer system. SiGN is a collection of gene network estimation software with three different sub-programs: SiGN-BN, SiGN-SSM and SiGN-L1. In these three programs, five different models are available: static and dynamic nonparametric Bayesian networks, state space models, graphical Gaussian models, and vector autoregressive models. All these models require a huge amount of computational resources for estimating large-scale gene networks and therefore are designed to be able to exploit the speed of 10 petaflops. The software will be available freely for "K computer" and HGC supercomputer system users. The estimated networks can be viewed and analyzed by Cell Illustrator Online and SBiP (Systems Biology integrative Pipeline). The software project web site is available at http://sign.hgc.jp/ .
Reveal genes functionally associated with ACADS by a network study.
Chen, Yulong; Su, Zhiguang
2015-09-15
Establishing a systematic network is aimed at finding essential human gene-gene/gene-disease pathway by means of network inter-connecting patterns and functional annotation analysis. In the present study, we have analyzed functional gene interactions of short-chain acyl-coenzyme A dehydrogenase gene (ACADS). ACADS plays a vital role in free fatty acid β-oxidation and regulates energy homeostasis. Modules of highly inter-connected genes in disease-specific ACADS network are derived by integrating gene function and protein interaction data. Among the 8 genes in ACADS web retrieved from both STRING and GeneMANIA, ACADS is effectively conjoined with 4 genes including HAHDA, HADHB, ECHS1 and ACAT1. The functional analysis is done via ontological briefing and candidate disease identification. We observed that the highly efficient-interlinked genes connected with ACADS are HAHDA, HADHB, ECHS1 and ACAT1. Interestingly, the ontological aspect of genes in the ACADS network reveals that ACADS, HAHDA and HADHB play equally vital roles in fatty acid metabolism. The gene ACAT1 together with ACADS indulges in ketone metabolism. Our computational gene web analysis also predicts potential candidate disease recognition, thus indicating the involvement of ACADS, HAHDA, HADHB, ECHS1 and ACAT1 not only with lipid metabolism but also with infant death syndrome, skeletal myopathy, acute hepatic encephalopathy, Reye-like syndrome, episodic ketosis, and metabolic acidosis. The current study presents a comprehensible layout of ACADS network, its functional strategies and candidate disease approach associated with ACADS network. Copyright © 2015 Elsevier B.V. All rights reserved.
Lee, Soohyun; Seo, Chae Hwa; Alver, Burak Han; Lee, Sanghyuk; Park, Peter J
2015-09-03
RNA-seq has been widely used for genome-wide expression profiling. RNA-seq data typically consists of tens of millions of short sequenced reads from different transcripts. However, due to sequence similarity among genes and among isoforms, the source of a given read is often ambiguous. Existing approaches for estimating expression levels from RNA-seq reads tend to compromise between accuracy and computational cost. We introduce a new approach for quantifying transcript abundance from RNA-seq data. EMSAR (Estimation by Mappability-based Segmentation And Reclustering) groups reads according to the set of transcripts to which they are mapped and finds maximum likelihood estimates using a joint Poisson model for each optimal set of segments of transcripts. The method uses nearly all mapped reads, including those mapped to multiple genes. With an efficient transcriptome indexing based on modified suffix arrays, EMSAR minimizes the use of CPU time and memory while achieving accuracy comparable to the best existing methods. EMSAR is a method for quantifying transcripts from RNA-seq data with high accuracy and low computational cost. EMSAR is available at https://github.com/parklab/emsar.
Multiclass classification of microarray data samples with a reduced number of genes
2011-01-01
Background Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained. Results A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples. Conclusions A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples. PMID:21342522
Functional Abstraction as a Method to Discover Knowledge in Gene Ontologies
Ultsch, Alfred; Lötsch, Jörn
2014-01-01
Computational analyses of functions of gene sets obtained in microarray analyses or by topical database searches are increasingly important in biology. To understand their functions, the sets are usually mapped to Gene Ontology knowledge bases by means of over-representation analysis (ORA). Its result represents the specific knowledge of the functionality of the gene set. However, the specific ontology typically consists of many terms and relationships, hindering the understanding of the ‘main story’. We developed a methodology to identify a comprehensibly small number of GO terms as “headlines” of the specific ontology allowing to understand all central aspects of the roles of the involved genes. The Functional Abstraction method finds a set of headlines that is specific enough to cover all details of a specific ontology and is abstract enough for human comprehension. This method exceeds the classical approaches at ORA abstraction and by focusing on information rather than decorrelation of GO terms, it directly targets human comprehension. Functional abstraction provides, with a maximum of certainty, information value, coverage and conciseness, a representation of the biological functions in a gene set plays a role. This is the necessary means to interpret complex Gene Ontology results thus strengthening the role of functional genomics in biomarker and drug discovery. PMID:24587272
A community computational challenge to predict the activity of pairs of compounds.
Bansal, Mukesh; Yang, Jichen; Karan, Charles; Menden, Michael P; Costello, James C; Tang, Hao; Xiao, Guanghua; Li, Yajuan; Allen, Jeffrey; Zhong, Rui; Chen, Beibei; Kim, Minsoo; Wang, Tao; Heiser, Laura M; Realubit, Ronald; Mattioli, Michela; Alvarez, Mariano J; Shen, Yao; Gallahan, Daniel; Singer, Dinah; Saez-Rodriguez, Julio; Xie, Yang; Stolovitzky, Gustavo; Califano, Andrea
2014-12-01
Recent therapeutic successes have renewed interest in drug combinations, but experimental screening approaches are costly and often identify only small numbers of synergistic combinations. The DREAM consortium launched an open challenge to foster the development of in silico methods to computationally rank 91 compound pairs, from the most synergistic to the most antagonistic, based on gene-expression profiles of human B cells treated with individual compounds at multiple time points and concentrations. Using scoring metrics based on experimental dose-response curves, we assessed 32 methods (31 community-generated approaches and SynGen), four of which performed significantly better than random guessing. We highlight similarities between the methods. Although the accuracy of predictions was not optimal, we find that computational prediction of compound-pair activity is possible, and that community challenges can be useful to advance the field of in silico compound-synergy prediction.
An integrative data mining approach to identifying adverse outcome pathway signatures.
Oki, Noffisat O; Edwards, Stephen W
2016-03-28
The Adverse Outcome Pathway (AOP) framework is a tool for making biological connections and summarizing key information across different levels of biological organization to connect biological perturbations at the molecular level to adverse outcomes for an individual or population. Computational approaches to explore and determine these connections can accelerate the assembly of AOPs. By leveraging the wealth of publicly available data covering chemical effects on biological systems, computationally-predicted AOPs (cpAOPs) were assembled via data mining of high-throughput screening (HTS) in vitro data, in vivo data and other disease phenotype information. Frequent Itemset Mining (FIM) was used to find associations between the gene targets of ToxCast HTS assays and disease data from Comparative Toxicogenomics Database (CTD) by using the chemicals as the common aggregators between datasets. The method was also used to map gene expression data to disease data from CTD. A cpAOP network was defined by considering genes and diseases as nodes and FIM associations as edges. This network contained 18,283 gene to disease associations for the ToxCast data and 110,253 for CTD gene expression. Two case studies show the value of the cpAOP network by extracting subnetworks focused either on fatty liver disease or the Aryl Hydrocarbon Receptor (AHR). The subnetwork surrounding fatty liver disease included many genes known to play a role in this disease. When querying the cpAOP network with the AHR gene, an interesting subnetwork including glaucoma was identified. While substantial literature exists to support the potential for AHR ligands to elicit glaucoma, it was not explicitly captured in the public annotation information in CTD. The subnetwork from this analysis suggests a cpAOP that includes changes in CYP1B1 expression, which has been previously established in the literature as a primary cause of glaucoma. These case studies highlight the value in integrating multiple data sources when defining cpAOPs for HTS data. Copyright © 2016. Published by Elsevier Ireland Ltd.
A universal genomic coordinate translator for comparative genomics
2014-01-01
Background Genomic duplications constitute major events in the evolution of species, allowing paralogous copies of genes to take on fine-tuned biological roles. Unambiguously identifying the orthology relationship between copies across multiple genomes can be resolved by synteny, i.e. the conserved order of genomic sequences. However, a comprehensive analysis of duplication events and their contributions to evolution would require all-to-all genome alignments, which increases at N2 with the number of available genomes, N. Results Here, we introduce Kraken, software that omits the all-to-all requirement by recursively traversing a graph of pairwise alignments and dynamically re-computing orthology. Kraken scales linearly with the number of targeted genomes, N, which allows for including large numbers of genomes in analyses. We first evaluated the method on the set of 12 Drosophila genomes, finding that orthologous correspondence computed indirectly through a graph of multiple synteny maps comes at minimal cost in terms of sensitivity, but reduces overall computational runtime by an order of magnitude. We then used the method on three well-annotated mammalian genomes, human, mouse, and rat, and show that up to 93% of protein coding transcripts have unambiguous pairwise orthologous relationships across the genomes. On a nucleotide level, 70 to 83% of exons match exactly at both splice junctions, and up to 97% on at least one junction. We last applied Kraken to an RNA-sequencing dataset from multiple vertebrates and diverse tissues, where we confirmed that brain-specific gene family members, i.e. one-to-many or many-to-many homologs, are more highly correlated across species than single-copy (i.e. one-to-one homologous) genes. Not limited to protein coding genes, Kraken also identifies thousands of newly identified transcribed loci, likely non-coding RNAs that are consistently transcribed in human, chimpanzee and gorilla, and maintain significant correlation of expression levels across species. Conclusions Kraken is a computational genome coordinate translator that facilitates cross-species comparisons, distinguishes orthologs from paralogs, and does not require costly all-to-all whole genome mappings. Kraken is freely available under LPGL from http://github.com/nedaz/kraken. PMID:24976580
A universal genomic coordinate translator for comparative genomics.
Zamani, Neda; Sundström, Görel; Meadows, Jennifer R S; Höppner, Marc P; Dainat, Jacques; Lantz, Henrik; Haas, Brian J; Grabherr, Manfred G
2014-06-30
Genomic duplications constitute major events in the evolution of species, allowing paralogous copies of genes to take on fine-tuned biological roles. Unambiguously identifying the orthology relationship between copies across multiple genomes can be resolved by synteny, i.e. the conserved order of genomic sequences. However, a comprehensive analysis of duplication events and their contributions to evolution would require all-to-all genome alignments, which increases at N2 with the number of available genomes, N. Here, we introduce Kraken, software that omits the all-to-all requirement by recursively traversing a graph of pairwise alignments and dynamically re-computing orthology. Kraken scales linearly with the number of targeted genomes, N, which allows for including large numbers of genomes in analyses. We first evaluated the method on the set of 12 Drosophila genomes, finding that orthologous correspondence computed indirectly through a graph of multiple synteny maps comes at minimal cost in terms of sensitivity, but reduces overall computational runtime by an order of magnitude. We then used the method on three well-annotated mammalian genomes, human, mouse, and rat, and show that up to 93% of protein coding transcripts have unambiguous pairwise orthologous relationships across the genomes. On a nucleotide level, 70 to 83% of exons match exactly at both splice junctions, and up to 97% on at least one junction. We last applied Kraken to an RNA-sequencing dataset from multiple vertebrates and diverse tissues, where we confirmed that brain-specific gene family members, i.e. one-to-many or many-to-many homologs, are more highly correlated across species than single-copy (i.e. one-to-one homologous) genes. Not limited to protein coding genes, Kraken also identifies thousands of newly identified transcribed loci, likely non-coding RNAs that are consistently transcribed in human, chimpanzee and gorilla, and maintain significant correlation of expression levels across species. Kraken is a computational genome coordinate translator that facilitates cross-species comparisons, distinguishes orthologs from paralogs, and does not require costly all-to-all whole genome mappings. Kraken is freely available under LPGL from http://github.com/nedaz/kraken.
Antonell, Anna; Lladó, Albert; Sánchez-Valle, Raquel; Sanfeliu, Coral; Casserras, Teresa; Rami, Lorena; Muñoz-García, Cristina; Dangla-Valls, Adrià; Balasa, Mircea; Boya, Patricia; Kalko, Susana G; Molinuevo, José Luis
2016-11-01
Alzheimer's disease (AD) is the most common of the neurodegenerative diseases. Recent diagnostic criteria have defined a preclinical disease phase during which neuropathological substrates are thought to be present in the brain. There is an urgent need to find measurable alterations in this phase as well as a good peripheral biomarker in the blood. We selected a cohort of 100 subjects (controls = 47; preclinical AD = 11; patients with AD = 42) and analyzed whole blood expression of 20 genes by quantitative polymerase chain reaction. The selected genes belonged to calcium signaling, senescence and autophagy, and mitochondria/oxidative stress pathways. Additionally, two genes associated with an increased risk of developing AD (clusterin (CLU) and bridging integrator 1 (BIN1)) were also analyzed. We detected significantly different gene expressions of BECN1 and PRKCB between the control and the AD groups and of CDKN2A between the control and the preclinical AD groups. Notably, these three genes are also considered tumor suppressor (CDKN2A and BECN1) or tumor promoter (PRKCB) genes. Gene-gene expression Pearson correlations were computed separately for controls and patients with AD. The significant correlations (p < 0.001) were represented in a network analysis with Cytoscape tool, which suggested an uncoupling of mitochondria-related genes in AD group. Whole blood is emerging as a valuable tissue in the study of the physiopathology of AD.
Hao, Weilong; Palmer, Jeffrey D
2009-09-29
The mitochondrial genomes of flowering plants possess a promiscuous proclivity for taking up sequences from the chloroplast genome. All characterized chloroplast integrants exist apart from native mitochondrial genes, and only a few, involving chloroplast tRNA genes that have functionally supplanted their mitochondrial counterparts, appear to be of functional consequence. We developed a novel computational approach to search for homologous recombination (gene conversion) in a large number of sequences and applied it to 22 mitochondrial and chloroplast gene pairs, which last shared common ancestry some 2 billion years ago. We found evidence of recurrent conversion of short patches of mitochondrial genes by chloroplast homologs during angiosperm evolution, but no evidence of gene conversion in the opposite direction. All 9 putative conversion events involve the atp1/atpA gene encoding the alpha subunit of ATP synthase, which is unusually well conserved between the 2 organelles and the only shared gene that is widely sequenced across plant mitochondria. Moreover, all conversions were limited to the 2 regions of greatest nucleotide and amino acid conservation of atp1/atpA. These observations probably reflect constraints operating on both the occurrence and fixation of recombination between ancient homologs. These findings indicate that recombination between anciently related sequences is more frequent than previously appreciated and creates functional mitochondrial genes of chimeric origin. These results also have implications for the widespread use of mitochondrial atp1 in phylogeny reconstruction.
Gorlin-Goltz syndrome: incidental finding on routine ct scan following car accident
2009-01-01
Introduction Gorlin-Goltz syndrome is a rare hereditary disease. Pathogenesis of the syndrome is attributed to abnormalities in the long arm of chromosome 9 (q22.3-q31) and loss or mutations of human patched gene (PTCH1 gene). Multiple basal cell carcinomas (BCCs), odontogenic keratocysts, skeletal abnormalities, hyperkeratosis of palms and soles, intracranial ectopic calcifications of the falx cerebri and facial dysmorphism are considered the main clinical features. Diagnosis is based upon established major and minor clinical and radiological criteria and ideally confirmed by DNA analysis. Because of the different systems affected, a multidisciplinary approach team of various experts is required for a successful management. Case presentation We report the case of a 19 year-old female who was involved in a car accident and found to present imaging findings of Gorlin-Goltz syndrome during a routine whole body computed tomography (CT) scan in order to exclude traumatic injuries. Conclusion Radiologic findings of the syndrome are easily identifiable on CT scans and may prompt to early verification of the disease, which is very important for regular follow-up and better survival rates from the co-existent diseases. PMID:20062724
Airoldi, Edoardo M.; Miller, Darach; Athanasiadou, Rodoniki; Brandt, Nathan; Abdul-Rahman, Farah; Neymotin, Benjamin; Hashimoto, Tatsu; Bahmani, Tayebeh; Gresham, David
2016-01-01
Cell growth rate is regulated in response to the abundance and molecular form of essential nutrients. In Saccharomyces cerevisiae (budding yeast), the molecular form of environmental nitrogen is a major determinant of cell growth rate, supporting growth rates that vary at least threefold. Transcriptional control of nitrogen use is mediated in large part by nitrogen catabolite repression (NCR), which results in the repression of specific transcripts in the presence of a preferred nitrogen source that supports a fast growth rate, such as glutamine, that are otherwise expressed in the presence of a nonpreferred nitrogen source, such as proline, which supports a slower growth rate. Differential expression of the NCR regulon and additional nitrogen-responsive genes results in >500 transcripts that are differentially expressed in cells growing in the presence of different nitrogen sources in batch cultures. Here we find that in growth rate–controlled cultures using nitrogen-limited chemostats, gene expression programs are strikingly similar regardless of nitrogen source. NCR expression is derepressed in all nitrogen-limiting chemostat conditions regardless of nitrogen source, and in these conditions, only 34 transcripts exhibit nitrogen source–specific differential gene expression. Addition of either the preferred nitrogen source, glutamine, or the nonpreferred nitrogen source, proline, to cells growing in nitrogen-limited chemostats results in rapid, dose-dependent repression of the NCR regulon. Using a novel means of computational normalization to compare global gene expression programs in steady-state and dynamic conditions, we find evidence that the addition of nitrogen to nitrogen-limited cells results in the transient overproduction of transcripts required for protein translation. Simultaneously, we find that that accelerated mRNA degradation underlies the rapid clearing of a subset of transcripts, which is most pronounced for the highly expressed NCR-regulated permease genes GAP1, MEP2, DAL5, PUT4, and DIP5. Our results reveal novel aspects of nitrogen-regulated gene expression and highlight the need for a quantitative approach to study how the cell coordinates protein translation and nitrogen assimilation to optimize cell growth in different environments. PMID:26941329
Pathway-based variant enrichment analysis on the example of dilated cardiomyopathy.
Backes, Christina; Meder, Benjamin; Lai, Alan; Stoll, Monika; Rühle, Frank; Katus, Hugo A; Keller, Andreas
2016-01-01
Genome-wide association (GWA) studies have significantly contributed to the understanding of human genetic variation and its impact on clinical traits. Frequently only a limited number of highly significant associations were considered as biologically relevant. Increasingly, network analysis of affected genes is used to explore the potential role of the genetic background on disease mechanisms. Instead of first determining affected genes or calculating scores for genes and performing pathway analysis on the gene level, we integrated both steps and directly calculated enrichment on the genetic variant level. The respective approach has been tested on dilated cardiomyopathy (DCM) GWA data as showcase. To compute significance values, 5000 permutation tests were carried out and p values were adjusted for multiple testing. For 282 KEGG pathways, we computed variant enrichment scores and significance values. Of these, 65 were significant. Surprisingly, we discovered the "nucleotide excision repair" and "tuberculosis" pathways to be most significantly associated with DCM (p = 10(-9)). The latter pathway is driven by genes of the HLA-D antigen group, a finding that closely resembles previous discoveries made by expression quantitative trait locus analysis in the context of DCM-GWA. Next, we implemented a sub-network-based analysis, which searches for affected parts of KEGG, however, independent on the pre-defined pathways. Here, proteins of the contractile apparatus of cardiac cells as well as the FAS sub-network were found to be affected by common polymorphisms in DCM. In this work, we performed enrichment analysis directly on variants, leveraging the potential to discover biological information in thousands of published GWA studies. The applied approach is cutoff free and considers a ranked list of genetic variants as input.
Pechenick, Dov A.; Payne, Joshua L.; Moore, Jason H.
2011-01-01
Gene regulatory networks (GRNs) drive the cellular processes that sustain life. To do so reliably, GRNs must be robust to perturbations, such as gene deletion and the addition or removal of regulatory interactions. GRNs must also be robust to genetic changes in regulatory regions that define the logic of signal-integration, as these changes can affect how specific combinations of regulatory signals are mapped to particular gene expression states. Previous theoretical analyses have demonstrated that the robustness of a GRN is influenced by its underlying topological properties, such as degree distribution and modularity. Another important topological property is assortativity, which measures the propensity with which nodes of similar connectivity are connected to one another. How assortativity influences the robustness of the signal-integration logic of GRNs remains an open question. Here, we use computational models of GRNs to investigate this relationship. We separately consider each of the three dynamical regimes of this model for a variety of degree distributions. We find that in the chaotic regime, robustness exhibits a pronounced increase as assortativity becomes more positive, while in the critical and ordered regimes, robustness is generally less sensitive to changes in assortativity. We attribute the increased robustness to a decrease in the duration of the gene expression pattern, which is caused by a reduction in the average size of a GRN’s in-components. This study provides the first direct evidence that assortativity influences the robustness of the signal-integration logic of computational models of GRNs, illuminates a mechanistic explanation for this influence, and furthers our understanding of the relationship between topology and robustness in complex biological systems. PMID:22155134
Andersson, Claes R; Hvidsten, Torgeir R; Isaksson, Anders; Gustafsson, Mats G; Komorowski, Jan
2007-01-01
Background We address the issue of explaining the presence or absence of phase-specific transcription in budding yeast cultures under different conditions. To this end we use a model-based detector of gene expression periodicity to divide genes into classes depending on their behavior in experiments using different synchronization methods. While computational inference of gene regulatory circuits typically relies on expression similarity (clustering) in order to find classes of potentially co-regulated genes, this method instead takes advantage of known time profile signatures related to the studied process. Results We explain the regulatory mechanisms of the inferred periodic classes with cis-regulatory descriptors that combine upstream sequence motifs with experimentally determined binding of transcription factors. By systematic statistical analysis we show that periodic classes are best explained by combinations of descriptors rather than single descriptors, and that different combinations correspond to periodic expression in different classes. We also find evidence for additive regulation in that the combinations of cis-regulatory descriptors associated with genes periodically expressed in fewer conditions are frequently subsets of combinations associated with genes periodically expression in more conditions. Finally, we demonstrate that our approach retrieves combinations that are more specific towards known cell-cycle related regulators than the frequently used clustering approach. Conclusion The results illustrate how a model-based approach to expression analysis may be particularly well suited to detect biologically relevant mechanisms. Our new approach makes it possible to provide more refined hypotheses about regulatory mechanisms of the cell cycle and it can easily be adjusted to reveal regulation of other, non-periodic, cellular processes. PMID:17939860
Robustness, Evolvability, and the Logic of Genetic Regulation
Moore, Jason H.; Wagner, Andreas
2014-01-01
In gene regulatory circuits, the expression of individual genes is commonly modulated by a set of regulating gene products, which bind to a gene’s cis-regulatory region. This region encodes an input-output function, referred to as signal-integration logic, that maps a specific combination of regulatory signals (inputs) to a particular expression state (output) of a gene. The space of all possible signal-integration functions is vast and the mapping from input to output is many-to-one: for the same set of inputs, many functions (genotypes) yield the same expression output (phenotype). Here, we exhaustively enumerate the set of signal-integration functions that yield idential gene expression patterns within a computational model of gene regulatory circuits. Our goal is to characterize the relationship between robustness and evolvability in the signal-integration space of regulatory circuits, and to understand how these properties vary between the genotypic and phenotypic scales. Among other results, we find that the distributions of genotypic robustness are skewed, such that the majority of signal-integration functions are robust to perturbation. We show that the connected set of genotypes that make up a given phenotype are constrained to specific regions of the space of all possible signal-integration functions, but that as the distance between genotypes increases, so does their capacity for unique innovations. In addition, we find that robust phenotypes are (i) evolvable, (ii) easily identified by random mutation, and (iii) mutationally biased toward other robust phenotypes. We explore the implications of these latter observations for mutation-based evolution by conducting random walks between randomly chosen source and target phenotypes. We demonstrate that the time required to identify the target phenotype is independent of the properties of the source phenotype. PMID:23373974
Computational challenges in modeling gene regulatory events.
Pataskar, Abhijeet; Tiwari, Vijay K
2016-10-19
Cellular transcriptional programs driven by genetic and epigenetic mechanisms could be better understood by integrating "omics" data and subsequently modeling the gene-regulatory events. Toward this end, computational biology should keep pace with evolving experimental procedures and data availability. This article gives an exemplified account of the current computational challenges in molecular biology.
Species tree inference by minimizing deep coalescences.
Than, Cuong; Nakhleh, Luay
2009-09-01
In a 1997 seminal paper, W. Maddison proposed minimizing deep coalescences, or MDC, as an optimization criterion for inferring the species tree from a set of incongruent gene trees, assuming the incongruence is exclusively due to lineage sorting. In a subsequent paper, Maddison and Knowles provided and implemented a search heuristic for optimizing the MDC criterion, given a set of gene trees. However, the heuristic is not guaranteed to compute optimal solutions, and its hill-climbing search makes it slow in practice. In this paper, we provide two exact solutions to the problem of inferring the species tree from a set of gene trees under the MDC criterion. In other words, our solutions are guaranteed to find the tree that minimizes the total number of deep coalescences from a set of gene trees. One solution is based on a novel integer linear programming (ILP) formulation, and another is based on a simple dynamic programming (DP) approach. Powerful ILP solvers, such as CPLEX, make the first solution appealing, particularly for very large-scale instances of the problem, whereas the DP-based solution eliminates dependence on proprietary tools, and its simplicity makes it easy to integrate with other genomic events that may cause gene tree incongruence. Using the exact solutions, we analyze a data set of 106 loci from eight yeast species, a data set of 268 loci from eight Apicomplexan species, and several simulated data sets. We show that the MDC criterion provides very accurate estimates of the species tree topologies, and that our solutions are very fast, thus allowing for the accurate analysis of genome-scale data sets. Further, the efficiency of the solutions allow for quick exploration of sub-optimal solutions, which is important for a parsimony-based criterion such as MDC, as we show. We show that searching for the species tree in the compatibility graph of the clusters induced by the gene trees may be sufficient in practice, a finding that helps ameliorate the computational requirements of optimization solutions. Further, we study the statistical consistency and convergence rate of the MDC criterion, as well as its optimality in inferring the species tree. Finally, we show how our solutions can be used to identify potential horizontal gene transfer events that may have caused some of the incongruence in the data, thus augmenting Maddison's original framework. We have implemented our solutions in the PhyloNet software package, which is freely available at: http://bioinfo.cs.rice.edu/phylonet.
Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases.
Berger, Seth I; Posner, Jeremy M; Ma'ayan, Avi
2007-10-04
In recent years, mammalian protein-protein interaction network databases have been developed. The interactions in these databases are either extracted manually from low-throughput experimental biomedical research literature, extracted automatically from literature using techniques such as natural language processing (NLP), generated experimentally using high-throughput methods such as yeast-2-hybrid screens, or interactions are predicted using an assortment of computational approaches. Genes or proteins identified as significantly changing in proteomic experiments, or identified as susceptibility disease genes in genomic studies, can be placed in the context of protein interaction networks in order to assign these genes and proteins to pathways and protein complexes. Genes2Networks is a software system that integrates the content of ten mammalian interaction network datasets. Filtering techniques to prune low-confidence interactions were implemented. Genes2Networks is delivered as a web-based service using AJAX. The system can be used to extract relevant subnetworks created from "seed" lists of human Entrez gene symbols. The output includes a dynamic linkable three color web-based network map, with a statistical analysis report that identifies significant intermediate nodes used to connect the seed list. Genes2Networks is powerful web-based software that can help experimental biologists to interpret lists of genes and proteins such as those commonly produced through genomic and proteomic experiments, as well as lists of genes and proteins associated with disease processes. This system can be used to find relationships between genes and proteins from seed lists, and predict additional genes or proteins that may play key roles in common pathways or protein complexes.
Wu, Shuang; Liu, Zhi-Ping; Qiu, Xing; Wu, Hulin
2014-01-01
The immune response to viral infection is regulated by an intricate network of many genes and their products. The reverse engineering of gene regulatory networks (GRNs) using mathematical models from time course gene expression data collected after influenza infection is key to our understanding of the mechanisms involved in controlling influenza infection within a host. A five-step pipeline: detection of temporally differentially expressed genes, clustering genes into co-expressed modules, identification of network structure, parameter estimate refinement, and functional enrichment analysis, is developed for reconstructing high-dimensional dynamic GRNs from genome-wide time course gene expression data. Applying the pipeline to the time course gene expression data from influenza-infected mouse lungs, we have identified 20 distinct temporal expression patterns in the differentially expressed genes and constructed a module-based dynamic network using a linear ODE model. Both intra-module and inter-module annotations and regulatory relationships of our inferred network show some interesting findings and are highly consistent with existing knowledge about the immune response in mice after influenza infection. The proposed method is a computationally efficient, data-driven pipeline bridging experimental data, mathematical modeling, and statistical analysis. The application to the influenza infection data elucidates the potentials of our pipeline in providing valuable insights into systematic modeling of complicated biological processes.
McKinney, Brett A.; White, Bill C.; Grill, Diane E.; Li, Peter W.; Kennedy, Richard B.; Poland, Gregory A.; Oberg, Ann L.
2013-01-01
Relief-F is a nonparametric, nearest-neighbor machine learning method that has been successfully used to identify relevant variables that may interact in complex multivariate models to explain phenotypic variation. While several tools have been developed for assessing differential expression in sequence-based transcriptomics, the detection of statistical interactions between transcripts has received less attention in the area of RNA-seq analysis. We describe a new extension and assessment of Relief-F for feature selection in RNA-seq data. The ReliefSeq implementation adapts the number of nearest neighbors (k) for each gene to optimize the Relief-F test statistics (importance scores) for finding both main effects and interactions. We compare this gene-wise adaptive-k (gwak) Relief-F method with standard RNA-seq feature selection tools, such as DESeq and edgeR, and with the popular machine learning method Random Forests. We demonstrate performance on a panel of simulated data that have a range of distributional properties reflected in real mRNA-seq data including multiple transcripts with varying sizes of main effects and interaction effects. For simulated main effects, gwak-Relief-F feature selection performs comparably to standard tools DESeq and edgeR for ranking relevant transcripts. For gene-gene interactions, gwak-Relief-F outperforms all comparison methods at ranking relevant genes in all but the highest fold change/highest signal situations where it performs similarly. The gwak-Relief-F algorithm outperforms Random Forests for detecting relevant genes in all simulation experiments. In addition, Relief-F is comparable to the other methods based on computational time. We also apply ReliefSeq to an RNA-Seq study of smallpox vaccine to identify gene expression changes between vaccinia virus-stimulated and unstimulated samples. ReliefSeq is an attractive tool for inclusion in the suite of tools used for analysis of mRNA-Seq data; it has power to detect both main effects and interaction effects. Software Availability: http://insilico.utulsa.edu/ReliefSeq.php. PMID:24339943
Identifying cooperative transcriptional regulations using protein–protein interactions
Nagamine, Nobuyoshi; Kawada, Yuji; Sakakibara, Yasubumi
2005-01-01
Cooperative transcriptional activations among multiple transcription factors (TFs) are important to understand the mechanisms of complex transcriptional regulations in eukaryotes. Previous studies have attempted to find cooperative TFs based on gene expression data with gene expression profiles as a measure of similarity of gene regulations. In this paper, we use protein–protein interaction data to infer synergistic binding of cooperative TFs. Our fundamental idea is based on the assumption that genes contributing to a similar biological process are regulated under the same control mechanism. First, the protein–protein interaction networks are used to calculate the similarity of biological processes among genes. Second, we integrate this similarity and the chromatin immuno-precipitation data to identify cooperative TFs. Our computational experiments in yeast show that predictions made by our method have successfully identified eight pairs of cooperative TFs that have literature evidences but could not be identified by the previous method. Further, 12 new possible pairs have been inferred and we have examined the biological relevances for them. However, since a typical problem using protein–protein interaction data is that many false-positive data are contained, we propose a method combining various biological data to increase the prediction accuracy. PMID:16126847
Improving microbial fitness in the mammalian gut by in vivo temporal functional metagenomics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yaung, Stephanie J.; Deng, Luxue; Li, Ning
Elucidating functions of commensal microbial genes in the mammalian gut is challenging because many commensals are recalcitrant to laboratory cultivation and genetic manipulation. We present Temporal FUnctional Metagenomics sequencing (TFUMseq), a platform to functionally mine bacterial genomes for genes that contribute to fitness of commensal bacteria in vivo. Our approach uses metagenomic DNA to construct large-scale heterologous expression libraries that are tracked over time in vivo by deep sequencing and computational methods. To demonstrate our approach, we built a TFUMseq plasmid library using the gut commensal Bacteroides thetaiotaomicron (Bt) and introduced Escherichia coli carrying this library into germfree mice. Populationmore » dynamics of library clones revealed Bt genes conferring significant fitness advantages in E. coli over time, including carbohydrate utilization genes, with a Bt galactokinase central to early colonization, and subsequent dominance by a Bt glycoside hydrolase enabling sucrose metabolism coupled with co-evolution of the plasmid library and E. coli genome driving increased galactose utilization. Here, our findings highlight the utility of functional metagenomics for engineering commensal bacteria with improved properties, including expanded colonization capabilities in vivo.« less
Improving microbial fitness in the mammalian gut by in vivo temporal functional metagenomics
Yaung, Stephanie J.; Deng, Luxue; Li, Ning; ...
2015-03-11
Elucidating functions of commensal microbial genes in the mammalian gut is challenging because many commensals are recalcitrant to laboratory cultivation and genetic manipulation. We present Temporal FUnctional Metagenomics sequencing (TFUMseq), a platform to functionally mine bacterial genomes for genes that contribute to fitness of commensal bacteria in vivo. Our approach uses metagenomic DNA to construct large-scale heterologous expression libraries that are tracked over time in vivo by deep sequencing and computational methods. To demonstrate our approach, we built a TFUMseq plasmid library using the gut commensal Bacteroides thetaiotaomicron (Bt) and introduced Escherichia coli carrying this library into germfree mice. Populationmore » dynamics of library clones revealed Bt genes conferring significant fitness advantages in E. coli over time, including carbohydrate utilization genes, with a Bt galactokinase central to early colonization, and subsequent dominance by a Bt glycoside hydrolase enabling sucrose metabolism coupled with co-evolution of the plasmid library and E. coli genome driving increased galactose utilization. Here, our findings highlight the utility of functional metagenomics for engineering commensal bacteria with improved properties, including expanded colonization capabilities in vivo.« less
Distinguishing between Selective Sweeps from Standing Variation and from a De Novo Mutation
Peter, Benjamin M.; Huerta-Sanchez, Emilia; Nielsen, Rasmus
2012-01-01
An outstanding question in human genetics has been the degree to which adaptation occurs from standing genetic variation or from de novo mutations. Here, we combine several common statistics used to detect selection in an Approximate Bayesian Computation (ABC) framework, with the goal of discriminating between models of selection and providing estimates of the age of selected alleles and the selection coefficients acting on them. We use simulations to assess the power and accuracy of our method and apply it to seven of the strongest sweeps currently known in humans. We identify two genes, ASPM and PSCA, that are most likely affected by selection on standing variation; and we find three genes, ADH1B, LCT, and EDAR, in which the adaptive alleles seem to have swept from a new mutation. We also confirm evidence of selection for one further gene, TRPV6. In one gene, G6PD, neither neutral models nor models of selective sweeps fit the data, presumably because this locus has been subject to balancing selection. PMID:23071458
Coleman, J. Robert; Papamichail, Dimitris; Yano, Masahide; García-Suárez, María del Mar
2011-01-01
In this study, we used a previously described method of controlling gene expression with computer-based gene design and de novo DNA synthesis to attenuate the virulence of Streptococcus pneumoniae. We produced 2 S. pneumoniae serotype 3 (SP3) strains in which the pneumolysin gene (ply) was recoded with underrepresented codon pairs while retaining its amino acid sequence and determined their ply expression and pneumolysin production in vitro and their virulence in a mouse pulmonary infection model. Expression of ply and production of pneumolysin of the recoded SP3 strains were decreased, and the recoded SP3 strains were less virulent in mice than the wild-type SP3 strain or a Δply SP3 strain. Further studies showed that the least virulent recoded strain induced a markedly reduced inflammatory response in the lungs compared with the wild-type or Δply strain. These findings suggest that reducing pneumococcal virulence gene expression by altering codon-pair bias could hold promise for rational design of live-attenuated pneumococcal vaccines. PMID:21343143
2013-01-01
Background Gene expression data could likely be a momentous help in the progress of proficient cancer diagnoses and classification platforms. Lately, many researchers analyze gene expression data using diverse computational intelligence methods, for selecting a small subset of informative genes from the data for cancer classification. Many computational methods face difficulties in selecting small subsets due to the small number of samples compared to the huge number of genes (high-dimension), irrelevant genes, and noisy genes. Methods We propose an enhanced binary particle swarm optimization to perform the selection of small subsets of informative genes which is significant for cancer classification. Particle speed, rule, and modified sigmoid function are introduced in this proposed method to increase the probability of the bits in a particle’s position to be zero. The method was empirically applied to a suite of ten well-known benchmark gene expression data sets. Results The performance of the proposed method proved to be superior to other previous related works, including the conventional version of binary particle swarm optimization (BPSO) in terms of classification accuracy and the number of selected genes. The proposed method also requires lower computational time compared to BPSO. PMID:23617960
Computational challenges in modeling gene regulatory events
Pataskar, Abhijeet; Tiwari, Vijay K.
2016-01-01
ABSTRACT Cellular transcriptional programs driven by genetic and epigenetic mechanisms could be better understood by integrating “omics” data and subsequently modeling the gene-regulatory events. Toward this end, computational biology should keep pace with evolving experimental procedures and data availability. This article gives an exemplified account of the current computational challenges in molecular biology. PMID:27390891
Little, A C; Burt, D M; Penton-Voak, I S; Perrett, D I
2001-01-01
Exaggerated sexual dimorphism and symmetry in human faces have both been linked to potential 'good-gene' benefits and have also been found to influence the attractiveness of male faces. The current study explores how female self-rated attractiveness influences male face preference in females using faces manipulated with computer graphics. The study demonstrates that there is a relatively increased preference for masculinity and an increased preference for symmetry for women who regard themselves as attractive. This finding may reflect a condition-dependent mating strategy analogous to behaviours found in other species. The absence of a preference for proposed markers of good genes may be adaptive in women of low mate value to avoid the costs of decreased parental investment from the owners of such characteristics. PMID:12123296
Bellaire, Anke; Ischebeck, Till; Staedler, Yannick; Weinhaeuser, Isabell; Mair, Andrea; Parameswaran, Sriram; Ito, Toshiro; Schönenberger, Jürg; Weckwerth, Wolfram
2014-01-01
The interrelationship of morphogenesis and metabolism is a poorly studied phenomenon. The main paradigm is that development is controlled by gene expression. The aim of the present study was to correlate metabolism to early and late stages of flower and fruit development in order to provide the basis for the identification of metabolic adjustment and limitations. A highly detailed picture of morphogenesis is achieved using nondestructive micro computed tomography. This technique was used to quantify morphometric parameters of early and late flower development in an Arabidopsis thaliana mutant with synchronized flower initiation. The synchronized flower phenotype made it possible to sample enough early floral tissue otherwise not accessible for metabolomic analysis. The integration of metabolomic and morphometric data enabled the correlation of metabolic signatures with the process of flower morphogenesis. These signatures changed significantly during development, indicating a pronounced metabolic reprogramming in the tissue. Distinct sets of metabolites involved in these processes were identified and were linked to the findings of previous gene expression studies of flower development. High correlations with basic leucine zipper (bZIP) transcription factors and nitrogen metabolism genes involved in the control of metabolic carbon : nitrogen partitioning were revealed. Based on these observations a model for metabolic adjustment during flower development is proposed. PMID:24350948
Cheng, Feixiong; Zhao, Junfei; Zhao, Zhongming
2016-07-01
Cancer is often driven by the accumulation of genetic alterations, including single nucleotide variants, small insertions or deletions, gene fusions, copy-number variations, and large chromosomal rearrangements. Recent advances in next-generation sequencing technologies have helped investigators generate massive amounts of cancer genomic data and catalog somatic mutations in both common and rare cancer types. So far, the somatic mutation landscapes and signatures of >10 major cancer types have been reported; however, pinpointing driver mutations and cancer genes from millions of available cancer somatic mutations remains a monumental challenge. To tackle this important task, many methods and computational tools have been developed during the past several years and, thus, a review of its advances is urgently needed. Here, we first summarize the main features of these methods and tools for whole-exome, whole-genome and whole-transcriptome sequencing data. Then, we discuss major challenges like tumor intra-heterogeneity, tumor sample saturation and functionality of synonymous mutations in cancer, all of which may result in false-positive discoveries. Finally, we highlight new directions in studying regulatory roles of noncoding somatic mutations and quantitatively measuring circulating tumor DNA in cancer. This review may help investigators find an appropriate tool for detecting potential driver or actionable mutations in rapidly emerging precision cancer medicine. © The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies.
Yung, Ling Sing; Yang, Can; Wan, Xiang; Yu, Weichuan
2011-05-01
Collecting millions of genetic variations is feasible with the advanced genotyping technology. With a huge amount of genetic variations data in hand, developing efficient algorithms to carry out the gene-gene interaction analysis in a timely manner has become one of the key problems in genome-wide association studies (GWAS). Boolean operation-based screening and testing (BOOST), a recent work in GWAS, completes gene-gene interaction analysis in 2.5 days on a desktop computer. Compared with central processing units (CPUs), graphic processing units (GPUs) are highly parallel hardware and provide massive computing resources. We are, therefore, motivated to use GPUs to further speed up the analysis of gene-gene interactions. We implement the BOOST method based on a GPU framework and name it GBOOST. GBOOST achieves a 40-fold speedup compared with BOOST. It completes the analysis of Wellcome Trust Case Control Consortium Type 2 Diabetes (WTCCC T2D) genome data within 1.34 h on a desktop computer equipped with Nvidia GeForce GTX 285 display card. GBOOST code is available at http://bioinformatics.ust.hk/BOOST.html#GBOOST.
Explicit Building Block Multiobjective Evolutionary Computation: Methods and Applications
2005-06-16
which is introduced in 1990 by Richard Dawkins in his book ”The Selfish Gene .” [34] 356 E.5.7 Pareto Envelop-based Selection Algorithm I and II...IGC Intelligent Gene Collector . . . . . . . . . . . . . . . . . 59 OED Orthogonal Experimental Design . . . . . . . . . . . . . 59 MED Main Effect...complete one experiment 74 `′ The string length hold within the computer (can be longer than number of genes
Mintz-Hittner, H A; Ferrell, R E; Sims, K B; Fernandez, K M; Gemmell, B S; Satriano, D R; Caster, J; Kretzer, F L
1996-12-01
The Norrie disease (ND) gene (Xp11.3) (McKusick 310600) consists of one untranslated exon and two exons partially translated as the Norrie disease protein (Norrin). Norrin has sequence homology and computer-predicted tertiary structure of a growth factor containing a cystine knot motif, which affects endothelial cell migration and proliferation. Norrie disease (congenital retinal detachment), X-linked primary retinal dysplasia (congenital retinal fold), and X-linked exudative vitreoretinopathy (congenital macular ectopia) are allelic disorders. Blood was drawn for genetic studies from members of two families to test for ND gene mutations. Sixteen unaffected family members were examined ophthalmologically. If any retinal abnormality were identified, fundus photography and fluorescein angiography was performed. Family A had ND (R109stp), and family B had X-linked exudative vitreoretinopathy (R121L). The retinas of 11 offspring of carrier females were examined: three of seven carrier females, three of three otherwise healthy females, and one of one otherwise healthy male had peripheral inner retinal vascular abnormalities. The retinas of five offspring of affected males were examined: none of three carrier females and none of two otherwise healthy males had this peripheral retinal finding. Peripheral inner retinal vascular abnormalities similar to regressed retinopathy of prematurity were identified in seven offspring of carriers of ND gene mutations in two families. These ophthalmologic findings, especially in four genetically healthy offspring, strongly support the hypothesis that abnormal Norrin may have an adverse transplacental (environmental) effect on normal inner retinal vasculogenesis.
Reyes-Guzmán, Edwin Alfredo; Poutou-Piñales, Raúl A.; Reyes-Montaño, Edgar Antonio; Pedroza-Rodríguez, Aura Marina; Rodríguez-Vázquez, Refugio; Cardozo-Bernal, Ángela M.
2015-01-01
Lacasses are multicopper oxidases that can catalyze aromatic and non-aromatic compounds concomitantly with reduction of molecular oxygen to water. Fungal laccases have generated a growing interest due to their biotechnological potential applications, such as lignocellulosic material delignification, biopulping and biobleaching, wastewater treatment, and transformation of toxic organic pollutants. In this work we selected fungal genes encoding for laccase enzymes GlLCC1 in Ganoderma lucidum and POXA 1B in Pleurotus ostreatus. These genes were optimized for codon use, GC content, and regions generating secondary structures. Laccase proposed computational models, and their interaction with ABTS [2, 2′-azino-bis(3-ethylbenzothiazoline-6-sulphonic acid)] substrate was evaluated by molecular docking. Synthetic genes were cloned under the control of Pichia pastoris glyceraldehyde-3-phosphate dehydrogenase (GAP) constitutive promoter. P. pastoris X-33 was transformed with pGAPZαA-LaccGluc-Stop and pGAPZαA-LaccPost-Stop constructs. Optimization reduced GC content by 47 and 49% for LaccGluc-Stop and LaccPost-Stop genes, respectively. A codon adaptation index of 0.84 was obtained for both genes. 3D structure analysis using SuperPose revealed LaccGluc-Stop is similar to the laccase crystallographic structure 1GYC of Trametes versicolor. Interaction analysis of the 3D models validated through ABTS, demonstrated higher substrate affinity for LaccPost-Stop, in agreement with our experimental results with enzymatic activities of 451.08 ± 6.46 UL-1 compared to activities of 0.13 ± 0.028 UL-1 for LaccGluc-Stop. This study demonstrated that G. lucidum GlLCC1 and P. ostreatus POXA 1B gene optimization resulted in constitutive gene expression under GAP promoter and α-factor leader in P. pastoris. These are important findings in light of recombinant enzyme expression system utility for environmentally friendly designed expression systems, because of the wide range of substrates that laccases can transform. This contributes to a great gamut of products in diverse settings: industry, clinical and chemical use, and environmental applications. PMID:25611746
Rivera-Hoyos, Claudia M; Morales-Álvarez, Edwin David; Poveda-Cuevas, Sergio Alejandro; Reyes-Guzmán, Edwin Alfredo; Poutou-Piñales, Raúl A; Reyes-Montaño, Edgar Antonio; Pedroza-Rodríguez, Aura Marina; Rodríguez-Vázquez, Refugio; Cardozo-Bernal, Ángela M
2015-01-01
Lacasses are multicopper oxidases that can catalyze aromatic and non-aromatic compounds concomitantly with reduction of molecular oxygen to water. Fungal laccases have generated a growing interest due to their biotechnological potential applications, such as lignocellulosic material delignification, biopulping and biobleaching, wastewater treatment, and transformation of toxic organic pollutants. In this work we selected fungal genes encoding for laccase enzymes GlLCC1 in Ganoderma lucidum and POXA 1B in Pleurotus ostreatus. These genes were optimized for codon use, GC content, and regions generating secondary structures. Laccase proposed computational models, and their interaction with ABTS [2, 2'-azino-bis(3-ethylbenzothiazoline-6-sulphonic acid)] substrate was evaluated by molecular docking. Synthetic genes were cloned under the control of Pichia pastoris glyceraldehyde-3-phosphate dehydrogenase (GAP) constitutive promoter. P. pastoris X-33 was transformed with pGAPZαA-LaccGluc-Stop and pGAPZαA-LaccPost-Stop constructs. Optimization reduced GC content by 47 and 49% for LaccGluc-Stop and LaccPost-Stop genes, respectively. A codon adaptation index of 0.84 was obtained for both genes. 3D structure analysis using SuperPose revealed LaccGluc-Stop is similar to the laccase crystallographic structure 1GYC of Trametes versicolor. Interaction analysis of the 3D models validated through ABTS, demonstrated higher substrate affinity for LaccPost-Stop, in agreement with our experimental results with enzymatic activities of 451.08 ± 6.46 UL-1 compared to activities of 0.13 ± 0.028 UL-1 for LaccGluc-Stop. This study demonstrated that G. lucidum GlLCC1 and P. ostreatus POXA 1B gene optimization resulted in constitutive gene expression under GAP promoter and α-factor leader in P. pastoris. These are important findings in light of recombinant enzyme expression system utility for environmentally friendly designed expression systems, because of the wide range of substrates that laccases can transform. This contributes to a great gamut of products in diverse settings: industry, clinical and chemical use, and environmental applications.
Bacteria as computers making computers
Danchin, Antoine
2009-01-01
Various efforts to integrate biological knowledge into networks of interactions have produced a lively microbial systems biology. Putting molecular biology and computer sciences in perspective, we review another trend in systems biology, in which recursivity and information replace the usual concepts of differential equations, feedback and feedforward loops and the like. Noting that the processes of gene expression separate the genome from the cell machinery, we analyse the role of the separation between machine and program in computers. However, computers do not make computers. For cells to make cells requires a specific organization of the genetic program, which we investigate using available knowledge. Microbial genomes are organized into a paleome (the name emphasizes the role of the corresponding functions from the time of the origin of life), comprising a constructor and a replicator, and a cenome (emphasizing community-relevant genes), made up of genes that permit life in a particular context. The cell duplication process supposes rejuvenation of the machine and replication of the program. The paleome also possesses genes that enable information to accumulate in a ratchet-like process down the generations. The systems biology must include the dynamics of information creation in its future developments. PMID:19016882
Bacteria as computers making computers.
Danchin, Antoine
2009-01-01
Various efforts to integrate biological knowledge into networks of interactions have produced a lively microbial systems biology. Putting molecular biology and computer sciences in perspective, we review another trend in systems biology, in which recursivity and information replace the usual concepts of differential equations, feedback and feedforward loops and the like. Noting that the processes of gene expression separate the genome from the cell machinery, we analyse the role of the separation between machine and program in computers. However, computers do not make computers. For cells to make cells requires a specific organization of the genetic program, which we investigate using available knowledge. Microbial genomes are organized into a paleome (the name emphasizes the role of the corresponding functions from the time of the origin of life), comprising a constructor and a replicator, and a cenome (emphasizing community-relevant genes), made up of genes that permit life in a particular context. The cell duplication process supposes rejuvenation of the machine and replication of the program. The paleome also possesses genes that enable information to accumulate in a ratchet-like process down the generations. The systems biology must include the dynamics of information creation in its future developments.
Pathway connectivity and signaling coordination in the yeast stress-activated signaling network
Chasman, Deborah; Ho, Yi-Hsuan; Berry, David B; Nemec, Corey M; MacGilvray, Matthew E; Hose, James; Merrill, Anna E; Lee, M Violet; Will, Jessica L; Coon, Joshua J; Ansari, Aseem Z; Craven, Mark; Gasch, Audrey P
2014-01-01
Stressed cells coordinate a multi-faceted response spanning many levels of physiology. Yet knowledge of the complete stress-activated regulatory network as well as design principles for signal integration remains incomplete. We developed an experimental and computational approach to integrate available protein interaction data with gene fitness contributions, mutant transcriptome profiles, and phospho-proteome changes in cells responding to salt stress, to infer the salt-responsive signaling network in yeast. The inferred subnetwork presented many novel predictions by implicating new regulators, uncovering unrecognized crosstalk between known pathways, and pointing to previously unknown ‘hubs’ of signal integration. We exploited these predictions to show that Cdc14 phosphatase is a central hub in the network and that modification of RNA polymerase II coordinates induction of stress-defense genes with reduction of growth-related transcripts. We find that the orthologous human network is enriched for cancer-causing genes, underscoring the importance of the subnetwork's predictions in understanding stress biology. PMID:25411400
Neural model of gene regulatory network: a survey on supportive meta-heuristics.
Biswas, Surama; Acharyya, Sriyankar
2016-06-01
Gene regulatory network (GRN) is produced as a result of regulatory interactions between different genes through their coded proteins in cellular context. Having immense importance in disease detection and drug finding, GRN has been modelled through various mathematical and computational schemes and reported in survey articles. Neural and neuro-fuzzy models have been the focus of attraction in bioinformatics. Predominant use of meta-heuristic algorithms in training neural models has proved its excellence. Considering these facts, this paper is organized to survey neural modelling schemes of GRN and the efficacy of meta-heuristic algorithms towards parameter learning (i.e. weighting connections) within the model. This survey paper renders two different structure-related approaches to infer GRN which are global structure approach and substructure approach. It also describes two neural modelling schemes, such as artificial neural network/recurrent neural network based modelling and neuro-fuzzy modelling. The meta-heuristic algorithms applied so far to learn the structure and parameters of neutrally modelled GRN have been reviewed here.
Combining Evidence of Preferential Gene-Tissue Relationships from Multiple Sources
Guo, Jing; Hammar, Mårten; Öberg, Lisa; Padmanabhuni, Shanmukha S.; Bjäreland, Marcus; Dalevi, Daniel
2013-01-01
An important challenge in drug discovery and disease prognosis is to predict genes that are preferentially expressed in one or a few tissues, i.e. showing a considerably higher expression in one tissue(s) compared to the others. Although several data sources and methods have been published explicitly for this purpose, they often disagree and it is not evident how to retrieve these genes and how to distinguish true biological findings from those that are due to choice-of-method and/or experimental settings. In this work we have developed a computational approach that combines results from multiple methods and datasets with the aim to eliminate method/study-specific biases and to improve the predictability of preferentially expressed human genes. A rule-based score is used to merge and assign support to the results. Five sets of genes with known tissue specificity were used for parameter pruning and cross-validation. In total we identify 3434 tissue-specific genes. We compare the genes of highest scores with the public databases: PaGenBase (microarray), TiGER (EST) and HPA (protein expression data). The results have 85% overlap to PaGenBase, 71% to TiGER and only 28% to HPA. 99% of our predictions have support from at least one of these databases. Our approach also performs better than any of the databases on identifying drug targets and biomarkers with known tissue-specificity. PMID:23950964
Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.
2012-01-01
Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921
Beretta, Lorenzo; Santaniello, Alessandro; van Riel, Piet L C M; Coenen, Marieke J H; Scorza, Raffaella
2010-08-06
Epistasis is recognized as a fundamental part of the genetic architecture of individuals. Several computational approaches have been developed to model gene-gene interactions in case-control studies, however, none of them is suitable for time-dependent analysis. Herein we introduce the Survival Dimensionality Reduction (SDR) algorithm, a non-parametric method specifically designed to detect epistasis in lifetime datasets. The algorithm requires neither specification about the underlying survival distribution nor about the underlying interaction model and proved satisfactorily powerful to detect a set of causative genes in synthetic epistatic lifetime datasets with a limited number of samples and high degree of right-censorship (up to 70%). The SDR method was then applied to a series of 386 Dutch patients with active rheumatoid arthritis that were treated with anti-TNF biological agents. Among a set of 39 candidate genes, none of which showed a detectable marginal effect on anti-TNF responses, the SDR algorithm did find that the rs1801274 SNP in the Fc gamma RIIa gene and the rs10954213 SNP in the IRF5 gene non-linearly interact to predict clinical remission after anti-TNF biologicals. Simulation studies and application in a real-world setting support the capability of the SDR algorithm to model epistatic interactions in candidate-genes studies in presence of right-censored data. http://sourceforge.net/projects/sdrproject/.
OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid.
Poehlman, William L; Rynge, Mats; Branton, Chris; Balamurugan, D; Feltus, Frank A
2016-01-01
High-throughput DNA sequencing technology has revolutionized the study of gene expression while introducing significant computational challenges for biologists. These computational challenges include access to sufficient computer hardware and functional data processing workflows. Both these challenges are addressed with our scalable, open-source Pegasus workflow for processing high-throughput DNA sequence datasets into a gene expression matrix (GEM) using computational resources available to U.S.-based researchers on the Open Science Grid (OSG). We describe the usage of the workflow (OSG-GEM), discuss workflow design, inspect performance data, and assess accuracy in mapping paired-end sequencing reads to a reference genome. A target OSG-GEM user is proficient with the Linux command line and possesses basic bioinformatics experience. The user may run this workflow directly on the OSG or adapt it to novel computing environments.
OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid
Poehlman, William L.; Rynge, Mats; Branton, Chris; Balamurugan, D.; Feltus, Frank A.
2016-01-01
High-throughput DNA sequencing technology has revolutionized the study of gene expression while introducing significant computational challenges for biologists. These computational challenges include access to sufficient computer hardware and functional data processing workflows. Both these challenges are addressed with our scalable, open-source Pegasus workflow for processing high-throughput DNA sequence datasets into a gene expression matrix (GEM) using computational resources available to U.S.-based researchers on the Open Science Grid (OSG). We describe the usage of the workflow (OSG-GEM), discuss workflow design, inspect performance data, and assess accuracy in mapping paired-end sequencing reads to a reference genome. A target OSG-GEM user is proficient with the Linux command line and possesses basic bioinformatics experience. The user may run this workflow directly on the OSG or adapt it to novel computing environments. PMID:27499617
Microarray-based cancer prediction using soft computing approach.
Wang, Xiaosheng; Gotoh, Osamu
2009-05-26
One of the difficulties in using gene expression profiles to predict cancer is how to effectively select a few informative genes to construct accurate prediction models from thousands or ten thousands of genes. We screen highly discriminative genes and gene pairs to create simple prediction models involved in single genes or gene pairs on the basis of soft computing approach and rough set theory. Accurate cancerous prediction is obtained when we apply the simple prediction models for four cancerous gene expression datasets: CNS tumor, colon tumor, lung cancer and DLBCL. Some genes closely correlated with the pathogenesis of specific or general cancers are identified. In contrast with other models, our models are simple, effective and robust. Meanwhile, our models are interpretable for they are based on decision rules. Our results demonstrate that very simple models may perform well on cancerous molecular prediction and important gene markers of cancer can be detected if the gene selection approach is chosen reasonably.
Curtis, Ross E; Kim, Seyoung; Woolford, John L; Xu, Wenjie; Xing, Eric P
2013-03-21
Association analysis using genome-wide expression quantitative trait locus (eQTL) data investigates the effect that genetic variation has on cellular pathways and leads to the discovery of candidate regulators. Traditional analysis of eQTL data via pairwise statistical significance tests or linear regression does not leverage the availability of the structural information of the transcriptome, such as presence of gene networks that reveal correlation and potentially regulatory relationships among the study genes. We employ a new eQTL mapping algorithm, GFlasso, which we have previously developed for sparse structured regression, to reanalyze a genome-wide yeast dataset. GFlasso fully takes into account the dependencies among expression traits to suppress false positives and to enhance the signal/noise ratio. Thus, GFlasso leverages the gene-interaction network to discover the pleiotropic effects of genetic loci that perturb the expression level of multiple (rather than individual) genes, which enables us to gain more power in detecting previously neglected signals that are marginally weak but pleiotropically significant. While eQTL hotspots in yeast have been reported previously as genomic regions controlling multiple genes, our analysis reveals additional novel eQTL hotspots and, more interestingly, uncovers groups of multiple contributing eQTL hotspots that affect the expression level of functional gene modules. To our knowledge, our study is the first to report this type of gene regulation stemming from multiple eQTL hotspots. Additionally, we report the results from in-depth bioinformatics analysis for three groups of these eQTL hotspots: ribosome biogenesis, telomere silencing, and retrotransposon biology. We suggest candidate regulators for the functional gene modules that map to each group of hotspots. Not only do we find that many of these candidate regulators contain mutations in the promoter and coding regions of the genes, in the case of the Ribi group, we provide experimental evidence suggesting that the identified candidates do regulate the target genes predicted by GFlasso. Thus, this structured association analysis of a yeast eQTL dataset via GFlasso, coupled with extensive bioinformatics analysis, discovers a novel regulation pattern between multiple eQTL hotspots and functional gene modules. Furthermore, this analysis demonstrates the potential of GFlasso as a powerful computational tool for eQTL studies that exploit the rich structural information among expression traits due to correlation, regulation, or other forms of biological dependencies.
Sequential Logic Model Deciphers Dynamic Transcriptional Control of Gene Expressions
Yeo, Zhen Xuan; Wong, Sum Thai; Arjunan, Satya Nanda Vel; Piras, Vincent; Tomita, Masaru; Selvarajoo, Kumar; Giuliani, Alessandro; Tsuchiya, Masa
2007-01-01
Background Cellular signaling involves a sequence of events from ligand binding to membrane receptors through transcription factors activation and the induction of mRNA expression. The transcriptional-regulatory system plays a pivotal role in the control of gene expression. A novel computational approach to the study of gene regulation circuits is presented here. Methodology Based on the concept of finite state machine, which provides a discrete view of gene regulation, a novel sequential logic model (SLM) is developed to decipher control mechanisms of dynamic transcriptional regulation of gene expressions. The SLM technique is also used to systematically analyze the dynamic function of transcriptional inputs, the dependency and cooperativity, such as synergy effect, among the binding sites with respect to when, how much and how fast the gene of interest is expressed. Principal Findings SLM is verified by a set of well studied expression data on endo16 of Strongylocentrotus purpuratus (sea urchin) during the embryonic midgut development. A dynamic regulatory mechanism for endo16 expression controlled by three binding sites, UI, R and Otx is identified and demonstrated to be consistent with experimental findings. Furthermore, we show that during transition from specification to differentiation in wild type endo16 expression profile, SLM reveals three binary activities are not sufficient to explain the transcriptional regulation of endo16 expression and additional activities of binding sites are required. Further analyses suggest detailed mechanism of R switch activity where indirect dependency occurs in between UI activity and R switch during specification to differentiation stage. Conclusions/Significance The sequential logic formalism allows for a simplification of regulation network dynamics going from a continuous to a discrete representation of gene activation in time. In effect our SLM is non-parametric and model-independent, yet providing rich biological insight. The demonstration of the efficacy of this approach in endo16 is a promising step for further application of the proposed method. PMID:17712424
Bioinformatics approaches to predict target genes from transcription factor binding data.
Essebier, Alexandra; Lamprecht, Marnie; Piper, Michael; Bodén, Mikael
2017-12-01
Transcription factors regulate gene expression and play an essential role in development by maintaining proliferative states, driving cellular differentiation and determining cell fate. Transcription factors are capable of regulating multiple genes over potentially long distances making target gene identification challenging. Currently available experimental approaches to detect distal interactions have multiple weaknesses that have motivated the development of computational approaches. Although an improvement over experimental approaches, existing computational approaches are still limited in their application, with different weaknesses depending on the approach. Here, we review computational approaches with a focus on data dependency, cell type specificity and usability. With the aim of identifying transcription factor target genes, we apply available approaches to typical transcription factor experimental datasets. We show that approaches are not always capable of annotating all transcription factor binding sites; binding sites should be treated disparately; and a combination of approaches can increase the biological relevance of the set of genes identified as targets. Copyright © 2017 Elsevier Inc. All rights reserved.
Distributed and grid computing projects with research focus in human health.
Diomidous, Marianna; Zikos, Dimitrios
2012-01-01
Distributed systems and grid computing systems are used to connect several computers to obtain a higher level of performance, in order to solve a problem. During the last decade, projects use the World Wide Web to aggregate individuals' CPU power for research purposes. This paper presents the existing active large scale distributed and grid computing projects with research focus in human health. There have been found and presented 11 active projects with more than 2000 Processing Units (PUs) each. The research focus for most of them is molecular biology and, specifically on understanding or predicting protein structure through simulation, comparing proteins, genomic analysis for disease provoking genes and drug design. Though not in all cases explicitly stated, common target diseases include research to find cure against HIV, dengue, Duchene dystrophy, Parkinson's disease, various types of cancer and influenza. Other diseases include malaria, anthrax, Alzheimer's disease. The need for national initiatives and European Collaboration for larger scale projects is stressed, to raise the awareness of citizens to participate in order to create a culture of internet volunteering altruism.
Functionally Enigmatic Genes: A Case Study of the Brain Ignorome
Pandey, Ashutosh K.; Lu, Lu; Wang, Xusheng; Homayouni, Ramin; Williams, Robert W.
2014-01-01
What proportion of genes with intense and selective expression in specific tissues, cells, or systems are still almost completely uncharacterized with respect to biological function? In what ways do these functionally enigmatic genes differ from well-studied genes? To address these two questions, we devised a computational approach that defines so-called ignoromes. As proof of principle, we extracted and analyzed a large subset of genes with intense and selective expression in brain. We find that publications associated with this set are highly skewed—the top 5% of genes absorb 70% of the relevant literature. In contrast, approximately 20% of genes have essentially no neuroscience literature. Analysis of the ignorome over the past decade demonstrates that it is stubbornly persistent, and the rapid expansion of the neuroscience literature has not had the expected effect on numbers of these genes. Surprisingly, ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect. Finally we ask to what extent massive genomic, imaging, and phenotype data sets can be used to provide high-throughput functional annotation for an entire ignorome. In a majority of cases we have been able to extract and add significant information for these neglected genes. In several cases—ELMOD1, TMEM88B, and DZANK1—we have exploited sequence polymorphisms, large phenome data sets, and reverse genetic methods to evaluate the function of ignorome genes. PMID:24523945
Functionally enigmatic genes: a case study of the brain ignorome.
Pandey, Ashutosh K; Lu, Lu; Wang, Xusheng; Homayouni, Ramin; Williams, Robert W
2014-01-01
What proportion of genes with intense and selective expression in specific tissues, cells, or systems are still almost completely uncharacterized with respect to biological function? In what ways do these functionally enigmatic genes differ from well-studied genes? To address these two questions, we devised a computational approach that defines so-called ignoromes. As proof of principle, we extracted and analyzed a large subset of genes with intense and selective expression in brain. We find that publications associated with this set are highly skewed--the top 5% of genes absorb 70% of the relevant literature. In contrast, approximately 20% of genes have essentially no neuroscience literature. Analysis of the ignorome over the past decade demonstrates that it is stubbornly persistent, and the rapid expansion of the neuroscience literature has not had the expected effect on numbers of these genes. Surprisingly, ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum--a genomic bandwagon effect. Finally we ask to what extent massive genomic, imaging, and phenotype data sets can be used to provide high-throughput functional annotation for an entire ignorome. In a majority of cases we have been able to extract and add significant information for these neglected genes. In several cases--ELMOD1, TMEM88B, and DZANK1--we have exploited sequence polymorphisms, large phenome data sets, and reverse genetic methods to evaluate the function of ignorome genes.
GraphTeams: a method for discovering spatial gene clusters in Hi-C sequencing data.
Schulz, Tizian; Stoye, Jens; Doerr, Daniel
2018-05-08
Hi-C sequencing offers novel, cost-effective means to study the spatial conformation of chromosomes. We use data obtained from Hi-C experiments to provide new evidence for the existence of spatial gene clusters. These are sets of genes with associated functionality that exhibit close proximity to each other in the spatial conformation of chromosomes across several related species. We present the first gene cluster model capable of handling spatial data. Our model generalizes a popular computational model for gene cluster prediction, called δ-teams, from sequences to graphs. Following previous lines of research, we subsequently extend our model to allow for several vertices being associated with the same label. The model, called δ-teams with families, is particular suitable for our application as it enables handling of gene duplicates. We develop algorithmic solutions for both models. We implemented the algorithm for discovering δ-teams with families and integrated it into a fully automated workflow for discovering gene clusters in Hi-C data, called GraphTeams. We applied it to human and mouse data to find intra- and interchromosomal gene cluster candidates. The results include intrachromosomal clusters that seem to exhibit a closer proximity in space than on their chromosomal DNA sequence. We further discovered interchromosomal gene clusters that contain genes from different chromosomes within the human genome, but are located on a single chromosome in mouse. By identifying δ-teams with families, we provide a flexible model to discover gene cluster candidates in Hi-C data. Our analysis of Hi-C data from human and mouse reveals several known gene clusters (thus validating our approach), but also few sparsely studied or possibly unknown gene cluster candidates that could be the source of further experimental investigations.
Adult onset Niemann-Pick type C disease: A clinical, neuroimaging and molecular genetic study.
Battisti, Carla; Tarugi, Patrizla; Dotti, Maria Teresa; De Stefano, Nicola; Vattimo, Angelo; Chierichetti, Francesea; Calandra, Sebastiano; Federico, Antonio
2003-11-01
We report on a patient with adult-onset Niemann-Pick type C (NPC) disease, carrying the mutations P1007 and I1061T in the NPC1 gene, presenting with marked psychiatric changes followed by dystonia and cognitive impairment. Filipin staining, single photon emission computed tomography perfusional, positron emission tomography metabolic, conventional magnetic resonance imaging, and magnetic resonance spectroscopy findings suggested a pathophysiological correlation with phenotype expression. This case expands the clinical and genetic spectrum of the rare adult-onset NPC disease phenotype.
Gupta, Nishant; Sunwoo, Bernie Y; Kotloff, Robert M
2016-09-01
Birt-Hogg-Dubé syndrome (BHD) is a rare autosomal dominant disorder caused by mutations in the Folliculin gene and is characterized by the formation of fibrofolliculomas, early onset renal cancers, pulmonary cysts, and spontaneous pneumothoraces. The exact pathogenesis of tumor and lung cyst formation in BHD remains unclear. There is great phenotypic variability in the clinical features of BHD, and patients can present with any combination of skin, pulmonary, or renal findings. More than 80% of adult patients with BHD have pulmonary cysts on high-resolution computed tomography scan of the chest. Published by Elsevier Inc.
An autonomous molecular computer for logical control of gene expression.
Benenson, Yaakov; Gil, Binyamin; Ben-Dor, Uri; Adar, Rivka; Shapiro, Ehud
2004-05-27
Early biomolecular computer research focused on laboratory-scale, human-operated computers for complex computational problems. Recently, simple molecular-scale autonomous programmable computers were demonstrated allowing both input and output information to be in molecular form. Such computers, using biological molecules as input data and biologically active molecules as outputs, could produce a system for 'logical' control of biological processes. Here we describe an autonomous biomolecular computer that, at least in vitro, logically analyses the levels of messenger RNA species, and in response produces a molecule capable of affecting levels of gene expression. The computer operates at a concentration of close to a trillion computers per microlitre and consists of three programmable modules: a computation module, that is, a stochastic molecular automaton; an input module, by which specific mRNA levels or point mutations regulate software molecule concentrations, and hence automaton transition probabilities; and an output module, capable of controlled release of a short single-stranded DNA molecule. This approach might be applied in vivo to biochemical sensing, genetic engineering and even medical diagnosis and treatment. As a proof of principle we programmed the computer to identify and analyse mRNA of disease-related genes associated with models of small-cell lung cancer and prostate cancer, and to produce a single-stranded DNA molecule modelled after an anticancer drug.
Le Meur, Nolwenn; Gentleman, Robert
2008-01-01
Background Synthetic lethality defines a genetic interaction where the combination of mutations in two or more genes leads to cell death. The implications of synthetic lethal screens have been discussed in the context of drug development as synthetic lethal pairs could be used to selectively kill cancer cells, but leave normal cells relatively unharmed. A challenge is to assess genome-wide experimental data and integrate the results to better understand the underlying biological processes. We propose statistical and computational tools that can be used to find relationships between synthetic lethality and cellular organizational units. Results In Saccharomyces cerevisiae, we identified multi-protein complexes and pairs of multi-protein complexes that share an unusually high number of synthetic genetic interactions. As previously predicted, we found that synthetic lethality can arise from subunits of an essential multi-protein complex or between pairs of multi-protein complexes. Finally, using multi-protein complexes allowed us to take into account the pleiotropic nature of the gene products. Conclusions Modeling synthetic lethality using current estimates of the yeast interactome is an efficient approach to disentangle some of the complex molecular interactions that drive a cell. Our model in conjunction with applied statistical methods and computational methods provides new tools to better characterize synthetic genetic interactions. PMID:18789146
WGE: a CRISPR database for genome engineering.
Hodgkins, Alex; Farne, Anna; Perera, Sajith; Grego, Tiago; Parry-Smith, David J; Skarnes, William C; Iyer, Vivek
2015-09-15
The rapid development of CRISPR-Cas9 mediated genome editing techniques has given rise to a number of online and stand-alone tools to find and score CRISPR sites for whole genomes. Here we describe the Wellcome Trust Sanger Institute Genome Editing database (WGE), which uses novel methods to compute, visualize and select optimal CRISPR sites in a genome browser environment. The WGE database currently stores single and paired CRISPR sites and pre-calculated off-target information for CRISPRs located in the mouse and human exomes. Scoring and display of off-target sites is simple, and intuitive, and filters can be applied to identify high-quality CRISPR sites rapidly. WGE also provides a tool for the design and display of gene targeting vectors in the same genome browser, along with gene models, protein translation and variation tracks. WGE is open, extensible and can be set up to compute and present CRISPR sites for any genome. The WGE database is freely available at www.sanger.ac.uk/htgt/wge : vvi@sanger.ac.uk or skarnes@sanger.ac.uk Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Computational characterization of chromatin domain boundary-associated genomic elements
Hong, Seungpyo
2017-01-01
Abstract Topologically associated domains (TADs) are 3D genomic structures with high internal interactions that play important roles in genome compaction and gene regulation. Their genomic locations and their association with CCCTC-binding factor (CTCF)-binding sites and transcription start sites (TSSs) were recently reported. However, the relationship between TADs and other genomic elements has not been systematically evaluated. This was addressed in the present study, with a focus on the enrichment of these genomic elements and their ability to predict the TAD boundary region. We found that consensus CTCF-binding sites were strongly associated with TAD boundaries as well as with the transcription factors (TFs) Zinc finger protein (ZNF)143 and Yin Yang (YY)1. TAD boundary-associated genomic elements include DNase I-hypersensitive sites, H3K36 trimethylation, TSSs, RNA polymerase II, and TFs such as Specificity protein 1, ZNF274 and SIX homeobox 5. Computational modeling with these genomic elements suggests that they have distinct roles in TAD boundary formation. We propose a structural model of TAD boundaries based on these findings that provides a basis for studying the mechanism of chromatin structure formation and gene regulation. PMID:28977568
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ecale Zhou, Carol L.
2016-07-05
Compare Gene Calls (CGC) is a Python code used for combining and comparing gene calls from any number of gene callers. A gene caller is a computer program that predicts the extends of open reading frames within genomes of biological organisms.
An ensemble rank learning approach for gene prioritization.
Lee, Po-Feng; Soo, Von-Wun
2013-01-01
Several different computational approaches have been developed to solve the gene prioritization problem. We intend to use the ensemble boosting learning techniques to combine variant computational approaches for gene prioritization in order to improve the overall performance. In particular we add a heuristic weighting function to the Rankboost algorithm according to: 1) the absolute ranks generated by the adopted methods for a certain gene, and 2) the ranking relationship between all gene-pairs from each prioritization result. We select 13 known prostate cancer genes in OMIM database as training set and protein coding gene data in HGNC database as test set. We adopt the leave-one-out strategy for the ensemble rank boosting learning. The experimental results show that our ensemble learning approach outperforms the four gene-prioritization methods in ToppGene suite in the ranking results of the 13 known genes in terms of mean average precision, ROC and AUC measures.
Text mining-based in silico drug discovery in oral mucositis caused by high-dose cancer therapy.
Kirk, Jon; Shah, Nirav; Noll, Braxton; Stevens, Craig B; Lawler, Marshall; Mougeot, Farah B; Mougeot, Jean-Luc C
2018-08-01
Oral mucositis (OM) is a major dose-limiting side effect of chemotherapy and radiation used in cancer treatment. Due to the complex nature of OM, currently available drug-based treatments are of limited efficacy. Our objectives were (i) to determine genes and molecular pathways associated with OM and wound healing using computational tools and publicly available data and (ii) to identify drugs formulated for topical use targeting the relevant OM molecular pathways. OM and wound healing-associated genes were determined by text mining, and the intersection of the two gene sets was selected for gene ontology analysis using the GeneCodis program. Protein interaction network analysis was performed using STRING-db. Enriched gene sets belonging to the identified pathways were queried against the Drug-Gene Interaction database to find drug candidates for topical use in OM. Our analysis identified 447 genes common to both the "OM" and "wound healing" text mining concepts. Gene enrichment analysis yielded 20 genes representing six pathways and targetable by a total of 32 drugs which could possibly be formulated for topical application. A manual search on ClinicalTrials.gov confirmed no relevant pathway/drug candidate had been overlooked. Twenty-five of the 32 drugs can directly affect the PTGS2 (COX-2) pathway, the pathway that has been targeted in previous clinical trials with limited success. Drug discovery using in silico text mining and pathway analysis tools can facilitate the identification of existing drugs that have the potential of topical administration to improve OM treatment.
Yang, Laurence; Tan, Justin; O'Brien, Edward J; Monk, Jonathan M; Kim, Donghyuk; Li, Howard J; Charusanti, Pep; Ebrahim, Ali; Lloyd, Colton J; Yurkovich, James T; Du, Bin; Dräger, Andreas; Thomas, Alex; Sun, Yuekai; Saunders, Michael A; Palsson, Bernhard O
2015-08-25
Finding the minimal set of gene functions needed to sustain life is of both fundamental and practical importance. Minimal gene lists have been proposed by using comparative genomics-based core proteome definitions. A definition of a core proteome that is supported by empirical data, is understood at the systems-level, and provides a basis for computing essential cell functions is lacking. Here, we use a systems biology-based genome-scale model of metabolism and expression to define a functional core proteome consisting of 356 gene products, accounting for 44% of the Escherichia coli proteome by mass based on proteomics data. This systems biology core proteome includes 212 genes not found in previous comparative genomics-based core proteome definitions, accounts for 65% of known essential genes in E. coli, and has 78% gene function overlap with minimal genomes (Buchnera aphidicola and Mycoplasma genitalium). Based on transcriptomics data across environmental and genetic backgrounds, the systems biology core proteome is significantly enriched in nondifferentially expressed genes and depleted in differentially expressed genes. Compared with the noncore, core gene expression levels are also similar across genetic backgrounds (two times higher Spearman rank correlation) and exhibit significantly more complex transcriptional and posttranscriptional regulatory features (40% more transcription start sites per gene, 22% longer 5'UTR). Thus, genome-scale systems biology approaches rigorously identify a functional core proteome needed to support growth. This framework, validated by using high-throughput datasets, facilitates a mechanistic understanding of systems-level core proteome function through in silico models; it de facto defines a paleome.
Prioritization of Disease Susceptibility Genes Using LSM/SVD.
Gong, Lejun; Yang, Ronggen; Yan, Qin; Sun, Xiao
2013-12-01
Understanding the role of genetics in diseases is one of the most important tasks in the postgenome era. It is generally too expensive and time consuming to perform experimental validation for all candidate genes related to disease. Computational methods play important roles for prioritizing these candidates. Herein, we propose an approach to prioritize disease genes using latent semantic mapping based on singular value decomposition. Our hypothesis is that similar functional genes are likely to cause similar diseases. Measuring the functional similarity between known disease susceptibility genes and unknown genes is to predict new disease susceptibility genes. Taking autism as an instance, the analysis results of the top ten genes prioritized demonstrate they might be autism susceptibility genes, which also indicates our approach could discover new disease susceptibility genes. The novel approach of disease gene prioritization could discover new disease susceptibility genes, and latent disease-gene relations. The prioritized results could also support the interpretive diversity and experimental views as computational evidence for disease researchers.
Guirao-Rico, Sara; Sánchez-Gracia, Alejandro; Charlesworth, Deborah
2017-03-01
DNA sequence diversity in genes in the partially sex-linked pseudoautosomal region (PAR) of the sex chromosomes of the plant Silene latifolia is higher than expected from within-species diversity of other genes. This could be the footprint of sexually antagonistic (SA) alleles that are maintained by balancing selection in a PAR gene (or genes) and affect polymorphism in linked genome regions. SA selection is predicted to occur during sex chromosome evolution, but it is important to test whether the unexpectedly high sequence polymorphism could be explained without it, purely by the combined effects of partial linkage with the sex-determining region and the population's demographic history, including possible introgression from Silene dioica. To test this, we applied approximate Bayesian computation-based model choice to autosomal sequence diversity data, to find the most plausible scenario for the recent history of S. latifolia and then to estimate the posterior density of the most relevant parameters. We then used these densities to simulate variation to be expected at PAR genes. We conclude that an excess of variants at high frequencies at PAR genes should arise in S. latifolia populations only for genes with strong associations with fully sex-linked genes, which requires closer linkage with the fully sex-linked region than that estimated for the PAR genes where apparent deviations from neutrality were observed. These results support the need to invoke selection to explain the S. latifolia PAR gene diversity, and encourage further work to test the possibility of balancing selection due to sexual antagonism. © 2016 John Wiley & Sons Ltd.
Kazemian, Majid; Zhu, Qiyun; Halfon, Marc S; Sinha, Saurabh
2011-12-01
Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. © The Author(s) 2011. Published by Oxford University Press.
Piccoli, Stefano; Andreolli, Marco; Giorgetti, Alejandro; Zordan, Fabio; Lampis, Silvia; Vallini, Giovanni
2014-05-01
Burkholderia fungorum DBT1, first isolated from settling particulate matter of an oil refinery wastewater, is a bacterial strain which has been shown capable of utilizing several polycyclic aromatic hydrocarbons (PAHs) including dibenzothiophene (DBT). In particular, this microbe is able to efficiently degrade DBT through the Kodama pathway. Previous investigations have lead to the identification of six genes, on a total of eight, required for DBT degradation. In the present study, a combined experimental/computational approach was adopted to identify and in silico characterize the two missing genes, namely a ferredoxin reductase and a hydratase-aldolase. Thus, the finding of all enzymatic components of the Kodama pathway in B. fungorum DBT1 makes this bacterial strain amenable for possible exploitation in soil bioremediation protocols. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
NASA Astrophysics Data System (ADS)
Endy, Drew; You, Lingchong; Yin, John; Molineux, Ian J.
2000-05-01
We created a simulation based on experimental data from bacteriophage T7 that computes the developmental cycle of the wild-type phage and also of mutants that have an altered genome order. We used the simulation to compute the fitness of more than 105 mutants. We tested these computations by constructing and experimentally characterizing T7 mutants in which we repositioned gene 1, coding for T7 RNA polymerase. Computed protein synthesis rates for ectopic gene 1 strains were in moderate agreement with observed rates. Computed phage-doubling rates were close to observations for two of four strains, but significantly overestimated those of the other two. Computations indicate that the genome organization of wild-type T7 is nearly optimal for growth: only 2.8% of random genome permutations were computed to grow faster, the highest 31% faster, than wild type. Specific discrepancies between computations and observations suggest that a better understanding of the translation efficiency of individual mRNAs and the functions of qualitatively "nonessential" genes will be needed to improve the T7 simulation. In silico representations of biological systems can serve to assess and advance our understanding of the underlying biology. Iteration between computation, prediction, and observation should increase the rate at which biological hypotheses are formulated and tested.
Genecentric: a package to uncover graph-theoretic structure in high-throughput epistasis data.
Gallant, Andrew; Leiserson, Mark D M; Kachalov, Maxim; Cowen, Lenore J; Hescott, Benjamin J
2013-01-18
New technology has resulted in high-throughput screens for pairwise genetic interactions in yeast and other model organisms. For each pair in a collection of non-essential genes, an epistasis score is obtained, representing how much sicker (or healthier) the double-knockout organism will be compared to what would be expected from the sickness of the component single knockouts. Recent algorithmic work has identified graph-theoretic patterns in this data that can indicate functional modules, and even sets of genes that may occur in compensatory pathways, such as a BPM-type schema first introduced by Kelley and Ideker. However, to date, any algorithms for finding such patterns in the data were implemented internally, with no software being made publically available. Genecentric is a new package that implements a parallelized version of the Leiserson et al. algorithm (J Comput Biol 18:1399-1409, 2011) for generating generalized BPMs from high-throughput genetic interaction data. Given a matrix of weighted epistasis values for a set of double knock-outs, Genecentric returns a list of generalized BPMs that may represent compensatory pathways. Genecentric also has an extension, GenecentricGO, to query FuncAssociate (Bioinformatics 25:3043-3044, 2009) to retrieve GO enrichment statistics on generated BPMs. Python is the only dependency, and our web site provides working examples and documentation. We find that Genecentric can be used to find coherent functional and perhaps compensatory gene sets from high throughput genetic interaction data. Genecentric is made freely available for download under the GPLv2 from http://bcb.cs.tufts.edu/genecentric.
Genecentric: a package to uncover graph-theoretic structure in high-throughput epistasis data
2013-01-01
Background New technology has resulted in high-throughput screens for pairwise genetic interactions in yeast and other model organisms. For each pair in a collection of non-essential genes, an epistasis score is obtained, representing how much sicker (or healthier) the double-knockout organism will be compared to what would be expected from the sickness of the component single knockouts. Recent algorithmic work has identified graph-theoretic patterns in this data that can indicate functional modules, and even sets of genes that may occur in compensatory pathways, such as a BPM-type schema first introduced by Kelley and Ideker. However, to date, any algorithms for finding such patterns in the data were implemented internally, with no software being made publically available. Results Genecentric is a new package that implements a parallelized version of the Leiserson et al. algorithm (J Comput Biol 18:1399-1409, 2011) for generating generalized BPMs from high-throughput genetic interaction data. Given a matrix of weighted epistasis values for a set of double knock-outs, Genecentric returns a list of generalized BPMs that may represent compensatory pathways. Genecentric also has an extension, GenecentricGO, to query FuncAssociate (Bioinformatics 25:3043-3044, 2009) to retrieve GO enrichment statistics on generated BPMs. Python is the only dependency, and our web site provides working examples and documentation. Conclusion We find that Genecentric can be used to find coherent functional and perhaps compensatory gene sets from high throughput genetic interaction data. Genecentric is made freely available for download under the GPLv2 from http://bcb.cs.tufts.edu/genecentric. PMID:23331614
What is bioinformatics? A proposed definition and overview of the field.
Luscombe, N M; Greenbaum, D; Gerstein, M
2001-01-01
The recent flood of data from genome sequences and functional genomics has given rise to new field, bioinformatics, which combines elements of biology and computer science. Here we propose a definition for this new field and review some of the research that is being pursued, particularly in relation to transcriptional regulatory systems. Our definition is as follows: Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying "informatics" techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale. Analyses in bioinformatics predominantly focus on three types of large datasets available in molecular biology: macromolecular structures, genome sequences, and the results of functional genomics experiments (e.g. expression data). Additional information includes the text of scientific papers and "relationship data" from metabolic pathways, taxonomy trees, and protein-protein interaction networks. Bioinformatics employs a wide range of computational techniques including sequence and structural alignment, database design and data mining, macromolecular geometry, phylogenetic tree construction, prediction of protein structure and function, gene finding, and expression data clustering. The emphasis is on approaches integrating a variety of computational methods and heterogeneous data sources. Finally, bioinformatics is a practical discipline. We survey some representative applications, such as finding homologues, designing drugs, and performing large-scale censuses. Additional information pertinent to the review is available over the web at http://bioinfo.mbb.yale.edu/what-is-it.
Van Loo, Peter; Aerts, Stein; Thienpont, Bernard; De Moor, Bart; Moreau, Yves; Marynen, Peter
2008-01-01
We present ModuleMiner, a novel algorithm for computationally detecting cis-regulatory modules (CRMs) in a set of co-expressed genes. ModuleMiner outperforms other methods for CRM detection on benchmark data, and successfully detects CRMs in tissue-specific microarray clusters and in embryonic development gene sets. Interestingly, CRM predictions for differentiated tissues exhibit strong enrichment close to the transcription start site, whereas CRM predictions for embryonic development gene sets are depleted in this region. PMID:18394174
How many bootstrap replicates are necessary?
Pattengale, Nicholas D; Alipour, Masoud; Bininda-Emonds, Olaf R P; Moret, Bernard M E; Stamatakis, Alexandros
2010-03-01
Phylogenetic bootstrapping (BS) is a standard technique for inferring confidence values on phylogenetic trees that is based on reconstructing many trees from minor variations of the input data, trees called replicates. BS is used with all phylogenetic reconstruction approaches, but we focus here on one of the most popular, maximum likelihood (ML). Because ML inference is so computationally demanding, it has proved too expensive to date to assess the impact of the number of replicates used in BS on the relative accuracy of the support values. For the same reason, a rather small number (typically 100) of BS replicates are computed in real-world studies. Stamatakis et al. recently introduced a BS algorithm that is 1 to 2 orders of magnitude faster than previous techniques, while yielding qualitatively comparable support values, making an experimental study possible. In this article, we propose stopping criteria--that is, thresholds computed at runtime to determine when enough replicates have been generated--and we report on the first large-scale experimental study to assess the effect of the number of replicates on the quality of support values, including the performance of our proposed criteria. We run our tests on 17 diverse real-world DNA--single-gene as well as multi-gene--datasets, which include 125-2,554 taxa. We find that our stopping criteria typically stop computations after 100-500 replicates (although the most conservative criterion may continue for several thousand replicates) while producing support values that correlate at better than 99.5% with the reference values on the best ML trees. Significantly, we also find that the stopping criteria can recommend very different numbers of replicates for different datasets of comparable sizes. Our results are thus twofold: (i) they give the first experimental assessment of the effect of the number of BS replicates on the quality of support values returned through BS, and (ii) they validate our proposals for stopping criteria. Practitioners will no longer have to enter a guess nor worry about the quality of support values; moreover, with most counts of replicates in the 100-500 range, robust BS under ML inference becomes computationally practical for most datasets. The complete test suite is available at http://lcbb.epfl.ch/BS.tar.bz2, and BS with our stopping criteria is included in the latest release of RAxML v7.2.5, available at http://wwwkramer.in.tum.de/exelixis/software.html.
UniGene Tabulator: a full parser for the UniGene format.
Lenzi, Luca; Frabetti, Flavia; Facchin, Federica; Casadei, Raffaella; Vitale, Lorenza; Canaider, Silvia; Carinci, Paolo; Zannotti, Maria; Strippoli, Pierluigi
2006-10-15
UniGene Tabulator 1.0 provides a solution for full parsing of UniGene flat file format; it implements a structured graphical representation of each data field present in UniGene following import into a common database managing system usable in a personal computer. This database includes related tables for sequence, protein similarity, sequence-tagged site (STS) and transcript map interval (TXMAP) data, plus a summary table where each record represents a UniGene cluster. UniGene Tabulator enables full local management of UniGene data, allowing parsing, querying, indexing, retrieving, exporting and analysis of UniGene data in a relational database form, usable on Macintosh (OS X 10.3.9 or later) and Windows (2000, with service pack 4, XP, with service pack 2 or later) operating systems-based computers. The current release, including both the FileMaker runtime applications, is freely available at http://apollo11.isto.unibo.it/software/
A systematic comparison of error correction enzymes by next-generation sequencing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lubock, Nathan B.; Zhang, Di; Sidore, Angus M.
Gene synthesis, the process of assembling genelength fragments from shorter groups of oligonucleotides (oligos), is becoming an increasingly important tool in molecular and synthetic biology. The length, quality and cost of gene synthesis are limited by errors produced during oligo synthesis and subsequent assembly. Enzymatic error correction methods are cost-effective means to ameliorate errors in gene synthesis. Previous analyses of these methods relied on cloning and Sanger sequencing to evaluate their efficiencies, limiting quantitative assessment. Here, we develop a method to quantify errors in synthetic DNA by next-generation sequencing. We analyzed errors in model gene assemblies and systematically compared sixmore » different error correction enzymes across 11 conditions. We find that ErrASE and T7 Endonuclease I are the most effective at decreasing average error rates (up to 5.8-fold relative to the input), whereas MutS is the best for increasing the number of perfect assemblies (up to 25.2-fold). We are able to quantify differential specificities such as ErrASE preferentially corrects C/G transversions whereas T7 Endonuclease I preferentially corrects A/T transversions. More generally, this experimental and computational pipeline is a fast, scalable and extensible way to analyze errors in gene assemblies, to profile error correction methods, and to benchmark DNA synthesis methods.« less
Yu, Ron X.; Liu, Jie; True, Nick; Wang, Wei
2008-01-01
A major challenge in the post-genome era is to reconstruct regulatory networks from the biological knowledge accumulated up to date. The development of tools for identifying direct target genes of transcription factors (TFs) is critical to this endeavor. Given a set of microarray experiments, a probabilistic model called TRANSMODIS has been developed which can infer the direct targets of a TF by integrating sequence motif, gene expression and ChIP-chip data. The performance of TRANSMODIS was first validated on a set of transcription factor perturbation experiments (TFPEs) involving Pho4p, a well studied TF in Saccharomyces cerevisiae. TRANSMODIS removed elements of arbitrariness in manual target gene selection process and produced results that concur with one's intuition. TRANSMODIS was further validated on a genome-wide scale by comparing it with two other methods in Saccharomyces cerevisiae. The usefulness of TRANSMODIS was then demonstrated by applying it to the identification of direct targets of DAF-16, a critical TF regulating ageing in Caenorhabditis elegans. We found that 189 genes were tightly regulated by DAF-16. In addition, DAF-16 has differential preference for motifs when acting as an activator or repressor, which awaits experimental verification. TRANSMODIS is computationally efficient and robust, making it a useful probabilistic framework for finding immediate targets. PMID:18350157
A systematic comparison of error correction enzymes by next-generation sequencing
Lubock, Nathan B.; Zhang, Di; Sidore, Angus M.; ...
2017-08-01
Gene synthesis, the process of assembling genelength fragments from shorter groups of oligonucleotides (oligos), is becoming an increasingly important tool in molecular and synthetic biology. The length, quality and cost of gene synthesis are limited by errors produced during oligo synthesis and subsequent assembly. Enzymatic error correction methods are cost-effective means to ameliorate errors in gene synthesis. Previous analyses of these methods relied on cloning and Sanger sequencing to evaluate their efficiencies, limiting quantitative assessment. Here, we develop a method to quantify errors in synthetic DNA by next-generation sequencing. We analyzed errors in model gene assemblies and systematically compared sixmore » different error correction enzymes across 11 conditions. We find that ErrASE and T7 Endonuclease I are the most effective at decreasing average error rates (up to 5.8-fold relative to the input), whereas MutS is the best for increasing the number of perfect assemblies (up to 25.2-fold). We are able to quantify differential specificities such as ErrASE preferentially corrects C/G transversions whereas T7 Endonuclease I preferentially corrects A/T transversions. More generally, this experimental and computational pipeline is a fast, scalable and extensible way to analyze errors in gene assemblies, to profile error correction methods, and to benchmark DNA synthesis methods.« less
Basili, Danilo; Zhang, Ji-Liang; Herbert, John; Kroll, Kevin; Denslow, Nancy D; Martyniuk, Christopher J; Falciani, Francesco; Antczak, Philipp
2018-06-15
In recent years, decreases in fish populations have been attributed, in part, to the effect of environmental chemicals on ovarian development. To understand the underlying molecular events we developed a dynamic model of ovary development linking gene transcription to key physiological end points, such as gonadosomatic index (GSI), plasma levels of estradiol (E2) and vitellogenin (VTG), in largemouth bass ( Micropterus salmoides). We were able to identify specific clusters of genes, which are affected at different stages of ovarian development. A subnetwork was identified that closely linked gene expression and physiological end points and by interrogating the Comparative Toxicogenomic Database (CTD), quercetin and tretinoin (ATRA) were identified as two potential candidates that may perturb this system. Predictions were validated by investigation of reproductive associated transcripts using qPCR in ovary and in the liver of both male and female largemouth bass treated after a single injection of quercetin and tretinoin (10 and 100 μg/kg). Both compounds were found to significantly alter the expression of some of these genes. Our findings support the use of omics and online repositories for identification of novel, yet untested, compounds. This is the first study of a dynamic model that links gene expression patterns across stages of ovarian development.
Pepke, Shirley; Ver Steeg, Greg
2017-03-15
De novo inference of clinically relevant gene function relationships from tumor RNA-seq remains a challenging task. Current methods typically either partition patient samples into a few subtypes or rely upon analysis of pairwise gene correlations that will miss some groups in noisy data. Leveraging higher dimensional information can be expected to increase the power to discern targetable pathways, but this is commonly thought to be an intractable computational problem. In this work we adapt a recently developed machine learning algorithm for sensitive detection of complex gene relationships. The algorithm, CorEx, efficiently optimizes over multivariate mutual information and can be iteratively applied to generate a hierarchy of relatively independent latent factors. The learned latent factors are used to stratify patients for survival analysis with respect to both single factors and combinations. These analyses are performed and interpreted in the context of biological function annotations and protein network interactions that might be utilized to match patients to multiple therapies. Analysis of ovarian tumor RNA-seq samples demonstrates the algorithm's power to infer well over one hundred biologically interpretable gene cohorts, several times more than standard methods such as hierarchical clustering and k-means. The CorEx factor hierarchy is also informative, with related but distinct gene clusters grouped by upper nodes. Some latent factors correlate with patient survival, including one for a pathway connected with the epithelial-mesenchymal transition in breast cancer that is regulated by a microRNA that modulates epigenetics. Further, combinations of factors lead to a synergistic survival advantage in some cases. In contrast to studies that attempt to partition patients into a small number of subtypes (typically 4 or fewer) for treatment purposes, our approach utilizes subgroup information for combinatoric transcriptional phenotyping. Considering only the 66 gene expression groups that are found to both have significant Gene Ontology enrichment and are small enough to indicate specific drug targets implies a computational phenotype for ovarian cancer that allows for 3 66 possible patient profiles, enabling truly personalized treatment. The findings here demonstrate a new technique that sheds light on the complexity of gene expression dependencies in tumors and could eventually enable the use of patient RNA-seq profiles for selection of personalized and effective cancer treatments.
Zhang, Chi; Dower, Ken; Zhang, Baohong; Martinez, Robert V; Lin, Lih-Ling; Zhao, Shanrong
2018-05-16
Obese ZSF1 rats exhibit spontaneous time-dependent diabetic nephropathy and are considered to be a highly relevant animal model of progressive human diabetic kidney disease. We previously identified gene expression changes between disease and control animals across six time points from 12 to 41 weeks. In this study, the same data were analysed at the isoform and exon levels to reveal additional disease mechanisms that may be governed by alternative splicing. Our analyses identified alternative splicing patterns in genes that may be implicated in disease pathogenesis (such as Shc1, Serpinc1, Epb4.1l5, and Il-33), which would have been overlooked in standard gene-level analysis. The alternatively spliced genes were enriched in pathways related to cell adhesion, cell-cell interactions/junctions, and cytoskeleton signalling, whereas the differentially expressed genes were enriched in pathways related to immune response, G protein-coupled receptor, and cAMP signalling. Our findings indicate that additional mechanistic insights can be gained from exon- and isoform-level data analyses over standard gene-level analysis. Considering alternative splicing is poorly conserved between rodents and humans, it is noted that this work is not translational, but the point holds true that additional insights can be gained from alternative splicing analysis of RNA-seq data.
Schield, Drew R; Adams, Richard H; Card, Daren C; Corbin, Andrew B; Jezkova, Tereza; Hales, Nicole R; Meik, Jesse M; Perry, Blair W; Spencer, Carol L; Smith, Lydia L; García, Gustavo Campillo; Bouzid, Nassima M; Strickland, Jason L; Parkinson, Christopher L; Borja, Miguel; Castañeda-Gaytán, Gamaliel; Bryson, Robert W; Flores-Villela, Oscar A; Mackessy, Stephen P; Castoe, Todd A
2018-06-15
The Mojave rattlesnake (Crotalus scutulatus) inhabits deserts and arid grasslands of the western United States and Mexico. Despite considerable interest in its highly toxic venom and the recognition of two subspecies, no molecular studies have characterized range-wide genetic diversity and population structure or tested species limits within C. scutulatus. We used mitochondrial DNA and thousands of nuclear loci from double-digest restriction site associated DNA sequencing to infer population genetic structure throughout the range of C. scutulatus, and to evaluate divergence times and gene flow between populations. We find strong support for several divergent mitochondrial and nuclear clades of C. scutulatus, including splits coincident with two major phylogeographic barriers: the Continental Divide and the elevational increase associated with the Central Mexican Plateau. We apply Bayesian clustering, phylogenetic inference, and coalescent-based species delimitation to our nuclear genetic data to test hypotheses of population structure. We also performed demographic analyses to test hypotheses relating to population divergence and gene flow. Collectively, our results support the existence of four distinct lineages within C. scutulatus, and genetically defined populations do not correspond with currently recognized subspecies ranges. Finally, we use approximate Bayesian computation to test hypotheses of divergence among multiple rattlesnake species groups distributed across the Continental Divide, and find evidence for co-divergence at this boundary during the mid-Pleistocene. Copyright © 2018 Elsevier Inc. All rights reserved.
Maity, Arnab; Carroll, Raymond J; Mammen, Enno; Chatterjee, Nilanjan
2009-01-01
Motivated from the problem of testing for genetic effects on complex traits in the presence of gene-environment interaction, we develop score tests in general semiparametric regression problems that involves Tukey style 1 degree-of-freedom form of interaction between parametrically and non-parametrically modelled covariates. We find that the score test in this type of model, as recently developed by Chatterjee and co-workers in the fully parametric setting, is biased and requires undersmoothing to be valid in the presence of non-parametric components. Moreover, in the presence of repeated outcomes, the asymptotic distribution of the score test depends on the estimation of functions which are defined as solutions of integral equations, making implementation difficult and computationally taxing. We develop profiled score statistics which are unbiased and asymptotically efficient and can be performed by using standard bandwidth selection methods. In addition, to overcome the difficulty of solving functional equations, we give easy interpretations of the target functions, which in turn allow us to develop estimation procedures that can be easily implemented by using standard computational methods. We present simulation studies to evaluate type I error and power of the method proposed compared with a naive test that does not consider interaction. Finally, we illustrate our methodology by analysing data from a case-control study of colorectal adenoma that was designed to investigate the association between colorectal adenoma and the candidate gene NAT2 in relation to smoking history.
An automated method for detecting alternatively spliced protein domains.
Coelho, Vitor; Sammeth, Michael
2018-06-01
Alternative splicing (AS) has been demonstrated to play a role in shaping eukaryotic gene diversity at the transcriptional level. However, the impact of AS on the proteome is still controversial. Studies that seek to explore the effect of AS at the proteomic level are hampered by technical difficulties in the cumbersome process of casting forth and back between genome, transcriptome and proteome space coordinates, and the naïve prediction of protein domains in the presence of AS suffers many redundant sequence scans that emerge from constitutively spliced regions that are shared between alternative products of a gene. We developed the AstaFunk pipeline that computes for every generic transcriptome all domains that are altered by AS events in a systematic and efficient manner. In a nutshell, our method employs Viterbi dynamic programming, which guarantees to find all score-optimal hits of the domains under consideration, while complementary optimisations at different levels avoid redundant and other irrelevant computations. We evaluate AstaFunk qualitatively and quantitatively using RNAseq in well-studied genes with AS, and on large-scale employing entire transcriptomes. Our study confirms complementary reports that the effect of most AS events on the proteome seems to be rather limited, but our results also pinpoint several cases where AS could have a major impact on the function of a protein domain. The JAVA implementation of AstaFunk is available as an open source project on http://astafunk.sammeth.net. micha@sammeth.net. Supplementary data are available at Bioinformatics online.
Robinson, Gene E.; Jakobsson, Eric
2016-01-01
The emerging field of sociogenomics explores the relations between social behavior and genome structure and function. An important question is the extent to which associations between social behavior and gene expression are conserved among the Metazoa. Prior experimental work in an invertebrate model of social behavior, the honey bee, revealed distinct brain gene expression patterns in African and European honey bees, and within European honey bees with different behavioral phenotypes. The present work is a computational study of these previous findings in which we analyze, by orthology determination, the extent to which genes that are socially regulated in honey bees are conserved across the Metazoa. We found that the differentially expressed gene sets associated with alarm pheromone response, the difference between old and young bees, and the colony influence on soldier bees, are enriched in widely conserved genes, indicating that these differences have genomic bases shared with many other metazoans. By contrast, the sets of differentially expressed genes associated with the differences between African and European forager and guard bees are depleted in widely conserved genes, indicating that the genomic basis for this social behavior is relatively specific to honey bees. For the alarm pheromone response gene set, we found a particularly high degree of conservation with mammals, even though the alarm pheromone itself is bee-specific. Gene Ontology identification of human orthologs to the strongly conserved honey bee genes associated with the alarm pheromone response shows overrepresentation of protein metabolism, regulation of protein complex formation, and protein folding, perhaps associated with remodeling of critical neural circuits in response to alarm pheromone. We hypothesize that such remodeling may be an adaptation of social animals to process and respond appropriately to the complex patterns of conspecific communication essential for social organization. PMID:27359102
Liu, Hui; Robinson, Gene E; Jakobsson, Eric
2016-06-01
The emerging field of sociogenomics explores the relations between social behavior and genome structure and function. An important question is the extent to which associations between social behavior and gene expression are conserved among the Metazoa. Prior experimental work in an invertebrate model of social behavior, the honey bee, revealed distinct brain gene expression patterns in African and European honey bees, and within European honey bees with different behavioral phenotypes. The present work is a computational study of these previous findings in which we analyze, by orthology determination, the extent to which genes that are socially regulated in honey bees are conserved across the Metazoa. We found that the differentially expressed gene sets associated with alarm pheromone response, the difference between old and young bees, and the colony influence on soldier bees, are enriched in widely conserved genes, indicating that these differences have genomic bases shared with many other metazoans. By contrast, the sets of differentially expressed genes associated with the differences between African and European forager and guard bees are depleted in widely conserved genes, indicating that the genomic basis for this social behavior is relatively specific to honey bees. For the alarm pheromone response gene set, we found a particularly high degree of conservation with mammals, even though the alarm pheromone itself is bee-specific. Gene Ontology identification of human orthologs to the strongly conserved honey bee genes associated with the alarm pheromone response shows overrepresentation of protein metabolism, regulation of protein complex formation, and protein folding, perhaps associated with remodeling of critical neural circuits in response to alarm pheromone. We hypothesize that such remodeling may be an adaptation of social animals to process and respond appropriately to the complex patterns of conspecific communication essential for social organization.
Smith, Adam Alexander Thil; Belda, Eugeni; Viari, Alain; Medigue, Claudine; Vallenet, David
2012-05-01
Of all biochemically characterized metabolic reactions formalized by the IUBMB, over one out of four have yet to be associated with a nucleic or protein sequence, i.e. are sequence-orphan enzymatic activities. Few bioinformatics annotation tools are able to propose candidate genes for such activities by exploiting context-dependent rather than sequence-dependent data, and none are readily accessible and propose result integration across multiple genomes. Here, we present CanOE (Candidate genes for Orphan Enzymes), a four-step bioinformatics strategy that proposes ranked candidate genes for sequence-orphan enzymatic activities (or orphan enzymes for short). The first step locates "genomic metabolons", i.e. groups of co-localized genes coding proteins catalyzing reactions linked by shared metabolites, in one genome at a time. These metabolons can be particularly helpful for aiding bioanalysts to visualize relevant metabolic data. In the second step, they are used to generate candidate associations between un-annotated genes and gene-less reactions. The third step integrates these gene-reaction associations over several genomes using gene families, and summarizes the strength of family-reaction associations by several scores. In the final step, these scores are used to rank members of gene families which are proposed for metabolic reactions. These associations are of particular interest when the metabolic reaction is a sequence-orphan enzymatic activity. Our strategy found over 60,000 genomic metabolons in more than 1,000 prokaryote organisms from the MicroScope platform, generating candidate genes for many metabolic reactions, of which more than 70 distinct orphan reactions. A computational validation of the approach is discussed. Finally, we present a case study on the anaerobic allantoin degradation pathway in Escherichia coli K-12.
An autonomous molecular computer for logical control of gene expression
Benenson, Yaakov; Gil, Binyamin; Ben-Dor, Uri; Adar, Rivka; Shapiro, Ehud
2013-01-01
Early biomolecular computer research focused on laboratory-scale, human-operated computers for complex computational problems1–7. Recently, simple molecular-scale autonomous programmable computers were demonstrated8–15 allowing both input and output information to be in molecular form. Such computers, using biological molecules as input data and biologically active molecules as outputs, could produce a system for ‘logical’ control of biological processes. Here we describe an autonomous biomolecular computer that, at least in vitro, logically analyses the levels of messenger RNA species, and in response produces a molecule capable of affecting levels of gene expression. The computer operates at a concentration of close to a trillion computers per microlitre and consists of three programmable modules: a computation module, that is, a stochastic molecular automaton12–17; an input module, by which specific mRNA levels or point mutations regulate software molecule concentrations, and hence automaton transition probabilities; and an output module, capable of controlled release of a short single-stranded DNA molecule. This approach might be applied in vivo to biochemical sensing, genetic engineering and even medical diagnosis and treatment. As a proof of principle we programmed the computer to identify and analyse mRNA of disease-related genes18–22 associated with models of small-cell lung cancer and prostate cancer, and to produce a single-stranded DNA molecule modelled after an anticancer drug. PMID:15116117
Stamatakis, Alexandros; Ott, Michael
2008-12-27
The continuous accumulation of sequence data, for example, due to novel wet-laboratory techniques such as pyrosequencing, coupled with the increasing popularity of multi-gene phylogenies and emerging multi-core processor architectures that face problems of cache congestion, poses new challenges with respect to the efficient computation of the phylogenetic maximum-likelihood (ML) function. Here, we propose two approaches that can significantly speed up likelihood computations that typically represent over 95 per cent of the computational effort conducted by current ML or Bayesian inference programs. Initially, we present a method and an appropriate data structure to efficiently compute the likelihood score on 'gappy' multi-gene alignments. By 'gappy' we denote sampling-induced gaps owing to missing sequences in individual genes (partitions), i.e. not real alignment gaps. A first proof-of-concept implementation in RAXML indicates that this approach can accelerate inferences on large and gappy alignments by approximately one order of magnitude. Moreover, we present insights and initial performance results on multi-core architectures obtained during the transition from an OpenMP-based to a Pthreads-based fine-grained parallelization of the ML function.
Optimal Information Processing in Biochemical Networks
NASA Astrophysics Data System (ADS)
Wiggins, Chris
2012-02-01
A variety of experimental results over the past decades provide examples of near-optimal information processing in biological networks, including in biochemical and transcriptional regulatory networks. Computing information-theoretic quantities requires first choosing or computing the joint probability distribution describing multiple nodes in such a network --- for example, representing the probability distribution of finding an integer copy number of each of two interacting reactants or gene products while respecting the `intrinsic' small copy number noise constraining information transmission at the scale of the cell. I'll given an overview of some recent analytic and numerical work facilitating calculation of such joint distributions and the associated information, which in turn makes possible numerical optimization of information flow in models of noisy regulatory and biochemical networks. Illustrating cases include quantification of form-function relations, ideal design of regulatory cascades, and response to oscillatory driving.
Spiliopoulou, Athina; Colombo, Marco; Orchard, Peter; Agakov, Felix; McKeigue, Paul
2017-01-01
We address the task of genotype imputation to a dense reference panel given genotype likelihoods computed from ultralow coverage sequencing as inputs. In this setting, the data have a high-level of missingness or uncertainty, and are thus more amenable to a probabilistic representation. Most existing imputation algorithms are not well suited for this situation, as they rely on prephasing for computational efficiency, and, without definite genotype calls, the prephasing task becomes computationally expensive. We describe GeneImp, a program for genotype imputation that does not require prephasing and is computationally tractable for whole-genome imputation. GeneImp does not explicitly model recombination, instead it capitalizes on the existence of large reference panels—comprising thousands of reference haplotypes—and assumes that the reference haplotypes can adequately represent the target haplotypes over short regions unaltered. We validate GeneImp based on data from ultralow coverage sequencing (0.5×), and compare its performance to the most recent version of BEAGLE that can perform this task. We show that GeneImp achieves imputation quality very close to that of BEAGLE, using one to two orders of magnitude less time, without an increase in memory complexity. Therefore, GeneImp is the first practical choice for whole-genome imputation to a dense reference panel when prephasing cannot be applied, for instance, in datasets produced via ultralow coverage sequencing. A related future application for GeneImp is whole-genome imputation based on the off-target reads from deep whole-exome sequencing. PMID:28348060
Spirov, Alexander; Holloway, David
2013-07-15
This paper surveys modeling approaches for studying the evolution of gene regulatory networks (GRNs). Modeling of the design or 'wiring' of GRNs has become increasingly common in developmental and medical biology, as a means of quantifying gene-gene interactions, the response to perturbations, and the overall dynamic motifs of networks. Drawing from developments in GRN 'design' modeling, a number of groups are now using simulations to study how GRNs evolve, both for comparative genomics and to uncover general principles of evolutionary processes. Such work can generally be termed evolution in silico. Complementary to these biologically-focused approaches, a now well-established field of computer science is Evolutionary Computations (ECs), in which highly efficient optimization techniques are inspired from evolutionary principles. In surveying biological simulation approaches, we discuss the considerations that must be taken with respect to: (a) the precision and completeness of the data (e.g. are the simulations for very close matches to anatomical data, or are they for more general exploration of evolutionary principles); (b) the level of detail to model (we proceed from 'coarse-grained' evolution of simple gene-gene interactions to 'fine-grained' evolution at the DNA sequence level); (c) to what degree is it important to include the genome's cellular context; and (d) the efficiency of computation. With respect to the latter, we argue that developments in computer science EC offer the means to perform more complete simulation searches, and will lead to more comprehensive biological predictions. Copyright © 2013 Elsevier Inc. All rights reserved.
USDA-ARS?s Scientific Manuscript database
Next Generation Sequencing is transforming the way scientists collect and measure an organism’s genetic background and gene dynamics, while bioinformatics and super-computing are merging to facilitate parallel sample computation and interpretation at unprecedented speeds. Analyzing the complete gene...
Hierarchical Parallelization of Gene Differential Association Analysis
2011-01-01
Background Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Results Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. Conclusions The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels. PMID:21936916
Hierarchical parallelization of gene differential association analysis.
Needham, Mark; Hu, Rui; Dwarkadas, Sandhya; Qiu, Xing
2011-09-21
Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels.
2010-01-01
Background Epistasis is recognized as a fundamental part of the genetic architecture of individuals. Several computational approaches have been developed to model gene-gene interactions in case-control studies, however, none of them is suitable for time-dependent analysis. Herein we introduce the Survival Dimensionality Reduction (SDR) algorithm, a non-parametric method specifically designed to detect epistasis in lifetime datasets. Results The algorithm requires neither specification about the underlying survival distribution nor about the underlying interaction model and proved satisfactorily powerful to detect a set of causative genes in synthetic epistatic lifetime datasets with a limited number of samples and high degree of right-censorship (up to 70%). The SDR method was then applied to a series of 386 Dutch patients with active rheumatoid arthritis that were treated with anti-TNF biological agents. Among a set of 39 candidate genes, none of which showed a detectable marginal effect on anti-TNF responses, the SDR algorithm did find that the rs1801274 SNP in the FcγRIIa gene and the rs10954213 SNP in the IRF5 gene non-linearly interact to predict clinical remission after anti-TNF biologicals. Conclusions Simulation studies and application in a real-world setting support the capability of the SDR algorithm to model epistatic interactions in candidate-genes studies in presence of right-censored data. Availability: http://sourceforge.net/projects/sdrproject/ PMID:20691091
A Hybrid Computational Method for the Discovery of Novel Reproduction-Related Genes
Chen, Lei; Chu, Chen; Kong, Xiangyin; Huang, Guohua; Huang, Tao; Cai, Yu-Dong
2015-01-01
Uncovering the molecular mechanisms underlying reproduction is of great importance to infertility treatment and to the generation of healthy offspring. In this study, we discovered novel reproduction-related genes with a hybrid computational method, integrating three different types of method, which offered new clues for further reproduction research. This method was first executed on a weighted graph, constructed based on known protein-protein interactions, to search the shortest paths connecting any two known reproduction-related genes. Genes occurring in these paths were deemed to have a special relationship with reproduction. These newly discovered genes were filtered with a randomization test. Then, the remaining genes were further selected according to their associations with known reproduction-related genes measured by protein-protein interaction score and alignment score obtained by BLAST. The in-depth analysis of the high confidence novel reproduction genes revealed hidden mechanisms of reproduction and provided guidelines for further experimental validations. PMID:25768094
A hybrid computational method for the discovery of novel reproduction-related genes.
Chen, Lei; Chu, Chen; Kong, Xiangyin; Huang, Guohua; Huang, Tao; Cai, Yu-Dong
2015-01-01
Uncovering the molecular mechanisms underlying reproduction is of great importance to infertility treatment and to the generation of healthy offspring. In this study, we discovered novel reproduction-related genes with a hybrid computational method, integrating three different types of method, which offered new clues for further reproduction research. This method was first executed on a weighted graph, constructed based on known protein-protein interactions, to search the shortest paths connecting any two known reproduction-related genes. Genes occurring in these paths were deemed to have a special relationship with reproduction. These newly discovered genes were filtered with a randomization test. Then, the remaining genes were further selected according to their associations with known reproduction-related genes measured by protein-protein interaction score and alignment score obtained by BLAST. The in-depth analysis of the high confidence novel reproduction genes revealed hidden mechanisms of reproduction and provided guidelines for further experimental validations.
Alignment-free detection of horizontal gene transfer between closely related bacterial genomes.
Domazet-Lošo, Mirjana; Haubold, Bernhard
2011-09-01
Bacterial epidemics are often caused by strains that have acquired their increased virulence through horizontal gene transfer. Due to this association with disease, the detection of horizontal gene transfer continues to receive attention from microbiologists and bioinformaticians alike. Most software for detecting transfer events is based on alignments of sets of genes or of entire genomes. But despite great advances in the design of algorithms and computer programs, genome alignment remains computationally challenging. We have therefore developed an alignment-free algorithm for rapidly detecting horizontal gene transfer between closely related bacterial genomes. Our implementation of this algorithm is called alfy for "ALignment Free local homologY" and is freely available from http://guanine.evolbio.mpg.de/alfy/. In this comment we demonstrate the application of alfy to the genomes of Staphylococcus aureus. We also argue that-contrary to popular belief and in spite of increasing computer speed-algorithmic optimization is becoming more, not less, important if genome data continues to accumulate at the present rate.
Gene context analysis in the Integrated Microbial Genomes (IMG) data management system.
Mavromatis, Konstantinos; Chu, Ken; Ivanova, Natalia; Hooper, Sean D; Markowitz, Victor M; Kyrpides, Nikos C
2009-11-24
Computational methods for determining the function of genes in newly sequenced genomes have been traditionally based on sequence similarity to genes whose function has been identified experimentally. Function prediction methods can be extended using gene context analysis approaches such as examining the conservation of chromosomal gene clusters, gene fusion events and co-occurrence profiles across genomes. Context analysis is based on the observation that functionally related genes are often having similar gene context and relies on the identification of such events across phylogenetically diverse collection of genomes. We have used the data management system of the Integrated Microbial Genomes (IMG) as the framework to implement and explore the power of gene context analysis methods because it provides one of the largest available genome integrations. Visualization and search tools to facilitate gene context analysis have been developed and applied across all publicly available archaeal and bacterial genomes in IMG. These computations are now maintained as part of IMG's regular genome content update cycle. IMG is available at: http://img.jgi.doe.gov.
2012-09-30
computational tools provide the ability to display, browse, select, filter and summarize spatio-temporal relationships of these individual-based...her research assistant at Esri, Shaun Walbridge, and members of the Marine Mammal Institute ( MMI ), including Tomas Follet and Debbie Steel. This...Genomics Laboratory, MMI , OSU. 4 As part of the geneGIS initiative, these SPLASH photo-identification records and the geneSPLASH DNA profiles
Rough set soft computing cancer classification and network: one stone, two birds.
Zhang, Yue
2010-07-15
Gene expression profiling provides tremendous information to help unravel the complexity of cancer. The selection of the most informative genes from huge noise for cancer classification has taken centre stage, along with predicting the function of such identified genes and the construction of direct gene regulatory networks at different system levels with a tuneable parameter. A new study by Wang and Gotoh described a novel Variable Precision Rough Sets-rooted robust soft computing method to successfully address these problems and has yielded some new insights. The significance of this progress and its perspectives will be discussed in this article.
Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi
NASA Astrophysics Data System (ADS)
Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; Eulisse, Giulio; Knight, Robert; Muzaffar, Shahzad
2015-05-01
Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. We report our experience on software porting, performance and energy efficiency and evaluate the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).
Toward integration of in vivo molecular computing devices: successes and challenges
Hayat, Sikander; Hinze, Thomas
2008-01-01
The computing power unleashed by biomolecule based massively parallel computational units has been the focus of many interdisciplinary studies that couple state of the art ideas from mathematical logic, theoretical computer science, bioengineering, and nanotechnology to fulfill some computational task. The output can influence, for instance, release of a drug at a specific target, gene expression, cell population, or be a purely mathematical entity. Analysis of the results of several studies has led to the emergence of a general set of rules concerning the implementation and optimization of in vivo computational units. Taking two recent studies on in vivo computing as examples, we discuss the impact of mathematical modeling and simulation in the field of synthetic biology and on in vivo computing. The impact of the emergence of gene regulatory networks and the potential of proteins acting as “circuit wires” on the problem of interconnecting molecular computing device subunits is also highlighted. PMID:19404433
Spectral gene set enrichment (SGSE).
Frost, H Robert; Li, Zhigang; Moore, Jason H
2015-03-03
Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.
Evaluating Computational Gene Ontology Annotations.
Škunca, Nives; Roberts, Richard J; Steffen, Martin
2017-01-01
Two avenues to understanding gene function are complementary and often overlapping: experimental work and computational prediction. While experimental annotation generally produces high-quality annotations, it is low throughput. Conversely, computational annotations have broad coverage, but the quality of annotations may be variable, and therefore evaluating the quality of computational annotations is a critical concern.In this chapter, we provide an overview of strategies to evaluate the quality of computational annotations. First, we discuss why evaluating quality in this setting is not trivial. We highlight the various issues that threaten to bias the evaluation of computational annotations, most of which stem from the incompleteness of biological databases. Second, we discuss solutions that address these issues, for example, targeted selection of new experimental annotations and leveraging the existing experimental annotations.
Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.
Borodovsky, M; Rudd, K E; Koonin, E V
1994-01-01
The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins. Images PMID:7984428
A Feature Selection Algorithm to Compute Gene Centric Methylation from Probe Level Methylation Data.
Baur, Brittany; Bozdag, Serdar
2016-01-01
DNA methylation is an important epigenetic event that effects gene expression during development and various diseases such as cancer. Understanding the mechanism of action of DNA methylation is important for downstream analysis. In the Illumina Infinium HumanMethylation 450K array, there are tens of probes associated with each gene. Given methylation intensities of all these probes, it is necessary to compute which of these probes are most representative of the gene centric methylation level. In this study, we developed a feature selection algorithm based on sequential forward selection that utilized different classification methods to compute gene centric DNA methylation using probe level DNA methylation data. We compared our algorithm to other feature selection algorithms such as support vector machines with recursive feature elimination, genetic algorithms and ReliefF. We evaluated all methods based on the predictive power of selected probes on their mRNA expression levels and found that a K-Nearest Neighbors classification using the sequential forward selection algorithm performed better than other algorithms based on all metrics. We also observed that transcriptional activities of certain genes were more sensitive to DNA methylation changes than transcriptional activities of other genes. Our algorithm was able to predict the expression of those genes with high accuracy using only DNA methylation data. Our results also showed that those DNA methylation-sensitive genes were enriched in Gene Ontology terms related to the regulation of various biological processes.
2013-01-01
Background The development of new therapies for orphan genetic diseases represents an extremely important medical and social challenge. Drug repositioning, i.e. finding new indications for approved drugs, could be one of the most cost- and time-effective strategies to cope with this problem, at least in a subset of cases. Therefore, many computational approaches based on the analysis of high throughput gene expression data have so far been proposed to reposition available drugs. However, most of these methods require gene expression profiles directly relevant to the pathologic conditions under study, such as those obtained from patient cells and/or from suitable experimental models. In this work we have developed a new approach for drug repositioning, based on identifying known drug targets showing conserved anti-correlated expression profiles with human disease genes, which is completely independent from the availability of ‘ad hoc’ gene expression data-sets. Results By analyzing available data, we provide evidence that the genes displaying conserved anti-correlation with drug targets are antagonistically modulated in their expression by treatment with the relevant drugs. We then identified clusters of genes associated to similar phenotypes and showing conserved anticorrelation with drug targets. On this basis, we generated a list of potential candidate drug-disease associations. Importantly, we show that some of the proposed associations are already supported by independent experimental evidence. Conclusions Our results support the hypothesis that the identification of gene clusters showing conserved anticorrelation with drug targets can be an effective method for drug repositioning and provide a wide list of new potential drug-disease associations for experimental validation. PMID:24088245
Regulation of neural macroRNAs by the transcriptional repressor REST
Johnson, Rory; Teh, Christina Hui-Leng; Jia, Hui; Vanisri, Ravi Raj; Pandey, Tridansh; Lu, Zhong-Hao; Buckley, Noel J.; Stanton, Lawrence W.; Lipovich, Leonard
2009-01-01
The essential transcriptional repressor REST (repressor element 1-silencing transcription factor) plays central roles in development and human disease by regulating a large cohort of neural genes. These have conventionally fallen into the class of known, protein-coding genes; recently, however, several noncoding microRNA genes were identified as REST targets. Given the widespread transcription of messenger RNA-like, noncoding RNAs (“macroRNAs”), some of which are functional and implicated in disease in mammalian genomes, we sought to determine whether this class of noncoding RNAs can also be regulated by REST. By applying a new, unbiased target gene annotation pipeline to computationally discovered REST binding sites, we find that 23% of mammalian REST genomic binding sites are within 10 kb of a macroRNA gene. These putative target genes were overlooked by previous studies. Focusing on a set of 18 candidate macroRNA targets from mouse, we experimentally demonstrate that two are regulated by REST in neural stem cells. Flanking protein-coding genes are, at most, weakly repressed, suggesting specific targeting of the macroRNAs by REST. Similar to the majority of known REST target genes, both of these macroRNAs are induced during nervous system development and have neurally restricted expression profiles in adult mouse. We observe a similar phenomenon in human: the DiGeorge syndrome-associated noncoding RNA, DGCR5, is repressed by REST through a proximal upstream binding site. Therefore neural macroRNAs represent an additional component of the REST regulatory network. These macroRNAs are new candidates for understanding the role of REST in neuronal development, neurodegeneration, and cancer. PMID:19050060
Regulation of neural macroRNAs by the transcriptional repressor REST.
Johnson, Rory; Teh, Christina Hui-Leng; Jia, Hui; Vanisri, Ravi Raj; Pandey, Tridansh; Lu, Zhong-Hao; Buckley, Noel J; Stanton, Lawrence W; Lipovich, Leonard
2009-01-01
The essential transcriptional repressor REST (repressor element 1-silencing transcription factor) plays central roles in development and human disease by regulating a large cohort of neural genes. These have conventionally fallen into the class of known, protein-coding genes; recently, however, several noncoding microRNA genes were identified as REST targets. Given the widespread transcription of messenger RNA-like, noncoding RNAs ("macroRNAs"), some of which are functional and implicated in disease in mammalian genomes, we sought to determine whether this class of noncoding RNAs can also be regulated by REST. By applying a new, unbiased target gene annotation pipeline to computationally discovered REST binding sites, we find that 23% of mammalian REST genomic binding sites are within 10 kb of a macroRNA gene. These putative target genes were overlooked by previous studies. Focusing on a set of 18 candidate macroRNA targets from mouse, we experimentally demonstrate that two are regulated by REST in neural stem cells. Flanking protein-coding genes are, at most, weakly repressed, suggesting specific targeting of the macroRNAs by REST. Similar to the majority of known REST target genes, both of these macroRNAs are induced during nervous system development and have neurally restricted expression profiles in adult mouse. We observe a similar phenomenon in human: the DiGeorge syndrome-associated noncoding RNA, DGCR5, is repressed by REST through a proximal upstream binding site. Therefore neural macroRNAs represent an additional component of the REST regulatory network. These macroRNAs are new candidates for understanding the role of REST in neuronal development, neurodegeneration, and cancer.
Surles-Zeigler, Monique C; Li, Yonggang; Distel, Timothy J; Omotayo, Hakeem; Ge, Shaokui; Ford, Byron D
2018-01-01
Ischemic stroke is a major cause of mortality in the United States. We previously showed that neuregulin-1 (NRG1) was neuroprotective in rat models of ischemic stroke. We used gene expression profiling to understand the early cellular and molecular mechanisms of NRG1's effects after the induction of ischemia. Ischemic stroke was induced by middle cerebral artery occlusion (MCAO). Rats were allocated to 3 groups: (1) control, (2) MCAO and (3) MCAO + NRG1. Cortical brain tissues were collected three hours following MCAO and NRG1 treatment and subjected to microarray analysis. Data and statistical analyses were performed using R/Bioconductor platform alongside Genesis, Ingenuity Pathway Analysis and Enrichr software packages. There were 2693 genes differentially regulated following ischemia and NRG1 treatment. These genes were organized by expression patterns into clusters using a K-means clustering algorithm. We further analyzed genes in clusters where ischemia altered gene expression, which was reversed by NRG1 (clusters 4 and 10). NRG1, IRS1, OPA3, and POU6F1 were central linking (node) genes in cluster 4. Conserved Transcription Factor Binding Site Finder (CONFAC) identified ETS-1 as a potential transcriptional regulator of NRG1 suppressed genes following ischemia. A transcription factor activity array showed that ETS-1 activity was increased 2-fold, 3 hours following ischemia and this activity was attenuated by NRG1. These findings reveal key early transcriptional mechanisms associated with neuroprotection by NRG1 in the ischemic penumbra.
A cluster merging method for time series microarray with production values.
Chira, Camelia; Sedano, Javier; Camara, Monica; Prieto, Carlos; Villar, Jose R; Corchado, Emilio
2014-09-01
A challenging task in time-course microarray data analysis is to cluster genes meaningfully combining the information provided by multiple replicates covering the same key time points. This paper proposes a novel cluster merging method to accomplish this goal obtaining groups with highly correlated genes. The main idea behind the proposed method is to generate a clustering starting from groups created based on individual temporal series (representing different biological replicates measured in the same time points) and merging them by taking into account the frequency by which two genes are assembled together in each clustering. The gene groups at the level of individual time series are generated using several shape-based clustering methods. This study is focused on a real-world time series microarray task with the aim to find co-expressed genes related to the production and growth of a certain bacteria. The shape-based clustering methods used at the level of individual time series rely on identifying similar gene expression patterns over time which, in some models, are further matched to the pattern of production/growth. The proposed cluster merging method is able to produce meaningful gene groups which can be naturally ranked by the level of agreement on the clustering among individual time series. The list of clusters and genes is further sorted based on the information correlation coefficient and new problem-specific relevant measures. Computational experiments and results of the cluster merging method are analyzed from a biological perspective and further compared with the clustering generated based on the mean value of time series and the same shape-based algorithm.
Building block synthesis using the polymerase chain assembly method.
Marchand, Julie A; Peccoud, Jean
2012-01-01
De novo gene synthesis allows the creation of custom DNA molecules without the typical constraints of traditional cloning assembly: scars, restriction site incompatibility, and the quest to find all the desired parts to name a few. Moreover, with the help of computer-assisted design, the perfect DNA molecule can be created along with its matching sequence ready to download. The challenge is to build the physical DNA molecules that have been designed with the software. Although there are several DNA assembly methods, this section presents and describes a method using the polymerase chain assembly (PCA).
Ueno, Hiroki; Kobatake, Keitaro; Matsumoto, Masayasu; Morino, Hiroyuki; Maruyama, Hirofumi; Kawakami, Hideshi
2011-12-12
Previous studies have shown widespread multisystem degeneration in patients with sporadic amyotrophic lateral sclerosis who develop a total locked-in state and survive under mechanical ventilation for a prolonged period of time. However, the disease progressions reported in these studies were several years after disease onset. There have been no reports of long-term follow-up with brain imaging of patients with familial amyotrophic lateral sclerosis at an advanced stage of the disease. We report the cases of siblings with amyotrophic lateral sclerosis with homozygous deletions of the exon 5 mutation of the gene encoding optineurin, in whom brain computed tomography scans were followed up for more than 20 years. The patients were a Japanese brother and sister. The elder sister was 33 years of age at the onset of disease, which began with muscle weakness of her left lower limb. Two years later she required mechanical ventilation. She became bedridden at the age of 34, and died at the age of 57. A computed tomography scan of her brain at the age of 36 revealed no abnormality. Atrophy of her brain gradually progressed. Ten years after the onset of mechanical ventilation, atrophy of her whole brain, including the cerebral cortex, brain stem and cerebellum, markedly progressed. Her younger brother was 36 years of age at the onset of disease, which presented as muscle weakness of his left upper limb. One year later, he showed dysphagia and dysarthria, and tracheostomy ventilation was performed. He became bedridden at the age of 37 and died at the age of 55. There were no abnormal intracranial findings on brain computed tomography scans obtained at the age of 37 years. At the age of 48 years, computed tomography scans showed marked brain atrophy with ventricular dilatation. Subsequently, atrophy of the whole brain rapidly progressed as in his elder sister. We conclude that a homozygous deletion-type mutation in the optineurin gene may be associated with widespread multisystem degeneration in amyotrophic lateral sclerosis.
A comparison of algorithms for inference and learning in probabilistic graphical models.
Frey, Brendan J; Jojic, Nebojsa
2005-09-01
Research into methods for reasoning under uncertainty is currently one of the most exciting areas of artificial intelligence, largely because it has recently become possible to record, store, and process large amounts of data. While impressive achievements have been made in pattern classification problems such as handwritten character recognition, face detection, speaker identification, and prediction of gene function, it is even more exciting that researchers are on the verge of introducing systems that can perform large-scale combinatorial analyses of data, decomposing the data into interacting components. For example, computational methods for automatic scene analysis are now emerging in the computer vision community. These methods decompose an input image into its constituent objects, lighting conditions, motion patterns, etc. Two of the main challenges are finding effective representations and models in specific applications and finding efficient algorithms for inference and learning in these models. In this paper, we advocate the use of graph-based probability models and their associated inference and learning algorithms. We review exact techniques and various approximate, computationally efficient techniques, including iterated conditional modes, the expectation maximization (EM) algorithm, Gibbs sampling, the mean field method, variational techniques, structured variational techniques and the sum-product algorithm ("loopy" belief propagation). We describe how each technique can be applied in a vision model of multiple, occluding objects and contrast the behaviors and performances of the techniques using a unifying cost function, free energy.
Joshi, Priya Shirish; Deshmukh, Vijay; Golgire, Someshwar
2012-01-01
Gorlin-Goltz syndrome is an uncommon autosomal dominant inherited disorder, which is characterized by multiple odontogenic Keratocysts and basal cell carcinomas, skeletal, dental, ophthalmic, and neurological abnormalities, intracranial ectopic calcifications of the falx cerebri, and facial dysmorphism. Pathogenesis of the syndrome is attributed to abnormalities in the long arm of chromosome 9 (q22.3-q31) and loss or mutations of human patched gene (PTCH1 gene). Diagnosis is based upon established major and minor clinical and radiological criteria and ideally confirmed by deoxyribo nucleic acid analysis. We report a case of a 9-year-old girl presenting with three major and one minor feature of Gorlin-Goltz syndrome. Radiologic findings of the syndrome are easily identifiable on Orthopantomogram, chest X-ray, and Computed tomography scans. These investigations prompt an early verification of the disease, which is very important to prevent recurrence and better survival rates from the coexistent diseases.
Joshi, Priya Shirish; Deshmukh, Vijay; Golgire, Someshwar
2012-01-01
Gorlin-Goltz syndrome is an uncommon autosomal dominant inherited disorder, which is characterized by multiple odontogenic Keratocysts and basal cell carcinomas, skeletal, dental, ophthalmic, and neurological abnormalities, intracranial ectopic calcifications of the falx cerebri, and facial dysmorphism. Pathogenesis of the syndrome is attributed to abnormalities in the long arm of chromosome 9 (q22.3-q31) and loss or mutations of human patched gene (PTCH1 gene). Diagnosis is based upon established major and minor clinical and radiological criteria and ideally confirmed by deoxyribo nucleic acid analysis. We report a case of a 9-year-old girl presenting with three major and one minor feature of Gorlin-Goltz syndrome. Radiologic findings of the syndrome are easily identifiable on Orthopantomogram, chest X-ray, and Computed tomography scans. These investigations prompt an early verification of the disease, which is very important to prevent recurrence and better survival rates from the coexistent diseases. PMID:22363371
Epigenetic Regulation: A New Frontier for Biomedical Engineers.
Chen, Zhen; Li, Shuai; Subramaniam, Shankar; Shyy, John Y-J; Chien, Shu
2017-06-21
Gene expression in mammalian cells depends on the epigenetic status of the chromatin, including DNA methylation, histone modifications, promoter-enhancer interactions, and noncoding RNA-mediated regulation. The coordinated actions of these multifaceted regulations determine cell development, cell cycle regulation, cell state and fate, and the ultimate responses in health and disease. Therefore, studies of epigenetic modulations are critical for our understanding of gene regulation mechanisms at the molecular, cellular, tissue, and organ levels. The aim of this review is to provide biomedical engineers with an overview of the principles of epigenetics, methods of study, recent findings in epigenetic regulation in health and disease, and computational and sequencing tools for epigenetics analysis, with an emphasis on the cardiovascular system. This review concludes with the perspectives of the application of bioengineering to advance epigenetics and the utilization of epigenetics to translate bioengineering research into clinical medicine.
Rzhepetskyy, Yuriy; Lazniewska, Joanna; Blesneac, Iulia; Pamphlett, Roger; Weiss, Norbert
2016-11-01
Amyotrophic lateral sclerosis (ALS) is a progressive neurodegenerative disease that affects nerve cells in the brain and the spinal cord. In a recent study by Steinberg and colleagues, 2 recessive missense mutations were identified in the Cav3.2 T-type calcium channel gene (CACNA1H), in a family with an affected proband (early onset, long duration ALS) and 2 unaffected parents. We have introduced and functionally characterized these mutations using transiently expressed human Cav3.2 channels in tsA-201 cells. Both of these mutations produced mild but significant changes on T-type channel activity that are consistent with a loss of channel function. Computer modeling in thalamic reticular neurons suggested that these mutations result in decreased neuronal excitability of thalamic structures. Taken together, these findings implicate CACNA1H as a susceptibility gene in amyotrophic lateral sclerosis.
Human tRNA genes function as chromatin insulators
Raab, Jesse R; Chiu, Jonathan; Zhu, Jingchun; Katzman, Sol; Kurukuti, Sreenivasulu; Wade, Paul A; Haussler, David; Kamakaka, Rohinton T
2012-01-01
Insulators help separate active chromatin domains from silenced ones. In yeast, gene promoters act as insulators to block the spread of Sir and HP1 mediated silencing while in metazoans most insulators are multipartite autonomous entities. tDNAs are repetitive sequences dispersed throughout the human genome and we now show that some of these tDNAs can function as insulators in human cells. Using computational methods, we identified putative human tDNA insulators. Using silencer blocking, transgene protection and repressor blocking assays we show that some of these tDNA-containing fragments can function as barrier insulators in human cells. We find that these elements also have the ability to block enhancers from activating RNA pol II transcribed promoters. Characterization of a putative tDNA insulator in human cells reveals that the site possesses chromatin signatures similar to those observed at other better-characterized eukaryotic insulators. Enhanced 4C analysis demonstrates that the tDNA insulator makes long-range chromatin contacts with other tDNAs and ETC sites but not with intervening or flanking RNA pol II transcribed genes. PMID:22085927
Ambrosi, Christina M.; Boyle, Patrick M.; Chen, Kay; Trayanova, Natalia A.; Entcheva, Emilia
2015-01-01
Multiple cardiac pathologies are accompanied by loss of tissue excitability, which leads to a range of heart rhythm disorders (arrhythmias). In addition to electronic device therapy (i.e. implantable pacemakers and cardioverter/defibrillators), biological approaches have recently been explored to restore pacemaking ability and to correct conduction slowing in the heart by delivering excitatory ion channels or ion channel agonists. Using optogenetics as a tool to selectively interrogate only cells transduced to produce an exogenous excitatory ion current, we experimentally and computationally quantify the efficiency of such biological approaches in rescuing cardiac excitability as a function of the mode of application (viral gene delivery or cell delivery) and the geometry of the transduced region (focal or spatially-distributed). We demonstrate that for each configuration (delivery mode and spatial pattern), the optical energy needed to excite can be used to predict therapeutic efficiency of excitability restoration. Taken directly, these results can help guide optogenetic interventions for light-based control of cardiac excitation. More generally, our findings can help optimize gene therapy for restoration of cardiac excitability. PMID:26621212
Morphogenesis of the vulva and the vulval-uterine connection.
Gupta, Bhagwati P; Hanna-Rose, Wendy; Sternberg, Paul W
2012-01-01
The C. elegans hermaphrodite vulva is an established model system to study mechanisms of cell fate specification and tissue morphogenesis. The adult vulva is a tubular shaped organ composed of seven concentric toroids that arise from selective fusion between differentiated vulval progeny. The dorsal end of the vulval tubule is connected to the uterus via a multinucleate syncytium utse (uterine-seam) cell. The vulval tubule and utse are formed as a result of changes in morphogenetic processes such as cell polarity, adhesion, and invagination. A number of genes controlling these processes are conserved all the way up to human and function in similar developmental contexts. This makes it possible to extend the findings to other metazoan systems. Gene expression studies in the vulval and uterine cells have revealed the presence of regulatory networks specifying distinct cell fates. Thus, these two cell types serve as a good system to understand how gene networks confer unique cell identities both experimentally and computationally. This chapter focuses on morphogenetic processes during the formation of the vulva and its connection to uterus. PMID:23208727
Morphogenesis of the vulva and the vulval-uterine connection.
Gupta, Bhagwati P; Hanna-Rose, Wendy; Sternberg, Paul W
2012-11-30
The C. elegans hermaphrodite vulva is an established model system to study mechanisms of cell fate specification and tissue morphogenesis. The adult vulva is a tubular shaped organ composed of seven concentric toroids that arise from selective fusion between differentiated vulval progeny. The dorsal end of the vulval tubule is connected to the uterus via a multinucleate syncytium utse (uterine-seam) cell. The vulval tubule and utse are formed as a result of changes in morphogenetic processes such as cell polarity, adhesion, and invagination. A number of genes controlling these processes are conserved all the way up to human and function in similar developmental contexts. This makes it possible to extend the findings to other metazoan systems. Gene expression studies in the vulval and uterine cells have revealed the presence of regulatory networks specifying distinct cell fates. Thus, these two cell types serve as a good system to understand how gene networks confer unique cell identities both experimentally and computationally. This chapter focuses on morphogenetic processes during the formation of the vulva and its connection to uterus.
A Gene Ontology Tutorial in Python.
Vesztrocy, Alex Warwick; Dessimoz, Christophe
2017-01-01
This chapter is a tutorial on using Gene Ontology resources in the Python programming language. This entails querying the Gene Ontology graph, retrieving Gene Ontology annotations, performing gene enrichment analyses, and computing basic semantic similarity between GO terms. An interactive version of the tutorial, including solutions, is available at http://gohandbook.org .
Solving a Hamiltonian Path Problem with a bacterial computer
Baumgardner, Jordan; Acker, Karen; Adefuye, Oyinade; Crowley, Samuel Thomas; DeLoache, Will; Dickson, James O; Heard, Lane; Martens, Andrew T; Morton, Nickolaus; Ritter, Michelle; Shoecraft, Amber; Treece, Jessica; Unzicker, Matthew; Valencia, Amanda; Waters, Mike; Campbell, A Malcolm; Heyer, Laurie J; Poet, Jeffrey L; Eckdahl, Todd T
2009-01-01
Background The Hamiltonian Path Problem asks whether there is a route in a directed graph from a beginning node to an ending node, visiting each node exactly once. The Hamiltonian Path Problem is NP complete, achieving surprising computational complexity with modest increases in size. This challenge has inspired researchers to broaden the definition of a computer. DNA computers have been developed that solve NP complete problems. Bacterial computers can be programmed by constructing genetic circuits to execute an algorithm that is responsive to the environment and whose result can be observed. Each bacterium can examine a solution to a mathematical problem and billions of them can explore billions of possible solutions. Bacterial computers can be automated, made responsive to selection, and reproduce themselves so that more processing capacity is applied to problems over time. Results We programmed bacteria with a genetic circuit that enables them to evaluate all possible paths in a directed graph in order to find a Hamiltonian path. We encoded a three node directed graph as DNA segments that were autonomously shuffled randomly inside bacteria by a Hin/hixC recombination system we previously adapted from Salmonella typhimurium for use in Escherichia coli. We represented nodes in the graph as linked halves of two different genes encoding red or green fluorescent proteins. Bacterial populations displayed phenotypes that reflected random ordering of edges in the graph. Individual bacterial clones that found a Hamiltonian path reported their success by fluorescing both red and green, resulting in yellow colonies. We used DNA sequencing to verify that the yellow phenotype resulted from genotypes that represented Hamiltonian path solutions, demonstrating that our bacterial computer functioned as expected. Conclusion We successfully designed, constructed, and tested a bacterial computer capable of finding a Hamiltonian path in a three node directed graph. This proof-of-concept experiment demonstrates that bacterial computing is a new way to address NP-complete problems using the inherent advantages of genetic systems. The results of our experiments also validate synthetic biology as a valuable approach to biological engineering. We designed and constructed basic parts, devices, and systems using synthetic biology principles of standardization and abstraction. PMID:19630940
Tachiyama, Keisuke; Shiga, Yuji; Shimoe, Yutaka; Mizuta, Ikuko; Mizuno, Toshiki; Kuriyama, Masaru
2018-04-25
A 55-year-old man with no history of stroke or migraine presented to the clinic with cognitive impairment and depression that had been experiencing for two years. Neurological examination showed bilateral pyramidal signs, and impairments in cognition and attention. Brain MRI revealed multiple lacunar lesions and microbleeds in the deep cerebral white matter, subcortical regions, and brainstem, as well as diffuse white matter hyperintensities without anterior temporal pole involvement. Cerebral single-photon emission computed tomography (SPECT) revealed bilateral hypoperfusion in the basal ganglia. Gene analysis revealed an arginine-to-proline missense mutation in the NOTCH3 gene at codon 75. The patient was administered lomerizine (10 mg/day), but the patient's cognitive impairment and cerebral atrophy continued to worsen. Follow-up testing with MRI three years after his initial diagnosis revealed similar lacunar infarctions, cerebral microbleeds, and diffuse white matter hyperintensities to those observed three years earlier. However, MRI scans revealed signs of increased cerebral blood flow. Together, these findings suggest that the patient's cognitive impairments may have been caused by pathogenesis in the cerebral cortex.
An informatics approach to analyzing the incidentalome.
Berg, Jonathan S; Adams, Michael; Nassar, Nassib; Bizon, Chris; Lee, Kristy; Schmitt, Charles P; Wilhelmsen, Kirk C; Evans, James P
2013-01-01
Next-generation sequencing has transformed genetic research and is poised to revolutionize clinical diagnosis. However, the vast amount of data and inevitable discovery of incidental findings require novel analytic approaches. We therefore implemented for the first time a strategy that utilizes an a priori structured framework and a conservative threshold for selecting clinically relevant incidental findings. We categorized 2,016 genes linked with Mendelian diseases into "bins" based on clinical utility and validity, and used a computational algorithm to analyze 80 whole-genome sequences in order to explore the use of such an approach in a simulated real-world setting. The algorithm effectively reduced the number of variants requiring human review and identified incidental variants with likely clinical relevance. Incorporation of the Human Gene Mutation Database improved the yield for missense mutations but also revealed that a substantial proportion of purported disease-causing mutations were misleading. This approach is adaptable to any clinically relevant bin structure, scalable to the demands of a clinical laboratory workflow, and flexible with respect to advances in genomics. We anticipate that application of this strategy will facilitate pretest informed consent, laboratory analysis, and posttest return of results in a clinical context.
Dittmar, W James; McIver, Lauren; Michalak, Pawel; Garner, Harold R; Valdez, Gregorio
2014-07-01
The wealth of publicly available gene expression and genomic data provides unique opportunities for computational inference to discover groups of genes that function to control specific cellular processes. Such genes are likely to have co-evolved and be expressed in the same tissues and cells. Unfortunately, the expertise and computational resources required to compare tens of genomes and gene expression data sets make this type of analysis difficult for the average end-user. Here, we describe the implementation of a web server that predicts genes involved in affecting specific cellular processes together with a gene of interest. We termed the server 'EvoCor', to denote that it detects functional relationships among genes through evolutionary analysis and gene expression correlation. This web server integrates profiles of sequence divergence derived by a Hidden Markov Model (HMM) and tissue-wide gene expression patterns to determine putative functional linkages between pairs of genes. This server is easy to use and freely available at http://pilot-hmm.vbi.vt.edu/. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
SGFSC: speeding the gene functional similarity calculation based on hash tables.
Tian, Zhen; Wang, Chunyu; Guo, Maozu; Liu, Xiaoyan; Teng, Zhixia
2016-11-04
In recent years, many measures of gene functional similarity have been proposed and widely used in all kinds of essential research. These methods are mainly divided into two categories: pairwise approaches and group-wise approaches. However, a common problem with these methods is their time consumption, especially when measuring the gene functional similarities of a large number of gene pairs. The problem of computational efficiency for pairwise approaches is even more prominent because they are dependent on the combination of semantic similarity. Therefore, the efficient measurement of gene functional similarity remains a challenging problem. To speed current gene functional similarity calculation methods, a novel two-step computing strategy is proposed: (1) establish a hash table for each method to store essential information obtained from the Gene Ontology (GO) graph and (2) measure gene functional similarity based on the corresponding hash table. There is no need to traverse the GO graph repeatedly for each method with the help of the hash table. The analysis of time complexity shows that the computational efficiency of these methods is significantly improved. We also implement a novel Speeding Gene Functional Similarity Calculation tool, namely SGFSC, which is bundled with seven typical measures using our proposed strategy. Further experiments show the great advantage of SGFSC in measuring gene functional similarity on the whole genomic scale. The proposed strategy is successful in speeding current gene functional similarity calculation methods. SGFSC is an efficient tool that is freely available at http://nclab.hit.edu.cn/SGFSC . The source code of SGFSC can be downloaded from http://pan.baidu.com/s/1dFFmvpZ .
Fan, Ming; Kuwahara, Hiroyuki; Wang, Xiaolei; Wang, Suojin; Gao, Xin
2015-11-01
Parameter estimation is a challenging computational problem in the reverse engineering of biological systems. Because advances in biotechnology have facilitated wide availability of time-series gene expression data, systematic parameter estimation of gene circuit models from such time-series mRNA data has become an important method for quantitatively dissecting the regulation of gene expression. By focusing on the modeling of gene circuits, we examine here the performance of three types of state-of-the-art parameter estimation methods: population-based methods, online methods and model-decomposition-based methods. Our results show that certain population-based methods are able to generate high-quality parameter solutions. The performance of these methods, however, is heavily dependent on the size of the parameter search space, and their computational requirements substantially increase as the size of the search space increases. In comparison, online methods and model decomposition-based methods are computationally faster alternatives and are less dependent on the size of the search space. Among other things, our results show that a hybrid approach that augments computationally fast methods with local search as a subsequent refinement procedure can substantially increase the quality of their parameter estimates to the level on par with the best solution obtained from the population-based methods while maintaining high computational speed. These suggest that such hybrid methods can be a promising alternative to the more commonly used population-based methods for parameter estimation of gene circuit models when limited prior knowledge about the underlying regulatory mechanisms makes the size of the parameter search space vastly large. © The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Computational Approaches to Drug Repurposing and Pharmacology
Hodos, Rachel A; Kidd, Brian A; Khader, Shameer; Readhead, Ben P; Dudley, Joel T
2016-01-01
Data in the biological, chemical, and clinical domains are accumulating at ever-increasing rates and have the potential to accelerate and inform drug development in new ways. Challenges and opportunities now lie in developing analytic tools to transform these often complex and heterogeneous data into testable hypotheses and actionable insights. This is the aim of computational pharmacology, which uses in silico techniques to better understand and predict how drugs affect biological systems, which can in turn improve clinical use, avoid unwanted side effects, and guide selection and development of better treatments. One exciting application of computational pharmacology is drug repurposing- finding new uses for existing drugs. Already yielding many promising candidates, this strategy has the potential to improve the efficiency of the drug development process and reach patient populations with previously unmet needs such as those with rare diseases. While current techniques in computational pharmacology and drug repurposing often focus on just a single data modality such as gene expression or drug-target interactions, we rationalize that methods such as matrix factorization that can integrate data within and across diverse data types have the potential to improve predictive performance and provide a fuller picture of a drug's pharmacological action. PMID:27080087
Fusing literature and full network data improves disease similarity computation.
Li, Ping; Nie, Yaling; Yu, Jingkai
2016-08-30
Identifying relatedness among diseases could help deepen understanding for the underlying pathogenic mechanisms of diseases, and facilitate drug repositioning projects. A number of methods for computing disease similarity had been developed; however, none of them were designed to utilize information of the entire protein interaction network, using instead only those interactions involving disease causing genes. Most of previously published methods required gene-disease association data, unfortunately, many diseases still have very few or no associated genes, which impeded broad adoption of those methods. In this study, we propose a new method (MedNetSim) for computing disease similarity by integrating medical literature and protein interaction network. MedNetSim consists of a network-based method (NetSim), which employs the entire protein interaction network, and a MEDLINE-based method (MedSim), which computes disease similarity by mining the biomedical literature. Among function-based methods, NetSim achieved the best performance. Its average AUC (area under the receiver operating characteristic curve) reached 95.2 %. MedSim, whose performance was even comparable to some function-based methods, acquired the highest average AUC in all semantic-based methods. Integration of MedSim and NetSim (MedNetSim) further improved the average AUC to 96.4 %. We further studied the effectiveness of different data sources. It was found that quality of protein interaction data was more important than its volume. On the contrary, higher volume of gene-disease association data was more beneficial, even with a lower reliability. Utilizing higher volume of disease-related gene data further improved the average AUC of MedNetSim and NetSim to 97.5 % and 96.7 %, respectively. Integrating biomedical literature and protein interaction network can be an effective way to compute disease similarity. Lacking sufficient disease-related gene data, literature-based methods such as MedSim can be a great addition to function-based algorithms. It may be beneficial to steer more resources torward studying gene-disease associations and improving the quality of protein interaction data. Disease similarities can be computed using the proposed methods at http:// www.digintelli.com:8000/ .
Heterogeneous high throughput scientific computing with APM X-Gene and Intel Xeon Phi
Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; ...
2015-05-22
Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. As a result, we report our experience on software porting, performance and energy efficiency and evaluatemore » the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).« less
Strakova, Eva; Zikova, Alice; Vohradsky, Jiri
2014-01-01
A computational model of gene expression was applied to a novel test set of microarray time series measurements to reveal regulatory interactions between transcriptional regulators represented by 45 sigma factors and the genes expressed during germination of a prokaryote Streptomyces coelicolor. Using microarrays, the first 5.5 h of the process was recorded in 13 time points, which provided a database of gene expression time series on genome-wide scale. The computational modeling of the kinetic relations between the sigma factors, individual genes and genes clustered according to the similarity of their expression kinetics identified kinetically plausible sigma factor-controlled networks. Using genome sequence annotations, functional groups of genes that were predominantly controlled by specific sigma factors were identified. Using external binding data complementing the modeling approach, specific genes involved in the control of the studied process were identified and their function suggested.
Sorting by Cuts, Joins, and Whole Chromosome Duplications.
Zeira, Ron; Shamir, Ron
2017-02-01
Genome rearrangement problems have been extensively studied due to their importance in biology. Most studied models assumed a single copy per gene. However, in reality, duplicated genes are common, most notably in cancer. In this study, we make a step toward handling duplicated genes by considering a model that allows the atomic operations of cut, join, and whole chromosome duplication. Given two linear genomes, [Formula: see text] with one copy per gene and [Formula: see text] with two copies per gene, we give a linear time algorithm for computing a shortest sequence of operations transforming [Formula: see text] into [Formula: see text] such that all intermediate genomes are linear. We also show that computing an optimal sequence with fewest duplications is NP-hard.
Dynamic Energy Landscapes of Riboswitches Help Interpret Conformational Rearrangements and Function
Quarta, Giulio; Sin, Ken; Schlick, Tamar
2012-01-01
Riboswitches are RNAs that modulate gene expression by ligand-induced conformational changes. However, the way in which sequence dictates alternative folding pathways of gene regulation remains unclear. In this study, we compute energy landscapes, which describe the accessible secondary structures for a range of sequence lengths, to analyze the transcriptional process as a given sequence elongates to full length. In line with experimental evidence, we find that most riboswitch landscapes can be characterized by three broad classes as a function of sequence length in terms of the distribution and barrier type of the conformational clusters: low-barrier landscape with an ensemble of different conformations in equilibrium before encountering a substrate; barrier-free landscape in which a direct, dominant “downhill” pathway to the minimum free energy structure is apparent; and a barrier-dominated landscape with two isolated conformational states, each associated with a different biological function. Sharing concepts with the “new view” of protein folding energy landscapes, we term the three sequence ranges above as the sensing, downhill folding, and functional windows, respectively. We find that these energy landscape patterns are conserved in various riboswitch classes, though the order of the windows may vary. In fact, the order of the three windows suggests either kinetic or thermodynamic control of ligand binding. These findings help understand riboswitch structure/function relationships and open new avenues to riboswitch design. PMID:22359488
Schizophrenia interactome with 504 novel protein–protein interactions
Ganapathiraju, Madhavi K; Thahir, Mohamed; Handen, Adam; Sarkar, Saumendra N; Sweet, Robert A; Nimgaonkar, Vishwajit L; Loscher, Christine E; Bauer, Eileen M; Chaparala, Srilakshmi
2016-01-01
Genome-wide association studies of schizophrenia (GWAS) have revealed the role of rare and common genetic variants, but the functional effects of the risk variants remain to be understood. Protein interactome-based studies can facilitate the study of molecular mechanisms by which the risk genes relate to schizophrenia (SZ) genesis, but protein–protein interactions (PPIs) are unknown for many of the liability genes. We developed a computational model to discover PPIs, which is found to be highly accurate according to computational evaluations and experimental validations of selected PPIs. We present here, 365 novel PPIs of liability genes identified by the SZ Working Group of the Psychiatric Genomics Consortium (PGC). Seventeen genes that had no previously known interactions have 57 novel interactions by our method. Among the new interactors are 19 drug targets that are targeted by 130 drugs. In addition, we computed 147 novel PPIs of 25 candidate genes investigated in the pre-GWAS era. While there is little overlap between the GWAS genes and the pre-GWAS genes, the interactomes reveal that they largely belong to the same pathways, thus reconciling the apparent disparities between the GWAS and prior gene association studies. The interactome including 504 novel PPIs overall, could motivate other systems biology studies and trials with repurposed drugs. The PPIs are made available on a webserver, called Schizo-Pi at http://severus.dbmi.pitt.edu/schizo-pi with advanced search capabilities. PMID:27336055
Rough Set Soft Computing Cancer Classification and Network: One Stone, Two Birds
Zhang, Yue
2010-01-01
Gene expression profiling provides tremendous information to help unravel the complexity of cancer. The selection of the most informative genes from huge noise for cancer classification has taken centre stage, along with predicting the function of such identified genes and the construction of direct gene regulatory networks at different system levels with a tuneable parameter. A new study by Wang and Gotoh described a novel Variable Precision Rough Sets-rooted robust soft computing method to successfully address these problems and has yielded some new insights. The significance of this progress and its perspectives will be discussed in this article. PMID:20706619
Azad, Ariful; Ouzounis, Christos A; Kyrpides, Nikos C; Buluç, Aydin
2018-01-01
Abstract Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. Here, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ∼70 million nodes with ∼68 billion edges in ∼2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license. PMID:29315405
Gene calling and bacterial genome annotation with BG7.
Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo
2015-01-01
New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services).
Berardini, Tanya Z.; Reiser, Leonore; Li, Donghui; Mezheritsky, Yarik; Muller, Robert; Strait, Emily; Huala, Eva
2015-01-01
The Arabidopsis Information Resource (TAIR) is a continuously updated, online database of genetic and molecular biology data for the model plant Arabidopsis thaliana that provides a global research community with centralized access to data for over 30,000 Arabidopsis genes. TAIR’s biocurators systematically extract, organize, and interconnect experimental data from the literature along with computational predictions, community submissions, and high throughput datasets to present a high quality and comprehensive picture of Arabidopsis gene function. TAIR provides tools for data visualization and analysis, and enables ordering of seed and DNA stocks, protein chips and other experimental resources. TAIR actively engages with its users who contribute expertise and data that augments the work of the curatorial staff. TAIR’s focus in an extensive and evolving ecosystem of online resources for plant biology is on the critically important role of extracting experimentally-based research findings from the literature and making that information computationally accessible. In response to the loss of government grant funding, the TAIR team founded a nonprofit entity, Phoenix Bioinformatics, with the aim of developing sustainable funding models for biological databases, using TAIR as a test case. Phoenix has successfully transitioned TAIR to subscription-based funding while still keeping its data relatively open and accessible. PMID:26201819
Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.; ...
2018-01-05
Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less
Li, S; Zhang, Z Q; Wu, L J; Zhang, X G; Li, Y D; Wang, Y Y
2007-01-01
Traditional Chinese medicine uses ZHENG as the key pathological principle to understand the human homeostasis and guide the applications of Chinese herbs. Here, a systems biology approach with the combination of computational analysis and animal experiment is used to investigate this complex issue, ZHENG, in the context of the neuro-endocrine-immune (NEI) system. By using the methods of literature mining, network analysis and topological comparison, it is found that hormones are predominant in the Cold ZHENG network, immune factors are predominant in the Hot ZHENG network, and these two networks are connected by neuro-transmitters. In addition, genes related to Hot ZHENG-related diseases are mainly present in the cytokine-cytokine receptor interaction pathway, whereas genes related to both the Cold-related and Hot-related diseases are linked to the neuroactive ligand-receptor interaction pathway. These computational findings were subsequently verified by experiments on a rat model of collagen-induced arthritis, which indicate that the Cold ZHENG-oriented herbs tend to affect the hub nodes in the Cold ZHENG network, and the Hot ZHENG-oriented herbs tend to affect the hub nodes in the Hot ZHENG network. These investigations demonstrate that the thousand-year-old concept of ZHENG may have a molecular basis with NEI as background.
2012-01-01
Background Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. Results This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. Conclusions It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network. PMID:22830977
Using SPEEDES to simulate the blue gene interconnect network
NASA Technical Reports Server (NTRS)
Springer, P.; Upchurch, E.
2003-01-01
JPL and the Center for Advanced Computer Architecture (CACR) is conducting application and simulation analyses of BG/L in order to establish a range of effectiveness for the Blue Gene/L MPP architecture in performing important classes of computations and to determine the design sensitivity of the global interconnect network in support of real world ASCI application execution.
Li, Richard Y.; Di Felice, Rosa; Rohs, Remo; Lidar, Daniel A.
2018-01-01
Transcription factors regulate gene expression, but how these proteins recognize and specifically bind to their DNA targets is still debated. Machine learning models are effective means to reveal interaction mechanisms. Here we studied the ability of a quantum machine learning approach to predict binding specificity. Using simplified datasets of a small number of DNA sequences derived from actual binding affinity experiments, we trained a commercially available quantum annealer to classify and rank transcription factor binding. The results were compared to state-of-the-art classical approaches for the same simplified datasets, including simulated annealing, simulated quantum annealing, multiple linear regression, LASSO, and extreme gradient boosting. Despite technological limitations, we find a slight advantage in classification performance and nearly equal ranking performance using the quantum annealer for these fairly small training data sets. Thus, we propose that quantum annealing might be an effective method to implement machine learning for certain computational biology problems. PMID:29652405
Kleftogiannis, Dimitrios; Korfiati, Aigli; Theofilatos, Konstantinos; Likothanassis, Spiros; Tsakalidis, Athanasios; Mavroudi, Seferina
2013-06-01
Traditional biology was forced to restate some of its principles when the microRNA (miRNA) genes and their regulatory role were firstly discovered. Typically, miRNAs are small non-coding RNA molecules which have the ability to bind to the 3'untraslated region (UTR) of their mRNA target genes for cleavage or translational repression. Existing experimental techniques for their identification and the prediction of the target genes share some important limitations such as low coverage, time consuming experiments and high cost reagents. Hence, many computational methods have been proposed for these tasks to overcome these limitations. Recently, many researchers emphasized on the development of computational approaches to predict the participation of miRNA genes in regulatory networks and to analyze their transcription mechanisms. All these approaches have certain advantages and disadvantages which are going to be described in the present survey. Our work is differentiated from existing review papers by updating the methodologies list and emphasizing on the computational issues that arise from the miRNA data analysis. Furthermore, in the present survey, the various miRNA data analysis steps are treated as an integrated procedure whose aims and scope is to uncover the regulatory role and mechanisms of the miRNA genes. This integrated view of the miRNA data analysis steps may be extremely useful for all researchers even if they work on just a single step. Copyright © 2013 Elsevier Inc. All rights reserved.
Computer Analogies: Teaching Molecular Biology and Ecology.
ERIC Educational Resources Information Center
Rice, Stanley; McArthur, John
2002-01-01
Suggests that computer science analogies can aid the understanding of gene expression, including the storage of genetic information on chromosomes. Presents a matrix of biology and computer science concepts. (DDR)
Li, Gao-Peng; Jiang, Liang; Ni, Jia-Zuan; Liu, Qiong; Zhang, Yan
2014-10-17
Selenium (Se) and sulfur (S) are closely related elements that exhibit similar chemical properties. Some genes related to S metabolism are also involved in Se utilization in many organisms. However, the evolutionary relationship between the two utilization traits is unclear. In this study, we conducted a comparative analysis of the selenophosphate synthetase (SelD) family, a key protein for all known Se utilization traits, in all sequenced archaea. Our search showed a very limited distribution of SelD and Se utilization in this kingdom. Interestingly, a SelD-like protein was detected in two orders of Crenarchaeota: Sulfolobales and Thermoproteales. Sequence and phylogenetic analyses revealed that SelD-like protein contains the same domain and conserved functional residues as those of SelD, and might be involved in S metabolism in these S-reducing organisms. Further genome-wide analysis of patterns of gene occurrence in different thermoproteales suggested that several genes, including SirA-like, Prx-like and adenylylsulfate reductase, were strongly related to SelD-like gene. Based on these findings, we proposed a simple model wherein SelD-like may play an important role in the biosynthesis of certain thiophosphate compound. Our data suggest novel genes involved in S metabolism in hyperthermophilic S-reducing archaea, and may provide a new window for understanding the complex relationship between Se and S metabolism in archaea.
Improving information retrieval in functional analysis.
Rodriguez, Juan C; González, Germán A; Fresno, Cristóbal; Llera, Andrea S; Fernández, Elmer A
2016-12-01
Transcriptome analysis is essential to understand the mechanisms regulating key biological processes and functions. The first step usually consists of identifying candidate genes; to find out which pathways are affected by those genes, however, functional analysis (FA) is mandatory. The most frequently used strategies for this purpose are Gene Set and Singular Enrichment Analysis (GSEA and SEA) over Gene Ontology. Several statistical methods have been developed and compared in terms of computational efficiency and/or statistical appropriateness. However, whether their results are similar or complementary, the sensitivity to parameter settings, or possible bias in the analyzed terms has not been addressed so far. Here, two GSEA and four SEA methods and their parameter combinations were evaluated in six datasets by comparing two breast cancer subtypes with well-known differences in genetic background and patient outcomes. We show that GSEA and SEA lead to different results depending on the chosen statistic, model and/or parameters. Both approaches provide complementary results from a biological perspective. Hence, an Integrative Functional Analysis (IFA) tool is proposed to improve information retrieval in FA. It provides a common gene expression analytic framework that grants a comprehensive and coherent analysis. Only a minimal user parameter setting is required, since the best SEA/GSEA alternatives are integrated. IFA utility was demonstrated by evaluating four prostate cancer and the TCGA breast cancer microarray datasets, which showed its biological generalization capabilities. Copyright © 2016 Elsevier Ltd. All rights reserved.
Predictive minimum description length principle approach to inferring gene regulatory networks.
Chaitankar, Vijender; Zhang, Chaoyang; Ghosh, Preetam; Gong, Ping; Perkins, Edward J; Deng, Youping
2011-01-01
Reverse engineering of gene regulatory networks using information theory models has received much attention due to its simplicity, low computational cost, and capability of inferring large networks. One of the major problems with information theory models is to determine the threshold that defines the regulatory relationships between genes. The minimum description length (MDL) principle has been implemented to overcome this problem. The description length of the MDL principle is the sum of model length and data encoding length. A user-specified fine tuning parameter is used as control mechanism between model and data encoding, but it is difficult to find the optimal parameter. In this work, we propose a new inference algorithm that incorporates mutual information (MI), conditional mutual information (CMI), and predictive minimum description length (PMDL) principle to infer gene regulatory networks from DNA microarray data. In this algorithm, the information theoretic quantities MI and CMI determine the regulatory relationships between genes and the PMDL principle method attempts to determine the best MI threshold without the need of a user-specified fine tuning parameter. The performance of the proposed algorithm is evaluated using both synthetic time series data sets and a biological time series data set (Saccharomyces cerevisiae). The results show that the proposed algorithm produced fewer false edges and significantly improved the precision when compared to existing MDL algorithm.
Schmitz, Ulf; Naderi-Meshkin, Hojjat; Gupta, Shailendra K; Wolkenhauer, Olaf; Vera, Julio
2016-05-01
There was evidence that RNAs are a functionally rich class of molecules not only since the arrival of the next-generation sequencing technology. Non-coding RNAs (ncRNA) could be the key to accelerated diagnosis and enhanced prediction of disease and therapy outcomes as well as the design of advanced therapeutic strategies to overcome yet unsatisfactory approaches.In this review, we discuss the state of the art in RNA systems biology with focus on the application in the systems biomedicine field. We propose guidelines for analysing the role of microRNAs and long non-coding RNAs in human pathologies. We introduce RNA expression profiling and network approaches for the identification of stable and effective RNomics-based biomarkers, providing insights into the role of ncRNAs in disease regulation. Towards this, we discuss ways to model the dynamics of gene regulatory networks and signalling pathways that involve ncRNAs. We also describe data resources and computational methods for finding putative mechanisms of action of ncRNAs. Finally, we discuss avenues for the computer-aided design of novel RNA-based therapeutics. © The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Gregoretti, Francesco; Belcastro, Vincenzo; di Bernardo, Diego; Oliva, Gennaro
2010-04-21
The reverse engineering of gene regulatory networks using gene expression profile data has become crucial to gain novel biological knowledge. Large amounts of data that need to be analyzed are currently being produced due to advances in microarray technologies. Using current reverse engineering algorithms to analyze large data sets can be very computational-intensive. These emerging computational requirements can be met using parallel computing techniques. It has been shown that the Network Identification by multiple Regression (NIR) algorithm performs better than the other ready-to-use reverse engineering software. However it cannot be used with large networks with thousands of nodes--as is the case in biological networks--due to the high time and space complexity. In this work we overcome this limitation by designing and developing a parallel version of the NIR algorithm. The new implementation of the algorithm reaches a very good accuracy even for large gene networks, improving our understanding of the gene regulatory networks that is crucial for a wide range of biomedical applications.
Huang, Lin; Lange, Miles D.; Zhang, Zhixin
2014-01-01
VH replacement occurs through RAG-mediated secondary recombination between a rearranged VH gene and an upstream unrearranged VH gene. Due to the location of the cryptic recombination signal sequence (cRSS, TACTGTG) at the 3′ end of VH gene coding region, a short stretch of nucleotides from the previous rearranged VH gene can be retained in the newly formed VH–DH junction as a “footprint” of VH replacement. Such footprints can be used as markers to identify Ig heavy chain (IgH) genes potentially generated through VH replacement. To explore the contribution of VH replacement products to the antibody repertoire, we developed a Java-based computer program, VH replacement footprint analyzer-I (VHRFA-I), to analyze published or newly obtained IgH genes from human or mouse. The VHRFA-1 program has multiple functional modules: it first uses service provided by the IMGT/V-QUEST program to assign potential VH, DH, and JH germline genes; then, it searches for VH replacement footprint motifs within the VH–DH junction (N1) regions of IgH gene sequences to identify potential VH replacement products; it can also analyze the frequencies of VH replacement products in correlation with publications, keywords, or VH, DH, and JH gene usages, and mutation status; it can further analyze the amino acid usages encoded by the identified VH replacement footprints. In summary, this program provides a useful computation tool for exploring the biological significance of VH replacement products in human and mouse. PMID:24575092
Statistical Analysis of Hurst Exponents of Essential/Nonessential Genes in 33 Bacterial Genomes
Liu, Xiao; Wang, Baojin; Xu, Luo
2015-01-01
Methods for identifying essential genes currently depend predominantly on biochemical experiments. However, there is demand for improved computational methods for determining gene essentiality. In this study, we used the Hurst exponent, a characteristic parameter to describe long-range correlation in DNA, and analyzed its distribution in 33 bacterial genomes. In most genomes (31 out of 33) the significance levels of the Hurst exponents of the essential genes were significantly higher than for the corresponding full-gene-set, whereas the significance levels of the Hurst exponents of the nonessential genes remained unchanged or increased only slightly. All of the Hurst exponents of essential genes followed a normal distribution, with one exception. We therefore propose that the distribution feature of Hurst exponents of essential genes can be used as a classification index for essential gene prediction in bacteria. For computer-aided design in the field of synthetic biology, this feature can build a restraint for pre- or post-design checking of bacterial essential genes. Moreover, considering the relationship between gene essentiality and evolution, the Hurst exponents could be used as a descriptive parameter related to evolutionary level, or be added to the annotation of each gene. PMID:26067107
Almathen, Faisal; Charruau, Pauline; Mohandesan, Elmira; Mwacharo, Joram M.; Orozco-terWengel, Pablo; Pitt, Daniel; Abdussamad, Abdussamad M.; Uerpmann, Margarethe; Uerpmann, Hans-Peter; De Cupere, Bea; Magee, Peter; Alnaqeeb, Majed A.; Salim, Bashir; Raziq, Abdul; Dessie, Tadelle; Abdelhadi, Omer M.; Banabazi, Mohammad H.; Al-Eknah, Marzook; Walzer, Chris; Faye, Bernard; Hofreiter, Michael; Peters, Joris; Hanotte, Olivier
2016-01-01
Dromedaries have been fundamental to the development of human societies in arid landscapes and for long-distance trade across hostile hot terrains for 3,000 y. Today they continue to be an important livestock resource in marginal agro-ecological zones. However, the history of dromedary domestication and the influence of ancient trading networks on their genetic structure have remained elusive. We combined ancient DNA sequences of wild and early-domesticated dromedary samples from arid regions with nuclear microsatellite and mitochondrial genotype information from 1,083 extant animals collected across the species’ range. We observe little phylogeographic signal in the modern population, indicative of extensive gene flow and virtually affecting all regions except East Africa, where dromedary populations have remained relatively isolated. In agreement with archaeological findings, we identify wild dromedaries from the southeast Arabian Peninsula among the founders of the domestic dromedary gene pool. Approximate Bayesian computations further support the “restocking from the wild” hypothesis, with an initial domestication followed by introgression from individuals from wild, now-extinct populations. Compared with other livestock, which show a long history of gene flow with their wild ancestors, we find a high initial diversity relative to the native distribution of the wild ancestor on the Arabian Peninsula and to the brief coexistence of early-domesticated and wild individuals. This study also demonstrates the potential to retrieve ancient DNA sequences from osseous remains excavated in hot and dry desert environments. PMID:27162355
GPU-powered Shotgun Stochastic Search for Dirichlet process mixtures of Gaussian Graphical Models
Mukherjee, Chiranjit; Rodriguez, Abel
2016-01-01
Gaussian graphical models are popular for modeling high-dimensional multivariate data with sparse conditional dependencies. A mixture of Gaussian graphical models extends this model to the more realistic scenario where observations come from a heterogenous population composed of a small number of homogeneous sub-groups. In this paper we present a novel stochastic search algorithm for finding the posterior mode of high-dimensional Dirichlet process mixtures of decomposable Gaussian graphical models. Further, we investigate how to harness the massive thread-parallelization capabilities of graphical processing units to accelerate computation. The computational advantages of our algorithms are demonstrated with various simulated data examples in which we compare our stochastic search with a Markov chain Monte Carlo algorithm in moderate dimensional data examples. These experiments show that our stochastic search largely outperforms the Markov chain Monte Carlo algorithm in terms of computing-times and in terms of the quality of the posterior mode discovered. Finally, we analyze a gene expression dataset in which Markov chain Monte Carlo algorithms are too slow to be practically useful. PMID:28626348
GPU-powered Shotgun Stochastic Search for Dirichlet process mixtures of Gaussian Graphical Models.
Mukherjee, Chiranjit; Rodriguez, Abel
2016-01-01
Gaussian graphical models are popular for modeling high-dimensional multivariate data with sparse conditional dependencies. A mixture of Gaussian graphical models extends this model to the more realistic scenario where observations come from a heterogenous population composed of a small number of homogeneous sub-groups. In this paper we present a novel stochastic search algorithm for finding the posterior mode of high-dimensional Dirichlet process mixtures of decomposable Gaussian graphical models. Further, we investigate how to harness the massive thread-parallelization capabilities of graphical processing units to accelerate computation. The computational advantages of our algorithms are demonstrated with various simulated data examples in which we compare our stochastic search with a Markov chain Monte Carlo algorithm in moderate dimensional data examples. These experiments show that our stochastic search largely outperforms the Markov chain Monte Carlo algorithm in terms of computing-times and in terms of the quality of the posterior mode discovered. Finally, we analyze a gene expression dataset in which Markov chain Monte Carlo algorithms are too slow to be practically useful.
Modeling the effect of nano-sized polymer particles on the properties of lipid membranes
NASA Astrophysics Data System (ADS)
Rossi, Giulia; Monticelli, Luca
2014-12-01
The interaction between polymers and biological membranes has recently gained significant interest in several research areas. On the biomedical side, dendrimers, linear polyelectrolytes, and neutral copolymers find application as drug and gene delivery agents, as biocidal agents, and as platforms for biological sensors. On the environmental side, plastic debris is often disposed of in the oceans and gets degraded into small particles; therefore concern is raising about the interaction of small plastic particles with living organisms. From both perspectives, it is crucial to understand the processes driving the interaction between polymers and cell membranes. In recent times progress in computer technology and simulation methods has allowed computational predictions on the molecular mechanism of interaction between polymeric materials and lipid membranes. Here we review the computational studies on the interaction between lipid membranes and different classes of polymers: dendrimers, linear charged polymers, polyethylene glycol (PEG) and its derivatives, polystyrene, and some generic models of polymer chains. We conclude by discussing some of the technical challenges in this area and future developments.
Lu, Songjian; Jin, Bo; Cowart, L Ashley; Lu, Xinghua
2013-01-01
Genetic and pharmacological perturbation experiments, such as deleting a gene and monitoring gene expression responses, are powerful tools for studying cellular signal transduction pathways. However, it remains a challenge to automatically derive knowledge of a cellular signaling system at a conceptual level from systematic perturbation-response data. In this study, we explored a framework that unifies knowledge mining and data mining towards the goal. The framework consists of the following automated processes: 1) applying an ontology-driven knowledge mining approach to identify functional modules among the genes responding to a perturbation in order to reveal potential signals affected by the perturbation; 2) applying a graph-based data mining approach to search for perturbations that affect a common signal; and 3) revealing the architecture of a signaling system by organizing signaling units into a hierarchy based on their relationships. Applying this framework to a compendium of yeast perturbation-response data, we have successfully recovered many well-known signal transduction pathways; in addition, our analysis has led to many new hypotheses regarding the yeast signal transduction system; finally, our analysis automatically organized perturbed genes as a graph reflecting the architecture of the yeast signaling system. Importantly, this framework transformed molecular findings from a gene level to a conceptual level, which can be readily translated into computable knowledge in the form of rules regarding the yeast signaling system, such as "if genes involved in the MAPK signaling are perturbed, genes involved in pheromone responses will be differentially expressed."
Applebaum, Mark A; Jha, Aashish R; Kao, Clara; Hernandez, Kyle M; DeWane, Gillian; Salwen, Helen R; Chlenski, Alexandre; Dobratic, Marija; Mariani, Christopher J; Godley, Lucy A; Prabhakar, Nanduri; White, Kevin; Stranger, Barbara E; Cohn, Susan L
2016-11-22
Neuroblastoma is notable for its broad spectrum of clinical behavior ranging from spontaneous regression to rapidly progressive disease. Hypoxia is well known to confer a more aggressive phenotype in neuroblastoma. We analyzed transcriptome data from diagnostic neuroblastoma tumors and hypoxic neuroblastoma cell lines to identify genes whose expression levels correlate with poor patient outcome and are involved in the hypoxia response. By integrating a diverse set of transcriptome datasets, including those from neuroblastoma patients and neuroblastoma derived cell lines, we identified nine genes (SLCO4A1, ENO1, HK2, PGK1, MTFP1, HILPDA, VKORC1, TPI1, and HIST1H1C) that are up-regulated in hypoxia and whose expression levels are correlated with poor patient outcome in three independent neuroblastoma cohorts. Analysis of 5-hydroxymethylcytosine and ENCODE data indicate that at least five of these nine genes have an increase in 5-hydroxymethylcytosine and a more open chromatin structure in hypoxia versus normoxia and are putative targets of hypoxia inducible factor (HIF) as they contain HIF binding sites in their regulatory regions. Four of these genes are key components of the glycolytic pathway and another three are directly involved in cellular metabolism. We experimentally validated our computational findings demonstrating that seven of the nine genes are significantly up-regulated in response to hypoxia in the four neuroblastoma cell lines tested. This compact and robustly validated group of genes, is associated with the hypoxia response in aggressive neuroblastoma and may represent a novel target for biomarker and therapeutic development.
Ben Abdallah, Emna; Folschette, Maxime; Roux, Olivier; Magnin, Morgan
2017-01-01
This paper addresses the problem of finding attractors in biological regulatory networks. We focus here on non-deterministic synchronous and asynchronous multi-valued networks, modeled using automata networks (AN). AN is a general and well-suited formalism to study complex interactions between different components (genes, proteins,...). An attractor is a minimal trap domain, that is, a part of the state-transition graph that cannot be escaped. Such structures are terminal components of the dynamics and take the form of steady states (singleton) or complex compositions of cycles (non-singleton). Studying the effect of a disease or a mutation on an organism requires finding the attractors in the model to understand the long-term behaviors. We present a computational logical method based on answer set programming (ASP) to identify all attractors. Performed without any network reduction, the method can be applied on any dynamical semantics. In this paper, we present the two most widespread non-deterministic semantics: the asynchronous and the synchronous updating modes. The logical approach goes through a complete enumeration of the states of the network in order to find the attractors without the necessity to construct the whole state-transition graph. We realize extensive computational experiments which show good performance and fit the expected theoretical results in the literature. The originality of our approach lies on the exhaustive enumeration of all possible (sets of) states verifying the properties of an attractor thanks to the use of ASP. Our method is applied to non-deterministic semantics in two different schemes (asynchronous and synchronous). The merits of our methods are illustrated by applying them to biological examples of various sizes and comparing the results with some existing approaches. It turns out that our approach succeeds to exhaustively enumerate on a desktop computer, in a large model (100 components), all existing attractors up to a given size (20 states). This size is only limited by memory and computation time.
A Computational Network Biology Approach to Uncover Novel Genes Related to Alzheimer's Disease.
Zanzoni, Andreas
2016-01-01
Recent advances in the fields of genetics and genomics have enabled the identification of numerous Alzheimer's disease (AD) candidate genes, although for many of them the role in AD pathophysiology has not been uncovered yet. Concomitantly, network biology studies have shown a strong link between protein network connectivity and disease. In this chapter I describe a computational approach that, by combining local and global network analysis strategies, allows the formulation of novel hypotheses on the molecular mechanisms involved in AD and prioritizes candidate genes for further functional studies.
Cell-specific prediction and application of drug-induced gene expression profiles.
Hodos, Rachel; Zhang, Ping; Lee, Hao-Chih; Duan, Qiaonan; Wang, Zichen; Clark, Neil R; Ma'ayan, Avi; Wang, Fei; Kidd, Brian; Hu, Jianying; Sontag, David; Dudley, Joel
2018-01-01
Gene expression profiling of in vitro drug perturbations is useful for many biomedical discovery applications including drug repurposing and elucidation of drug mechanisms. However, limited data availability across cell types has hindered our capacity to leverage or explore the cell-specificity of these perturbations. While recent efforts have generated a large number of drug perturbation profiles across a variety of human cell types, many gaps remain in this combinatorial drug-cell space. Hence, we asked whether it is possible to fill these gaps by predicting cell-specific drug perturbation profiles using available expression data from related conditions--i.e. from other drugs and cell types. We developed a computational framework that first arranges existing profiles into a three-dimensional array (or tensor) indexed by drugs, genes, and cell types, and then uses either local (nearest-neighbors) or global (tensor completion) information to predict unmeasured profiles. We evaluate prediction accuracy using a variety of metrics, and find that the two methods have complementary performance, each superior in different regions in the drug-cell space. Predictions achieve correlations of 0.68 with true values, and maintain accurate differentially expressed genes (AUC 0.81). Finally, we demonstrate that the predicted profiles add value for making downstream associations with drug targets and therapeutic classes.
Cell-specific prediction and application of drug-induced gene expression profiles
Hodos, Rachel; Zhang, Ping; Lee, Hao-Chih; Duan, Qiaonan; Wang, Zichen; Clark, Neil R.; Ma'ayan, Avi; Wang, Fei; Kidd, Brian; Hu, Jianying; Sontag, David
2017-01-01
Gene expression profiling of in vitro drug perturbations is useful for many biomedical discovery applications including drug repurposing and elucidation of drug mechanisms. However, limited data availability across cell types has hindered our capacity to leverage or explore the cell-specificity of these perturbations. While recent efforts have generated a large number of drug perturbation profiles across a variety of human cell types, many gaps remain in this combinatorial drug-cell space. Hence, we asked whether it is possible to fill these gaps by predicting cell-specific drug perturbation profiles using available expression data from related conditions--i.e. from other drugs and cell types. We developed a computational framework that first arranges existing profiles into a three-dimensional array (or tensor) indexed by drugs, genes, and cell types, and then uses either local (nearest-neighbors) or global (tensor completion) information to predict unmeasured profiles. We evaluate prediction accuracy using a variety of metrics, and find that the two methods have complementary performance, each superior in different regions in the drug-cell space. Predictions achieve correlations of 0.68 with true values, and maintain accurate differentially expressed genes (AUC 0.81). Finally, we demonstrate that the predicted profiles add value for making downstream associations with drug targets and therapeutic classes. PMID:29218867
A novel harmony search-K means hybrid algorithm for clustering gene expression data
Nazeer, KA Abdul; Sebastian, MP; Kumar, SD Madhu
2013-01-01
Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms. PMID:23390351
A novel harmony search-K means hybrid algorithm for clustering gene expression data.
Nazeer, Ka Abdul; Sebastian, Mp; Kumar, Sd Madhu
2013-01-01
Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms.
del Val, Coral; Rivas, Elena; Torres-Quesada, Omar; Toro, Nicolás; Jiménez-Zurdo, José I
2007-01-01
Bacterial small non-coding RNAs (sRNAs) are being recognized as novel widespread regulators of gene expression in response to environmental signals. Here, we present the first search for sRNA-encoding genes in the nitrogen-fixing endosymbiont Sinorhizobium meliloti, performed by a genome-wide computational analysis of its intergenic regions. Comparative sequence data from eight related α-proteobacteria were obtained, and the interspecies pairwise alignments were scored with the programs eQRNA and RNAz as complementary predictive tools to identify conserved and stable secondary structures corresponding to putative non-coding RNAs. Northern experiments confirmed that eight of the predicted loci, selected among the original 32 candidates as most probable sRNA genes, expressed small transcripts. This result supports the combined use of eQRNA and RNAz as a robust strategy to identify novel sRNAs in bacteria. Furthermore, seven of the transcripts accumulated differentially in free-living and symbiotic conditions. Experimental mapping of the 5′-ends of the detected transcripts revealed that their encoding genes are organized in autonomous transcription units with recognizable promoter and, in most cases, termination signatures. These findings suggest novel regulatory functions for sRNAs related to the interactions of α-proteobacteria with their eukaryotic hosts. PMID:17971083
Yu, Yun; Degnan, James H.; Nakhleh, Luay
2012-01-01
Gene tree topologies have proven a powerful data source for various tasks, including species tree inference and species delimitation. Consequently, methods for computing probabilities of gene trees within species trees have been developed and widely used in probabilistic inference frameworks. All these methods assume an underlying multispecies coalescent model. However, when reticulate evolutionary events such as hybridization occur, these methods are inadequate, as they do not account for such events. Methods that account for both hybridization and deep coalescence in computing the probability of a gene tree topology currently exist for very limited cases. However, no such methods exist for general cases, owing primarily to the fact that it is currently unknown how to compute the probability of a gene tree topology within the branches of a phylogenetic network. Here we present a novel method for computing the probability of gene tree topologies on phylogenetic networks and demonstrate its application to the inference of hybridization in the presence of incomplete lineage sorting. We reanalyze a Saccharomyces species data set for which multiple analyses had converged on a species tree candidate. Using our method, though, we show that an evolutionary hypothesis involving hybridization in this group has better support than one of strict divergence. A similar reanalysis on a group of three Drosophila species shows that the data is consistent with hybridization. Further, using extensive simulation studies, we demonstrate the power of gene tree topologies at obtaining accurate estimates of branch lengths and hybridization probabilities of a given phylogenetic network. Finally, we discuss identifiability issues with detecting hybridization, particularly in cases that involve extinction or incomplete sampling of taxa. PMID:22536161
Zhang, Wensheng; Edwards, Andrea; Fan, Wei; Zhu, Dongxiao; Zhang, Kun
2010-06-22
Comparative analysis of gene expression profiling of multiple biological categories, such as different species of organisms or different kinds of tissue, promises to enhance the fundamental understanding of the universality as well as the specialization of mechanisms and related biological themes. Grouping genes with a similar expression pattern or exhibiting co-expression together is a starting point in understanding and analyzing gene expression data. In recent literature, gene module level analysis is advocated in order to understand biological network design and system behaviors in disease and life processes; however, practical difficulties often lie in the implementation of existing methods. Using the singular value decomposition (SVD) technique, we developed a new computational tool, named svdPPCS (SVD-based Pattern Pairing and Chart Splitting), to identify conserved and divergent co-expression modules of two sets of microarray experiments. In the proposed methods, gene modules are identified by splitting the two-way chart coordinated with a pair of left singular vectors factorized from the gene expression matrices of the two biological categories. Importantly, the cutoffs are determined by a data-driven algorithm using the well-defined statistic, SVD-p. The implementation was illustrated on two time series microarray data sets generated from the samples of accessory gland (ACG) and malpighian tubule (MT) tissues of the line W118 of M. drosophila. Two conserved modules and six divergent modules, each of which has a unique characteristic profile across tissue kinds and aging processes, were identified. The number of genes contained in these models ranged from five to a few hundred. Three to over a hundred GO terms were over-represented in individual modules with FDR < 0.1. One divergent module suggested the tissue-specific relationship between the expressions of mitochondrion-related genes and the aging process. This finding, together with others, may be of biological significance. The validity of the proposed SVD-based method was further verified by a simulation study, as well as the comparisons with regression analysis and cubic spline regression analysis plus PAM based clustering. svdPPCS is a novel computational tool for the comparative analysis of transcriptional profiling. It especially fits the comparison of time series data of related organisms or different tissues of the same organism under equivalent or similar experimental conditions. The general scheme can be directly extended to the comparisons of multiple data sets. It also can be applied to the integration of data sets from different platforms and of different sources.
Computational analyses of mammalian lactate dehydrogenases: human, mouse, opossum and platypus LDHs.
Holmes, Roger S; Goldberg, Erwin
2009-10-01
Computational methods were used to predict the amino acid sequences and gene locations for mammalian lactate dehydrogenase (LDH) genes and proteins using genome sequence databanks. Human LDHA, LDHC and LDH6A genes were located in tandem on chromosome 11, while LDH6B and LDH6C genes were on chromosomes 15 and 12, respectively. Opossum LDHC and LDH6B genes were located in tandem with the opossum LDHA gene on chromosome 5 and contained 7 (LDHA and LDHC) or 8 (LDH6B) exons. An amino acid sequence prediction for the opossum LDH6B subunit gave an extended N-terminal sequence, similar to the human and mouse LDH6B sequences, which may support the export of this enzyme into mitochondria. The platypus genome contained at least 3 LDH genes encoding LDHA, LDHB and LDH6B subunits. Phylogenetic studies and sequence analyses indicated that LDHA, LDHB and LDH6B genes are present in all mammalian genomes examined, including a monotreme species (platypus), whereas the LDHC gene may have arisen more recently in marsupial mammals.
Computational analyses of mammalian lactate dehydrogenases: human, mouse, opossum and platypus LDHs
Holmes, Roger S; Goldberg, Erwin
2009-01-01
Computational methods were used to predict the amino acid sequences and gene locations for mammalian lactate dehydrogenase (LDH) genes and proteins using genome sequence databanks. Human LDHA, LDHC and LDH6A genes were located in tandem on chromosome 11, while LDH6B and LDH6C genes were on chromosomes 15 and 12, respectively. Opossum LDHC and LDH6B genes were located in tandem with the opossum LDHA gene on chromosome 5 and contained 7 (LDHA and LDHC) or 8 (LDH6B) exons. An amino acid sequence prediction for the opossum LDH6B subunit gave an extended N-terminal sequence, similar to the human and mouse LDH6B sequences, which may support the export of this enzyme into mitochondria. The platypus genome contained at least 3 LDH genes encoding LDHA, LDHB and LDH6B subunits. Phylogenetic studies and sequence analyses indicated that LDHA, LDHB and LDH6B genes are present in all mammalian genomes examined, including a monotreme species (platypus), whereas the LDHC gene may have arisen more recently in marsupial mammals. PMID:19679512
Probabilistic representation of gene regulatory networks.
Mao, Linyong; Resat, Haluk
2004-09-22
Recent experiments have established unambiguously that biological systems can have significant cell-to-cell variations in gene expression levels even in isogenic populations. Computational approaches to studying gene expression in cellular systems should capture such biological variations for a more realistic representation. In this paper, we present a new fully probabilistic approach to the modeling of gene regulatory networks that allows for fluctuations in the gene expression levels. The new algorithm uses a very simple representation for the genes, and accounts for the repression or induction of the genes and for the biological variations among isogenic populations simultaneously. Because of its simplicity, introduced algorithm is a very promising approach to model large-scale gene regulatory networks. We have tested the new algorithm on the synthetic gene network library bioengineered recently. The good agreement between the computed and the experimental results for this library of networks, and additional tests, demonstrate that the new algorithm is robust and very successful in explaining the experimental data. The simulation software is available upon request. Supplementary material will be made available on the OUP server.
Comparative transcriptomics reveals similarities and differences between astrocytoma grades.
Seifert, Michael; Garbe, Martin; Friedrich, Betty; Mittelbronn, Michel; Klink, Barbara
2015-12-16
Astrocytomas are the most common primary brain tumors distinguished into four histological grades. Molecular analyses of individual astrocytoma grades have revealed detailed insights into genetic, transcriptomic and epigenetic alterations. This provides an excellent basis to identify similarities and differences between astrocytoma grades. We utilized public omics data of all four astrocytoma grades focusing on pilocytic astrocytomas (PA I), diffuse astrocytomas (AS II), anaplastic astrocytomas (AS III) and glioblastomas (GBM IV) to identify similarities and differences using well-established bioinformatics and systems biology approaches. We further validated the expression and localization of Ang2 involved in angiogenesis using immunohistochemistry. Our analyses show similarities and differences between astrocytoma grades at the level of individual genes, signaling pathways and regulatory networks. We identified many differentially expressed genes that were either exclusively observed in a specific astrocytoma grade or commonly affected in specific subsets of astrocytoma grades in comparison to normal brain. Further, the number of differentially expressed genes generally increased with the astrocytoma grade with one major exception. The cytokine receptor pathway showed nearly the same number of differentially expressed genes in PA I and GBM IV and was further characterized by a significant overlap of commonly altered genes and an exclusive enrichment of overexpressed cancer genes in GBM IV. Additional analyses revealed a strong exclusive overexpression of CX3CL1 (fractalkine) and its receptor CX3CR1 in PA I possibly contributing to the absence of invasive growth. We further found that PA I was significantly associated with the mesenchymal subtype typically observed for very aggressive GBM IV. Expression of endothelial and mesenchymal markers (ANGPT2, CHI3L1) indicated a stronger contribution of the micro-environment to the manifestation of the mesenchymal subtype than the tumor biology itself. We further inferred a transcriptional regulatory network associated with specific expression differences distinguishing PA I from AS II, AS III and GBM IV. Major central transcriptional regulators were involved in brain development, cell cycle control, proliferation, apoptosis, chromatin remodeling or DNA methylation. Many of these regulators showed directly underlying DNA methylation changes in PA I or gene copy number mutations in AS II, AS III and GBM IV. This computational study characterizes similarities and differences between all four astrocytoma grades confirming known and revealing novel insights into astrocytoma biology. Our findings represent a valuable resource for future computational and experimental studies.
Liscovitch, Noa; Bazak, Lily; Levanon, Erez Y; Chechik, Gal
2014-01-01
A-to-I RNA editing by adenosine deaminases acting on RNA is a post-transcriptional modification that is crucial for normal life and development in vertebrates. RNA editing has been shown to be very abundant in the human transcriptome, specifically at the primate-specific Alu elements. The functional role of this wide-spread effect is still not clear; it is believed that editing of transcripts is a mechanism for their down-regulation via processes such as nuclear retention or RNA degradation. Here we combine 2 neural gene expression datasets with genome-level editing information to examine the relation between the expression of ADAR genes with the expression of their target genes. Specifically, we computed the spatial correlation across structures of post-mortem human brains between ADAR and a large set of targets that were found to be edited in their Alu repeats. Surprisingly, we found that a large fraction of the edited genes are positively correlated with ADAR, opposing the assumption that editing would reduce expression. When considering the correlations between ADAR and its targets over development, 2 gene subsets emerge, positively correlated and negatively correlated with ADAR expression. Specifically, in embryonic time points, ADAR is positively correlated with many genes related to RNA processing and regulation of gene expression. These findings imply that the suggested mechanism of regulation of expression by editing is probably not a global one; ADAR expression does not have a genome wide effect reducing the expression of editing targets. It is possible, however, that RNA editing by ADAR in non-coding regions of the gene might be a part of a more complex expression regulation mechanism. PMID:25692240
Liscovitch, Noa; Bazak, Lily; Levanon, Erez Y; Chechik, Gal
2014-01-01
A-to-I RNA editing by adenosine deaminases acting on RNA is a post-transcriptional modification that is crucial for normal life and development in vertebrates. RNA editing has been shown to be very abundant in the human transcriptome, specifically at the primate-specific Alu elements. The functional role of this wide-spread effect is still not clear; it is believed that editing of transcripts is a mechanism for their down-regulation via processes such as nuclear retention or RNA degradation. Here we combine 2 neural gene expression datasets with genome-level editing information to examine the relation between the expression of ADAR genes with the expression of their target genes. Specifically, we computed the spatial correlation across structures of post-mortem human brains between ADAR and a large set of targets that were found to be edited in their Alu repeats. Surprisingly, we found that a large fraction of the edited genes are positively correlated with ADAR, opposing the assumption that editing would reduce expression. When considering the correlations between ADAR and its targets over development, 2 gene subsets emerge, positively correlated and negatively correlated with ADAR expression. Specifically, in embryonic time points, ADAR is positively correlated with many genes related to RNA processing and regulation of gene expression. These findings imply that the suggested mechanism of regulation of expression by editing is probably not a global one; ADAR expression does not have a genome wide effect reducing the expression of editing targets. It is possible, however, that RNA editing by ADAR in non-coding regions of the gene might be a part of a more complex expression regulation mechanism.
On the Interplay of Telomeres, Nevi and the Risk of Melanoma
Bodelon, Clara; Pfeiffer, Ruth M.; Bollati, Valentina; Debbache, Julien; Calista, Donato; Ghiorzo, Paola; Fargnoli, Maria Concetta; Bianchi-Scarra, Giovanna; Peris, Ketty; Hoxha, Mirjam; Hutchinson, Amy; Burdette, Laurie; Burke, Laura; Fang, Shenying; Tucker, Margaret A.; Goldstein, Alisa M.; Lee, Jeffrey E.; Wei, Qingyi; Savage, Sharon A.; Yang, Xiaohong R.; Amos, Christopher; Landi, Maria Teresa
2012-01-01
The relationship between telomeres, nevi and melanoma is complex. Shorter telomeres have been found to be associated with many cancers and with number of nevi, a known risk factor for melanoma. However, shorter telomeres have also been found to decrease melanoma risk. We performed a systematic analysis of telomere-related genes and tagSNPs within these genes, in relation to the risk of melanoma, dysplastic nevi, and nevus count combining data from four studies conducted in Italy. In addition, we examined whether telomere length measured in peripheral blood leukocytes is related to the risk of melanoma, dysplastic nevi, number of nevi, or telomere-related SNPs. A total of 796 cases and 770 controls were genotyped for 517 SNPs in 39 telomere-related genes genotyped with a custom-made array. Replication of the top SNPs was conducted in two American populations consisting of 488 subjects from 53 melanoma-prone families and 1,086 cases and 1,024 controls from a case-control study. We estimated odds ratios for associations with SNPs and combined SNP P-values to compute gene region-specific, functional group-specific, and overall P-value using an adaptive rank-truncated product algorithm. In the Mediterranean population, we found suggestive evidence that RECQL4, a gene involved in genome stability, RTEL1, a gene regulating telomere elongation, and TERF2, a gene implicated in the protection of telomeres, were associated with melanoma, the presence of dysplastic nevi and number of nevi, respectively. However, these associations were not found in the American samples, suggesting variable melanoma susceptibility for these genes across populations or chance findings in our discovery sample. Larger studies across different populations are necessary to clarify these associations. PMID:23300679
Zheng, Guangyong; Xu, Yaochen; Zhang, Xiujun; Liu, Zhi-Ping; Wang, Zhuo; Chen, Luonan; Zhu, Xin-Guang
2016-12-23
A gene regulatory network (GRN) represents interactions of genes inside a cell or tissue, in which vertexes and edges stand for genes and their regulatory interactions respectively. Reconstruction of gene regulatory networks, in particular, genome-scale networks, is essential for comparative exploration of different species and mechanistic investigation of biological processes. Currently, most of network inference methods are computationally intensive, which are usually effective for small-scale tasks (e.g., networks with a few hundred genes), but are difficult to construct GRNs at genome-scale. Here, we present a software package for gene regulatory network reconstruction at a genomic level, in which gene interaction is measured by the conditional mutual information measurement using a parallel computing framework (so the package is named CMIP). The package is a greatly improved implementation of our previous PCA-CMI algorithm. In CMIP, we provide not only an automatic threshold determination method but also an effective parallel computing framework for network inference. Performance tests on benchmark datasets show that the accuracy of CMIP is comparable to most current network inference methods. Moreover, running tests on synthetic datasets demonstrate that CMIP can handle large datasets especially genome-wide datasets within an acceptable time period. In addition, successful application on a real genomic dataset confirms its practical applicability of the package. This new software package provides a powerful tool for genomic network reconstruction to biological community. The software can be accessed at http://www.picb.ac.cn/CMIP/ .
Wang, Guohua; Wang, Fang; Huang, Qian; Li, Yu; Liu, Yunlong; Wang, Yadong
2015-01-01
Transcription factors are proteins that bind to DNA sequences to regulate gene transcription. The transcription factor binding sites are short DNA sequences (5-20 bp long) specifically bound by one or more transcription factors. The identification of transcription factor binding sites and prediction of their function continue to be challenging problems in computational biology. In this study, by integrating the DNase I hypersensitive sites with known position weight matrices in the TRANSFAC database, the transcription factor binding sites in gene regulatory region are identified. Based on the global gene expression patterns in cervical cancer HeLaS3 cell and HelaS3-ifnα4h cell (interferon treatment on HeLaS3 cell for 4 hours), we present a model-based computational approach to predict a set of transcription factors that potentially cause such differential gene expression. Significantly, 6 out 10 predicted functional factors, including IRF, IRF-2, IRF-9, IRF-1 and IRF-3, ICSBP, belong to interferon regulatory factor family and upregulate the gene expression levels responding to the interferon treatment. Another factor, ISGF-3, is also a transcriptional activator induced by interferon alpha. Using the different transcription factor binding sites selected criteria, the prediction result of our model is consistent. Our model demonstrated the potential to computationally identify the functional transcription factors in gene regulation.
SynFind: Compiling Syntenic Regions across Any Set of Genomes on Demand.
Tang, Haibao; Bomhoff, Matthew D; Briones, Evan; Zhang, Liangsheng; Schnable, James C; Lyons, Eric
2015-11-11
The identification of conserved syntenic regions enables discovery of predicted locations for orthologous and homeologous genes, even when no such gene is present. This capability means that synteny-based methods are far more effective than sequence similarity-based methods in identifying true-negatives, a necessity for studying gene loss and gene transposition. However, the identification of syntenic regions requires complex analyses which must be repeated for pairwise comparisons between any two species. Therefore, as the number of published genomes increases, there is a growing demand for scalable, simple-to-use applications to perform comparative genomic analyses that cater to both gene family studies and genome-scale studies. We implemented SynFind, a web-based tool that addresses this need. Given one query genome, SynFind is capable of identifying conserved syntenic regions in any set of target genomes. SynFind is capable of reporting per-gene information, useful for researchers studying specific gene families, as well as genome-wide data sets of syntenic gene and predicted gene locations, critical for researchers focused on large-scale genomic analyses. Inference of syntenic homologs provides the basis for correlation of functional changes around genes of interests between related organisms. Deployed on the CoGe online platform, SynFind is connected to the genomic data from over 15,000 organisms from all domains of life as well as supporting multiple releases of the same organism. SynFind makes use of a powerful job execution framework that promises scalability and reproducibility. SynFind can be accessed at http://genomevolution.org/CoGe/SynFind.pl. A video tutorial of SynFind using Phytophthrora as an example is available at http://www.youtube.com/watch?v=2Agczny9Nyc. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Gui, Jiang; Andrew, Angeline S.; Andrews, Peter; Nelson, Heather M.; Kelsey, Karl T.; Karagas, Margaret R.; Moore, Jason H.
2010-01-01
Epistasis or gene-gene interaction is a fundamental component of the genetic architecture of complex traits such as disease susceptibility. Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model-free method to detect epistasis when there are no significant marginal genetic effects. However, in many studies of complex disease, other covariates like age of onset and smoking status could have a strong main effect and may potentially interfere with MDR's ability to achieve its goal. In this paper, we present a simple and computationally efficient sampling method to adjust for covariate effects in MDR. We use simulation to show that after adjustment, MDR has sufficient power to detect true gene-gene interactions. We also compare our method with the state-of-art technique in covariate adjustment. The results suggest that our proposed method performs similarly, but is more computationally efficient. We then apply this new method to an analysis of a population-based bladder cancer study in New Hampshire. PMID:20924193
MAGMA: Generalized Gene-Set Analysis of GWAS Data
de Leeuw, Christiaan A.; Mooij, Joris M.; Heskes, Tom; Posthuma, Danielle
2015-01-01
By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn’s Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn’s Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn’s Disease data was found to be considerably faster as well. PMID:25885710
MAGMA: generalized gene-set analysis of GWAS data.
de Leeuw, Christiaan A; Mooij, Joris M; Heskes, Tom; Posthuma, Danielle
2015-04-01
By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.
Reverse Genetics and High Throughput Sequencing Methodologies for Plant Functional Genomics
Ben-Amar, Anis; Daldoul, Samia; Reustle, Götz M.; Krczal, Gabriele; Mliki, Ahmed
2016-01-01
In the post-genomic era, increasingly sophisticated genetic tools are being developed with the long-term goal of understanding how the coordinated activity of genes gives rise to a complex organism. With the advent of the next generation sequencing associated with effective computational approaches, wide variety of plant species have been fully sequenced giving a wealth of data sequence information on structure and organization of plant genomes. Since thousands of gene sequences are already known, recently developed functional genomics approaches provide powerful tools to analyze plant gene functions through various gene manipulation technologies. Integration of different omics platforms along with gene annotation and computational analysis may elucidate a complete view in a system biology level. Extensive investigations on reverse genetics methodologies were deployed for assigning biological function to a specific gene or gene product. We provide here an updated overview of these high throughout strategies highlighting recent advances in the knowledge of functional genomics in plants. PMID:28217003
FlpStop, a tool for conditional gene control in Drosophila
Fisher, Yvette E; Yang, Helen H; Isaacman-Beck, Jesse; Xie, Marjorie; Gohl, Daryl M; Clandinin, Thomas R
2017-01-01
Manipulating gene function cell type-specifically is a common experimental goal in Drosophila research and has been central to studies of neural development, circuit computation, and behavior. However, current cell type-specific gene disruption techniques in flies often reduce gene activity incompletely or rely on cell division. Here we describe FlpStop, a generalizable tool for conditional gene disruption and rescue in post-mitotic cells. In proof-of-principle experiments, we manipulated apterous, a regulator of wing development. Next, we produced conditional null alleles of Glutamic acid decarboxylase 1 (Gad1) and Resistant to dieldrin (Rdl), genes vital for GABAergic neurotransmission, as well as cacophony (cac) and paralytic (para), voltage-gated ion channels central to neuronal excitability. To demonstrate the utility of this approach, we manipulated cac in a specific visual interneuron type and discovered differential regulation of calcium signals across subcellular compartments. Thus, FlpStop will facilitate investigations into the interactions between genes, circuits, and computation. DOI: http://dx.doi.org/10.7554/eLife.22279.001 PMID:28211790
Systems Biology-Based Identification of Mycobacterium tuberculosis Persistence Genes in Mouse Lungs
Dutta, Noton K.; Bandyopadhyay, Nirmalya; Veeramani, Balaji; Lamichhane, Gyanu; Karakousis, Petros C.; Bader, Joel S.
2014-01-01
ABSTRACT Identifying Mycobacterium tuberculosis persistence genes is important for developing novel drugs to shorten the duration of tuberculosis (TB) treatment. We developed computational algorithms that predict M. tuberculosis genes required for long-term survival in mouse lungs. As the input, we used high-throughput M. tuberculosis mutant library screen data, mycobacterial global transcriptional profiles in mice and macrophages, and functional interaction networks. We selected 57 unique, genetically defined mutants (18 previously tested and 39 untested) to assess the predictive power of this approach in the murine model of TB infection. We observed a 6-fold enrichment in the predicted set of M. tuberculosis genes required for persistence in mouse lungs relative to randomly selected mutant pools. Our results also allowed us to reclassify several genes as required for M. tuberculosis persistence in vivo. Finally, the new results implicated additional high-priority candidate genes for testing. Experimental validation of computational predictions demonstrates the power of this systems biology approach for elucidating M. tuberculosis persistence genes. PMID:24549847
García-Calvo, Raúl; Guisado, JL; Diaz-del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco
2018-01-01
Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes—master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)—is carried out for this problem. Several procedures that optimize the use of the GPU’s resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent sequential single-core implementation running on a recent Intel i7 CPU. This work can provide useful guidance to researchers in biology, medicine, or bioinformatics in how to take advantage of the parallelization on massively parallel devices and GPUs to apply novel metaheuristic algorithms powered by nature for real-world applications (like the method to solve the temporal dynamics of GRNs). PMID:29662297
García-Calvo, Raúl; Guisado, J L; Diaz-Del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco
2018-01-01
Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes-master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)-is carried out for this problem. Several procedures that optimize the use of the GPU's resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent sequential single-core implementation running on a recent Intel i7 CPU. This work can provide useful guidance to researchers in biology, medicine, or bioinformatics in how to take advantage of the parallelization on massively parallel devices and GPUs to apply novel metaheuristic algorithms powered by nature for real-world applications (like the method to solve the temporal dynamics of GRNs).
Modeling Bi-modality Improves Characterization of Cell Cycle on Gene Expression in Single Cells
Danaher, Patrick; Finak, Greg; Krouse, Michael; Wang, Alice; Webster, Philippa; Beechem, Joseph; Gottardo, Raphael
2014-01-01
Advances in high-throughput, single cell gene expression are allowing interrogation of cell heterogeneity. However, there is concern that the cell cycle phase of a cell might bias characterizations of gene expression at the single-cell level. We assess the effect of cell cycle phase on gene expression in single cells by measuring 333 genes in 930 cells across three phases and three cell lines. We determine each cell's phase non-invasively without chemical arrest and use it as a covariate in tests of differential expression. We observe bi-modal gene expression, a previously-described phenomenon, wherein the expression of otherwise abundant genes is either strongly positive, or undetectable within individual cells. This bi-modality is likely both biologically and technically driven. Irrespective of its source, we show that it should be modeled to draw accurate inferences from single cell expression experiments. To this end, we propose a semi-continuous modeling framework based on the generalized linear model, and use it to characterize genes with consistent cell cycle effects across three cell lines. Our new computational framework improves the detection of previously characterized cell-cycle genes compared to approaches that do not account for the bi-modality of single-cell data. We use our semi-continuous modelling framework to estimate single cell gene co-expression networks. These networks suggest that in addition to having phase-dependent shifts in expression (when averaged over many cells), some, but not all, canonical cell cycle genes tend to be co-expressed in groups in single cells. We estimate the amount of single cell expression variability attributable to the cell cycle. We find that the cell cycle explains only 5%–17% of expression variability, suggesting that the cell cycle will not tend to be a large nuisance factor in analysis of the single cell transcriptome. PMID:25032992
Unique cerebrovascular anomalies in Noonan syndrome with RAF1 mutation.
Zarate, Yuri A; Lichty, Angie W; Champion, Kristen J; Clarkson, L Kate; Holden, Kenton R; Matheus, M Gisele
2014-08-01
Noonan syndrome is a common autosomal dominant neurodevelopmental disorder caused by gain-of-function germline mutations affecting components of the Ras-MAPK pathway. The authors present the case of a 6-year-old male with Noonan syndrome, Chiari malformation type I, shunted benign external hydrocephalus in infancy, and unique cerebrovascular changes. A de novo heterozygous change in the RAF1 gene was identified. The patient underwent brain magnetic resonance imaging, computed tomography angiography, and magnetic resonance angiography to further clarify the nature of his abnormal brain vasculature. The authors compared his findings to the few cases of Noonan syndrome reported with cerebrovascular pathology. © The Author(s) 2013.
Yang, Xinan Holly; Li, Meiyi; Wang, Bin; Zhu, Wanqi; Desgardin, Aurelie; Onel, Kenan; de Jong, Jill; Chen, Jianjun; Chen, Luonan; Cunningham, John M
2015-03-24
Genes that regulate stem cell function are suspected to exert adverse effects on prognosis in malignancy. However, diverse cancer stem cell signatures are difficult for physicians to interpret and apply clinically. To connect the transcriptome and stem cell biology, with potential clinical applications, we propose a novel computational "gene-to-function, snapshot-to-dynamics, and biology-to-clinic" framework to uncover core functional gene-sets signatures. This framework incorporates three function-centric gene-set analysis strategies: a meta-analysis of both microarray and RNA-seq data, novel dynamic network mechanism (DNM) identification, and a personalized prognostic indicator analysis. This work uses complex disease acute myeloid leukemia (AML) as a research platform. We introduced an adjustable "soft threshold" to a functional gene-set algorithm and found that two different analysis methods identified distinct gene-set signatures from the same samples. We identified a 30-gene cluster that characterizes leukemic stem cell (LSC)-depleted cells and a 25-gene cluster that characterizes LSC-enriched cells in parallel; both mark favorable-prognosis in AML. Genes within each signature significantly share common biological processes and/or molecular functions (empirical p = 6e-5 and 0.03 respectively). The 25-gene signature reflects the abnormal development of stem cells in AML, such as AURKA over-expression. We subsequently determined that the clinical relevance of both signatures is independent of known clinical risk classifications in 214 patients with cytogenetically normal AML. We successfully validated the prognosis of both signatures in two independent cohorts of 91 and 242 patients respectively (log-rank p < 0.0015 and 0.05; empirical p < 0.015 and 0.08). The proposed algorithms and computational framework will harness systems biology research because they efficiently translate gene-sets (rather than single genes) into biological discoveries about AML and other complex diseases.
Predicting gene regulatory networks of soybean nodulation from RNA-Seq transcriptome data.
Zhu, Mingzhu; Dahmen, Jeremy L; Stacey, Gary; Cheng, Jianlin
2013-09-22
High-throughput RNA sequencing (RNA-Seq) is a revolutionary technique to study the transcriptome of a cell under various conditions at a systems level. Despite the wide application of RNA-Seq techniques to generate experimental data in the last few years, few computational methods are available to analyze this huge amount of transcription data. The computational methods for constructing gene regulatory networks from RNA-Seq expression data of hundreds or even thousands of genes are particularly lacking and urgently needed. We developed an automated bioinformatics method to predict gene regulatory networks from the quantitative expression values of differentially expressed genes based on RNA-Seq transcriptome data of a cell in different stages and conditions, integrating transcriptional, genomic and gene function data. We applied the method to the RNA-Seq transcriptome data generated for soybean root hair cells in three different development stages of nodulation after rhizobium infection. The method predicted a soybean nodulation-related gene regulatory network consisting of 10 regulatory modules common for all three stages, and 24, 49 and 70 modules separately for the first, second and third stage, each containing both a group of co-expressed genes and several transcription factors collaboratively controlling their expression under different conditions. 8 of 10 common regulatory modules were validated by at least two kinds of validations, such as independent DNA binding motif analysis, gene function enrichment test, and previous experimental data in the literature. We developed a computational method to reliably reconstruct gene regulatory networks from RNA-Seq transcriptome data. The method can generate valuable hypotheses for interpreting biological data and designing biological experiments such as ChIP-Seq, RNA interference, and yeast two hybrid experiments.
A cis-regulatory logic simulator.
Zeigler, Robert D; Gertz, Jason; Cohen, Barak A
2007-07-27
A major goal of computational studies of gene regulation is to accurately predict the expression of genes based on the cis-regulatory content of their promoters. The development of computational methods to decode the interactions among cis-regulatory elements has been slow, in part, because it is difficult to know, without extensive experimental validation, whether a particular method identifies the correct cis-regulatory interactions that underlie a given set of expression data. There is an urgent need for test expression data in which the interactions among cis-regulatory sites that produce the data are known. The ability to rapidly generate such data sets would facilitate the development and comparison of computational methods that predict gene expression patterns from promoter sequence. We developed a gene expression simulator which generates expression data using user-defined interactions between cis-regulatory sites. The simulator can incorporate additive, cooperative, competitive, and synergistic interactions between regulatory elements. Constraints on the spacing, distance, and orientation of regulatory elements and their interactions may also be defined and Gaussian noise can be added to the expression values. The simulator allows for a data transformation that simulates the sigmoid shape of expression levels from real promoters. We found good agreement between sets of simulated promoters and predicted regulatory modules from real expression data. We present several data sets that may be useful for testing new methodologies for predicting gene expression from promoter sequence. We developed a flexible gene expression simulator that rapidly generates large numbers of simulated promoters and their corresponding transcriptional output based on specified interactions between cis-regulatory sites. When appropriate rule sets are used, the data generated by our simulator faithfully reproduces experimentally derived data sets. We anticipate that using simulated gene expression data sets will facilitate the direct comparison of computational strategies to predict gene expression from promoter sequence. The source code is available online and as additional material. The test sets are available as additional material.
Wang, Shuaiqun; Aorigele; Kong, Wei; Zeng, Weiming; Hong, Xiaomin
2016-01-01
Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes.
Aorigele; Zeng, Weiming; Hong, Xiaomin
2016-01-01
Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes. PMID:27579323
Fuzzy measures on the Gene Ontology for gene product similarity.
Popescu, Mihail; Keller, James M; Mitchell, Joyce A
2006-01-01
One of the most important objects in bioinformatics is a gene product (protein or RNA). For many gene products, functional information is summarized in a set of Gene Ontology (GO) annotations. For these genes, it is reasonable to include similarity measures based on the terms found in the GO or other taxonomy. In this paper, we introduce several novel measures for computing the similarity of two gene products annotated with GO terms. The fuzzy measure similarity (FMS) has the advantage that it takes into consideration the context of both complete sets of annotation terms when computing the similarity between two gene products. When the two gene products are not annotated by common taxonomy terms, we propose a method that avoids a zero similarity result. To account for the variations in the annotation reliability, we propose a similarity measure based on the Choquet integral. These similarity measures provide extra tools for the biologist in search of functional information for gene products. The initial testing on a group of 194 sequences representing three proteins families shows a higher correlation of the FMS and Choquet similarities to the BLAST sequence similarities than the traditional similarity measures such as pairwise average or pairwise maximum.
Discovery of new candidate genes related to brain development using protein interaction information.
Chen, Lei; Chu, Chen; Kong, Xiangyin; Huang, Tao; Cai, Yu-Dong
2015-01-01
Human brain development is a dramatic process composed of a series of complex and fine-tuned spatiotemporal gene expressions. A good comprehension of this process can assist us in developing the potential of our brain. However, we have only limited knowledge about the genes and gene functions that are involved in this biological process. Therefore, a substantial demand remains to discover new brain development-related genes and identify their biological functions. In this study, we aimed to discover new brain-development related genes by building a computational method. We referred to a series of computational methods used to discover new disease-related genes and developed a similar method. In this method, the shortest path algorithm was executed on a weighted graph that was constructed using protein-protein interactions. New candidate genes fell on at least one of the shortest paths connecting two known genes that are related to brain development. A randomization test was then adopted to filter positive discoveries. Of the final identified genes, several have been reported to be associated with brain development, indicating the effectiveness of the method, whereas several of the others may have potential roles in brain development.
Computation and application of tissue-specific gene set weights.
Frost, H Robert
2018-04-06
Gene set testing, or pathway analysis, has become a critical tool for the analysis of highdimensional genomic data. Although the function and activity of many genes and higher-level processes is tissue-specific, gene set testing is typically performed in a tissue agnostic fashion, which impacts statistical power and the interpretation and replication of results. To address this challenge, we have developed a bioinformatics approach to compute tissuespecific weights for individual gene sets using information on tissue-specific gene activity from the Human Protein Atlas (HPA). We used this approach to create a public repository of tissue-specific gene set weights for 37 different human tissue types from the HPA and all collections in the Molecular Signatures Database (MSigDB). To demonstrate the validity and utility of these weights, we explored three different applications: the functional characterization of human tissues, multi-tissue analysis for systemic diseases and tissue-specific gene set testing. All data used in the reported analyses is publicly available. An R implementation of the method and tissue-specific weights for MSigDB gene set collections can be downloaded at http://www.dartmouth.edu/∼hrfrost/TissueSpecificGeneSets. rob.frost@dartmouth.edu.
BioVLAB-MMIA: a cloud environment for microRNA and mRNA integrated analysis (MMIA) on Amazon EC2.
Lee, Hyungro; Yang, Youngik; Chae, Heejoon; Nam, Seungyoon; Choi, Donghoon; Tangchaisin, Patanachai; Herath, Chathura; Marru, Suresh; Nephew, Kenneth P; Kim, Sun
2012-09-01
MicroRNAs, by regulating the expression of hundreds of target genes, play critical roles in developmental biology and the etiology of numerous diseases, including cancer. As a vast amount of microRNA expression profile data are now publicly available, the integration of microRNA expression data sets with gene expression profiles is a key research problem in life science research. However, the ability to conduct genome-wide microRNA-mRNA (gene) integration currently requires sophisticated, high-end informatics tools, significant expertise in bioinformatics and computer science to carry out the complex integration analysis. In addition, increased computing infrastructure capabilities are essential in order to accommodate large data sets. In this study, we have extended the BioVLAB cloud workbench to develop an environment for the integrated analysis of microRNA and mRNA expression data, named BioVLAB-MMIA. The workbench facilitates computations on the Amazon EC2 and S3 resources orchestrated by the XBaya Workflow Suite. The advantages of BioVLAB-MMIA over the web-based MMIA system include: 1) readily expanded as new computational tools become available; 2) easily modifiable by re-configuring graphic icons in the workflow; 3) on-demand cloud computing resources can be used on an "as needed" basis; 4) distributed orchestration supports complex and long running workflows asynchronously. We believe that BioVLAB-MMIA will be an easy-to-use computing environment for researchers who plan to perform genome-wide microRNA-mRNA (gene) integrated analysis tasks.
Synthetic mixed-signal computation in living cells
Rubens, Jacob R.; Selvaggio, Gianluca; Lu, Timothy K.
2016-01-01
Living cells implement complex computations on the continuous environmental signals that they encounter. These computations involve both analogue- and digital-like processing of signals to give rise to complex developmental programs, context-dependent behaviours and homeostatic activities. In contrast to natural biological systems, synthetic biological systems have largely focused on either digital or analogue computation separately. Here we integrate analogue and digital computation to implement complex hybrid synthetic genetic programs in living cells. We present a framework for building comparator gene circuits to digitize analogue inputs based on different thresholds. We then demonstrate that comparators can be predictably composed together to build band-pass filters, ternary logic systems and multi-level analogue-to-digital converters. In addition, we interface these analogue-to-digital circuits with other digital gene circuits to enable concentration-dependent logic. We expect that this hybrid computational paradigm will enable new industrial, diagnostic and therapeutic applications with engineered cells. PMID:27255669
Computational approaches were developed to identify factors that regulate Nrf2 in a large gene expression compendium of microarray profiles including >2000 comparisons which queried the effects of chemicals, genes, diets, and infectious agents on gene expression in the mouse l...
Tiffin, Nicki; Meintjes, Ayton; Ramesar, Rajkumar; Bajic, Vladimir B.; Rayner, Brian
2010-01-01
Multiple factors underlie susceptibility to essential hypertension, including a significant genetic and ethnic component, and environmental effects. Blood pressure response of hypertensive individuals to salt is heterogeneous, but salt sensitivity appears more prevalent in people of indigenous African origin. The underlying genetics of salt-sensitive hypertension, however, are poorly understood. In this study, computational methods including text- and data-mining have been used to select and prioritize candidate aetiological genes for salt-sensitive hypertension. Additionally, we have compared allele frequencies and copy number variation for single nucleotide polymorphisms in candidate genes between indigenous Southern African and Caucasian populations, with the aim of identifying candidate genes with significant variability between the population groups: identifying genetic variability between population groups can exploit ethnic differences in disease prevalence to aid with prioritisation of good candidate genes. Our top-ranking candidate genes include parathyroid hormone precursor (PTH) and type-1angiotensin II receptor (AGTR1). We propose that the candidate genes identified in this study warrant further investigation as potential aetiological genes for salt-sensitive hypertension. PMID:20886000
Burger, Brian T.; Imam, Saheed; Scarborough, Matthew J.; ...
2017-06-06
Rhodobacter sphaeroides is one of the best-studied alphaproteobacteria from biochemical, genetic, and genomic perspectives. To gain a better systems-level understanding of this organism, we generated a large transposon mutant library and used transposon sequencing (Tn-seq) to identify genes that are essential under several growth conditions. Using newly developed Tn-seq analysis software (TSAS), we identified 493 genes as essential for aerobic growth on a rich medium. We then used the mutant library to identify conditionally essential genes under two laboratory growth conditions, identifying 85 additional genes required for aerobic growth in a minimal medium and 31 additional genes required for photosyntheticmore » growth. In all instances, our analyses confirmed essentiality for many known genes and identified genes not previously considered to be essential. We used the resulting Tn-seq data to refine and improve a genome-scale metabolic network model (GEM) for R. sphaeroides. Together, we demonstrate how genetic, genomic, and computational approaches can be combined to obtain a systems-level understanding of the genetic framework underlying metabolic diversity in bacterial species.« less
Computational methods for identifying miRNA sponge interactions.
Le, Thuc Duy; Zhang, Junpeng; Liu, Lin; Li, Jiuyong
2017-07-01
Recent findings show that coding genes are not the only targets that miRNAs interact with. In fact, there is a pool of different RNAs competing with each other to attract miRNAs for interactions, thus acting as competing endogenous RNAs (ceRNAs). The ceRNAs indirectly regulate each other via the titration mechanism, i.e. the increasing concentration of a ceRNA will decrease the number of miRNAs that are available for interacting with other targets. The cross-talks between ceRNAs, i.e. their interactions mediated by miRNAs, have been identified as the drivers in many disease conditions, including cancers. In recent years, some computational methods have emerged for identifying ceRNA-ceRNA interactions. However, there remain great challenges and opportunities for developing computational methods to provide new insights into ceRNA regulatory mechanisms.In this paper, we review the publically available databases of ceRNA-ceRNA interactions and the computational methods for identifying ceRNA-ceRNA interactions (also known as miRNA sponge interactions). We also conduct a comparison study of the methods with a breast cancer dataset. Our aim is to provide a current snapshot of the advances of the computational methods in identifying miRNA sponge interactions and to discuss the remaining challenges. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Dissecting Embryonic Stem Cell Self-Renewal and Differentiation Commitment from Quantitative Models.
Hu, Rong; Dai, Xianhua; Dai, Zhiming; Xiang, Qian; Cai, Yanning
2016-10-01
To model quantitatively embryonic stem cell (ESC) self-renewal and differentiation by computational approaches, we developed a unified mathematical model for gene expression involved in cell fate choices. Our quantitative model comprised ESC master regulators and lineage-specific pivotal genes. It took the factors of multiple pathways as input and computed expression as a function of intrinsic transcription factors, extrinsic cues, epigenetic modifications, and antagonism between ESC master regulators and lineage-specific pivotal genes. In the model, the differential equations of expression of genes involved in cell fate choices from regulation relationship were established according to the transcription and degradation rates. We applied this model to the Murine ESC self-renewal and differentiation commitment and found that it modeled the expression patterns with good accuracy. Our model analysis revealed that Murine ESC was an attractor state in culture and differentiation was predominantly caused by antagonism between ESC master regulators and lineage-specific pivotal genes. Moreover, antagonism among lineages played a critical role in lineage reprogramming. Our results also uncovered that the ordered expression alteration of ESC master regulators over time had a central role in ESC differentiation fates. Our computational framework was generally applicable to most cell-type maintenance and lineage reprogramming.
Industrial applications of high-performance computing for phylogeny reconstruction
NASA Astrophysics Data System (ADS)
Bader, David A.; Moret, Bernard M.; Vawter, Lisa
2001-07-01
Phylogenies (that is, tree-of-life relationships) derived from gene order data may prove crucial in answering some fundamental open questions in biomolecular evolution. Real-world interest is strong in determining these relationships. For example, pharmaceutical companies may use phylogeny reconstruction in drug discovery for discovering synthetic pathways unique to organisms that they wish to target. Health organizations study the phylogenies of organisms such as HIV in order to understand their epidemiologies and to aid in predicting the behaviors of future outbreaks. And governments are interested in aiding the production of such foodstuffs as rice, wheat and potatoes via genetics through understanding of the phylogenetic distribution of genetic variation in wild populations. Yet few techniques are available for difficult phylogenetic reconstruction problems. Appropriate tools for analysis of such data may aid in resolving some of the phylogenetic problems that have been analyzed without much resolution for decades. With the rapid accumulation of whole genome sequences for a wide diversity of taxa, especially microbial taxa, phylogenetic reconstruction based on changes in gene order and gene content is showing promise, particularly for resolving deep (i.e., ancient) branch splits. However, reconstruction from gene-order data is even more computationally expensive than reconstruction from sequence data, particularly in groups with large numbers of genes and highly-rearranged genomes. We have developed a software suite, GRAPPA, that extends the breakpoint analysis (BPAnalysis) method of Sankoff and Blanchette while running much faster: in a recent analysis of chloroplast genome data for species of Campanulaceae on a 512-processor Linux supercluster with Myrinet, we achieved a one-million-fold speedup over BPAnalysis. GRAPPA can use either breakpoint or inversion distance (computed exactly) for its computation and runs on single-processor machines as well as parallel and high-performance computers.
Genetics and Early Detection in Idiopathic Pulmonary Fibrosis
Putman, Rachel K.; Rosas, Ivan O.
2014-01-01
Genetic studies hold promise in helping to identify patients with early idiopathic pulmonary fibrosis (IPF). Recent studies using chest computed tomograms (CTs) in smokers and in the general population have demonstrated that imaging abnormalities suggestive of an early stage of pulmonary fibrosis are not uncommon and are associated with respiratory symptoms, physical examination abnormalities, and physiologic decrements expected, but less severe than those noted in patients with IPF. Similarly, recent genetic studies have demonstrated strong and replicable associations between a common promoter polymorphism in the mucin 5B gene (MUC5B) and both IPF and the presence of abnormal imaging findings in the general population. Despite these findings, it is important to note that the definition of early-stage IPF remains unclear, limited data exist to definitively connect abnormal imaging findings to IPF, and genetic studies assessing early-stage pulmonary fibrosis remain in their infancy. In this perspective we provide updated information on interstitial lung abnormalities and their connection to IPF. We summarize information on the genetics of pulmonary fibrosis by focusing on the recent genetic findings of MUC5B. Finally, we discuss the implications of these findings and suggest a roadmap for the use of genetics in the detection of early IPF. PMID:24547893
Arrhenius-kinetics evidence for quantum tunneling in microbial "social" decision rates.
Clark, Kevin B
2010-11-01
Social-like bacteria, fungi and protozoa communicate chemical and behavioral signals to coordinate their specializations into an ordered group of individuals capable of fitter ecological performance. Examples of microbial "social" behaviors include sporulation and dispersion, kin recognition and nonclonal or paired reproduction. Paired reproduction by ciliates is believed to involve intra- and intermate selection through pheromone-stimulated "courting" rituals. Such social maneuvering minimizes survival-reproduction tradeoffs while sorting superior mates from inferior ones, lowering the vertical spread of deleterious genes in geographically constricted populations and possibly promoting advantageous genetic innovations. In a previous article, I reported findings that the heterotrich Spirostomum ambiguum can out-complete mating rivals in simulated social trials by learning behavioral heuristics which it then employs to store and select sets of altruistic and deceptive signaling strategies. Frequencies of strategy use typically follow Maxwell-Boltzmann (MB), Fermi-Dirac (FD) or Bose-Einstein (BE) statistical distributions. For ciliates most adept at social decision making, a brief classical MB computational phase drives signaling behavior into a later quantum BE computational phase that condenses or favors the selection of a single fittest strategy. Appearance of the network analogue of BE condensation coincides with Hebbian-like trial-and-error learning and is consistent with the idea that cells behave as heat engines, where loss of energy associated with specific cellular machinery critical for mating decisions effectively reduces the temperature of intracellular enzymes cohering into weak Fröhlich superposition. I extend these findings by showing the rates at which ciliates switch serial behavioral strategies agree with principles of chemical reactions exhibiting linear and nonlinear Arrhenius kinetics during respective classical and quantum computations. Nonlinear Arrhenius kinetics in ciliate decision making suggest transitions from one signaling strategy to another result from a computational analogue of quantum tunneling in social information processing.
Gene therapy improves dental manifestations in hypophosphatasia model mice.
Okawa, R; Iijima, O; Kishino, M; Okawa, H; Toyosawa, S; Sugano-Tajima, H; Shimada, T; Okada, T; Ozono, K; Ooshima, T; Nakano, K
2017-06-01
Hypophosphatasia is a rare inherited skeletal disorder characterized by defective bone mineralization and deficiency of tissue non-specific alkaline phosphatase (TNSALP) activity. The disease is caused by mutations in the liver/bone/kidney alkaline phosphatase gene (ALPL) encoding TNSALP. Early exfoliation of primary teeth owing to disturbed cementum formation, periodontal ligament weakness and alveolar bone resorption are major complications encountered in oral findings, and discovery of early loss of primary teeth in a dental examination often leads to early diagnosis of hypophosphatasia. Although there are no known fundamental treatments or effective dental approaches to prevent early exfoliation of primary teeth in affected patients, several possible treatments have recently been described, including gene therapy. Gene therapy has also been applied to TNSALP knockout mice (Alpl -/- ), which phenocopy the infantile form of hypophosphatasia, and improved their systemic condition. In the present study, we investigated whether gene therapy improved the dental condition of Alpl -/- mice. Following sublethal irradiation (4 Gy) at the age of 2 d, Alpl -/- mice underwent gene therapy using bone marrow cells transduced with a lentiviral vector expressing a bone-targeted form of TNSALP injected into the jugular vein (n = 3). Wild-type (Alpl +/+ ), heterozygous mice (Alpl +/- ) and Alpl -/- mice were analyzed at 9 d of age (n = 3 of each), while Alpl +/+ mice and treated or untreated Alpl -/- mice were analyzed at 1 mo of age (n = 3 of each), and Alpl +/- mice and Alpl -/- mice with gene therapy were analyzed at 3 mo of age (n = 3 of each). A single mandibular hemi-section obtained at 1 mo of age was analyzed using a small animal computed tomography machine to assess alveolar bone formation. Other mandibular hemi-sections obtained at 9 d, 1 mo and 3 mo of age were subjected to hematoxylin and eosin staining and immunohistochemical analysis of osteopontin, a marker of cementum. Immunohistochemical analysis of osteopontin, a marker of acellular cementum, revealed that Alpl -/- mice displayed impaired formation of cementum and alveolar bone, similar to the human dental phenotype. Cementum formation was clearly present in Alpl -/- mice that underwent gene therapy, but did not recover to the same level as that in wild-type (Alpl +/+ ) mice. Micro-computed tomography examination showed that gene therapy improved alveolar bone mineral density in Alpl -/- mice to a similar level to that in Alpl +/+ mice. Our results suggest that gene therapy can improve the general condition of Alpl -/- mice, and induce significant alveolar bone formation and moderate improvement of cementum formation, which may contribute to inhibition of early spontaneous tooth exfoliation. © 2016 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Pilkington, Sarah M; Crowhurst, Ross; Hilario, Elena; Nardozza, Simona; Fraser, Lena; Peng, Yongyan; Gunaseelan, Kularajathevan; Simpson, Robert; Tahir, Jibran; Deroles, Simon C; Templeton, Kerry; Luo, Zhiwei; Davy, Marcus; Cheng, Canhong; McNeilage, Mark; Scaglione, Davide; Liu, Yifei; Zhang, Qiong; Datson, Paul; De Silva, Nihal; Gardiner, Susan E; Bassett, Heather; Chagné, David; McCallum, John; Dzierzon, Helge; Deng, Cecilia; Wang, Yen-Yi; Barron, Lorna; Manako, Kelvina; Bowen, Judith; Foster, Toshi M; Erridge, Zoe A; Tiffin, Heather; Waite, Chethi N; Davies, Kevin M; Grierson, Ella P; Laing, William A; Kirk, Rebecca; Chen, Xiuyin; Wood, Marion; Montefiori, Mirco; Brummell, David A; Schwinn, Kathy E; Catanach, Andrew; Fullerton, Christina; Li, Dawei; Meiyalaghan, Sathiyamoorthy; Nieuwenhuizen, Niels; Read, Nicola; Prakash, Roneel; Hunter, Don; Zhang, Huaibi; McKenzie, Marian; Knäbel, Mareike; Harris, Alastair; Allan, Andrew C; Gleave, Andrew; Chen, Angela; Janssen, Bart J; Plunkett, Blue; Ampomah-Dwamena, Charles; Voogd, Charlotte; Leif, Davin; Lafferty, Declan; Souleyre, Edwige J F; Varkonyi-Gasic, Erika; Gambi, Francesco; Hanley, Jenny; Yao, Jia-Long; Cheung, Joey; David, Karine M; Warren, Ben; Marsh, Ken; Snowden, Kimberley C; Lin-Wang, Kui; Brian, Lara; Martinez-Sanchez, Marcela; Wang, Mindy; Ileperuma, Nadeesha; Macnee, Nikolai; Campin, Robert; McAtee, Peter; Drummond, Revel S M; Espley, Richard V; Ireland, Hilary S; Wu, Rongmei; Atkinson, Ross G; Karunairetnam, Sakuntala; Bulley, Sean; Chunkath, Shayhan; Hanley, Zac; Storey, Roy; Thrimawithana, Amali H; Thomson, Susan; David, Charles; Testolin, Raffaele; Huang, Hongwen; Hellens, Roger P; Schaffer, Robert J
2018-04-16
Most published genome sequences are drafts, and most are dominated by computational gene prediction. Draft genomes typically incorporate considerable sequence data that are not assigned to chromosomes, and predicted genes without quality confidence measures. The current Actinidia chinensis (kiwifruit) 'Hongyang' draft genome has 164 Mb of sequences unassigned to pseudo-chromosomes, and omissions have been identified in the gene models. A second genome of an A. chinensis (genotype Red5) was fully sequenced. This new sequence resulted in a 554.0 Mb assembly with all but 6 Mb assigned to pseudo-chromosomes. Pseudo-chromosomal comparisons showed a considerable number of translocation events have occurred following a whole genome duplication (WGD) event some consistent with centromeric Robertsonian-like translocations. RNA sequencing data from 12 tissues and ab initio analysis informed a genome-wide manual annotation, using the WebApollo tool. In total, 33,044 gene loci represented by 33,123 isoforms were identified, named and tagged for quality of evidential support. Of these 3114 (9.4%) were identical to a protein within 'Hongyang' The Kiwifruit Information Resource (KIR v2). Some proportion of the differences will be varietal polymorphisms. However, as most computationally predicted Red5 models required manual re-annotation this proportion is expected to be small. The quality of the new gene models was tested by fully sequencing 550 cloned 'Hort16A' cDNAs and comparing with the predicted protein models for Red5 and both the original 'Hongyang' assembly and the revised annotation from KIR v2. Only 48.9% and 63.5% of the cDNAs had a match with 90% identity or better to the original and revised 'Hongyang' annotation, respectively, compared with 90.9% to the Red5 models. Our study highlights the need to take a cautious approach to draft genomes and computationally predicted genes. Our use of the manual annotation tool WebApollo facilitated manual checking and correction of gene models enabling improvement of computational prediction. This utility was especially relevant for certain types of gene families such as the EXPANSIN like genes. Finally, this high quality gene set will supply the kiwifruit and general plant community with a new tool for genomics and other comparative analysis.
Davis, Allan Peter; Wiegers, Thomas C.; King, Benjamin L.; Wiegers, Jolene; Grondin, Cynthia J.; Sciaky, Daniela; Johnson, Robin J.; Mattingly, Carolyn J.
2016-01-01
Strategies for discovering common molecular events among disparate diseases hold promise for improving understanding of disease etiology and expanding treatment options. One technique is to leverage curated datasets found in the public domain. The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) manually curates chemical-gene, chemical-disease, and gene-disease interactions from the scientific literature. The use of official gene symbols in CTD interactions enables this information to be combined with the Gene Ontology (GO) file from NCBI Gene. By integrating these GO-gene annotations with CTD’s gene-disease dataset, we produce 753,000 inferences between 15,700 GO terms and 4,200 diseases, providing opportunities to explore presumptive molecular underpinnings of diseases and identify biological similarities. Through a variety of applications, we demonstrate the utility of this novel resource. As a proof-of-concept, we first analyze known repositioned drugs (e.g., raloxifene and sildenafil) and see that their target diseases have a greater degree of similarity when comparing GO terms vs. genes. Next, a computational analysis predicts seemingly non-intuitive diseases (e.g., stomach ulcers and atherosclerosis) as being similar to bipolar disorder, and these are validated in the literature as reported co-diseases. Additionally, we leverage other CTD content to develop testable hypotheses about thalidomide-gene networks to treat seemingly disparate diseases. Finally, we illustrate how CTD tools can rank a series of drugs as potential candidates for repositioning against B-cell chronic lymphocytic leukemia and predict cisplatin and the small molecule inhibitor JQ1 as lead compounds. The CTD dataset is freely available for users to navigate pathologies within the context of extensive biological processes, molecular functions, and cellular components conferred by GO. This inference set should aid researchers, bioinformaticists, and pharmaceutical drug makers in finding commonalities in disease mechanisms, which in turn could help identify new therapeutics, new indications for existing pharmaceuticals, potential disease comorbidities, and alerts for side effects. PMID:27171405
Davis, Allan Peter; Wiegers, Thomas C; King, Benjamin L; Wiegers, Jolene; Grondin, Cynthia J; Sciaky, Daniela; Johnson, Robin J; Mattingly, Carolyn J
2016-01-01
Strategies for discovering common molecular events among disparate diseases hold promise for improving understanding of disease etiology and expanding treatment options. One technique is to leverage curated datasets found in the public domain. The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) manually curates chemical-gene, chemical-disease, and gene-disease interactions from the scientific literature. The use of official gene symbols in CTD interactions enables this information to be combined with the Gene Ontology (GO) file from NCBI Gene. By integrating these GO-gene annotations with CTD's gene-disease dataset, we produce 753,000 inferences between 15,700 GO terms and 4,200 diseases, providing opportunities to explore presumptive molecular underpinnings of diseases and identify biological similarities. Through a variety of applications, we demonstrate the utility of this novel resource. As a proof-of-concept, we first analyze known repositioned drugs (e.g., raloxifene and sildenafil) and see that their target diseases have a greater degree of similarity when comparing GO terms vs. genes. Next, a computational analysis predicts seemingly non-intuitive diseases (e.g., stomach ulcers and atherosclerosis) as being similar to bipolar disorder, and these are validated in the literature as reported co-diseases. Additionally, we leverage other CTD content to develop testable hypotheses about thalidomide-gene networks to treat seemingly disparate diseases. Finally, we illustrate how CTD tools can rank a series of drugs as potential candidates for repositioning against B-cell chronic lymphocytic leukemia and predict cisplatin and the small molecule inhibitor JQ1 as lead compounds. The CTD dataset is freely available for users to navigate pathologies within the context of extensive biological processes, molecular functions, and cellular components conferred by GO. This inference set should aid researchers, bioinformaticists, and pharmaceutical drug makers in finding commonalities in disease mechanisms, which in turn could help identify new therapeutics, new indications for existing pharmaceuticals, potential disease comorbidities, and alerts for side effects.
Pulay, Attila J; Réthelyi, János M
2016-09-01
Despite moderate heritability estimates the genetics of suicidal behavior remains unclear, genome-wide association and candidate gene studies focusing on single nucleotide associations reported inconsistent findings. Our study explored biologically informed, multimarker candidate gene associations with suicidal behavior in mood disorders. We analyzed the GAIN Whole Genome Association Study of Bipolar Disorder version 3 (n = 999, suicidal n = 358) and the GAIN Major Depression: Stage 1 Genomewide Association in Population-Based Samples (n = 1,753, suicidal n = 245) datasets. Suicidal behavior was defined as severe suicidal ideation or attempt. Candidate genes were selected based on literature search (Geneset1, n = 35), gene expression data of microRNA genes, (Geneset2, n = 68) and their target genes (Geneset3, n = 11,259). Quality control, dosage analyses were carried out with PLINK. Gene-based associations of Geneset1 were analyzed with KGG. Polygenic profile scores of suicidal behavior were computed in the major depression dataset both with PRSice and LDpred and validated in the bipolar disorder data. Several nominally significant gene-based associations were detected, but only DICER1 associated with suicidal behavior in both samples, while only the associations of NTRK2 in the depression sample reached family wise and experiment wise significance. Polygenic profile scores negatively predicted suicidal behavior in the bipolar sample for only Geneset2, with the strongest prediction by PRSice at Pt < 0.03 (Nagelkerke R(2) = 0.01, P < 0.007). Gene-based association results confirmed the potential involvement of the BDNF-NTRK2-CREB pathway in the pathogenesis of suicide and the cross-disorder association of DICER1. Polygenic risk prediction of the selected miRNA genes indicates that the miRNA system may play a mediating role, but with considerable pleiotropy. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Variability of Creatine Metabolism Genes in Children with Autism Spectrum Disorder.
Cameron, Jessie M; Levandovskiy, Valeriy; Roberts, Wendy; Anagnostou, Evdokia; Scherer, Stephen; Loh, Alvin; Schulze, Andreas
2017-07-31
Creatine deficiency syndrome (CDS) comprises three separate enzyme deficiencies with overlapping clinical presentations: arginine:glycine amidinotransferase ( GATM gene, glycine amidinotransferase), guanidinoacetate methyltransferase ( GAMT gene), and creatine transporter deficiency ( SLC6A8 gene, solute carrier family 6 member 8). CDS presents with developmental delays/regression, intellectual disability, speech and language impairment, autistic behaviour, epileptic seizures, treatment-refractory epilepsy, and extrapyramidal movement disorders; symptoms that are also evident in children with autism. The objective of the study was to test the hypothesis that genetic variability in creatine metabolism genes is associated with autism. We sequenced GATM , GAMT and SLC6A8 genes in 166 patients with autism (coding sequence, introns and adjacent untranslated regions). A total of 29, 16 and 25 variants were identified in each gene, respectively. Four variants were novel in GATM , and 5 in SLC6A8 (not present in the 1000 Genomes, Exome Sequencing Project (ESP) or Exome Aggregation Consortium (ExAC) databases). A single variant in each gene was identified as non-synonymous, and computationally predicted to be potentially damaging. Nine variants in GATM were shown to have a lower minor allele frequency (MAF) in the autism population than in the 1000 Genomes database, specifically in the East Asian population (Fisher's exact test). Two variants also had lower MAFs in the European population. In summary, there were no apparent associations of variants in GAMT and SLC6A8 genes with autism. The data implying there could be a lower association of some specific GATM gene variants with autism is an observation that would need to be corroborated in a larger group of autism patients, and with sub-populations of Asian ethnicities. Overall, our findings suggest that the genetic variability of creatine synthesis/transport is unlikely to play a part in the pathogenesis of autism spectrum disorder (ASD) in children.
Ranking metrics in gene set enrichment analysis: do they matter?
Zyla, Joanna; Marczyk, Michal; Weiner, January; Polanska, Joanna
2017-05-12
There exist many methods for describing the complex relation between changes of gene expression in molecular pathways or gene ontologies under different experimental conditions. Among them, Gene Set Enrichment Analysis seems to be one of the most commonly used (over 10,000 citations). An important parameter, which could affect the final result, is the choice of a metric for the ranking of genes. Applying a default ranking metric may lead to poor results. In this work 28 benchmark data sets were used to evaluate the sensitivity and false positive rate of gene set analysis for 16 different ranking metrics including new proposals. Furthermore, the robustness of the chosen methods to sample size was tested. Using k-means clustering algorithm a group of four metrics with the highest performance in terms of overall sensitivity, overall false positive rate and computational load was established i.e. absolute value of Moderated Welch Test statistic, Minimum Significant Difference, absolute value of Signal-To-Noise ratio and Baumgartner-Weiss-Schindler test statistic. In case of false positive rate estimation, all selected ranking metrics were robust with respect to sample size. In case of sensitivity, the absolute value of Moderated Welch Test statistic and absolute value of Signal-To-Noise ratio gave stable results, while Baumgartner-Weiss-Schindler and Minimum Significant Difference showed better results for larger sample size. Finally, the Gene Set Enrichment Analysis method with all tested ranking metrics was parallelised and implemented in MATLAB, and is available at https://github.com/ZAEDPolSl/MrGSEA . Choosing a ranking metric in Gene Set Enrichment Analysis has critical impact on results of pathway enrichment analysis. The absolute value of Moderated Welch Test has the best overall sensitivity and Minimum Significant Difference has the best overall specificity of gene set analysis. When the number of non-normally distributed genes is high, using Baumgartner-Weiss-Schindler test statistic gives better outcomes. Also, it finds more enriched pathways than other tested metrics, which may induce new biological discoveries.
Synthetic analog computation in living cells.
Daniel, Ramiz; Rubens, Jacob R; Sarpeshkar, Rahul; Lu, Timothy K
2013-05-30
A central goal of synthetic biology is to achieve multi-signal integration and processing in living cells for diagnostic, therapeutic and biotechnology applications. Digital logic has been used to build small-scale circuits, but other frameworks may be needed for efficient computation in the resource-limited environments of cells. Here we demonstrate that synthetic analog gene circuits can be engineered to execute sophisticated computational functions in living cells using just three transcription factors. Such synthetic analog gene circuits exploit feedback to implement logarithmically linear sensing, addition, ratiometric and power-law computations. The circuits exhibit Weber's law behaviour as in natural biological systems, operate over a wide dynamic range of up to four orders of magnitude and can be designed to have tunable transfer functions. Our circuits can be composed to implement higher-order functions that are well described by both intricate biochemical models and simple mathematical functions. By exploiting analog building-block functions that are already naturally present in cells, this approach efficiently implements arithmetic operations and complex functions in the logarithmic domain. Such circuits may lead to new applications for synthetic biology and biotechnology that require complex computations with limited parts, need wide-dynamic-range biosensing or would benefit from the fine control of gene expression.
bc-GenExMiner 3.0: new mining module computes breast cancer gene expression correlation analyses.
Jézéquel, Pascal; Frénel, Jean-Sébastien; Campion, Loïc; Guérin-Charbonnel, Catherine; Gouraud, Wilfried; Ricolleau, Gabriel; Campone, Mario
2013-01-01
We recently developed a user-friendly web-based application called bc-GenExMiner (http://bcgenex.centregauducheau.fr), which offered the possibility to evaluate prognostic informativity of genes in breast cancer by means of a 'prognostic module'. In this study, we develop a new module called 'correlation module', which includes three kinds of gene expression correlation analyses. The first one computes correlation coefficient between 2 or more (up to 10) chosen genes. The second one produces two lists of genes that are most correlated (positively and negatively) to a 'tested' gene. A gene ontology (GO) mining function is also proposed to explore GO 'biological process', 'molecular function' and 'cellular component' terms enrichment for the output lists of most correlated genes. The third one explores gene expression correlation between the 15 telomeric and 15 centromeric genes surrounding a 'tested' gene. These correlation analyses can be performed in different groups of patients: all patients (without any subtyping), in molecular subtypes (basal-like, HER2+, luminal A and luminal B) and according to oestrogen receptor status. Validation tests based on published data showed that these automatized analyses lead to results consistent with studies' conclusions. In brief, this new module has been developed to help basic researchers explore molecular mechanisms of breast cancer. DATABASE URL: http://bcgenex.centregauducheau.fr
Computational Identification of Novel Genes: Current and Future Perspectives.
Klasberg, Steffen; Bitard-Feildel, Tristan; Mallet, Ludovic
2016-01-01
While it has long been thought that all genomic novelties are derived from the existing material, many genes lacking homology to known genes were found in recent genome projects. Some of these novel genes were proposed to have evolved de novo, ie, out of noncoding sequences, whereas some have been shown to follow a duplication and divergence process. Their discovery called for an extension of the historical hypotheses about gene origination. Besides the theoretical breakthrough, increasing evidence accumulated that novel genes play important roles in evolutionary processes, including adaptation and speciation events. Different techniques are available to identify genes and classify them as novel. Their classification as novel is usually based on their similarity to known genes, or lack thereof, detected by comparative genomics or against databases. Computational approaches are further prime methods that can be based on existing models or leveraging biological evidences from experiments. Identification of novel genes remains however a challenging task. With the constant software and technologies updates, no gold standard, and no available benchmark, evaluation and characterization of genomic novelty is a vibrant field. In this review, the classical and state-of-the-art tools for gene prediction are introduced. The current methods for novel gene detection are presented; the methodological strategies and their limits are discussed along with perspective approaches for further studies.
Comparing Phylogenetic Trees by Matching Nodes Using the Transfer Distance Between Partitions
Giaro, Krzysztof
2017-01-01
Abstract Ability to quantify dissimilarity of different phylogenetic trees describing the relationship between the same group of taxa is required in various types of phylogenetic studies. For example, such metrics are used to assess the quality of phylogeny construction methods, to define optimization criteria in supertree building algorithms, or to find horizontal gene transfer (HGT) events. Among the set of metrics described so far in the literature, the most commonly used seems to be the Robinson–Foulds distance. In this article, we define a new metric for rooted trees—the Matching Pair (MP) distance. The MP metric uses the concept of the minimum-weight perfect matching in a complete bipartite graph constructed from partitions of all pairs of leaves of the compared phylogenetic trees. We analyze the properties of the MP metric and present computational experiments showing its potential applicability in tasks related to finding the HGT events. PMID:28177699
Automated Design Framework for Synthetic Biology Exploiting Pareto Optimality.
Otero-Muras, Irene; Banga, Julio R
2017-07-21
In this work we consider Pareto optimality for automated design in synthetic biology. We present a generalized framework based on a mixed-integer dynamic optimization formulation that, given design specifications, allows the computation of Pareto optimal sets of designs, that is, the set of best trade-offs for the metrics of interest. We show how this framework can be used for (i) forward design, that is, finding the Pareto optimal set of synthetic designs for implementation, and (ii) reverse design, that is, analyzing and inferring motifs and/or design principles of gene regulatory networks from the Pareto set of optimal circuits. Finally, we illustrate the capabilities and performance of this framework considering four case studies. In the first problem we consider the forward design of an oscillator. In the remaining problems, we illustrate how to apply the reverse design approach to find motifs for stripe formation, rapid adaption, and fold-change detection, respectively.
Comparing Phylogenetic Trees by Matching Nodes Using the Transfer Distance Between Partitions.
Bogdanowicz, Damian; Giaro, Krzysztof
2017-05-01
Ability to quantify dissimilarity of different phylogenetic trees describing the relationship between the same group of taxa is required in various types of phylogenetic studies. For example, such metrics are used to assess the quality of phylogeny construction methods, to define optimization criteria in supertree building algorithms, or to find horizontal gene transfer (HGT) events. Among the set of metrics described so far in the literature, the most commonly used seems to be the Robinson-Foulds distance. In this article, we define a new metric for rooted trees-the Matching Pair (MP) distance. The MP metric uses the concept of the minimum-weight perfect matching in a complete bipartite graph constructed from partitions of all pairs of leaves of the compared phylogenetic trees. We analyze the properties of the MP metric and present computational experiments showing its potential applicability in tasks related to finding the HGT events.
2015-01-01
Cancer is a disease characterized largely by the accumulation of out-of-control somatic mutations during the lifetime of a patient. Distinguishing driver mutations from passenger mutations has posed a challenge in modern cancer research. With the advanced development of microarray experiments and clinical studies, a large numbers of candidate cancer genes have been extracted and distinguishing informative genes out of them is essential. As a matter of fact, we proposed to find the informative genes for cancer by using mutation data from ovarian cancers in our framework. In our model we utilized the patient gene mutation profile, gene expression data and gene gene interactions network to construct a graphical representation of genes and patients. Markov processes for mutation and patients are triggered separately. After this process, cancer genes are prioritized automatically by examining their scores at their stationary distributions in the eigenvector. Extensive experiments demonstrate that the integration of heterogeneous sources of information is essential in finding important cancer genes. PMID:26328548
Kim, Jihye; Yoo, Minjae; Shin, Jimin; Kim, Hyunmin; Kang, Jaewoo; Tan, Aik Choon
2018-01-01
Traditional Chinese medicine (TCM) originated in ancient China has been practiced over thousands of years for treating various symptoms and diseases. However, the molecular mechanisms of TCM in treating these diseases remain unknown. In this study, we employ a systems pharmacology-based approach for connecting GWAS diseases with TCM for potential drug repurposing and repositioning. We studied 102 TCM components and their target genes by analyzing microarray gene expression experiments. We constructed disease-gene networks from 2558 GWAS studies. We applied a systems pharmacology approach to prioritize disease-target genes. Using this bioinformatics approach, we analyzed 14,713 GWAS disease-TCM-target gene pairs and identified 115 disease-gene pairs with q value < 0.2. We validated several of these GWAS disease-TCM-target gene pairs with literature evidence, demonstrating that this computational approach could reveal novel indications for TCM. We also develop TCM-Disease web application to facilitate the traditional Chinese medicine drug repurposing efforts. Systems pharmacology is a promising approach for connecting GWAS diseases with TCM for potential drug repurposing and repositioning. The computational approaches described in this study could be easily expandable to other disease-gene network analysis.
Kim, Jihye; Yoo, Minjae; Shin, Jimin; Kim, Hyunmin; Kang, Jaewoo
2018-01-01
Traditional Chinese medicine (TCM) originated in ancient China has been practiced over thousands of years for treating various symptoms and diseases. However, the molecular mechanisms of TCM in treating these diseases remain unknown. In this study, we employ a systems pharmacology-based approach for connecting GWAS diseases with TCM for potential drug repurposing and repositioning. We studied 102 TCM components and their target genes by analyzing microarray gene expression experiments. We constructed disease-gene networks from 2558 GWAS studies. We applied a systems pharmacology approach to prioritize disease-target genes. Using this bioinformatics approach, we analyzed 14,713 GWAS disease-TCM-target gene pairs and identified 115 disease-gene pairs with q value < 0.2. We validated several of these GWAS disease-TCM-target gene pairs with literature evidence, demonstrating that this computational approach could reveal novel indications for TCM. We also develop TCM-Disease web application to facilitate the traditional Chinese medicine drug repurposing efforts. Systems pharmacology is a promising approach for connecting GWAS diseases with TCM for potential drug repurposing and repositioning. The computational approaches described in this study could be easily expandable to other disease-gene network analysis. PMID:29765977
Ye, Weixing; Zhu, Lei; Liu, Yingying; Crickmore, Neil; Peng, Donghai; Ruan, Lifang; Sun, Ming
2012-07-01
We have designed a high-throughput system for the identification of novel crystal protein genes (cry) from Bacillus thuringiensis strains. The system was developed with two goals: (i) to acquire the mixed plasmid-enriched genomic sequence of B. thuringiensis using next-generation sequencing biotechnology, and (ii) to identify cry genes with a computational pipeline (using BtToxin_scanner). In our pipeline method, we employed three different kinds of well-developed prediction methods, BLAST, hidden Markov model (HMM), and support vector machine (SVM), to predict the presence of Cry toxin genes. The pipeline proved to be fast (average speed, 1.02 Mb/min for proteins and open reading frames [ORFs] and 1.80 Mb/min for nucleotide sequences), sensitive (it detected 40% more protein toxin genes than a keyword extraction method using genomic sequences downloaded from GenBank), and highly specific. Twenty-one strains from our laboratory's collection were selected based on their plasmid pattern and/or crystal morphology. The plasmid-enriched genomic DNA was extracted from these strains and mixed for Illumina sequencing. The sequencing data were de novo assembled, and a total of 113 candidate cry sequences were identified using the computational pipeline. Twenty-seven candidate sequences were selected on the basis of their low level of sequence identity to known cry genes, and eight full-length genes were obtained with PCR. Finally, three new cry-type genes (primary ranks) and five cry holotypes, which were designated cry8Ac1, cry7Ha1, cry21Ca1, cry32Fa1, and cry21Da1 by the B. thuringiensis Toxin Nomenclature Committee, were identified. The system described here is both efficient and cost-effective and can greatly accelerate the discovery of novel cry genes.
Sharma, Amitabh; Menche, Jörg; Huang, C. Chris; Ort, Tatiana; Zhou, Xiaobo; Kitsak, Maksim; Sahni, Nidhi; Thibault, Derek; Voung, Linh; Guo, Feng; Ghiassian, Susan Dina; Gulbahce, Natali; Baribaud, Frédéric; Tocker, Joel; Dobrin, Radu; Barnathan, Elliot; Liu, Hao; Panettieri, Reynold A.; Tantisira, Kelan G.; Qiu, Weiliang; Raby, Benjamin A.; Silverman, Edwin K.; Vidal, Marc; Weiss, Scott T.; Barabási, Albert-László
2015-01-01
Recent advances in genetics have spurred rapid progress towards the systematic identification of genes involved in complex diseases. Still, the detailed understanding of the molecular and physiological mechanisms through which these genes affect disease phenotypes remains a major challenge. Here, we identify the asthma disease module, i.e. the local neighborhood of the interactome whose perturbation is associated with asthma, and validate it for functional and pathophysiological relevance, using both computational and experimental approaches. We find that the asthma disease module is enriched with modest GWAS P-values against the background of random variation, and with differentially expressed genes from normal and asthmatic fibroblast cells treated with an asthma-specific drug. The asthma module also contains immune response mechanisms that are shared with other immune-related disease modules. Further, using diverse omics (genomics, gene-expression, drug response) data, we identify the GAB1 signaling pathway as an important novel modulator in asthma. The wiring diagram of the uncovered asthma module suggests a relatively close link between GAB1 and glucocorticoids (GCs), which we experimentally validate, observing an increase in the level of GAB1 after GC treatment in BEAS-2B bronchial epithelial cells. The siRNA knockdown of GAB1 in the BEAS-2B cell line resulted in a decrease in the NFkB level, suggesting a novel regulatory path of the pro-inflammatory factor NFkB by GAB1 in asthma. PMID:25586491
A new computational method for the detection of horizontal gene transfer events.
Tsirigos, Aristotelis; Rigoutsos, Isidore
2005-01-01
In recent years, the increase in the amounts of available genomic data has made it easier to appreciate the extent by which organisms increase their genetic diversity through horizontally transferred genetic material. Such transfers have the potential to give rise to extremely dynamic genomes where a significant proportion of their coding DNA has been contributed by external sources. Because of the impact of these horizontal transfers on the ecological and pathogenic character of the recipient organisms, methods are continuously sought that are able to computationally determine which of the genes of a given genome are products of transfer events. In this paper, we introduce and discuss a novel computational method for identifying horizontal transfers that relies on a gene's nucleotide composition and obviates the need for knowledge of codon boundaries. In addition to being applicable to individual genes, the method can be easily extended to the case of clusters of horizontally transferred genes. With the help of an extensive and carefully designed set of experiments on 123 archaeal and bacterial genomes, we demonstrate that the new method exhibits significant improvement in sensitivity when compared to previously published approaches. In fact, it achieves an average relative improvement across genomes of between 11 and 41% compared to the Codon Adaptation Index method in distinguishing native from foreign genes. Our method's horizontal gene transfer predictions for 123 microbial genomes are available online at http://cbcsrv.watson.ibm.com/HGT/.
Genome-wide analysis of the GH3 family in apple (Malus × domestica).
Yuan, Huazhao; Zhao, Kai; Lei, Hengjiu; Shen, Xinjie; Liu, Yun; Liao, Xiong; Li, Tianhong
2013-05-02
Auxin plays important roles in hormone crosstalk and the plant's stress response. The auxin-responsive Gretchen Hagen3 (GH3) gene family maintains hormonal homeostasis by conjugating excess indole-3-acetic acid (IAA), salicylic acid (SA), and jasmonic acids (JAs) to amino acids during hormone- and stress-related signaling pathways. With the sequencing of the apple (Malus × domestica) genome completed, it is possible to carry out genomic studies on GH3 genes to indentify candidates with roles in abiotic/biotic stress responses. Malus sieversii Roem., an apple rootstock with strong drought tolerance and the ancestral species of cultivated apple species, was used as the experimental material. Following genome-wide computational and experimental identification of MdGH3 genes, we showed that MdGH3s were differentially expressed in the leaves and roots of M. sieversii and that some of these genes were significantly induced after various phytohormone and abiotic stress treatments. Given the role of GH3 in the negative feedback regulation of free IAA concentration, we examined whether phytohormones and abiotic stresses could alter the endogenous auxin level. By analyzing the GUS activity of DR5::GUS-transformed Arabidopsis seedlings, we showed that ABA, SA, salt, and cold treatments suppressed the auxin response. These findings suggest that other phytohormones and abiotic stress factors might alter endogenous auxin levels. Previous studies showed that GH3 genes regulate hormonal homeostasis. Our study indicated that some GH3 genes were significantly induced in M. sieversii after various phytohormone and abiotic stress treatments, and that ABA, SA, salt, and cold treatments reduce the endogenous level of axuin. Taken together, this study provides evidence that GH3 genes play important roles in the crosstalk between auxin, other phytohormones, and the abiotic stress response by maintaining auxin homeostasis.
Genes2WordCloud: a quick way to identify biological themes from gene lists and free text.
Baroukh, Caroline; Jenkins, Sherry L; Dannenfelser, Ruth; Ma'ayan, Avi
2011-10-13
Word-clouds recently emerged on the web as a solution for quickly summarizing text by maximizing the display of most relevant terms about a specific topic in the minimum amount of space. As biologists are faced with the daunting amount of new research data commonly presented in textual formats, word-clouds can be used to summarize and represent biological and/or biomedical content for various applications. Genes2WordCloud is a web application that enables users to quickly identify biological themes from gene lists and research relevant text by constructing and displaying word-clouds. It provides users with several different options and ideas for the sources that can be used to generate a word-cloud. Different options for rendering and coloring the word-clouds give users the flexibility to quickly generate customized word-clouds of their choice. Genes2WordCloud is a word-cloud generator and a word-cloud viewer that is based on WordCram implemented using Java, Processing, AJAX, mySQL, and PHP. Text is fetched from several sources and then processed to extract the most relevant terms with their computed weights based on word frequencies. Genes2WordCloud is freely available for use online; it is open source software and is available for installation on any web-site along with supporting documentation at http://www.maayanlab.net/G2W. Genes2WordCloud provides a useful way to summarize and visualize large amounts of textual biological data or to find biological themes from several different sources. The open source availability of the software enables users to implement customized word-clouds on their own web-sites and desktop applications.
Genes2WordCloud: a quick way to identify biological themes from gene lists and free text
2011-01-01
Background Word-clouds recently emerged on the web as a solution for quickly summarizing text by maximizing the display of most relevant terms about a specific topic in the minimum amount of space. As biologists are faced with the daunting amount of new research data commonly presented in textual formats, word-clouds can be used to summarize and represent biological and/or biomedical content for various applications. Results Genes2WordCloud is a web application that enables users to quickly identify biological themes from gene lists and research relevant text by constructing and displaying word-clouds. It provides users with several different options and ideas for the sources that can be used to generate a word-cloud. Different options for rendering and coloring the word-clouds give users the flexibility to quickly generate customized word-clouds of their choice. Methods Genes2WordCloud is a word-cloud generator and a word-cloud viewer that is based on WordCram implemented using Java, Processing, AJAX, mySQL, and PHP. Text is fetched from several sources and then processed to extract the most relevant terms with their computed weights based on word frequencies. Genes2WordCloud is freely available for use online; it is open source software and is available for installation on any web-site along with supporting documentation at http://www.maayanlab.net/G2W. Conclusions Genes2WordCloud provides a useful way to summarize and visualize large amounts of textual biological data or to find biological themes from several different sources. The open source availability of the software enables users to implement customized word-clouds on their own web-sites and desktop applications. PMID:21995939
Schroeder, Mark D.; Greer, Christina; Gaul, Ulrike
2011-01-01
The generation of metameric body plans is a key process in development. In Drosophila segmentation, periodicity is established rapidly through the complex transcriptional regulation of the pair-rule genes. The ‘primary’ pair-rule genes generate their 7-stripe expression through stripe-specific cis-regulatory elements controlled by the preceding non-periodic maternal and gap gene patterns, whereas ‘secondary’ pair-rule genes are thought to rely on 7-stripe elements that read off the already periodic primary pair-rule patterns. Using a combination of computational and experimental approaches, we have conducted a comprehensive systems-level examination of the regulatory architecture underlying pair-rule stripe formation. We find that runt (run), fushi tarazu (ftz) and odd skipped (odd) establish most of their pattern through stripe-specific elements, arguing for a reclassification of ftz and odd as primary pair-rule genes. In the case of run, we observe long-range cis-regulation across multiple intervening genes. The 7-stripe elements of run, ftz and odd are active concurrently with the stripe-specific elements, indicating that maternal/gap-mediated control and pair-rule gene cross-regulation are closely integrated. Stripe-specific elements fall into three distinct classes based on their principal repressive gap factor input; stripe positions along the gap gradients correlate with the strength of predicted input. The prevalence of cis-elements that generate two stripes and their genomic organization suggest that single-stripe elements arose by splitting and subfunctionalization of ancestral dual-stripe elements. Overall, our study provides a greatly improved understanding of how periodic patterns are established in the Drosophila embryo. PMID:21693522
StereoGene: rapid estimation of genome-wide correlation of continuous or interval feature data.
Stavrovskaya, Elena D; Niranjan, Tejasvi; Fertig, Elana J; Wheelan, Sarah J; Favorov, Alexander V; Mironov, Andrey A
2017-10-15
Genomics features with similar genome-wide distributions are generally hypothesized to be functionally related, for example, colocalization of histones and transcription start sites indicate chromatin regulation of transcription factor activity. Therefore, statistical algorithms to perform spatial, genome-wide correlation among genomic features are required. Here, we propose a method, StereoGene, that rapidly estimates genome-wide correlation among pairs of genomic features. These features may represent high-throughput data mapped to reference genome or sets of genomic annotations in that reference genome. StereoGene enables correlation of continuous data directly, avoiding the data binarization and subsequent data loss. Correlations are computed among neighboring genomic positions using kernel correlation. Representing the correlation as a function of the genome position, StereoGene outputs the local correlation track as part of the analysis. StereoGene also accounts for confounders such as input DNA by partial correlation. We apply our method to numerous comparisons of ChIP-Seq datasets from the Human Epigenome Atlas and FANTOM CAGE to demonstrate its wide applicability. We observe the changes in the correlation between epigenomic features across developmental trajectories of several tissue types consistent with known biology and find a novel spatial correlation of CAGE clusters with donor splice sites and with poly(A) sites. These analyses provide examples for the broad applicability of StereoGene for regulatory genomics. The StereoGene C ++ source code, program documentation, Galaxy integration scripts and examples are available from the project homepage http://stereogene.bioinf.fbb.msu.ru/. favorov@sensi.org. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Diaz-Montana, Juan J.; Diaz-Diaz, Norberto
2014-01-01
Gene networks are one of the main computational models used to study the interaction between different elements during biological processes being widely used to represent gene–gene, or protein–protein interaction complexes. We present GFD-Net, a Cytoscape app for visualizing and analyzing the functional dissimilarity of gene networks. PMID:25400907
Genetics of PCOS: A systematic bioinformatics approach to unveil the proteins responsible for PCOS.
Panda, Pritam Kumar; Rane, Riya; Ravichandran, Rahul; Singh, Shrinkhla; Panchal, Hetalkumar
2016-06-01
Polycystic ovary syndrome (PCOS) is a hormonal imbalance in women, which causes problems during menstrual cycle and in pregnancy that sometimes results in fatality. Though the genetics of PCOS is not fully understood, early diagnosis and treatment can prevent long-term effects. In this study, we have studied the proteins involved in PCOS and the structural aspects of the proteins that are taken into consideration using computational tools. The proteins involved are modeled using Modeller 9v14 and Ab-initio programs. All the 43 proteins responsible for PCOS were subjected to phylogenetic analysis to identify the relatedness of the proteins. Further, microarray data analysis of PCOS datasets was analyzed that was downloaded from GEO datasets to find the significant protein-coding genes responsible for PCOS, which is an addition to the reported protein-coding genes. Various statistical analyses were done using R programming to get an insight into the structural aspects of PCOS that can be used as drug targets to treat PCOS and other related reproductive diseases.
GenePRIMP: Improving Microbial Gene Prediction Quality
Pati, Amrita
2018-01-24
Amrita Pati of the DOE Joint Genome Institute's Genome Biology group talks about a computational pipeline that evaluates the accuracy of gene models in genomes and metagenomes at different stages of finishing at the "Sequencing, Finishing, Analysis in the Future" meeting in Santa Fe, NM.
Rapanelli, Maximiliano; Lew, Sergio Eduardo; Frick, Luciana Romina; Zanutto, Bonifacio Silvano
2010-01-01
The plasticity in the medial Prefrontal Cortex (mPFC) of rodents or lateral prefrontal cortex in non human primates (lPFC), plays a key role neural circuits involved in learning and memory. Several genes, like brain-derived neurotrophic factor (BDNF), cAMP response element binding (CREB), Synapsin I, Calcium/calmodulin-dependent protein kinase II (CamKII), activity-regulated cytoskeleton-associated protein (Arc), c-jun and c-fos have been related to plasticity processes. We analysed differential expression of related plasticity genes and immediate early genes in the mPFC of rats during learning an operant conditioning task. Incompletely and completely trained animals were studied because of the distinct events predicted by our computational model at different learning stages. During learning an operant conditioning task, we measured changes in the mRNA levels by Real-Time RT-PCR during learning; expression of these markers associated to plasticity was incremented while learning and such increments began to decline when the task was learned. The plasticity changes in the lPFC during learning predicted by the model matched up with those of the representative gene BDNF. Herein, we showed for the first time that plasticity in the mPFC in rats during learning of an operant conditioning is higher while learning than when the task is learned, using an integrative approach of a computational model and gene expression. PMID:20111591
Carroll, E L; Alderman, R; Bannister, J L; Bérubé, M; Best, P B; Boren, L; Baker, C S; Constantine, R; Findlay, K; Harcourt, R; Lemaire, L; Palsbøll, P J; Patenaude, N J; Rowntree, V J; Seger, J; Steel, D; Valenzuela, L O; Watson, M; Gaggiotti, O E
2018-05-03
Understanding how dispersal and gene flow link geographically separated the populations over evolutionary history is challenging, particularly in migratory marine species. In southern right whales (SRWs, Eubalaena australis), patterns of genetic diversity are likely influenced by the glacial climate cycle and recent history of whaling. Here we use a dataset of mitochondrial DNA (mtDNA) sequences (n = 1327) and nuclear markers (17 microsatellite loci, n = 222) from major wintering grounds to investigate circumpolar population structure, historical demography and effective population size. Analyses of nuclear genetic variation identify two population clusters that correspond to the South Atlantic and Indo-Pacific ocean basins that have similar effective breeder estimates. In contrast, all wintering grounds show significant differentiation for mtDNA, but no sex-biased dispersal was detected using the microsatellite genotypes. An approximate Bayesian computation (ABC) approach with microsatellite markers compared the scenarios with gene flow through time, or isolation and secondary contact between ocean basins, while modelling declines in abundance linked to whaling. Secondary-contact scenarios yield the highest posterior probabilities, implying that populations in different ocean basins were largely isolated and came into secondary contact within the last 25,000 years, but the role of whaling in changes in genetic diversity and gene flow over recent generations could not be resolved. We hypothesise that these findings are driven by factors that promote isolation, such as female philopatry, and factors that could promote dispersal, such as oceanographic changes. These findings highlight the application of ABC approaches to infer the connectivity in mobile species with complex population histories and, currently, low levels of differentiation.
Luo, Huaichao; Chen, Yuhong; Ye, Zimeng; Sun, Xinghuai; Shi, Yi; Luo, Qian; Gong, Bo; Shuai, Ping; Yang, Jiyun; Zhou, Yu; Liu, Xiaoqi; Zhang, Kaijiong; Tan, Chang; Li, Yuanfeng; Lin, Ying; Yang, Zhenglin
2015-10-01
Recently, three large genome-wide association studies have identified multiple variants associated with primary open angle glaucoma (POAG) near the ABCA1 gene. Considering that POAG and primary angle closure glaucoma (PACG) share many similar clinical manifestations, the present study was conducted to investigate whether these genetic variants were also associated with PACG in a Han Chinese population. A case-control association study of 1122 cases (PACG/PAC) and 1311 normal, matched controls was undertaken. Seven single-nucleotide polymorphisms (SNPs) near the ABCA1 gene, including rs2422493, rs2487042, rs2472496, rs2472493, rs2487032, rs2472459, and rs2472519, were genotyped. Genotype and allele frequencies were assessed using χ² tests. Linkage disequilibrium (LD) structure was analyzed by computer software. Among the SNPs genotyped, no association was observed between these SNPs and PACG. However, we discovered that two haplotypes, CATTTAC (corrected P = 0.048) and CGCCCGC (corrected P = 0.048), remained significantly associated with PACG/PAC after Bonferroni correction. Subjects with the CATTTAC haplotype have a 1.71-fold increased possibility of having PACG/PAC, whereas subjects with the CGCCCGC haplotype have 0.47-fold decreased possibility of developing PACG. Our findings suggest that the genetic backgrounds of PACG and POAG might be different. However, whether or not ABCA1 plays a role in the development of PACG is still not made certain by this study. Thus, further research is needed to find the role of ABCA1 in the progress of PACG.
Identification of constrained cancer driver genes based on mutation timing.
Sakoparnig, Thomas; Fried, Patrick; Beerenwinkel, Niko
2015-01-01
Cancer drivers are genomic alterations that provide cells containing them with a selective advantage over their local competitors, whereas neutral passengers do not change the somatic fitness of cells. Cancer-driving mutations are usually discriminated from passenger mutations by their higher degree of recurrence in tumor samples. However, there is increasing evidence that many additional driver mutations may exist that occur at very low frequencies among tumors. This observation has prompted alternative methods for driver detection, including finding groups of mutually exclusive mutations and incorporating prior biological knowledge about gene function or network structure. Dependencies among drivers due to epistatic interactions can also result in low mutation frequencies, but this effect has been ignored in driver detection so far. Here, we present a new computational approach for identifying genomic alterations that occur at low frequencies because they depend on other events. Unlike passengers, these constrained mutations display punctuated patterns of occurrence in time. We test this driver-passenger discrimination approach based on mutation timing in extensive simulation studies, and we apply it to cross-sectional copy number alteration (CNA) data from ovarian cancer, CNA and single-nucleotide variant (SNV) data from breast tumors and SNV data from colorectal cancer. Among the top ranked predicted drivers, we find low-frequency genes that have already been shown to be involved in carcinogenesis, as well as many new candidate drivers. The mutation timing approach is orthogonal and complementary to existing driver prediction methods. It will help identifying from cancer genome data the alterations that drive tumor progression.
Identification of Constrained Cancer Driver Genes Based on Mutation Timing
Sakoparnig, Thomas; Fried, Patrick; Beerenwinkel, Niko
2015-01-01
Cancer drivers are genomic alterations that provide cells containing them with a selective advantage over their local competitors, whereas neutral passengers do not change the somatic fitness of cells. Cancer-driving mutations are usually discriminated from passenger mutations by their higher degree of recurrence in tumor samples. However, there is increasing evidence that many additional driver mutations may exist that occur at very low frequencies among tumors. This observation has prompted alternative methods for driver detection, including finding groups of mutually exclusive mutations and incorporating prior biological knowledge about gene function or network structure. Dependencies among drivers due to epistatic interactions can also result in low mutation frequencies, but this effect has been ignored in driver detection so far. Here, we present a new computational approach for identifying genomic alterations that occur at low frequencies because they depend on other events. Unlike passengers, these constrained mutations display punctuated patterns of occurrence in time. We test this driver–passenger discrimination approach based on mutation timing in extensive simulation studies, and we apply it to cross-sectional copy number alteration (CNA) data from ovarian cancer, CNA and single-nucleotide variant (SNV) data from breast tumors and SNV data from colorectal cancer. Among the top ranked predicted drivers, we find low-frequency genes that have already been shown to be involved in carcinogenesis, as well as many new candidate drivers. The mutation timing approach is orthogonal and complementary to existing driver prediction methods. It will help identifying from cancer genome data the alterations that drive tumor progression. PMID:25569148
Commentary: Gene-Environment Interplay in the Context of Genetics, Epigenetics, and Gene Expression.
ERIC Educational Resources Information Center
Kramer, Douglas A.
2005-01-01
Objective: To comment on the article in this issue of the Journal by Professor Michael Rutter, "Environmentally Mediated Risks for Psychopathology: Research Strategies and Findings," in the context of current research findings on gene-environment interaction, epigenetics, and gene expression. Method: Animal and human studies are reviewed that…
Integrated computational biology analysis to evaluate target genes for chronic myelogenous leukemia.
Zheng, Yu; Wang, Yu-Ping; Cao, Hongbao; Chen, Qiusheng; Zhang, Xi
2018-06-05
Although hundreds of genes have been linked to chronic myelogenous leukemia (CML), many of the results lack reproducibility. In the present study, data across multiple modalities were integrated to evaluate 579 CML candidate genes, including literature‑based CML‑gene relation data, Gene Expression Omnibus RNA expression data and pathway‑based gene‑gene interaction data. The expression data included samples from 76 patients with CML and 73 healthy controls. For each target gene, four metrics were proposed and tested with case/control classification. The effectiveness of the four metrics presented was demonstrated by the high classification accuracy (94.63%; P<2x10‑4). Cross metric analysis suggested nine top candidate genes for CML: Epidermal growth factor receptor, tumor protein p53, catenin β 1, janus kinase 2, tumor necrosis factor, abelson murine leukemia viral oncogene homolog 1, vascular endothelial growth factor A, B‑cell lymphoma 2 and proto‑oncogene tyrosine‑protein kinase. In addition, 145 CML candidate pathways enriched with 485 out of 579 genes were identified (P<8.2x10‑11; q=0.005). In conclusion, weighted genetic networks generated using computational biology may be complementary to biological experiments for the evaluation of known or novel CML target genes.
Genome-wide association between DNA methylation and alternative splicing in an invertebrate
2012-01-01
Background Gene bodies are the most evolutionarily conserved targets of DNA methylation in eukaryotes. However, the regulatory functions of gene body DNA methylation remain largely unknown. DNA methylation in insects appears to be primarily confined to exons. Two recent studies in Apis mellifera (honeybee) and Nasonia vitripennis (jewel wasp) analyzed transcription and DNA methylation data for one gene in each species to demonstrate that exon-specific DNA methylation may be associated with alternative splicing events. In this study we investigated the relationship between DNA methylation, alternative splicing, and cross-species gene conservation on a genome-wide scale using genome-wide transcription and DNA methylation data. Results We generated RNA deep sequencing data (RNA-seq) to measure genome-wide mRNA expression at the exon- and gene-level. We produced a de novo transcriptome from this RNA-seq data and computationally predicted splice variants for the honeybee genome. We found that exons that are included in transcription are higher methylated than exons that are skipped during transcription. We detected enrichment for alternative splicing among methylated genes compared to unmethylated genes using fisher’s exact test. We performed a statistical analysis to reveal that the presence of DNA methylation or alternative splicing are both factors associated with a longer gene length and a greater number of exons in genes. In concordance with this observation, a conservation analysis using BLAST revealed that each of these factors is also associated with higher cross-species gene conservation. Conclusions This study constitutes the first genome-wide analysis exhibiting a positive relationship between exon-level DNA methylation and mRNA expression in the honeybee. Our finding that methylated genes are enriched for alternative splicing suggests that, in invertebrates, exon-level DNA methylation may play a role in the construction of splice variants by positively influencing exon inclusion during transcription. The results from our cross-species homology analysis suggest that DNA methylation and alternative splicing are genetic mechanisms whose utilization could contribute to a longer gene length and a slower rate of gene evolution. PMID:22978521
Large scale analysis of signal reachability.
Todor, Andrei; Gabr, Haitham; Dobra, Alin; Kahveci, Tamer
2014-06-15
Major disorders, such as leukemia, have been shown to alter the transcription of genes. Understanding how gene regulation is affected by such aberrations is of utmost importance. One promising strategy toward this objective is to compute whether signals can reach to the transcription factors through the transcription regulatory network (TRN). Due to the uncertainty of the regulatory interactions, this is a #P-complete problem and thus solving it for very large TRNs remains to be a challenge. We develop a novel and scalable method to compute the probability that a signal originating at any given set of source genes can arrive at any given set of target genes (i.e., transcription factors) when the topology of the underlying signaling network is uncertain. Our method tackles this problem for large networks while providing a provably accurate result. Our method follows a divide-and-conquer strategy. We break down the given network into a sequence of non-overlapping subnetworks such that reachability can be computed autonomously and sequentially on each subnetwork. We represent each interaction using a small polynomial. The product of these polynomials express different scenarios when a signal can or cannot reach to target genes from the source genes. We introduce polynomial collapsing operators for each subnetwork. These operators reduce the size of the resulting polynomial and thus the computational complexity dramatically. We show that our method scales to entire human regulatory networks in only seconds, while the existing methods fail beyond a few tens of genes and interactions. We demonstrate that our method can successfully characterize key reachability characteristics of the entire transcriptions regulatory networks of patients affected by eight different subtypes of leukemia, as well as those from healthy control samples. All the datasets and code used in this article are available at bioinformatics.cise.ufl.edu/PReach/scalable.htm. © The Author 2014. Published by Oxford University Press.
Comparative analysis of gene regulatory networks: from network reconstruction to evolution.
Thompson, Dawn; Regev, Aviv; Roy, Sushmita
2015-01-01
Regulation of gene expression is central to many biological processes. Although reconstruction of regulatory circuits from genomic data alone is therefore desirable, this remains a major computational challenge. Comparative approaches that examine the conservation and divergence of circuits and their components across strains and species can help reconstruct circuits as well as provide insights into the evolution of gene regulatory processes and their adaptive contribution. In recent years, advances in genomic and computational tools have led to a wealth of methods for such analysis at the sequence, expression, pathway, module, and entire network level. Here, we review computational methods developed to study transcriptional regulatory networks using comparative genomics, from sequence to functional data. We highlight how these methods use evolutionary conservation and divergence to reliably detect regulatory components as well as estimate the extent and rate of divergence. Finally, we discuss the promise and open challenges in linking regulatory divergence to phenotypic divergence and adaptation.
Kelemen, Arpad; Vasilakos, Athanasios V; Liang, Yulan
2009-09-01
Comprehensive evaluation of common genetic variations through association of single-nucleotide polymorphism (SNP) structure with common complex disease in the genome-wide scale is currently a hot area in human genome research due to the recent development of the Human Genome Project and HapMap Project. Computational science, which includes computational intelligence (CI), has recently become the third method of scientific enquiry besides theory and experimentation. There have been fast growing interests in developing and applying CI in disease mapping using SNP and haplotype data. Some of the recent studies have demonstrated the promise and importance of CI for common complex diseases in genomic association study using SNP/haplotype data, especially for tackling challenges, such as gene-gene and gene-environment interactions, and the notorious "curse of dimensionality" problem. This review provides coverage of recent developments of CI approaches for complex diseases in genetic association study with SNP/haplotype data.
Sample-space-based feature extraction and class preserving projection for gene expression data.
Wang, Wenjun
2013-01-01
In order to overcome the problems of high computational complexity and serious matrix singularity for feature extraction using Principal Component Analysis (PCA) and Fisher's Linear Discrinimant Analysis (LDA) in high-dimensional data, sample-space-based feature extraction is presented, which transforms the computation procedure of feature extraction from gene space to sample space by representing the optimal transformation vector with the weighted sum of samples. The technique is used in the implementation of PCA, LDA, Class Preserving Projection (CPP) which is a new method for discriminant feature extraction proposed, and the experimental results on gene expression data demonstrate the effectiveness of the method.
Cloud-scale genomic signals processing classification analysis for gene expression microarray data.
Harvey, Benjamin; Soo-Yeon Ji
2014-01-01
As microarray data available to scientists continues to increase in size and complexity, it has become overwhelmingly important to find multiple ways to bring inference though analysis of DNA/mRNA sequence data that is useful to scientists. Though there have been many attempts to elucidate the issue of bringing forth biological inference by means of wavelet preprocessing and classification, there has not been a research effort that focuses on a cloud-scale classification analysis of microarray data using Wavelet thresholding in a Cloud environment to identify significantly expressed features. This paper proposes a novel methodology that uses Wavelet based Denoising to initialize a threshold for determination of significantly expressed genes for classification. Additionally, this research was implemented and encompassed within cloud-based distributed processing environment. The utilization of Cloud computing and Wavelet thresholding was used for the classification 14 tumor classes from the Global Cancer Map (GCM). The results proved to be more accurate than using a predefined p-value for differential expression classification. This novel methodology analyzed Wavelet based threshold features of gene expression in a Cloud environment, furthermore classifying the expression of samples by analyzing gene patterns, which inform us of biological processes. Moreover, enabling researchers to face the present and forthcoming challenges that may arise in the analysis of data in functional genomics of large microarray datasets.
Chaturvedi, Anurag; Raeymaekers, Joost A M; Volckaert, Filip A M
2014-07-01
An intriguing question in biology is how the evolution of gene regulation is shaped by natural selection in natural populations. Among the many known regulatory mechanisms, regulation of gene expression by microRNAs (miRNAs) is of critical importance. However, our understanding of their evolution in natural populations is limited. Studying the role of miRNAs in three-spined stickleback, an important natural model for speciation research, may provide new insights into adaptive polymorphisms. However, lack of annotation of miRNA genes in its genome is a bottleneck. To fill this research gap, we used the genome of three-spined stickleback to predict miRNAs and their targets. We predicted 1486 mature miRNAs using the homology-based miRNA prediction approach. We then performed functional annotation and enrichment analysis of these targets, which identified over-represented motifs. Further, a database resource (GAmiRdb) has been developed for dynamically searching miRNAs and their targets exclusively in three-spined stickleback. Finally, the database was used in two case studies focusing on freshwater adaptation in natural populations. In the first study, we found 44 genomic regions overlapping with predicted miRNA targets. In the second study, we identified two SNPs altering the MRE seed site of sperm-specific glyceraldehyde-3-phosphate gene. These findings highlight the importance of the GAmiRdb knowledge base in understanding adaptive evolution. © 2014 John Wiley & Sons Ltd.
Gene regulatory networks: a coarse-grained, equation-free approach to multiscale computation.
Erban, Radek; Kevrekidis, Ioannis G; Adalsteinsson, David; Elston, Timothy C
2006-02-28
We present computer-assisted methods for analyzing stochastic models of gene regulatory networks. The main idea that underlies this equation-free analysis is the design and execution of appropriately initialized short bursts of stochastic simulations; the results of these are processed to estimate coarse-grained quantities of interest, such as mesoscopic transport coefficients. In particular, using a simple model of a genetic toggle switch, we illustrate the computation of an effective free energy Phi and of a state-dependent effective diffusion coefficient D that characterize an unavailable effective Fokker-Planck equation. Additionally we illustrate the linking of equation-free techniques with continuation methods for performing a form of stochastic "bifurcation analysis"; estimation of mean switching times in the case of a bistable switch is also implemented in this equation-free context. The accuracy of our methods is tested by direct comparison with long-time stochastic simulations. This type of equation-free analysis appears to be a promising approach to computing features of the long-time, coarse-grained behavior of certain classes of complex stochastic models of gene regulatory networks, circumventing the need for long Monte Carlo simulations.
Rapidly Evolving Toll-3/4 Genes Encode Male-Specific Toll-Like Receptors in Drosophila.
Levin, Tera C; Malik, Harmit S
2017-09-01
Animal Toll-like receptors (TLRs) have evolved through a pattern of duplication and divergence. Whereas mammalian TLRs directly recognize microbial ligands, Drosophila Tolls bind endogenous ligands downstream of both developmental and immune signaling cascades. Here, we find that most Toll genes in Drosophila evolve slowly with little gene turnover (gains/losses), consistent with their important roles in development and indirect roles in microbial recognition. In contrast, we find that the Toll-3/4 genes have experienced an unusually rapid rate of gene gains and losses, resulting in lineage-specific Toll-3/4s and vastly different gene repertoires among Drosophila species, from zero copies (e.g., D. mojavensis) to nineteen copies (e.g., D. willistoni). In D. willistoni, we find strong evidence for positive selection in Toll-3/4 genes, localized specifically to an extracellular region predicted to overlap with the binding site of Spätzle, the only known ligand of insect Tolls. However, because Spätzle genes are not experiencing similar selective pressures, we hypothesize that Toll-3/4s may be rapidly evolving because they bind to a different ligand, akin to TLRs outside of insects. We further find that most Drosophila Toll-3/4 genes are either weakly expressed or expressed exclusively in males, specifically in the germline. Unlike other Toll genes in D. melanogaster, Toll-3, and Toll-4 have apparently escaped from essential developmental roles, as knockdowns have no substantial effects on viability or male fertility. Based on these findings, we propose that the Toll-3/4 genes represent an exceptionally rapidly evolving lineage of Drosophila Toll genes, which play an unusual, as-yet-undiscovered role in the male germline. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Rapidly Evolving Toll-3/4 Genes Encode Male-Specific Toll-Like Receptors in Drosophila
Levin, Tera C.; Malik, Harmit S.
2017-01-01
Abstract Animal Toll-like receptors (TLRs) have evolved through a pattern of duplication and divergence. Whereas mammalian TLRs directly recognize microbial ligands, Drosophila Tolls bind endogenous ligands downstream of both developmental and immune signaling cascades. Here, we find that most Toll genes in Drosophila evolve slowly with little gene turnover (gains/losses), consistent with their important roles in development and indirect roles in microbial recognition. In contrast, we find that the Toll-3/4 genes have experienced an unusually rapid rate of gene gains and losses, resulting in lineage-specific Toll-3/4s and vastly different gene repertoires among Drosophila species, from zero copies (e.g., D. mojavensis) to nineteen copies (e.g., D. willistoni). In D. willistoni, we find strong evidence for positive selection in Toll-3/4 genes, localized specifically to an extracellular region predicted to overlap with the binding site of Spätzle, the only known ligand of insect Tolls. However, because Spätzle genes are not experiencing similar selective pressures, we hypothesize that Toll-3/4s may be rapidly evolving because they bind to a different ligand, akin to TLRs outside of insects. We further find that most Drosophila Toll-3/4 genes are either weakly expressed or expressed exclusively in males, specifically in the germline. Unlike other Toll genes in D. melanogaster, Toll-3, and Toll-4 have apparently escaped from essential developmental roles, as knockdowns have no substantial effects on viability or male fertility. Based on these findings, we propose that the Toll-3/4 genes represent an exceptionally rapidly evolving lineage of Drosophila Toll genes, which play an unusual, as-yet-undiscovered role in the male germline. PMID:28541576
Incorporating gene-environment interaction in testing for association with rare genetic variants.
Chen, Han; Meigs, James B; Dupuis, Josée
2014-01-01
The incorporation of gene-environment interactions could improve the ability to detect genetic associations with complex traits. For common genetic variants, single-marker interaction tests and joint tests of genetic main effects and gene-environment interaction have been well-established and used to identify novel association loci for complex diseases and continuous traits. For rare genetic variants, however, single-marker tests are severely underpowered due to the low minor allele frequency, and only a few gene-environment interaction tests have been developed. We aimed at developing powerful and computationally efficient tests for gene-environment interaction with rare variants. In this paper, we propose interaction and joint tests for testing gene-environment interaction of rare genetic variants. Our approach is a generalization of existing gene-environment interaction tests for multiple genetic variants under certain conditions. We show in our simulation studies that our interaction and joint tests have correct type I errors, and that the joint test is a powerful approach for testing genetic association, allowing for gene-environment interaction. We also illustrate our approach in a real data example from the Framingham Heart Study. Our approach can be applied to both binary and continuous traits, it is powerful and computationally efficient.
NASA Astrophysics Data System (ADS)
Chen, Xianshun; Feng, Liang; Ong, Yew Soon
2012-07-01
In this article, we proposed a self-adaptive memeplex robust search (SAMRS) for finding robust and reliable solutions that are less sensitive to stochastic behaviours of customer demands and have low probability of route failures, respectively, in vehicle routing problem with stochastic demands (VRPSD). In particular, the contribution of this article is three-fold. First, the proposed SAMRS employs the robust solution search scheme (RS 3) as an approximation of the computationally intensive Monte Carlo simulation, thus reducing the computation cost of fitness evaluation in VRPSD, while directing the search towards robust and reliable solutions. Furthermore, a self-adaptive individual learning based on the conceptual modelling of memeplex is introduced in the SAMRS. Finally, SAMRS incorporates a gene-meme co-evolution model with genetic and memetic representation to effectively manage the search for solutions in VRPSD. Extensive experimental results are then presented for benchmark problems to demonstrate that the proposed SAMRS serves as an efficable means of generating high-quality robust and reliable solutions in VRPSD.
NASA Astrophysics Data System (ADS)
Li, Richard Y.; Di Felice, Rosa; Rohs, Remo; Lidar, Daniel A.
2018-03-01
Transcription factors regulate gene expression, but how these proteins recognize and specifically bind to their DNA targets is still debated. Machine learning models are effective means to reveal interaction mechanisms. Here we studied the ability of a quantum machine learning approach to classify and rank binding affinities. Using simplified data sets of a small number of DNA sequences derived from actual binding affinity experiments, we trained a commercially available quantum annealer to classify and rank transcription factor binding. The results were compared to state-of-the-art classical approaches for the same simplified data sets, including simulated annealing, simulated quantum annealing, multiple linear regression, LASSO, and extreme gradient boosting. Despite technological limitations, we find a slight advantage in classification performance and nearly equal ranking performance using the quantum annealer for these fairly small training data sets. Thus, we propose that quantum annealing might be an effective method to implement machine learning for certain computational biology problems.
gene2drug: a computational tool for pathway-based rational drug repositioning.
Napolitano, Francesco; Carrella, Diego; Mandriani, Barbara; Pisonero-Vaquero, Sandra; Sirci, Francesco; Medina, Diego L; Brunetti-Pierri, Nicola; di Bernardo, Diego
2018-05-01
Drug repositioning has been proposed as an effective shortcut to drug discovery. The availability of large collections of transcriptional responses to drugs enables computational approaches to drug repositioning directly based on measured molecular effects. We introduce a novel computational methodology for rational drug repositioning, which exploits the transcriptional responses following treatment with small molecule. Specifically, given a therapeutic target gene, a prioritization of potential effective drugs is obtained by assessing their impact on the transcription of genes in the pathway(s) including the target. We performed in silico validation and comparison with a state-of-art technique based on similar principles. We next performed experimental validation in two different real-case drug repositioning scenarios: (i) upregulation of the glutamate-pyruvate transaminase (GPT), which has been shown to induce reduction of oxalate levels in a mouse model of primary hyperoxaluria, and (ii) activation of the transcription factor TFEB, a master regulator of lysosomal biogenesis and autophagy, whose modulation may be beneficial in neurodegenerative disorders. A web tool for Gene2drug is freely available at http://gene2drug.tigem.it. An R package is under development and can be obtained from https://github.com/franapoli/gep2pep. dibernardo@tigem.it. Supplementary data are available at Bioinformatics online.
USDA-ARS?s Scientific Manuscript database
The non-culturable bacterium ‘Candidatus Liberibacter solanacearum’ (Lso) is the causative agent of zebra chip disease in potato. Computational analysis of the Lso genome revealed a serralysin-like gene based on conserved domains characteristic of genes encoding metalloprotease enzymes similar to se...
We propose the use of gene expression profiling to complement the chemical characterization currently based on HTS assay data and present a case study relevant to the Endocrine Disruptor Screening Program. We have developed computational methods to identify estrogen receptor &alp...
Inferring species trees from incongruent multi-copy gene trees using the Robinson-Foulds distance
2013-01-01
Background Constructing species trees from multi-copy gene trees remains a challenging problem in phylogenetics. One difficulty is that the underlying genes can be incongruent due to evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. Gene tree estimation errors may further exacerbate the difficulties of species tree estimation. Results We present a new approach for inferring species trees from incongruent multi-copy gene trees that is based on a generalization of the Robinson-Foulds (RF) distance measure to multi-labeled trees (mul-trees). We prove that it is NP-hard to compute the RF distance between two mul-trees; however, it is easy to calculate this distance between a mul-tree and a singly-labeled species tree. Motivated by this, we formulate the RF problem for mul-trees (MulRF) as follows: Given a collection of multi-copy gene trees, find a singly-labeled species tree that minimizes the total RF distance from the input mul-trees. We develop and implement a fast SPR-based heuristic algorithm for the NP-hard MulRF problem. We compare the performance of the MulRF method (available at http://genome.cs.iastate.edu/CBL/MulRF/) with several gene tree parsimony approaches using gene tree simulations that incorporate gene tree error, gene duplications and losses, and/or lateral transfer. The MulRF method produces more accurate species trees than gene tree parsimony approaches. We also demonstrate that the MulRF method infers in minutes a credible plant species tree from a collection of nearly 2,000 gene trees. Conclusions Our new phylogenetic inference method, based on a generalized RF distance, makes it possible to quickly estimate species trees from large genomic data sets. Since the MulRF method, unlike gene tree parsimony, is based on a generic tree distance measure, it is appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplication and losses as well as phylogenetic error may contribute to gene tree discord. In experiments, the MulRF method estimated species trees accurately and quickly, demonstrating MulRF as an efficient alternative approach for phylogenetic inference from large-scale genomic data sets. PMID:24180377
Muthu Krishnan, S
2018-05-14
The receptor-associated protein (RAP) is an inhibitor of endocytic receptors that belong to the lipoprotein receptor gene family. In this study, a computational approach was tried to find the evolutionarily related fold of the RAP proteins. Through the structural and sequence-based analysis, found various protein folds that are very close to the RAP folds. Remote homolog datasets were used potentially to develop a different support vector machine (SVM) methods to recognize the homologous RAP fold. This study helps in understanding the relationship of RAP homologs folds based on the structure, function and evolutionary history. Copyright © 2018 Elsevier Ltd. All rights reserved.
Advances in color science: from retina to behavior
Chatterjee, Soumya; Field, Greg D.; Horwitz, Gregory D.; Johnson, Elizabeth N.; Koida, Kowa; Mancuso, Katherine
2010-01-01
Color has become a premier model system for understanding how information is processed by neural circuits, and for investigating the relationships among genes, neural circuits and perception. Both the physical stimulus for color and the perceptual output experienced as color are quite well characterized, but the neural mechanisms that underlie the transformation from stimulus to perception are incompletely understood. The past several years have seen important scientific and technical advances that are changing our understanding of these mechanisms. Here, and in the accompanying minisymposium, we review the latest findings and hypotheses regarding color computations in the retina, primary visual cortex and higher-order visual areas, focusing on non-human primates, a model of human color vision. PMID:21068298
Functional Evolution of a cis-Regulatory Module
Palsson, Arnar; Alekseeva, Elena; Bergman, Casey M; Nathan, Janaki; Kreitman, Martin
2005-01-01
Lack of knowledge about how regulatory regions evolve in relation to their structure–function may limit the utility of comparative sequence analysis in deciphering cis-regulatory sequences. To address this we applied reverse genetics to carry out a functional genetic complementation analysis of a eukaryotic cis-regulatory module—the even-skipped stripe 2 enhancer—from four Drosophila species. The evolution of this enhancer is non-clock-like, with important functional differences between closely related species and functional convergence between distantly related species. Functional divergence is attributable to differences in activation levels rather than spatiotemporal control of gene expression. Our findings have implications for understanding enhancer structure–function, mechanisms of speciation and computational identification of regulatory modules. PMID:15757364
Brahma, Rahul; Gurumayum, Sanathoi; Naorem, Leimarembi Devi; Muthaiyan, Mathavan; Gopal, Jeyakodi; Venkatesan, Amouda
2018-05-01
Zika virus (ZIKV), a single-strand RNA flavivirus, is transmitted primarily through Aedes aegypti. The recent outbreaks in America and unexpected association between ZIKV infection and birth defects have triggered the global attention. This vouches to understand the molecular mechanisms of ZIKV infection to develop effective drug therapy. A systems-level understanding of biological process affected by ZIKV infection in fetal brain sample led us to identify the candidate genes for pharmaceutical intervention and potential biomarkers for diagnosis. To identify the key genes, transcriptomics data (RNA-Seq) with GSE93385 of ZIKV (Strain: MR766) infected human fetal neural stem cell are analyzed. In total, 1,084 differentially expressed genes (DEGs) are identified, that is, 471 upregulated and 613 downregulated genes. Further analysis such as the gene ontology term suggested that the downregulated genes are mostly enriched in defense response to virus, receptor binding, laminin binding, extracellular matrix, endoplasmic reticulum, and for upregulated DEGs: translation initiation, RNA binding, cytosol, and nucleosome are enriched. And through pathway analysis, systemic lupus erythematosus (SLE) is found to be the most enriched pathway. Protein-protein interaction (PPI) network is constructed to find the hub genes using STRING database. The seven key genes namely cyclin-dependent kinase 1 (CDK1), cyclin B1 (CCNB1), histone cluster 1 H2B family member K, (HIST1H2BK) histone cluster 1 H2B family member O (HIST1H2BO), and histone cluster 1 H2B family member B (HIST1H2BB), polo-like kinase 1 (PLK1), and cell division cycle 20 (CDC20) with highest degree are found to be hub genes using Centiscape, a Cytoscape plugin. The modules of PPI network using Molecular Complex Detection plugin are found significant in structural constituent of ribosome, defense response to virus, nucleosome, SLE, extracellular region, and regulation of gene silencing. Thus, identified key hub genes and pathways shed light on molecular mechanism that may contribute to the discovery of novel therapeutic targets and development of new strategies for the intervention of ZIKV disease.
Di, Yanming; Schafer, Daniel W.; Wilhelm, Larry J.; Fox, Samuel E.; Sullivan, Christopher M.; Curzon, Aron D.; Carrington, James C.; Mockler, Todd C.; Chang, Jeff H.
2011-01-01
GENE-counter is a complete Perl-based computational pipeline for analyzing RNA-Sequencing (RNA-Seq) data for differential gene expression. In addition to its use in studying transcriptomes of eukaryotic model organisms, GENE-counter is applicable for prokaryotes and non-model organisms without an available genome reference sequence. For alignments, GENE-counter is configured for CASHX, Bowtie, and BWA, but an end user can use any Sequence Alignment/Map (SAM)-compliant program of preference. To analyze data for differential gene expression, GENE-counter can be run with any one of three statistics packages that are based on variations of the negative binomial distribution. The default method is a new and simple statistical test we developed based on an over-parameterized version of the negative binomial distribution. GENE-counter also includes three different methods for assessing differentially expressed features for enriched gene ontology (GO) terms. Results are transparent and data are systematically stored in a MySQL relational database to facilitate additional analyses as well as quality assessment. We used next generation sequencing to generate a small-scale RNA-Seq dataset derived from the heavily studied defense response of Arabidopsis thaliana and used GENE-counter to process the data. Collectively, the support from analysis of microarrays as well as the observed and substantial overlap in results from each of the three statistics packages demonstrates that GENE-counter is well suited for handling the unique characteristics of small sample sizes and high variability in gene counts. PMID:21998647
Li, Edward B; Truong, Dawn; Hallett, Shawn A; Mukherjee, Kusumika; Schutte, Brian C; Liao, Eric C
2017-09-01
Large-scale sequencing efforts have captured a rapidly growing catalogue of genetic variations. However, the accurate establishment of gene variant pathogenicity remains a central challenge in translating personal genomics information to clinical decisions. Interferon Regulatory Factor 6 (IRF6) gene variants are significant genetic contributors to orofacial clefts. Although approximately three hundred IRF6 gene variants have been documented, their effects on protein functions remain difficult to interpret. Here, we demonstrate the protein functions of human IRF6 missense gene variants could be rapidly assessed in detail by their abilities to rescue the irf6 -/- phenotype in zebrafish through variant mRNA microinjections at the one-cell stage. The results revealed many missense variants previously predicted by traditional statistical and computational tools to be loss-of-function and pathogenic retained partial or full protein function and rescued the zebrafish irf6 -/- periderm rupture phenotype. Through mRNA dosage titration and analysis of the Exome Aggregation Consortium (ExAC) database, IRF6 missense variants were grouped by their abilities to rescue at various dosages into three functional categories: wild type function, reduced function, and complete loss-of-function. This sensitive and specific biological assay was able to address the nuanced functional significances of IRF6 missense gene variants and overcome many limitations faced by current statistical and computational tools in assigning variant protein function and pathogenicity. Furthermore, it unlocked the possibility for characterizing yet undiscovered human IRF6 missense gene variants from orofacial cleft patients, and illustrated a generalizable functional genomics paradigm in personalized medicine.
Abduallah, Yasser; Turki, Turki; Byron, Kevin; Du, Zongxuan; Cervantes-Cervantes, Miguel; Wang, Jason T L
2017-01-01
Gene regulation is a series of processes that control gene expression and its extent. The connections among genes and their regulatory molecules, usually transcription factors, and a descriptive model of such connections are known as gene regulatory networks (GRNs). Elucidating GRNs is crucial to understand the inner workings of the cell and the complexity of gene interactions. To date, numerous algorithms have been developed to infer gene regulatory networks. However, as the number of identified genes increases and the complexity of their interactions is uncovered, networks and their regulatory mechanisms become cumbersome to test. Furthermore, prodding through experimental results requires an enormous amount of computation, resulting in slow data processing. Therefore, new approaches are needed to expeditiously analyze copious amounts of experimental data resulting from cellular GRNs. To meet this need, cloud computing is promising as reported in the literature. Here, we propose new MapReduce algorithms for inferring gene regulatory networks on a Hadoop cluster in a cloud environment. These algorithms employ an information-theoretic approach to infer GRNs using time-series microarray data. Experimental results show that our MapReduce program is much faster than an existing tool while achieving slightly better prediction accuracy than the existing tool.
InteGO2: A web tool for measuring and visualizing gene semantic similarities using Gene Ontology
Peng, Jiajie; Li, Hongxiang; Liu, Yongzhuang; ...
2016-08-31
Here, the Gene Ontology (GO) has been used in high-throughput omics research as a major bioinformatics resource. The hierarchical structure of GO provides users a convenient platform for biological information abstraction and hypothesis testing. Computational methods have been developed to identify functionally similar genes. However, none of the existing measurements take into account all the rich information in GO. Similarly, using these existing methods, web-based applications have been constructed to compute gene functional similarities, and to provide pure text-based outputs. Without a graphical visualization interface, it is difficult for result interpretation. As a result, we present InteGO2, a web toolmore » that allows researchers to calculate the GO-based gene semantic similarities using seven widely used GO-based similarity measurements. Also, we provide an integrative measurement that synergistically integrates all the individual measurements to improve the overall performance. Using HTML5 and cytoscape.js, we provide a graphical interface in InteGO2 to visualize the resulting gene functional association networks. In conclusion, InteGO2 is an easy-to-use HTML5 based web tool. With it, researchers can measure gene or gene product functional similarity conveniently, and visualize the network of functional interactions in a graphical interface.« less
InteGO2: a web tool for measuring and visualizing gene semantic similarities using Gene Ontology.
Peng, Jiajie; Li, Hongxiang; Liu, Yongzhuang; Juan, Liran; Jiang, Qinghua; Wang, Yadong; Chen, Jin
2016-08-31
The Gene Ontology (GO) has been used in high-throughput omics research as a major bioinformatics resource. The hierarchical structure of GO provides users a convenient platform for biological information abstraction and hypothesis testing. Computational methods have been developed to identify functionally similar genes. However, none of the existing measurements take into account all the rich information in GO. Similarly, using these existing methods, web-based applications have been constructed to compute gene functional similarities, and to provide pure text-based outputs. Without a graphical visualization interface, it is difficult for result interpretation. We present InteGO2, a web tool that allows researchers to calculate the GO-based gene semantic similarities using seven widely used GO-based similarity measurements. Also, we provide an integrative measurement that synergistically integrates all the individual measurements to improve the overall performance. Using HTML5 and cytoscape.js, we provide a graphical interface in InteGO2 to visualize the resulting gene functional association networks. InteGO2 is an easy-to-use HTML5 based web tool. With it, researchers can measure gene or gene product functional similarity conveniently, and visualize the network of functional interactions in a graphical interface. InteGO2 can be accessed via http://mlg.hit.edu.cn:8089/ .
InteGO2: A web tool for measuring and visualizing gene semantic similarities using Gene Ontology
DOE Office of Scientific and Technical Information (OSTI.GOV)
Peng, Jiajie; Li, Hongxiang; Liu, Yongzhuang
Here, the Gene Ontology (GO) has been used in high-throughput omics research as a major bioinformatics resource. The hierarchical structure of GO provides users a convenient platform for biological information abstraction and hypothesis testing. Computational methods have been developed to identify functionally similar genes. However, none of the existing measurements take into account all the rich information in GO. Similarly, using these existing methods, web-based applications have been constructed to compute gene functional similarities, and to provide pure text-based outputs. Without a graphical visualization interface, it is difficult for result interpretation. As a result, we present InteGO2, a web toolmore » that allows researchers to calculate the GO-based gene semantic similarities using seven widely used GO-based similarity measurements. Also, we provide an integrative measurement that synergistically integrates all the individual measurements to improve the overall performance. Using HTML5 and cytoscape.js, we provide a graphical interface in InteGO2 to visualize the resulting gene functional association networks. In conclusion, InteGO2 is an easy-to-use HTML5 based web tool. With it, researchers can measure gene or gene product functional similarity conveniently, and visualize the network of functional interactions in a graphical interface.« less
Gene coexpression measures in large heterogeneous samples using count statistics.
Wang, Y X Rachel; Waterman, Michael S; Huang, Haiyan
2014-11-18
With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.
Computational screening of disease-associated mutations in OCA2 gene.
Kamaraj, Balu; Purohit, Rituraj
2014-01-01
Oculocutaneous albinism type 2 (OCA2), caused by mutations of OCA2 gene, is an autosomal recessive disorder characterized by reduced biosynthesis of melanin pigment in the skin, hair, and eyes. The OCA2 gene encodes instructions for making a protein called the P protein. This protein plays a crucial role in melanosome biogenesis, and controls the eumelanin content in melanocytes in part via the processing and trafficking of tyrosinase which is the rate-limiting enzyme in melanin synthesis. In this study we analyzed the pathogenic effect of 95 non-synonymous single nucleotide polymorphisms reported in OCA2 gene using computational methods. We found R305W mutation as most deleterious and disease associated using SIFT, PolyPhen, PANTHER, PhD-SNP, Pmut, and MutPred tools. To understand the atomic arrangement in 3D space, the native and mutant (R305W) structures were modeled. Molecular dynamics simulation was conducted to observe the structural significance of computationally prioritized disease-associated mutation (R305W). Root-mean-square deviation, root-mean-square fluctuation, radius of gyration, solvent accessibility surface area, hydrogen bond (NH bond), trace of covariance matrix, eigenvector projection analysis, and density analysis results showed prominent loss of stability and rise in mutant flexibility values in 3D space. This study presents a well designed computational methodology to examine the albinism-associated SNPs.
Efficiently Identifying Significant Associations in Genome-wide Association Studies
Eskin, Eleazar
2013-01-01
Abstract Over the past several years, genome-wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome that harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits, where only a handful of phenotypes are analyzed per study, in eQTL studies, tens of thousands of gene expression levels are measured, and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the single nucleotide polymorphisms (SNPs). In the first stage, a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions that may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to the state-of-the-art testing approaches by a factor of 75. PMID:24033261
Gene Regulation Networks for Modeling Drosophila Development
NASA Technical Reports Server (NTRS)
Mjolsness, E.
1999-01-01
This chapter will very briefly introduce and review some computational experiments in using trainable gene regulation network models to simulate and understand selected episodes in the development of the fruit fly, Drosophila Melanogaster.
The Immunological Genome Project: networks of gene expression in immune cells.
Heng, Tracy S P; Painter, Michio W
2008-10-01
The Immunological Genome Project combines immunology and computational biology laboratories in an effort to establish a complete 'road map' of gene-expression and regulatory networks in all immune cells.
Messmer, Bradley T; Raphael, Benjamin J; Aerni, Sarah J; Widhopf, George F; Rassenti, Laura Z; Gribben, John G; Kay, Neil E; Kipps, Thomas J
2009-01-01
The leukemia cells of unrelated patients with chronic lymphocytic leukemia (CLL) display a restricted repertoire of immunoglobulin (Ig) gene rearrangements with preferential usage of certain Ig gene segments. We developed a computational method to rigorously quantify biases in Ig sequence similarity in large patient databases and to identify groups of patients with unusual levels of sequence similarity. We applied our method to sequences from 1577 CLL patients through the CLL Research Consortium (CRC), and identified 67 similarity groups into which roughly 20% of all patients could be assigned. Immunoglobulin light chain class was highly correlated within all groups and light chain gene usage was similar within sets. Surprisingly, over 40% of the identified groups were composed of somatically mutated genes. This study significantly expands the evidence that antigen selection shapes the Ig repertoire in CLL. PMID:18640719
Messmer, Bradley T; Raphael, Benjamin J; Aerni, Sarah J; Widhopf, George F; Rassenti, Laura Z; Gribben, John G; Kay, Neil E; Kipps, Thomas J
2009-03-01
The leukemia cells of unrelated patients with chronic lymphocytic leukemia (CLL) display a restricted repertoire of immunoglobulin (Ig) gene rearrangements with preferential usage of certain Ig gene segments. We developed a computational method to rigorously quantify biases in Ig sequence similarity in large patient databases and to identify groups of patients with unusual levels of sequence similarity. We applied our method to sequences from 1577 CLL patients through the CLL Research Consortium (CRC), and identified 67 similarity groups into which roughly 20% of all patients could be assigned. Immunoglobulin light chain class was highly correlated within all groups and light chain gene usage was similar within sets. Surprisingly, over 40% of the identified groups were composed of somatically mutated genes. This study significantly expands the evidence that antigen selection shapes the Ig repertoire in CLL.
Toiviainen-Salo, Sanna; Raade, Merja; Durie, Peter R; Ip, Wan; Marttinen, Eino; Savilahti, Erkki; Mäkitie, Outi
2008-03-01
Pancreatic MRI was evaluated in 14 patients with a clinical diagnosis of Shwachman-Diamond syndrome, and the findings were correlated with Shwachman-Bodian-Diamond gene (SBDS) genotype. The findings suggest that patients with mutations in the SBDS gene have a characteristic magnetic resonance imaging pattern of fat-replaced pancreas and that SBDS mutations are unlikely in patients without this pattern.
Jambusaria, Ankit; Klomp, Jeff; Hong, Zhigang; Rafii, Shahin; Dai, Yang; Malik, Asrar B; Rehman, Jalees
2018-06-07
The heterogeneity of cells across tissue types represents a major challenge for studying biological mechanisms as well as for therapeutic targeting of distinct tissues. Computational prediction of tissue-specific gene regulatory networks may provide important insights into the mechanisms underlying the cellular heterogeneity of cells in distinct organs and tissues. Using three pathway analysis techniques, gene set enrichment analysis (GSEA), parametric analysis of gene set enrichment (PGSEA), alongside our novel model (HeteroPath), which assesses heterogeneously upregulated and downregulated genes within the context of pathways, we generated distinct tissue-specific gene regulatory networks. We analyzed gene expression data derived from freshly isolated heart, brain, and lung endothelial cells and populations of neurons in the hippocampus, cingulate cortex, and amygdala. In both datasets, we found that HeteroPath segregated the distinct cellular populations by identifying regulatory pathways that were not identified by GSEA or PGSEA. Using simulated datasets, HeteroPath demonstrated robustness that was comparable to what was seen using existing gene set enrichment methods. Furthermore, we generated tissue-specific gene regulatory networks involved in vascular heterogeneity and neuronal heterogeneity by performing motif enrichment of the heterogeneous genes identified by HeteroPath and linking the enriched motifs to regulatory transcription factors in the ENCODE database. HeteroPath assesses contextual bidirectional gene expression within pathways and thus allows for transcriptomic assessment of cellular heterogeneity. Unraveling tissue-specific heterogeneity of gene expression can lead to a better understanding of the molecular underpinnings of tissue-specific phenotypes.
2012-01-01
Background Francisella is a genus of gram-negative bacterium highly virulent in fishes and human where F. tularensis is causing the serious disease tularaemia in human. Recently Francisella species have been reported to cause mortality in aquaculture species like Atlantic cod and tilapia. We have completed the sequencing and draft assembly of the Francisella noatunensis subsp. orientalisToba04 strain isolated from farmed Tilapia. Compared to other available Francisella genomes, it is most similar to the genome of Francisella philomiragia subsp. philomiragia, a free-living bacterium not virulent to human. Results The genome is rearranged compared to the available Francisella genomes even though we found no IS-elements in the genome. Nearly 16% percent of the predicted ORFs are pseudogenes. Computational pathway analysis indicates that a number of the metabolic pathways are disrupted due to pseudogenes. Comparing the novel genome with other available Francisella genomes, we found around 2.5% of unique genes present in Francisella noatunensis subsp. orientalis Toba04 and a list of genes uniquely present in the human-pathogenic Francisella subspecies. Most of these genes might have transferred from bacterial species through horizontal gene transfer. Comparative analysis between human and fish pathogen also provide insights into genes responsible for pathogenecity. Our analysis of pseudogenes indicates that the evolution of Francisella subspecies’s pseudogenes from Tilapia is old with large number of pseudogenes having more than one inactivating mutation. Conclusions The fish pathogen has lost non-essential genes some time ago. Evolutionary analysis of the Francisella genomes, strongly suggests that human and fish pathogenic Francisella species have evolved independently from free-living metabolically competent Francisella species. These findings will contribute to understanding the evolution of Francisella species and pathogenesis. PMID:23131096
Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues.
Wheeler, Heather E; Shah, Kaanan P; Brenner, Jonathon; Garcia, Tzintzuni; Aquino-Michaels, Keston; Cox, Nancy J; Nicolae, Dan L; Im, Hae Kyung
2016-11-01
Understanding the genetic architecture of gene expression traits is key to elucidating the underlying mechanisms of complex traits. Here, for the first time, we perform a systematic survey of the heritability and the distribution of effect sizes across all representative tissues in the human body. We find that local h2 can be relatively well characterized with 59% of expressed genes showing significant h2 (FDR < 0.1) in the DGN whole blood cohort. However, current sample sizes (n ≤ 922) do not allow us to compute distal h2. Bayesian Sparse Linear Mixed Model (BSLMM) analysis provides strong evidence that the genetic contribution to local expression traits is dominated by a handful of genetic variants rather than by the collective contribution of a large number of variants each of modest size. In other words, the local architecture of gene expression traits is sparse rather than polygenic across all 40 tissues (from DGN and GTEx) examined. This result is confirmed by the sparsity of optimal performing gene expression predictors via elastic net modeling. To further explore the tissue context specificity, we decompose the expression traits into cross-tissue and tissue-specific components using a novel Orthogonal Tissue Decomposition (OTD) approach. Through a series of simulations we show that the cross-tissue and tissue-specific components are identifiable via OTD. Heritability and sparsity estimates of these derived expression phenotypes show similar characteristics to the original traits. Consistent properties relative to prior GTEx multi-tissue analysis results suggest that these traits reflect the expected biology. Finally, we apply this knowledge to develop prediction models of gene expression traits for all tissues. The prediction models, heritability, and prediction performance R2 for original and decomposed expression phenotypes are made publicly available (https://github.com/hakyimlab/PrediXcan).
A hybrid correlation analysis with application to imaging genetics
NASA Astrophysics Data System (ADS)
Hu, Wenxing; Fang, Jian; Calhoun, Vince D.; Wang, Yu-Ping
2018-03-01
Investigating the association between brain regions and genes continues to be a challenging topic in imaging genetics. Current brain region of interest (ROI)-gene association studies normally reduce data dimension by averaging the value of voxels in each ROI. This averaging may lead to a loss of information due to the existence of functional sub-regions. Pearson correlation is widely used for association analysis. However, it only detects linear correlation whereas nonlinear correlation may exist among ROIs. In this work, we introduced distance correlation to ROI-gene association analysis, which can detect both linear and nonlinear correlations and overcome the limitation of averaging operations by taking advantage of the information at each voxel. Nevertheless, distance correlation usually has a much lower value than Pearson correlation. To address this problem, we proposed a hybrid correlation analysis approach, by applying canonical correlation analysis (CCA) to the distance covariance matrix instead of directly computing distance correlation. Incorporating CCA into distance correlation approach may be more suitable for complex disease study because it can detect highly associated pairs of ROI and gene groups, and may improve the distance correlation level and statistical power. In addition, we developed a novel nonlinear CCA, called distance kernel CCA, which seeks the optimal combination of features with the most significant dependence. This approach was applied to imaging genetic data from the Philadelphia Neurodevelopmental Cohort (PNC). Experiments showed that our hybrid approach produced more consistent results than conventional CCA across resampling and both the correlation and statistical significance were increased compared to distance correlation analysis. Further gene enrichment analysis and region of interest (ROI) analysis confirmed the associations of the identified genes with brain ROIs. Therefore, our approach provides a powerful tool for finding the correlation between brain imaging and genomic data.
Moteghaed, Niloofar Yousefi; Maghooli, Keivan; Garshasbi, Masoud
2018-01-01
Background: Gene expression data are characteristically high dimensional with a small sample size in contrast to the feature size and variability inherent in biological processes that contribute to difficulties in analysis. Selection of highly discriminative features decreases the computational cost and complexity of the classifier and improves its reliability for prediction of a new class of samples. Methods: The present study used hybrid particle swarm optimization and genetic algorithms for gene selection and a fuzzy support vector machine (SVM) as the classifier. Fuzzy logic is used to infer the importance of each sample in the training phase and decrease the outlier sensitivity of the system to increase the ability to generalize the classifier. A decision-tree algorithm was applied to the most frequent genes to develop a set of rules for each type of cancer. This improved the abilities of the algorithm by finding the best parameters for the classifier during the training phase without the need for trial-and-error by the user. The proposed approach was tested on four benchmark gene expression profiles. Results: Good results have been demonstrated for the proposed algorithm. The classification accuracy for leukemia data is 100%, for colon cancer is 96.67% and for breast cancer is 98%. The results show that the best kernel used in training the SVM classifier is the radial basis function. Conclusions: The experimental results show that the proposed algorithm can decrease the dimensionality of the dataset, determine the most informative gene subset, and improve classification accuracy using the optimal parameters of the classifier with no user interface. PMID:29535919
Genome-wide analysis of epistasis in body mass index using multiple human populations.
Wei, Wen-Hua; Hemani, Gib; Gyenesei, Attila; Vitart, Veronique; Navarro, Pau; Hayward, Caroline; Cabrera, Claudia P; Huffman, Jennifer E; Knott, Sara A; Hicks, Andrew A; Rudan, Igor; Pramstaller, Peter P; Wild, Sarah H; Wilson, James F; Campbell, Harry; Hastie, Nicholas D; Wright, Alan F; Haley, Chris S
2012-08-01
We surveyed gene-gene interactions (epistasis) in human body mass index (BMI) in four European populations (n<1200) via exhaustive pair-wise genome scans where interactions were computed as F ratios by testing a linear regression model fitting two single-nucleotide polymorphisms (SNPs) with interactions against the one without. Before the association tests, BMI was corrected for sex and age, normalised and adjusted for relatedness. Neither single SNPs nor SNP interactions were genome-wide significant in either cohort based on the consensus threshold (P=5.0E-08) and a Bonferroni corrected threshold (P=1.1E-12), respectively. Next we compared sub genome-wide significant SNP interactions (P<5.0E-08) across cohorts to identify common epistatic signals, where SNPs were annotated to genes to test for gene ontology (GO) enrichment. Among the epistatic genes contributing to the commonly enriched GO terms, 19 were shared across study cohorts of which 15 are previously published genome-wide association loci, including CDH13 (cadherin 13) associated with height and SORCS2 (sortilin-related VPS10 domain containing receptor 2) associated with circulating insulin-like growth factor 1 and binding protein 3. Interactions between the 19 shared epistatic genes and those involving BMI candidate loci (P<5.0E-08) were tested across cohorts and found eight replicated at the SNP level (P<0.05) in at least one cohort, which were further tested and showed limited replication in a separate European population (n>5000). We conclude that genome-wide analysis of epistasis in multiple populations is an effective approach to provide new insights into the genetic regulation of BMI but requires additional efforts to confirm the findings.
NASA Astrophysics Data System (ADS)
Tripathi, Shubham; Deem, Michael W.
2015-02-01
Cancer progresses with a change in the structure of the gene network in normal cells. We define a measure of organizational hierarchy in gene networks of affected cells in adult acute myeloid leukemia (AML) patients. With a retrospective cohort analysis based on the gene expression profiles of 116 AML patients, we find that the likelihood of future cancer relapse and the level of clinical risk are directly correlated with the level of organization in the cancer related gene network. We also explore the variation of the level of organization in the gene network with cancer progression. We find that this variation is non-monotonic, which implies the fitness landscape in the evolution of AML cancer cells is non-trivial. We further find that the hierarchy in gene expression at the time of diagnosis may be a useful biomarker in AML prognosis.
van den Broek, Evert; van Lieshout, Stef; Rausch, Christian; Ylstra, Bauke; van de Wiel, Mark A; Meijer, Gerrit A; Fijneman, Remond J A; Abeln, Sanne
2016-01-01
Development of cancer is driven by somatic alterations, including numerical and structural chromosomal aberrations. Currently, several computational methods are available and are widely applied to detect numerical copy number aberrations (CNAs) of chromosomal segments in tumor genomes. However, there is lack of computational methods that systematically detect structural chromosomal aberrations by virtue of the genomic location of CNA-associated chromosomal breaks and identify genes that appear non-randomly affected by chromosomal breakpoints across (large) series of tumor samples. 'GeneBreak' is developed to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach, which can be applied to DNA copy number data obtained by array-Comparative Genomic Hybridization (CGH) or by (low-pass) whole genome sequencing (WGS). First, 'GeneBreak' collects the genomic locations of chromosomal CNA-associated breaks that were previously pinpointed by the segmentation algorithm that was applied to obtain CNA profiles. Next, a tailored annotation approach for breakpoint-to-gene mapping is implemented. Finally, dedicated cohort-based statistics is incorporated with correction for covariates that influence the probability to be a breakpoint gene. In addition, multiple testing correction is integrated to reveal recurrent breakpoint events. This easy-to-use algorithm, 'GeneBreak', is implemented in R ( www.cran.r-project.org ) and is available from Bioconductor ( www.bioconductor.org/packages/release/bioc/html/GeneBreak.html ).
A new computational strategy for predicting essential genes.
Cheng, Jian; Wu, Wenwu; Zhang, Yinwen; Li, Xiangchen; Jiang, Xiaoqian; Wei, Gehong; Tao, Shiheng
2013-12-21
Determination of the minimum gene set for cellular life is one of the central goals in biology. Genome-wide essential gene identification has progressed rapidly in certain bacterial species; however, it remains difficult to achieve in most eukaryotic species. Several computational models have recently been developed to integrate gene features and used as alternatives to transfer gene essentiality annotations between organisms. We first collected features that were widely used by previous predictive models and assessed the relationships between gene features and gene essentiality using a stepwise regression model. We found two issues that could significantly reduce model accuracy: (i) the effect of multicollinearity among gene features and (ii) the diverse and even contrasting correlations between gene features and gene essentiality existing within and among different species. To address these issues, we developed a novel model called feature-based weighted Naïve Bayes model (FWM), which is based on Naïve Bayes classifiers, logistic regression, and genetic algorithm. The proposed model assesses features and filters out the effects of multicollinearity and diversity. The performance of FWM was compared with other popular models, such as support vector machine, Naïve Bayes model, and logistic regression model, by applying FWM to reciprocally predict essential genes among and within 21 species. Our results showed that FWM significantly improves the accuracy and robustness of essential gene prediction. FWM can remarkably improve the accuracy of essential gene prediction and may be used as an alternative method for other classification work. This method can contribute substantially to the knowledge of the minimum gene sets required for living organisms and the discovery of new drug targets.
Wilson, Paul; Larminie, Christopher; Smith, Rona
2016-01-01
To use literature mining to catalogue Behçet's associated genes, and advanced computational methods to improve the understanding of the pathways and signalling mechanisms that lead to the typical clinical characteristics of Behçet's patients. To extend this technique to identify potential treatment targets for further experimental validation. Text mining methods combined with gene enrichment tools, pathway analysis and causal analysis algorithms. This approach identified 247 human genes associated with Behçet's disease and the resulting disease map, comprising 644 nodes and 19220 edges, captured important details of the relationships between these genes and their associated pathways, as described in diverse data repositories. Pathway analysis has identified how Behçet's associated genes are likely to participate in innate and adaptive immune responses. Causal analysis algorithms have identified a number of potential therapeutic strategies for further investigation. Computational methods have captured pertinent features of the prominent disease characteristics presented in Behçet's disease and have highlighted NOD2, ICOS and IL18 signalling as potential therapeutic strategies.
A Simple Test of Class-Level Genetic Association Can Reveal Novel Cardiometabolic Trait Loci.
Qian, Jing; Nunez, Sara; Reed, Eric; Reilly, Muredach P; Foulkes, Andrea S
2016-01-01
Characterizing the genetic determinants of complex diseases can be further augmented by incorporating knowledge of underlying structure or classifications of the genome, such as newly developed mappings of protein-coding genes, epigenetic marks, enhancer elements and non-coding RNAs. We apply a simple class-level testing framework, termed Genetic Class Association Testing (GenCAT), to identify protein-coding gene association with 14 cardiometabolic (CMD) related traits across 6 publicly available genome wide association (GWA) meta-analysis data resources. GenCAT uses SNP-level meta-analysis test statistics across all SNPs within a class of elements, as well as the size of the class and its unique correlation structure, to determine if the class is statistically meaningful. The novelty of findings is evaluated through investigation of regional signals. A subset of findings are validated using recently updated, larger meta-analysis resources. A simulation study is presented to characterize overall performance with respect to power, control of family-wise error and computational efficiency. All analysis is performed using the GenCAT package, R version 3.2.1. We demonstrate that class-level testing complements the common first stage minP approach that involves individual SNP-level testing followed by post-hoc ascribing of statistically significant SNPs to genes and loci. GenCAT suggests 54 protein-coding genes at 41 distinct loci for the 13 CMD traits investigated in the discovery analysis, that are beyond the discoveries of minP alone. An additional application to biological pathways demonstrates flexibility in defining genetic classes. We conclude that it would be prudent to include class-level testing as standard practice in GWA analysis. GenCAT, for example, can be used as a simple, complementary and efficient strategy for class-level testing that leverages existing data resources, requires only summary level data in the form of test statistics, and adds significant value with respect to its potential for identifying multiple novel and clinically relevant trait associations.
Shirdel, Elize A.; Xie, Wing; Mak, Tak W.; Jurisica, Igor
2011-01-01
Background MicroRNAs are a class of small RNAs known to regulate gene expression at the transcript level, the protein level, or both. Since microRNA binding is sequence-based but possibly structure-specific, work in this area has resulted in multiple databases storing predicted microRNA:target relationships computed using diverse algorithms. We integrate prediction databases, compare predictions to in vitro data, and use cross-database predictions to model the microRNA:transcript interactome – referred to as the micronome – to study microRNA involvement in well-known signalling pathways as well as associations with disease. We make this data freely available with a flexible user interface as our microRNA Data Integration Portal — mirDIP (http://ophid.utoronto.ca/mirDIP). Results mirDIP integrates prediction databases to elucidate accurate microRNA:target relationships. Using NAViGaTOR to produce interaction networks implicating microRNAs in literature-based, KEGG-based and Reactome-based pathways, we find these signalling pathway networks have significantly more microRNA involvement compared to chance (p<0.05), suggesting microRNAs co-target many genes in a given pathway. Further examination of the micronome shows two distinct classes of microRNAs; universe microRNAs, which are involved in many signalling pathways; and intra-pathway microRNAs, which target multiple genes within one signalling pathway. We find universe microRNAs to have more targets (p<0.0001), to be more studied (p<0.0002), and to have higher degree in the KEGG cancer pathway (p<0.0001), compared to intra-pathway microRNAs. Conclusions Our pathway-based analysis of mirDIP data suggests microRNAs are involved in intra-pathway signalling. We identify two distinct classes of microRNAs, suggesting a hierarchical organization of microRNAs co-targeting genes both within and between pathways, and implying differential involvement of universe and intra-pathway microRNAs at the disease level. PMID:21364759
Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.
2014-01-01
Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface. PMID:25329157
Short and long-term genome stability analysis of prokaryotic genomes.
Brilli, Matteo; Liò, Pietro; Lacroix, Vincent; Sagot, Marie-France
2013-05-08
Gene organization dynamics is actively studied because it provides useful evolutionary information, makes functional annotation easier and often enables to characterize pathogens. There is therefore a strong interest in understanding the variability of this trait and the possible correlations with life-style. Two kinds of events affect genome organization: on one hand translocations and recombinations change the relative position of genes shared by two genomes (i.e. the backbone gene order); on the other, insertions and deletions leave the backbone gene order unchanged but they alter the gene neighborhoods by breaking the syntenic regions. A complete picture about genome organization evolution therefore requires to account for both kinds of events. We developed an approach where we model chromosomes as graphs on which we compute different stability estimators; we consider genome rearrangements as well as the effect of gene insertions and deletions. In a first part of the paper, we fit a measure of backbone gene order conservation (hereinafter called backbone stability) against phylogenetic distance for over 3000 genome comparisons, improving existing models for the divergence in time of backbone stability. Intra- and inter-specific comparisons were treated separately to focus on different time-scales. The use of multiple genomes of a same species allowed to identify genomes with diverging gene order with respect to their conspecific. The inter-species analysis indicates that pathogens are more often unstable with respect to non-pathogens. In a second part of the text, we show that in pathogens, gene content dynamics (insertions and deletions) have a much more dramatic effect on genome organization stability than backbone rearrangements. In this work, we studied genome organization divergence taking into account the contribution of both genome order rearrangements and genome content dynamics. By studying species with multiple sequenced genomes available, we were able to explore genome organization stability at different time-scales and to find significant differences for pathogen and non-pathogen species. The output of our framework also allows to identify the conserved gene clusters and/or partial occurrences thereof, making possible to explore how gene clusters assembled during evolution.
Computed tomographic findings of trichuriasis
Tokmak, Naime; Koc, Zafer; Ulusan, Serife; Koltas, Ismail Soner; Bal, Nebil
2006-01-01
In this report, we present computed tomographic findings of colonic trichuriasis. The patient was a 75-year-old man who complained of abdominal pain, and weight loss. Diagnosis was achieved by colonoscopic biopsy. Abdominal computed tomography showed irregular and nodular thickening of the wall of the cecum and ascending colon. Although these findings are nonspecific, they may be one of the findings of trichuriasis. These findings, confirmed by pathologic analysis of the biopsied tissue and Kato-Katz parasitological stool flotation technique, revealed adult Trichuris. To our knowledge, this is the first report of colonic trichuriasis indicated by computed tomography. PMID:16830393
Computational toxicology combines data from high-throughput test methods, chemical structure analyses and other biological domains (e.g., genes, proteins, cells, tissues) with the goals of predicting and understanding the underlying mechanistic causes of chemical toxicity and for...
Geometry of the Gene Expression Space of Individual Cells
Korem, Yael; Szekely, Pablo; Hart, Yuval; Sheftel, Hila; Hausser, Jean; Mayo, Avi; Rothenberg, Michael E.; Kalisky, Tomer; Alon, Uri
2015-01-01
There is a revolution in the ability to analyze gene expression of single cells in a tissue. To understand this data we must comprehend how cells are distributed in a high-dimensional gene expression space. One open question is whether cell types form discrete clusters or whether gene expression forms a continuum of states. If such a continuum exists, what is its geometry? Recent theory on evolutionary trade-offs suggests that cells that need to perform multiple tasks are arranged in a polygon or polyhedron (line, triangle, tetrahedron and so on, generally called polytopes) in gene expression space, whose vertices are the expression profiles optimal for each task. Here, we analyze single-cell data from human and mouse tissues profiled using a variety of single-cell technologies. We fit the data to shapes with different numbers of vertices, compute their statistical significance, and infer their tasks. We find cases in which single cells fill out a continuum of expression states within a polyhedron. This occurs in intestinal progenitor cells, which fill out a tetrahedron in gene expression space. The four vertices of this tetrahedron are each enriched with genes for a specific task related to stemness and early differentiation. A polyhedral continuum of states is also found in spleen dendritic cells, known to perform multiple immune tasks: cells fill out a tetrahedron whose vertices correspond to key tasks related to maturation, pathogen sensing and communication with lymphocytes. A mixture of continuum-like distributions and discrete clusters is found in other cell types, including bone marrow and differentiated intestinal crypt cells. This approach can be used to understand the geometry and biological tasks of a wide range of single-cell datasets. The present results suggest that the concept of cell type may be expanded. In addition to discreet clusters in gene-expression space, we suggest a new possibility: a continuum of states within a polyhedron, in which the vertices represent specialists at key tasks. PMID:26161936
Gene Ontology annotations at SGD: new data sources and annotation methods
Hong, Eurie L.; Balakrishnan, Rama; Dong, Qing; Christie, Karen R.; Park, Julie; Binkley, Gail; Costanzo, Maria C.; Dwight, Selina S.; Engel, Stacia R.; Fisk, Dianna G.; Hirschman, Jodi E.; Hitz, Benjamin C.; Krieger, Cynthia J.; Livstone, Michael S.; Miyasato, Stuart R.; Nash, Robert S.; Oughtred, Rose; Skrzypek, Marek S.; Weng, Shuai; Wong, Edith D.; Zhu, Kathy K.; Dolinski, Kara; Botstein, David; Cherry, J. Michael
2008-01-01
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current. PMID:17982175
Zhang, Fan; Liu, Runsheng; Zheng, Jie
2016-12-23
Linking computational models of signaling pathways to predicted cellular responses such as gene expression regulation is a major challenge in computational systems biology. In this work, we present Sig2GRN, a Cytoscape plugin that is able to simulate time-course gene expression data given the user-defined external stimuli to the signaling pathways. A generalized logical model is used in modeling the upstream signaling pathways. Then a Boolean model and a thermodynamics-based model are employed to predict the downstream changes in gene expression based on the simulated dynamics of transcription factors in signaling pathways. Our empirical case studies show that the simulation of Sig2GRN can predict changes in gene expression patterns induced by DNA damage signals and drug treatments. As a software tool for modeling cellular dynamics, Sig2GRN can facilitate studies in systems biology by hypotheses generation and wet-lab experimental design. http://histone.scse.ntu.edu.sg/Sig2GRN/.
The physical size of transcription factors is key to transcriptional regulation in chromatin domains
NASA Astrophysics Data System (ADS)
Maeshima, Kazuhiro; Kaizu, Kazunari; Tamura, Sachiko; Nozaki, Tadasu; Kokubo, Tetsuro; Takahashi, Koichi
2015-02-01
Genetic information, which is stored in the long strand of genomic DNA as chromatin, must be scanned and read out by various transcription factors. First, gene-specific transcription factors, which are relatively small (˜50 kDa), scan the genome and bind regulatory elements. Such factors then recruit general transcription factors, Mediators, RNA polymerases, nucleosome remodellers, and histone modifiers, most of which are large protein complexes of 1-3 MDa in size. Here, we propose a new model for the functional significance of the size of transcription factors (or complexes) for gene regulation of chromatin domains. Recent findings suggest that chromatin consists of irregularly folded nucleosome fibres (10 nm fibres) and forms numerous condensed domains (e.g., topologically associating domains). Although the flexibility and dynamics of chromatin allow repositioning of genes within the condensed domains, the size exclusion effect of the domain may limit accessibility of DNA sequences by transcription factors. We used Monte Carlo computer simulations to determine the physical size limit of transcription factors that can enter condensed chromatin domains. Small gene-specific transcription factors can penetrate into the chromatin domains and search their target sequences, whereas large transcription complexes cannot enter the domain. Due to this property, once a large complex binds its target site via gene-specific factors it can act as a ‘buoy’ to keep the target region on the surface of the condensed domain and maintain transcriptional competency. This size-dependent specialization of target-scanning and surface-tethering functions could provide novel insight into the mechanisms of various DNA transactions, such as DNA replication and repair/recombination.
Bez, Maxim; Sheyn, Dmitriy; Tawackoli, Wafa; Avalos, Pablo; Shapiro, Galina; Giaconi, Joseph C; Da, Xiaoyu; David, Shiran Ben; Gavrity, Jayne; Awad, Hani A; Bae, Hyun W; Ley, Eric J; Kremen, Thomas J; Gazit, Zulma; Ferrara, Katherine W; Pelled, Gadi; Gazit, Dan
2017-05-17
More than 2 million bone-grafting procedures are performed each year using autografts or allografts. However, both options carry disadvantages, and there remains a clear medical need for the development of new therapies for massive bone loss and fracture nonunions. We hypothesized that localized ultrasound-mediated, microbubble-enhanced therapeutic gene delivery to endogenous stem cells would induce efficient bone regeneration and fracture repair. To test this hypothesis, we surgically created a critical-sized bone fracture in the tibiae of Yucatán mini-pigs, a clinically relevant large animal model. A collagen scaffold was implanted in the fracture to facilitate recruitment of endogenous mesenchymal stem/progenitor cells (MSCs) into the fracture site. Two weeks later, transcutaneous ultrasound-mediated reporter gene delivery successfully transfected 40% of cells at the fracture site, and flow cytometry showed that 80% of the transfected cells expressed MSC markers. Human bone morphogenetic protein-6 ( BMP - 6 ) plasmid DNA was delivered using ultrasound in the same animal model, leading to transient expression and secretion of BMP-6 localized to the fracture area. Micro-computed tomography and biomechanical analyses showed that ultrasound-mediated BMP-6 gene delivery led to complete radiographic and functional fracture healing in all animals 6 weeks after treatment, whereas nonunion was evident in control animals. Collectively, these findings demonstrate that ultrasound-mediated gene delivery to endogenous mesenchymal progenitor cells can effectively treat nonhealing bone fractures in large animals, thereby addressing a major orthopedic unmet need and offering new possibilities for clinical translation. Copyright © 2017, American Association for the Advancement of Science.
Systems Biomedicine of Rabies Delineates the Affected Signaling Pathways.
Azimzadeh Jamalkandi, Sadegh; Mozhgani, Sayed-Hamidreza; Gholami Pourbadie, Hamid; Mirzaie, Mehdi; Noorbakhsh, Farshid; Vaziri, Behrouz; Gholami, Alireza; Ansari-Pour, Naser; Jafari, Mohieddin
2016-01-01
The prototypical neurotropic virus, rabies, is a member of the Rhabdoviridae family that causes lethal encephalomyelitis. Although there have been a plethora of studies investigating the etiological mechanism of the rabies virus and many precautionary methods have been implemented to avert the disease outbreak over the last century, the disease has surprisingly no definite remedy at its late stages. The psychological symptoms and the underlying etiology, as well as the rare survival rate from rabies encephalitis, has still remained a mystery. We, therefore, undertook a systems biomedicine approach to identify the network of gene products implicated in rabies. This was done by meta-analyzing whole-transcriptome microarray datasets of the CNS infected by strain CVS-11, and integrating them with interactome data using computational and statistical methods. We first determined the differentially expressed genes (DEGs) in each study and horizontally integrated the results at the mRNA and microRNA levels separately. A total of 61 seed genes involved in signal propagation system were obtained by means of unifying mRNA and microRNA detected integrated DEGs. We then reconstructed a refined protein-protein interaction network (PPIN) of infected cells to elucidate the rabies-implicated signal transduction network (RISN). To validate our findings, we confirmed differential expression of randomly selected genes in the network using Real-time PCR. In conclusion, the identification of seed genes and their network neighborhood within the refined PPIN can be useful for demonstrating signaling pathways including interferon circumvent, toward proliferation and survival, and neuropathological clue, explaining the intricate underlying molecular neuropathology of rabies infection and thus rendered a molecular framework for predicting potential drug targets.
Bez, Maxim; Sheyn, Dmitriy; Tawackoli, Wafa; Avalos, Pablo; Shapiro, Galina; Giaconi, Joseph C.; Da, Xiaoyu; Ben David, Shiran; Gavrity, Jayne; Awad, Hani A.; Bae, Hyun W.; Ley, Eric J.; Kremen, Thomas J.; Gazit, Zulma; Ferrara, Katherine W.; Pelled, Gadi; Gazit, Dan
2017-01-01
More than 2 million bone-grafting procedures are performed each year using autografts or allografts. However, both options carry disadvantages, and there remains a clear medical need for the development of new therapies for massive bone loss and fracture nonunions. We hypothesized that localized ultrasound-mediated, microbubble-enhanced therapeutic gene delivery to endogenous stem cells would induce efficient bone regeneration and fracture repair. To test this hypothesis, we surgically created a critical-sized bone fracture in the tibiae of Yucatán mini-pigs, a clinically relevant large animal model. A collagen scaffold was implanted in the fracture to facilitate recruitment of endogenous mesenchymal stem/progenitor cells (MSCs) into the fracture site. Two weeks later, transcutaneous ultrasound-mediated reporter gene delivery successfully transfected 40% of cells at the fracture site, and flow cytometry showed that 80% of the transfected cells expressed MSC markers. Human bone morphogenetic protein-6 (BMP-6) plasmid DNA was delivered using ultrasound in the same animal model, leading to transient expression and secretion of BMP-6 localized to the fracture area. Micro–computed tomography and biomechanical analyses showed that ultrasound-mediated BMP-6 gene delivery led to complete radiographic and functional fracture healing in all animals 6 weeks after treatment, whereas nonunion was evident in control animals. Collectively, these findings demonstrate that ultrasound-mediated gene delivery to endogenous mesenchy-mal progenitor cells can effectively treat nonhealing bone fractures in large animals, thereby addressing a major orthopedic unmet need and offering new possibilities for clinical translation. PMID:28515335
Giacopuzzi, Edoardo; Laffranchi, Mattia; Berardelli, Romina; Ravasio, Viola; Ferrarotti, Ilaria; Gooptu, Bibek; Borsani, Giuseppe; Fra, Annamaria
2018-06-07
The growth of publicly available data informing upon genetic variations, mechanisms of disease and disease sub-phenotypes offers great potential for personalised medicine. Computational approaches are likely required to assess large numbers of novel genetic variants. However, the integration of genetic, structural and pathophysiological data still represents a challenge for computational predictions and their clinical use. We addressed these issues for alpha-1-antitrypsin deficiency, a disease mediated by mutations in the SERPINA1 gene encoding alpha-1-antitrypsin. We compiled a comprehensive database of SERPINA1 coding mutations and assigned them apparent pathological relevance based upon available data. 'Benign' and 'Pathogenic' mutations were used to assess performance of 31 pathogenicity predictors. Well-performing algorithms clustered the subset of variants known to be severely pathogenic with high scores. Eight new mutations identified in the ExAC database and achieving high scores were selected for characterisation in cell models and showed secretory deficiency and polymer formation, supporting the predictive power of our computational approach. The behaviour of the pathogenic new variants and consistent outliers were rationalised by considering the protein structural context and residue conservation. These findings highlight the potential of computational methods to provide meaningful predictions of the pathogenic significance of novel mutations and identify areas for further investigation. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Harwood, Caroline S
The goal of this project is to identify gene networks that are critical for efficient biohydrogen production by leveraging variation in gene content and gene expression in independently isolated Rhodopseudomonas palustris strains. Coexpression methods were applied to large data sets that we have collected to define probabilistic causal gene networks. To our knowledge this a first systems level approach that takes advantage of strain-to strain variability to computationally define networks critical for a particular bacterial phenotypic trait.
Sen Sarma, Moushumi; Arcoleo, David; Khetani, Radhika S; Chee, Brant; Ling, Xu; He, Xin; Jiang, Jing; Mei, Qiaozhu; Zhai, ChengXiang; Schatz, Bruce
2011-07-01
With the rapid decrease in cost of genome sequencing, the classification of gene function is becoming a primary problem. Such classification has been performed by human curators who read biological literature to extract evidence. BeeSpace Navigator is a prototype software for exploratory analysis of gene function using biological literature. The software supports an automatic analogue of the curator process to extract functions, with a simple interface intended for all biologists. Since extraction is done on selected collections that are semantically indexed into conceptual spaces, the curation can be task specific. Biological literature containing references to gene lists from expression experiments can be analyzed to extract concepts that are computational equivalents of a classification such as Gene Ontology, yielding discriminating concepts that differentiate gene mentions from other mentions. The functions of individual genes can be summarized from sentences in biological literature, to produce results resembling a model organism database entry that is automatically computed. Statistical frequency analysis based on literature phrase extraction generates offline semantic indexes to support these gene function services. The website with BeeSpace Navigator is free and open to all; there is no login requirement at www.beespace.illinois.edu for version 4. Materials from the 2010 BeeSpace Software Training Workshop are available at www.beespace.illinois.edu/bstwmaterials.php.
Reverse engineering and analysis of large genome-scale gene networks
Aluru, Maneesha; Zola, Jaroslaw; Nettleton, Dan; Aluru, Srinivas
2013-01-01
Reverse engineering the whole-genome networks of complex multicellular organisms continues to remain a challenge. While simpler models easily scale to large number of genes and gene expression datasets, more accurate models are compute intensive limiting their scale of applicability. To enable fast and accurate reconstruction of large networks, we developed Tool for Inferring Network of Genes (TINGe), a parallel mutual information (MI)-based program. The novel features of our approach include: (i) B-spline-based formulation for linear-time computation of MI, (ii) a novel algorithm for direct permutation testing and (iii) development of parallel algorithms to reduce run-time and facilitate construction of large networks. We assess the quality of our method by comparison with ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) and GeneNet and demonstrate its unique capability by reverse engineering the whole-genome network of Arabidopsis thaliana from 3137 Affymetrix ATH1 GeneChips in just 9 min on a 1024-core cluster. We further report on the development of a new software Gene Network Analyzer (GeNA) for extracting context-specific subnetworks from a given set of seed genes. Using TINGe and GeNA, we performed analysis of 241 Arabidopsis AraCyc 8.0 pathways, and the results are made available through the web. PMID:23042249
2016-01-01
Motivation: Gene tree represents the evolutionary history of gene lineages that originate from multiple related populations. Under the multispecies coalescent model, lineages may coalesce outside the species (population) boundary. Given a species tree (with branch lengths), the gene tree probability is the probability of observing a specific gene tree topology under the multispecies coalescent model. There are two existing algorithms for computing the exact gene tree probability. The first algorithm is due to Degnan and Salter, where they enumerate all the so-called coalescent histories for the given species tree and the gene tree topology. Their algorithm runs in exponential time in the number of gene lineages in general. The second algorithm is the STELLS algorithm (2012), which is usually faster but also runs in exponential time in almost all the cases. Results: In this article, we present a new algorithm, called CompactCH, for computing the exact gene tree probability. This new algorithm is based on the notion of compact coalescent histories: multiple coalescent histories are represented by a single compact coalescent history. The key advantage of our new algorithm is that it runs in polynomial time in the number of gene lineages if the number of populations is fixed to be a constant. The new algorithm is more efficient than the STELLS algorithm both in theory and in practice when the number of populations is small and there are multiple gene lineages from each population. As an application, we show that CompactCH can be applied in the inference of population tree (i.e. the population divergence history) from population haplotypes. Simulation results show that the CompactCH algorithm enables efficient and accurate inference of population trees with much more haplotypes than a previous approach. Availability: The CompactCH algorithm is implemented in the STELLS software package, which is available for download at http://www.engr.uconn.edu/ywu/STELLS.html. Contact: ywu@engr.uconn.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27307621
Maron, Jill L.; Hwang, Jooyeon S.; Pathak, Subash; Ruthazer, Robin; Russell, Ruby L.; Alterovitz, Gil
2014-01-01
Objective To combine mathematical modeling of salivary gene expression microarray data and systems biology annotation with RT-qPCR amplification to identify (phase I) and validate (phase II) salivary biomarker analysis for the prediction of oral feeding readiness in preterm infants. Study design Comparative whole transcriptome microarray analysis from 12 preterm newborns pre- and post-oral feeding success was used for computational modeling and systems biology analysis to identify potential salivary transcripts associated with oral feeding success (phase I). Selected gene expression biomarkers (15 from computational modeling; 6 evidence-based; and 3 reference) were evaluated by RT-qPCR amplification on 400 salivary samples from successful (n=200) and unsuccessful (n=200) oral feeders (phase II). Genes, alone and in combination, were evaluated by a multivariate analysis controlling for sex and post-conceptional age (PCA) to determine the probability that newborns achieved successful oral feeding. Results Advancing post-conceptional age (p < 0.001) and female sex (p = 0.05) positively predicted an infant’s ability to feed orally. A combination of five genes, NPY2R (hunger signaling), AMPK (energy homeostasis), PLXNA1 (olfactory neurogenesis), NPHP4 (visual behavior) and WNT3 (facial development), in addition to PCA and sex, demonstrated good accuracy for determining feeding success (AUROC = 0.78). Conclusions We have identified objective and biologically relevant salivary biomarkers that noninvasively assess a newborn’s developing brain, sensory and facial development as they relate to oral feeding success. Understanding the mechanisms that underlie the development of oral feeding readiness through translational and computational methods may improve clinical decision making while decreasing morbidities and health care costs. PMID:25620512
Lobo, Daniel; Morokuma, Junji; Levin, Michael
2016-09-01
Automated computational methods can infer dynamic regulatory network models directly from temporal and spatial experimental data, such as genetic perturbations and their resultant morphologies. Recently, a computational method was able to reverse-engineer the first mechanistic model of planarian regeneration that can recapitulate the main anterior-posterior patterning experiments published in the literature. Validating this comprehensive regulatory model via novel experiments that had not yet been performed would add in our understanding of the remarkable regeneration capacity of planarian worms and demonstrate the power of this automated methodology. Using the Michigan Molecular Interactions and STRING databases and the MoCha software tool, we characterized as hnf4 an unknown regulatory gene predicted to exist by the reverse-engineered dynamic model of planarian regeneration. Then, we used the dynamic model to predict the morphological outcomes under different single and multiple knock-downs (RNA interference) of hnf4 and its predicted gene pathway interactors β-catenin and hh Interestingly, the model predicted that RNAi of hnf4 would rescue the abnormal regenerated phenotype (tailless) of RNAi of hh in amputated trunk fragments. Finally, we validated these predictions in vivo by performing the same surgical and genetic experiments with planarian worms, obtaining the same phenotypic outcomes predicted by the reverse-engineered model. These results suggest that hnf4 is a regulatory gene in planarian regeneration, validate the computational predictions of the reverse-engineered dynamic model, and demonstrate the automated methodology for the discovery of novel genes, pathways and experimental phenotypes. michael.levin@tufts.edu. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
U2AF1 mutations alter splice site recognition in hematological malignancies.
Ilagan, Janine O; Ramakrishnan, Aravind; Hayes, Brian; Murphy, Michele E; Zebari, Ahmad S; Bradley, Philip; Bradley, Robert K
2015-01-01
Whole-exome sequencing studies have identified common mutations affecting genes encoding components of the RNA splicing machinery in hematological malignancies. Here, we sought to determine how mutations affecting the 3' splice site recognition factor U2AF1 alter its normal role in RNA splicing. We find that U2AF1 mutations influence the similarity of splicing programs in leukemias, but do not give rise to widespread splicing failure. U2AF1 mutations cause differential splicing of hundreds of genes, affecting biological pathways such as DNA methylation (DNMT3B), X chromosome inactivation (H2AFY), the DNA damage response (ATR, FANCA), and apoptosis (CASP8). We show that U2AF1 mutations alter the preferred 3' splice site motif in patients, in cell culture, and in vitro. Mutations affecting the first and second zinc fingers give rise to different alterations in splice site preference and largely distinct downstream splicing programs. These allele-specific effects are consistent with a computationally predicted model of U2AF1 in complex with RNA. Our findings suggest that U2AF1 mutations contribute to pathogenesis by causing quantitative changes in splicing that affect diverse cellular pathways, and give insight into the normal function of U2AF1's zinc finger domains. © 2015 Ilagan et al.; Published by Cold Spring Harbor Laboratory Press.
panelcn.MOPS: Copy-number detection in targeted NGS panel data for clinical diagnostics.
Povysil, Gundula; Tzika, Antigoni; Vogt, Julia; Haunschmid, Verena; Messiaen, Ludwine; Zschocke, Johannes; Klambauer, Günter; Hochreiter, Sepp; Wimmer, Katharina
2017-07-01
Targeted next-generation-sequencing (NGS) panels have largely replaced Sanger sequencing in clinical diagnostics. They allow for the detection of copy-number variations (CNVs) in addition to single-nucleotide variants and small insertions/deletions. However, existing computational CNV detection methods have shortcomings regarding accuracy, quality control (QC), incidental findings, and user-friendliness. We developed panelcn.MOPS, a novel pipeline for detecting CNVs in targeted NGS panel data. Using data from 180 samples, we compared panelcn.MOPS with five state-of-the-art methods. With panelcn.MOPS leading the field, most methods achieved comparably high accuracy. panelcn.MOPS reliably detected CNVs ranging in size from part of a region of interest (ROI), to whole genes, which may comprise all ROIs investigated in a given sample. The latter is enabled by analyzing reads from all ROIs of the panel, but presenting results exclusively for user-selected genes, thus avoiding incidental findings. Additionally, panelcn.MOPS offers QC criteria not only for samples, but also for individual ROIs within a sample, which increases the confidence in called CNVs. panelcn.MOPS is freely available both as R package and standalone software with graphical user interface that is easy to use for clinical geneticists without any programming experience. panelcn.MOPS combines high sensitivity and specificity with user-friendliness rendering it highly suitable for routine clinical diagnostics. © 2017 The Authors. Human Mutation published by Wiley Periodicals, Inc.
panelcn.MOPS: Copy‐number detection in targeted NGS panel data for clinical diagnostics
Povysil, Gundula; Tzika, Antigoni; Vogt, Julia; Haunschmid, Verena; Messiaen, Ludwine; Zschocke, Johannes; Klambauer, Günter; Wimmer, Katharina
2017-01-01
Abstract Targeted next‐generation‐sequencing (NGS) panels have largely replaced Sanger sequencing in clinical diagnostics. They allow for the detection of copy‐number variations (CNVs) in addition to single‐nucleotide variants and small insertions/deletions. However, existing computational CNV detection methods have shortcomings regarding accuracy, quality control (QC), incidental findings, and user‐friendliness. We developed panelcn.MOPS, a novel pipeline for detecting CNVs in targeted NGS panel data. Using data from 180 samples, we compared panelcn.MOPS with five state‐of‐the‐art methods. With panelcn.MOPS leading the field, most methods achieved comparably high accuracy. panelcn.MOPS reliably detected CNVs ranging in size from part of a region of interest (ROI), to whole genes, which may comprise all ROIs investigated in a given sample. The latter is enabled by analyzing reads from all ROIs of the panel, but presenting results exclusively for user‐selected genes, thus avoiding incidental findings. Additionally, panelcn.MOPS offers QC criteria not only for samples, but also for individual ROIs within a sample, which increases the confidence in called CNVs. panelcn.MOPS is freely available both as R package and standalone software with graphical user interface that is easy to use for clinical geneticists without any programming experience. panelcn.MOPS combines high sensitivity and specificity with user‐friendliness rendering it highly suitable for routine clinical diagnostics. PMID:28449315
Translation initiation events on structured eukaryotic mRNAs generate gene expression noise
Dacheux, Estelle; Malys, Naglis; Meng, Xiang; Ramachandran, Vinoy; Mendes, Pedro
2017-01-01
Abstract Gene expression stochasticity plays a major role in biology, creating non-genetic cellular individuality and influencing multiple processes, including differentiation and stress responses. We have addressed the lack of knowledge about posttranscriptional contributions to noise by determining cell-to-cell variations in the abundance of mRNA and reporter protein in yeast. Two types of structural element, a stem–loop and a poly(G) motif, not only inhibit translation initiation when inserted into an mRNA 5΄ untranslated region, but also generate noise. The noise-enhancing effect of the stem–loop structure also remains operational when combined with an upstream open reading frame. This has broad significance, since these elements are known to modulate the expression of a diversity of eukaryotic genes. Our findings suggest a mechanism for posttranscriptional noise generation that will contribute to understanding of the generally poor correlation between protein-level stochasticity and transcriptional bursting. We propose that posttranscriptional stochasticity can be linked to cycles of folding/unfolding of a stem–loop structure, or to interconversion between higher-order structural conformations of a G-rich motif, and have created a correspondingly configured computational model that generates fits to the experimental data. Stochastic events occurring during the ribosomal scanning process can therefore feature alongside transcriptional bursting as a source of noise. PMID:28521011
Exploring lateral genetic transfer among microbial genomes using TF-IDF.
Cong, Yingnan; Chan, Yao-Ban; Ragan, Mark A
2016-07-25
Many microbes can acquire genetic material from their environment and incorporate it into their genome, a process known as lateral genetic transfer (LGT). Computational approaches have been developed to detect genomic regions of lateral origin, but typically lack sensitivity, ability to distinguish donor from recipient, and scalability to very large datasets. To address these issues we have introduced an alignment-free method based on ideas from document analysis, term frequency-inverse document frequency (TF-IDF). Here we examine the performance of TF-IDF on three empirical datasets: 27 genomes of Escherichia coli and Shigella, 110 genomes of enteric bacteria, and 143 genomes across 12 bacterial and three archaeal phyla. We investigate the effect of k-mer size, gap size and delineation of groups on the inference of genomic regions of lateral origin, finding an interplay among these parameters and sequence divergence. Because TF-IDF identifies donor groups and delineates regions of lateral origin within recipient genomes, aggregating these regions by gene enables us to explore, for the first time, the mosaic nature of lateral genes including the multiplicity of biological sources, ancestry of transfer and over-writing by subsequent transfers. We carry out Gene Ontology enrichment tests to investigate which biological processes are potentially affected by LGT.
Morais, Daniel Kumazawa; Cuadros-Orellana, Sara; Pais, Fabiano Sviatopolk-Mirsky; Medeiros, Julliane Dutra; Geraldo, Juliana Assis; Gilbert, Jack; Volpini, Angela Cristina; Fernandes, Gabriel Rocha
2016-01-01
Background In early 2015, a ZIKA Virus (ZIKV) infection outbreak was recognized in northeast Brazil, where concerns over its possible links with infant microcephaly have been discussed. Providing a causal link between ZIKV infection and birth defects is still a challenge. MicroRNAs (miRNAs) are small noncoding RNAs (sncRNAs) that regulate post-transcriptional gene expression by translational repression, and play important roles in viral pathogenesis and brain development. The potential for flavivirus-mediated miRNA signalling dysfunction in brain-tissue development provides a compelling hypothesis to test the perceived link between ZIKV and microcephaly. Methodology/Principal Findings Here, we applied in silico analyses to provide novel insights to understand how Congenital ZIKA Syndrome symptoms may be related to an imbalance in miRNAs function. Moreover, following World Health Organization (WHO) recommendations, we have assembled a database to help target investigations of the possible relationship between ZIKV symptoms and miRNA-mediated human gene expression. Conclusions/Significance We have computationally predicted both miRNAs encoded by ZIKV able to target genes in the human genome and cellular (human) miRNAs capable of interacting with ZIKV genomes. Our results represent a step forward in the ZIKV studies, providing new insights to support research in this field and identify potential targets for therapy. PMID:27332714
Novel Insights into the Transcriptome of Dirofilaria immitis
Zhang, Zhihe; Hou, Rong; Wu, Xuhang; Yang, Deying; Zhang, Runhui; Zheng, Wanpeng; Nie, Huaming; Xie, Yue; Yan, Ning; Yang, Zhi; Wang, Chengdong; Luo, Li; Liu, Li; Gu, Xiaobin; Wang, Shuxian; Peng, Xuerong; Yang, Guangyou
2012-01-01
Background The heartworm Dirofilaria immitis is the causal agent of cardiopulmonary dirofilariosis in dogs and cats, and also infects a wide range of wild mammals as well as humans. One bottleneck for the design of fundamentally new intervention and management strategies against D. immitis may be the currently limited knowledge of fundamental molecular aspects of D. immitis. Methodology/Principal Findings A next-generation sequencing platform combining computational approaches was employed to assess a global view of the heartworm transcriptome. A total of 20,810 unigenes (mean length = 1,270 bp) were assembled from 22.3 million clean reads. From these, 15,698 coding sequences (CDS) were inferred, and about 85% of the unigenes had orthologs/homologs in public databases. Comparative transcriptomic study uncovered 4,157 filarial-specific genes as well as 3,795 genes potentially involved in filarial-Wolbachia symbiosis. In addition, the potential intestine transcriptome of D. immitis (1,101 genes) was mined for the first time, which might help to discover ‘hidden antigens’. Conclusions/Significance This study provides novel insights into the transcriptome of D. immitis and sheds light on its molecular processes and survival mechanisms. Furthermore, it provides a platform to discover new vaccine candidates and potential targets for new drugs against dirofilariosis. PMID:22911833
Zemojtel, Tomasz; Köhler, Sebastian; Mackenroth, Luisa; Jäger, Marten; Hecht, Jochen; Krawitz, Peter; Graul-Neumann, Luitgard; Doelken, Sandra; Ehmke, Nadja; Spielmann, Malte; Øien, Nancy Christine; Schweiger, Michal R.; Krüger, Ulrike; Frommer, Götz; Fischer, Björn; Kornak, Uwe; Flöttmann, Ricarda; Ardeshirdavani, Amin; Moreau, Yves; Lewis, Suzanna E.; Haendel, Melissa; Smedley, Damian; Horn, Denise; Mundlos, Stefan; Robinson, Peter N.
2015-01-01
Less than half of patients with suspected genetic disease receive a molecular diagnosis. We have therefore integrated next-generation sequencing (NGS), bioinformatics, and clinical data into an effective diagnostic workflow. We used variants in the 2741 established Mendelian disease genes [the disease-associated genome (DAG)] to develop a targeted enrichment DAG panel (7.1 Mb), which achieves a coverage of 20-fold or better for 98% of bases. Furthermore, we established a computational method [Phenotypic Interpretation of eXomes (PhenIX)] that evaluated and ranked variants based on pathogenicity and semantic similarity of patients’ phenotype described by Human Phenotype Ontology (HPO) terms to those of 3991 Mendelian diseases. In computer simulations, ranking genes based on the variant score put the true gene in first place less than 5% of the time; PhenIX placed the correct gene in first place more than 86% of the time. In a retrospective test of PhenIX on 52 patients with previously identified mutations and known diagnoses, the correct gene achieved a mean rank of 2.1. In a prospective study on 40 individuals without a diagnosis, PhenIX analysis enabled a diagnosis in 11 cases (28%, at a mean rank of 2.4). Thus, the NGS of the DAG followed by phenotype-driven bioinformatic analysis allows quick and effective differential diagnostics in medical genetics. PMID:25186178
Zemojtel, Tomasz; Köhler, Sebastian; Mackenroth, Luisa; Jäger, Marten; Hecht, Jochen; Krawitz, Peter; Graul-Neumann, Luitgard; Doelken, Sandra; Ehmke, Nadja; Spielmann, Malte; Oien, Nancy Christine; Schweiger, Michal R; Krüger, Ulrike; Frommer, Götz; Fischer, Björn; Kornak, Uwe; Flöttmann, Ricarda; Ardeshirdavani, Amin; Moreau, Yves; Lewis, Suzanna E; Haendel, Melissa; Smedley, Damian; Horn, Denise; Mundlos, Stefan; Robinson, Peter N
2014-09-03
Less than half of patients with suspected genetic disease receive a molecular diagnosis. We have therefore integrated next-generation sequencing (NGS), bioinformatics, and clinical data into an effective diagnostic workflow. We used variants in the 2741 established Mendelian disease genes [the disease-associated genome (DAG)] to develop a targeted enrichment DAG panel (7.1 Mb), which achieves a coverage of 20-fold or better for 98% of bases. Furthermore, we established a computational method [Phenotypic Interpretation of eXomes (PhenIX)] that evaluated and ranked variants based on pathogenicity and semantic similarity of patients' phenotype described by Human Phenotype Ontology (HPO) terms to those of 3991 Mendelian diseases. In computer simulations, ranking genes based on the variant score put the true gene in first place less than 5% of the time; PhenIX placed the correct gene in first place more than 86% of the time. In a retrospective test of PhenIX on 52 patients with previously identified mutations and known diagnoses, the correct gene achieved a mean rank of 2.1. In a prospective study on 40 individuals without a diagnosis, PhenIX analysis enabled a diagnosis in 11 cases (28%, at a mean rank of 2.4). Thus, the NGS of the DAG followed by phenotype-driven bioinformatic analysis allows quick and effective differential diagnostics in medical genetics. Copyright © 2014, American Association for the Advancement of Science.
Lyubetsky, Vassily; Gershgorin, Roman; Gorbunov, Konstantin
2017-12-06
Chromosome structure is a very limited model of the genome including the information about its chromosomes such as their linear or circular organization, the order of genes on them, and the DNA strand encoding a gene. Gene lengths, nucleotide composition, and intergenic regions are ignored. Although highly incomplete, such structure can be used in many cases, e.g., to reconstruct phylogeny and evolutionary events, to identify gene synteny, regulatory elements and promoters (considering highly conserved elements), etc. Three problems are considered; all assume unequal gene content and the presence of gene paralogs. The distance problem is to determine the minimum number of operations required to transform one chromosome structure into another and the corresponding transformation itself including the identification of paralogs in two structures. We use the DCJ model which is one of the most studied combinatorial rearrangement models. Double-, sesqui-, and single-operations as well as deletion and insertion of a chromosome region are considered in the model; the single ones comprise cut and join. In the reconstruction problem, a phylogenetic tree with chromosome structures in the leaves is given. It is necessary to assign the structures to inner nodes of the tree to minimize the sum of distances between terminal structures of each edge and to identify the mutual paralogs in a fairly large set of structures. A linear algorithm is known for the distance problem without paralogs, while the presence of paralogs makes it NP-hard. If paralogs are allowed but the insertion and deletion operations are missing (and special constraints are imposed), the reduction of the distance problem to integer linear programming is known. Apparently, the reconstruction problem is NP-hard even in the absence of paralogs. The problem of contigs is to find the optimal arrangements for each given set of contigs, which also includes the mutual identification of paralogs. We proved that these problems can be reduced to integer linear programming formulations, which allows an algorithm to redefine the problems to implement a very special case of the integer linear programming tool. The results were tested on synthetic and biological samples. Three well-known problems were reduced to a very special case of integer linear programming, which is a new method of their solutions. Integer linear programming is clearly among the main computational methods and, as generally accepted, is fast on average; in particular, computation systems specifically targeted at it are available. The challenges are to reduce the size of the corresponding integer linear programming formulations and to incorporate a more detailed biological concept in our model of the reconstruction.
Huang, Yi-Wen; Roa, Juan C.; Goodfellow, Paul J.; Kizer, E. Lynette; Huang, Tim H. M.; Chen, Yidong
2013-01-01
Background DNA methylation of promoter CpG islands is associated with gene suppression, and its unique genome-wide profiles have been linked to tumor progression. Coupled with high-throughput sequencing technologies, it can now efficiently determine genome-wide methylation profiles in cancer cells. Also, experimental and computational technologies make it possible to find the functional relationship between cancer-specific methylation patterns and their clinicopathological parameters. Methodology/Principal Findings Cancer methylome system (CMS) is a web-based database application designed for the visualization, comparison and statistical analysis of human cancer-specific DNA methylation. Methylation intensities were obtained from MBDCap-sequencing, pre-processed and stored in the database. 191 patient samples (169 tumor and 22 normal specimen) and 41 breast cancer cell-lines are deposited in the database, comprising about 6.6 billion uniquely mapped sequence reads. This provides comprehensive and genome-wide epigenetic portraits of human breast cancer and endometrial cancer to date. Two views are proposed for users to better understand methylation structure at the genomic level or systemic methylation alteration at the gene level. In addition, a variety of annotation tracks are provided to cover genomic information. CMS includes important analytic functions for interpretation of methylation data, such as the detection of differentially methylated regions, statistical calculation of global methylation intensities, multiple gene sets of biologically significant categories, interactivity with UCSC via custom-track data. We also present examples of discoveries utilizing the framework. Conclusions/Significance CMS provides visualization and analytic functions for cancer methylome datasets. A comprehensive collection of datasets, a variety of embedded analytic functions and extensive applications with biological and translational significance make this system powerful and unique in cancer methylation research. CMS is freely accessible at: http://cbbiweb.uthscsa.edu/KMethylomes/. PMID:23630576