Correlation approach to identify coding regions in DNA sequences
NASA Technical Reports Server (NTRS)
Ossadnik, S. M.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Mantegna, R. N.; Peng, C. K.; Simons, M.; Stanley, H. E.
1994-01-01
Recently, it was observed that noncoding regions of DNA sequences possess long-range power-law correlations, whereas coding regions typically display only short-range correlations. We develop an algorithm based on this finding that enables investigators to perform a statistical analysis on long DNA sequences to locate possible coding regions. The algorithm is particularly successful in predicting the location of lengthy coding regions. For example, for the complete genome of yeast chromosome III (315,344 nucleotides), at least 82% of the predictions correspond to putative coding regions; the algorithm correctly identified all coding regions larger than 3000 nucleotides, 92% of coding regions between 2000 and 3000 nucleotides long, and 79% of coding regions between 1000 and 2000 nucleotides. The predictive ability of this new algorithm supports the claim that there is a fundamental difference in the correlation property between coding and noncoding sequences. This algorithm, which is not species-dependent, can be implemented with other techniques for rapidly and accurately locating relatively long coding regions in genomic sequences.
CRITICA: coding region identification tool invoking comparative analysis
NASA Technical Reports Server (NTRS)
Badger, J. H.; Olsen, G. J.; Woese, C. R. (Principal Investigator)
1999-01-01
Gene recognition is essential to understanding existing and future DNA sequence data. CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) is a suite of programs for identifying likely protein-coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis, regions of DNA are aligned with related sequences from the DNA databases; if the translation of the aligned sequences has greater amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from the relative frequencies of hexanucleotides in coding frames versus other contexts (i.e., dicodon bias). The dicodon usage information is derived by iterative analysis of the data, such that CRITICA is not dependent on the existence or accuracy of coding sequence annotations in the databases. This independence makes the method particularly well suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium DNA sequences. Its predictions were compared with the DNA sequence annotations and with the predictions of GenMark. CRITICA proved to be more accurate than GenMark, and moreover, many of its predictions that would seem to be errors instead reflect problems in the sequence databases. The source code of CRITICA is freely available by anonymous FTP (rdp.life.uiuc.edu in/pub/critica) and on the World Wide Web (http:/(/)rdpwww.life.uiuc.edu).
Gene and translation initiation site prediction in metagenomic sequences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hyatt, Philip Douglas; LoCascio, Philip F; Hauser, Loren John
2012-01-01
Gene prediction in metagenomic sequences remains a difficult problem. Current sequencing technologies do not achieve sufficient coverage to assemble the individual genomes in a typical sample; consequently, sequencing runs produce a large number of short sequences whose exact origin is unknown. Since these sequences are usually smaller than the average length of a gene, algorithms must make predictions based on very little data. We present MetaProdigal, a metagenomic version of the gene prediction program Prodigal, that can identify genes in short, anonymous coding sequences with a high degree of accuracy. The novel value of the method consists of enhanced translationmore » initiation site identification, ability to identify sequences that use alternate genetic codes and confidence values for each gene call. We compare the results of MetaProdigal with other methods and conclude with a discussion of future improvements.« less
Tramontano, A; Macchiato, M F
1986-01-01
An algorithm to determine the probability that a reading frame codifies for a protein is presented. It is based on the results of our previous studies on the thermodynamic characteristics of a translated reading frame. We also develop a prediction procedure to distinguish between coding and non-coding reading frames. The procedure is based on the characteristics of the putative product of the DNA sequence and not on periodicity characteristics of the sequence, so the prediction is not biased by the presence of overlapping translated reading frames or by the presence of translated reading frames on the complementary DNA strand. PMID:3753761
Yin, Changchuan
2015-04-01
To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.
Domier, L L; Latorre, I J; Steinlage, T A; McCoppin, N; Hartman, G L
2003-10-01
The variability of North American and Asian strains and isolates of Soybean mosaic virus was investigated. First, polymerase chain reaction (PCR) products representing the coat protein (CP)-coding regions of 38 SMVs were analyzed for restriction fragment length polymorphisms (RFLP). Second, the nucleotide and predicted amino acid sequence variability of the P1-coding region of 18 SMVs and the helper component/protease (HC/Pro) and CP-coding regions of 25 SMVs were assessed. The CP nucleotide and predicted amino acid sequences were the most similar and predicted phylogenetic relationships similar to those obtained from RFLP analysis. Neither RFLP nor sequence analyses of the CP-coding regions grouped the SMVs by geographical origin. The P1 and HC/Pro sequences were more variable and separated the North American and Asian SMV isolates into two groups similar to previously reported differences in pathogenic diversity of the two sets of SMV isolates. The P1 region was the most informative of the three regions analyzed. To assess the biological relevance of the sequence differences in the HC/Pro and CP coding regions, the transmissibility of 14 SMV isolates by Aphis glycines was tested. All field isolates of SMV were transmitted efficiently by A. glycines, but the laboratory isolates analyzed were transmitted poorly. The amino acid sequences from most, but not all, of the poorly transmitted isolates contained mutations in the aphid transmission-associated DAG and/or KLSC amino acid sequence motifs of CP and HC/Pro, respectively.
SGP-1: Prediction and Validation of Homologous Genes Based on Sequence Alignments
Wiehe, Thomas; Gebauer-Jung, Steffi; Mitchell-Olds, Thomas; Guigó, Roderic
2001-01-01
Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of SGP-1 depends little on species-specific properties such as codon usage or the nucleotide distribution. SGP-1 may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors. PMID:11544202
EUGENE'HOM: A generic similarity-based gene finder using multiple homologous sequences.
Foissac, Sylvain; Bardou, Philippe; Moisan, Annick; Cros, Marie-Josée; Schiex, Thomas
2003-07-01
EUGENE'HOM is a gene prediction software for eukaryotic organisms based on comparative analysis. EUGENE'HOM is able to take into account multiple homologous sequences from more or less closely related organisms. It integrates the results of TBLASTX analysis, splice site and start codon prediction and a robust coding/non-coding probabilistic model which allows EUGENE'HOM to handle sequences from a variety of organisms. The current target of EUGENE'HOM is plant sequences. The EUGENE'HOM web site is available at http://genopole.toulouse.inra.fr/bioinfo/eugene/EuGeneHom/cgi-bin/EuGeneHom.pl.
Huang, Ying; Chen, Shi-Yi; Deng, Feilong
2016-01-01
In silico analysis of DNA sequences is an important area of computational biology in the post-genomic era. Over the past two decades, computational approaches for ab initio prediction of gene structure from genome sequence alone have largely facilitated our understanding on a variety of biological questions. Although the computational prediction of protein-coding genes has already been well-established, we are also facing challenges to robustly find the non-coding RNA genes, such as miRNA and lncRNA. Two main aspects of ab initio gene prediction include the computed values for describing sequence features and used algorithm for training the discriminant function, and by which different combinations are employed into various bioinformatic tools. Herein, we briefly review these well-characterized sequence features in eukaryote genomes and applications to ab initio gene prediction. The main purpose of this article is to provide an overview to beginners who aim to develop the related bioinformatic tools.
Hall, L; Laird, J E; Craig, R K
1984-01-01
Nucleotide sequence analysis of cloned guinea-pig casein B cDNA sequences has identified two casein B variants related to the bovine and rat alpha s1 caseins. Amino acid homology was largely confined to the known bovine or predicted rat phosphorylation sites and within the 'signal' precursor sequence. Comparison of the deduced nucleotide sequence of the guinea-pig and rat alpha s1 casein mRNA species showed greater sequence conservation in the non-coding than in the coding regions, suggesting a functional and possibly regulatory role for the non-coding regions of casein mRNA. The results provide insight into the evolution of the casein genes, and raise questions as to the role of conserved nucleotide sequences within the non-coding regions of mRNA species. Images Fig. 1. PMID:6548375
EUGÈNE'HOM: a generic similarity-based gene finder using multiple homologous sequences
Foissac, Sylvain; Bardou, Philippe; Moisan, Annick; Cros, Marie-Josée; Schiex, Thomas
2003-01-01
EUGÈNE'HOM is a gene prediction software for eukaryotic organisms based on comparative analysis. EUGÈNE'HOM is able to take into account multiple homologous sequences from more or less closely related organisms. It integrates the results of TBLASTX analysis, splice site and start codon prediction and a robust coding/non-coding probabilistic model which allows EUGÈNE'HOM to handle sequences from a variety of organisms. The current target of EUGÈNE'HOM is plant sequences. The EUGÈNE'HOM web site is available at http://genopole.toulouse.inra.fr/bioinfo/eugene/EuGeneHom/cgi-bin/EuGeneHom.pl. PMID:12824408
3D RNA and functional interactions from evolutionary couplings
Weinreb, Caleb; Riesselman, Adam; Ingraham, John B.; Gross, Torsten; Sander, Chris; Marks, Debora S.
2016-01-01
Summary Non-coding RNAs are ubiquitous, but the discovery of new RNA gene sequences far outpaces research on their structure and functional interactions. We mine the evolutionary sequence record to derive precise information about function and structure of RNAs and RNA-protein complexes. As in protein structure prediction, we use maximum entropy global probability models of sequence co-variation to infer evolutionarily constrained nucleotide-nucleotide interactions within RNA molecules, and nucleotide-amino acid interactions in RNA-protein complexes. The predicted contacts allow all-atom blinded 3D structure prediction at good accuracy for several known RNA structures and RNA-protein complexes. For unknown structures, we predict contacts in 160 non-coding RNA families. Beyond 3D structure prediction, evolutionary couplings help identify important functional interactions, e.g., at switch points in riboswitches and at a complex nucleation site in HIV. Aided by accelerating sequence accumulation, evolutionary coupling analysis can accelerate the discovery of functional interactions and 3D structures involving RNA. PMID:27087444
Ghio, Silvina; Martinez Cáceres, Alfredo I.; Talia, Paola; Grasso, Daniel H.
2015-01-01
Paenibacillus sp. A59 was isolated from decaying forest soil in Argentina and characterized as a xylanolytic strain. We report the draft genome sequence of this isolate, with an estimated genome size of 7 Mb which harbor 6,424 coding sequences. Genes coding for hydrolytic enzymes involved in lignocellulose deconstruction were predicted. PMID:26494679
Kanda, Kojun; Pflug, James M; Sproul, John S; Dasenko, Mark A; Maddison, David R
2015-01-01
In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles being more successfully sequenced.
Dasenko, Mark A.
2015-01-01
In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles being more successfully sequenced. PMID:26716693
Ghio, Silvina; Martinez Cáceres, Alfredo I; Talia, Paola; Grasso, Daniel H; Campos, Eleonora
2015-10-22
Paenibacillus sp. A59 was isolated from decaying forest soil in Argentina and characterized as a xylanolytic strain. We report the draft genome sequence of this isolate, with an estimated genome size of 7 Mb which harbor 6,424 coding sequences. Genes coding for hydrolytic enzymes involved in lignocellulose deconstruction were predicted. Copyright © 2015 Ghio et al.
SinEx DB: a database for single exon coding sequences in mammalian genomes.
Jorquera, Roddy; Ortiz, Rodrigo; Ossandon, F; Cárdenas, Juan Pablo; Sepúlveda, Rene; González, Carolina; Holmes, David S
2016-01-01
Eukaryotic genes are typically interrupted by intragenic, noncoding sequences termed introns. However, some genes lack introns in their coding sequence (CDS) and are generally known as 'single exon genes' (SEGs). In this work, a SEG is defined as a nuclear, protein-coding gene that lacks introns in its CDS. Whereas, many public databases of Eukaryotic multi-exon genes are available, there are only two specialized databases for SEGs. The present work addresses the need for a more extensive and diverse database by creating SinEx DB, a publicly available, searchable database of predicted SEGs from 10 completely sequenced mammalian genomes including human. SinEx DB houses the DNA and protein sequence information of these SEGs and includes their functional predictions (KOG) and the relative distribution of these functions within species. The information is stored in a relational database built with My SQL Server 5.1.33 and the complete dataset of SEG sequences and their functional predictions are available for downloading. SinEx DB can be interrogated by: (i) a browsable phylogenetic schema, (ii) carrying out BLAST searches to the in-house SinEx DB of SEGs and (iii) via an advanced search mode in which the database can be searched by key words and any combination of searches by species and predicted functions. SinEx DB provides a rich source of information for advancing our understanding of the evolution and function of SEGs.Database URL: www.sinex.cl. © The Author(s) 2016. Published by Oxford University Press.
Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees.
Williams, Philip H; Eyles, Rod; Weiller, Georg
2012-01-01
MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require "read count" to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA(∗) duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.
Khan, Abdul Latif; Asaf, Sajjad; Khan, Abdur Rahim; Al-Harrasi, Ahmed; Al-Rawahi, Ahmed; Lee, In-Jung
2016-05-10
Preussia sp. BSL10, family Sporormiaceae, was actively producing phytohormone (indole-3-acetic acid) and extra-cellular enzymes (phosphatases and glucosidases). The fungus was also promoting the growth of arid-land tree-Boswellia sacra. Looking at such prospects of this fungus, we sequenced its draft genome for the first time. The Illumina based sequence analysis reveals an approximate genome size of 31.4Mbp for Preussia sp. BSL10. Based on ab initio gene prediction, total 32,312 coding sequences were annotated consisting of 11,967 coding genes, pseudogenes, and 221 tRNA genes. Furthermore, 321 carbohydrate-active enzymes were predicted and classified into many functional families. Copyright © 2016 Elsevier B.V. All rights reserved.
Zhao, A; Guo, A; Liu, Z; Pape, L
1997-01-01
The coding sequences for a Schizosaccharomyces pombe sequence-specific DNA binding protein, Reb1p, have been cloned. The predicted S. pombe Reb1p is 24-29% identical to mouse TTF-1 (transcription termination factor-1) and Saccharomyces cerevisiae REB1 protein, both of which direct termination of RNA polymerase I catalyzed transcripts. The S.pombe Reb1 cDNA encodes a predicted polypeptide of 504 amino acids with a predicted molecular weight of 58.4 kDa. The S. pombe Reb1p is unusual in that the bipartite DNA binding motif identified originally in S.cerevisiae and Klyveromyces lactis REB1 proteins is uninterrupted and thus S.pombe Reb1p may contain the smallest natural REB1 homologous DNA binding domain. Its genomic coding sequences were shown to be interrupted by two introns. A recombinant histidine-tagged Reb1 protein bearing the rDNA binding domain has two homologous, sequence-specific binding sites in the S. pomber DNA intergenic spacer, located between 289 and 480 nt downstream of the end of the approximately 25S rRNA coding sequences. Each binding site is 13-14 bp downstream of two of the three proposed in vivo termination sites. The core of this 17 bp site, AGGTAAGGGTAATGCAC, is specifically protected by Reb1p in footprinting analysis. PMID:9016645
Schiex, Thomas; Gouzy, Jérôme; Moisan, Annick; de Oliveira, Yannick
2003-07-01
We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms.
BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone.
Yang, Bite; Liu, Feng; Ren, Chao; Ouyang, Zhangyi; Xie, Ziwei; Bo, Xiaochen; Shu, Wenjie
2017-07-01
Enhancer elements are noncoding stretches of DNA that play key roles in controlling gene expression programmes. Despite major efforts to develop accurate enhancer prediction methods, identifying enhancer sequences continues to be a challenge in the annotation of mammalian genomes. One of the major issues is the lack of large, sufficiently comprehensive and experimentally validated enhancers for humans or other species. Thus, the development of computational methods based on limited experimentally validated enhancers and deciphering the transcriptional regulatory code encoded in the enhancer sequences is urgent. We present a deep-learning-based hybrid architecture, BiRen, which predicts enhancers using the DNA sequence alone. Our results demonstrate that BiRen can learn common enhancer patterns directly from the DNA sequence and exhibits superior accuracy, robustness and generalizability in enhancer prediction relative to other state-of-the-art enhancer predictors based on sequence characteristics. Our BiRen will enable researchers to acquire a deeper understanding of the regulatory code of enhancer sequences. Our BiRen method can be freely accessed at https://github.com/wenjiegroup/BiRen . shuwj@bmi.ac.cn or boxc@bmi.ac.cn. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Draft Genome Sequence of the Deinococcus-Thermus Bacterium Meiothermus ruber Strain A
Thiel, Vera; Tomsho, Lynn P.; Burhans, Richard; ...
2015-03-26
The draft genome sequence of the Deinococcus-Thermus group bacterium Meiothermus ruber strain A, isolated from a cyanobacterial enrichment culture obtained from Octopus Spring (Yellowstone National Park, WY), comprises 2,968,099 bp in 170 contigs. It is predicted to contain 2,895 protein-coding genes, 44 tRNA-coding genes, and 2 rRNA operons.
Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana.
Mayer, K; Schüller, C; Wambutt, R; Murphy, G; Volckaert, G; Pohl, T; Düsterhöft, A; Stiekema, W; Entian, K D; Terryn, N; Harris, B; Ansorge, W; Brandt, P; Grivell, L; Rieger, M; Weichselgartner, M; de Simone, V; Obermaier, B; Mache, R; Müller, M; Kreis, M; Delseny, M; Puigdomenech, P; Watson, M; Schmidtheini, T; Reichert, B; Portatelle, D; Perez-Alonso, M; Boutry, M; Bancroft, I; Vos, P; Hoheisel, J; Zimmermann, W; Wedler, H; Ridley, P; Langham, S A; McCullagh, B; Bilham, L; Robben, J; Van der Schueren, J; Grymonprez, B; Chuang, Y J; Vandenbussche, F; Braeken, M; Weltjens, I; Voet, M; Bastiaens, I; Aert, R; Defoor, E; Weitzenegger, T; Bothe, G; Ramsperger, U; Hilbert, H; Braun, M; Holzer, E; Brandt, A; Peters, S; van Staveren, M; Dirske, W; Mooijman, P; Klein Lankhorst, R; Rose, M; Hauf, J; Kötter, P; Berneiser, S; Hempel, S; Feldpausch, M; Lamberth, S; Van den Daele, H; De Keyser, A; Buysshaert, C; Gielen, J; Villarroel, R; De Clercq, R; Van Montagu, M; Rogers, J; Cronin, A; Quail, M; Bray-Allen, S; Clark, L; Doggett, J; Hall, S; Kay, M; Lennard, N; McLay, K; Mayes, R; Pettett, A; Rajandream, M A; Lyne, M; Benes, V; Rechmann, S; Borkova, D; Blöcker, H; Scharfe, M; Grimm, M; Löhnert, T H; Dose, S; de Haan, M; Maarse, A; Schäfer, M; Müller-Auer, S; Gabel, C; Fuchs, M; Fartmann, B; Granderath, K; Dauner, D; Herzl, A; Neumann, S; Argiriou, A; Vitale, D; Liguori, R; Piravandi, E; Massenet, O; Quigley, F; Clabauld, G; Mündlein, A; Felber, R; Schnabl, S; Hiller, R; Schmidt, W; Lecharny, A; Aubourg, S; Chefdor, F; Cooke, R; Berger, C; Montfort, A; Casacuberta, E; Gibbons, T; Weber, N; Vandenbol, M; Bargues, M; Terol, J; Torres, A; Perez-Perez, A; Purnelle, B; Bent, E; Johnson, S; Tacon, D; Jesse, T; Heijnen, L; Schwarz, S; Scholler, P; Heber, S; Francs, P; Bielke, C; Frishman, D; Haase, D; Lemcke, K; Mewes, H W; Stocker, S; Zaccaria, P; Bevan, M; Wilson, R K; de la Bastide, M; Habermann, K; Parnell, L; Dedhia, N; Gnoj, L; Schutz, K; Huang, E; Spiegel, L; Sehkon, M; Murray, J; Sheet, P; Cordes, M; Abu-Threideh, J; Stoneking, T; Kalicki, J; Graves, T; Harmon, G; Edwards, J; Latreille, P; Courtney, L; Cloud, J; Abbott, A; Scott, K; Johnson, D; Minx, P; Bentley, D; Fulton, B; Miller, N; Greco, T; Kemp, K; Kramer, J; Fulton, L; Mardis, E; Dante, M; Pepin, K; Hillier, L; Nelson, J; Spieth, J; Ryan, E; Andrews, S; Geisel, C; Layman, D; Du, H; Ali, J; Berghoff, A; Jones, K; Drone, K; Cotton, M; Joshu, C; Antonoiu, B; Zidanic, M; Strong, C; Sun, H; Lamar, B; Yordan, C; Ma, P; Zhong, J; Preston, R; Vil, D; Shekher, M; Matero, A; Shah, R; Swaby, I K; O'Shaughnessy, A; Rodriguez, M; Hoffmann, J; Till, S; Granat, S; Shohdy, N; Hasegawa, A; Hameed, A; Lodhi, M; Johnson, A; Chen, E; Marra, M; Martienssen, R; McCombie, W R
1999-12-16
The higher plant Arabidopsis thaliana (Arabidopsis) is an important model for identifying plant genes and determining their function. To assist biological investigations and to define chromosome structure, a coordinated effort to sequence the Arabidopsis genome was initiated in late 1996. Here we report one of the first milestones of this project, the sequence of chromosome 4. Analysis of 17.38 megabases of unique sequence, representing about 17% of the genome, reveals 3,744 protein coding genes, 81 transfer RNAs and numerous repeat elements. Heterochromatic regions surrounding the putative centromere, which has not yet been completely sequenced, are characterized by an increased frequency of a variety of repeats, new repeats, reduced recombination, lowered gene density and lowered gene expression. Roughly 60% of the predicted protein-coding genes have been functionally characterized on the basis of their homology to known genes. Many genes encode predicted proteins that are homologous to human and Caenorhabditis elegans proteins.
Lee, Chi-Ching; Chen, Yi-Ping Phoebe; Yao, Tzu-Jung; Ma, Cheng-Yu; Lo, Wei-Cheng; Lyu, Ping-Chiang; Tang, Chuan Yi
2013-04-10
Sequencing of microbial genomes is important because of microbial-carrying antibiotic and pathogenetic activities. However, even with the help of new assembling software, finishing a whole genome is a time-consuming task. In most bacteria, pathogenetic or antibiotic genes are carried in genomic islands. Therefore, a quick genomic island (GI) prediction method is useful for ongoing sequencing genomes. In this work, we built a Web server called GI-POP (http://gipop.life.nthu.edu.tw) which integrates a sequence assembling tool, a functional annotation pipeline, and a high-performance GI predicting module, in a support vector machine (SVM)-based method called genomic island genomic profile scanning (GI-GPS). The draft genomes of the ongoing genome projects in contigs or scaffolds can be submitted to our Web server, and it provides the functional annotation and highly probable GI-predicting results. GI-POP is a comprehensive annotation Web server designed for ongoing genome project analysis. Researchers can perform annotation and obtain pre-analytic information include possible GIs, coding/non-coding sequences and functional analysis from their draft genomes. This pre-analytic system can provide useful information for finishing a genome sequencing project. Copyright © 2012 Elsevier B.V. All rights reserved.
Recent plant studies using Victoria 2.0
DOE Office of Scientific and Technical Information (OSTI.GOV)
BIXLER,NATHAN E.; GASSER,RONALD D.
2000-03-08
VICTORIA 2.0 is a mechanistic computer code designed to analyze fission product behavior within the reactor coolant system (RCS) during a severe nuclear reactor accident. It provides detailed predictions of the release of radioactive and nonradioactive materials from the reactor core and transport and deposition of these materials within the RCS and secondary circuits. These predictions account for the chemical and aerosol processes that affect radionuclide behavior. VICTORIA 2.0 was released in early 1999; a new version VICTORIA 2.1, is now under development. The largest improvements in VICTORIA 2.1 are connected with the thermochemical database, which is being revised andmore » expanded following the recommendations of a peer review. Three risk-significant severe accident sequences have recently been investigated using the VICTORIA 2.0 code. The focus here is on how various chemistry options affect the predictions. Additionally, the VICTORIA predictions are compared with ones made using the MELCOR code. The three sequences are a station blackout in a GE BWR and steam generator tube rupture (SGTR) and pump-seal LOCA sequences in a 3-loop Westinghouse PWR. These sequences cover a range of system pressures, from fully depressurized to full system pressure. The chief results of this study are the fission product fractions that are retained in the core, RCS, secondary, and containment and the fractions that are released into the environment.« less
Antalis, T M; Clark, M A; Barnes, T; Lehrbach, P R; Devine, P L; Schevzov, G; Goss, N H; Stephens, R W; Tolstoshev, P
1988-02-01
Human monocyte-derived plasminogen activator inhibitor (mPAI-2) was purified to homogeneity from the U937 cell line and partially sequenced. Oligonucleotide probes derived from this sequence were used to screen a cDNA library prepared from U937 cells. One positive clone was sequenced and contained most of the coding sequence as well as a long incomplete 3' untranslated region (1112 base pairs). This cDNA sequence was shown to encode mPAI-2 by hybrid-select translation. A cDNA clone encoding the remainder of the mPAI-2 mRNA was obtained by primer extension of U937 poly(A)+ RNA using a probe complementary to the mPAI-2 coding region. The coding sequence for mPAI-2 was placed under the control of the lambda PL promoter, and the protein expressed in Escherichia coli formed a complex with urokinase that could be detected immunologically. By nucleotide sequence analysis, mPAI-2 cDNA encodes a protein containing 415 amino acids with a predicted unglycosylated Mr of 46,543. The predicted amino acid sequence of mPAI-2 is very similar to placental PAI-2 (3 amino acid differences) and shows extensive homology with members of the serine protease inhibitor (serpin) superfamily. mPAI-2 was found to be more homologous to ovalbumin (37%) than the endothelial plasminogen activator inhibitor, PAI-1 (26%). Like ovalbumin, mPAI-2 appears to have no typical amino-terminal signal sequence. The 3' untranslated region of the mPAI-2 cDNA contains a putative regulatory sequence that has been associated with the inflammatory mediators.
Antalis, T M; Clark, M A; Barnes, T; Lehrbach, P R; Devine, P L; Schevzov, G; Goss, N H; Stephens, R W; Tolstoshev, P
1988-01-01
Human monocyte-derived plasminogen activator inhibitor (mPAI-2) was purified to homogeneity from the U937 cell line and partially sequenced. Oligonucleotide probes derived from this sequence were used to screen a cDNA library prepared from U937 cells. One positive clone was sequenced and contained most of the coding sequence as well as a long incomplete 3' untranslated region (1112 base pairs). This cDNA sequence was shown to encode mPAI-2 by hybrid-select translation. A cDNA clone encoding the remainder of the mPAI-2 mRNA was obtained by primer extension of U937 poly(A)+ RNA using a probe complementary to the mPAI-2 coding region. The coding sequence for mPAI-2 was placed under the control of the lambda PL promoter, and the protein expressed in Escherichia coli formed a complex with urokinase that could be detected immunologically. By nucleotide sequence analysis, mPAI-2 cDNA encodes a protein containing 415 amino acids with a predicted unglycosylated Mr of 46,543. The predicted amino acid sequence of mPAI-2 is very similar to placental PAI-2 (3 amino acid differences) and shows extensive homology with members of the serine protease inhibitor (serpin) superfamily. mPAI-2 was found to be more homologous to ovalbumin (37%) than the endothelial plasminogen activator inhibitor, PAI-1 (26%). Like ovalbumin, mPAI-2 appears to have no typical amino-terminal signal sequence. The 3' untranslated region of the mPAI-2 cDNA contains a putative regulatory sequence that has been associated with the inflammatory mediators. Images PMID:3257578
Sugita, Chieko; Ogata, Koretsugu; Shikata, Masamitsu; Jikuya, Hiroyuki; Takano, Jun; Furumichi, Miho; Kanehisa, Minoru; Omata, Tatsuo; Sugiura, Masahiro; Sugita, Mamoru
2007-01-01
The entire genome of the unicellular cyanobacterium Synechococcus elongatus PCC 6301 (formerly Anacystis nidulans Berkeley strain 6301) was sequenced. The genome consisted of a circular chromosome 2,696,255 bp long. A total of 2,525 potential protein-coding genes, two sets of rRNA genes, 45 tRNA genes representing 42 tRNA species, and several genes for small stable RNAs were assigned to the chromosome by similarity searches and computer predictions. The translated products of 56% of the potential protein-coding genes showed sequence similarities to experimentally identified and predicted proteins of known function, and the products of 35% of the genes showed sequence similarities to the translated products of hypothetical genes. The remaining 9% of genes lacked significant similarities to genes for predicted proteins in the public DNA databases. Some 139 genes coding for photosynthesis-related components were identified. Thirty-seven genes for two-component signal transduction systems were also identified. This is the smallest number of such genes identified in cyanobacteria, except for marine cyanobacteria, suggesting that only simple signal transduction systems are found in this strain. The gene arrangement and nucleotide sequence of Synechococcus elongatus PCC 6301 were nearly identical to those of a closely related strain Synechococcus elongatus PCC 7942, except for the presence of a 188.6 kb inversion. The sequences as well as the gene information shown in this paper are available in the Web database, CYORF (http://www.cyano.genome.jp/).
Aires-de-Sousa, João; Aires-de-Sousa, Luisa
2003-01-01
We propose representing individual positions in DNA sequences by virtual potentials generated by other bases of the same sequence. This is a compact representation of the neighbourhood of a base. The distribution of the virtual potentials over the whole sequence can be used as a representation of the entire sequence (SEQREP code). It is a flexible code, with a length independent of the sequence size, does not require previous alignment, and is convenient for processing by neural networks or statistical techniques. To evaluate its biological significance, the SEQREP code was used for training Kohonen self-organizing maps (SOMs) in two applications: (a) detection of Alu sequences, and (b) classification of sequences encoding for HIV-1 envelope glycoprotein (env) into subtypes A-G. It was demonstrated that SOMs clustered sequences belonging to different classes into distinct regions. For independent test sets, very high rates of correct predictions were obtained (97% in the first application, 91% in the second). Possible areas of application of SEQREP codes include functional genomics, phylogenetic analysis, detection of repetitions, database retrieval, and automatic alignment. Software for representing sequences by SEQREP code, and for training Kohonen SOMs is made freely available from http://www.dq.fct.unl.pt/qoa/jas/seqrep. Supplementary material is available at http://www.dq.fct.unl.pt/qoa/jas/seqrep/bioinf2002
Complete Genome Sequence of Thiostrepton-Producing Streptomyces laurentii ATCC 31255
Fujino, Yasuhiro; Nagayoshi, Yuko; Ohshima, Toshihisa; Ogata, Seiya
2016-01-01
Streptomyces laurentii ATCC 31255 produces thiostrepton, a thiopeptide class antibiotic. Here, we report the complete genome sequence for this strain, which contains a total of 8,032,664 bp, 7,452 predicted coding sequences, and a G+C content of 72.3%. PMID:27257211
DOE Office of Scientific and Technical Information (OSTI.GOV)
Solovyev, V.V.; Salamov, A.A.; Lawrence, C.B.
1994-12-31
Discriminant analysis is applied to the problem of recognition 5`-, internal and 3`-exons in human DNA sequences. Specific recognition functions were developed for revealing exons of particular types. The method based on a splice site prediction algorithm that uses the linear Fisher discriminant to combine the information about significant triplet frequencies of various functional parts of splice site regions and preferences of oligonucleotide in protein coding and nation regions. The accuracy of our splice site recognition function is about 97%. A discriminant function for 5`-exon prediction includes hexanucleotide composition of upstream region, triplet composition around the ATG codon, ORF codingmore » potential, donor splice site potential and composition of downstream introit region. For internal exon prediction, we combine in a discriminant function the characteristics describing the 5`- intron region, donor splice site, coding region, acceptor splice site and Y-intron region for each open reading frame flanked by GT and AG base pairs. The accuracy of precise internal exon recognition on a test set of 451 exon and 246693 pseudoexon sequences is 77% with a specificity of 79% and a level of pseudoexon ORF prediction of 99.96%. The recognition quality computed at the level of individual nucleotides is 89%, for exon sequences and 98% for intron sequences. A discriminant function for 3`-exon prediction includes octanucleolide composition of upstream nation region, triplet composition around the stop codon, ORF coding potential, acceptor splice site potential and hexanucleotide composition of downstream region. We unite these three discriminant functions in exon predicting program FEX (find exons). FEX exactly predicts 70% of 1016 exons from the test of 181 complete genes with specificity 73%, and 89% exons are exactly or partially predicted. On the average, 85% of nucleotides were predicted accurately with specificity 91%.« less
Vlahovicek, K; Munteanu, M G; Pongor, S
1999-01-01
Bending is a local conformational micropolymorphism of DNA in which the original B-DNA structure is only distorted but not extensively modified. Bending can be predicted by simple static geometry models as well as by a recently developed elastic model that incorporate sequence dependent anisotropic bendability (SDAB). The SDAB model qualitatively explains phenomena including affinity of protein binding, kinking, as well as sequence-dependent vibrational properties of DNA. The vibrational properties of DNA segments can be studied by finite element analysis of a model subjected to an initial bending moment. The frequency spectrum is obtained by applying Fourier analysis to the displacement values in the time domain. This analysis shows that the spectrum of the bending vibrations quite sensitively depends on the sequence, for example the spectrum of a curved sequence is characteristically different from the spectrum of straight sequence motifs of identical basepair composition. Curvature distributions are genome-specific, and pronounced differences are found between protein-coding and regulatory regions, respectively, that is, sites of extreme curvature and/or bendability are less frequent in protein-coding regions. A WWW server is set up for the prediction of curvature and generation of 3D models from DNA sequences (http:@www.icgeb.trieste.it/dna).
Prediction suppression and surprise enhancement in monkey inferotemporal cortex.
Ramachandran, Suchitra; Meyer, Travis; Olson, Carl R
2017-07-01
Exposing monkeys, over the course of days and weeks, to pairs of images presented in fixed sequence, so that each leading image becomes a predictor for the corresponding trailing image, affects neuronal visual responsiveness in area TE. At the end of the training period, neurons respond relatively weakly to a trailing image when it appears in a trained sequence and, thus, confirms prediction, whereas they respond relatively strongly to the same image when it appears in an untrained sequence and, thus, violates prediction. This effect could arise from prediction suppression (reduced firing in response to the occurrence of a probable event) or surprise enhancement (elevated firing in response to the omission of a probable event). To identify its cause, we compared firing under the prediction-confirming and prediction-violating conditions to firing under a prediction-neutral condition. The results provide strong evidence for prediction suppression and limited evidence for surprise enhancement. NEW & NOTEWORTHY In predictive coding models of the visual system, neurons carry signed prediction error signals. We show here that monkey inferotemporal neurons exhibit prediction-modulated firing, as posited by these models, but that the signal is unsigned. The response to a prediction-confirming image is suppressed, and the response to a prediction-violating image may be enhanced. These results are better explained by a model in which the visual system emphasizes unpredicted events than by a predictive coding model. Copyright © 2017 the American Physiological Society.
GeneBuilder: interactive in silico prediction of gene structure.
Milanesi, L; D'Angelo, D; Rogozin, I B
1999-01-01
Prediction of gene structure in newly sequenced DNA becomes very important in large genome sequencing projects. This problem is complicated due to the exon-intron structure of eukaryotic genes and because gene expression is regulated by many different short nucleotide domains. In order to be able to analyse the full gene structure in different organisms, it is necessary to combine information about potential functional signals (promoter region, splice sites, start and stop codons, 3' untranslated region) together with the statistical properties of coding sequences (coding potential), information about homologous proteins, ESTs and repeated elements. We have developed the GeneBuilder system which is based on prediction of functional signals and coding regions by different approaches in combination with similarity searches in proteins and EST databases. The potential gene structure models are obtained by using a dynamic programming method. The program permits the use of several parameters for gene structure prediction and refinement. During gene model construction, selecting different exon homology levels with a protein sequence selected from a list of homologous proteins can improve the accuracy of the gene structure prediction. In the case of low homology, GeneBuilder is still able to predict the gene structure. The GeneBuilder system has been tested by using the standard set (Burset and Guigo, Genomics, 34, 353-367, 1996) and the performances are: 0.89 sensitivity and 0.91 specificity at the nucleotide level. The total correlation coefficient is 0.88. The GeneBuilder system is implemented as a part of the WebGene a the URL: http://www.itba.mi. cnr.it/webgene and TRADAT (TRAncription Database and Analysis Tools) launcher URL: http://www.itba.mi.cnr.it/tradat.
Genome Sequences of Three Cluster AU Arthrobacter Phages, Caterpillar, Nightmare, and Teacup
Adair, Tamarah L.; Stowe, Emily; Pizzorno, Marie C.; Krukonis, Gregory; Harrison, Melinda; Garlena, Rebecca A.; Russell, Daniel A.; Jacobs-Sera, Deborah
2017-01-01
ABSTRACT Caterpillar, Nightmare, and Teacup are cluster AU siphoviral phages isolated from enriched soil on Arthrobacter sp. strain ATCC 21022. These genomes are 58 kbp long with an average G+C content of 50%. Sequence analysis predicts 86 to 92 protein-coding genes, including a large number of small proteins with predicted transmembrane domains. PMID:29122860
NASA Astrophysics Data System (ADS)
Yu, Jia-Feng; Sui, Tian-Xiang; Wang, Hong-Mei; Wang, Chun-Ling; Jing, Li; Wang, Ji-Hua
2015-12-01
Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. Project supported by the National Natural Science Foundation of China (Grant Nos. 61302186 and 61271378) and the Funding from the State Key Laboratory of Bioelectronics of Southeast University.
Firth, Andrew E; Atkins, John F
2009-01-01
Japanese encephalitis, West Nile, Usutu and Murray Valley encephalitis viruses form a tight subgroup within the larger Flavivirus genus. These viruses utilize a single-polyprotein expression strategy, resulting in ~10 mature proteins. Plotting the conservation at synonymous sites along the polyprotein coding sequence reveals strong conservation peaks at the very 5' end of the coding sequence, and also at the 5' end of the sequence encoding the NS2A protein. Such peaks are generally indicative of functionally important non-coding sequence elements. The second peak corresponds to a predicted stable pseudoknot structure whose biological importance is supported by compensatory mutations that preserve the structure. The pseudoknot is preceded by a conserved slippery heptanucleotide (Y CCU UUU), thus forming a classical stimulatory motif for -1 ribosomal frameshifting. We hypothesize, therefore, that the functional importance of the pseudoknot is to stimulate a portion of ribosomes to shift -1 nt into a short (45 codon), conserved, overlapping open reading frame, termed foo. Since cleavage at the NS1-NS2A boundary is known to require synthesis of NS2A in cis, the resulting transframe fusion protein is predicted to be NS1-NS2AN-term-FOO. We hypothesize that this may explain the origin of the previously identified NS1 'extension' protein in JEV-group flaviviruses, known as NS1'. PMID:19196463
Context influences on TALE–DNA binding revealed by quantitative profiling
Rogers, Julia M.; Barrera, Luis A.; Reyon, Deepak; Sander, Jeffry D.; Kellis, Manolis; Joung, J Keith; Bulyk, Martha L.
2015-01-01
Transcription activator-like effector (TALE) proteins recognize DNA using a seemingly simple DNA-binding code, which makes them attractive for use in genome engineering technologies that require precise targeting. Although this code is used successfully to design TALEs to target specific sequences, off-target binding has been observed and is difficult to predict. Here we explore TALE–DNA interactions comprehensively by quantitatively assaying the DNA-binding specificities of 21 representative TALEs to ∼5,000–20,000 unique DNA sequences per protein using custom-designed protein-binding microarrays (PBMs). We find that protein context features exert significant influences on binding. Thus, the canonical recognition code does not fully capture the complexity of TALE–DNA binding. We used the PBM data to develop a computational model, Specificity Inference For TAL-Effector Design (SIFTED), to predict the DNA-binding specificity of any TALE. We provide SIFTED as a publicly available web tool that predicts potential genomic off-target sites for improved TALE design. PMID:26067805
Context influences on TALE-DNA binding revealed by quantitative profiling.
Rogers, Julia M; Barrera, Luis A; Reyon, Deepak; Sander, Jeffry D; Kellis, Manolis; Joung, J Keith; Bulyk, Martha L
2015-06-11
Transcription activator-like effector (TALE) proteins recognize DNA using a seemingly simple DNA-binding code, which makes them attractive for use in genome engineering technologies that require precise targeting. Although this code is used successfully to design TALEs to target specific sequences, off-target binding has been observed and is difficult to predict. Here we explore TALE-DNA interactions comprehensively by quantitatively assaying the DNA-binding specificities of 21 representative TALEs to ∼5,000-20,000 unique DNA sequences per protein using custom-designed protein-binding microarrays (PBMs). We find that protein context features exert significant influences on binding. Thus, the canonical recognition code does not fully capture the complexity of TALE-DNA binding. We used the PBM data to develop a computational model, Specificity Inference For TAL-Effector Design (SIFTED), to predict the DNA-binding specificity of any TALE. We provide SIFTED as a publicly available web tool that predicts potential genomic off-target sites for improved TALE design.
Sequence similarity is more relevant than species specificity in probabilistic backtranslation.
Ferro, Alfredo; Giugno, Rosalba; Pigola, Giuseppe; Pulvirenti, Alfredo; Di Pietro, Cinzia; Purrello, Michele; Ragusa, Marco
2007-02-21
Backtranslation is the process of decoding a sequence of amino acids into the corresponding codons. All synthetic gene design systems include a backtranslation module. The degeneracy of the genetic code makes backtranslation potentially ambiguous since most amino acids are encoded by multiple codons. The common approach to overcome this difficulty is based on imitation of codon usage within the target species. This paper describes EasyBack, a new parameter-free, fully-automated software for backtranslation using Hidden Markov Models. EasyBack is not based on imitation of codon usage within the target species, but instead uses a sequence-similarity criterion. The model is trained with a set of proteins with known cDNA coding sequences, constructed from the input protein by querying the NCBI databases with BLAST. Unlike existing software, the proposed method allows the quality of prediction to be estimated. When tested on a group of proteins that show different degrees of sequence conservation, EasyBack outperforms other published methods in terms of precision. The prediction quality of a protein backtranslation methis markedly increased by replacing the criterion of most used codon in the same species with a Hidden Markov Model trained with a set of most similar sequences from all species. Moreover, the proposed method allows the quality of prediction to be estimated probabilistically.
Complete genome sequence of ‘Candidatus Liberibacter africanus’
USDA-ARS?s Scientific Manuscript database
The complete genome sequence of ‘Candidatus Liberibacter africanus’ (Laf), strain ptsapsy, was obtained by an Illumina HiSeq 2000. The Laf genome comprises 1,192,232 nucleotides, 34.5% GC content, 1,141 predicted coding sequences, 44 tRNAs, 3 complete copies of ribosomal RNA genes (16S, 23S and 5S) ...
Genome Sequence of a Chromium-Reducing Strain, Bacillus cereus S612
Wang, Dongping; Boukhalfa, Hakim; Ware, Doug S.; ...
2015-12-10
We report here the genome sequence of an effective chromium-reducing bacterium,Bacillus cereusstrain S612. We found that the size of the draft genome sequence is approximately 5.4 Mb, with a G+C content of 35%, and it is predicted to contain 5,450 protein-coding genes.
Coutouné, Natalia; Mulato, Aline Tieppo Nogueira
2017-01-01
ABSTRACT Here, we present the draft genome sequence of Saccharomyces cerevisiae BG-1, a Brazilian industrial strain widely used for bioethanol production from sugarcane. The 11.7-Mb genome sequence consists of 216 scaffolds and harbors 5,607 predicted protein-coding genes. PMID:28360170
Lee, Imchang; Chalita, Mauricio; Ha, Sung-Min; Na, Seong-In; Yoon, Seok-Hwan; Chun, Jongsik
2017-06-01
Thanks to the recent advancement of DNA sequencing technology, the cost and time of prokaryotic genome sequencing have been dramatically decreased. It has repeatedly been reported that genome sequencing using high-throughput next-generation sequencing is prone to contaminations due to its high depth of sequencing coverage. Although a few bioinformatics tools are available to detect potential contaminations, these have inherited limitations as they only use protein-coding genes. Here we introduce a new algorithm, called ContEst16S, to detect potential contaminations using 16S rRNA genes from genome assemblies. We screened 69 745 prokaryotic genomes from the NCBI Assembly Database using ContEst16S and found that 594 were contaminated by bacteria, human and plants. Of the predicted contaminated genomes, 8 % were not predicted by the existing protein-coding gene-based tool, implying that both methods can be complementary in the detection of contaminations. A web-based service of the algorithm is available at www.ezbiocloud.net/tools/contest16s.
RNAcode: Robust discrimination of coding and noncoding regions in comparative sequence data
Washietl, Stefan; Findeiß, Sven; Müller, Stephan A.; Kalkhof, Stefan; von Bergen, Martin; Hofacker, Ivo L.; Stadler, Peter F.; Goldman, Nick
2011-01-01
With the availability of genome-wide transcription data and massive comparative sequencing, the discrimination of coding from noncoding RNAs and the assessment of coding potential in evolutionarily conserved regions arose as a core analysis task. Here we present RNAcode, a program to detect coding regions in multiple sequence alignments that is optimized for emerging applications not covered by current protein gene-finding software. Our algorithm combines information from nucleotide substitution and gap patterns in a unified framework and also deals with real-life issues such as alignment and sequencing errors. It uses an explicit statistical model with no machine learning component and can therefore be applied “out of the box,” without any training, to data from all domains of life. We describe the RNAcode method and apply it in combination with mass spectrometry experiments to predict and confirm seven novel short peptides in Escherichia coli and to analyze the coding potential of RNAs previously annotated as “noncoding.” RNAcode is open source software and available for all major platforms at http://wash.github.com/rnacode. PMID:21357752
RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data.
Washietl, Stefan; Findeiss, Sven; Müller, Stephan A; Kalkhof, Stefan; von Bergen, Martin; Hofacker, Ivo L; Stadler, Peter F; Goldman, Nick
2011-04-01
With the availability of genome-wide transcription data and massive comparative sequencing, the discrimination of coding from noncoding RNAs and the assessment of coding potential in evolutionarily conserved regions arose as a core analysis task. Here we present RNAcode, a program to detect coding regions in multiple sequence alignments that is optimized for emerging applications not covered by current protein gene-finding software. Our algorithm combines information from nucleotide substitution and gap patterns in a unified framework and also deals with real-life issues such as alignment and sequencing errors. It uses an explicit statistical model with no machine learning component and can therefore be applied "out of the box," without any training, to data from all domains of life. We describe the RNAcode method and apply it in combination with mass spectrometry experiments to predict and confirm seven novel short peptides in Escherichia coli and to analyze the coding potential of RNAs previously annotated as "noncoding." RNAcode is open source software and available for all major platforms at http://wash.github.com/rnacode.
Chiusano, M L; D'Onofrio, G; Alvarez-Valin, F; Jabbari, K; Colonna, G; Bernardi, G
1999-09-30
We investigated the relationships between the nucleotide substitution rates and the predicted secondary structures in the three states representation (alpha-helix, beta-sheet, and coil). The analysis was carried out on 34 alignments, each of which comprised sequences belonging to at least four different mammalian orders. The rates of synonymous substitution were found to be significantly different in regions predicted to be alpha-helix, beta-sheet, or coil. Likewise, the nonsynonymous rates also differ, although expectedly at a lower extent, in the three types of secondary structure, suggesting that different selective constraints associated with the different structures are affecting in a similar way the synonymous and nonsynonymous rates. Moreover, the base composition of the third codon positions is different in coding sequence regions corresponding to different secondary structures of proteins.
Yerrapragada, Shaila; Shukla, Animesh; Hallsworth-Pepin, Kymberlie; Choi, Kwangmin; Wollam, Aye; Clifton, Sandra; Qin, Xiang; Muzny, Donna; Raghuraman, Sriram; Ashki, Haleh; Uzman, Akif; Highlander, Sarah K.; Fryszczyn, Bartlomiej G.; Fox, George E.; Tirumalai, Madhan R.; Liu, Yamei; Kim, Sun
2015-01-01
Tolypothrix sp. PCC 7601 is a freshwater filamentous cyanobacterium with complex responses to environmental conditions. Here, we present its 9.96-Mbp draft genome sequence, containing 10,065 putative protein-coding sequences, including 305 predicted two-component system proteins and 27 putative phytochrome-class photoreceptors, the most such proteins in any sequenced genome. PMID:25953173
Complete genome sequence of chinese strain of ‘Candidatus Liberibacter asiaticus’
USDA-ARS?s Scientific Manuscript database
The complete genome sequence of ‘Candidatus Liberibacter asiaticus’ strain (Las) Guangxi-1(GX-1) was obtained by an Illumina HiSeq 2000. The GX-1 genome comprises 1,268,237 nucleotides, 36.5 % GC content, 1,141 predicted coding sequences, 44 tRNAs, 3 complete copies of ribosomal RNA genes (16S, 23S ...
Draft Genome Sequence of Thiostrepton-Producing Streptomyces azureus ATCC 14921
Sakihara, Kengo; Maeda, Jumpei; Tashiro, Kosuke; Fujino, Yasuhiro; Kuhara, Satoru; Ohshima, Toshihisa; Ogata, Seiya
2015-01-01
Streptomyces azureus ATCC 14921 belongs to the Streptomyces cyaneus cluster and is known to be a thiostrepton producer. Here, we report a draft genome sequence for this strain, consisting of 350 contigs containing a total of 8,790,525 bp, 8,164 predicted coding sequences, and a G+C content of 70.9%. PMID:26494661
Mode-dependent templates and scan order for H.264/AVC-based intra lossless coding.
Gu, Zhouye; Lin, Weisi; Lee, Bu-Sung; Lau, Chiew Tong; Sun, Ming-Ting
2012-09-01
In H.264/advanced video coding (AVC), lossless coding and lossy coding share the same entropy coding module. However, the entropy coders in the H.264/AVC standard were original designed for lossy video coding and do not yield adequate performance for lossless video coding. In this paper, we analyze the problem with the current lossless coding scheme and propose a mode-dependent template (MD-template) based method for intra lossless coding. By exploring the statistical redundancy of the prediction residual in the H.264/AVC intra prediction modes, more zero coefficients are generated. By designing a new scan order for each MD-template, the scanned coefficients sequence fits the H.264/AVC entropy coders better. A fast implementation algorithm is also designed. With little computation increase, experimental results confirm that the proposed fast algorithm achieves about 7.2% bit saving compared with the current H.264/AVC fidelity range extensions high profile.
Aggregating and Predicting Sequence Labels from Crowd Annotations
Nguyen, An T.; Wallace, Byron C.; Li, Junyi Jessy; Nenkova, Ani; Lease, Matthew
2017-01-01
Despite sequences being core to NLP, scant work has considered how to handle noisy sequence labels from multiple annotators for the same text. Given such annotations, we consider two complementary tasks: (1) aggregating sequential crowd labels to infer a best single set of consensus annotations; and (2) using crowd annotations as training data for a model that can predict sequences in unannotated text. For aggregation, we propose a novel Hidden Markov Model variant. To predict sequences in unannotated text, we propose a neural approach using Long Short Term Memory. We evaluate a suite of methods across two different applications and text genres: Named-Entity Recognition in news articles and Information Extraction from biomedical abstracts. Results show improvement over strong baselines. Our source code and data are available online1. PMID:29093611
Evolution and Diversity of the Human Hepatitis D Virus Genome
Huang, Chi-Ruei; Lo, Szecheng J.
2010-01-01
Human hepatitis delta virus (HDV) is the smallest RNA virus in genome. HDV genome is divided into a viroid-like sequence and a protein-coding sequence which could have originated from different resources and the HDV genome was eventually constituted through RNA recombination. The genome subsequently diversified through accumulation of mutations selected by interactions between the mutated RNA and proteins with host factors to successfully form the infectious virions. Therefore, we propose that the conservation of HDV nucleotide sequence is highly related with its functionality. Genome analysis of known HDV isolates shows that the C-terminal coding sequences of large delta antigen (LDAg) are the highest diversity than other regions of protein-coding sequences but they still retain biological functionality to interact with the heavy chain of clathrin can be selected and maintained. Since viruses interact with many host factors, including escaping the host immune response, how to design a program to predict RNA genome evolution is a great challenging work. PMID:20204073
Cenik, Can; Chua, Hon Nian; Singh, Guramrit; Akef, Abdalla; Snyder, Michael P; Palazzo, Alexander F; Moore, Melissa J; Roth, Frederick P
2017-03-01
Introns are found in 5' untranslated regions (5'UTRs) for 35% of all human transcripts. These 5'UTR introns are not randomly distributed: Genes that encode secreted, membrane-bound and mitochondrial proteins are less likely to have them. Curiously, transcripts lacking 5'UTR introns tend to harbor specific RNA sequence elements in their early coding regions. To model and understand the connection between coding-region sequence and 5'UTR intron status, we developed a classifier that can predict 5'UTR intron status with >80% accuracy using only sequence features in the early coding region. Thus, the classifier identifies transcripts with 5 ' proximal- i ntron- m inus-like-coding regions ("5IM" transcripts). Unexpectedly, we found that the early coding sequence features defining 5IM transcripts are widespread, appearing in 21% of all human RefSeq transcripts. The 5IM class of transcripts is enriched for non-AUG start codons, more extensive secondary structure both preceding the start codon and near the 5' cap, greater dependence on eIF4E for translation, and association with ER-proximal ribosomes. 5IM transcripts are bound by the exon junction complex (EJC) at noncanonical 5' proximal positions. Finally, N 1 -methyladenosines are specifically enriched in the early coding regions of 5IM transcripts. Taken together, our analyses point to the existence of a distinct 5IM class comprising ∼20% of human transcripts. This class is defined by depletion of 5' proximal introns, presence of specific RNA sequence features associated with low translation efficiency, N 1 -methyladenosines in the early coding region, and enrichment for noncanonical binding by the EJC. © 2017 Cenik et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
An FPGA Implementation to Detect Selective Cationic Antibacterial Peptides
Polanco González, Carlos; Nuño Maganda, Marco Aurelio; Arias-Estrada, Miguel; del Rio, Gabriel
2011-01-01
Exhaustive prediction of physicochemical properties of peptide sequences is used in different areas of biological research. One example is the identification of selective cationic antibacterial peptides (SCAPs), which may be used in the treatment of different diseases. Due to the discrete nature of peptide sequences, the physicochemical properties calculation is considered a high-performance computing problem. A competitive solution for this class of problems is to embed algorithms into dedicated hardware. In the present work we present the adaptation, design and implementation of an algorithm for SCAPs prediction into a Field Programmable Gate Array (FPGA) platform. Four physicochemical properties codes useful in the identification of peptide sequences with potential selective antibacterial activity were implemented into an FPGA board. The speed-up gained in a single-copy implementation was up to 108 times compared with a single Intel processor cycle for cycle. The inherent scalability of our design allows for replication of this code into multiple FPGA cards and consequently improvements in speed are possible. Our results show the first embedded SCAPs prediction solution described and constitutes the grounds to efficiently perform the exhaustive analysis of the sequence-physicochemical properties relationship of peptides. PMID:21738652
Yerrapragada, Shaila; Shukla, Animesh; Hallsworth-Pepin, Kymberlie; Choi, Kwangmin; Wollam, Aye; Clifton, Sandra; Qin, Xiang; Muzny, Donna; Raghuraman, Sriram; Ashki, Haleh; Uzman, Akif; Highlander, Sarah K; Fryszczyn, Bartlomiej G; Fox, George E; Tirumalai, Madhan R; Liu, Yamei; Kim, Sun; Kehoe, David M; Weinstock, George M
2015-05-07
Tolypothrix sp. PCC 7601 is a freshwater filamentous cyanobacterium with complex responses to environmental conditions. Here, we present its 9.96-Mbp draft genome sequence, containing 10,065 putative protein-coding sequences, including 305 predicted two-component system proteins and 27 putative phytochrome-class photoreceptors, the most such proteins in any sequenced genome. Copyright © 2015 Yerrapragada et al.
The Representation of Prediction Error in Auditory Cortex
Rubin, Jonathan; Ulanovsky, Nachum; Tishby, Naftali
2016-01-01
To survive, organisms must extract information from the past that is relevant for their future. How this process is expressed at the neural level remains unclear. We address this problem by developing a novel approach from first principles. We show here how to generate low-complexity representations of the past that produce optimal predictions of future events. We then illustrate this framework by studying the coding of ‘oddball’ sequences in auditory cortex. We find that for many neurons in primary auditory cortex, trial-by-trial fluctuations of neuronal responses correlate with the theoretical prediction error calculated from the short-term past of the stimulation sequence, under constraints on the complexity of the representation of this past sequence. In some neurons, the effect of prediction error accounted for more than 50% of response variability. Reliable predictions often depended on a representation of the sequence of the last ten or more stimuli, although the representation kept only few details of that sequence. PMID:27490251
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ghodhbane-Gtari, Faten; Beauchemin, Nicholas; Louati, Moussa
Here, we report the first genome sequence of a Nocardia plant endophyte, N. casuarinae strain BMG51109, isolated from Casuarina glauca root nodules. The improved high-quality draft genome sequence contains 8,787,999 bp with a 68.90% GC content and 7,307 predicted protein-coding genes.
Draft Genome Sequence of Pediococcus lolii NGRI 0510QT Isolated from Ryegrass Silage
Mori, Kazuki; Tashiro, Kosuke; Fujino, Yasuhiro; Nagayoshi, Yuko; Hayashi, Yoshiharu; Kuhara, Satoru; Ohshima, Toshihisa
2013-01-01
Pediococcus lolii NGRI 0510QT was isolated from ryegrass silage produced on Ishigaki Island, Okinawa Prefecture, Japan. Here we present a draft genome sequence for this strain, consisting of 103 contigs for a total of 2,047,078 bp, 2,154 predicted coding sequences, and a G+C content of 42.1%. PMID:23405350
Malhotra, Jaya; Dua, Ankita; Saxena, Anjali; Sangwan, Naseer; Mukherjee, Udita; Pandey, Neeti; Rajagopal, Raman; Khurana, Paramjit; Khurana, Jitendra P.
2012-01-01
In this study, Acinetobacter sp. strain HA was isolated from the midgut of a fifth-instar larva of Helicoverpa armigera. Here, we report the draft genome sequence (3,125,085 bp) of this strain that consists of 102 contigs, 2,911 predicted coding sequences, and a G+C content of 41%. PMID:22933775
Ghodhbane-Gtari, Faten; Beauchemin, Nicholas; Louati, Moussa; ...
2016-08-04
Here, we report the first genome sequence of a Nocardia plant endophyte, N. casuarinae strain BMG51109, isolated from Casuarina glauca root nodules. The improved high-quality draft genome sequence contains 8,787,999 bp with a 68.90% GC content and 7,307 predicted protein-coding genes.
lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine.
Sun, Lei; Liu, Hui; Zhang, Lin; Meng, Jia
2015-01-01
Functional long non-coding RNAs (lncRNAs) have been bringing novel insight into biological study, however it is still not trivial to accurately distinguish the lncRNA transcripts (LNCTs) from the protein coding ones (PCTs). As various information and data about lncRNAs are preserved by previous studies, it is appealing to develop novel methods to identify the lncRNAs more accurately. Our method lncRScan-SVM aims at classifying PCTs and LNCTs using support vector machine (SVM). The gold-standard datasets for lncRScan-SVM model training, lncRNA prediction and method comparison were constructed according to the GENCODE gene annotations of human and mouse respectively. By integrating features derived from gene structure, transcript sequence, potential codon sequence and conservation, lncRScan-SVM outperforms other approaches, which is evaluated by several criteria such as sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC) and area under curve (AUC). In addition, several known human lncRNA datasets were assessed using lncRScan-SVM. LncRScan-SVM is an efficient tool for predicting the lncRNAs, and it is quite useful for current lncRNA study.
Vorstman, Jacob A S; Olde Loohuis, Loes M; Kahn, René S; Ophoff, Roel A
2018-05-14
The co-occurrence of a Copy Number Variant (CNV) and a functional variant on the other allele may be a relevant genetic mechanism in schizophrenia. We hypothesized that the cumulative burden of such double hits - in particular those composed of a deletion and a coding single nucleotide variation (SNV) - is increased in patients with schizophrenia.We combined CNV data with coding variants data in 795 patients with schizophrenia and 474 controls. To limit false CNV-detection, only CNVs called only by two algorithms we included. CNV-affected genes were subsequently examined for coding SNVs, which we termed "CNV-SNVs". Correcting for total queried sequence, we assessed the CNV-SNV-burden and the combined predicted deleterious effect. We estimated p-values by permutation of the phenotype.We detected 105 CNV-SNVs; 67 in duplicated and 38 in deleted genic sequence. While the difference in CNV-SNVs rates was not significant, the combined deleteriousness inferred by CNV-SNVs in deleted sequence was almost fourfold higher in cases compared to controls (nominal p = 0.009). This effect may be driven by a higher number of CNV-SNVs and/or by a higher degree of predicted deleteriousness of CNV-SNVs. No such effect was observed for duplications.We provide early evidence that deletions co-occurring with a functional variant may be relevant, albeit of modest impact, for the genetic etiology of schizophrenia. Large-scale consortium studies are required to validate our findings. Sequence-based analyses would provide the best resolution for detection of CNVs as well as coding variants genome-wide.
Lim, Chun Shen; Brown, Chris M
2017-01-01
Structured RNA elements may control virus replication, transcription and translation, and their distinct features are being exploited by novel antiviral strategies. Viral RNA elements continue to be discovered using combinations of experimental and computational analyses. However, the wealth of sequence data, notably from deep viral RNA sequencing, viromes, and metagenomes, necessitates computational approaches being used as an essential discovery tool. In this review, we describe practical approaches being used to discover functional RNA elements in viral genomes. In addition to success stories in new and emerging viruses, these approaches have revealed some surprising new features of well-studied viruses e.g., human immunodeficiency virus, hepatitis C virus, influenza, and dengue viruses. Some notable discoveries were facilitated by new comparative analyses of diverse viral genome alignments. Importantly, comparative approaches for finding RNA elements embedded in coding and non-coding regions differ. With the exponential growth of computer power we have progressed from stem-loop prediction on single sequences to cutting edge 3D prediction, and from command line to user friendly web interfaces. Despite these advances, many powerful, user friendly prediction tools and resources are underutilized by the virology community.
Lim, Chun Shen; Brown, Chris M.
2018-01-01
Structured RNA elements may control virus replication, transcription and translation, and their distinct features are being exploited by novel antiviral strategies. Viral RNA elements continue to be discovered using combinations of experimental and computational analyses. However, the wealth of sequence data, notably from deep viral RNA sequencing, viromes, and metagenomes, necessitates computational approaches being used as an essential discovery tool. In this review, we describe practical approaches being used to discover functional RNA elements in viral genomes. In addition to success stories in new and emerging viruses, these approaches have revealed some surprising new features of well-studied viruses e.g., human immunodeficiency virus, hepatitis C virus, influenza, and dengue viruses. Some notable discoveries were facilitated by new comparative analyses of diverse viral genome alignments. Importantly, comparative approaches for finding RNA elements embedded in coding and non-coding regions differ. With the exponential growth of computer power we have progressed from stem-loop prediction on single sequences to cutting edge 3D prediction, and from command line to user friendly web interfaces. Despite these advances, many powerful, user friendly prediction tools and resources are underutilized by the virology community. PMID:29354101
Draft Genome Sequence of Lactobacillus panis DSM 6035T, First Isolated from Sourdough
Zhu, Yixin; Fang, Daiqiong; Shi, Ding; Li, Ang; Lv, Longxian; Yan, Ren; Yao, Jian; Hua, Dasong; Hu, Xinjun; Guo, Feifei; Wu, Wenrui; Guo, Jing; Chen, Yanfei; Jiang, Xiawei; Chen, Xiaoxiao
2015-01-01
We report a draft genome sequence of Lactobacillus panis DSM 6035T, isolated from sourdough. The genome of this strain is 2,082,789 bp long, with 47.9% G+C content. A total of 2,047 protein-coding genes were predicted. PMID:26205855
Khan, Waqasuddin; Saripella, Ganapathi Varma-; Ludwig, Thomas; Cuppens, Tania; Thibord, Florian; Génin, Emmanuelle; Deleuze, Jean-Francois; Trégouët, David-Alexandre
2018-05-03
Predicted deleteriousness of coding variants is a frequently used criterion to filter out variants detected in next-generation sequencing projects and to select candidates impacting on the risk of human diseases. Most available dedicated tools implement a base-to-base annotation approach that could be biased in presence of several variants in the same genetic codon. We here proposed the MACARON program that, from a standard VCF file, identifies, re-annotates and predicts the amino acid change resulting from multiple single nucleotide variants (SNVs) within the same genetic codon. Applied to the whole exome dataset of 573 individuals, MACARON identifies 114 situations where multiple SNVs within a genetic codon induce an amino acid change that is different from those predicted by standard single SNV annotation tool. Such events are not uncommon and deserve to be studied in sequencing projects with inconclusive findings. MACARON is written in python with codes available on the GENMED website (www.genmed.fr). david-alexandre.tregouet@inserm.fr. Supplementary data are available at Bioinformatics online.
Multistability in auditory stream segregation: a predictive coding view
Winkler, István; Denham, Susan; Mill, Robert; Bőhm, Tamás M.; Bendixen, Alexandra
2012-01-01
Auditory stream segregation involves linking temporally separate acoustic events into one or more coherent sequences. For any non-trivial sequence of sounds, many alternative descriptions can be formed, only one or very few of which emerge in awareness at any time. Evidence from studies showing bi-/multistability in auditory streaming suggest that some, perhaps many of the alternative descriptions are represented in the brain in parallel and that they continuously vie for conscious perception. Here, based on a predictive coding view, we consider the nature of these sound representations and how they compete with each other. Predictive processing helps to maintain perceptual stability by signalling the continuation of previously established patterns as well as the emergence of new sound sources. It also provides a measure of how well each of the competing representations describes the current acoustic scene. This account of auditory stream segregation has been tested on perceptual data obtained in the auditory streaming paradigm. PMID:22371621
Testa, Alison C; Hane, James K; Ellwood, Simon R; Oliver, Richard P
2015-03-11
The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.
2014-01-01
Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the protein-coding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3-base periodicity, while non-coding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the least-norm method. The least-norm estimator developed in this paper shows sharp period-3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of least-norm gene prediction method over existing method. PMID:24386895
Xie, G.; Chain, P.S.G.; Lo, C.; Liu, K-L.; Gans, J.; Merritt, J.; Qi, F.
2010-01-01
SUMMARY Human dental plaque is a complex microbial community containing an estimated 700 to 19,000 species/phylotypes. Despite numerous studies analysing species richness in healthy and diseased human subjects, the true genomic composition of the human dental plaque microbiota remains unknown. Here we report a metagenomic analysis of a healthy human plaque sample using a combination of second-generation sequencing platforms. A total of 860 million base pairs of non-human sequences were generated. Various analysis tools revealed the presence of 12 well-characterized phyla, members of the TM-7 and BRC1 clade, and sequences that could not be classified. Both pathogens and opportunistic pathogens were identified, supporting the ecological plaque hypothesis for oral diseases. Mapping the metagenomic reads to sequenced reference genomes demonstrated that 4% of the reads could be assigned to the sequenced species. Preliminary annotation identified genes belonging to all known functional categories. Interestingly, although 73% of the total assembled contig sequences were predicted to code for proteins, only 51% of them could be assigned a functional role. Furthermore, ~ 2.8% of the total predicted genes coded for proteins involved in resistance to antibiotics and toxic compounds, suggesting that the oral cavity is an important reservoir for antimicrobial resistance. PMID:21040513
Xie, G; Chain, P S G; Lo, C-C; Liu, K-L; Gans, J; Merritt, J; Qi, F
2010-12-01
Human dental plaque is a complex microbial community containing an estimated 700 to 19,000 species/phylotypes. Despite numerous studies analysing species richness in healthy and diseased human subjects, the true genomic composition of the human dental plaque microbiota remains unknown. Here we report a metagenomic analysis of a healthy human plaque sample using a combination of second-generation sequencing platforms. A total of 860 million base pairs of non-human sequences were generated. Various analysis tools revealed the presence of 12 well-characterized phyla, members of the TM-7 and BRC1 clade, and sequences that could not be classified. Both pathogens and opportunistic pathogens were identified, supporting the ecological plaque hypothesis for oral diseases. Mapping the metagenomic reads to sequenced reference genomes demonstrated that 4% of the reads could be assigned to the sequenced species. Preliminary annotation identified genes belonging to all known functional categories. Interestingly, although 73% of the total assembled contig sequences were predicted to code for proteins, only 51% of them could be assigned a functional role. Furthermore, ~2.8% of the total predicted genes coded for proteins involved in resistance to antibiotics and toxic compounds, suggesting that the oral cavity is an important reservoir for antimicrobial resistance. © 2010 John Wiley & Sons A/S.
Zheng, Yang; Cai, Jing; Li, JianWen; Li, Bo; Lin, Runmao; Tian, Feng; Wang, XiaoLing; Wang, Jun
2010-01-01
A 10-fold BAC library for giant panda was constructed and nine BACs were selected to generate finish sequences. These BACs could be used as a validation resource for the de novo assembly accuracy of the whole genome shotgun sequencing reads of giant panda newly generated by the Illumina GA sequencing technology. Complete sanger sequencing, assembly, annotation and comparative analysis were carried out on the selected BACs of a joint length 878 kb. Homologue search and de novo prediction methods were used to annotate genes and repeats. Twelve protein coding genes were predicted, seven of which could be functionally annotated. The seven genes have an average gene size of about 41 kb, an average coding size of about 1.2 kb and an average exon number of 6 per gene. Besides, seven tRNA genes were found. About 27 percent of the BAC sequence is composed of repeats. A phylogenetic tree was constructed using neighbor-join algorithm across five species, including giant panda, human, dog, cat and mouse, which reconfirms dog as the most related species to giant panda. Our results provide detailed sequence and structure information for new genes and repeats of giant panda, which will be helpful for further studies on the giant panda.
O'Hair, Joshua A.; Li, Hui; Thapa, Santosh; Scholz, Matthew B.
2017-01-01
ABSTRACT Novel cellulolytic microorganisms can potentially influence second-generation biofuel production. This paper reports the draft genome sequence of Bacillus licheniformis strain YNP1-TSU, isolated from hydrothermal-vegetative microbiomes inside Yellowstone National Park. The assembled sequence contigs predicted 4,230 coding genes, 66 tRNAs, and 10 rRNAs through automated annotation. PMID:28254968
Tanjung, Zulfikar Achmad; Aditama, Redi; Buana, Rika Fithri Nurani; Pratomo, Antonius Dony Madu; Tryono, Reno; Liwang, Tony
2018-01-01
ABSTRACT Ganoderma boninense is the dominant fungal pathogen of basal stem rot (BSR) disease on Elaeis guineensis. We sequenced the nuclear genome of mycelia using both Illumina and Pacific Biosciences platforms for assembly of scaffolds. The draft genome comprised 79.24 Mb, 495 scaffolds, and 26,226 predicted coding sequences. PMID:29700132
Seddiki, Khawla; Godart, François; Aiese Cigliano, Riccardo; Sanseverino, Walter; Barakat, Mohamed; Ortet, Philippe; Rébeillé, Fabrice; Maréchal, Eric
2018-01-01
ABSTRACT Thraustochytrids are ecologically and biotechnologically relevant marine species. We report here the de novo assembly and annotation of the whole-genome sequence of a new thraustochytrid strain, CCAP_4062/3. The genome size was estimated at 38.7 Mb with 11,853 predicted coding sequences, and the GC content was scored at 57%. PMID:29545303
2018-01-01
ABSTRACT The complete genome sequence of Bacillus cereus strain TG1-6, which is a highly salt-tolerant rhizobacterium that enhances plant tolerance to drought stress, is reported here. The sequencing process was performed based on a combination of pyrosequencing and single-molecule sequencing. The complete genome is estimated to be approximately 5.42 Mb, containing a total of 5,610 predicted protein-coding DNA sequences (CDSs). PMID:29748401
Vílchez, Juan Ignacio; Tang, Qiming; Kaushal, Richa; Wang, Wei; Lv, Suhui; He, Danxia; Chu, Zhaoqing; Zhang, Heng; Liu, Renyi; Zhang, Huiming
2018-06-21
Here, we report the complete genome sequence for Bacillus megaterium strain YC4-R4, a highly salt-tolerant rhizobacterium that promotes growth in plants. The sequencing process was performed by combining pyrosequencing and single-molecule sequencing techniques. The complete genome is estimated to be approximately 5.44 Mb, containing a total of 5,673 predicted protein-coding DNA sequences (CDSs). Copyright © 2018 Vílchez et al.
Vfold: a web server for RNA structure and folding thermodynamics prediction.
Xu, Xiaojun; Zhao, Peinan; Chen, Shi-Jie
2014-01-01
The ever increasing discovery of non-coding RNAs leads to unprecedented demand for the accurate modeling of RNA folding, including the predictions of two-dimensional (base pair) and three-dimensional all-atom structures and folding stabilities. Accurate modeling of RNA structure and stability has far-reaching impact on our understanding of RNA functions in human health and our ability to design RNA-based therapeutic strategies. The Vfold server offers a web interface to predict (a) RNA two-dimensional structure from the nucleotide sequence, (b) three-dimensional structure from the two-dimensional structure and the sequence, and (c) folding thermodynamics (heat capacity melting curve) from the sequence. To predict the two-dimensional structure (base pairs), the server generates an ensemble of structures, including loop structures with the different intra-loop mismatches, and evaluates the free energies using the experimental parameters for the base stacks and the loop entropy parameters given by a coarse-grained RNA folding model (the Vfold model) for the loops. To predict the three-dimensional structure, the server assembles the motif scaffolds using structure templates extracted from the known PDB structures and refines the structure using all-atom energy minimization. The Vfold-based web server provides a user friendly tool for the prediction of RNA structure and stability. The web server and the source codes are freely accessible for public use at "http://rna.physics.missouri.edu".
High-Level Prediction Signals in a Low-Level Area of the Macaque Face-Processing Hierarchy.
Schwiedrzik, Caspar M; Freiwald, Winrich A
2017-09-27
Theories like predictive coding propose that lower-order brain areas compare their inputs to predictions derived from higher-order representations and signal their deviation as a prediction error. Here, we investigate whether the macaque face-processing system, a three-level hierarchy in the ventral stream, employs such a coding strategy. We show that after statistical learning of specific face sequences, the lower-level face area ML computes the deviation of actual from predicted stimuli. But these signals do not reflect the tuning characteristic of ML. Rather, they exhibit identity specificity and view invariance, the tuning properties of higher-level face areas AL and AM. Thus, learning appears to endow lower-level areas with the capability to test predictions at a higher level of abstraction than what is afforded by the feedforward sweep. These results provide evidence for computational architectures like predictive coding and suggest a new quality of functional organization of information-processing hierarchies beyond pure feedforward schemes. Copyright © 2017 Elsevier Inc. All rights reserved.
2013-01-01
Background Significant efforts have been made to address the problem of identifying short genes in prokaryotic genomes. However, most known methods are not effective in detecting short genes. Because of the limited information contained in short DNA sequences, it is very difficult to accurately distinguish between protein coding and non-coding sequences in prokaryotic genomes. We have developed a new Iteratively Adaptive Sparse Partial Least Squares (IASPLS) algorithm as the classifier to improve the accuracy of the identification process. Results For testing, we chose the short coding and non-coding sequences from seven prokaryotic organisms. We used seven feature sets (including GC content, Z-curve, etc.) of short genes. In comparison with GeneMarkS, Metagene, Orphelia, and Heuristic Approachs methods, our model achieved the best prediction performance in identification of short prokaryotic genes. Even when we focused on the very short length group ([60–100 nt)), our model provided sensitivity as high as 83.44% and specificity as high as 92.8%. These values are two or three times higher than three of the other methods while Metagene fails to recognize genes in this length range. The experiments also proved that the IASPLS can improve the identification accuracy in comparison with other widely used classifiers, i.e. Logistic, Random Forest (RF) and K nearest neighbors (KNN). The accuracy in using IASPLS was improved 5.90% or more in comparison with the other methods. In addition to the improvements in accuracy, IASPLS required ten times less computer time than using KNN or RF. Conclusions It is conclusive that our method is preferable for application as an automated method of short gene classification. Its linearity and easily optimized parameters make it practicable for predicting short genes of newly-sequenced or under-studied species. Reviewers This article was reviewed by Alexey Kondrashov, Rajeev Azad (nominated by Dr J.Peter Gogarten) and Yuriy Fofanov (nominated by Dr Janet Siefert). PMID:24067167
Efficient analysis of mouse genome sequences reveal many nonsense variants
Steeland, Sophie; Timmermans, Steven; Van Ryckeghem, Sara; Hulpiau, Paco; Saeys, Yvan; Van Montagu, Marc; Vandenbroucke, Roosmarijn E.; Libert, Claude
2016-01-01
Genetic polymorphisms in coding genes play an important role when using mouse inbred strains as research models. They have been shown to influence research results, explain phenotypical differences between inbred strains, and increase the amount of interesting gene variants present in the many available inbred lines. SPRET/Ei is an inbred strain derived from Mus spretus that has ∼1% sequence difference with the C57BL/6J reference genome. We obtained a listing of all SNPs and insertions/deletions (indels) present in SPRET/Ei from the Mouse Genomes Project (Wellcome Trust Sanger Institute) and processed these data to obtain an overview of all transcripts having nonsynonymous coding sequence variants. We identified 8,883 unique variants affecting 10,096 different transcripts from 6,328 protein-coding genes, which is about 28% of all coding genes. Because only a subset of these variants results in drastic changes in proteins, we focused on variations that are nonsense mutations that ultimately resulted in a gain of a stop codon. These genes were identified by in silico changing the C57BL/6J coding sequences to the SPRET/Ei sequences, converting them to amino acid (AA) sequences, and comparing the AA sequences. All variants and transcripts affected were also stored in a database, which can be browsed using a SPRET/Ei M. spretus variants web tool (www.spretus.org), including a manual. We validated the tool by demonstrating the loss of function of three proteins predicted to be severely truncated, namely Fas, IRAK2, and IFNγR1. PMID:27147605
Genomic Sequence around Butterfly Wing Development Genes: Annotation and Comparative Analysis
Conceição, Inês C.; Long, Anthony D.; Gruber, Jonathan D.; Beldade, Patrícia
2011-01-01
Background Analysis of genomic sequence allows characterization of genome content and organization, and access beyond gene-coding regions for identification of functional elements. BAC libraries, where relatively large genomic regions are made readily available, are especially useful for species without a fully sequenced genome and can increase genomic coverage of phylogenetic and biological diversity. For example, no butterfly genome is yet available despite the unique genetic and biological properties of this group, such as diversified wing color patterns. The evolution and development of these patterns is being studied in a few target species, including Bicyclus anynana, where a whole-genome BAC library allows targeted access to large genomic regions. Methodology/Principal Findings We characterize ∼1.3 Mb of genomic sequence around 11 selected genes expressed in B. anynana developing wings. Extensive manual curation of in silico predictions, also making use of a large dataset of expressed genes for this species, identified repetitive elements and protein coding sequence, and highlighted an expansion of Alcohol dehydrogenase genes. Comparative analysis with orthologous regions of the lepidopteran reference genome allowed assessment of conservation of fine-scale synteny (with detection of new inversions and translocations) and of DNA sequence (with detection of high levels of conservation of non-coding regions around some, but not all, developmental genes). Conclusions The general properties and organization of the available B. anynana genomic sequence are similar to the lepidopteran reference, despite the more than 140 MY divergence. Our results lay the groundwork for further studies of new interesting findings in relation to both coding and non-coding sequence: 1) the Alcohol dehydrogenase expansion with higher similarity between the five tandemly-repeated B. anynana paralogs than with the corresponding B. mori orthologs, and 2) the high conservation of non-coding sequence around the genes wingless and Ecdysone receptor, both involved in multiple developmental processes including wing pattern formation. PMID:21909358
Many human accelerated regions are developmental enhancers
Capra, John A.; Erwin, Genevieve D.; McKinsey, Gabriel; Rubenstein, John L. R.; Pollard, Katherine S.
2013-01-01
The genetic changes underlying the dramatic differences in form and function between humans and other primates are largely unknown, although it is clear that gene regulatory changes play an important role. To identify regulatory sequences with potentially human-specific functions, we and others used comparative genomics to find non-coding regions conserved across mammals that have acquired many sequence changes in humans since divergence from chimpanzees. These regions are good candidates for performing human-specific regulatory functions. Here, we analysed the DNA sequence, evolutionary history, histone modifications, chromatin state and transcription factor (TF) binding sites of a combined set of 2649 non-coding human accelerated regions (ncHARs) and predicted that at least 30% of them function as developmental enhancers. We prioritized the predicted ncHAR enhancers using analysis of TF binding site gain and loss, along with the functional annotations and expression patterns of nearby genes. We then tested both the human and chimpanzee sequence for 29 ncHARs in transgenic mice, and found 24 novel developmental enhancers active in both species, 17 of which had very consistent patterns of activity in specific embryonic tissues. Of these ncHAR enhancers, five drove expression patterns suggestive of different activity for the human and chimpanzee sequence at embryonic day 11.5. The changes to human non-coding DNA in these ncHAR enhancers may modify the complex patterns of gene expression necessary for proper development in a human-specific manner and are thus promising candidates for understanding the genetic basis of human-specific biology. PMID:24218637
Regulatory sequence analysis tools.
van Helden, Jacques
2003-07-01
The web resource Regulatory Sequence Analysis Tools (RSAT) (http://rsat.ulb.ac.be/rsat) offers a collection of software tools dedicated to the prediction of regulatory sites in non-coding DNA sequences. These tools include sequence retrieval, pattern discovery, pattern matching, genome-scale pattern matching, feature-map drawing, random sequence generation and other utilities. Alternative formats are supported for the representation of regulatory motifs (strings or position-specific scoring matrices) and several algorithms are proposed for pattern discovery. RSAT currently holds >100 fully sequenced genomes and these data are regularly updated from GenBank.
Progressive Dictionary Learning with Hierarchical Predictive Structure for Scalable Video Coding.
Dai, Wenrui; Shen, Yangmei; Xiong, Hongkai; Jiang, Xiaoqian; Zou, Junni; Taubman, David
2017-04-12
Dictionary learning has emerged as a promising alternative to the conventional hybrid coding framework. However, the rigid structure of sequential training and prediction degrades its performance in scalable video coding. This paper proposes a progressive dictionary learning framework with hierarchical predictive structure for scalable video coding, especially in low bitrate region. For pyramidal layers, sparse representation based on spatio-temporal dictionary is adopted to improve the coding efficiency of enhancement layers (ELs) with a guarantee of reconstruction performance. The overcomplete dictionary is trained to adaptively capture local structures along motion trajectories as well as exploit the correlations between neighboring layers of resolutions. Furthermore, progressive dictionary learning is developed to enable the scalability in temporal domain and restrict the error propagation in a close-loop predictor. Under the hierarchical predictive structure, online learning is leveraged to guarantee the training and prediction performance with an improved convergence rate. To accommodate with the stateof- the-art scalable extension of H.264/AVC and latest HEVC, standardized codec cores are utilized to encode the base and enhancement layers. Experimental results show that the proposed method outperforms the latest SHVC and HEVC simulcast over extensive test sequences with various resolutions.
Ferlaino, Michael; Rogers, Mark F.; Shihab, Hashem A.; Mort, Matthew; Cooper, David N.; Gaunt, Tom R.; Campbell, Colin
2018-01-01
Background Small insertions and deletions (indels) have a significant influence in human disease and, in terms of frequency, they are second only to single nucleotide variants as pathogenic mutations. As the majority of mutations associated with complex traits are located outside the exome, it is crucial to investigate the potential pathogenic impact of indels in non-coding regions of the human genome. Results We present FATHMM-indel, an integrative approach to predict the functional effect, pathogenic or neutral, of indels in non-coding regions of the human genome. Our method exploits various genomic annotations in addition to sequence data. When validated on benchmark data, FATHMM-indel significantly outperforms CADD and GAVIN, state of the art models in assessing the pathogenic impact of non-coding variants. FATHMM-indel is available via a web server at indels.biocompute.org.uk. Conclusions FATHMM-indel can accurately predict the functional impact and prioritise small indels throughout the whole non-coding genome. PMID:28985712
Ferlaino, Michael; Rogers, Mark F; Shihab, Hashem A; Mort, Matthew; Cooper, David N; Gaunt, Tom R; Campbell, Colin
2017-10-06
Small insertions and deletions (indels) have a significant influence in human disease and, in terms of frequency, they are second only to single nucleotide variants as pathogenic mutations. As the majority of mutations associated with complex traits are located outside the exome, it is crucial to investigate the potential pathogenic impact of indels in non-coding regions of the human genome. We present FATHMM-indel, an integrative approach to predict the functional effect, pathogenic or neutral, of indels in non-coding regions of the human genome. Our method exploits various genomic annotations in addition to sequence data. When validated on benchmark data, FATHMM-indel significantly outperforms CADD and GAVIN, state of the art models in assessing the pathogenic impact of non-coding variants. FATHMM-indel is available via a web server at indels.biocompute.org.uk. FATHMM-indel can accurately predict the functional impact and prioritise small indels throughout the whole non-coding genome.
Neural evidence for predictive coding in auditory cortex during speech production.
Okada, Kayoko; Matchin, William; Hickok, Gregory
2018-02-01
Recent models of speech production suggest that motor commands generate forward predictions of the auditory consequences of those commands, that these forward predications can be used to monitor and correct speech output, and that this system is hierarchically organized (Hickok, Houde, & Rong, Neuron, 69(3), 407--422, 2011; Pickering & Garrod, Behavior and Brain Sciences, 36(4), 329--347, 2013). Recent psycholinguistic research has shown that internally generated speech (i.e., imagined speech) produces different types of errors than does overt speech (Oppenheim & Dell, Cognition, 106(1), 528--537, 2008; Oppenheim & Dell, Memory & Cognition, 38(8), 1147-1160, 2010). These studies suggest that articulated speech might involve predictive coding at additional levels than imagined speech. The current fMRI experiment investigates neural evidence of predictive coding in speech production. Twenty-four participants from UC Irvine were recruited for the study. Participants were scanned while they were visually presented with a sequence of words that they reproduced in sync with a visual metronome. On each trial, they were cued to either silently articulate the sequence or to imagine the sequence without overt articulation. As expected, silent articulation and imagined speech both engaged a left hemisphere network previously implicated in speech production. A contrast of silent articulation with imagined speech revealed greater activation for articulated speech in inferior frontal cortex, premotor cortex and the insula in the left hemisphere, consistent with greater articulatory load. Although both conditions were silent, this contrast also produced significantly greater activation in auditory cortex in dorsal superior temporal gyrus in both hemispheres. We suggest that these activations reflect forward predictions arising from additional levels of the perceptual/motor hierarchy that are involved in monitoring the intended speech output.
Lathe, R
1985-05-05
Synthetic probes deduced from amino acid sequence data are widely used to detect cognate coding sequences in libraries of cloned DNA segments. The redundancy of the genetic code dictates that a choice must be made between (1) a mixture of probes reflecting all codon combinations, and (2) a single longer "optimal" probe. The second strategy is examined in detail. The frequency of sequences matching a given probe by chance alone can be determined and also the frequency of sequences closely resembling the probe and contributing to the hybridization background. Gene banks cannot be treated as random associations of the four nucleotides, and probe sequences deduced from amino acid sequence data occur more often than predicted by chance alone. Probe lengths must be increased to confer the necessary specificity. Examination of hybrids formed between unique homologous probes and their cognate targets reveals that short stretches of perfect homology occurring by chance make a significant contribution to the hybridization background. Statistical methods for improving homology are examined, taking human coding sequences as an example, and considerations of codon utilization and dinucleotide frequencies yield an overall homology of greater than 82%. Recommendations for probe design and hybridization are presented, and the choice between using multiple probes reflecting all codon possibilities and a unique optimal probe is discussed.
Gene Unprediction with Spurio: A tool to identify spurious protein sequences.
Höps, Wolfram; Jeffryes, Matt; Bateman, Alex
2018-01-01
We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation. Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases. We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes. Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.
Lossless Video Sequence Compression Using Adaptive Prediction
NASA Technical Reports Server (NTRS)
Li, Ying; Sayood, Khalid
2007-01-01
We present an adaptive lossless video compression algorithm based on predictive coding. The proposed algorithm exploits temporal, spatial, and spectral redundancies in a backward adaptive fashion with extremely low side information. The computational complexity is further reduced by using a caching strategy. We also study the relationship between the operational domain for the coder (wavelet or spatial) and the amount of temporal and spatial redundancy in the sequence being encoded. Experimental results show that the proposed scheme provides significant improvements in compression efficiencies.
Burzynski, Grzegorz M.; Reed, Xylena; Taher, Leila; Stine, Zachary E.; Matsui, Takeshi; Ovcharenko, Ivan; McCallion, Andrew S.
2012-01-01
Illuminating the primary sequence encryption of enhancers is central to understanding the regulatory architecture of genomes. We have developed a machine learning approach to decipher motif patterns of hindbrain enhancers and identify 40,000 sequences in the human genome that we predict display regulatory control that includes the hindbrain. Consistent with their roles in hindbrain patterning, MEIS1, NKX6-1, as well as HOX and POU family binding motifs contributed strongly to this enhancer model. Predicted hindbrain enhancers are overrepresented at genes expressed in hindbrain and associated with nervous system development, and primarily reside in the areas of open chromatin. In addition, 77 (0.2%) of these predictions are identified as hindbrain enhancers on the VISTA Enhancer Browser, and 26,000 (60%) overlap enhancer marks (H3K4me1 or H3K27ac). To validate these putative hindbrain enhancers, we selected 55 elements distributed throughout our predictions and six low scoring controls for evaluation in a zebrafish transgenic assay. When assayed in mosaic transgenic embryos, 51/55 elements directed expression in the central nervous system. Furthermore, 30/34 (88%) predicted enhancers analyzed in stable zebrafish transgenic lines directed expression in the larval zebrafish hindbrain. Subsequent analysis of sequence fragments selected based upon motif clustering further confirmed the critical role of the motifs contributing to the classifier. Our results demonstrate the existence of a primary sequence code characteristic to hindbrain enhancers. This code can be accurately extracted using machine-learning approaches and applied successfully for de novo identification of hindbrain enhancers. This study represents a critical step toward the dissection of regulatory control in specific neuronal subtypes. PMID:22759862
Performance evaluation of the intra compression in the video coding standards
NASA Astrophysics Data System (ADS)
Abramowski, Andrzej
2015-09-01
The article presents a comparison of the Intra prediction algorithms in the current state-of-the-art video coding standards, including MJPEG 2000, VP8, VP9, H.264/AVC and H.265/HEVC. The effectiveness of techniques employed by each standard is evaluated in terms of compression efficiency and average encoding time. The compression efficiency is measured using BD-PSNR and BD-RATE metrics with H.265/HEVC results as an anchor. Tests are performed on a set of video sequences, composed of sequences gathered by Joint Collaborative Team on Video Coding during the development of the H.265/HEVC standard and 4K sequences provided by Ultra Video Group. According to results, H.265/HEVC provides significant bit-rate savings at the expense of computational complexity, while VP9 may be regarded as a compromise between the efficiency and required encoding time.
Almeida, Eduardo L.; Margassery, Lekha M.; O’Leary, Niall
2018-01-01
ABSTRACT Pseudomonas putida strain CA-3 is an industrial bioreactor isolate capable of synthesizing biodegradable polyhydroxyalkanoate polymers via the metabolism of styrene and other unrelated carbon sources. The pathways involved are subject to regulation by global cellular processes. The draft genome sequence is 6,177,154 bp long and contains 5,608 predicted coding sequences. PMID:29371359
Riveros-Mckay, Fernando; Campos, Itzia; Giles-Gómez, Martha; Bolívar, Francisco
2014-01-01
Leuconostoc mesenteroides P45 was isolated from the traditional Mexican pulque beverage. We report its draft genome sequence, assembled in 6 contigs consisting of 1,874,188 bp and no plasmids. Genome annotation predicted a total of 1,800 genes, 1,687 coding sequences, 52 pseudogenes, 9 rRNAs, 51 tRNAs, 1 noncoding RNA, and 44 frameshifted genes. PMID:25377708
Moreno-Avitia, Fabian; Lozano, Luis; Utrilla, Jose; Bolívar, Francisco; Escalante, Adelfo
2017-06-08
Pseudomonas chlororaphis strain ATCC 9446 is a biocontrol-related organism. We report here its draft genome sequence assembled into 35 contigs consisting of 6,783,030 bp. Genome annotation predicted a total of 6,200 genes, 6,128 coding sequences, 81 pseudogenes, 58 tRNAs, 4 noncoding RNAs (ncRNAs), and 41 frameshifted genes. Copyright © 2017 Moreno-Avitia et al.
Utomo, Condro; Tanjung, Zulfikar Achmad; Aditama, Redi; Buana, Rika Fithri Nurani; Pratomo, Antonius Dony Madu; Tryono, Reno; Liwang, Tony
2018-04-26
Ganoderma boninense is the dominant fungal pathogen of basal stem rot (BSR) disease on Elaeis guineensis We sequenced the nuclear genome of mycelia using both Illumina and Pacific Biosciences platforms for assembly of scaffolds. The draft genome comprised 79.24 Mb, 495 scaffolds, and 26,226 predicted coding sequences. Copyright © 2018 Utomo et al.
Mu, Dashuai; Zhao, Jinxin; Wang, Zongjie; Chen, Guanjun
2016-01-01
Algoriphagus sp. NH1 is a multidrug-resistant bacterium isolated from coastal sediments of the northern Yellow Sea in China. Here, we report the draft genome sequence of NH1, with a size of 6,131,579 bp, average G+C content of 42.68%, and 5,746 predicted protein-coding sequences. PMID:26769940
Draft Genome Sequence of Lactobacillus sp. Strain TCF032-E4, Isolated from Fermented Radish.
Mao, Yuejian; Chen, Meng; Horvath, Philippe
2015-07-30
Here, we report the draft genome sequence of Lactobacillus sp. strain TCF032-E4 (= CCTCC AB2015090 = DSM 100358), isolated from a Chinese fermented radish. The total length of the 57 contigs is about 2.9 Mb, with a G+C content of 43.5 mol% and 2,797 predicted coding sequences (CDSs). Copyright © 2015 Mao et al.
Draft Genome Sequence of Geobacillus kaustophilus GBlys, a Lysogenic Strain with Bacteriophage ϕOH2
Mori, Kazuki; Martono, Hindra; Nagayoshi, Yuko; Fujino, Yasuhiro; Tashiro, Kosuke; Kuhara, Satoru; Ohshima, Toshihisa
2013-01-01
Geobacillus kaustophilus strain GBlys was isolated along with the bacteriophage ϕOH2, which infects G. kaustophilus NBRC 102445T. Here we present a draft sequence of this strain’s genome, which consists of 216 contigs for a total of 3,541,481 bp, 3,679 predicted coding sequences, and a G+C content of 52.1%. PMID:23950135
Piccinni, Florencia; Murua, Yanina; Ghio, Silvina; Talia, Paola; Rivarola, Máximo
2016-01-01
Cellulomonas sp. strain B6 was isolated from a subtropical forest soil sample and presented (hemi)cellulose-degrading activity. We report here its draft genome sequence, with an estimated genome size of 4 Mb, a G+C content of 75.1%, and 3,443 predicted protein-coding sequences, 92 of which are glycosyl hydrolases involved in polysaccharide degradation. PMID:27563050
Draft Genome Sequence of Photorhabdus luminescens Strain DSPV002N Isolated from Santa Fe, Argentina
Del Valle, Eleodoro E.; Frizzo, Laureano; Berry, Colin; Caballero, Primitivo
2016-01-01
Here, we report the draft genome sequence of Photorhabdus luminescens strain DSPV002N, which consists of 177 contig sequences accounting for 5,518,143 bp, with a G+C content of 42.3% and 4,701 predicted protein-coding genes (CDSs). From these, 27 CDSs exhibited significant similarity with insecticidal toxin proteins from Photorhabdus luminescens subsp. laumondii TT01. PMID:27469965
Human somatostatin I: sequence of the cDNA.
Shen, L P; Pictet, R L; Rutter, W J
1982-01-01
RNA has been isolated from a human pancreatic somatostatinoma and used to prepare a cDNA library. After prescreening, clones containing somatostatin I sequences were identified by hybridization with an anglerfish somatostatin I-cloned cDNA probe. From the nucleotide sequence of two of these clones, we have deduced an essentially full-length mRNA sequence, including the preprosomatostatin coding region, 105 nucleotides from the 5' untranslated region and the complete 150-nucleotide 3' untranslated region. The coding region predicts a 116-amino acid precursor protein (Mr, 12.727) that contains somatostatin-14 and -28 at its COOH terminus. The predicted amino acid sequence of human somatostatin-28 is identical to that of somatostatin-28 isolated from the porcine and ovine species. A comparison of the amino acid sequences of human and anglerfish preprosomatostatin I indicated that the COOH-terminal region encoding somatostatin-14 and the adjacent 6 amino acids are highly conserved, whereas the remainder of the molecule, including the signal peptide region, is more divergent. However, many of the amino acid differences found in the pro region of the human and anglerfish proteins are conservative changes. This suggests that the propeptides have a similar secondary structure, which in turn may imply a biological function for this region of the molecule. Images PMID:6126875
2010-01-01
Background Intragenic tandem repeats occur throughout all domains of life and impart functional and structural variability to diverse translation products. Repeat proteins confer distinctive surface phenotypes to many unicellular organisms, including those with minimal genomes such as the wall-less bacterial monoderms, Mollicutes. One such repeat pattern in this clade is distributed in a manner suggesting its exchange by horizontal gene transfer (HGT). Expanding genome sequence databases reveal the pattern in a widening range of bacteria, and recently among eucaryotic microbes. We examined the genomic flux and consequences of the motif by determining its distribution, predicted structural features and association with membrane-targeted proteins. Results Using a refined hidden Markov model, we document a 25-residue protein sequence motif tandemly arrayed in variable-number repeats in ORFs lacking assigned functions. It appears sporadically in unicellular microbes from disparate bacterial and eucaryotic clades, representing diverse lifestyles and ecological niches that include host parasitic, marine and extreme environments. Tracts of the repeats predict a malleable configuration of recurring domains, with conserved hydrophobic residues forming an amphipathic secondary structure in which hydrophilic residues endow extensive sequence variation. Many ORFs with these domains also have membrane-targeting sequences that predict assorted topologies; others may comprise reservoirs of sequence variants. We demonstrate expressed variants among surface lipoproteins that distinguish closely related animal pathogens belonging to a subgroup of the Mollicutes. DNA sequences encoding the tandem domains display dyad symmetry. Moreover, in some taxa the domains occur in ORFs selectively associated with mobile elements. These features, a punctate phylogenetic distribution, and different patterns of dispersal in genomes of related taxa, suggest that the repeat may be disseminated by HGT and intra-genomic shuffling. Conclusions We describe novel features of PARCELs (Palindromic Amphipathic Repeat Coding ELements), a set of widely distributed repeat protein domains and coding sequences that were likely acquired through HGT by diverse unicellular microbes, further mobilized and diversified within genomes, and co-opted for expression in the membrane proteome of some taxa. Disseminated by multiple gene-centric vehicles, ORFs harboring these elements enhance accessory gene pools as part of the "mobilome" connecting genomes of various clades, in taxa sharing common niches. PMID:20626840
The complete nucleotide sequence of RNA beta from the type strain of barley stripe mosaic virus.
Gustafson, G; Armour, S L
1986-01-01
The complete nucleotide sequence of RNA beta from the type strain of barley stripe mosaic virus (BSMV) has been determined. The sequence is 3289 nucleotides in length and contains four open reading frames (ORFs) which code for proteins of Mr 22,147 (ORF1), Mr 58,098 (ORF2), Mr 17,378 (ORF3), and Mr 14,119 (ORF4). The predicted N-terminal amino acid sequence of the polypeptide encoded by the ORF nearest the 5'-end of the RNA (ORF1) is identical (after the initiator methionine) to the published N-terminal amino acid sequence of BSMV coat protein for 29 of the first 30 amino acids. ORF2 occupies the central portion of the coding region of RNA beta and ORF3 is located at the 3'-end. The ORF4 sequence overlaps the 3'-region of ORF2 and the 5'-region of ORF3 and differs in codon usage from the other three RNA beta ORFs. The coding region of RNA beta is followed by a poly(A) tract and a 238 nucleotide tRNA-like structure which are common to all three BSMV genomic RNAs. Images PMID:3754962
Takai, Kazuyuki
2017-01-21
Codon adaptation index (CAI) has been widely used for prediction of expression of recombinant genes in Escherichia coli and other organisms. However, CAI has no mechanistic basis that rationalizes its application to estimation of translational efficiency. Here, I propose a model based on which we could consider how codon usage is related to the level of expression during exponential growth of bacteria. In this model, translation of a gene is considered as an analog of electric current, and an analog of electric resistance corresponding to each gene is considered. "Translational resistance" is dependent on the steady-state concentration and the sequence of the mRNA species, and "translational resistivity" is dependent only on the mRNA sequence. The latter is the sum of two parts: one is the resistivity for the elongation reaction (coding sequence resistivity), and the other comes from all of the other steps of the decoding reaction. This electric circuit model clearly shows that some conditions should be met for codon composition of a coding sequence to correlate well with its expression level. On the other hand, I calculated relative frequency of each of the 61 sense codon triplets translated during exponential growth of E. coli from a proteomic dataset covering over 2600 proteins. A tentative method for estimating relative coding sequence resistivity based on the data is presented. Copyright © 2016. Published by Elsevier Ltd.
COOLAIR Antisense RNAs Form Evolutionarily Conserved Elaborate Secondary Structures
Hawkes, Emily J.; Hennelly, Scott P.; Novikova, Irina V.; ...
2016-09-20
There is considerable debate about the functionality of long non-coding RNAs (lncRNAs). Lack of sequence conservation has been used to argue against functional relevance. Here, we investigated antisense lncRNAs, called COOLAIR, at the A. thaliana FLC locus and experimentally determined their secondary structure. The major COOLAIR variants are highly structured, organized by exon. The distally polyadenylated transcript has a complex multi-domain structure, altered by a single non-coding SNP defining a functionally distinct A. thaliana FLC haplotype. The A. thaliana COOLAIR secondary structure was used to predict COOLAIR exons in evolutionarily divergent Brassicaceae species. These predictions were validated through chemical probingmore » and cloning. Despite the relatively low nucleotide sequence identity, the structures, including multi-helix junctions, show remarkable evolutionary conservation. In a number of places, the structure is conserved through covariation of a non-contiguous DNA sequence. This structural conservation supports a functional role for COOLAIR transcripts rather than, or in addition to, antisense transcription.« less
COOLAIR Antisense RNAs Form Evolutionarily Conserved Elaborate Secondary Structures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hawkes, Emily J.; Hennelly, Scott P.; Novikova, Irina V.
There is considerable debate about the functionality of long non-coding RNAs (lncRNAs). Lack of sequence conservation has been used to argue against functional relevance. Here, we investigated antisense lncRNAs, called COOLAIR, at the A. thaliana FLC locus and experimentally determined their secondary structure. The major COOLAIR variants are highly structured, organized by exon. The distally polyadenylated transcript has a complex multi-domain structure, altered by a single non-coding SNP defining a functionally distinct A. thaliana FLC haplotype. The A. thaliana COOLAIR secondary structure was used to predict COOLAIR exons in evolutionarily divergent Brassicaceae species. These predictions were validated through chemical probingmore » and cloning. Despite the relatively low nucleotide sequence identity, the structures, including multi-helix junctions, show remarkable evolutionary conservation. In a number of places, the structure is conserved through covariation of a non-contiguous DNA sequence. This structural conservation supports a functional role for COOLAIR transcripts rather than, or in addition to, antisense transcription.« less
Predictive coding in autism spectrum disorder and attention deficit hyperactivity disorder.
Gonzalez-Gadea, Maria Luz; Chennu, Srivas; Bekinschtein, Tristan A; Rattazzi, Alexia; Beraudi, Ana; Tripicchio, Paula; Moyano, Beatriz; Soffita, Yamila; Steinberg, Laura; Adolfi, Federico; Sigman, Mariano; Marino, Julian; Manes, Facundo; Ibanez, Agustin
2015-11-01
Predictive coding has been proposed as a framework to understand neural processes in neuropsychiatric disorders. We used this approach to describe mechanisms responsible for attentional abnormalities in autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD). We monitored brain dynamics of 59 children (8-15 yr old) who had ASD or ADHD or who were control participants via high-density electroencephalography. We performed analysis at the scalp and source-space levels while participants listened to standard and deviant tone sequences. Through task instructions, we manipulated top-down expectation by presenting expected and unexpected deviant sequences. Children with ASD showed reduced superior frontal cortex (FC) responses to unexpected events but increased dorsolateral prefrontal cortex (PFC) activation to expected events. In contrast, children with ADHD exhibited reduced cortical responses in superior FC to expected events but strong PFC activation to unexpected events. Moreover, neural abnormalities were associated with specific control mechanisms, namely, inhibitory control in ASD and set-shifting in ADHD. Based on the predictive coding account, top-down expectation abnormalities could be attributed to a disproportionate reliance (precision) allocated to prior beliefs in ASD and to sensory input in ADHD. Copyright © 2015 the American Physiological Society.
Predictive coding in autism spectrum disorder and attention deficit hyperactivity disorder
Gonzalez-Gadea, Maria Luz; Chennu, Srivas; Bekinschtein, Tristan A.; Rattazzi, Alexia; Beraudi, Ana; Tripicchio, Paula; Moyano, Beatriz; Soffita, Yamila; Steinberg, Laura; Adolfi, Federico; Sigman, Mariano; Marino, Julian; Manes, Facundo
2015-01-01
Predictive coding has been proposed as a framework to understand neural processes in neuropsychiatric disorders. We used this approach to describe mechanisms responsible for attentional abnormalities in autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD). We monitored brain dynamics of 59 children (8–15 yr old) who had ASD or ADHD or who were control participants via high-density electroencephalography. We performed analysis at the scalp and source-space levels while participants listened to standard and deviant tone sequences. Through task instructions, we manipulated top-down expectation by presenting expected and unexpected deviant sequences. Children with ASD showed reduced superior frontal cortex (FC) responses to unexpected events but increased dorsolateral prefrontal cortex (PFC) activation to expected events. In contrast, children with ADHD exhibited reduced cortical responses in superior FC to expected events but strong PFC activation to unexpected events. Moreover, neural abnormalities were associated with specific control mechanisms, namely, inhibitory control in ASD and set-shifting in ADHD. Based on the predictive coding account, top-down expectation abnormalities could be attributed to a disproportionate reliance (precision) allocated to prior beliefs in ASD and to sensory input in ADHD. PMID:26311184
Position specific variation in the rate of evolution in transcription factor binding sites
Moses, Alan M; Chiang, Derek Y; Kellis, Manolis; Lander, Eric S; Eisen, Michael B
2003-01-01
Background The binding sites of sequence specific transcription factors are an important and relatively well-understood class of functional non-coding DNAs. Although a wide variety of experimental and computational methods have been developed to characterize transcription factor binding sites, they remain difficult to identify. Comparison of non-coding DNA from related species has shown considerable promise in identifying these functional non-coding sequences, even though relatively little is known about their evolution. Results Here we analyse the genome sequences of the budding yeasts Saccharomyces cerevisiae, S. bayanus, S. paradoxus and S. mikatae to study the evolution of transcription factor binding sites. As expected, we find that both experimentally characterized and computationally predicted binding sites evolve slower than surrounding sequence, consistent with the hypothesis that they are under purifying selection. We also observe position-specific variation in the rate of evolution within binding sites. We find that the position-specific rate of evolution is positively correlated with degeneracy among binding sites within S. cerevisiae. We test theoretical predictions for the rate of evolution at positions where the base frequencies deviate from background due to purifying selection and find reasonable agreement with the observed rates of evolution. Finally, we show how the evolutionary characteristics of real binding motifs can be used to distinguish them from artefacts of computational motif finding algorithms. Conclusion As has been observed for protein sequences, the rate of evolution in transcription factor binding sites varies with position, suggesting that some regions are under stronger functional constraint than others. This variation likely reflects the varying importance of different positions in the formation of the protein-DNA complex. The characterization of the pattern of evolution in known binding sites will likely contribute to the effective use of comparative sequence data in the identification of transcription factor binding sites and is an important step toward understanding the evolution of functional non-coding DNA. PMID:12946282
Chromatin accessibility prediction via a hybrid deep convolutional neural network.
Liu, Qiao; Xia, Fei; Yin, Qijin; Jiang, Rui
2018-03-01
A majority of known genetic variants associated with human-inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from large-scale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies. We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNase-seq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or in-house chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases. Deopen is freely available at https://github.com/kimmo1019/Deopen. ruijiang@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
A study on multiresolution lossless video coding using inter/intra frame adaptive prediction
NASA Astrophysics Data System (ADS)
Nakachi, Takayuki; Sawabe, Tomoko; Fujii, Tetsuro
2003-06-01
Lossless video coding is required in the fields of archiving and editing digital cinema or digital broadcasting contents. This paper combines a discrete wavelet transform and adaptive inter/intra-frame prediction in the wavelet transform domain to create multiresolution lossless video coding. The multiresolution structure offered by the wavelet transform facilitates interchange among several video source formats such as Super High Definition (SHD) images, HDTV, SDTV, and mobile applications. Adaptive inter/intra-frame prediction is an extension of JPEG-LS, a state-of-the-art lossless still image compression standard. Based on the image statistics of the wavelet transform domains in successive frames, inter/intra frame adaptive prediction is applied to the appropriate wavelet transform domain. This adaptation offers superior compression performance. This is achieved with low computational cost and no increase in additional information. Experiments on digital cinema test sequences confirm the effectiveness of the proposed algorithm.
Mahan, Kristina M.; Klingeman, Dawn Marie; Robert L. Hettich; ...
2016-01-21
Streptomyces vitaminophilus produces pyrrolomycins, which are halogenated polyketide antibiotics. Some of the pyrrolomycins contain a rare nitro group located on the pyrrole ring. In addition, the 6.5-Mbp genome encodes 5,941 predicted protein-coding sequences in 39 contigs with a 71.9% G+C content.
Klingeman, Dawn M.; Hettich, Robert L.; Parry, Ronald J.
2016-01-01
Streptomyces vitaminophilus produces pyrrolomycins, which are halogenated polyketide antibiotics. Some of the pyrrolomycins contain a rare nitro group located on the pyrrole ring. The 6.5-Mbp genome encodes 5,941 predicted protein-coding sequences in 39 contigs with a 71.9% G+C content. PMID:26798098
Cross-species transferability and mapping of genomic and cDNA SSRs in pines
D. Chagne; P. Chaumeil; A. Ramboer; C. Collada; A. Guevara; M. T. Cervera; G. G. Vendramin; V. Garcia; J-M. Frigerio; Craig Echt; T. Richardson; Christophe Plomion
2004-01-01
Two unigene datasets of Pinus taeda and Pinus pinaster were screened to detect di-, tri and tetranucleotide repeated motifs using the SSRIT script. A total of 419 simple sequence repeats (SSRs) were identified, from which only 12.8% overlapped between the two sets. The position of the SSRs within the coding sequence were predicted...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mahan, Kristina M.; Klingeman, Dawn Marie; Robert L. Hettich
Streptomyces vitaminophilus produces pyrrolomycins, which are halogenated polyketide antibiotics. Some of the pyrrolomycins contain a rare nitro group located on the pyrrole ring. In addition, the 6.5-Mbp genome encodes 5,941 predicted protein-coding sequences in 39 contigs with a 71.9% G+C content.
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).
Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C
2015-01-01
The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.
A neuronal model of predictive coding accounting for the mismatch negativity.
Wacongne, Catherine; Changeux, Jean-Pierre; Dehaene, Stanislas
2012-03-14
The mismatch negativity (MMN) is thought to index the activation of specialized neural networks for active prediction and deviance detection. However, a detailed neuronal model of the neurobiological mechanisms underlying the MMN is still lacking, and its computational foundations remain debated. We propose here a detailed neuronal model of auditory cortex, based on predictive coding, that accounts for the critical features of MMN. The model is entirely composed of spiking excitatory and inhibitory neurons interconnected in a layered cortical architecture with distinct input, predictive, and prediction error units. A spike-timing dependent learning rule, relying upon NMDA receptor synaptic transmission, allows the network to adjust its internal predictions and use a memory of the recent past inputs to anticipate on future stimuli based on transition statistics. We demonstrate that this simple architecture can account for the major empirical properties of the MMN. These include a frequency-dependent response to rare deviants, a response to unexpected repeats in alternating sequences (ABABAA…), a lack of consideration of the global sequence context, a response to sound omission, and a sensitivity of the MMN to NMDA receptor antagonists. Novel predictions are presented, and a new magnetoencephalography experiment in healthy human subjects is presented that validates our key hypothesis: the MMN results from active cortical prediction rather than passive synaptic habituation.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wheeler, E.F.; Roussel, M.F.; Hampe, A.
1986-08-01
The nucleotide sequence of a 5' segment of the human genomic c-fms proto-oncogene suggested that recombination between feline leukemia virus and feline c-fms sequences might have occurred in a region encoding the 5' untranslated portion of c-fms mRNA. The polyprotein precursor gP180/sup gag-fms/ encoded by the McDonough strain of feline sarcoma virus was therefore predicted to contain 34 v-fms-coded amino acids derived from sequences of the c-fms gene that are not ordinarily translated from the proto-oncogene mRNA. The (gP180/sup gag-fms/) polyprotein was cotranslationally cleaved near the gag-fms junction to remove its gag gene-coded portion. Determination of the amino-terminal sequence ofmore » the resulting v-fms-coded glycoprotein, gp120/sup v-fms/, showed that the site of proteolysis corresponded to a predicted signal peptidase cleavage site within the c-fms gene product. Together, these analyses suggested that the linked gag sequences may not be necessary for expression of a biologically active v-fms gene product. The gag-fms sequences of feline sarcoma virus strain McDonough and the v-fms sequences alone were inserted into a murine retroviral vector containing a neomycin resistance gene. The authors conclude that a cryptic hydrophobic signal peptide sequence in v-fms was unmasked by gag deletion, thereby allowing the correct orientation and transport of the v-fms was unmasked by gag deletion, thereby allowing the correct orientation and transport of the v-fms gene product within membranous organelles. It seems likely that the proteolytic cleavage of gP180/gag-fms/ is mediated by signal peptidase and that the amino termini of gp140/sup v-fms/ and the c-fms gene product are identical.« less
A draft sequence of the rice genome (Oryza sativa L. ssp. indica).
Yu, Jun; Hu, Songnian; Wang, Jun; Wong, Gane Ka-Shu; Li, Songgang; Liu, Bin; Deng, Yajun; Dai, Li; Zhou, Yan; Zhang, Xiuqing; Cao, Mengliang; Liu, Jing; Sun, Jiandong; Tang, Jiabin; Chen, Yanjiong; Huang, Xiaobing; Lin, Wei; Ye, Chen; Tong, Wei; Cong, Lijuan; Geng, Jianing; Han, Yujun; Li, Lin; Li, Wei; Hu, Guangqiang; Huang, Xiangang; Li, Wenjie; Li, Jian; Liu, Zhanwei; Li, Long; Liu, Jianping; Qi, Qiuhui; Liu, Jinsong; Li, Li; Li, Tao; Wang, Xuegang; Lu, Hong; Wu, Tingting; Zhu, Miao; Ni, Peixiang; Han, Hua; Dong, Wei; Ren, Xiaoyu; Feng, Xiaoli; Cui, Peng; Li, Xianran; Wang, Hao; Xu, Xin; Zhai, Wenxue; Xu, Zhao; Zhang, Jinsong; He, Sijie; Zhang, Jianguo; Xu, Jichen; Zhang, Kunlin; Zheng, Xianwu; Dong, Jianhai; Zeng, Wanyong; Tao, Lin; Ye, Jia; Tan, Jun; Ren, Xide; Chen, Xuewei; He, Jun; Liu, Daofeng; Tian, Wei; Tian, Chaoguang; Xia, Hongai; Bao, Qiyu; Li, Gang; Gao, Hui; Cao, Ting; Wang, Juan; Zhao, Wenming; Li, Ping; Chen, Wei; Wang, Xudong; Zhang, Yong; Hu, Jianfei; Wang, Jing; Liu, Song; Yang, Jian; Zhang, Guangyu; Xiong, Yuqing; Li, Zhijie; Mao, Long; Zhou, Chengshu; Zhu, Zhen; Chen, Runsheng; Hao, Bailin; Zheng, Weimou; Chen, Shouyi; Guo, Wei; Li, Guojie; Liu, Siqi; Tao, Ming; Wang, Jian; Zhu, Lihuang; Yuan, Longping; Yang, Huanming
2002-04-05
We have produced a draft sequence of the rice genome for the most widely cultivated subspecies in China, Oryza sativa L. ssp. indica, by whole-genome shotgun sequencing. The genome was 466 megabases in size, with an estimated 46,022 to 55,615 genes. Functional coverage in the assembled sequences was 92.0%. About 42.2% of the genome was in exact 20-nucleotide oligomer repeats, and most of the transposons were in the intergenic regions between genes. Although 80.6% of predicted Arabidopsis thaliana genes had a homolog in rice, only 49.4% of predicted rice genes had a homolog in A. thaliana. The large proportion of rice genes with no recognizable homologs is due to a gradient in the GC content of rice coding sequences.
Fast H.264/AVC FRExt intra coding using belief propagation.
Milani, Simone
2011-01-01
In the H.264/AVC FRExt coder, the coding performance of Intra coding significantly overcomes the previous still image coding standards, like JPEG2000, thanks to a massive use of spatial prediction. Unfortunately, the adoption of an extensive set of predictors induces a significant increase of the computational complexity required by the rate-distortion optimization routine. The paper presents a complexity reduction strategy that aims at reducing the computational load of the Intra coding with a small loss in the compression performance. The proposed algorithm relies on selecting a reduced set of prediction modes according to their probabilities, which are estimated adopting a belief-propagation procedure. Experimental results show that the proposed method permits saving up to 60 % of the coding time required by an exhaustive rate-distortion optimization method with a negligible loss in performance. Moreover, it permits an accurate control of the computational complexity unlike other methods where the computational complexity depends upon the coded sequence.
Prediction of plant lncRNA by ensemble machine learning classifiers.
Simopoulos, Caitlin M A; Weretilnyk, Elizabeth A; Golding, G Brian
2018-05-02
In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With comparatively few validated plant long non-coding RNAs, research on this potentially critical class of RNA is hindered by a lack of appropriate prediction tools and databases. Supervised learning models trained on data sets of mostly non-validated, non-coding transcripts have been previously used to identify this enigmatic RNA class with applications largely focused on animal systems. Our approach uses a training set comprised only of empirically validated long non-protein coding RNAs from plant, animal, and viral sources to predict and rank candidate long non-protein coding gene products for future functional validation. Individual stochastic gradient boosting and random forest classifiers trained on only empirically validated long non-protein coding RNAs were constructed. In order to use the strengths of multiple classifiers, we combined multiple models into a single stacking meta-learner. This ensemble approach benefits from the diversity of several learners to effectively identify putative plant long non-coding RNAs from transcript sequence features. When the predicted genes identified by the ensemble classifier were compared to those listed in GreeNC, an established plant long non-coding RNA database, overlap for predicted genes from Arabidopsis thaliana, Oryza sativa and Eutrema salsugineum ranged from 51 to 83% with the highest agreement in Eutrema salsugineum. Most of the highest ranking predictions from Arabidopsis thaliana were annotated as potential natural antisense genes, pseudogenes, transposable elements, or simply computationally predicted hypothetical protein. Due to the nature of this tool, the model can be updated as new long non-protein coding transcripts are identified and functionally verified. This ensemble classifier is an accurate tool that can be used to rank long non-protein coding RNA predictions for use in conjunction with gene expression studies. Selection of plant transcripts with a high potential for regulatory roles as long non-protein coding RNAs will advance research in the elucidation of long non-protein coding RNA function.
Darris, Maxwell
2017-01-01
ABSTRACT Most of the 24 known Chitinophaga species were originally isolated from soils. We report the draft genome sequence of a putatively novel Chitinophaga sp. from a biofilm in an air conditioner condensate pipe. The genome comprises 7,661,303 bp in one scaffold, 5,694 predicted protein-coding sequences, and a G+C content of 47.6%. PMID:29051259
Almeida, Eduardo L; Margassery, Lekha M; O'Leary, Niall; Dobson, Alan D W
2018-01-25
Pseudomonas putida strain CA-3 is an industrial bioreactor isolate capable of synthesizing biodegradable polyhydroxyalkanoate polymers via the metabolism of styrene and other unrelated carbon sources. The pathways involved are subject to regulation by global cellular processes. The draft genome sequence is 6,177,154 bp long and contains 5,608 predicted coding sequences. Copyright © 2018 Almeida et al.
Govindarajan, Subramaniam S.; Qi, Feng; Li, Jian-Liang; Sahoo, Malaya K.
2017-01-01
ABSTRACT Paenibacillus sp. strain KS1 was isolated from an epiphyte, Tillandsia usneoides (Spanish moss), in central Florida, USA. Here, we report a draft genome sequence of this strain, which consists of a total of 398 contigs spanning 6,508,195 bp, with a G+C content of 46.5% and comprising 5,401 predicted coding sequences. PMID:28153888
Mori, Kazuki; Mutaguchi, Yuta; Tashiro, Kosuke; Fujino, Yasuhiro; Ohmori, Taketo; Kuhara, Satoru; Ohshima, Toshihisa
2013-01-01
Lactobacillus otakiensis strain JCM 15040T was isolated from an unsalted pickling solution used in the production of sunki, a traditional Japanese pickle. Here, we prepared a draft genome sequence for this strain consisting of 40 contigs containing a total of 2,347,132 bp, 2,310 predicted coding sequences, and a G+C content of 42.4%. PMID:23929467
Piccinni, Florencia; Murua, Yanina; Ghio, Silvina; Talia, Paola; Rivarola, Máximo; Campos, Eleonora
2016-08-25
Cellulomonas sp. strain B6 was isolated from a subtropical forest soil sample and presented (hemi)cellulose-degrading activity. We report here its draft genome sequence, with an estimated genome size of 4 Mb, a G+C content of 75.1%, and 3,443 predicted protein-coding sequences, 92 of which are glycosyl hydrolases involved in polysaccharide degradation. Copyright © 2016 Piccinni et al.
García-Ramón, Diana C.; Palma, Leopoldo; Berry, Colin; Osuna, Antonio
2015-01-01
We present the draft whole-genome sequence of the entomopathogenic Bacillus pumilus 15.1 strain that consists of 3,795,691 bp and 3,776 predicted protein-coding genes. This genome sequence provides the basis for understanding the potential mechanism behind the toxicity and virulence of B. pumilus 15.1 against the Mediterranean fruit fly. PMID:26404596
Navas, Esteban; Bohle, Harry; Henríquez, Patricio; Grothusen, Horst; Bustamante, Fernando; Bustos, Patricio
2014-01-01
We sequenced the genome of a motile O1b Yersinia ruckeri field isolate from Chile, which is causing enteric redmouth disease (ERM) in vaccinated Atlantic salmon (Salmo salar). The draft genome has 3,775,486 bp, a G+C content of 47.1%, and is predicted to contain 3,406 coding sequences. PMID:25169862
Information-Theoretic Properties of Auditory Sequences Dynamically Influence Expectation and Memory
ERIC Educational Resources Information Center
Agres, Kat; Abdallah, Samer; Pearce, Marcus
2018-01-01
A basic function of cognition is to detect regularities in sensory input to facilitate the prediction and recognition of future events. It has been proposed that these implicit expectations arise from an internal predictive coding model, based on knowledge acquired through processes such as statistical learning, but it is unclear how different…
Gardner, Elliot M.; Johnson, Matthew G.; Ragone, Diane; Wickett, Norman J.; Zerega, Nyree J. C.
2016-01-01
Premise of the study: We used moderately low-coverage (17×) whole-genome sequencing of Artocarpus camansi (Moraceae) to develop genomic resources for Artocarpus and Moraceae. Methods and Results: A de novo assembly of Illumina short reads (251,378,536 pairs, 2 × 100 bp) accounted for 93% of the predicted genome size. Predicted coding regions were used in a three-way orthology search with published genomes of Morus notabilis and Cannabis sativa. Phylogenetic markers for Moraceae were developed from 333 inferred single-copy exons. Ninety-eight putative MADS-box genes were identified. Analysis of all predicted coding regions resulted in preliminary annotation of 49,089 genes. An analysis of synonymous substitutions for pairs of orthologs (Ks analysis) in M. notabilis and A. camansi strongly suggested a lineage-specific whole-genome duplication in Artocarpus. Conclusions: This study substantially increases the genomic resources available for Artocarpus and Moraceae and demonstrates the value of low-coverage de novo assemblies for nonmodel organisms with moderately large genomes. PMID:27437173
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, O.; Masters, C.; Lewis, M.B.
1994-09-01
In an 8-year-old girl and her father, both of whom have severe type III OI, we have previously used RNA/RNA hybrid analysis to demonstrate a mismatch in the region of {alpha}1(I) mRNA coding for aa 558-861. We used SSCP to further localize the abnormality to a subregion coding for aa 579-679. This region was subcloned and sequenced. Each patient`s cDNA has a deletion of the sequences coding for the last residue of exon 34, and all of exons 35 and 36 (aa 604-639), followed by an insertion of 156 nt from the 3{prime}-end of intron 36. PCR amplification of leukocytemore » DNA from the patients and the clinically normal paternal grandmother yielded two fragments: a 1007 bp fragment predicted from normal genomic sequences and a 445 bp fragment. Subcloning and sequencing of the shorter genomic PCR product confirmed the presence of a 565 bp genomic deletion from the end of exon 34 to the middle of intron 36. The abnormal protein is apparently synthesized and incorporated into helix. The inserted nucleotides are in frame with the collagenous sequence and contain no stop codons. They encode a 52 aa non-collagenous region. The fibroblast procollagen of the patients has both normal and electrophoretically delayed pro{alpha}(I) bands. The electrophoretically delayed procollagen is very sensitive to pepsin or trypsin digestion, as predicted by its non-collagenous sequence, and cannot be visualized as collagen. This unique OI collagen mutation is an excellent candidate for molecular targeting to {open_quotes}turn off{close_quotes} a dominant mutant allele.« less
A deep learning method for lincRNA detection using auto-encoder algorithm.
Yu, Ning; Yu, Zeng; Pan, Yi
2017-12-06
RNA sequencing technique (RNA-seq) enables scientists to develop novel data-driven methods for discovering more unidentified lincRNAs. Meantime, knowledge-based technologies are experiencing a potential revolution ignited by the new deep learning methods. By scanning the newly found data set from RNA-seq, scientists have found that: (1) the expression of lincRNAs appears to be regulated, that is, the relevance exists along the DNA sequences; (2) lincRNAs contain some conversed patterns/motifs tethered together by non-conserved regions. The two evidences give the reasoning for adopting knowledge-based deep learning methods in lincRNA detection. Similar to coding region transcription, non-coding regions are split at transcriptional sites. However, regulatory RNAs rather than message RNAs are generated. That is, the transcribed RNAs participate the biological process as regulatory units instead of generating proteins. Identifying these transcriptional regions from non-coding regions is the first step towards lincRNA recognition. The auto-encoder method achieves 100% and 92.4% prediction accuracy on transcription sites over the putative data sets. The experimental results also show the excellent performance of predictive deep neural network on the lincRNA data sets compared with support vector machine and traditional neural network. In addition, it is validated through the newly discovered lincRNA data set and one unreported transcription site is found by feeding the whole annotated sequences through the deep learning machine, which indicates that deep learning method has the extensive ability for lincRNA prediction. The transcriptional sequences of lincRNAs are collected from the annotated human DNA genome data. Subsequently, a two-layer deep neural network is developed for the lincRNA detection, which adopts the auto-encoder algorithm and utilizes different encoding schemes to obtain the best performance over intergenic DNA sequence data. Driven by those newly annotated lincRNA data, deep learning methods based on auto-encoder algorithm can exert their capability in knowledge learning in order to capture the useful features and the information correlation along DNA genome sequences for lincRNA detection. As our knowledge, this is the first application to adopt the deep learning techniques for identifying lincRNA transcription sequences.
Auer, Paul L; Nalls, Mike; Meschia, James F; Worrall, Bradford B; Longstreth, W T; Seshadri, Sudha; Kooperberg, Charles; Burger, Kathleen M; Carlson, Christopher S; Carty, Cara L; Chen, Wei-Min; Cupples, L Adrienne; DeStefano, Anita L; Fornage, Myriam; Hardy, John; Hsu, Li; Jackson, Rebecca D; Jarvik, Gail P; Kim, Daniel S; Lakshminarayan, Kamakshi; Lange, Leslie A; Manichaikul, Ani; Quinlan, Aaron R; Singleton, Andrew B; Thornton, Timothy A; Nickerson, Deborah A; Peters, Ulrike; Rich, Stephen S
2015-07-01
Stroke is the second leading cause of death and the third leading cause of years of life lost. Genetic factors contribute to stroke prevalence, and candidate gene and genome-wide association studies (GWAS) have identified variants associated with ischemic stroke risk. These variants often have small effects without obvious biological significance. Exome sequencing may discover predicted protein-altering variants with a potentially large effect on ischemic stroke risk. To investigate the contribution of rare and common genetic variants to ischemic stroke risk by targeting the protein-coding regions of the human genome. The National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP) analyzed approximately 6000 participants from numerous cohorts of European and African ancestry. For discovery, 365 cases of ischemic stroke (small-vessel and large-vessel subtypes) and 809 European ancestry controls were sequenced; for replication, 47 affected sibpairs concordant for stroke subtype and an African American case-control series were sequenced, with 1672 cases and 4509 European ancestry controls genotyped. The ESP's exome sequencing and genotyping started on January 1, 2010, and continued through June 30, 2012. Analyses were conducted on the full data set between July 12, 2012, and July 13, 2013. Discovery of new variants or genes contributing to ischemic stroke risk and subtype (primary analysis) and determination of support for protein-coding variants contributing to risk in previously published candidate genes (secondary analysis). We identified 2 novel genes associated with an increased risk of ischemic stroke: a protein-coding variant in PDE4DIP (rs1778155; odds ratio, 2.15; P = 2.63 × 10(-8)) with an intracellular signal transduction mechanism and in ACOT4 (rs35724886; odds ratio, 2.04; P = 1.24 × 10(-7)) with a fatty acid metabolism; confirmation of PDE4DIP was observed in affected sibpair families with large-vessel stroke subtype and in African Americans. Replication of protein-coding variants in candidate genes was observed for 2 previously reported GWAS associations: ZFHX3 (cardioembolic stroke) and ABCA1 (large-vessel stroke). Exome sequencing discovered 2 novel genes and mechanisms, PDE4DIP and ACOT4, associated with increased risk for ischemic stroke. In addition, ZFHX3 and ABCA1 were discovered to have protein-coding variants associated with ischemic stroke. These results suggest that genetic variation in novel pathways contributes to ischemic stroke risk and serves as a target for prediction, prevention, and therapy.
Cloning and sequence analysis of Hemonchus contortus HC58cDNA.
Muleke, Charles I; Ruofeng, Yan; Lixin, Xu; Xinwen, Bo; Xiangrui, Li
2007-06-01
The complete coding sequence of Hemonchus contortus HC58cDNA was generated by rapid amplification of cDNA ends and polymerase chain reaction using primers based on the 5' and 3' ends of the parasite mRNA, accession no. AF305964. The HC58cDNA gene was 851 bp long, with open reading frame of 717 bp, precursors to 239 amino acids coding for approximately 27 kDa protein. Analysis of amino acid sequence revealed conserved residues of cysteine, histidine, asparagine, occluding loop pattern, hemoglobinase motif and glutamine of the oxyanion hole characteristic of cathepsin B like proteases (CBL). Comparison of the predicted amino acid sequences showed the protein shared 33.5-58.7% identity to cathepsin B homologues in the papain clan CA family (family C1). Phylogenetic analysis revealed close evolutionary proximity of the protein sequence to counterpart sequences in the CBL, suggesting that HC58cDNA was a member of the papain family.
A TDM link with channel coding and digital voice.
NASA Technical Reports Server (NTRS)
Jones, M. W.; Tu, K.; Harton, P. L.
1972-01-01
The features of a TDM (time-division multiplexed) link model are described. A PCM telemetry sequence was coded for error correction and multiplexed with a digitized voice channel. An all-digital implementation of a variable-slope delta modulation algorithm was used to digitize the voice channel. The results of extensive testing are reported. The measured coding gain and the system performance over a Gaussian channel are compared with theoretical predictions and computer simulations. Word intelligibility scores are reported as a measure of voice channel performance.
Aggarwal, Gautam; Worthey, E A; McDonagh, Paul D; Myler, Peter J
2003-06-07
Seattle Biomedical Research Institute (SBRI) as part of the Leishmania Genome Network (LGN) is sequencing chromosomes of the trypanosomatid protozoan species Leishmania major. At SBRI, chromosomal sequence is annotated using a combination of trained and untrained non-consensus gene-prediction algorithms with ARTEMIS, an annotation platform with rich and user-friendly interfaces. Here we describe a methodology used to import results from three different protein-coding gene-prediction algorithms (GLIMMER, TESTCODE and GENESCAN) into the ARTEMIS sequence viewer and annotation tool. Comparison of these methods, along with the CODONUSAGE algorithm built into ARTEMIS, shows the importance of combining methods to more accurately annotate the L. major genomic sequence. An improvised and powerful tool for gene prediction has been developed by importing data from widely-used algorithms into an existing annotation platform. This approach is especially fruitful in the Leishmania genome project where there is large proportion of novel genes requiring manual annotation.
Yamada, Takuji; Waller, Alison S; Raes, Jeroen; Zelezniak, Aleksej; Perchat, Nadia; Perret, Alain; Salanoubat, Marcel; Patil, Kiran R; Weissenbach, Jean; Bork, Peer
2012-01-01
Despite the current wealth of sequencing data, one-third of all biochemically characterized metabolic enzymes lack a corresponding gene or protein sequence, and as such can be considered orphan enzymes. They represent a major gap between our molecular and biochemical knowledge, and consequently are not amenable to modern systemic analyses. As 555 of these orphan enzymes have metabolic pathway neighbours, we developed a global framework that utilizes the pathway and (meta)genomic neighbour information to assign candidate sequences to orphan enzymes. For 131 orphan enzymes (37% of those for which (meta)genomic neighbours are available), we associate sequences to them using scoring parameters with an estimated accuracy of 70%, implying functional annotation of 16 345 gene sequences in numerous (meta)genomes. As a case in point, two of these candidate sequences were experimentally validated to encode the predicted activity. In addition, we augmented the currently available genome-scale metabolic models with these new sequence–function associations and were able to expand the models by on average 8%, with a considerable change in the flux connectivity patterns and improved essentiality prediction. PMID:22569339
Chao, Tianle; Wang, Guizhi; Wang, Jianmin; Liu, Zhaohua; Ji, Zhibin; Hou, Lei; Zhang, Chunlan
2016-01-01
High-throughput mRNA sequencing enables the discovery of new transcripts and additional parts of incompletely annotated transcripts. Compared with the human and cow genomes, the reference annotation level of the sheep genome is still low. An investigation of new transcripts in sheep skeletal muscle will improve our understanding of muscle development. Therefore, applying high-throughput sequencing, two cDNA libraries from the biceps brachii of small-tailed Han sheep and Dorper sheep were constructed, and whole-transcriptome analysis was performed to determine the unknown transcript catalogue of this tissue. In this study, 40,129 transcripts were finally mapped to the sheep genome. Among them, 3,467 transcripts were determined to be unannotated in the current reference sheep genome and were defined as new transcripts. Based on protein-coding capacity prediction and comparative analysis of sequence similarity, 246 transcripts were classified as portions of unannotated genes or incompletely annotated genes. Another 1,520 transcripts were predicted with high confidence to be long non-coding RNAs. Our analysis also revealed 334 new transcripts that displayed specific expression in ruminants and uncovered a number of new transcripts without intergenus homology but with specific expression in sheep skeletal muscle. The results confirmed a complex transcript pattern of coding and non-coding RNA in sheep skeletal muscle. This study provided important information concerning the sheep genome and transcriptome annotation, which could provide a basis for further study.
SIBIS: a Bayesian model for inconsistent protein sequence estimation.
Khenoussi, Walyd; Vanhoutrève, Renaud; Poch, Olivier; Thompson, Julie D
2014-09-01
The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences. Source code, implemented in C on a linux system, and the datasets of protein sequences are freely available for download at http://www.lbgi.fr/∼julie/SIBIS. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Functional census of mutation sequence spaces: The example of p53 cancer rescue mutants
Danziger, Samuel A.; Swamidass, S. Joshua; Zeng, Jue; Dearth, Lawrence R.; Lu, Qiang; Chen, Jonathan H.; Cheng, Jainlin; Hoang, Vinh P.; Saigo, Hiroto; Luo, Ray; Baldi, Pierre; Brachmann, Rainer K.; Lathrop, Richard H.
2009-01-01
Many biomedical problems relate to mutant functional properties across a sequence space of interest, e.g., flu, cancer, and HIV. Detailed knowledge of mutant properties and function improves medical treatment and prevention. A functional census of p53 cancer rescue mutants would aid the search for cancer treatments from p53 rescue. We devised a general methodology for conducting a functional census of a mutation sequence space, and conducted a double-blind predictive test on the functional rescue property of 71 novel putative p53 cancer rescue mutants iteratively predicted in sets of 3. Double-blind predictive accuracy (15-point moving window) rose from 47% to 86% over the trial (r = 0.74). Code and data are available upon request1. PMID:17048398
Mahan, Kristina M; Klingeman, Dawn M; Hettich, Robert L; Parry, Ronald J; Graham, David E
2016-01-21
Streptomyces vitaminophilus produces pyrrolomycins, which are halogenated polyketide antibiotics. Some of the pyrrolomycins contain a rare nitro group located on the pyrrole ring. The 6.5-Mbp genome encodes 5,941 predicted protein-coding sequences in 39 contigs with a 71.9% G+C content. Copyright © 2016 Mahan et al.
Ashburner, M; Misra, S; Roote, J; Lewis, S E; Blazej, R; Davis, T; Doyle, C; Galle, R; George, R; Harris, N; Hartzell, G; Harvey, D; Hong, L; Houston, K; Hoskins, R; Johnson, G; Martin, C; Moshrefi, A; Palazzolo, M; Reese, M G; Spradling, A; Tsang, G; Wan, K; Whitelaw, K; Celniker, S
1999-01-01
A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized "Adh region." A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species.Before beginning a Hunt, it is wise to ask someone what you are looking for before you begin looking for it. Milne 1926 PMID:10471707
Bacillus anthracis genome organization in light of whole transcriptome sequencing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Martin, Jeffrey; Zhu, Wenhan; Passalacqua, Karla D.
2010-03-22
Emerging knowledge of whole prokaryotic transcriptomes could validate a number of theoretical concepts introduced in the early days of genomics. What are the rules connecting gene expression levels with sequence determinants such as quantitative scores of promoters and terminators? Are translation efficiency measures, e.g. codon adaptation index and RBS score related to gene expression? We used the whole transcriptome shotgun sequencing of a bacterial pathogen Bacillus anthracis to assess correlation of gene expression level with promoter, terminator and RBS scores, codon adaptation index, as well as with a new measure of gene translational efficiency, average translation speed. We compared computationalmore » predictions of operon topologies with the transcript borders inferred from RNA-Seq reads. Transcriptome mapping may also improve existing gene annotation. Upon assessment of accuracy of current annotation of protein-coding genes in the B. anthracis genome we have shown that the transcriptome data indicate existence of more than a hundred genes missing in the annotation though predicted by an ab initio gene finder. Interestingly, we observed that many pseudogenes possess not only a sequence with detectable coding potential but also promoters that maintain transcriptional activity.« less
SIFTER search: a web server for accurate phylogeny-based protein function prediction
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.
We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access tomore » precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.« less
SIFTER search: a web server for accurate phylogeny-based protein function prediction
Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.
2015-05-15
We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access tomore » precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.« less
Signal Prediction With Input Identification
NASA Technical Reports Server (NTRS)
Juang, Jer-Nan; Chen, Ya-Chin
1999-01-01
A novel coding technique is presented for signal prediction with applications including speech coding, system identification, and estimation of input excitation. The approach is based on the blind equalization method for speech signal processing in conjunction with the geometric subspace projection theory to formulate the basic prediction equation. The speech-coding problem is often divided into two parts, a linear prediction model and excitation input. The parameter coefficients of the linear predictor and the input excitation are solved simultaneously and recursively by a conventional recursive least-squares algorithm. The excitation input is computed by coding all possible outcomes into a binary codebook. The coefficients of the linear predictor and excitation, and the index of the codebook can then be used to represent the signal. In addition, a variable-frame concept is proposed to block the same excitation signal in sequence in order to reduce the storage size and increase the transmission rate. The results of this work can be easily extended to the problem of disturbance identification. The basic principles are outlined in this report and differences from other existing methods are discussed. Simulations are included to demonstrate the proposed method.
DNA sequence-dependent mechanics and protein-assisted bending in repressor-mediated loop formation
Boedicker, James Q.; Garcia, Hernan G.; Johnson, Stephanie; Phillips, Rob
2014-01-01
As the chief informational molecule of life, DNA is subject to extensive physical manipulations. The energy required to deform double-helical DNA depends on sequence, and this mechanical code of DNA influences gene regulation, such as through nucleosome positioning. Here we examine the sequence-dependent flexibility of DNA in bacterial transcription factor-mediated looping, a context for which the role of sequence remains poorly understood. Using a suite of synthetic constructs repressed by the Lac repressor and two well-known sequences that show large flexibility differences in vitro, we make precise statistical mechanical predictions as to how DNA sequence influences loop formation and test these predictions using in vivo transcription and in vitro single-molecule assays. Surprisingly, sequence-dependent flexibility does not affect in vivo gene regulation. By theoretically and experimentally quantifying the relative contributions of sequence and the DNA-bending protein HU to DNA mechanical properties, we reveal that bending by HU dominates DNA mechanics and masks intrinsic sequence-dependent flexibility. Such a quantitative understanding of how mechanical regulatory information is encoded in the genome will be a key step towards a predictive understanding of gene regulation at single-base pair resolution. PMID:24231252
Computational Identification and Functional Predictions of Long Noncoding RNA in Zea mays
Boerner, Susan; McGinnis, Karen M.
2012-01-01
Background Computational analysis of cDNA sequences from multiple organisms suggests that a large portion of transcribed DNA does not code for a functional protein. In mammals, noncoding transcription is abundant, and often results in functional RNA molecules that do not appear to encode proteins. Many long noncoding RNAs (lncRNAs) appear to have epigenetic regulatory function in humans, including HOTAIR and XIST. While epigenetic gene regulation is clearly an essential mechanism in plants, relatively little is known about the presence or function of lncRNAs in plants. Methodology/Principal Findings To explore the connection between lncRNA and epigenetic regulation of gene expression in plants, a computational pipeline using the programming language Python has been developed and applied to maize full length cDNA sequences to identify, classify, and localize potential lncRNAs. The pipeline was used in parallel with an SVM tool for identifying ncRNAs to identify the maximal number of ncRNAs in the dataset. Although the available library of sequences was small and potentially biased toward protein coding transcripts, 15% of the sequences were predicted to be noncoding. Approximately 60% of these sequences appear to act as precursors for small RNA molecules and may function to regulate gene expression via a small RNA dependent mechanism. ncRNAs were predicted to originate from both genic and intergenic loci. Of the lncRNAs that originated from genic loci, ∼20% were antisense to the host gene loci. Conclusions/Significance Consistent with similar studies in other organisms, noncoding transcription appears to be widespread in the maize genome. Computational predictions indicate that maize lncRNAs may function to regulate expression of other genes through multiple RNA mediated mechanisms. PMID:22916204
Wan, Xuehua; Darris, Maxwell; Hou, Shaobin; Donachie, Stuart P
2017-10-19
Most of the 24 known Chitinophaga species were originally isolated from soils. We report the draft genome sequence of a putatively novel Chitinophaga sp. from a biofilm in an air conditioner condensate pipe. The genome comprises 7,661,303 bp in one scaffold, 5,694 predicted protein-coding sequences, and a G+C content of 47.6%. Copyright © 2017 Wan et al.
Riveros-Mckay, Fernando; Campos, Itzia; Giles-Gómez, Martha; Bolívar, Francisco; Escalante, Adelfo
2014-11-06
Leuconostoc mesenteroides P45 was isolated from the traditional Mexican pulque beverage. We report its draft genome sequence, assembled in 6 contigs consisting of 1,874,188 bp and no plasmids. Genome annotation predicted a total of 1,800 genes, 1,687 coding sequences, 52 pseudogenes, 9 rRNAs, 51 tRNAs, 1 noncoding RNA, and 44 frameshifted genes. Copyright © 2014 Riveros-Mckay et al.
Lata, Pushpa; Govindarajan, Subramaniam S; Qi, Feng; Li, Jian-Liang; Sahoo, Malaya K
2017-02-02
Paenibacillus sp. strain KS1 was isolated from an epiphyte, Tillandsia usneoides (Spanish moss), in central Florida, USA. Here, we report a draft genome sequence of this strain, which consists of a total of 398 contigs spanning 6,508,195 bp, with a G+C content of 46.5% and comprising 5,401 predicted coding sequences. Copyright © 2017 Lata et al.
Navas, Esteban; Bohle, Harry; Henríquez, Patricio; Grothusen, Horst; Bustamante, Fernando; Bustos, Patricio; Mancilla, Marcos
2014-08-28
We sequenced the genome of a motile O1b Yersinia ruckeri field isolate from Chile, which is causing enteric redmouth disease (ERM) in vaccinated Atlantic salmon (Salmo salar). The draft genome has 3,775,486 bp, a G+C content of 47.1%, and is predicted to contain 3,406 coding sequences. Copyright © 2014 Navas et al.
Identification of two allelic IgG1 C(H) coding regions (Cgamma1) of cat.
Kanai, T H; Ueda, S; Nakamura, T
2000-01-31
Two types of cDNA encoding IgG1 heavy chain (gamma1) were isolated from a single domestic short-hair cat. Sequence analysis indicated a higher level of similarity of these Cgamma1 sequences to human Cgamma1 sequence (76.9 and 77.0%) than to mouse sequence (70.0 and 69.7%) at the nucleotide level. Predicted primary structures of both the feline Cgamma1 genes, designated as Cgamma1a and Cgamma1b, were similar to that of human Cgamma1 gene, for instance, as to the size of constant domains, the presence of six conserved cysteine residues involved in formation of the domain structure, and the location of a conserved N-linked glycosylation site. Sequence comparison between the two alleles showed that 7 out of 10 nucleotide differences were within the C(H)3 domain coding region, all leading to nonsynonymous changes in amino acid residues. Partial sequence analysis of genomic clones showed three nucleotide substitutions between the two Cgamma1 alleles in the intron between the CH2 and C(H)3 domain coding regions. In 12 domestic short-hair cats used in this study, the frequency of Cgamma1a allele (62.5%) was higher than that of the Cgamma1b allele (37.5%).
Dessimoz, Christophe; Zoller, Stefan; Manousaki, Tereza; Qiu, Huan; Meyer, Axel; Kuraku, Shigehiro
2011-09-01
Recent development of deep sequencing technologies has facilitated de novo genome sequencing projects, now conducted even by individual laboratories. However, this will yield more and more genome sequences that are not well assembled, and will hinder thorough annotation when no closely related reference genome is available. One of the challenging issues is the identification of protein-coding sequences split into multiple unassembled genomic segments, which can confound orthology assignment and various laboratory experiments requiring the identification of individual genes. In this study, using the genome of a cartilaginous fish, Callorhinchus milii, as test case, we performed gene prediction using a model specifically trained for this genome. We implemented an algorithm, designated ESPRIT, to identify possible linkages between multiple protein-coding portions derived from a single genomic locus split into multiple unassembled genomic segments. We developed a validation framework based on an artificially fragmented human genome, improvements between early and recent mouse genome assemblies, comparison with experimentally validated sequences from GenBank, and phylogenetic analyses. Our strategy provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes. To our knowledge, our study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references.
Zoller, Stefan; Manousaki, Tereza; Qiu, Huan; Meyer, Axel; Kuraku, Shigehiro
2011-01-01
Recent development of deep sequencing technologies has facilitated de novo genome sequencing projects, now conducted even by individual laboratories. However, this will yield more and more genome sequences that are not well assembled, and will hinder thorough annotation when no closely related reference genome is available. One of the challenging issues is the identification of protein-coding sequences split into multiple unassembled genomic segments, which can confound orthology assignment and various laboratory experiments requiring the identification of individual genes. In this study, using the genome of a cartilaginous fish, Callorhinchus milii, as test case, we performed gene prediction using a model specifically trained for this genome. We implemented an algorithm, designated ESPRIT, to identify possible linkages between multiple protein-coding portions derived from a single genomic locus split into multiple unassembled genomic segments. We developed a validation framework based on an artificially fragmented human genome, improvements between early and recent mouse genome assemblies, comparison with experimentally validated sequences from GenBank, and phylogenetic analyses. Our strategy provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes. To our knowledge, our study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references. PMID:21712341
Poly(A) code analyses reveal key determinants for tissue-specific mRNA alternative polyadenylation
Weng, Lingjie; Li, Yi; Xie, Xiaohui; Shi, Yongsheng
2016-01-01
mRNA alternative polyadenylation (APA) is a critical mechanism for post-transcriptional gene regulation and is often regulated in a tissue- and/or developmental stage-specific manner. An ultimate goal for the APA field has been to be able to computationally predict APA profiles under different physiological or pathological conditions. As a first step toward this goal, we have assembled a poly(A) code for predicting tissue-specific poly(A) sites (PASs). Based on a compendium of over 600 features that have known or potential roles in PAS selection, we have generated and refined a machine-learning algorithm using multiple high-throughput sequencing-based data sets of tissue-specific and constitutive PASs. This code can predict tissue-specific PASs with >85% accuracy. Importantly, by analyzing the prediction performance based on different RNA features, we found that PAS context, including the distance between alternative PASs and the relative position of a PAS within the gene, is a key feature for determining the susceptibility of a PAS to tissue-specific regulation. Our poly(A) code provides a useful tool for not only predicting tissue-specific APA regulation, but also for studying its underlying molecular mechanisms. PMID:27095026
Bain, Peter A; Papanicolaou, Alexie; Kumar, Anupama
2015-01-01
Murray-Darling rainbowfish (Melanotaenia fluviatilis [Castelnau, 1878]; Atheriniformes: Melanotaeniidae) is a small-bodied teleost currently under development in Australasia as a test species for aquatic toxicological studies. To date, efforts towards the development of molecular biomarkers of contaminant exposure have been hindered by the lack of available sequence data. To address this, we sequenced messenger RNA from brain, liver and gonads of mature male and female fish and generated a high-quality draft transcriptome using a de novo assembly approach. 149,742 clusters of putative transcripts were obtained, encompassing 43,841 non-redundant protein-coding regions. Deduced amino acid sequences were annotated by functional inference based on similarity with sequences from manually curated protein sequence databases. The draft assembly contained protein-coding regions homologous to 95.7% of the complete cohort of predicted proteins from the taxonomically related species, Oryzias latipes (Japanese medaka). The mean length of rainbowfish protein-coding sequences relative to their medaka homologues was 92.1%, indicating that despite the limited number of tissues sampled a large proportion of the total expected number of protein-coding genes was captured in the study. Because of our interest in the effects of environmental contaminants on endocrine pathways, we manually curated subsets of coding regions for putative nuclear receptors and steroidogenic enzymes in the rainbowfish transcriptome, revealing 61 candidate nuclear receptors encompassing all known subfamilies, and 41 putative steroidogenic enzymes representing all major steroidogenic enzymes occurring in teleosts. The transcriptome presented here will be a valuable resource for researchers interested in biomarker development, protein structure and function, and contaminant-response genomics in Murray-Darling rainbowfish.
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...
2015-10-26
The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos
The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.
Analysis of protein-coding genetic variation in 60,706 humans.
Lek, Monkol; Karczewski, Konrad J; Minikel, Eric V; Samocha, Kaitlin E; Banks, Eric; Fennell, Timothy; O'Donnell-Luria, Anne H; Ware, James S; Hill, Andrew J; Cummings, Beryl B; Tukiainen, Taru; Birnbaum, Daniel P; Kosmicki, Jack A; Duncan, Laramie E; Estrada, Karol; Zhao, Fengmei; Zou, James; Pierce-Hoffman, Emma; Berghout, Joanne; Cooper, David N; Deflaux, Nicole; DePristo, Mark; Do, Ron; Flannick, Jason; Fromer, Menachem; Gauthier, Laura; Goldstein, Jackie; Gupta, Namrata; Howrigan, Daniel; Kiezun, Adam; Kurki, Mitja I; Moonshine, Ami Levy; Natarajan, Pradeep; Orozco, Lorena; Peloso, Gina M; Poplin, Ryan; Rivas, Manuel A; Ruano-Rubio, Valentin; Rose, Samuel A; Ruderfer, Douglas M; Shakir, Khalid; Stenson, Peter D; Stevens, Christine; Thomas, Brett P; Tiao, Grace; Tusie-Luna, Maria T; Weisburd, Ben; Won, Hong-Hee; Yu, Dongmei; Altshuler, David M; Ardissino, Diego; Boehnke, Michael; Danesh, John; Donnelly, Stacey; Elosua, Roberto; Florez, Jose C; Gabriel, Stacey B; Getz, Gad; Glatt, Stephen J; Hultman, Christina M; Kathiresan, Sekar; Laakso, Markku; McCarroll, Steven; McCarthy, Mark I; McGovern, Dermot; McPherson, Ruth; Neale, Benjamin M; Palotie, Aarno; Purcell, Shaun M; Saleheen, Danish; Scharf, Jeremiah M; Sklar, Pamela; Sullivan, Patrick F; Tuomilehto, Jaakko; Tsuang, Ming T; Watkins, Hugh C; Wilson, James G; Daly, Mark J; MacArthur, Daniel G
2016-08-18
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.
Operon-mapper: A Web Server for Precise Operon Identification in Bacterial and Archaeal Genomes.
Taboada, Blanca; Estrada, Karel; Ciria, Ricardo; Merino, Enrique
2018-06-19
Operon-mapper is a web server that accurately, easily, and directly predicts the operons of any bacterial or archaeal genome sequence. The operon predictions are based on the intergenic distance of neighboring genes as well as the functional relationships of their protein-coding products. To this end, Operon-mapper finds all the ORFs within a given nucleotide sequence, along with their genomic coordinates, orthology groups, and functional relationships. We believe that Operon-mapper, due to its accuracy, simplicity and speed, as well as the relevant information that it generates, will be a useful tool for annotating and characterizing genomic sequences. http://biocomputo.ibt.unam.mx/operon_mapper/.
The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...
2016-02-24
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provide d via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation ismore » followed by functional annotation including assignment of protein product names and connection to various protein family databases.« less
The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provide d via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation ismore » followed by functional annotation including assignment of protein product names and connection to various protein family databases.« less
PACCMIT/PACCMIT-CDS: identifying microRNA targets in 3′ UTRs and coding sequences
Šulc, Miroslav; Marín, Ray M.; Robins, Harlan S.; Vaníček, Jiří
2015-01-01
The purpose of the proposed web server, publicly available at http://paccmit.epfl.ch, is to provide a user-friendly interface to two algorithms for predicting messenger RNA (mRNA) molecules regulated by microRNAs: (i) PACCMIT (Prediction of ACcessible and/or Conserved MIcroRNA Targets), which identifies primarily mRNA transcripts targeted in their 3′ untranslated regions (3′ UTRs), and (ii) PACCMIT-CDS, designed to find mRNAs targeted within their coding sequences (CDSs). While PACCMIT belongs among the accurate algorithms for predicting conserved microRNA targets in the 3′ UTRs, the main contribution of the web server is 2-fold: PACCMIT provides an accurate tool for predicting targets also of weakly conserved or non-conserved microRNAs, whereas PACCMIT-CDS addresses the lack of similar portals adapted specifically for targets in CDS. The web server asks the user for microRNAs and mRNAs to be analyzed, accesses the precomputed P-values for all microRNA–mRNA pairs from a database for all mRNAs and microRNAs in a given species, ranks the predicted microRNA–mRNA pairs, evaluates their significance according to the false discovery rate and finally displays the predictions in a tabular form. The results are also available for download in several standard formats. PMID:25948580
DOE Office of Scientific and Technical Information (OSTI.GOV)
Leung, Elo; Huang, Amy; Cadag, Eithon
In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less
Leung, Elo; Huang, Amy; Cadag, Eithon; ...
2016-01-20
In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less
Ray, Jayashree; Waters, R. Jordan; Skerker, Jeffrey M.; ...
2015-05-14
Cupriavidus basilensis 4G11 was isolated from groundwater at the Oak Ridge Field Research Center (FRC) site. Here, we report the complete genome sequence and annotation of Cupriavidus basilensis 4G11. The genome contains 8,421,483 bp, 7,661 predicted protein-coding genes, and a total GC content of 64.4%.
APADB: a database for alternative polyadenylation and microRNA regulation events
Müller, Sören; Rycak, Lukas; Afonso-Grunz, Fabian; Winter, Peter; Zawada, Adam M.; Damrath, Ewa; Scheider, Jessica; Schmäh, Juliane; Koch, Ina; Kahl, Günter; Rotter, Björn
2014-01-01
Alternative polyadenylation (APA) is a widespread mechanism that contributes to the sophisticated dynamics of gene regulation. Approximately 50% of all protein-coding human genes harbor multiple polyadenylation (PA) sites; their selective and combinatorial use gives rise to transcript variants with differing length of their 3′ untranslated region (3′UTR). Shortened variants escape UTR-mediated regulation by microRNAs (miRNAs), especially in cancer, where global 3′UTR shortening accelerates disease progression, dedifferentiation and proliferation. Here we present APADB, a database of vertebrate PA sites determined by 3′ end sequencing, using massive analysis of complementary DNA ends. APADB provides (A)PA sites for coding and non-coding transcripts of human, mouse and chicken genes. For human and mouse, several tissue types, including different cancer specimens, are available. APADB records the loss of predicted miRNA binding sites and visualizes next-generation sequencing reads that support each PA site in a genome browser. The database tables can either be browsed according to organism and tissue or alternatively searched for a gene of interest. APADB is the largest database of APA in human, chicken and mouse. The stored information provides experimental evidence for thousands of PA sites and APA events. APADB combines 3′ end sequencing data with prediction algorithms of miRNA binding sites, allowing to further improve prediction algorithms. Current databases lack correct information about 3′UTR lengths, especially for chicken, and APADB provides necessary information to close this gap. Database URL: http://tools.genxpro.net/apadb/ PMID:25052703
Becker, Michael P I; Nitsch, Alexander M; Hewig, Johannes; Miltner, Wolfgang H R; Straube, Thomas
2016-12-01
Several regions of the frontal cortex interact with striatal and amygdala regions to mediate the evaluation of reward-related information and subsequent adjustment of response choices. Recent theories discuss the particular relevance of dorsal anterior cingulate cortex (dACC) for switching behavior; consecutively, ventromedial prefrontal cortex (VMPFC) is involved in mediating exploitative behaviors by tracking reward values unfolding after the behavioral switch. Amygdala, on the other hand, has been implied in coding the valence of stimulus-outcome associations and the ventral striatum (VS) has consistently been shown to code a reward prediction error (RPE). Here, we used fMRI data acquired in humans during a reversal task to parametrically model different sequences of positive feedback in order to unravel differential contributions of these brain regions to the tracking and exploitation of rewards. Parameters from an Optimal Bayesian Learner accurately predicted the divergent involvement of dACC and VMPFC during feedback processing: dACC signaled the first, but not later, presentations of positive feedback, while VMPFC coded trial-by-trial accumulations in reward value. Our results confirm that dACC carries a prominent confirmatory signal during processing of first positive feedback. Amygdala coded positive feedbacks more uniformly, while striatal regions were associated with RPE. Copyright © 2016 Elsevier Inc. All rights reserved.
Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.
Borodovsky, M; Rudd, K E; Koonin, E V
1994-01-01
The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins. Images PMID:7984428
Some practical universal noiseless coding techniques, part 3, module PSl14,K+
NASA Technical Reports Server (NTRS)
Rice, Robert F.
1991-01-01
The algorithmic definitions, performance characterizations, and application notes for a high-performance adaptive noiseless coding module are provided. Subsets of these algorithms are currently under development in custom very large scale integration (VLSI) at three NASA centers. The generality of coding algorithms recently reported is extended. The module incorporates a powerful adaptive noiseless coder for Standard Data Sources (i.e., sources whose symbols can be represented by uncorrelated non-negative integers, where smaller integers are more likely than the larger ones). Coders can be specified to provide performance close to the data entropy over any desired dynamic range (of entropy) above 0.75 bit/sample. This is accomplished by adaptively choosing the best of many efficient variable-length coding options to use on each short block of data (e.g., 16 samples) All code options used for entropies above 1.5 bits/sample are 'Huffman Equivalent', but they require no table lookups to implement. The coding can be performed directly on data that have been preprocessed to exhibit the characteristics of a standard source. Alternatively, a built-in predictive preprocessor can be used where applicable. This built-in preprocessor includes the familiar 1-D predictor followed by a function that maps the prediction error sequences into the desired standard form. Additionally, an external prediction can be substituted if desired. A broad range of issues dealing with the interface between the coding module and the data systems it might serve are further addressed. These issues include: multidimensional prediction, archival access, sensor noise, rate control, code rate improvements outside the module, and the optimality of certain internal code options.
Bayesian decision support for coding occupational injury data.
Nanda, Gaurav; Grattan, Kathleen M; Chu, MyDzung T; Davis, Letitia K; Lehto, Mark R
2016-06-01
Studies on autocoding injury data have found that machine learning algorithms perform well for categories that occur frequently but often struggle with rare categories. Therefore, manual coding, although resource-intensive, cannot be eliminated. We propose a Bayesian decision support system to autocode a large portion of the data, filter cases for manual review, and assist human coders by presenting them top k prediction choices and a confusion matrix of predictions from Bayesian models. We studied the prediction performance of Single-Word (SW) and Two-Word-Sequence (TW) Naïve Bayes models on a sample of data from the 2011 Survey of Occupational Injury and Illness (SOII). We used the agreement in prediction results of SW and TW models, and various prediction strength thresholds for autocoding and filtering cases for manual review. We also studied the sensitivity of the top k predictions of the SW model, TW model, and SW-TW combination, and then compared the accuracy of the manually assigned codes to SOII data with that of the proposed system. The accuracy of the proposed system, assuming well-trained coders reviewing a subset of only 26% of cases flagged for review, was estimated to be comparable (86.5%) to the accuracy of the original coding of the data set (range: 73%-86.8%). Overall, the TW model had higher sensitivity than the SW model, and the accuracy of the prediction results increased when the two models agreed, and for higher prediction strength thresholds. The sensitivity of the top five predictions was 93%. The proposed system seems promising for coding injury data as it offers comparable accuracy and less manual coding. Accurate and timely coded occupational injury data is useful for surveillance as well as prevention activities that aim to make workplaces safer. Copyright © 2016 Elsevier Ltd and National Safety Council. All rights reserved.
EGASP: the human ENCODE Genome Annotation Assessment Project
Guigó, Roderic; Flicek, Paul; Abril, Josep F; Reymond, Alexandre; Lagarde, Julien; Denoeud, France; Antonarakis, Stylianos; Ashburner, Michael; Bajic, Vladimir B; Birney, Ewan; Castelo, Robert; Eyras, Eduardo; Ucla, Catherine; Gingeras, Thomas R; Harrow, Jennifer; Hubbard, Tim; Lewis, Suzanna E; Reese, Martin G
2006-01-01
Background We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. Results The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. Conclusion This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence. PMID:16925836
Lozano, Gloria; Trenado, Helena P.; Fiallo-Olivé, Elvira; Chirinos, Dorys; Geraud-Pouey, Francis; Briddon, Rob W.; Navas-Castillo, Jesús
2016-01-01
Begomoviruses (family Geminiviridae) are whitefly-transmitted, plant-infecting single-stranded DNA viruses that cause crop losses throughout the warmer parts of the World. Sweepoviruses are a phylogenetically distinct group of begomoviruses that infect plants of the family Convolvulaceae, including sweet potato (Ipomoea batatas). Two classes of subviral molecules are often associated with begomoviruses, particularly in the Old World; the betasatellites and the alphasatellites. An analysis of sweet potato and Ipomoea indica samples from Spain and Merremia dissecta samples from Venezuela identified small non-coding subviral molecules in association with several distinct sweepoviruses. The sequences of 18 clones were obtained and found to be structurally similar to tomato leaf curl virus-satellite (ToLCV-sat, the first DNA satellite identified in association with a begomovirus), with a region with significant sequence identity to the conserved region of betasatellites, an A-rich sequence, a predicted stem–loop structure containing the nonanucleotide TAATATTAC, and a second predicted stem–loop. These sweepovirus-associated satellites join an increasing number of ToLCV-sat-like non-coding satellites identified recently. Although sharing some features with betasatellites, evidence is provided to suggest that the ToLCV-sat-like satellites are distinct from betasatellites and should be considered a separate class of satellites, for which the collective name deltasatellites is proposed. PMID:26925037
Schaeffer, E; Sninsky, J J
1984-01-01
Proteins that are related evolutionarily may have diverged at the level of primary amino acid sequence while maintaining similar secondary structures. Computer analysis has been used to compare the open reading frames of the hepatitis B virus to those of the woodchuck hepatitis virus at the level of amino acid sequence, and to predict the relative hydrophilic character and the secondary structure of putative polypeptides. Similarity is seen at the levels of relative hydrophilicity and secondary structure, in the absence of sequence homology. These data reinforce the proposal that these open reading frames encode viral proteins. Computer analysis of this type can be more generally used to establish structural similarities between proteins that do not share obvious sequence homology as well as to assess whether an open reading frame is fortuitous or codes for a protein. PMID:6585835
kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets
Fletez-Brant, Christopher; Lee, Dongwon; McCallion, Andrew S.; Beer, Michael A.
2013-01-01
Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167–80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org. PMID:23771147
kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets.
Fletez-Brant, Christopher; Lee, Dongwon; McCallion, Andrew S; Beer, Michael A
2013-07-01
Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167-80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org.
Saeed, A M; Magnuson, N S; Sriranganathan, N; Burger, D; Cosand, W
1984-01-01
Heat-stable enterotoxins (STs) from four strains of bovine enterotoxigenic Escherichia coli representing four serogroups were purified to homogeneity by utilizing previously published purification schemata. Biochemical characterization of the purified STs showed that they met the basic criteria for the heat-stable enterotoxins of E. coli. Amino acid analysis of the purified STs revealed that they were peptides of identical amino acid composition. This composition consisted of 18 residues of 10 different amino acids, 6 of which were cysteine. The amino acid composition of the four ST peptides was identical to that reported for the STs of human and porcine E. coli. In addition, complete sequence analysis of two of the ST peptides and partial sequencing of several others revealed strong homology to the sequences of STs from human and porcine E. coli and to the sequence predicted from the last 18 codons of the transposon Tn1681. There was also substantial homology to the sequence predicted from the ST-coding genetic element of human E. coli, which may indicate the existence of identical bioactive configuration among ST peptides of E. coli strains of various host origins. These data support the hypothesis that STs produced by human, bovine, and porcine E. coli are coded by a closely related genetic element which may have originated from a single, widely disseminated transposon. Images PMID:6376355
DNA-based watermarks using the DNA-Crypt algorithm.
Heider, Dominik; Barnekow, Angelika
2007-05-29
The aim of this paper is to demonstrate the application of watermarks based on DNA sequences to identify the unauthorized use of genetically modified organisms (GMOs) protected by patents. Predicted mutations in the genome can be corrected by the DNA-Crypt program leaving the encrypted information intact. Existing DNA cryptographic and steganographic algorithms use synthetic DNA sequences to store binary information however, although these sequences can be used for authentication, they may change the target DNA sequence when introduced into living organisms. The DNA-Crypt algorithm and image steganography are based on the same watermark-hiding principle, namely using the least significant base in case of DNA-Crypt and the least significant bit in case of the image steganography. It can be combined with binary encryption algorithms like AES, RSA or Blowfish. DNA-Crypt is able to correct mutations in the target DNA with several mutation correction codes such as the Hamming-code or the WDH-code. Mutations which can occur infrequently may destroy the encrypted information, however an integrated fuzzy controller decides on a set of heuristics based on three input dimensions, and recommends whether or not to use a correction code. These three input dimensions are the length of the sequence, the individual mutation rate and the stability over time, which is represented by the number of generations. In silico experiments using the Ypt7 in Saccharomyces cerevisiae shows that the DNA watermarks produced by DNA-Crypt do not alter the translation of mRNA into protein. The program is able to store watermarks in living organisms and can maintain the original information by correcting mutations itself. Pairwise or multiple sequence alignments show that DNA-Crypt produces few mismatches between the sequences similar to all steganographic algorithms.
DNA-based watermarks using the DNA-Crypt algorithm
Heider, Dominik; Barnekow, Angelika
2007-01-01
Background The aim of this paper is to demonstrate the application of watermarks based on DNA sequences to identify the unauthorized use of genetically modified organisms (GMOs) protected by patents. Predicted mutations in the genome can be corrected by the DNA-Crypt program leaving the encrypted information intact. Existing DNA cryptographic and steganographic algorithms use synthetic DNA sequences to store binary information however, although these sequences can be used for authentication, they may change the target DNA sequence when introduced into living organisms. Results The DNA-Crypt algorithm and image steganography are based on the same watermark-hiding principle, namely using the least significant base in case of DNA-Crypt and the least significant bit in case of the image steganography. It can be combined with binary encryption algorithms like AES, RSA or Blowfish. DNA-Crypt is able to correct mutations in the target DNA with several mutation correction codes such as the Hamming-code or the WDH-code. Mutations which can occur infrequently may destroy the encrypted information, however an integrated fuzzy controller decides on a set of heuristics based on three input dimensions, and recommends whether or not to use a correction code. These three input dimensions are the length of the sequence, the individual mutation rate and the stability over time, which is represented by the number of generations. In silico experiments using the Ypt7 in Saccharomyces cerevisiae shows that the DNA watermarks produced by DNA-Crypt do not alter the translation of mRNA into protein. Conclusion The program is able to store watermarks in living organisms and can maintain the original information by correcting mutations itself. Pairwise or multiple sequence alignments show that DNA-Crypt produces few mismatches between the sequences similar to all steganographic algorithms. PMID:17535434
Using cellular automata to generate image representation for biological sequences.
Xiao, X; Shao, S; Ding, Y; Huang, Z; Chen, X; Chou, K-C
2005-02-01
A novel approach to visualize biological sequences is developed based on cellular automata (Wolfram, S. Nature 1984, 311, 419-424), a set of discrete dynamical systems in which space and time are discrete. By transforming the symbolic sequence codes into the digital codes, and using some optimal space-time evolvement rules of cellular automata, a biological sequence can be represented by a unique image, the so-called cellular automata image. Many important features, which are originally hidden in a long and complicated biological sequence, can be clearly revealed thru its cellular automata image. With biological sequences entering into databanks rapidly increasing in the post-genomic era, it is anticipated that the cellular automata image will become a very useful vehicle for investigation into their key features, identification of their function, as well as revelation of their "fingerprint". It is anticipated that by using the concept of the pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43, 246-255), the cellular automata image approach can also be used to improve the quality of predicting protein attributes, such as structural class and subcellular location.
Analysis and recognition of 5′ UTR intron splice sites in human pre-mRNA
Eden, E.; Brunak, S.
2004-01-01
Prediction of splice sites in non-coding regions of genes is one of the most challenging aspects of gene structure recognition. We perform a rigorous analysis of such splice sites embedded in human 5′ untranslated regions (UTRs), and investigate correlations between this class of splice sites and other features found in the adjacent exons and introns. By restricting the training of neural network algorithms to ‘pure’ UTRs (not extending partially into protein coding regions), we for the first time investigate the predictive power of the splicing signal proper, in contrast to conventional splice site prediction, which typically relies on the change in sequence at the transition from protein coding to non-coding. By doing so, the algorithms were able to pick up subtler splicing signals that were otherwise masked by ‘coding’ noise, thus enhancing significantly the prediction of 5′ UTR splice sites. For example, the non-coding splice site predicting networks pick up compositional and positional bias in the 3′ ends of non-coding exons and 5′ non-coding intron ends, where cytosine and guanine are over-represented. This compositional bias at the true UTR donor sites is also visible in the synaptic weights of the neural networks trained to identify UTR donor sites. Conventional splice site prediction methods perform poorly in UTRs because the reading frame pattern is absent. The NetUTR method presented here performs 2–3-fold better compared with NetGene2 and GenScan in 5′ UTRs. We also tested the 5′ UTR trained method on protein coding regions, and discovered, surprisingly, that it works quite well (although it cannot compete with NetGene2). This indicates that the local splicing pattern in UTRs and coding regions is largely the same. The NetUTR method is made publicly available at www.cbs.dtu.dk/services/NetUTR. PMID:14960723
Statistical Analysis of CFD Solutions From the Fifth AIAA Drag Prediction Workshop
NASA Technical Reports Server (NTRS)
Morrison, Joseph H.
2013-01-01
A graphical framework is used for statistical analysis of the results from an extensive N-version test of a collection of Reynolds-averaged Navier-Stokes computational fluid dynamics codes. The solutions were obtained by code developers and users from North America, Europe, Asia, and South America using a common grid sequence and multiple turbulence models for the June 2012 fifth Drag Prediction Workshop sponsored by the AIAA Applied Aerodynamics Technical Committee. The aerodynamic configuration for this workshop was the Common Research Model subsonic transport wing-body previously used for the 4th Drag Prediction Workshop. This work continues the statistical analysis begun in the earlier workshops and compares the results from the grid convergence study of the most recent workshop with previous workshops.
Sen, Diya; Chandrababunaidu, Mathu Malar; Singh, Deeksha; Sanghi, Neha; Ghorai, Arpita; Mishra, Gyan Prakash; Madduluri, Madhavi
2015-01-01
We report here the draft genome sequence of Scytonema millei VB511283, a cyanobacterium isolated from biofilms on the exterior of stone monuments in Santiniketan, eastern India. The draft genome is 11,627,246 bp long (11.63 Mb), with 118 scaffolds. About 9,011 protein-coding genes, 117 tRNAs, and 12 rRNAs are predicted from this assembly. PMID:25744984
Genome Sequence of Legionella massiliensis, Isolated from a Cooling Tower Water Sample.
Pagnier, Isabelle; Croce, Olivier; Robert, Catherine; Raoult, Didier; La Scola, Bernard
2014-10-16
We present the draft genome sequence of Legionella massiliensis strain LegA(T), recovered from a cooling tower water sample, using an amoebal coculture procedure. The strain described here is composed of 4,387,007 bp, with a G+C content of 41.19%, and its genome has 3,767 protein-coding genes and 60 predicted RNA genes. Copyright © 2014 Pagnier et al.
Kim, Ji Hyung
2016-01-01
Here, we report the draft genome of cyanobacterium Leptolyngbya sp. KIOST-1 isolated from a microalgal culture pond in South Korea. The genome consists of 13 contigs containing 6,320,172 bp, and a total of 5,327 coding sequences were predicted. This genomic information will allow further exploitation of its biotechnological potential for alimentary purposes. PMID:27635005
Barley whole exome capture: a tool for genomic research in the genus Hordeum and beyond
Mascher, Martin; Richmond, Todd A; Gerhardt, Daniel J; Himmelbach, Axel; Clissold, Leah; Sampath, Dharanya; Ayling, Sarah; Steuernagel, Burkhard; Pfeifer, Matthias; D'Ascenzo, Mark; Akhunov, Eduard D; Hedley, Pete E; Gonzales, Ana M; Morrell, Peter L; Kilian, Benjamin; Blattner, Frank R; Scholz, Uwe; Mayer, Klaus FX; Flavell, Andrew J; Muehlbauer, Gary J; Waugh, Robbie; Jeddeloh, Jeffrey A; Stein, Nils
2013-01-01
Advanced resources for genome-assisted research in barley (Hordeum vulgare) including a whole-genome shotgun assembly and an integrated physical map have recently become available. These have made possible studies that aim to assess genetic diversity or to isolate single genes by whole-genome resequencing and in silico variant detection. However such an approach remains expensive given the 5 Gb size of the barley genome. Targeted sequencing of the mRNA-coding exome reduces barley genomic complexity more than 50-fold, thus dramatically reducing this heavy sequencing and analysis load. We have developed and employed an in-solution hybridization-based sequence capture platform to selectively enrich for a 61.6 megabase coding sequence target that includes predicted genes from the genome assembly of the cultivar Morex as well as publicly available full-length cDNAs and de novo assembled RNA-Seq consensus sequence contigs. The platform provides a highly specific capture with substantial and reproducible enrichment of targeted exons, both for cultivated barley and related species. We show that this exome capture platform provides a clear path towards a broader and deeper understanding of the natural variation residing in the mRNA-coding part of the barley genome and will thus constitute a valuable resource for applications such as mapping-by-sequencing and genetic diversity analyzes. PMID:23889683
Human Splice-Site Prediction with Deep Neural Networks.
Naito, Tatsuhiko
2018-04-18
Accurate splice-site prediction is essential to delineate gene structures from sequence data. Several computational techniques have been applied to create a system to predict canonical splice sites. For classification tasks, deep neural networks (DNNs) have achieved record-breaking results and often outperformed other supervised learning techniques. In this study, a new method of splice-site prediction using DNNs was proposed. The proposed system receives an input sequence data and returns an answer as to whether it is splice site. The length of input is 140 nucleotides, with the consensus sequence (i.e., "GT" and "AG" for the donor and acceptor sites, respectively) in the middle. Each input sequence model is applied to the pretrained DNN model that determines the probability that an input is a splice site. The model consists of convolutional layers and bidirectional long short-term memory network layers. The pretraining and validation were conducted using the data set tested in previously reported methods. The performance evaluation results showed that the proposed method can outperform the previous methods. In addition, the pattern learned by the DNNs was visualized as position frequency matrices (PFMs). Some of PFMs were very similar to the consensus sequence. The trained DNN model and the brief source code for the prediction system are uploaded. Further improvement will be achieved following the further development of DNNs.
Tett, Adrian; Spiers, Andrew J; Crossman, Lisa C; Ager, Duane; Ciric, Lena; Dow, J Maxwell; Fry, John C; Harris, David; Lilley, Andrew; Oliver, Anna; Parkhill, Julian; Quail, Michael A; Rainey, Paul B; Saunders, Nigel J; Seeger, Kathy; Snyder, Lori AS; Squares, Rob; Thomas, Christopher M; Turner, Sarah L; Zhang, Xue-Xian; Field, Dawn; Bailey, Mark J
2009-01-01
The plasmid pQBR103 was found within Pseudomonas populations colonizing the leaf and root surfaces of sugar beet plants growing at Wytham, Oxfordshire, UK. At 425 kb it is the largest self-transmissible plasmid yet sequenced from the phytosphere. It is known to enhance the competitive fitness of its host, and parts of the plasmid are known to be actively transcribed in the plant environment. Analysis of the complete sequence of this plasmid predicts a coding sequence (CDS)-rich genome containing 478 CDSs and an exceptional degree of genetic novelty; 80% of predicted coding sequences cannot be ascribed a function and 60% are orphans. Of those to which function could be assigned, 40% bore greatest similarity to sequences from Pseudomonas spp, and the majority of the remainder showed similarity to other c-proteobacterial genera and plasmids. pQBR103 has identifiable regions presumed responsible for replication and partitioning, but despite being tra+ lacks the full complement of any previously described conjugal transfer functions. The DNA sequence provided few insights into the functional significance of plant-induced transcriptional regions, but suggests that 14% of CDSs may be expressed (11 CDSs with functional annotation and 54 without), further highlighting the ecological importance of these novel CDSs. Comparative analysis indicates that pQBR103 shares significant regions of sequence with other plasmids isolated from sugar beet plants grown at the same geographic location. These plasmid sequences indicate there is more novelty in the mobile DNA pool accessible to phytosphere pseudomonas than is currently appreciated or understood. PMID:18043644
Liu, Bin; Ertesvåg, Helga; Aasen, Inga Marie; Vadstein, Olav; Brautaset, Trygve; Heggeset, Tonje Marita Bjerkan
2016-06-01
Thraustochytrids are unicellular, marine protists, and there is a growing industrial interest in these organisms, particularly because some species, including strains belonging to the genus Aurantiochytrium, accumulate high levels of docosahexaenoic acid (DHA). Here, we report the draft genome sequence of Aurantiochytrium sp. T66 (ATCC PRA-276), with a size of 43 Mbp, and 11,683 predicted protein-coding sequences. The data has been deposited at DDBJ/EMBL/Genbank under the accession LNGJ00000000. The genome sequence will contribute new insight into DHA biosynthesis and regulation, providing a basis for metabolic engineering of thraustochytrids.
Gritz, L; Davies, J
1983-11-01
The plasmid-borne gene hph coding for hygromycin B phosphotransferase (HPH) in Escherichia coli has been identified and its nucleotide sequence determined. The hph gene is 1026 nucleotides long, coding for a protein with a predicted Mr of 39 000. The hph gene was placed in a shuttle plasmid vector, downstream from the promoter region of the cyc 1 gene of Saccharomyces cerevisiae, and an hph construction containing a single AUG in the 5' noncoding region allowed direct selection following transformation in yeast and in E. coli. Thus the hph gene can be used in cloning vectors for both pro- and eukaryotes.
InterProScan 5: genome-scale protein function classification
Jones, Philip; Binns, David; Chang, Hsin-Yu; Fraser, Matthew; Li, Weizhong; McAnulla, Craig; McWilliam, Hamish; Maslen, John; Mitchell, Alex; Nuka, Gift; Pesseat, Sebastien; Quinn, Antony F.; Sangrador-Vegas, Amaia; Scheremetjew, Maxim; Yong, Siew-Yit; Lopez, Rodrigo; Hunter, Sarah
2014-01-01
Motivation: Robust large-scale sequence analysis is a major challenge in modern genomic science, where biologists are frequently trying to characterize many millions of sequences. Here, we describe a new Java-based architecture for the widely used protein function prediction software package InterProScan. Developments include improvements and additions to the outputs of the software and the complete reimplementation of the software framework, resulting in a flexible and stable system that is able to use both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis. InterProScan is freely available for download from the EMBl-EBI FTP site and the open source code is hosted at Google Code. Availability and implementation: InterProScan is distributed via FTP at ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/ and the source code is available from http://code.google.com/p/interproscan/. Contact: http://www.ebi.ac.uk/support or interhelp@ebi.ac.uk or mitchell@ebi.ac.uk PMID:24451626
Natural Antisense Transcripts: Molecular Mechanisms and Implications in Breast Cancers
Latgé, Guillaume; Poulet, Christophe; Bours, Vincent; Jerusalem, Guy
2018-01-01
Natural antisense transcripts are RNA sequences that can be transcribed from both DNA strands at the same locus but in the opposite direction from the gene transcript. Because strand-specific high-throughput sequencing of the antisense transcriptome has only been available for less than a decade, many natural antisense transcripts were first described as long non-coding RNAs. Although the precise biological roles of natural antisense transcripts are not known yet, an increasing number of studies report their implication in gene expression regulation. Their expression levels are altered in many physiological and pathological conditions, including breast cancers. Among the potential clinical utilities of the natural antisense transcripts, the non-coding|coding transcript pairs are of high interest for treatment. Indeed, these pairs can be targeted by antisense oligonucleotides to specifically tune the expression of the coding-gene. Here, we describe the current knowledge about natural antisense transcripts, their varying molecular mechanisms as gene expression regulators, and their potential as prognostic or predictive biomarkers in breast cancers. PMID:29301303
Natural Antisense Transcripts: Molecular Mechanisms and Implications in Breast Cancers.
Latgé, Guillaume; Poulet, Christophe; Bours, Vincent; Josse, Claire; Jerusalem, Guy
2018-01-02
Natural antisense transcripts are RNA sequences that can be transcribed from both DNA strands at the same locus but in the opposite direction from the gene transcript. Because strand-specific high-throughput sequencing of the antisense transcriptome has only been available for less than a decade, many natural antisense transcripts were first described as long non-coding RNAs. Although the precise biological roles of natural antisense transcripts are not known yet, an increasing number of studies report their implication in gene expression regulation. Their expression levels are altered in many physiological and pathological conditions, including breast cancers. Among the potential clinical utilities of the natural antisense transcripts, the non-coding|coding transcript pairs are of high interest for treatment. Indeed, these pairs can be targeted by antisense oligonucleotides to specifically tune the expression of the coding-gene. Here, we describe the current knowledge about natural antisense transcripts, their varying molecular mechanisms as gene expression regulators, and their potential as prognostic or predictive biomarkers in breast cancers.
The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity
Wang, Quanli; Halvorsen, Matt; Han, Yujun; Weir, William H.; Allen, Andrew S.; Goldstein, David B.
2015-01-01
Noncoding sequence contains pathogenic mutations. Yet, compared with mutations in protein-coding sequence, pathogenic regulatory mutations are notoriously difficult to recognize. Most fundamentally, we are not yet adept at recognizing the sequence stretches in the human genome that are most important in regulating the expression of genes. For this reason, it is difficult to apply to the regulatory regions the same kinds of analytical paradigms that are being successfully applied to identify mutations among protein-coding regions that influence risk. To determine whether dosage sensitive genes have distinct patterns among their noncoding sequence, we present two primary approaches that focus solely on a gene’s proximal noncoding regulatory sequence. The first approach is a regulatory sequence analogue of the recently introduced residual variation intolerance score (RVIS), termed noncoding RVIS, or ncRVIS. The ncRVIS compares observed and predicted levels of standing variation in the regulatory sequence of human genes. The second approach, termed ncGERP, reflects the phylogenetic conservation of a gene’s regulatory sequence using GERP++. We assess how well these two approaches correlate with four gene lists that use different ways to identify genes known or likely to cause disease through changes in expression: 1) genes that are known to cause disease through haploinsufficiency, 2) genes curated as dosage sensitive in ClinGen’s Genome Dosage Map, 3) genes judged likely to be under purifying selection for mutations that change expression levels because they are statistically depleted of loss-of-function variants in the general population, and 4) genes judged unlikely to cause disease based on the presence of copy number variants in the general population. We find that both noncoding scores are highly predictive of dosage sensitivity using any of these criteria. In a similar way to ncGERP, we assess two ensemble-based predictors of regional noncoding importance, ncCADD and ncGWAVA, and find both scores are significantly predictive of human dosage sensitive genes and appear to carry information beyond conservation, as assessed by ncGERP. These results highlight that the intolerance of noncoding sequence stretches in the human genome can provide a critical complementary tool to other genome annotation approaches to help identify the parts of the human genome increasingly likely to harbor mutations that influence risk of disease. PMID:26332131
PCR Amplicon Prediction from Multiplex Degenerate Primer and Probe Sets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gardner, S. N.
2013-08-08
Assessing primer specificity and predicting both desired and off-target amplification products is an essential step for robust PCR assay design. Code is described to predict potential polymerase chain reaction (PCR) amplicons in a large sequence database such as NCBI nt from either singleplex or a large multiplexed set of primers, allowing degenerate primer and probe bases, with target mismatch annotates amplicons with gene information automatically downloaded from NCBI, and optionally it can predict whether there are also TaqMan/Luminex probe matches within predicted amplicons.
NASA Astrophysics Data System (ADS)
Manjanaik, N.; Parameshachari, B. D.; Hanumanthappa, S. N.; Banu, Reshma
2017-08-01
Intra prediction process of H.264 video coding standard used to code first frame i.e. Intra frame of video to obtain good coding efficiency compare to previous video coding standard series. More benefit of intra frame coding is to reduce spatial pixel redundancy with in current frame, reduces computational complexity and provides better rate distortion performance. To code Intra frame it use existing process Rate Distortion Optimization (RDO) method. This method increases computational complexity, increases in bit rate and reduces picture quality so it is difficult to implement in real time applications, so the many researcher has been developed fast mode decision algorithm for coding of intra frame. The previous work carried on Intra frame coding in H.264 standard using fast decision mode intra prediction algorithm based on different techniques was achieved increased in bit rate, degradation of picture quality(PSNR) for different quantization parameters. Many previous approaches of fast mode decision algorithms on intra frame coding achieved only reduction of computational complexity or it save encoding time and limitation was increase in bit rate with loss of quality of picture. In order to avoid increase in bit rate and loss of picture quality a better approach was developed. In this paper developed a better approach i.e. Gaussian pulse for Intra frame coding using diagonal down left intra prediction mode to achieve higher coding efficiency in terms of PSNR and bitrate. In proposed method Gaussian pulse is multiplied with each 4x4 frequency domain coefficients of 4x4 sub macro block of macro block of current frame before quantization process. Multiplication of Gaussian pulse for each 4x4 integer transformed coefficients at macro block levels scales the information of the coefficients in a reversible manner. The resulting signal would turn abstract. Frequency samples are abstract in a known and controllable manner without intermixing of coefficients, it avoids picture getting bad hit for higher values of quantization parameters. The proposed work was implemented using MATLAB and JM 18.6 reference software. The proposed work measure the performance parameters PSNR, bit rate and compression of intra frame of yuv video sequences in QCIF resolution under different values of quantization parameter with Gaussian value for diagonal down left intra prediction mode. The simulation results of proposed algorithm are tabulated and compared with previous algorithm i.e. Tian et al method. The proposed algorithm achieved reduced in bit rate averagely 30.98% and maintain consistent picture quality for QCIF sequences compared to previous algorithm i.e. Tian et al method.
Peng, Hui; Lan, Chaowang; Liu, Yuansheng; Liu, Tao; Blumenstein, Michael; Li, Jinyan
2017-10-03
Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes.
Peng, Hui; Lan, Chaowang; Liu, Yuansheng; Liu, Tao; Blumenstein, Michael; Li, Jinyan
2017-01-01
Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes. PMID:29108274
1996-01-01
Mutations in the Caenorhabditis elegans gene unc-89 result in nematodes having disorganized muscle structure in which thick filaments are not organized into A-bands, and there are no M-lines. Beginning with a partial cDNA from the C. elegans sequencing project, we have cloned and sequenced the unc-89 gene. An unc-89 allele, st515, was found to contain an 84-bp deletion and a 10-bp duplication, resulting in an in- frame stop codon within predicted unc-89 coding sequence. Analysis of the complete coding sequence for unc-89 predicts a novel 6,632 amino acid polypeptide consisting of sequence motifs which have been implicated in protein-protein interactions. UNC-89 begins with 67 residues of unique sequences, SH3, dbl/CDC24, and PH domains, 7 immunoglobulins (Ig) domains, a putative KSP-containing multiphosphorylation domain, and ends with 46 Ig domains. A polyclonal antiserum raised to a portion of unc-89 encoded sequence reacts to a twitchin-sized polypeptide from wild type, but truncated polypeptides from st515 and from the amber allele e2338. By immunofluorescent microscopy, this antiserum localizes to the middle of A-bands, consistent with UNC-89 being a structural component of the M-line. Previous studies indicate that myofilament lattice assembly begins with positional cues laid down in the basement membrane and muscle cell membrane. We propose that the intracellular protein UNC-89 responds to these signals, localizes, and then participates in assembling an M-line. PMID:8603916
NASA Astrophysics Data System (ADS)
Darazi, R.; Gouze, A.; Macq, B.
2009-01-01
Reproducing a natural and real scene as we see in the real world everyday is becoming more and more popular. Stereoscopic and multi-view techniques are used for this end. However due to the fact that more information are displayed requires supporting technologies such as digital compression to ensure the storage and transmission of the sequences. In this paper, a new scheme for stereo image coding is proposed. The original left and right images are jointly coded. The main idea is to optimally exploit the existing correlation between the two images. This is done by the design of an efficient transform that reduces the existing redundancy in the stereo image pair. This approach was inspired by Lifting Scheme (LS). The novelty in our work is that the prediction step is been replaced by an hybrid step that consists in disparity compensation followed by luminance correction and an optimized prediction step. The proposed scheme can be used for lossless and for lossy coding. Experimental results show improvement in terms of performance and complexity compared to recently proposed methods.
Wang, Jiajia; Li, Hu; Dai, Renhuai
2017-12-01
Here, we describe the first complete mitochondrial genome (mitogenome) sequence of the leafhopper Taharana fasciana (Coelidiinae). The mitogenome sequence contains 15,161 bp with an A + T content of 77.9%. It includes 13 protein-coding genes, two ribosomal RNA genes, 22 transfer RNA genes, and one non-coding (A + T-rich) region; in addition, a repeat region is also present (GenBank accession no. KY886913). These genes/regions are in the same order as in the inferred insect ancestral mitogenome. All protein-coding genes have ATN as the start codon, and TAA or single T as the stop codons, except the gene ND3, which ends with TAG. Furthermore, we predicted the secondary structures of the rRNAs in T. fasciana. Six domains (domain III is absent in arthropods) and 41 helices were predicted for 16S rRNA, and 12S rRNA comprised three structural domains and 24 helices. Phylogenetic tree analysis confirmed that T. fasciana and other members of the Cicadellidae are clustered into a clade, and it identified the relationships among the subfamilies Deltocephalinae, Coelidiinae, Idiocerinae, Cicadellinae, and Typhlocybinae.
PACCMIT/PACCMIT-CDS: identifying microRNA targets in 3' UTRs and coding sequences.
Šulc, Miroslav; Marín, Ray M; Robins, Harlan S; Vaníček, Jiří
2015-07-01
The purpose of the proposed web server, publicly available at http://paccmit.epfl.ch, is to provide a user-friendly interface to two algorithms for predicting messenger RNA (mRNA) molecules regulated by microRNAs: (i) PACCMIT (Prediction of ACcessible and/or Conserved MIcroRNA Targets), which identifies primarily mRNA transcripts targeted in their 3' untranslated regions (3' UTRs), and (ii) PACCMIT-CDS, designed to find mRNAs targeted within their coding sequences (CDSs). While PACCMIT belongs among the accurate algorithms for predicting conserved microRNA targets in the 3' UTRs, the main contribution of the web server is 2-fold: PACCMIT provides an accurate tool for predicting targets also of weakly conserved or non-conserved microRNAs, whereas PACCMIT-CDS addresses the lack of similar portals adapted specifically for targets in CDS. The web server asks the user for microRNAs and mRNAs to be analyzed, accesses the precomputed P-values for all microRNA-mRNA pairs from a database for all mRNAs and microRNAs in a given species, ranks the predicted microRNA-mRNA pairs, evaluates their significance according to the false discovery rate and finally displays the predictions in a tabular form. The results are also available for download in several standard formats. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Ancient DNA sequence revealed by error-correcting codes.
Brandão, Marcelo M; Spoladore, Larissa; Faria, Luzinete C B; Rocha, Andréa S L; Silva-Filho, Marcio C; Palazzo, Reginaldo
2015-07-10
A previously described DNA sequence generator algorithm (DNA-SGA) using error-correcting codes has been employed as a computational tool to address the evolutionary pathway of the genetic code. The code-generated sequence alignment demonstrated that a residue mutation revealed by the code can be found in the same position in sequences of distantly related taxa. Furthermore, the code-generated sequences do not promote amino acid changes in the deviant genomes through codon reassignment. A Bayesian evolutionary analysis of both code-generated and homologous sequences of the Arabidopsis thaliana malate dehydrogenase gene indicates an approximately 1 MYA divergence time from the MDH code-generated sequence node to its paralogous sequences. The DNA-SGA helps to determine the plesiomorphic state of DNA sequences because a single nucleotide alteration often occurs in distantly related taxa and can be found in the alternative codon patterns of noncanonical genetic codes. As a consequence, the algorithm may reveal an earlier stage of the evolution of the standard code.
Ancient DNA sequence revealed by error-correcting codes
Brandão, Marcelo M.; Spoladore, Larissa; Faria, Luzinete C. B.; Rocha, Andréa S. L.; Silva-Filho, Marcio C.; Palazzo, Reginaldo
2015-01-01
A previously described DNA sequence generator algorithm (DNA-SGA) using error-correcting codes has been employed as a computational tool to address the evolutionary pathway of the genetic code. The code-generated sequence alignment demonstrated that a residue mutation revealed by the code can be found in the same position in sequences of distantly related taxa. Furthermore, the code-generated sequences do not promote amino acid changes in the deviant genomes through codon reassignment. A Bayesian evolutionary analysis of both code-generated and homologous sequences of the Arabidopsis thaliana malate dehydrogenase gene indicates an approximately 1 MYA divergence time from the MDH code-generated sequence node to its paralogous sequences. The DNA-SGA helps to determine the plesiomorphic state of DNA sequences because a single nucleotide alteration often occurs in distantly related taxa and can be found in the alternative codon patterns of noncanonical genetic codes. As a consequence, the algorithm may reveal an earlier stage of the evolution of the standard code. PMID:26159228
Permanent draft genome sequence of Comamonas testosteroni KF-1
Weiss, Michael; Kesberg, Anna I.; LaButti, Kurt M.; Pitluck, Sam; Bruce, David; Hauser, Loren; Copeland, Alex; Woyke, Tanja; Lowry, Stephen; Lucas, Susan; Land, Miriam; Goodwin, Lynne; Kjelleberg, Staffan; Cook, Alasdair M.; Buhmann, Matthias; Thomas, Torsten; Schleheck, David
2013-01-01
Comamonas testosteroni KF-1 is a model organism for the elucidation of the novel biochemical degradation pathways for xenobiotic 4-sulfophenylcarboxylates (SPC) formed during biodegradation of synthetic 4-sulfophenylalkane surfactants (linear alkylbenzenesulfonates, LAS) by bacterial communities. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 6,026,527 bp long chromosome (one sequencing gap) exhibits an average G+C content of 61.79% and is predicted to encode 5,492 protein-coding genes and 114 RNA genes. PMID:23991256
Sebaihia, Mohammed; Preston, Andrew; Maskell, Duncan J.; Kuzmiak, Holly; Connell, Terry D.; King, Natalie D.; Orndorff, Paul E.; Miyamoto, David M.; Thomson, Nicholas R.; Harris, David; Goble, Arlette; Lord, Angela; Murphy, Lee; Quail, Michael A.; Rutter, Simon; Squares, Robert; Squares, Steven; Woodward, John; Parkhill, Julian; Temple, Louise M.
2006-01-01
Bordetella avium is a pathogen of poultry and is phylogenetically distinct from Bordetella bronchiseptica, Bordetella pertussis, and Bordetella parapertussis, which are other species in the Bordetella genus that infect mammals. In order to understand the evolutionary relatedness of Bordetella species and further the understanding of pathogenesis, we obtained the complete genome sequence of B. avium strain 197N, a pathogenic strain that has been extensively studied. With 3,732,255 base pairs of DNA and 3,417 predicted coding sequences, it has the smallest genome and gene complement of the sequenced bordetellae. In this study, the presence or absence of previously reported virulence factors from B. avium was confirmed, and the genetic bases for growth characteristics were elucidated. Over 1,100 genes present in B. avium but not in B. bronchiseptica were identified, and most were predicted to encode surface or secreted proteins that are likely to define an organism adapted to the avian rather than the mammalian respiratory tracts. These include genes coding for the synthesis of a polysaccharide capsule, hemagglutinins, a type I secretion system adjacent to two very large genes for secreted proteins, and unique genes for both lipopolysaccharide and fimbrial biogenesis. Three apparently complete prophages are also present. The BvgAS virulence regulatory system appears to have polymorphisms at a poly(C) tract that is involved in phase variation in other bordetellae. A number of putative iron-regulated outer membrane proteins were predicted from the sequence, and this regulation was confirmed experimentally for five of these. PMID:16885469
Kullback Leibler divergence in complete bacterial and phage genomes
Akhter, Sajia; Kashef, Mona T.; Ibrahim, Eslam S.; Bailey, Barbara
2017-01-01
The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback–Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses. PMID:29204318
Kullback Leibler divergence in complete bacterial and phage genomes.
Akhter, Sajia; Aziz, Ramy K; Kashef, Mona T; Ibrahim, Eslam S; Bailey, Barbara; Edwards, Robert A
2017-01-01
The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback-Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.
Yang, Yang; Stanković, Vladimir; Xiong, Zixiang; Zhao, Wei
2009-03-01
Following recent works on the rate region of the quadratic Gaussian two-terminal source coding problem and limit-approaching code designs, this paper examines multiterminal source coding of two correlated, i.e., stereo, video sequences to save the sum rate over independent coding of both sequences. Two multiterminal video coding schemes are proposed. In the first scheme, the left sequence of the stereo pair is coded by H.264/AVC and used at the joint decoder to facilitate Wyner-Ziv coding of the right video sequence. The first I-frame of the right sequence is successively coded by H.264/AVC Intracoding and Wyner-Ziv coding. An efficient stereo matching algorithm based on loopy belief propagation is then adopted at the decoder to produce pixel-level disparity maps between the corresponding frames of the two decoded video sequences on the fly. Based on the disparity maps, side information for both motion vectors and motion-compensated residual frames of the right sequence are generated at the decoder before Wyner-Ziv encoding. In the second scheme, source splitting is employed on top of classic and Wyner-Ziv coding for compression of both I-frames to allow flexible rate allocation between the two sequences. Experiments with both schemes on stereo video sequences using H.264/AVC, LDPC codes for Slepian-Wolf coding of the motion vectors, and scalar quantization in conjunction with LDPC codes for Wyner-Ziv coding of the residual coefficients give a slightly lower sum rate than separate H.264/AVC coding of both sequences at the same video quality.
DCT based interpolation filter for motion compensation in HEVC
NASA Astrophysics Data System (ADS)
Alshin, Alexander; Alshina, Elena; Park, Jeong Hoon; Han, Woo-Jin
2012-10-01
High Efficiency Video Coding (HEVC) draft standard has a challenging goal to improve coding efficiency twice compare to H.264/AVC. Many aspects of the traditional hybrid coding framework were improved during new standard development. Motion compensated prediction, in particular the interpolation filter, is one area that was improved significantly over H.264/AVC. This paper presents the details of the interpolation filter design of the draft HEVC standard. The coding efficiency improvements over H.264/AVC interpolation filter is studied and experimental results are presented, which show a 4.0% average bitrate reduction for Luma component and 11.3% average bitrate reduction for Chroma component. The coding efficiency gains are significant for some video sequences and can reach up 21.7%.
Sen, Diya; Chandrababunaidu, Mathu Malar; Singh, Deeksha; Sanghi, Neha; Ghorai, Arpita; Mishra, Gyan Prakash; Madduluri, Madhavi; Adhikary, Siba Prasad; Tripathy, Sucheta
2015-03-05
We report here the draft genome sequence of Scytonema millei VB511283, a cyanobacterium isolated from biofilms on the exterior of stone monuments in Santiniketan, eastern India. The draft genome is 11,627,246 bp long (11.63 Mb), with 118 scaffolds. About 9,011 protein-coding genes, 117 tRNAs, and 12 rRNAs are predicted from this assembly. Copyright © 2015 Sen et al.
Liu, Chunfeng; Niu, Chengtuo; Zheng, Feiyun; Li, Yongxian; Zhao, Yun; Yin, Xiangsheng
2017-01-01
ABSTRACT Lager-brewing yeasts are mainly used for the production of lager beers. Illumina and PacBio-based sequence analyses revealed an approximate genome size of 22.8 Mb, with a GC content of 38.98%, for the Chinese lager-brewing yeast Saccharomyces sp. strain M14. Based on ab initio prediction, 9,970 coding genes were annotated. PMID:29074666
Kim, Ji Hyung; Kang, Do-Hyung
2016-09-15
Here, we report the draft genome of cyanobacterium Leptolyngbya sp. KIOST-1 isolated from a microalgal culture pond in South Korea. The genome consists of 13 contigs containing 6,320,172 bp, and a total of 5,327 coding sequences were predicted. This genomic information will allow further exploitation of its biotechnological potential for alimentary purposes. Copyright © 2016 Kim and Kang.
Merlaen, Britt; De Keyser, Ellen; Van Labeke, Marie-Christine
2018-01-01
The newly identified aquaporin coding sequences presented here pave the way for further insights into the plant-water relations in the commercial strawberry ( Fragaria x ananassa ). Aquaporins are water channel proteins that allow water to cross (intra)cellular membranes. In Fragaria x ananassa , few of them have been identified hitherto, hampering the exploration of the water transport regulation at cellular level. Here, we present new aquaporin coding sequences belonging to different subclasses: plasma membrane intrinsic proteins subtype 1 and subtype 2 (PIP1 and PIP2) and tonoplast intrinsic proteins (TIP). The classification is based on phylogenetic analysis and is confirmed by the presence of conserved residues. Substrate-specific signature sequences (SSSSs) and specificity-determining positions (SDPs) predict the substrate specificity of each new aquaporin. Expression profiling in leaves, petioles and developing fruits reveals distinct patterns, even within the same (sub)class. Expression profiles range from leaf-specific expression over constitutive expression to fruit-specific expression. Both upregulation and downregulation during fruit ripening occur. Substrate specificity and expression profiles suggest that functional specialization exists among aquaporins belonging to a different but also to the same (sub)class.
Khrustalev, Vladislav Victorovich
2009-01-01
We showed that GC-content of nucleotide sequences coding for linear B-cell epitopes of herpes simplex virus type 1 (HSV1) glycoprotein B (gB) is higher than GC-content of sequences coding for epitope-free regions of this glycoprotein (G + C = 73 and 64%, respectively). Linear B-cell epitopes have been predicted in HSV1 gB by BepiPred algorithm ( www.cbs.dtu.dk/services/BepiPred ). Proline is an acrophilic amino acid residue (it is usually situated on the surface of protein globules, and so included in linear B-cell epitopes). Indeed, the level of proline is much higher in predicted epitopes of gB than in epitope-free regions (17.8% versus 1.8%). This amino acid is coded by GC-rich codons (CCX) that can be produced due to nucleotide substitutions caused by mutational GC-pressure. GC-pressure will also lead to disappearance of acrophobic phenylalanine, isoleucine, methionine and tyrosine coded by GC-poor codons. Results of our "in-silico directed mutagenesis" showed that single nonsynonymous substitutions in AT to GC direction in two long epitope-free regions of gB will cause formation of new linear epitopes or elongation of previously existing epitopes flanking these regions in 25% of 539 possible cases. The calculations of GC-content and amino acid content have been performed by CodonChanges algorithm ( www.barkovsky.hotmail.ru ).
Wise, C A; Chiang, L C; Paznekas, W A; Sharma, M; Musy, M M; Ashley, J A; Lovett, M; Jabs, E W
1997-04-01
Treacher Collins Syndrome (TCS) is the most common of the human mandibulofacial dysostosis disorders. Recently, a partial TCOF1 cDNA was identified and shown to contain mutations in TCS families. Here we present the entire exon/intron genomic structure and the complete coding sequence of TCOF1. TCOF1 encodes a low complexity protein of 1,411 amino acids, whose predicted protein structure reveals repeated motifs that mirror the organization of its exons. These motifs are shared with nucleolar trafficking proteins in other species and are predicted to be highly phosphorylated by casein kinase. Consistent with this, the full-length TCOF1 protein sequence also contains putative nuclear and nucleolar localization signals. Throughout the open reading frame, we detected an additional eight mutations in TCS families and several polymorphisms. We postulate that TCS results from defects in a nucleolar trafficking protein that is critically required during human craniofacial development.
MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit
Li, Junhua; Chen, Weineng; Chen, Hua; Mende, Daniel R.; Arumugam, Manimozhiyan; Pan, Qi; Liu, Binghang; Qin, Junjie; Wang, Jun; Bork, Peer
2012-01-01
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/. PMID:23082188
MOCAT: a metagenomics assembly and gene prediction toolkit.
Kultima, Jens Roat; Sunagawa, Shinichi; Li, Junhua; Chen, Weineng; Chen, Hua; Mende, Daniel R; Arumugam, Manimozhiyan; Pan, Qi; Liu, Binghang; Qin, Junjie; Wang, Jun; Bork, Peer
2012-01-01
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.
AMS 4.0: consensus prediction of post-translational modifications in protein sequences.
Plewczynski, Dariusz; Basu, Subhadip; Saha, Indrajit
2012-08-01
We present here the 2011 update of the AutoMotif Service (AMS 4.0) that predicts the wide selection of 88 different types of the single amino acid post-translational modifications (PTM) in protein sequences. The selection of experimentally confirmed modifications is acquired from the latest UniProt and Phospho.ELM databases for training. The sequence vicinity of each modified residue is represented using amino acids physico-chemical features encoded using high quality indices (HQI) obtaining by automatic clustering of known indices extracted from AAindex database. For each type of the numerical representation, the method builds the ensemble of Multi-Layer Perceptron (MLP) pattern classifiers, each optimising different objectives during the training (for example the recall, precision or area under the ROC curve (AUC)). The consensus is built using brainstorming technology, which combines multi-objective instances of machine learning algorithm, and the data fusion of different training objects representations, in order to boost the overall prediction accuracy of conserved short sequence motifs. The performance of AMS 4.0 is compared with the accuracy of previous versions, which were constructed using single machine learning methods (artificial neural networks, support vector machine). Our software improves the average AUC score of the earlier version by close to 7 % as calculated on the test datasets of all 88 PTM types. Moreover, for the selected most-difficult sequence motifs types it is able to improve the prediction performance by almost 32 %, when compared with previously used single machine learning methods. Summarising, the brainstorming consensus meta-learning methodology on the average boosts the AUC score up to around 89 %, averaged over all 88 PTM types. Detailed results for single machine learning methods and the consensus methodology are also provided, together with the comparison to previously published methods and state-of-the-art software tools. The source code and precompiled binaries of brainstorming tool are available at http://code.google.com/p/automotifserver/ under Apache 2.0 licensing.
Burden, S; Lin, Y-X; Zhang, R
2005-03-01
Although a great deal of research has been undertaken in the area of promoter prediction, prediction techniques are still not fully developed. Many algorithms tend to exhibit poor specificity, generating many false positives, or poor sensitivity. The neural network prediction program NNPP2.2 is one such example. To improve the NNPP2.2 prediction technique, the distance between the transcription start site (TSS) associated with the promoter and the translation start site (TLS) of the subsequent gene coding region has been studied for Escherichia coli K12 bacteria. An empirical probability distribution that is consistent for all E.coli promoters has been established. This information is combined with the results from NNPP2.2 to create a new technique called TLS-NNPP, which improves the specificity of promoter prediction. The technique is shown to be effective using E.coli DNA sequences, however, it is applicable to any organism for which a set of promoters has been experimentally defined. The data used in this project and the prediction results for the tested sequences can be obtained from http://www.uow.edu.au/~yanxia/E_Coli_paper/SBurden_Results.xls alh98@uow.edu.au.
SPRINT: ultrafast protein-protein interaction prediction of the entire human interactome.
Li, Yiwei; Ilie, Lucian
2017-11-15
Proteins perform their functions usually by interacting with other proteins. Predicting which proteins interact is a fundamental problem. Experimental methods are slow, expensive, and have a high rate of error. Many computational methods have been proposed among which sequence-based ones are very promising. However, so far no such method is able to predict effectively the entire human interactome: they require too much time or memory. We present SPRINT (Scoring PRotein INTeractions), a new sequence-based algorithm and tool for predicting protein-protein interactions. We comprehensively compare SPRINT with state-of-the-art programs on seven most reliable human PPI datasets and show that it is more accurate while running orders of magnitude faster and using very little memory. SPRINT is the only sequence-based program that can effectively predict the entire human interactome: it requires between 15 and 100 min, depending on the dataset. Our goal is to transform the very challenging problem of predicting the entire human interactome into a routine task. The source code of SPRINT is freely available from https://github.com/lucian-ilie/SPRINT/ and the datasets and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/ .
Contribution to the Prediction of the Fold Code: Application to Immunoglobulin and Flavodoxin Cases
Banach, Mateusz; Prudhomme, Nicolas; Carpentier, Mathilde; Duprat, Elodie; Papandreou, Nikolaos; Kalinowska, Barbara; Chomilier, Jacques; Roterman, Irena
2015-01-01
Background Folding nucleus of globular proteins formation starts by the mutual interaction of a group of hydrophobic amino acids whose close contacts allow subsequent formation and stability of the 3D structure. These early steps can be predicted by simulation of the folding process through a Monte Carlo (MC) coarse grain model in a discrete space. We previously defined MIRs (Most Interacting Residues), as the set of residues presenting a large number of non-covalent neighbour interactions during such simulation. MIRs are good candidates to define the minimal number of residues giving rise to a given fold instead of another one, although their proportion is rather high, typically [15-20]% of the sequences. Having in mind experiments with two sequences of very high levels of sequence identity (up to 90%) but different folds, we combined the MIR method, which takes sequence as single input, with the “fuzzy oil drop” (FOD) model that requires a 3D structure, in order to estimate the residues coding for the fold. FOD assumes that a globular protein follows an idealised 3D Gaussian distribution of hydrophobicity density, with the maximum in the centre and minima at the surface of the “drop”. If the actual local density of hydrophobicity around a given amino acid is as high as the ideal one, then this amino acid is assigned to the core of the globular protein, and it is assumed to follow the FOD model. Therefore one obtains a distribution of the amino acids of a protein according to their agreement or rejection with the FOD model. Results We compared and combined MIR and FOD methods to define the minimal nucleus, or keystone, of two populated folds: immunoglobulin-like (Ig) and flavodoxins (Flav). The combination of these two approaches defines some positions both predicted as a MIR and assigned as accordant with the FOD model. It is shown here that for these two folds, the intersection of the predicted sets of residues significantly differs from random selection. It reduces the number of selected residues by each individual method and allows a reasonable agreement with experimentally determined key residues coding for the particular fold. In addition, the intersection of the two methods significantly increases the specificity of the prediction, providing a robust set of residues that constitute the folding nucleus. PMID:25915049
gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing.
Olejnik, Michael; Steuwer, Michel; Gorlatch, Sergei; Heider, Dominik
2014-11-15
Next-generation sequencing (NGS) has a large potential in HIV diagnostics, and genotypic prediction models have been developed and successfully tested in the recent years. However, albeit being highly accurate, these computational models lack computational efficiency to reach their full potential. In this study, we demonstrate the use of graphics processing units (GPUs) in combination with a computational prediction model for HIV tropism. Our new model named gCUP, parallelized and optimized for GPU, is highly accurate and can classify >175 000 sequences per second on an NVIDIA GeForce GTX 460. The computational efficiency of our new model is the next step to enable NGS technologies to reach clinical significance in HIV diagnostics. Moreover, our approach is not limited to HIV tropism prediction, but can also be easily adapted to other settings, e.g. drug resistance prediction. The source code can be downloaded at http://www.heiderlab.de d.heider@wz-straubing.de. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
EST-PAC a web package for EST annotation and protein sequence prediction
Strahm, Yvan; Powell, David; Lefèvre, Christophe
2006-01-01
With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST) from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST) annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1) searching local or remote biological databases for sequence similarities using Blast services, 2) predicting protein coding sequence from EST data and, 3) annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics. PMID:17147782
Garcia Lopez, Sebastian; Kim, Philip M.
2014-01-01
Advances in sequencing have led to a rapid accumulation of mutations, some of which are associated with diseases. However, to draw mechanistic conclusions, a biochemical understanding of these mutations is necessary. For coding mutations, accurate prediction of significant changes in either the stability of proteins or their affinity to their binding partners is required. Traditional methods have used semi-empirical force fields, while newer methods employ machine learning of sequence and structural features. Here, we show how combining both of these approaches leads to a marked boost in accuracy. We introduce ELASPIC, a novel ensemble machine learning approach that is able to predict stability effects upon mutation in both, domain cores and domain-domain interfaces. We combine semi-empirical energy terms, sequence conservation, and a wide variety of molecular details with a Stochastic Gradient Boosting of Decision Trees (SGB-DT) algorithm. The accuracy of our predictions surpasses existing methods by a considerable margin, achieving correlation coefficients of 0.77 for stability, and 0.75 for affinity predictions. Notably, we integrated homology modeling to enable proteome-wide prediction and show that accurate prediction on modeled structures is possible. Lastly, ELASPIC showed significant differences between various types of disease-associated mutations, as well as between disease and common neutral mutations. Unlike pure sequence-based prediction methods that try to predict phenotypic effects of mutations, our predictions unravel the molecular details governing the protein instability, and help us better understand the molecular causes of diseases. PMID:25243403
Beyond the Triplet Code: Context Cues Transform Translation.
Brar, Gloria A
2016-12-15
The elucidation of the genetic code remains among the most influential discoveries in biology. While innumerable studies have validated the general universality of the code and its value in predicting and analyzing protein coding sequences, established and emerging work has also suggested that full genome decryption may benefit from a greater consideration of a codon's neighborhood within an mRNA than has been broadly applied. This Review examines the evidence for context cues in translation, with a focus on several recent studies that reveal broad roles for mRNA context in programming translation start sites, the rate of translation elongation, and stop codon identity. Copyright © 2016 Elsevier Inc. All rights reserved.
Meitinger, T; Meindl, A; Bork, P; Rost, B; Sander, C; Haasemann, M; Murken, J
1993-12-01
The X-lined gene for Norrie disease, which is characterized by blindness, deafness and mental retardation has been cloned recently. This gene has been thought to code for a putative extracellular factor; its predicted amino acid sequence is homologous to the C-terminal domain of diverse extracellular proteins. Sequence pattern searches and three-dimensional modelling now suggest that the Norrie disease protein (NDP) has a tertiary structure similar to that of transforming growth factor beta (TGF beta). Our model identifies NDP as a member of an emerging family of growth factors containing a cystine knot motif, with direct implications for the physiological role of NDP. The model also sheds light on sequence related domains such as the C-terminal domain of mucins and of von Willebrand factor.
CaMELS: In silico prediction of calmodulin binding proteins and their binding sites.
Abbasi, Wajid Arshad; Asif, Amina; Andleeb, Saiqa; Minhas, Fayyaz Ul Amir Afsar
2017-09-01
Due to Ca 2+ -dependent binding and the sequence diversity of Calmodulin (CaM) binding proteins, identifying CaM interactions and binding sites in the wet-lab is tedious and costly. Therefore, computational methods for this purpose are crucial to the design of such wet-lab experiments. We present an algorithm suite called CaMELS (CalModulin intEraction Learning System) for predicting proteins that interact with CaM as well as their binding sites using sequence information alone. CaMELS offers state of the art accuracy for both CaM interaction and binding site prediction and can aid biologists in studying CaM binding proteins. For CaM interaction prediction, CaMELS uses protein sequence features coupled with a large-margin classifier. CaMELS models the binding site prediction problem using multiple instance machine learning with a custom optimization algorithm which allows more effective learning over imprecisely annotated CaM-binding sites during training. CaMELS has been extensively benchmarked using a variety of data sets, mutagenic studies, proteome-wide Gene Ontology enrichment analyses and protein structures. Our experiments indicate that CaMELS outperforms simple motif-based search and other existing methods for interaction and binding site prediction. We have also found that the whole sequence of a protein, rather than just its binding site, is important for predicting its interaction with CaM. Using the machine learning model in CaMELS, we have identified important features of protein sequences for CaM interaction prediction as well as characteristic amino acid sub-sequences and their relative position for identifying CaM binding sites. Python code for training and evaluating CaMELS together with a webserver implementation is available at the URL: http://faculty.pieas.edu.pk/fayyaz/software.html#camels. © 2017 Wiley Periodicals, Inc.
Nucleotide sequence of the gag gene and gag-pol junction of feline leukemia virus.
Laprevotte, I; Hampe, A; Sherr, C J; Galibert, F
1984-01-01
The nucleotide sequence of the gag gene of feline leukemia virus and its flanking sequences were determined and compared with the corresponding sequences of two strains of feline sarcoma virus and with that of the Moloney strain of murine leukemia virus. A high degree of nucleotide sequence homology between the feline leukemia virus and murine leukemia virus gag genes was observed, suggesting that retroviruses of domestic cats and laboratory mice have a common, proximal evolutionary progenitor. The predicted structure of the complete feline leukemia virus gag gene precursor suggests that the translation of nonglycosylated and glycosylated gag gene polypeptides is initiated at two different AUG codons. These initiator codons fall in the same reading frame and are separated by a 222-base-pair segment which encodes an amino terminal signal peptide. The nucleotide sequence predicts the order of amino acids in each of the individual gag-coded proteins (p15, p12, p30, p10), all of which derive from the gag gene precursor. Stable stem-and-loop secondary structures are proposed for two regions of viral RNA. The first falls within sequences at the 5' end of the viral genome, together with adjacent palindromic sequences which may play a role in dimer linkage of RNA subunits. The second includes coding sequences at the gag-pol junction and is proposed to be involved in translation of the pol gene product. Sequence analysis of the latter region shows that the gag and pol genes are translated in different reading frames. Classical consensus splice donor and acceptor sequences could not be localized to regions which would permit synthesis of the expected gag-pol precursor protein. Alternatively, we suggest that the pol gene product (RNA-dependent DNA polymerase) could be translated by a frameshift suppressing mechanism which could involve cleavage modification of stems and loops in a manner similar to that observed in tRNA processing. PMID:6328019
Budak, Hikmet; Kantar, Melda
2015-07-01
MicroRNAs (miRNAs) are small, endogenous, non-coding RNA molecules that regulate gene expression at the post-transcriptional level. As high-throughput next generation sequencing (NGS) and Big Data rapidly accumulate for various species, efforts for in silico identification of miRNAs intensify. Surprisingly, the effect of the input genomics sequence on the robustness of miRNA prediction was not evaluated in detail to date. In the present study, we performed a homology-based miRNA and isomiRNA prediction of the 5D chromosome of bread wheat progenitor, Aegilops tauschii, using two distinct sequence data sets as input: (1) raw sequence reads obtained from 454-GS FLX Titanium sequencing platform and (2) an assembly constructed from these reads. We also compared this method with a number of available plant sequence datasets. We report here the identification of 62 and 22 miRNAs from raw reads and the assembly, respectively, of which 16 were predicted with high confidence from both datasets. While raw reads promoted sensitivity with the high number of miRNAs predicted, 55% (12 out of 22) of the assembly-based predictions were supported by previous observations, bringing specificity forward compared to the read-based predictions, of which only 37% were supported. Importantly, raw reads could identify several repeat-related miRNAs that could not be detected with the assembly. However, raw reads could not capture 6 miRNAs, for which the stem-loops could only be covered by the relatively longer sequences from the assembly. In summary, the comparison of miRNA datasets obtained by these two strategies revealed that utilization of raw reads, as well as assemblies for in silico prediction, have distinct advantages and disadvantages. Consideration of these important nuances can benefit future miRNA identification efforts in the current age of NGS and Big Data driven life sciences innovation.
Disruption of hierarchical predictive coding during sleep
Strauss, Melanie; Sitt, Jacobo D.; King, Jean-Remi; Elbaz, Maxime; Azizi, Leila; Buiatti, Marco; Naccache, Lionel; van Wassenhove, Virginie; Dehaene, Stanislas
2015-01-01
When presented with an auditory sequence, the brain acts as a predictive-coding device that extracts regularities in the transition probabilities between sounds and detects unexpected deviations from these regularities. Does such prediction require conscious vigilance, or does it continue to unfold automatically in the sleeping brain? The mismatch negativity and P300 components of the auditory event-related potential, reflecting two steps of auditory novelty detection, have been inconsistently observed in the various sleep stages. To clarify whether these steps remain during sleep, we recorded simultaneous electroencephalographic and magnetoencephalographic signals during wakefulness and during sleep in normal subjects listening to a hierarchical auditory paradigm including short-term (local) and long-term (global) regularities. The global response, reflected in the P300, vanished during sleep, in line with the hypothesis that it is a correlate of high-level conscious error detection. The local mismatch response remained across all sleep stages (N1, N2, and REM sleep), but with an incomplete structure; compared with wakefulness, a specific peak reflecting prediction error vanished during sleep. Those results indicate that sleep leaves initial auditory processing and passive sensory response adaptation intact, but specifically disrupts both short-term and long-term auditory predictive coding. PMID:25737555
Kulmanov, Maxat; Khan, Mohammed Asif; Hoehndorf, Robert; Wren, Jonathan
2018-02-15
A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein-protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo. robert.hoehndorf@kaust.edu.sa. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
Current Research on Non-Coding Ribonucleic Acid (RNA).
Wang, Jing; Samuels, David C; Zhao, Shilin; Xiang, Yu; Zhao, Ying-Yong; Guo, Yan
2017-12-05
Non-coding ribonucleic acid (RNA) has without a doubt captured the interest of biomedical researchers. The ability to screen the entire human genome with high-throughput sequencing technology has greatly enhanced the identification, annotation and prediction of the functionality of non-coding RNAs. In this review, we discuss the current landscape of non-coding RNA research and quantitative analysis. Non-coding RNA will be categorized into two major groups by size: long non-coding RNAs and small RNAs. In long non-coding RNA, we discuss regular long non-coding RNA, pseudogenes and circular RNA. In small RNA, we discuss miRNA, transfer RNA, piwi-interacting RNA, small nucleolar RNA, small nuclear RNA, Y RNA, single recognition particle RNA, and 7SK RNA. We elaborate on the origin, detection method, and potential association with disease, putative functional mechanisms, and public resources for these non-coding RNAs. We aim to provide readers with a complete overview of non-coding RNAs and incite additional interest in non-coding RNA research.
Predictive Coding Strategies for Developmental Neurorobotics
Park, Jun-Cheol; Lim, Jae Hyun; Choi, Hansol; Kim, Dae-Shik
2012-01-01
In recent years, predictive coding strategies have been proposed as a possible means by which the brain might make sense of the truly overwhelming amount of sensory data available to the brain at any given moment of time. Instead of the raw data, the brain is hypothesized to guide its actions by assigning causal beliefs to the observed error between what it expects to happen and what actually happens. In this paper, we present a variety of developmental neurorobotics experiments in which minimalist prediction error-based encoding strategies are utilize to elucidate the emergence of infant-like behavior in humanoid robotic platforms. Our approaches will be first naively Piagian, then move onto more Vygotskian ideas. More specifically, we will investigate how simple forms of infant learning, such as motor sequence generation, object permanence, and imitation learning may arise if minimizing prediction errors are used as objective functions. PMID:22586416
NASA Astrophysics Data System (ADS)
Tu, Shiqi; Yuan, Guo-Cheng; Shao, Zhen
2017-01-01
Recently, long non-coding RNAs (lncRNAs) have emerged as an important class of molecules involved in many cellular processes. One of their primary functions is to shape epigenetic landscape through interactions with chromatin modifying proteins. However, mechanisms contributing to the specificity of such interactions remain poorly understood. Here we took the human and mouse lncRNAs that were experimentally determined to have physical interactions with Polycomb repressive complex 2 (PRC2), and systematically investigated the sequence features of these lncRNAs by developing a new computational pipeline for sequences composition analysis, in which each sequence is considered as a series of transitions between adjacent nucleotides. Through that, PRC2-binding lncRNAs were found to be associated with a set of distinctive and evolutionarily conserved sequence features, which can be utilized to distinguish them from the others with considerable accuracy. We further identified fragments of PRC2-binding lncRNAs that are enriched with these sequence features, and found they show strong PRC2-binding signals and are more highly conserved across species than the other parts, implying their functional importance.
Bulashevska, Alla; Eils, Roland
2006-06-14
The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction. A novel method called HensBC is introduced to predict protein subcellular location. HensBC is a recursive algorithm which constructs a hierarchical ensemble of classifiers. The classifiers used are Bayesian classifiers based on Markov chain models. We tested our method on six various datasets; among them are Gram-negative bacteria dataset, data for discriminating outer membrane proteins and apoptosis proteins dataset. We observed that our method can predict the subcellular location with high accuracy. Another advantage of the proposed method is that it can improve the accuracy of the prediction of some classes with few sequences in training and is therefore useful for datasets with imbalanced distribution of classes. This study introduces an algorithm which uses only the primary sequence of a protein to predict its subcellular location. The proposed recursive scheme represents an interesting methodology for learning and combining classifiers. The method is computationally efficient and competitive with the previously reported approaches in terms of prediction accuracies as empirical results indicate. The code for the software is available upon request.
Swarnkar, Mohit Kumar; Vyas, Pratibha; Rahi, Praveen; Thakur, Rishu; Thakur, Namika; Singh, Anil Kumar
2015-01-01
The complete genome sequence of 6.45 Mb is reported here for Pseudomonas trivialis strain IHBB745 (MTCC 5336), which is an efficient, stress-tolerant, and broad-spectrum plant growth-promoting rhizobacterium. The gene-coding clusters predicted the genes for phosphate solubilization, siderophore production, 1-aminocyclopropane-1-carboxylate (ACC) deaminase activity, indole-3-acetic acid (IAA) production, and stress response. PMID:26337878
Liu, Chunfeng; Li, Qi; Niu, Chengtuo; Zheng, Feiyun; Li, Yongxian; Zhao, Yun; Yin, Xiangsheng
2017-10-26
Lager-brewing yeasts are mainly used for the production of lager beers. Illumina and PacBio-based sequence analyses revealed an approximate genome size of 22.8 Mb, with a GC content of 38.98%, for the Chinese lager-brewing yeast Saccharomyces sp. strain M14. Based on ab initio prediction, 9,970 coding genes were annotated. Copyright © 2017 Liu et al.
Cocho, Germinal; Miramontes, Pedro; Mansilla, Ricardo; Li, Wentian
2014-12-01
We examine the relationship between exponential correlation functions and Markov models in a bacterial genome in detail. Despite the well known fact that Markov models generate sequences with correlation function that decays exponentially, simply constructed Markov models based on nearest-neighbor dimer (first-order), trimer (second-order), up to hexamer (fifth-order), and treating the DNA sequence as being homogeneous all fail to predict the value of exponential decay rate. Even reading-frame-specific Markov models (both first- and fifth-order) could not explain the fact that the exponential decay is very slow. Starting with the in-phase coding-DNA-sequence (CDS), we investigated correlation within a fixed-codon-position subsequence, and in artificially constructed sequences by packing CDSs with out-of-phase spacers, as well as altering CDS length distribution by imposing an upper limit. From these targeted analyses, we conclude that the correlation in the bacterial genomic sequence is mainly due to a mixing of heterogeneous statistics at different codon positions, and the decay of correlation is due to the possible out-of-phase between neighboring CDSs. There are also small contributions to the correlation from bases at the same codon position, as well as by non-coding sequences. These show that the seemingly simple exponential correlation functions in bacterial genome hide a complexity in correlation structure which is not suitable for a modeling by Markov chain in a homogeneous sequence. Other results include: use of the (absolute value) second largest eigenvalue to represent the 16 correlation functions and the prediction of a 10-11 base periodicity from the hexamer frequencies. Copyright © 2014 Elsevier Ltd. All rights reserved.
Genomic analysis of organismal complexity in the multicellular green alga Volvox carteri
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prochnik, Simon E.; Umen, James; Nedelcu, Aurora
2010-07-01
Analysis of the Volvox carteri genome reveals that this green alga's increased organismal complexity and multicellularity are associated with modifications in protein families shared with its unicellular ancestor, and not with large-scale innovations in protein coding capacity. The multicellular green alga Volvox carteri and its morphologically diverse close relatives (the volvocine algae) are uniquely suited for investigating the evolution of multicellularity and development. We sequenced the 138 Mb genome of V. carteri and compared its {approx}14,500 predicted proteins to those of its unicellular relative, Chlamydomonas reinhardtii. Despite fundamental differences in organismal complexity and life history, the two species have similarmore » protein-coding potentials, and few species-specific protein-coding gene predictions. Interestingly, volvocine algal-specific proteins are enriched in Volvox, including those associated with an expanded and highly compartmentalized extracellular matrix. Our analysis shows that increases in organismal complexity can be associated with modifications of lineage-specific proteins rather than large-scale invention of protein-coding capacity.« less
DOE R&D Accomplishments Database
Liang, X.
1998-06-10
The genome of Methanococcus jannaschii has been sequenced completely and has been found to contain approximately 1,770 predicted protein-coding regions. When these coding regions are expressed and how their expression is regulated, however, remain open questions. In this work, mass spectrometry was combined with two-dimensional gel electrophoresis to identify which proteins the genes produce under different growth conditions, and thus investigate the regulation of genes responsible for functions characteristic of this thermophilic representative of the methanogenic Archaea.
The dependence of frequency distributions on multiple meanings of words, codes and signs
NASA Astrophysics Data System (ADS)
Yan, Xiaoyong; Minnhagen, Petter
2018-01-01
The dependence of the frequency distributions due to multiple meanings of words in a text is investigated by deleting letters. By coding the words with fewer letters the number of meanings per coded word increases. This increase is measured and used as an input in a predictive theory. For a text written in English, the word-frequency distribution is broad and fat-tailed, whereas if the words are only represented by their first letter the distribution becomes exponential. Both distribution are well predicted by the theory, as is the whole sequence obtained by consecutively representing the words by the first L = 6 , 5 , 4 , 3 , 2 , 1 letters. Comparisons of texts written by Chinese characters and the same texts written by letter-codes are made and the similarity of the corresponding frequency-distributions are interpreted as a consequence of the multiple meanings of Chinese characters. This further implies that the difference of the shape for word-frequencies for an English text written by letters and a Chinese text written by Chinese characters is due to the coding and not to the language per se.
Manku, H K; Dhanoa, J K; Kaur, S; Arora, J S; Mukhopadhyay, C S
2017-10-01
MicroRNAs (miRNAs) are small (19-25 base long), non-coding RNAs that regulate post-transcriptional gene expression by cleaving targeted mRNAs in several eukaryotes. The miRNAs play vital roles in multiple biological and metabolic processes, including developmental timing, signal transduction, cell maintenance and differentiation, diseases and cancers. Experimental identification of microRNAs is expensive and lab-intensive. Alternatively, computational approaches for predicting putative miRNAs from genomic or exomic sequences rely on features of miRNAs viz. secondary structures, sequence conservation, minimum free energy index (MFEI) etc. To date, not a single miRNA has been identified in bubaline (Bubalus bubalis), which is an economically important livestock. The present study aims at predicting the putative miRNAs of buffalo using comparative computational approach from buffalo whole genome shotgun sequencing data (INSDC: AWWX00000000.1). The sequences were blasted against the known mammalian miRNA. The obtained miRNAs were then passed through a series of filtration criteria to obtain the set of predicted (putative and novel) bubaline miRNA. Eight miRNAs were selected based on lowest E-value and validated by real time PCR (SYBR green chemistry) using RNU6 as endogenous control. The results from different trails of real time PCR shows that out of selected 8 miRNAs, only 2 (hsa-miR-1277-5p; bta-miR-2285b) are not expressed in bubaline PBMCs. The potential target genes based on their sequence complementarities were then predicted using miRanda. This work is the first report on prediction of bubaline miRNA from whole genome sequencing data followed by experimental validation. The finding could pave the way to future studies in economically important traits in buffalo. Copyright © 2017 Elsevier Ltd. All rights reserved.
ChIP-seq Accurately Predicts Tissue-Specific Activity of Enhancers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Visel, Axel; Blow, Matthew J.; Li, Zirong
2009-02-01
A major yet unresolved quest in decoding the human genome is the identification of the regulatory sequences that control the spatial and temporal expression of genes. Distant-acting transcriptional enhancers are particularly challenging to uncover since they are scattered amongst the vast non-coding portion of the genome. Evolutionary sequence constraint can facilitate the discovery of enhancers, but fails to predict when and where they are active in vivo. Here, we performed chromatin immunoprecipitation with the enhancer-associated protein p300, followed by massively-parallel sequencing, to map several thousand in vivo binding sites of p300 in mouse embryonic forebrain, midbrain, and limb tissue. Wemore » tested 86 of these sequences in a transgenic mouse assay, which in nearly all cases revealed reproducible enhancer activity in those tissues predicted by p300 binding. Our results indicate that in vivo mapping of p300 binding is a highly accurate means for identifying enhancers and their associated activities and suggest that such datasets will be useful to study the role of tissue-specific enhancers in human biology and disease on a genome-wide scale.« less
Cryptic tRNAs in chaetognath mitochondrial genomes.
Barthélémy, Roxane-Marie; Seligmann, Hervé
2016-06-01
The chaetognaths constitute a small and enigmatic phylum of little marine invertebrates. Both nuclear and mitochondrial genomes have numerous originalities, some phylum-specific. Until recently, their mitogenomes seemed containing only one tRNA gene (trnMet), but a recent study found in two chaetognath mitogenomes two and four tRNA genes. Moreover, apparently two conspecific mitogenomes have different tRNA gene numbers (one and two). Reanalyses by tRNAscan-SE and ARWEN softwares of the five available complete chaetognath mitogenomes suggest numerous additional tRNA genes from different types. Their total number never reaches the 22 found in most other invertebrates using that genetic code. Predicted error compensation between codon-anticodon mismatch and tRNA misacylation suggests translational activity by tRNAs predicted solely according to secondary structure for tRNAs predicted by tRNAscan-SE, not ARWEN. Numbers of predicted stop-suppressor (antitermination) tRNAs coevolve with predicted overlapping, frameshifted protein coding genes including stop codons. Sequence alignments in secondary structure prediction with non-chaetognath tRNAs suggest that the most likely functional tRNAs are in intergenic regions, as regular mt-tRNAs. Due to usually short intergenic regions, generally tRNA sequences partially overlap with flanking genes. Some tRNA pairs seem templated by sense-antisense strands. Moreover, 16S rRNA genes, but not 12S rRNAs, appear as tRNA nurseries, as previously suggested for multifunctional ribosomal-like protogenomes. Copyright © 2016 Elsevier Ltd. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gardner, S N; Clague, D S; Vandersall, J A
2006-02-23
The polymerase chain reaction (PCR) stands among the keystone technologies for analysis of biological sequence data. PCR is used to amplify DNA, to generate many copies from as little as a single template. This is essential, for example, in processing forensic DNA samples, pathogen detection in clinical or biothreat surveillance applications, and medical genotyping for diagnosis and treatment of disease. It is used in virtually every laboratory doing molecular, cellular, genetic, ecologic, forensic, or medical research. Despite its ubiquity, we lack the precise predictive capability that would enable detailed optimization of PCR reaction dynamics. In this LDRD, we proposed tomore » develop Virtual PCR (VPCR) software, a computational method to model the kinetic, thermodynamic, and biological processes of PCR reactions. Given a successful completion, these tools will allow us to predict both the sequences and concentrations of all species that are amplified during PCR. The ability to answer the following questions will allow us both to optimize the PCR process and interpret the PCR results: What products are amplified when sequence mixtures are present, containing multiple, closely related targets and multiplexed primers, which may hybridize with sequence mismatches? What are the effects of time, temperature, and DNA concentrations on the concentrations of products? A better understanding of these issues will improve the design and interpretation of PCR reactions. The status of the VPCR project after 1.5 years of funding is consistent with the goals of the overall project which was scoped for 3 years of funding. At half way through the projected timeline of the project we have an early beta version of the VPCR code. We have begun investigating means to improve the robustness of the code, performed preliminary experiments to test the code and begun drafting manuscripts for publication. Although an experimental protocol for testing the code was developed, the preliminary experiments were tainted by contaminated products received from the manufacturer. Much knowledge has been gained in the development of the code thus far, but without final debugging, increasing its robustness and verifying it against experimental results, the papers which we have drafted to share our findings still require the final data necessary for publication. The following sections summarize our final progress on VPCR as it stands after 1.5 years of effort on an ambitious project scoped for a 3 year period. We have additional details of the methods than are provided here, but would like to have legal protection in place before releasing them. The result of this project, a suite of programs that predict PCR products as a function of reaction conditions and sequences, will be used to address outstanding questions in pathogen detection and forensics at LLNL. VPCR should enable scientists to optimize PCR protocols in terms of time, temperature, ion concentration, and primer sequences and concentrations, and to estimate products and error rates in advance of performing experiments. Our proposed capabilities are well ahead of all currently available technologies, which do not model non-equilibrium kinetics, polymerase extension, or predict multiple or undesired PCR products. We are currently seeking DHS funding to complete the project, at which time licensing opportunities will be explored, an updated patent application will be prepared, and a publication will be submitted. A provisional and a full patent application have already been filed (1).« less
Region-Based Prediction for Image Compression in the Cloud.
Begaint, Jean; Thoreau, Dominique; Guillotel, Philippe; Guillemot, Christine
2018-04-01
Thanks to the increasing number of images stored in the cloud, external image similarities can be leveraged to efficiently compress images by exploiting inter-images correlations. In this paper, we propose a novel image prediction scheme for cloud storage. Unlike current state-of-the-art methods, we use a semi-local approach to exploit inter-image correlation. The reference image is first segmented into multiple planar regions determined from matched local features and super-pixels. The geometric and photometric disparities between the matched regions of the reference image and the current image are then compensated. Finally, multiple references are generated from the estimated compensation models and organized in a pseudo-sequence to differentially encode the input image using classical video coding tools. Experimental results demonstrate that the proposed approach yields significant rate-distortion performance improvements compared with the current image inter-coding solutions such as high efficiency video coding.
Odor Coding by a Mammalian Receptor Repertoire
Saito, Harumi; Chi, Qiuyi; Zhuang, Hanyi; Matsunami, Hiro; Mainland, Joel D.
2009-01-01
Deciphering olfactory encoding requires a thorough description of the ligands that activate each odorant receptor (OR). In mammalian systems, however, ligands are known for fewer than 50 of over 1400 human and mouse ORs, greatly limiting our understanding of olfactory coding. We performed high-throughput screening of 93 odorants against 464 ORs expressed in heterologous cells and identified agonists for 52 mouse and 10 human ORs. We used the resulting interaction profiles to develop a predictive model relating physicochemical odorant properties, OR sequences, and their interactions. Our results provide a basis for translating odorants into receptor neuron responses and unraveling mammalian odor coding. PMID:19261596
Identifying functionally informative evolutionary sequence profiles.
Gil, Nelson; Fiser, Andras
2018-04-15
Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. andras.fiser@einstein.yu.edu. Supplementary data are available at Bioinformatics online.
Mühlhausen, Stefanie; Findeisen, Peggy; Plessmann, Uwe; Urlaub, Henning; Kollmar, Martin
2016-01-01
The genetic code is the cellular translation table for the conversion of nucleotide sequences into amino acid sequences. Changes to the meaning of sense codons would introduce errors into almost every translated message and are expected to be highly detrimental. However, reassignment of single or multiple codons in mitochondria and nuclear genomes, although extremely rare, demonstrates that the code can evolve. Several models for the mechanism of alteration of nuclear genetic codes have been proposed (including “codon capture,” “genome streamlining,” and “ambiguous intermediate” theories), but with little resolution. Here, we report a novel sense codon reassignment in Pachysolen tannophilus, a yeast related to the Pichiaceae. By generating proteomics data and using tRNA sequence comparisons, we show that Pachysolen translates CUG codons as alanine and not as the more usual leucine. The Pachysolen tRNACAG is an anticodon-mutated tRNAAla containing all major alanine tRNA recognition sites. The polyphyly of the CUG-decoding tRNAs in yeasts is best explained by a tRNA loss driven codon reassignment mechanism. Loss of the CUG-tRNA in the ancient yeast is followed by gradual decrease of respective codons and subsequent codon capture by tRNAs whose anticodon is not part of the aminoacyl-tRNA synthetase recognition region. Our hypothesis applies to all nuclear genetic code alterations and provides several testable predictions. We anticipate more codon reassignments to be uncovered in existing and upcoming genome projects. PMID:27197221
Seligmann, Hervé
2013-03-01
Usual DNA→RNA transcription exchanges T→U. Assuming different systematic symmetric nucleotide exchanges during translation, some GenBank RNAs match exactly human mitochondrial sequences (exchange rules listed in decreasing transcript frequencies): C↔U, A↔U, A↔U+C↔G (two nucleotide pairs exchanged), G↔U, A↔G, C↔G, none for A↔C, A↔G+C↔U, and A↔C+G↔U. Most unusual transcripts involve exchanging uracil. Independent measures of rates of rare replicational enzymatic DNA nucleotide misinsertions predict frequencies of RNA transcripts systematically exchanging the corresponding misinserted nucleotides. Exchange transcripts self-hybridize less than other gene regions, self-hybridization increases with length, suggesting endoribonuclease-limited elongation. Blast detects stop codon depleted putative protein coding overlapping genes within exchange-transcribed mitochondrial genes. These align with existing GenBank proteins (mainly metazoan origins, prokaryotic and viral origins underrepresented). These GenBank proteins frequently interact with RNA/DNA, are membrane transporters, or are typical of mitochondrial metabolism. Nucleotide exchange transcript frequencies increase with overlapping gene densities and stop densities, indicating finely tuned counterbalancing regulation of expression of systematic symmetric nucleotide exchange-encrypted proteins. Such expression necessitates combined activities of suppressor tRNAs matching stops, and nucleotide exchange transcription. Two independent properties confirm predicted exchanged overlap coding genes: discrepancy of third codon nucleotide contents from replicational deamination gradients, and codon usage according to circular code predictions. Predictions from both properties converge, especially for frequent nucleotide exchange types. Nucleotide exchanging transcription apparently increases coding densities of protein coding genes without lengthening genomes, revealing unsuspected functional DNA coding potential. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Chandrasekaran, Srinivas Niranj; Yardimci, Galip Gürkan; Erdogan, Ozgün; Roach, Jeffrey; Carter, Charles W.
2013-01-01
We tested the idea that ancestral class I and II aminoacyl-tRNA synthetases arose on opposite strands of the same gene. We assembled excerpted 94-residue Urgenes for class I tryptophanyl-tRNA synthetase (TrpRS) and class II Histidyl-tRNA synthetase (HisRS) from a diverse group of species, by identifying and catenating three blocks coding for secondary structures that position the most highly conserved, active-site residues. The codon middle-base pairing frequency was 0.35 ± 0.0002 in all-by-all sense/antisense alignments for 211 TrpRS and 207 HisRS sequences, compared with frequencies between 0.22 ± 0.0009 and 0.27 ± 0.0005 for eight different representations of the null hypothesis. Clustering algorithms demonstrate further that profiles of middle-base pairing in the synthetase antisense alignments are correlated along the sequences from one species-pair to another, whereas this is not the case for similar operations on sets representing the null hypothesis. Most probable reconstructed sequences for ancestral nodes of maximum likelihood trees show that middle-base pairing frequency increases to approximately 0.42 ± 0.002 as bacterial trees approach their roots; ancestral nodes from trees including archaeal sequences show a less pronounced increase. Thus, contemporary and reconstructed sequences all validate important bioinformatic predictions based on descent from opposite strands of the same ancestral gene. They further provide novel evidence for the hypothesis that bacteria lie closer than archaea to the origin of translation. Moreover, the inverse polarity of genetic coding, together with a priori α-helix propensities suggest that in-frame coding on opposite strands leads to similar secondary structures with opposite polarity, as observed in TrpRS and HisRS crystal structures. PMID:23576570
Gulati, Arvind; Swarnkar, Mohit Kumar; Vyas, Pratibha; Rahi, Praveen; Thakur, Rishu; Thakur, Namika; Singh, Anil Kumar
2015-09-03
The complete genome sequence of 6.45 Mb is reported here for Pseudomonas trivialis strain IHBB745 (MTCC 5336), which is an efficient, stress-tolerant, and broad-spectrum plant growth-promoting rhizobacterium. The gene-coding clusters predicted the genes for phosphate solubilization, siderophore production, 1-aminocyclopropane-1-carboxylate (ACC) deaminase activity, indole-3-acetic acid (IAA) production, and stress response. Copyright © 2015 Gulati et al.
Rogan, P K; Schneider, T D
1995-01-01
Predicting the effects of nucleotide substitutions in human splice sites has been based on analysis of consensus sequences. We used a graphic representation of sequence conservation and base frequency, the sequence logo, to demonstrate that a change in a splice acceptor of hMSH2 (a gene associated with familial nonpolyposis colon cancer) probably does not reduce splicing efficiency. This confirms a population genetic study that suggested that this substitution is a genetic polymorphism. The information theory-based sequence logo is quantitative and more sensitive than the corresponding splice acceptor consensus sequence for detection of true mutations. Information analysis may potentially be used to distinguish polymorphisms from mutations in other types of transcriptional, translational, or protein-coding motifs.
Villanueva, Pía; Nudel, Ron; Hoischen, Alexander; Fernández, María Angélica; Simpson, Nuala H; Gilissen, Christian; Reader, Rose H; Jara, Lillian; Echeverry, María Magdalena; Echeverry, Maria Magdalena; Francks, Clyde; Baird, Gillian; Conti-Ramsden, Gina; O'Hare, Anne; Bolton, Patrick F; Hennessy, Elizabeth R; Palomino, Hernán; Carvajal-Carmona, Luis; Veltman, Joris A; Cazier, Jean-Baptiste; De Barbieri, Zulema; Fisher, Simon E; Newbury, Dianne F
2015-03-01
Children affected by Specific Language Impairment (SLI) fail to acquire age appropriate language skills despite adequate intelligence and opportunity. SLI is highly heritable, but the understanding of underlying genetic mechanisms has proved challenging. In this study, we use molecular genetic techniques to investigate an admixed isolated founder population from the Robinson Crusoe Island (Chile), who are affected by a high incidence of SLI, increasing the power to discover contributory genetic factors. We utilize exome sequencing in selected individuals from this population to identify eight coding variants that are of putative significance. We then apply association analyses across the wider population to highlight a single rare coding variant (rs144169475, Minor Allele Frequency of 4.1% in admixed South American populations) in the NFXL1 gene that confers a nonsynonymous change (N150K) and is significantly associated with language impairment in the Robinson Crusoe population (p = 2.04 × 10-4, 8 variants tested). Subsequent sequencing of NFXL1 in 117 UK SLI cases identified four individuals with heterozygous variants predicted to be of functional consequence. We conclude that coding variants within NFXL1 confer an increased risk of SLI within a complex genetic model.
Complete genome sequence of Parvibaculum lavamentivorans type strain (DS-1(T)).
Schleheck, David; Weiss, Michael; Pitluck, Sam; Bruce, David; Land, Miriam L; Han, Shunsheng; Saunders, Elizabeth; Tapia, Roxanne; Detter, Chris; Brettin, Thomas; Han, James; Woyke, Tanja; Goodwin, Lynne; Pennacchio, Len; Nolan, Matt; Cook, Alasdair M; Kjelleberg, Staffan; Thomas, Torsten
2011-12-31
Parvibaculum lavamentivorans DS-1(T) is the type species of the novel genus Parvibaculum in the novel family Rhodobiaceae (formerly Phyllobacteriaceae) of the order Rhizobiales of Alphaproteobacteria. Strain DS-1(T) is a non-pigmented, aerobic, heterotrophic bacterium and represents the first tier member of environmentally important bacterial communities that catalyze the complete degradation of synthetic laundry surfactants. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 3,914,745 bp long genome with its predicted 3,654 protein coding genes is the first completed genome sequence of the genus Parvibaculum, and the first genome sequence of a representative of the family Rhodobiaceae.
Nissimov, Jozef I; Worthy, Charlotte A; Rooks, Paul; Napier, Johnathan A; Kimmance, Susan A; Henn, Matthew R; Ogata, Hiroyuki; Allen, Michael J
2012-03-01
The Coccolithoviridae are a group of viruses which infect the marine coccolithophorid microalga Emiliania huxleyi. The Emiliania huxleyi viruses (known as EhVs) described herein have 160- to 180-nm diameter icosahedral structures, have genomes of approximately 400 kbp, and consist of more than 450 predicted coding sequences (CDSs). Here, we describe the genomic features of four newly sequenced coccolithoviruses (EhV-88, EhV-201, EhV-207, and EhV-208) together with their draft genome sequences and their annotations, highlighting the homology and heterogeneity of these genomes to the EhV-86 model reference genome.
TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes
Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine
2012-01-01
In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565
Principles of protein folding--a perspective from simple exact models.
Dill, K. A.; Bromberg, S.; Yue, K.; Fiebig, K. M.; Yee, D. P.; Thomas, P. D.; Chan, H. S.
1995-01-01
General principles of protein structure, stability, and folding kinetics have recently been explored in computer simulations of simple exact lattice models. These models represent protein chains at a rudimentary level, but they involve few parameters, approximations, or implicit biases, and they allow complete explorations of conformational and sequence spaces. Such simulations have resulted in testable predictions that are sometimes unanticipated: The folding code is mainly binary and delocalized throughout the amino acid sequence. The secondary and tertiary structures of a protein are specified mainly by the sequence of polar and nonpolar monomers. More specific interactions may refine the structure, rather than dominate the folding code. Simple exact models can account for the properties that characterize protein folding: two-state cooperativity, secondary and tertiary structures, and multistage folding kinetics--fast hydrophobic collapse followed by slower annealing. These studies suggest the possibility of creating "foldable" chain molecules other than proteins. The encoding of a unique compact chain conformation may not require amino acids; it may require only the ability to synthesize specific monomer sequences in which at least one monomer type is solvent-averse. PMID:7613459
Lourenco-Jaramillo, Diana Lelidett; Sifuentes-Rincón, Ana María; Parra-Bracamonte, Gaspar Manuel; de la Rosa-Reyna, Xochitl Fabiola; Segura-Cabrera, Aldo; Arellano-Vera, Williams
2012-01-01
DNA from four cattle breeds was used to re-sequence all of the exons and 56% of the introns of the bovine tyrosine hydroxylase (TH) gene and 97% and 13% of the bovine dopamine β-hydroxylase (DBH) coding and non-coding sequences, respectively. Two novel single nucleotide polymorphisms (SNPs) and a microsatellite motif were found in the TH sequences. The DBH sequences contained 62 nucleotide changes, including eight non-synonymous SNPs (nsSNPs) that are of particular interest because they may alter protein function and therefore affect the phenotype. These DBH nsSNPs resulted in amino acid substitutions that were predicted to destabilize the protein structure. Six SNPs (one from TH and five from DBH non-synonymous SNPs) were genotyped in 140 animals; all of them were polymorphic and had a minor allele frequency of > 9%. There were significant differences in the intra- and inter-population haplotype distributions. The haplotype differences between Brahman cattle and the three B. t. taurus breeds (Charolais, Holstein and Lidia) were interesting from a behavioural point of view because of the differences in temperament between these breeds. PMID:22888292
Jankowitsch, Frank; Schwarz, Julia; Rückert, Christian; Gust, Bertolt; Szczepanowski, Rafael; Blom, Jochen; Pelzer, Stefan; Kalinowski, Jörn
2012-01-01
Streptomyces davawensis JCM 4913 synthesizes the antibiotic roseoflavin, a structural riboflavin (vitamin B2) analog. Here, we report the 9,466,619-bp linear chromosome of S. davawensis JCM 4913 and a 89,331-bp linear plasmid. The sequence has an average G+C content of 70.58% and contains six rRNA operons (16S-23S-5S) and 69 tRNA genes. The 8,616 predicted protein-coding sequences include 32 clusters coding for secondary metabolites, several of which are unique to S. davawensis. The chromosome contains long terminal inverted repeats of 33,255 bp each and atypical telomeres. Sequence analysis with regard to riboflavin biosynthesis revealed three different patterns of gene organization in Streptomyces species. Heterologous expression of a set of genes present on a subgenomic fragment of S. davawensis resulted in the production of roseoflavin by the host Streptomyces coelicolor M1152. Phylogenetic analysis revealed that S. davawensis is a close relative of Streptomyces cinnabarinus, and much to our surprise, we found that the latter bacterium is a roseoflavin producer as well. PMID:23043000
Bringing the fathead minnow into the genomic era | Science ...
The fathead minnow is a well-established ecotoxicological model organism that has been widely used for regulatory ecotoxicity testing and research for over a half century. While a large amount of molecular information has been gathered on the fathead minnow over the years, the lack of genomic sequence data has limited the utility of the fathead minnow for certain applications. To address this limitation, high-throughput Illumina sequencing technology was employed to sequence the fathead minnow genome. Approximately 100X coverage was achieved by sequencing several libraries of paired-end reads with differing genome insert sizes. Two draft genome assemblies were generated using the SOAPdenovo and String Graph Assembler (SGA) methods, respectively. When these were compared, the SOAPdenovo assembly had a higher scaffold N50 value of 60.4 kbp versus 15.4 kbp, and it also performed better in a Core Eukaryotic Genes Mapping Analysis (CEGMA), mapping 91% versus 67% of genes. As such, this assembly was selected for further development and annotation. The foundation for genome annotation was generated using AUGUSTUS, an ab initio method for gene prediction. A total of 43,345 potential coding sequences were predicted on the genome assembly. These predicted sequences were translated to peptides and queried in a BLAST search against all vertebrates, with 28,290 of these sequences corresponding to zebrafish peptides and 5,242 producing no significant alignments. Additional ty
Zhu, Xun; Xie, Shangbo; Armengaud, Jean; Xie, Wen; Guo, Zhaojiang; Kang, Shi; Wu, Qingjun; Wang, Shaoli; Xia, Jixing; He, Rongjun; Zhang, Youjun
2016-01-01
The diamondback moth, Plutella xylostella (L.), is the major cosmopolitan pest of brassica and other cruciferous crops. Its larval midgut is a dynamic tissue that interfaces with a wide variety of toxicological and physiological processes. The draft sequence of the P. xylostella genome was recently released, but its annotation remains challenging because of the low sequence coverage of this branch of life and the poor description of exon/intron splicing rules for these insects. Peptide sequencing by computational assignment of tandem mass spectra to genome sequence information provides an experimental independent approach for confirming or refuting protein predictions, a concept that has been termed proteogenomics. In this study, we carried out an in-depth proteogenomic analysis to complement genome annotation of P. xylostella larval midgut based on shotgun HPLC-ESI-MS/MS data by means of a multialgorithm pipeline. A total of 876,341 tandem mass spectra were searched against the predicted P. xylostella protein sequences and a whole-genome six-frame translation database. Based on a data set comprising 2694 novel genome search specific peptides, we discovered 439 novel protein-coding genes and corrected 128 existing gene models. To get the most accurate data to seed further insect genome annotation, more than half of the novel protein-coding genes, i.e. 235 over 439, were further validated after RT-PCR amplification and sequencing of the corresponding transcripts. Furthermore, we validated 53 novel alternative splicings. Finally, a total of 6764 proteins were identified, resulting in one of the most comprehensive proteogenomic study of a nonmodel animal. As the first tissue-specific proteogenomics analysis of P. xylostella, this study provides the fundamental basis for high-throughput proteomics and functional genomics approaches aimed at deciphering the molecular mechanisms of resistance and controlling this pest. PMID:26902207
Whitaker, Weston R; Lee, Hanson; Arkin, Adam P; Dueber, John E
2015-03-20
Genetic sequences ported into non-native hosts for synthetic biology applications can gain unexpected properties. In this study, we explored sequences functioning as ribosome binding sites (RBSs) within protein coding DNA sequences (CDSs) that cause internal translation, resulting in truncated proteins. Genome-wide prediction of bacterial RBSs, based on biophysical calculations employed by the RBS calculator, suggests a selection against internal RBSs within CDSs in Escherichia coli, but not those in Saccharomyces cerevisiae. Based on these calculations, silent mutations aimed at removing internal RBSs can effectively reduce truncation products from internal translation. However, a solution for complete elimination of internal translation initiation is not always feasible due to constraints of available coding sequences. Fluorescence assays and Western blot analysis showed that in genes with internal RBSs, increasing the strength of the intended upstream RBS had little influence on the internal translation strength. Another strategy to minimize truncated products from an internal RBS is to increase the relative strength of the upstream RBS with a concomitant reduction in promoter strength to achieve the same protein expression level. Unfortunately, lower transcription levels result in increased noise at the single cell level due to stochasticity in gene expression. At the low expression regimes desired for many synthetic biology applications, this problem becomes particularly pronounced. We found that balancing promoter strengths and upstream RBS strengths to intermediate levels can achieve the target protein concentration while avoiding both excessive noise and truncated protein.
Imran, Md; Pant, Poonam; Shanbhag, Yogini P; Sawant, Samir V; Ghadi, Sanjeev C
2017-02-01
Microbulbifer mangrovi strain DD-13 T is a novel-type species isolated from the mangroves of Goa, India. The draft genome sequence of strain DD-13 comprised 4,528,106 bp with G+C content of 57.15%. Out of 3479 open reading frames, functions for 3488 protein coding sequences were predicted on the basis of similarity with the cluster of orthologous groups. In addition to protein coding sequences, 34 tRNA genes and 3 rRNA genes were detected. Analysis of nucleotide sequence of predicted gene using a Carbohydrate-Active Enzymes (CAZymes) Analysis Toolkit indicates that strain DD-13 encodes a large set of CAZymes including 255 glycoside hydrolases, 76 carbohydrate esterases, 17 polysaccharide lyases, and 113 carbohydrate-binding modules (CBMs). Many genes from strain DD-13 were annotated as carbohydrases specific for degradation of agar, alginate, carrageenan, chitin, xylan, pullulan, cellulose, starch, β-glucan, pectin, etc. Some of polysaccharide-degrading genes were highly modular and were appended at least with one CBM indicating the versatility of strain DD-13 to degrade complex polysaccharides. The cell growth of strain DD-13 was validated using pure polysaccharides such as agarose or alginate as carbon source as well as by using red and brown seaweed powder as substrate. The homologous carbohydrase produced by strain DD-13 during growth degraded the polysaccharide, ensuring the production of metabolizable reducing sugars. Additionally, several other polysaccharides such as carrageenan, xylan, pullulan, pectin, starch, and carboxymethyl cellulose were also corroborated as growth substrate for strain DD-13 and were associated with concomitant production of homologous carbohydrase.
Posterior Predictive Bayesian Phylogenetic Model Selection
Lewis, Paul O.; Xie, Wangang; Chen, Ming-Hui; Fan, Yu; Kuo, Lynn
2014-01-01
We present two distinctly different posterior predictive approaches to Bayesian phylogenetic model selection and illustrate these methods using examples from green algal protein-coding cpDNA sequences and flowering plant rDNA sequences. The Gelfand–Ghosh (GG) approach allows dissection of an overall measure of model fit into components due to posterior predictive variance (GGp) and goodness-of-fit (GGg), which distinguishes this method from the posterior predictive P-value approach. The conditional predictive ordinate (CPO) method provides a site-specific measure of model fit useful for exploratory analyses and can be combined over sites yielding the log pseudomarginal likelihood (LPML) which is useful as an overall measure of model fit. CPO provides a useful cross-validation approach that is computationally efficient, requiring only a sample from the posterior distribution (no additional simulation is required). Both GG and CPO add new perspectives to Bayesian phylogenetic model selection based on the predictive abilities of models and complement the perspective provided by the marginal likelihood (including Bayes Factor comparisons) based solely on the fit of competing models to observed data. [Bayesian; conditional predictive ordinate; CPO; L-measure; LPML; model selection; phylogenetics; posterior predictive.] PMID:24193892
Upper Stage Tank Thermodynamic Modeling Using SINDA/FLUINT
NASA Technical Reports Server (NTRS)
Schallhorn, Paul; Campbell, D. Michael; Chase, Sukhdeep; Piquero, Jorge; Fortenberry, Cindy; Li, Xiaoyi; Grob, Lisa
2006-01-01
Modeling to predict the condition of cryogenic propellants in an upper stage of a launch vehicle is necessary for mission planning and successful execution. Traditionally, this effort was performed using custom, in-house proprietary codes, limiting accessibility and application. Phenomena responsible for influencing the thermodynamic state of the propellant have been characterized as distinct events whose sequence defines a mission. These events include thermal stratification, passive thermal control roll (rotation), slosh, and engine firing. This paper demonstrates the use of an off the shelf, commercially available, thermal/fluid-network code to predict the thermodynamic state of propellant during the coast phase between engine firings, i.e. the first three of the above identified events. Results of this effort will also be presented.
NASA Technical Reports Server (NTRS)
Kalb, Michael; Robertson, Franklin; Jedlovec, Gary; Perkey, Donald
1987-01-01
Techniques by which mesoscale numerical weather prediction model output and radiative transfer codes are combined to simulate the radiance fields that a given passive temperature/moisture satellite sensor would see if viewing the evolving model atmosphere are introduced. The goals are to diagnose the dynamical atmospheric processes responsible for recurring patterns in observed satellite radiance fields, and to develop techniques to anticipate the ability of satellite sensor systems to depict atmospheric structures and provide information useful for numerical weather prediction (NWP). The concept of linking radiative transfer and dynamical NWP codes is demonstrated with time sequences of simulated radiance imagery in the 24 TIROS vertical sounder channels derived from model integrations for March 6, 1982.
Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae.
Zubek, Julian; Tatjewski, Marcin; Boniecki, Adam; Mnich, Maciej; Basu, Subhadip; Plewczynski, Dariusz
2015-01-01
Accurate identification of protein-protein interactions (PPI) is the key step in understanding proteins' biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein-protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein-protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).
Broadband transmission-type coding metamaterial for wavefront manipulation for airborne sound
NASA Astrophysics Data System (ADS)
Li, Kun; Liang, Bin; Yang, Jing; Yang, Jun; Cheng, Jian-chun
2018-07-01
The recent advent of coding metamaterials, as a new class of acoustic metamaterials, substantially reduces the complexity in the design and fabrication of acoustic functional devices capable of manipulating sound waves in exotic manners by arranging coding elements with discrete phase states in specific sequences. It is therefore intriguing, both physically and practically, to pursue a mechanism for realizing broadband acoustic coding metamaterials that control transmitted waves with a fine resolution of the phase profile. Here, we propose the design of a transmission-type acoustic coding device and demonstrate its metamaterial-based implementation. The mechanism is that, instead of relying on resonant coding elements that are necessarily narrow-band, we build weak-resonant coding elements with a helical-like metamaterial with a continuously varying pitch that effectively expands the working bandwidth while maintaining the sub-wavelength resolution of the phase profile that is vital for the production of complicated wave fields. The effectiveness of our proposed scheme is numerically verified via the demonstration of three distinctive examples of acoustic focusing, anomalous refraction, and vortex beam generation in the prescribed frequency band on the basis of 1- and 2-bit coding sequences. Simulation results agree well with theoretical predictions, showing that the designed coding devices with discrete phase profiles are efficient in engineering the wavefront of outcoming waves to form the desired spatial pattern. We anticipate the realization of coding metamaterials with broadband functionality and design flexibility to open up possibilities for novel acoustic functional devices for the special manipulation of transmitted waves and underpin diverse applications ranging from medical ultrasound imaging to acoustic detections.
Unique and conserved microRNAs in wheat chromosome 5D revealed by next-generation sequencing.
Kurtoglu, Kuaybe Yucebilgili; Kantar, Melda; Lucas, Stuart J; Budak, Hikmet
2013-01-01
MicroRNAs are a class of short, non-coding, single-stranded RNAs that act as post-transcriptional regulators in gene expression. miRNA analysis of Triticum aestivum chromosome 5D was performed on 454 GS FLX Titanium sequences of flow-sorted chromosome 5D with a total of 3,208,630 good quality reads representing 1.34x and 1.61x coverage of the short (5DS) and long (5DL) arms of the chromosome respectively. In silico and structural analyses revealed a total of 55 miRNAs; 48 and 42 miRNAs were found to be present on 5DL and 5DS respectively, of which 35 were common to both chromosome arms, while 13 miRNAs were specific to 5DL and 7 miRNAs were specific to 5DS. In total, 14 of the predicted miRNAs were identified in wheat for the first time. Representation (the copy number of each miRNA) was also found to be higher in 5DL (1,949) compared to 5DS (1,191). Targets were predicted for each miRNA, while expression analysis gave evidence of expression for 6 out of 55 miRNAs. Occurrences of the same miRNAs were also found in Brachypodium distachyon and Oryza sativa genome sequences to identify syntenic miRNA coding sequences. Based on this analysis, two other miRNAs: miR1133 and miR167 were detected in B. distachyon syntenic region of wheat 5DS. Five of the predicted miRNA coding regions (miR6220, miR5070, miR169, miR5085, miR2118) were experimentally verified to be located to the 5D chromosome and three of them : miR2118, miR169 and miR5085, were shown to be 5D specific. Furthermore miR2118 was shown to be expressed in Chinese Spring adult leaves. miRNA genes identified in this study will expand our understanding of gene regulation in bread wheat.
The influence of visual training on predicting complex action sequences.
Cross, Emily S; Stadler, Waltraud; Parkinson, Jim; Schütz-Bosbach, Simone; Prinz, Wolfgang
2013-02-01
Linking observed and executable actions appears to be achieved by an action observation network (AON), comprising parietal, premotor, and occipitotemporal cortical regions of the human brain. AON engagement during action observation is thought to aid in effortless, efficient prediction of ongoing movements to support action understanding. Here, we investigate how the AON responds when observing and predicting actions we cannot readily reproduce before and after visual training. During pre- and posttraining neuroimaging sessions, participants watched gymnasts and wind-up toys moving behind an occluder and pressed a button when they expected each agent to reappear. Between scanning sessions, participants visually trained to predict when a subset of stimuli would reappear. Posttraining scanning revealed activation of inferior parietal, superior temporal, and cerebellar cortices when predicting occluded actions compared to perceiving them. Greater activity emerged when predicting untrained compared to trained sequences in occipitotemporal cortices and to a lesser degree, premotor cortices. The occipitotemporal responses when predicting untrained agents showed further specialization, with greater responses within body-processing regions when predicting gymnasts' movements and in object-selective cortex when predicting toys' movements. The results suggest that (1) select portions of the AON are recruited to predict the complex movements not easily mapped onto the observer's body and (2) greater recruitment of these AON regions supports prediction of less familiar sequences. We suggest that the findings inform both the premotor model of action prediction and the predictive coding account of AON function. Copyright © 2011 Wiley Periodicals, Inc.
Characterization and prediction of residues determining protein functional specificity.
Capra, John A; Singh, Mona
2008-07-01
Within a homologous protein family, proteins may be grouped into subtypes that share specific functions that are not common to the entire family. Often, the amino acids present in a small number of sequence positions determine each protein's particular functional specificity. Knowledge of these specificity determining positions (SDPs) aids in protein function prediction, drug design and experimental analysis. A number of sequence-based computational methods have been introduced for identifying SDPs; however, their further development and evaluation have been hindered by the limited number of known experimentally determined SDPs. We combine several bioinformatics resources to automate a process, typically undertaken manually, to build a dataset of SDPs. The resulting large dataset, which consists of SDPs in enzymes, enables us to characterize SDPs in terms of their physicochemical and evolutionary properties. It also facilitates the large-scale evaluation of sequence-based SDP prediction methods. We present a simple sequence-based SDP prediction method, GroupSim, and show that, surprisingly, it is competitive with a representative set of current methods. We also describe ConsWin, a heuristic that considers sequence conservation of neighboring amino acids, and demonstrate that it improves the performance of all methods tested on our large dataset of enzyme SDPs. Datasets and GroupSim code are available online at http://compbio.cs.princeton.edu/specificity/. Supplementary data are available at Bioinformatics online.
Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse
Hillier, LaDeana W.; Zody, Michael C.; Goldstein, Steve; She, Xinwe; Bult, Carol J.; Agarwala, Richa; Cherry, Joshua L.; DiCuccio, Michael; Hlavina, Wratko; Kapustin, Yuri; Meric, Peter; Maglott, Donna; Birtle, Zoë; Marques, Ana C.; Graves, Tina; Zhou, Shiguo; Teague, Brian; Potamousis, Konstantinos; Churas, Christopher; Place, Michael; Herschleb, Jill; Runnheim, Ron; Forrest, Daniel; Amos-Landgraf, James; Schwartz, David C.; Cheng, Ze; Lindblad-Toh, Kerstin; Eichler, Evan E.; Ponting, Chris P.
2009-01-01
The mouse (Mus musculus) is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes). In addition, we identified 439 long, non–protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not. PMID:19468303
Draft Genome Sequence of Mycobacterium bohemicum Strain DSM 44277T.
Asmar, Shady; Phelippeau, Michael; Robert, Catherine; Croce, Olivier; Drancourt, Michel
2015-08-06
The Mycobacterium bohemicum strain is a nontuberculosis species mainly responsible for pediatric cervical lymphadenitis. The draft genome of M. bohemicum DSM 44277(T) comprises 5,097,190 bp exhibiting a 68.64% G+C content, 4,840 protein-coding genes, and 75 predicted RNA genes. Copyright © 2015 Asmar et al.
2015-09-02
human behavior. In this project, we hypothesized that visual memory of past motion trajectories may be used for selecting future behavior. In other...34Decoding sequence of actions using fMRI ", Society for Neuroscience Annual Meeting, San Diego, CA, USA, Nov 9-13 2013 (only abstract) 3. Hansol Choi, Dae...Shik Kim, "Planning as inference in a Hierarchical Predictive Memory ", Proceedings of International Conference on Neural Information Processing
Singh, Kh Dhanachandra; Karthikeyan, Muthusamy
2014-12-01
The renin-angiotensin-aldosterone system (RAAS) plays a key role in the regulation of blood pressure (BP). Mutations on the genes that encode components of the RAAS have played a significant role in genetic susceptibility to hypertension and have been intensively scrutinized. The identification of such probably causal mutations not only provides insight into the RAAS but may also serve as antihypertensive therapeutic targets and diagnostic markers. The methods for analyzing the SNPs from the huge dataset of SNPs, containing both functional and neutral SNPs is challenging by the experimental approach on every SNPs to determine their biological significance. To explore the functional significance of genetic mutation (SNPs), we adopted combined sequence and sequence-structure-based SNP analysis algorithm. Out of 3864 SNPs reported in dbSNP, we found 108 missense SNPs in the coding region and remaining in the non-coding region. In this study, we are reporting only those SNPs in coding region to be deleterious when three or more tools are predicted to be deleterious and which have high RMSD from the native structure. Based on these analyses, we have identified two SNPs of REN gene, eight SNPs of AGT gene, three SNPs of ACE gene, two SNPs of AT1R gene, three SNPs of CYP11B2 gene and three SNPs of CMA1 gene in the coding region were found to be deleterious. Further this type of study will be helpful in reducing the cost and time for identification of potential SNP and also helpful in selecting potential SNP for experimental study out of SNP pool.
Improving performance of DS-CDMA systems using chaotic complex Bernoulli spreading codes
NASA Astrophysics Data System (ADS)
Farzan Sabahi, Mohammad; Dehghanfard, Ali
2014-12-01
The most important goal of spreading spectrum communication system is to protect communication signals against interference and exploitation of information by unintended listeners. In fact, low probability of detection and low probability of intercept are two important parameters to increase the performance of the system. In Direct Sequence Code Division Multiple Access (DS-CDMA) systems, these properties are achieved by multiplying the data information in spreading sequences. Chaotic sequences, with their particular properties, have numerous applications in constructing spreading codes. Using one-dimensional Bernoulli chaotic sequence as spreading code is proposed in literature previously. The main feature of this sequence is its negative auto-correlation at lag of 1, which with proper design, leads to increase in efficiency of the communication system based on these codes. On the other hand, employing the complex chaotic sequences as spreading sequence also has been discussed in several papers. In this paper, use of two-dimensional Bernoulli chaotic sequences is proposed as spreading codes. The performance of a multi-user synchronous and asynchronous DS-CDMA system will be evaluated by applying these sequences under Additive White Gaussian Noise (AWGN) and fading channel. Simulation results indicate improvement of the performance in comparison with conventional spreading codes like Gold codes as well as similar complex chaotic spreading sequences. Similar to one-dimensional Bernoulli chaotic sequences, the proposed sequences also have negative auto-correlation. Besides, construction of complex sequences with lower average cross-correlation is possible with the proposed method.
Mutations that Cause Human Disease: A Computational/Experimental Approach
DOE Office of Scientific and Technical Information (OSTI.GOV)
Beernink, P; Barsky, D; Pesavento, B
International genome sequencing projects have produced billions of nucleotides (letters) of DNA sequence data, including the complete genome sequences of 74 organisms. These genome sequences have created many new scientific opportunities, including the ability to identify sequence variations among individuals within a species. These genetic differences, which are known as single nucleotide polymorphisms (SNPs), are particularly important in understanding the genetic basis for disease susceptibility. Since the report of the complete human genome sequence, over two million human SNPs have been identified, including a large-scale comparison of an entire chromosome from twenty individuals. Of the protein coding SNPs (cSNPs), approximatelymore » half leads to a single amino acid change in the encoded protein (non-synonymous coding SNPs). Most of these changes are functionally silent, while the remainder negatively impact the protein and sometimes cause human disease. To date, over 550 SNPs have been found to cause single locus (monogenic) diseases and many others have been associated with polygenic diseases. SNPs have been linked to specific human diseases, including late-onset Parkinson disease, autism, rheumatoid arthritis and cancer. The ability to predict accurately the effects of these SNPs on protein function would represent a major advance toward understanding these diseases. To date several attempts have been made toward predicting the effects of such mutations. The most successful of these is a computational approach called ''Sorting Intolerant From Tolerant'' (SIFT). This method uses sequence conservation among many similar proteins to predict which residues in a protein are functionally important. However, this method suffers from several limitations. First, a query sequence must have a sufficient number of relatives to infer sequence conservation. Second, this method does not make use of or provide any information on protein structure, which can be used to understand how an amino acid change affects the protein. The experimental methods that provide the most detailed structural information on proteins are X-ray crystallography and NMR spectroscopy. However, these methods are labor intensive and currently cannot be carried out on a genomic scale. Nonetheless, Structural Genomics projects are being pursued by more than a dozen groups and consortia worldwide and as a result the number of experimentally determined structures is rising exponentially. Based on the expectation that protein structures will continue to be determined at an ever-increasing rate, reliable structure prediction schemes will become increasingly valuable, leading to information on protein function and disease for many different proteins. Given known genetic variability and experimentally determined protein structures, can we accurately predict the effects of single amino acid substitutions? An objective assessment of this question would involve comparing predicted and experimentally determined structures, which thus far has not been rigorously performed. The completed research leveraged existing expertise at LLNL in computational and structural biology, as well as significant computing resources, to address this question.« less
The G-Box Transcriptional Regulatory Code in Arabidopsis1[OPEN
Shepherd, Samuel J.K.; Brestovitsky, Anna; Dickinson, Patrick; Biswas, Surojit
2017-01-01
Plants have significantly more transcription factor (TF) families than animals and fungi, and plant TF families tend to contain more genes; these expansions are linked to adaptation to environmental stressors. Many TF family members bind to similar or identical sequence motifs, such as G-boxes (CACGTG), so it is difficult to predict regulatory relationships. We determined that the flanking sequences near G-boxes help determine in vitro specificity but that this is insufficient to predict the transcription pattern of genes near G-boxes. Therefore, we constructed a gene regulatory network that identifies the set of bZIPs and bHLHs that are most predictive of the expression of genes downstream of perfect G-boxes. This network accurately predicts transcriptional patterns and reconstructs known regulatory subnetworks. Finally, we present Ara-BOX-cis (araboxcis.org), a Web site that provides interactive visualizations of the G-box regulatory network, a useful resource for generating predictions for gene regulatory relations. PMID:28864470
Organization and transient expression of the gene for human U11 snRNA
Clemens, Suter-Crazzolara; Walter, Keller
1991-01-01
The nucleotide sequence of U11 small nuclear RNA, a minor U RNA from HeLa cells, was determined. Computer analysis of the sequence (135 residues) predicts two strong hairpin loops which are separated by seventeen nucleotides containing an Sm binding site (AAUUUUUUGG). A synthetic gene was constructed in which the coding region of U11 RNA is under the control of a T7 promoter. This vector can be used to produce U11 RNA in vitro. Southern hybridization and PCR analysis of HeLa genomic DNA suggest that U11 RNA is encoded by a single copy gene, and that at least three genomic regions could be U11 RNA pseudogenes. A HeLa genomic copy of a U11 gene was isolated by inverted PCR. This gene contains the U11 RNA coding sequence and several sequence elements unique for the U RNA genes. These include a Distal Sequence Element (DSE, ATTTGCATA) present between positions −215 and −223 relative to the start of transcription; a Proximal Sequence Element (PSE, TTCACCTTTACCAAAAATG) located between positions −43 and −63 ; and a 3′box (GTTAGGCGAAATATTA) between positions +150 and +166. Transfection of HeLa cells with this gene revealed that it is functioning in vivo and can produce U11 RNA. PMID:1820214
Crowley, T E; Bond, M W; Meyerowitz, E M
1983-01-01
The polytene chromosome puff at 68C on the Drosophila melanogaster third chromosome is thought from genetic experiments to contain the structural gene for one of the secreted salivary gland glue polypeptides, sgs-3. Previous work has demonstrated that the DNA included in this puff contains sequences that are transcribed to give three different polyadenylated RNAs that are abundant in third-larval-instar salivary glands. These have been called the group II, group III, and group IV RNAs. In the experiments reported here, we used the nucleotide sequence of the DNA coding for these RNAs to predict some of the physical and chemical properties expected of their protein products, including molecular weight, amino acid composition, and amino acid sequence. Salivary gland polypeptides with molecular weights similar to those expected for the 68C RNA translation products, and with the expected degree of incorporation of different radioactive amino acids, were purified. These proteins were shown by amino acid sequencing to correspond to the protein products of the 68C RNAs. It was further shown that each of these proteins is a part of the secreted salivary gland glue: the group IV RNA codes for the previously described sgs-3, whereas the group II and III RNAs code for the newly identified glue polypeptides sgs-8 and sgs-7. Images PMID:6406838
Ntougias, Spyridon; Lapidus, Alla; Han, James; Mavromatis, Konstantinos; Pati, Amrita; Chen, Amy; Klenk, Hans-Peter; Woyke, Tanja; Fasseas, Constantinos; Kyrpides, Nikos C.; Zervakis, Georgios I.
2014-01-01
Olivibacter sitiensis Ntougias et al. 2007 is a member of the family Sphingobacteriaceae, phylum Bacteroidetes. Members of the genus Olivibacter are phylogenetically diverse and of significant interest. They occur in diverse habitats, such as rhizosphere and contaminated soils, viscous wastes, composts, biofilter clean-up facilities on contaminated sites and cave environments, and they are involved in the degradation of complex and toxic compounds. Here we describe the features of O. sitiensis AW-6T, together with the permanent-draft genome sequence and annotation. The organism was sequenced under the Genomic Encyclopedia for Bacteria and Archaea (GEBA) project at the DOE Joint Genome Institute and is the first genome sequence of a species within the genus Olivibacter. The genome is 5,053,571 bp long and is comprised of 110 scaffolds with an average GC content of 44.61%. Of the 4,565 genes predicted, 4,501 were protein-coding genes and 64 were RNA genes. Most protein-coding genes (68.52%) were assigned to a putative function. The identification of 2-keto-4-pentenoate hydratase/2-oxohepta-3-ene-1,7-dioic acid hydratase-coding genes indicates involvement of this organism in the catechol catabolic pathway. In addition, genes encoding for β-1,4-xylanases and β-1,4-xylosidases reveal the xylanolytic action of O. sitiensis. PMID:25197463
Fanning, T; Singer, M
1987-01-01
Recent work suggests that one or more members of the highly repeated LINE-1 (L1) DNA family found in all mammals may encode one or more proteins. Here we report the sequence of a portion of an L1 cloned from the domestic cat (Felis catus). These data permit comparison of the L1 sequences in four mammalian orders (Carnivore, Lagomorph, Rodent and Primate) and the comparison supports the suggested coding potential. In two separate, noncontiguous regions in the carboxy terminal half of the proteins predicted from the DNA sequences, there are several strongly conserved segments. In one region, these share homology with known or suspected reverse transcriptases, as described by others in rodents and primates. In the second region, closer to the carboxy terminus, the strongly conserved segments are over 90% homologous among the four orders. One of the latter segments is cysteine rich and resembles the putative metal binding domains of nucleic acid binding proteins, including those of TFIIIA and retroviruses. PMID:3562227
Deep intronic GPR143 mutation in a Japanese family with ocular albinism
Naruto, Takuya; Okamoto, Nobuhiko; Masuda, Kiyoshi; Endo, Takao; Hatsukawa, Yoshikazu; Kohmoto, Tomohiro; Imoto, Issei
2015-01-01
Deep intronic mutations are often ignored as possible causes of human disease. Using whole-exome sequencing, we analysed genomic DNAs of a Japanese family with two male siblings affected by ocular albinism and congenital nystagmus. Although mutations or copy number alterations of coding regions were not identified in candidate genes, the novel intronic mutation c.659-131 T > G within GPR143 intron 5 was identified as hemizygous in affected siblings and as heterozygous in the unaffected mother. This mutation was predicted to create a cryptic splice donor site within intron 5 and activate a cryptic acceptor site at 41nt upstream, causing the insertion into the coding sequence of an out-of-frame 41-bp pseudoexon with a premature stop codon in the aberrant transcript, which was confirmed by minigene experiments. This result expands the mutational spectrum of GPR143 and suggests the utility of next-generation sequencing integrated with in silico and experimental analyses for improving the molecular diagnosis of this disease. PMID:26061757
Deep intronic GPR143 mutation in a Japanese family with ocular albinism.
Naruto, Takuya; Okamoto, Nobuhiko; Masuda, Kiyoshi; Endo, Takao; Hatsukawa, Yoshikazu; Kohmoto, Tomohiro; Imoto, Issei
2015-06-10
Deep intronic mutations are often ignored as possible causes of human disease. Using whole-exome sequencing, we analysed genomic DNAs of a Japanese family with two male siblings affected by ocular albinism and congenital nystagmus. Although mutations or copy number alterations of coding regions were not identified in candidate genes, the novel intronic mutation c.659-131 T > G within GPR143 intron 5 was identified as hemizygous in affected siblings and as heterozygous in the unaffected mother. This mutation was predicted to create a cryptic splice donor site within intron 5 and activate a cryptic acceptor site at 41nt upstream, causing the insertion into the coding sequence of an out-of-frame 41-bp pseudoexon with a premature stop codon in the aberrant transcript, which was confirmed by minigene experiments. This result expands the mutational spectrum of GPR143 and suggests the utility of next-generation sequencing integrated with in silico and experimental analyses for improving the molecular diagnosis of this disease.
The structure of the human interferon alpha/beta receptor gene.
Lutfalla, G; Gardiner, K; Proudhon, D; Vielh, E; Uzé, G
1992-02-05
Using the cDNA coding for the human interferon alpha/beta receptor (IFNAR), the IFNAR gene has been physically mapped relative to the other loci of the chromosome 21q22.1 region. 32,906 base pairs covering the IFNAR gene have been cloned and sequenced. Primer extension and solution hybridization-ribonuclease protection have been used to determine that the transcription of the gene is initiated in a broad region of 20 base pairs. Some aspects of the polymorphism of the gene, including noncoding sequences, have been analyzed; some are allelic differences in the coding sequence that induce amino acid variations in the resulting protein. The exon structure of the IFNAR gene and of that of the available genes for the receptors of the cytokine/growth hormone/prolactin/interferon receptor family have been compared with the predictions for the secondary structure of those receptors. From this analysis, we postulate a common origin and propose an hypothesis for the divergence from the immunoglobulin superfamily.
A SNP panel and online tool for checking genotype concordance through comparing QR codes.
Du, Yonghong; Martin, Joshua S; McGee, John; Yang, Yuchen; Liu, Eric Yi; Sun, Yingrui; Geihs, Matthias; Kong, Xuejun; Zhou, Eric Lingfeng; Li, Yun; Huang, Jie
2017-01-01
In the current precision medicine era, more and more samples get genotyped and sequenced. Both researchers and commercial companies expend significant time and resources to reduce the error rate. However, it has been reported that there is a sample mix-up rate of between 0.1% and 1%, not to mention the possibly higher mix-up rate during the down-stream genetic reporting processes. Even on the low end of this estimate, this translates to a significant number of mislabeled samples, especially over the projected one billion people that will be sequenced within the next decade. Here, we first describe a method to identify a small set of Single nucleotide polymorphisms (SNPs) that can uniquely identify a personal genome, which utilizes allele frequencies of five major continental populations reported in the 1000 genomes project and the ExAC Consortium. To make this panel more informative, we added four SNPs that are commonly used to predict ABO blood type, and another two SNPs that are capable of predicting sex. We then implement a web interface (http://qrcme.tech), nicknamed QRC (for QR code based Concordance check), which is capable of extracting the relevant ID SNPs from a raw genetic data, coding its genotype as a quick response (QR) code, and comparing QR codes to report the concordance of underlying genetic datasets. The resulting 80 fingerprinting SNPs represent a significant decrease in complexity and the number of markers used for genetic data labelling and tracking. Our method and web tool is easily accessible to both researchers and the general public who consider the accuracy of complex genetic data as a prerequisite towards precision medicine.
A SNP panel and online tool for checking genotype concordance through comparing QR codes
Du, Yonghong; Martin, Joshua S.; McGee, John; Yang, Yuchen; Liu, Eric Yi; Sun, Yingrui; Geihs, Matthias; Kong, Xuejun; Zhou, Eric Lingfeng; Li, Yun
2017-01-01
In the current precision medicine era, more and more samples get genotyped and sequenced. Both researchers and commercial companies expend significant time and resources to reduce the error rate. However, it has been reported that there is a sample mix-up rate of between 0.1% and 1%, not to mention the possibly higher mix-up rate during the down-stream genetic reporting processes. Even on the low end of this estimate, this translates to a significant number of mislabeled samples, especially over the projected one billion people that will be sequenced within the next decade. Here, we first describe a method to identify a small set of Single nucleotide polymorphisms (SNPs) that can uniquely identify a personal genome, which utilizes allele frequencies of five major continental populations reported in the 1000 genomes project and the ExAC Consortium. To make this panel more informative, we added four SNPs that are commonly used to predict ABO blood type, and another two SNPs that are capable of predicting sex. We then implement a web interface (http://qrcme.tech), nicknamed QRC (for QR code based Concordance check), which is capable of extracting the relevant ID SNPs from a raw genetic data, coding its genotype as a quick response (QR) code, and comparing QR codes to report the concordance of underlying genetic datasets. The resulting 80 fingerprinting SNPs represent a significant decrease in complexity and the number of markers used for genetic data labelling and tracking. Our method and web tool is easily accessible to both researchers and the general public who consider the accuracy of complex genetic data as a prerequisite towards precision medicine. PMID:28926565
Pattern-based integer sample motion search strategies in the context of HEVC
NASA Astrophysics Data System (ADS)
Maier, Georg; Bross, Benjamin; Grois, Dan; Marpe, Detlev; Schwarz, Heiko; Veltkamp, Remco C.; Wiegand, Thomas
2015-09-01
The H.265/MPEG-H High Efficiency Video Coding (HEVC) standard provides a significant increase in coding efficiency compared to its predecessor, the H.264/MPEG-4 Advanced Video Coding (AVC) standard, which however comes at the cost of a high computational burden for a compliant encoder. Motion estimation (ME), which is a part of the inter-picture prediction process, typically consumes a high amount of computational resources, while significantly increasing the coding efficiency. In spite of the fact that both H.265/MPEG-H HEVC and H.264/MPEG-4 AVC standards allow processing motion information on a fractional sample level, the motion search algorithms based on the integer sample level remain to be an integral part of ME. In this paper, a flexible integer sample ME framework is proposed, thereby allowing to trade off significant reduction of ME computation time versus coding efficiency penalty in terms of bit rate overhead. As a result, through extensive experimentation, an integer sample ME algorithm that provides a good trade-off is derived, incorporating a combination and optimization of known predictive, pattern-based and early termination techniques. The proposed ME framework is implemented on a basis of the HEVC Test Model (HM) reference software, further being compared to the state-of-the-art fast search algorithm, which is a native part of HM. It is observed that for high resolution sequences, the integer sample ME process can be speed-up by factors varying from 3.2 to 7.6, resulting in the bit-rate overhead of 1.5% and 0.6% for Random Access (RA) and Low Delay P (LDP) configurations, respectively. In addition, the similar speed-up is observed for sequences with mainly Computer-Generated Imagery (CGI) content while trading off the bit rate overhead of up to 5.2%.
A Graph-Centric Approach for Metagenome-Guided Peptide and Protein Identification in Metaproteomics
Tang, Haixu; Li, Sujun; Ye, Yuzhen
2016-01-01
Metaproteomic studies adopt the common bottom-up proteomics approach to investigate the protein composition and the dynamics of protein expression in microbial communities. When matched metagenomic and/or metatranscriptomic data of the microbial communities are available, metaproteomic data analyses often employ a metagenome-guided approach, in which complete or fragmental protein-coding genes are first directly predicted from metagenomic (and/or metatranscriptomic) sequences or from their assemblies, and the resulting protein sequences are then used as the reference database for peptide/protein identification from MS/MS spectra. This approach is often limited because protein coding genes predicted from metagenomes are incomplete and fragmental. In this paper, we present a graph-centric approach to improving metagenome-guided peptide and protein identification in metaproteomics. Our method exploits the de Bruijn graph structure reported by metagenome assembly algorithms to generate a comprehensive database of protein sequences encoded in the community. We tested our method using several public metaproteomic datasets with matched metagenomic and metatranscriptomic sequencing data acquired from complex microbial communities in a biological wastewater treatment plant. The results showed that many more peptides and proteins can be identified when assembly graphs were utilized, improving the characterization of the proteins expressed in the microbial communities. The additional proteins we identified contribute to the characterization of important pathways such as those involved in degradation of chemical hazards. Our tools are released as open-source software on github at https://github.com/COL-IU/Graph2Pro. PMID:27918579
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.
Li, Yifeng; Shi, Wenqiang; Wasserman, Wyeth W
2018-05-31
In the human genome, 98% of DNA sequences are non-protein-coding regions that were previously disregarded as junk DNA. In fact, non-coding regions host a variety of cis-regulatory regions which precisely control the expression of genes. Thus, Identifying active cis-regulatory regions in the human genome is critical for understanding gene regulation and assessing the impact of genetic variation on phenotype. The developments of high-throughput sequencing and machine learning technologies make it possible to predict cis-regulatory regions genome wide. Based on rich data resources such as the Encyclopedia of DNA Elements (ENCODE) and the Functional Annotation of the Mammalian Genome (FANTOM) projects, we introduce DECRES based on supervised deep learning approaches for the identification of enhancer and promoter regions in the human genome. Due to their ability to discover patterns in large and complex data, the introduction of deep learning methods enables a significant advance in our knowledge of the genomic locations of cis-regulatory regions. Using models for well-characterized cell lines, we identify key experimental features that contribute to the predictive performance. Applying DECRES, we delineate locations of 300,000 candidate enhancers genome wide (6.8% of the genome, of which 40,000 are supported by bidirectional transcription data), and 26,000 candidate promoters (0.6% of the genome). The predicted annotations of cis-regulatory regions will provide broad utility for genome interpretation from functional genomics to clinical applications. The DECRES model demonstrates potentials of deep learning technologies when combined with high-throughput sequencing data, and inspires the development of other advanced neural network models for further improvement of genome annotations.
2014-01-01
Background Locating the protein-coding genes in novel genomes is essential to understanding and exploiting the genomic information but it is still difficult to accurately predict all the genes. The recent availability of detailed information about transcript structure from high-throughput sequencing of messenger RNA (RNA-Seq) delineates many expressed genes and promises increased accuracy in gene prediction. Computational gene predictors have been intensively developed for and tested in well-studied animal genomes. Hundreds of fungal genomes are now or will soon be sequenced. The differences of fungal genomes from animal genomes and the phylogenetic sparsity of well-studied fungi call for gene-prediction tools tailored to them. Results SnowyOwl is a new gene prediction pipeline that uses RNA-Seq data to train and provide hints for the generation of Hidden Markov Model (HMM)-based gene predictions and to evaluate the resulting models. The pipeline has been developed and streamlined by comparing its predictions to manually curated gene models in three fungal genomes and validated against the high-quality gene annotation of Neurospora crassa; SnowyOwl predicted N. crassa genes with 83% sensitivity and 65% specificity. SnowyOwl gains sensitivity by repeatedly running the HMM gene predictor Augustus with varied input parameters and selectivity by choosing the models with best homology to known proteins and best agreement with the RNA-Seq data. Conclusions SnowyOwl efficiently uses RNA-Seq data to produce accurate gene models in both well-studied and novel fungal genomes. The source code for the SnowyOwl pipeline (in Python) and a web interface (in PHP) is freely available from http://sourceforge.net/projects/snowyowl/. PMID:24980894
Protein Solvent-Accessibility Prediction by a Stacked Deep Bidirectional Recurrent Neural Network.
Zhang, Buzhong; Li, Linqing; Lü, Qiang
2018-05-25
Residue solvent accessibility is closely related to the spatial arrangement and packing of residues. Predicting the solvent accessibility of a protein is an important step to understand its structure and function. In this work, we present a deep learning method to predict residue solvent accessibility, which is based on a stacked deep bidirectional recurrent neural network applied to sequence profiles. To capture more long-range sequence information, a merging operator was proposed when bidirectional information from hidden nodes was merged for outputs. Three types of merging operators were used in our improved model, with a long short-term memory network performing as a hidden computing node. The trained database was constructed from 7361 proteins extracted from the PISCES server using a cut-off of 25% sequence identity. Sequence-derived features including position-specific scoring matrix, physical properties, physicochemical characteristics, conservation score and protein coding were used to represent a residue. Using this method, predictive values of continuous relative solvent-accessible area were obtained, and then, these values were transformed into binary states with predefined thresholds. Our experimental results showed that our deep learning method improved prediction quality relative to current methods, with mean absolute error and Pearson's correlation coefficient values of 8.8% and 74.8%, respectively, on the CB502 dataset and 8.2% and 78%, respectively, on the Manesh215 dataset.
Computational RNomics of Drosophilids
Rose, Dominic; Hackermüller, Jörg; Washietl, Stefan; Reiche, Kristin; Hertel, Jana; Findeiß, Sven; Stadler, Peter F; Prohaska, Sonja J
2007-01-01
Background Recent experimental and computational studies have provided overwhelming evidence for a plethora of diverse transcripts that are unrelated to protein-coding genes. One subclass consists of those RNAs that require distinctive secondary structure motifs to exert their biological function and hence exhibit distinctive patterns of sequence conservation characteristic for positive selection on RNA secondary structure. The deep-sequencing of 12 drosophilid species coordinated by the NHGRI provides an ideal data set of comparative computational approaches to determine those genomic loci that code for evolutionarily conserved RNA motifs. This class of loci includes the majority of the known small ncRNAs as well as structured RNA motifs in mRNAs. We report here on a genome-wide survey using RNAz. Results We obtain 16 000 high quality predictions among which we recover the majority of the known ncRNAs. Taking a pessimistically estimated false discovery rate of 40% into account, this implies that at least some ten thousand loci in the Drosophila genome show the hallmarks of stabilizing selection action of RNA structure, and hence are most likely functional at the RNA level. A subset of RNAz predictions overlapping with TRF1 and BRF binding sites [Isogai et al., EMBO J. 26: 79–89 (2007)], which are plausible candidates of Pol III transcripts, have been studied in more detail. Among these sequences we identify several "clusters" of ncRNA candidates with striking structural similarities. Conclusion The statistical evaluation of the RNAz predictions in comparison with a similar analysis of vertebrate genomes [Washietl et al., Nat. Biotech. 23: 1383–1390 (2005)] shows that qualitatively similar fractions of structured RNAs are found in introns, UTRs, and intergenic regions. The intergenic RNA structures, however, are concentrated much more closely around known protein-coding loci, suggesting that flies have significantly smaller complement of independent structured ncRNAs compared to mammals. PMID:17996037
Draft genome sequence of Enterococcus faecium strain LMG 8148.
Michiels, Joran E; Van den Bergh, Bram; Fauvart, Maarten; Michiels, Jan
2016-01-01
Enterococcus faecium, traditionally considered a harmless gut commensal, is emerging as an important nosocomial pathogen showing increasing rates of multidrug resistance. We report the draft genome sequence of E. faecium strain LMG 8148, isolated in 1968 from a human in Gothenburg, Sweden. The draft genome has a total length of 2,697,490 bp, a GC-content of 38.3 %, and 2,402 predicted protein-coding sequences. The isolation of this strain predates the emergence of E. faecium as a nosocomial pathogen. Consequently, its genome can be useful in comparative genomic studies investigating the evolution of E. faecium as a pathogen.
DeepLoc: prediction of protein subcellular localization using deep learning.
Almagro Armenteros, José Juan; Sønderby, Casper Kaae; Sønderby, Søren Kaae; Nielsen, Henrik; Winther, Ole
2017-11-01
The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research. Many machine learning methods have been successfully applied in this task, but in most of them, predictions rely on annotation of homologues from knowledge databases. For novel proteins where no annotated homologues exist, and for predicting the effects of sequence variants, it is desirable to have methods for predicting protein properties from sequence information only. Here, we present a prediction algorithm using deep neural networks to predict protein subcellular localization relying only on sequence information. At its core, the prediction model uses a recurrent neural network that processes the entire protein sequence and an attention mechanism identifying protein regions important for the subcellular localization. The model was trained and tested on a protein dataset extracted from one of the latest UniProt releases, in which experimentally annotated proteins follow more stringent criteria than previously. We demonstrate that our model achieves a good accuracy (78% for 10 categories; 92% for membrane-bound or soluble), outperforming current state-of-the-art algorithms, including those relying on homology information. The method is available as a web server at http://www.cbs.dtu.dk/services/DeepLoc. Example code is available at https://github.com/JJAlmagro/subcellular_localization. The dataset is available at http://www.cbs.dtu.dk/services/DeepLoc/data.php. jjalma@dtu.dk. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Xiong, Dapeng; Zeng, Jianyang; Gong, Haipeng
2017-09-01
Residue-residue contacts are of great value for protein structure prediction, since contact information, especially from those long-range residue pairs, can significantly reduce the complexity of conformational sampling for protein structure prediction in practice. Despite progresses in the past decade on protein targets with abundant homologous sequences, accurate contact prediction for proteins with limited sequence information is still far from satisfaction. Methodologies for these hard targets still need further improvement. We presented a computational program DeepConPred, which includes a pipeline of two novel deep-learning-based methods (DeepCCon and DeepRCon) as well as a contact refinement step, to improve the prediction of long-range residue contacts from primary sequences. When compared with previous prediction approaches, our framework employed an effective scheme to identify optimal and important features for contact prediction, and was only trained with coevolutionary information derived from a limited number of homologous sequences to ensure robustness and usefulness for hard targets. Independent tests showed that 59.33%/49.97%, 64.39%/54.01% and 70.00%/59.81% of the top L/5, top L/10 and top 5 predictions were correct for CASP10/CASP11 proteins, respectively. In general, our algorithm ranked as one of the best methods for CASP targets. All source data and codes are available at http://166.111.152.91/Downloads.html . hgong@tsinghua.edu.cn or zengjy321@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Liakhovetskiĭ, V A; Bobrova, E V; Skopin, G N
2012-01-01
Transposition errors during the reproduction of a hand movement sequence make it possible to receive important information on the internal representation of this sequence in the motor working memory. Analysis of such errors showed that learning to reproduce sequences of the left-hand movements improves the system of positional coding (coding ofpositions), while learning of the right-hand movements improves the system of vector coding (coding of movements). Learning of the right-hand movements after the left-hand performance involved the system of positional coding "imposed" by the left hand. Learning of the left-hand movements after the right-hand performance activated the system of vector coding. Transposition errors during learning to reproduce movement sequences can be explained by neural network using either vector coding or both vector and positional coding.
osFP: a web server for predicting the oligomeric states of fluorescent proteins.
Simeon, Saw; Shoombuatong, Watshara; Anuwongcharoen, Nuttapat; Preeyanon, Likit; Prachayasittikul, Virapong; Wikberg, Jarl E S; Nantasenamat, Chanin
2016-01-01
Currently, monomeric fluorescent proteins (FP) are ideal markers for protein tagging. The prediction of oligomeric states is helpful for enhancing live biomedical imaging. Computational prediction of FP oligomeric states can accelerate the effort of protein engineering efforts of creating monomeric FPs. To the best of our knowledge, this study represents the first computational model for predicting and analyzing FP oligomerization directly from the amino acid sequence. After data curation, an exhaustive data set consisting of 397 non-redundant FP oligomeric states was compiled from the literature. Results from benchmarking of the protein descriptors revealed that the model built with amino acid composition descriptors was the top performing model with accuracy, sensitivity and specificity in excess of 80% and MCC greater than 0.6 for all three data subsets (e.g. training, tenfold cross-validation and external sets). The model provided insights on the important residues governing the oligomerization of FP. To maximize the benefit of the generated predictive model, it was implemented as a web server under the R programming environment. osFP affords a user-friendly interface that can be used to predict the oligomeric state of FP using the protein sequence. The advantage of osFP is that it is platform-independent meaning that it can be accessed via a web browser on any operating system and device. osFP is freely accessible at http://codes.bio/osfp/ while the source code and data set is provided on GitHub at https://github.com/chaninn/osFP/.Graphical Abstract.
Informational structure of genetic sequences and nature of gene splicing
NASA Astrophysics Data System (ADS)
Trifonov, E. N.
1991-10-01
Only about 1/20 of DNA of higher organisms codes for proteins, by means of classical triplet code. The rest of DNA sequences is largely silent, with unclear functions, if any. The triplet code is not the only code (message) carried by the sequences. There are three levels of molecular communication, where the same sequence ``talks'' to various bimolecules, while having, respectively, three different appearances: DNA, RNA and protein. Since the molecular structures and, hence, sequence specific preferences of these are substantially different, the original DNA sequence has to carry simultaneously three types of sequence patterns (codes, messages), thus, being a composite structure in which one had the same letter (nucleotide) is frequently involved in several overlapping codes of different nature. This multiplicity and overlapping of the codes is a unique feature of the Gnomic, language of genetic sequences. The coexisting codes have to be degenerate in various degrees to allow an optimal and concerted performance of all the encoded functions. There is an obvious conflict between the best possible performance of a given function and necessity to compromise the quality of a given sequence pattern in favor of other patterns. It appears that the major role of various changes in the sequences on their ``ontogenetic'' way from DNA to RNA to protein, like RNA editing and splicing, or protein post-translational modifications is to resolve such conflicts. New data are presented strongly indicating that the gene splicing is such a device to resolve the conflict between the code of DNA folding in chromatin and the triplet code for protein synthesis.
Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tags
de Souza, Sandro J.; Camargo, Anamaria A.; Briones, Marcelo R. S.; Costa, Fernando F.; Nagai, Maria Aparecida; Verjovski-Almeida, Sergio; Zago, Marco A.; Andrade, Luis Eduardo C.; Carrer, Helaine; El-Dorry, Hamza F. A.; Espreafico, Enilza M.; Habr-Gama, Angelita; Giannella-Neto, Daniel; Goldman, Gustavo H.; Gruber, Arthur; Hackel, Christine; Kimura, Edna T.; Maciel, Rui M. B.; Marie, Suely K. N.; Martins, Elizabeth A. L.; Nóbrega, Marina P.; Paçó-Larson, Maria Luisa; Pardini, Maria Inês M. C.; Pereira, Gonçalo G.; Pesquero, João Bosco; Rodrigues, Vanderlei; Rogatto, Silvia R.; da Silva, Ismael D. C. G.; Sogayar, Mari C.; de Fátima Sonati, Maria; Tajara, Eloiza H.; Valentini, Sandro R.; Acencio, Marcio; Alberto, Fernando L.; Amaral, Maria Elisabete J.; Aneas, Ivy; Bengtson, Mário Henrique; Carraro, Dirce M.; Carvalho, Alex F.; Carvalho, Lúcia Helena; Cerutti, Janete M.; Corrêa, Maria Lucia C.; Costa, Maria Cristina R.; Curcio, Cyntia; Gushiken, Tsieko; Ho, Paulo L.; Kimura, Elza; Leite, Luciana C. C.; Maia, Gustavo; Majumder, Paromita; Marins, Mozart; Matsukuma, Adriana; Melo, Analy S. A.; Mestriner, Carlos Alberto; Miracca, Elisabete C.; Miranda, Daniela C.; Nascimento, Ana Lucia T. O.; Nóbrega, Francisco G.; Ojopi, Élida P. B.; Pandolfi, José Rodrigo C.; Pessoa, Luciana Gilbert; Rahal, Paula; Rainho, Claudia A.; da Ro's, Nancy; de Sá, Renata G.; Sales, Magaly M.; da Silva, Neusa P.; Silva, Tereza C.; da Silva, Wilson; Simão, Daniel F.; Sousa, Josane F.; Stecconi, Daniella; Tsukumo, Fernando; Valente, Valéria; Zalcberg, Heloisa; Brentani, Ricardo R.; Reis, Luis F. L.; Dias-Neto, Emmanuel; Simpson, Andrew J. G.
2000-01-01
Transcribed sequences in the human genome can be identified with confidence only by alignment with sequences derived from cDNAs synthesized from naturally occurring mRNAs. We constructed a set of 250,000 cDNAs that represent partial expressed gene sequences and that are biased toward the central coding regions of the resulting transcripts. They are termed ORF expressed sequence tags (ORESTES). The 250,000 ORESTES were assembled into 81,429 contigs. Of these, 1,181 (1.45%) were found to match sequences in chromosome 22 with at least one ORESTES contig for 162 (65.6%) of the 247 known genes, for 67 (44.6%) of the 150 related genes, and for 45 of the 148 (30.4%) EST-predicted genes on this chromosome. Using a set of stringent criteria to validate our sequences, we identified a further 219 previously unannotated transcribed sequences on chromosome 22. Of these, 171 were in fact also defined by EST or full length cDNA sequences available in GenBank but not utilized in the initial annotation of the first human chromosome sequence. Thus despite representing less than 15% of all expressed human sequences in the public databases at the time of the present analysis, ORESTES sequences defined 48 transcribed sequences on chromosome 22 not defined by other sequences. All of the transcribed sequences defined by ORESTES coincided with DNA regions predicted as encoding exons by genscan. (http://genes.mit.edu/GENSCAN.html). PMID:11070084
Long, C M; Virolle, M J; Chang, S Y; Chang, S; Bibb, M J
1987-01-01
The nucleotide sequence of the coding and regulatory regions of the alpha-amylase gene (aml) of Streptomyces limosus was determined. High-resolution S1 mapping was used to locate the 5' end of the transcript and demonstrated that the gene is transcribed from a unique promoter. The predicted amino acid sequence has considerable identity to mammalian and invertebrate alpha-amylases, but not to those of plant, fungal, or eubacterial origin. Consistent with this is the susceptibility of the enzyme to an inhibitor of mammalian alpha-amylases. The amino-terminal sequence of the extracellular enzyme was determined, revealing the presence of a typical signal peptide preceding the mature form of the alpha-amylase. Images PMID:3500166
Moors, Pieter
2015-01-01
In a recent functional magnetic resonance imaging study, Kok and de Lange (2014) observed that BOLD activity for a Kanizsa illusory shape stimulus, in which pacmen-like inducers elicit an illusory shape percept, was either enhanced or suppressed relative to a nonillusory control configuration depending on whether the spatial profile of BOLD activity in early visual cortex was related to the illusory shape or the inducers, respectively. The authors argued that these findings fit well with the predictive coding framework, because top-down predictions related to the illusory shape are not met with bottom-up sensory input and hence the feedforward error signal is enhanced. Conversely, for the inducing elements, there is a match between top-down predictions and input, leading to a decrease in error. Rather than invoking predictive coding as the explanatory framework, the suppressive effect related to the inducers might be caused by neural adaptation to perceptually stable input due to the trial sequence used in the experiment.
NASA Astrophysics Data System (ADS)
Belloni, Diogo; Schreiber, Matthias R.; Zorotovic, Mónica; Iłkiewicz, Krystian; Hurley, Jarrod R.; Giersz, Mirek; Lagos, Felipe
2018-06-01
The predicted and observed space density of cataclysmic variables (CVs) have been for a long time discrepant by at least an order of magnitude. The standard model of CV evolution predicts that the vast majority of CVs should be period bouncers, whose space density has been recently measured to be ρ ≲ 2 × 10-5 pc-3. We performed population synthesis of CVs using an updated version of the Binary Stellar Evolution (BSE) code for single and binary star evolution. We find that the recently suggested empirical prescription of consequential angular momentum loss (CAML) brings into agreement predicted and observed space densities of CVs and period bouncers. To progress with our understanding of CV evolution it is crucial to understand the physical mechanism behind empirical CAML. Our changes to the BSE code are also provided in details, which will allow the community to accurately model mass transfer in interacting binaries in which degenerate objects accrete from low-mass main-sequence donor stars.
SIM_ADJUST -- A computer code that adjusts simulated equivalents for observations or predictions
Poeter, Eileen P.; Hill, Mary C.
2008-01-01
This report documents the SIM_ADJUST computer code. SIM_ADJUST surmounts an obstacle that is sometimes encountered when using universal model analysis computer codes such as UCODE_2005 (Poeter and others, 2005), PEST (Doherty, 2004), and OSTRICH (Matott, 2005; Fredrick and others (2007). These codes often read simulated equivalents from a list in a file produced by a process model such as MODFLOW that represents a system of interest. At times values needed by the universal code are missing or assigned default values because the process model could not produce a useful solution. SIM_ADJUST can be used to (1) read a file that lists expected observation or prediction names and possible alternatives for the simulated values; (2) read a file produced by a process model that contains space or tab delimited columns, including a column of simulated values and a column of related observation or prediction names; (3) identify observations or predictions that have been omitted or assigned a default value by the process model; and (4) produce an adjusted file that contains a column of simulated values and a column of associated observation or prediction names. The user may provide alternatives that are constant values or that are alternative simulated values. The user may also provide a sequence of alternatives. For example, the heads from a series of cells may be specified to ensure that a meaningful value is available to compare with an observation located in a cell that may become dry. SIM_ADJUST is constructed using modules from the JUPITER API, and is intended for use on any computer operating system. SIM_ADJUST consists of algorithms programmed in Fortran90, which efficiently performs numerical calculations.
GAMSOR: Gamma Source Preparation and DIF3D Flux Solution
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, M. A.; Lee, C. H.; Hill, R. N.
2016-12-15
Nuclear reactors that rely upon the fission reaction have two modes of thermal energy deposition in the reactor system: neutron absorption and gamma absorption. The gamma rays are typically generated by neutron absorption reactions or during the fission process which means the primary driver of energy production is of course the neutron interaction. In conventional reactor physics methods, the gamma heating component is ignored such that the gamma absorption is forced to occur at the gamma emission site. For experimental reactor systems like EBR-II and FFTF, the placement of structural pins and assemblies internal to the core leads to problemsmore » with power heating predictions because there is no fission power source internal to the assembly to dictate a spatial distribution of the power. As part of the EBR-II support work in the 1980s, the GAMSOR code was developed to assist analysts in calculating the gamma heating. The GAMSOR code is a modified version of DIF3D and actually functions within a sequence of DIF3D calculations. The gamma flux in a conventional fission reactor system does not perturb the neutron flux and thus the gamma flux calculation can be cast as a fixed source problem given a solution to the steady state neutron flux equation. This leads to a sequence of DIF3D calculations, called the GAMSOR sequence, which involves solving the neutron flux, then the gamma flux, then combining the results to do a summary edit. In this manuscript, we go over the GAMSOR code and detail how it is put together and functions. We also discuss how to setup the GAMSOR sequence and input for each DIF3D calculation in the GAMSOR sequence. With the GAMSOR capability, users can take any valid steady state DIF3D calculation and compute the power distribution due to neutron and gamma heating. The MC2-3 code is the preferable companion code to use for generating neutron and gamma cross section data, but the GAMSOR code can accept cross section data from other sources. To further this aspect, an additional utility code was created which demonstrates how to merge the neutron and gamma cross section data together to carry out a simultaneous solve of the two systems.« less
NASA Astrophysics Data System (ADS)
Kraljić, K.; Strüngmann, L.; Fimmel, E.; Gumbel, M.
2018-01-01
The genetic code is degenerated and it is assumed that redundancy provides error detection and correction mechanisms in the translation process. However, the biological meaning of the code's structure is still under current research. This paper presents a Genetic Code Analysis Toolkit (GCAT) which provides workflows and algorithms for the analysis of the structure of nucleotide sequences. In particular, sets or sequences of codons can be transformed and tested for circularity, comma-freeness, dichotomic partitions and others. GCAT comes with a fertile editor custom-built to work with the genetic code and a batch mode for multi-sequence processing. With the ability to read FASTA files or load sequences from GenBank, the tool can be used for the mathematical and statistical analysis of existing sequence data. GCAT is Java-based and provides a plug-in concept for extensibility. Availability: Open source Homepage:http://www.gcat.bio/
Mühlhausen, Stefanie; Findeisen, Peggy; Plessmann, Uwe; Urlaub, Henning; Kollmar, Martin
2016-07-01
The genetic code is the cellular translation table for the conversion of nucleotide sequences into amino acid sequences. Changes to the meaning of sense codons would introduce errors into almost every translated message and are expected to be highly detrimental. However, reassignment of single or multiple codons in mitochondria and nuclear genomes, although extremely rare, demonstrates that the code can evolve. Several models for the mechanism of alteration of nuclear genetic codes have been proposed (including "codon capture," "genome streamlining," and "ambiguous intermediate" theories), but with little resolution. Here, we report a novel sense codon reassignment in Pachysolen tannophilus, a yeast related to the Pichiaceae. By generating proteomics data and using tRNA sequence comparisons, we show that Pachysolen translates CUG codons as alanine and not as the more usual leucine. The Pachysolen tRNACAG is an anticodon-mutated tRNA(Ala) containing all major alanine tRNA recognition sites. The polyphyly of the CUG-decoding tRNAs in yeasts is best explained by a tRNA loss driven codon reassignment mechanism. Loss of the CUG-tRNA in the ancient yeast is followed by gradual decrease of respective codons and subsequent codon capture by tRNAs whose anticodon is not part of the aminoacyl-tRNA synthetase recognition region. Our hypothesis applies to all nuclear genetic code alterations and provides several testable predictions. We anticipate more codon reassignments to be uncovered in existing and upcoming genome projects. © 2016 Mühlhausen et al.; Published by Cold Spring Harbor Laboratory Press.
2010-01-01
Background The identification of non-coding transcripts in human, mouse, and Escherichia coli has revealed their widespread occurrence and functional importance in both eukaryotic and prokaryotic life. In prokaryotes, studies have shown that non-coding transcripts participate in a broad range of cellular functions like gene regulation, stress and virulence. However, very little is known about non-coding transcripts in Streptococcus pneumoniae (pneumococcus), an obligate human respiratory pathogen responsible for significant worldwide morbidity and mortality. Tiling microarrays enable genome wide mRNA profiling as well as identification of novel transcripts at a high-resolution. Results Here, we describe a high-resolution transcription map of the S. pneumoniae clinical isolate TIGR4 using genomic tiling arrays. Our results indicate that approximately 66% of the genome is expressed under our experimental conditions. We identified a total of 50 non-coding small RNAs (sRNAs) from the intergenic regions, of which 36 had no predicted function. Half of the identified sRNA sequences were found to be unique to S. pneumoniae genome. We identified eight overrepresented sequence motifs among sRNA sequences that correspond to sRNAs in different functional categories. Tiling arrays also identified approximately 202 operon structures in the genome. Conclusions In summary, the pneumococcal operon structures and novel sRNAs identified in this study enhance our understanding of the complexity and extent of the pneumococcal 'expressed' genome. Furthermore, the results of this study open up new avenues of research for understanding the complex RNA regulatory network governing S. pneumoniae physiology and virulence. PMID:20525227
SEQassembly: A Practical Tools Program for Coding Sequences Splicing
NASA Astrophysics Data System (ADS)
Lee, Hongbin; Yang, Hang; Fu, Lei; Qin, Long; Li, Huili; He, Feng; Wang, Bo; Wu, Xiaoming
CDS (Coding Sequences) is a portion of mRNA sequences, which are composed by a number of exon sequence segments. The construction of CDS sequence is important for profound genetic analysis such as genotyping. A program in MATLAB environment is presented, which can process batch of samples sequences into code segments under the guide of reference exon models, and splice these code segments of same sample source into CDS according to the exon order in queue file. This program is useful in transcriptional polymorphism detection and gene function study.
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition
Melvin, Iain; Ie, Eugene; Kuang, Rui; Weston, Jason; Stafford, William Noble; Leslie, Christina
2007-01-01
Background Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. Results We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at . Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. Conclusion By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition. PMID:17570145
Paul, Sujay; Zhang, Angel; Ludeña, Yvette; Villena, Gretty K; Yu, Fengan; Sherman, David H; Gutiérrez-Correa, Marcel
2017-06-10
Here, we report the complete genome sequence of a high alkaline cellulase producing Aspergillus fumigatus strain LMB-35Aa isolated from soil of Peruvian Amazon rainforest. The genome is ∼27.5mb in size, comprises of 228 scaffolds with an average GC content of 50%, and is predicted to contain a total of 8660 protein-coding genes. Of which, 6156 are with known function; it codes for 607 putative CAZymes families potentially involved in carbohydrate metabolism. Several important cellulose degrading genes, such as endoglucanase A, endoglucanase B, endoglucanase D and beta-glucosidase, are also identified. The genome of A. fumigatus strain LMB-35Aa represents the first whole sequenced genome of non-clinical, high cellulase producing A. fumigatus strain isolated from forest soil. Copyright © 2017 Elsevier B.V. All rights reserved.
Tetrahymena thermophila acidic ribosomal protein L37 contains an archaebacterial type of C-terminus.
Hansen, T S; Andreasen, P H; Dreisig, H; Højrup, P; Nielsen, H; Engberg, J; Kristiansen, K
1991-09-15
We have cloned and characterized a Tetrahymena thermophila macronuclear gene (L37) encoding the acidic ribosomal protein (A-protein) L37. The gene contains a single intron located in the 3'-part of the coding region. Two major and three minor transcription start points (tsp) were mapped 39 to 63 nucleotides upstream from the translational start codon. The uppermost tsp mapped to the first T in a putative T. thermophila RNA polymerase II initiator element, TATAA. The coding region of L37 predicts a protein of 109 amino acid (aa) residues. A substantial part of the deduced aa sequence was verified by protein sequencing. The T. thermophila L37 clearly belongs to the P1-type family of eukaryotic A-proteins, but the C-terminal region has the hallmarks of archaebacterial A-proteins.
Statistical properties of DNA sequences
NASA Technical Reports Server (NTRS)
Peng, C. K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Mantegna, R. N.; Simons, M.; Stanley, H. E.
1995-01-01
We review evidence supporting the idea that the DNA sequence in genes containing non-coding regions is correlated, and that the correlation is remarkably long range--indeed, nucleotides thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationarity" feature of the sequence of base pairs by applying a new algorithm called detrended fluctuation analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and non-coding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to every DNA sequence (33301 coding and 29453 non-coding) in the entire GenBank database. Finally, we describe briefly some recent work showing that the non-coding sequences have certain statistical features in common with natural and artificial languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts. These statistical properties of non-coding sequences support the possibility that non-coding regions of DNA may carry biological information.
Cellulases and coding sequences
Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong
2001-02-20
The present invention provides three fungal cellulases, their coding sequences, recombinant DNA molecules comprising the cellulase coding sequences, recombinant host cells and methods for producing same. The present cellulases are from Orpinomyces PC-2.
Cellulases and coding sequences
Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong
2001-01-01
The present invention provides three fungal cellulases, their coding sequences, recombinant DNA molecules comprising the cellulase coding sequences, recombinant host cells and methods for producing same. The present cellulases are from Orpinomyces PC-2.
Cloud prediction of protein structure and function with PredictProtein for Debian.
Kaján, László; Yachdav, Guy; Vicedo, Esmeralda; Steinegger, Martin; Mirdita, Milot; Angermüller, Christof; Böhm, Ariane; Domke, Simon; Ertl, Julia; Mertes, Christian; Reisinger, Eva; Staniewski, Cedric; Rost, Burkhard
2013-01-01
We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome.
Cloud Prediction of Protein Structure and Function with PredictProtein for Debian
Kaján, László; Yachdav, Guy; Vicedo, Esmeralda; Steinegger, Martin; Mirdita, Milot; Angermüller, Christof; Böhm, Ariane; Domke, Simon; Ertl, Julia; Mertes, Christian; Reisinger, Eva; Rost, Burkhard
2013-01-01
We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome. PMID:23971032
Liu, Huitao; Cui, Peng; Zhan, Kehui; Lin, Qiang; Zhuo, Guoyin; Guo, Xiaoli; Ding, Feng; Yang, Wenlong; Liu, Dongcheng; Hu, Songnian; Yu, Jun; Zhang, Aimin
2011-03-29
Plant mitochondria, semiautonomous organelles that function as manufacturers of cellular ATP, have their own genome that has a slow rate of evolution and rapid rearrangement. Cytoplasmic male sterility (CMS), a common phenotype in higher plants, is closely associated with rearrangements in mitochondrial DNA (mtDNA), and is widely used to produce F1 hybrid seeds in a variety of valuable crop species. Novel chimeric genes deduced from mtDNA rearrangements causing CMS have been identified in several plants, such as rice, sunflower, pepper, and rapeseed, but there are very few reports about mtDNA rearrangements in wheat. In the present work, we describe the mitochondrial genome of a wheat K-type CMS line and compare it with its maintainer line. The complete mtDNA sequence of a wheat K-type (with cytoplasm of Aegilops kotschyi) CMS line, Ks3, was assembled into a master circle (MC) molecule of 647,559 bp and found to harbor 34 known protein-coding genes, three rRNAs (18 S, 26 S, and 5 S rRNAs), and 16 different tRNAs. Compared to our previously published sequence of a K-type maintainer line, Km3, we detected Ks3-specific mtDNA (> 100 bp, 11.38%) and repeats (> 100 bp, 29 units) as well as genes that are unique to each line: rpl5 was missing in Ks3 and trnH was absent from Km3. We also defined 32 single nucleotide polymorphisms (SNPs) in 13 protein-coding, albeit functionally irrelevant, genes, and predicted 22 unique ORFs in Ks3, representing potential candidates for K-type CMS. All these sequence variations are candidates for involvement in CMS. A comparative analysis of the mtDNA of several angiosperms, including those from Ks3, Km3, rice, maize, Arabidopsis thaliana, and rapeseed, showed that non-coding sequences of higher plants had mostly divergent multiple reorganizations during the mtDNA evolution of higher plants. The complete mitochondrial genome of the wheat K-type CMS line Ks3 is very different from that of its maintainer line Km3, especially in non-coding sequences. Sequence rearrangement has produced novel chimeric ORFs, which may be candidate genes for CMS. Comparative analysis of several angiosperm mtDNAs indicated that non-coding sequences are the most frequently reorganized during mtDNA evolution in higher plants.
Decoding the complex genetic causes of heart diseases using systems biology.
Djordjevic, Djordje; Deshpande, Vinita; Szczesnik, Tomasz; Yang, Andrian; Humphreys, David T; Giannoulatou, Eleni; Ho, Joshua W K
2015-03-01
The pace of disease gene discovery is still much slower than expected, even with the use of cost-effective DNA sequencing and genotyping technologies. It is increasingly clear that many inherited heart diseases have a more complex polygenic aetiology than previously thought. Understanding the role of gene-gene interactions, epigenetics, and non-coding regulatory regions is becoming increasingly critical in predicting the functional consequences of genetic mutations identified by genome-wide association studies and whole-genome or exome sequencing. A systems biology approach is now being widely employed to systematically discover genes that are involved in heart diseases in humans or relevant animal models through bioinformatics. The overarching premise is that the integration of high-quality causal gene regulatory networks (GRNs), genomics, epigenomics, transcriptomics and other genome-wide data will greatly accelerate the discovery of the complex genetic causes of congenital and complex heart diseases. This review summarises state-of-the-art genomic and bioinformatics techniques that are used in accelerating the pace of disease gene discovery in heart diseases. Accompanying this review, we provide an interactive web-resource for systems biology analysis of mammalian heart development and diseases, CardiacCode ( http://CardiacCode.victorchang.edu.au/ ). CardiacCode features a dataset of over 700 pieces of manually curated genetic or molecular perturbation data, which enables the inference of a cardiac-specific GRN of 280 regulatory relationships between 33 regulator genes and 129 target genes. We believe this growing resource will fill an urgent unmet need to fully realise the true potential of predictive and personalised genomic medicine in tackling human heart disease.
Gönner, Lorenz; Vitay, Julien; Hamker, Fred H.
2017-01-01
Hippocampal place-cell sequences observed during awake immobility often represent previous experience, suggesting a role in memory processes. However, recent reports of goals being overrepresented in sequential activity suggest a role in short-term planning, although a detailed understanding of the origins of hippocampal sequential activity and of its functional role is still lacking. In particular, it is unknown which mechanism could support efficient planning by generating place-cell sequences biased toward known goal locations, in an adaptive and constructive fashion. To address these questions, we propose a model of spatial learning and sequence generation as interdependent processes, integrating cortical contextual coding, synaptic plasticity and neuromodulatory mechanisms into a map-based approach. Following goal learning, sequential activity emerges from continuous attractor network dynamics biased by goal memory inputs. We apply Bayesian decoding on the resulting spike trains, allowing a direct comparison with experimental data. Simulations show that this model (1) explains the generation of never-experienced sequence trajectories in familiar environments, without requiring virtual self-motion signals, (2) accounts for the bias in place-cell sequences toward goal locations, (3) highlights their utility in flexible route planning, and (4) provides specific testable predictions. PMID:29075187
Roberts, C H; Turino, C; Madrigal, J A; Marsh, S G E
2007-06-01
DNA enrichment by allele-specific hybridization (DEASH) was used as a means to isolate individual alleles of the killer cell immunoglobulin-like receptor (KIR2DL4) gene from heterozygous genomic DNA. Using long-template polymerase chain reaction (LT-PCR), the complete KIR2DL4 gene was amplified from a cell line that had previously been characterized for its KIR gene content by PCR using sequence-specific primers (PCR-SSP). The whole gene amplicons were sequenced and we identified two heterozygous positions in accordance with the predictions of the PCR-SSP. The amplicons were then hybridized to allele-specific, biotinylated oligonucleotide probes and through binding to streptavidin-coated beads, the targeted alleles were enriched. A second PCR amplified only the exonic regions of the enriched allele, and these were then sequenced in full. We show DEASH to be capable of enriching single alleles from a heterozygous PCR product, and through sequencing the enriched DNA, we are able to produce complete coding sequences of the KIR2DL4 alleles in accordance with the typing predicted by PCR-SSP.
David S. Bischoff; James M. Slavicek
1995-01-01
The Lymantria dispar multinucleocapsid nuclear polyhedrosis virus (LdMNPV) gene encoding G22 was cloned and sequenced. The G22 gene codes for a 191 amino acid protein with a predicted Mr of 22000. Expression of G22 in a rabbit reticulocyte system generated a protein with an M...
RNA catalysis and the origins of life
NASA Technical Reports Server (NTRS)
Orgel, Leslie E.
1986-01-01
The role of RNA catalysis in the origins of life is considered in connection with the discovery of riboszymes, which are RNA molecules that catalyze sequence-specific hydrolysis and transesterification reactions of RNA substrates. Due to this discovery, theories positing protein-free replication as preceding the appearance of the genetic code are more plausible. The scope of RNA catalysis in biology and chemistry is discussed, and it is noted that the development of methods to select (or predict) RNA sequences with preassigned catalytic functions would be a major contribution to the study of life's origins.
KONAGAbase: a genomic and transcriptomic database for the diamondback moth, Plutella xylostella.
Jouraku, Akiya; Yamamoto, Kimiko; Kuwazaki, Seigo; Urio, Masahiro; Suetsugu, Yoshitaka; Narukawa, Junko; Miyamoto, Kazuhisa; Kurita, Kanako; Kanamori, Hiroyuki; Katayose, Yuichi; Matsumoto, Takashi; Noda, Hiroaki
2013-07-09
The diamondback moth (DBM), Plutella xylostella, is one of the most harmful insect pests for crucifer crops worldwide. DBM has rapidly evolved high resistance to most conventional insecticides such as pyrethroids, organophosphates, fipronil, spinosad, Bacillus thuringiensis, and diamides. Therefore, it is important to develop genomic and transcriptomic DBM resources for analysis of genes related to insecticide resistance, both to clarify the mechanism of resistance of DBM and to facilitate the development of insecticides with a novel mode of action for more effective and environmentally less harmful insecticide rotation. To contribute to this goal, we developed KONAGAbase, a genomic and transcriptomic database for DBM (KONAGA is the Japanese word for DBM). KONAGAbase provides (1) transcriptomic sequences of 37,340 ESTs/mRNAs and 147,370 RNA-seq contigs which were clustered and assembled into 84,570 unigenes (30,695 contigs, 50,548 pseudo singletons, and 3,327 singletons); and (2) genomic sequences of 88,530 WGS contigs with 246,244 degenerate contigs and 106,455 singletons from which 6,310 de novo identified repeat sequences and 34,890 predicted gene-coding sequences were extracted. The unigenes and predicted gene-coding sequences were clustered and 32,800 representative sequences were extracted as a comprehensive putative gene set. These sequences were annotated with BLAST descriptions, Gene Ontology (GO) terms, and Pfam descriptions, respectively. KONAGAbase contains rich graphical user interface (GUI)-based web interfaces for easy and efficient searching, browsing, and downloading sequences and annotation data. Five useful search interfaces consisting of BLAST search, keyword search, BLAST result-based search, GO tree-based search, and genome browser are provided. KONAGAbase is publicly available from our website (http://dbm.dna.affrc.go.jp/px/) through standard web browsers. KONAGAbase provides DBM comprehensive transcriptomic and draft genomic sequences with useful annotation information with easy-to-use web interfaces, which helps researchers to efficiently search for target sequences such as insect resistance-related genes. KONAGAbase will be continuously updated and additional genomic/transcriptomic resources and analysis tools will be provided for further efficient analysis of the mechanism of insecticide resistance and the development of effective insecticides with a novel mode of action for DBM.
Omasits, Ulrich; Varadarajan, Adithi R; Schmid, Michael; Goetze, Sandra; Melidis, Damianos; Bourqui, Marc; Nikolayeva, Olga; Québatte, Maxime; Patrignani, Andrea; Dehio, Christoph; Frey, Juerg E; Robinson, Mark D; Wollscheid, Bernd; Ahrens, Christian H
2017-12-01
Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae , Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote. © 2017 Omasits et al.; Published by Cold Spring Harbor Laboratory Press.
Fernandez-Valverde, Selene L; Calcino, Andrew D; Degnan, Bernard M
2015-05-15
The demosponge Amphimedon queenslandica is amongst the few early-branching metazoans with an assembled and annotated draft genome, making it an important species in the study of the origin and early evolution of animals. Current gene models in this species are largely based on in silico predictions and low coverage expressed sequence tag (EST) evidence. Amphimedon queenslandica protein-coding gene models are improved using deep RNA-Seq data from four developmental stages and CEL-Seq data from 82 developmental samples. Over 86% of previously predicted genes are retained in the new gene models, although 24% have additional exons; there is also a marked increase in the total number of annotated 3' and 5' untranslated regions (UTRs). Importantly, these new developmental transcriptome data reveal numerous previously unannotated protein-coding genes in the Amphimedon genome, increasing the total gene number by 25%, from 30,060 to 40,122. In general, Amphimedon genes have introns that are markedly smaller than those in other animals and most of the alternatively spliced genes in Amphimedon undergo intron-retention; exon-skipping is the least common mode of alternative splicing. Finally, in addition to canonical polyadenylation signal sequences, Amphimedon genes are enriched in a number of unique AT-rich motifs in their 3' UTRs. The inclusion of developmental transcriptome data has substantially improved the structure and composition of protein-coding gene models in Amphimedon queenslandica, providing a more accurate and comprehensive set of genes for functional and comparative studies. These improvements reveal the Amphimedon genome is comprised of a remarkably high number of tightly packed genes. These genes have small introns and there is pervasive intron retention amongst alternatively spliced transcripts. These aspects of the sponge genome are more similar unicellular opisthokont genomes than to other animal genomes.
Zhang, Yulei; Zhao, Lijuan; Chen, Wenjie; Huang, Yunmao; Yang, Ling; Sarathbabu, V; Wu, Zaohe; Li, Jun; Nie, Pin; Lin, Li
2017-10-01
We analyzed here the complete genome sequences of a highly virulent Flavobacterium columnare Pf1 strain isolated in our laboratory. The complete genome consists of a 3,171,081 bp circular DNA with 2784 predicted protein-coding genes. Among these, 286 genes were predicted as antibiotic resistance genes, including 32 RND-type efflux pump related genes which were associated with the export of aminoglycosides, indicating inducible aminoglycosides resistances in F. columnare. On the other hand, 328 genes were predicted as pathogenicity related genes which could be classified as virulence factors, gliding motility proteins, adhesins, and many putative secreted proteases. These genes were probably involved in the colonization, invasion and destruction of fish tissues during the infection of F. columnare. Apparently, our obtained complete genome sequences provide the basis for the explanation of the interactions between the F. columnare and the infected fish. The predicted antibiotic resistance and pathogenicity related genes will shed a new light on the development of more efficient preventional strategies against the infection of F. columnare, which is a major worldwide fish pathogen. Copyright © 2017 Elsevier Ltd. All rights reserved.
Joy, Nisha; Asha, Srinivasan; Mallika, Vijayan; Soniya, Eppurathu Vasudevan
2013-01-01
Next generation sequencing has an advantageon transformational development of species with limited available sequence data as it helps to decode the genome and transcriptome. We carried out the de novo sequencing using illuminaHiSeq™ 2000 to generate the first leaf transcriptome of black pepper (Piper nigrum L.), an important spice variety native to South India and also grown in other tropical regions. Despite the economic and biochemical importance of pepper, a scientifically rigorous study at the molecular level is far from complete due to lack of sufficient sequence information and cytological complexity of its genome. The 55 million raw reads obtained, when assembled using Trinity program generated 2,23,386 contigs and 1,28,157 unigenes. Reports suggest that the repeat-rich genomic regions give rise to small non-coding functional RNAs. MicroRNAs (miRNAs) are the most abundant type of non-coding regulatory RNAs. In spite of the widespread research on miRNAs, little is known about the hair-pin precursors of miRNAs bearing Simple Sequence Repeats (SSRs). We used the array of transcripts generated, for the in silico prediction and detection of '43 pre-miRNA candidates bearing different types of SSR motifs'. The analysis identified 3913 different types of SSR motifs with an average of one SSR per 3.04 MB of thetranscriptome. About 0.033% of the transcriptome constituted 'pre-miRNA candidates bearing SSRs'. The abundance, type and distribution of SSR motifs studied across the hair-pin miRNA precursors, showed a significant bias in the position of SSRs towards the downstream of predicted 'pre-miRNA candidates'. The catalogue of transcripts identified, together with the demonstration of reliable existence of SSRs in the miRNA precursors, permits future opportunities for understanding the genetic mechanism of black pepper and likely functions of 'tandem repeats' in miRNAs.
Joy, Nisha; Asha, Srinivasan; Mallika, Vijayan; Soniya, Eppurathu Vasudevan
2013-01-01
Next generation sequencing has an advantageon transformational development of species with limited available sequence data as it helps to decode the genome and transcriptome. We carried out the de novo sequencing using illuminaHiSeq™ 2000 to generate the first leaf transcriptome of black pepper (Piper nigrum L.), an important spice variety native to South India and also grown in other tropical regions. Despite the economic and biochemical importance of pepper, a scientifically rigorous study at the molecular level is far from complete due to lack of sufficient sequence information and cytological complexity of its genome. The 55 million raw reads obtained, when assembled using Trinity program generated 2,23,386 contigs and 1,28,157 unigenes. Reports suggest that the repeat-rich genomic regions give rise to small non-coding functional RNAs. MicroRNAs (miRNAs) are the most abundant type of non-coding regulatory RNAs. In spite of the widespread research on miRNAs, little is known about the hair-pin precursors of miRNAs bearing Simple Sequence Repeats (SSRs). We used the array of transcripts generated, for the in silico prediction and detection of ‘43 pre-miRNA candidates bearing different types of SSR motifs’. The analysis identified 3913 different types of SSR motifs with an average of one SSR per 3.04 MB of thetranscriptome. About 0.033% of the transcriptome constituted ‘pre-miRNA candidates bearing SSRs’. The abundance, type and distribution of SSR motifs studied across the hair-pin miRNA precursors, showed a significant bias in the position of SSRs towards the downstream of predicted ‘pre-miRNA candidates’. The catalogue of transcripts identified, together with the demonstration of reliable existence of SSRs in the miRNA precursors, permits future opportunities for understanding the genetic mechanism of black pepper and likely functions of ‘tandem repeats’ in miRNAs. PMID:23469176
Villanueva, Pía; Nudel, Ron; Hoischen, Alexander; Fernández, María Angélica; Simpson, Nuala H.; Gilissen, Christian; Reader, Rose H.; Jara, Lillian; Echeverry, Maria Magdalena; Francks, Clyde; Baird, Gillian; Conti-Ramsden, Gina; O’Hare, Anne; Bolton, Patrick F.; Hennessy, Elizabeth R.; Palomino, Hernán; Carvajal-Carmona, Luis; Veltman, Joris A.; Cazier, Jean-Baptiste; De Barbieri, Zulema
2015-01-01
Children affected by Specific Language Impairment (SLI) fail to acquire age appropriate language skills despite adequate intelligence and opportunity. SLI is highly heritable, but the understanding of underlying genetic mechanisms has proved challenging. In this study, we use molecular genetic techniques to investigate an admixed isolated founder population from the Robinson Crusoe Island (Chile), who are affected by a high incidence of SLI, increasing the power to discover contributory genetic factors. We utilize exome sequencing in selected individuals from this population to identify eight coding variants that are of putative significance. We then apply association analyses across the wider population to highlight a single rare coding variant (rs144169475, Minor Allele Frequency of 4.1% in admixed South American populations) in the NFXL1 gene that confers a nonsynonymous change (N150K) and is significantly associated with language impairment in the Robinson Crusoe population (p = 2.04 × 10–4, 8 variants tested). Subsequent sequencing of NFXL1 in 117 UK SLI cases identified four individuals with heterozygous variants predicted to be of functional consequence. We conclude that coding variants within NFXL1 confer an increased risk of SLI within a complex genetic model. PMID:25781923
Shenoy, Archana; Blelloch, Robert
2009-09-11
The Microprocessor, containing the RNA binding protein Dgcr8 and RNase III enzyme Drosha, is responsible for processing primary microRNAs to precursor microRNAs. The Microprocessor regulates its own levels by cleaving hairpins in the 5'UTR and coding region of the Dgcr8 mRNA, thereby destabilizing the mature transcript. To determine whether the Microprocessor has a broader role in directly regulating other coding mRNA levels, we integrated results from expression profiling and ultra high-throughput deep sequencing of small RNAs. Expression analysis of mRNAs in wild-type, Dgcr8 knockout, and Dicer knockout mouse embryonic stem (ES) cells uncovered mRNAs that were specifically upregulated in the Dgcr8 null background. A number of these transcripts had evolutionarily conserved predicted hairpin targets for the Microprocessor. However, analysis of deep sequencing data of 18 to 200nt small RNAs in mouse ES, HeLa, and HepG2 indicates that exonic sequence reads that map in a pattern consistent with Microprocessor activity are unique to Dgcr8. We conclude that the Microprocessor's role in directly destabilizing coding mRNAs is likely specifically targeted to Dgcr8 itself, suggesting a specialized cellular mechanism for gene auto-regulation.
Lim, Byung Chan; Lee, Seungbok; Shin, Jong-Yeon; Kim, Jong-Il; Hwang, Hee; Kim, Ki Joong; Hwang, Yong Seung; Seo, Jeong-Sun; Chae, Jong Hee
2011-11-01
Duchenne muscular dystrophy or Becker muscular dystrophy might be a suitable candidate disease for application of next-generation sequencing in the genetic diagnosis because the complex mutational spectrum and the large size of the dystrophin gene require two or more analytical methods and have a high cost. The authors tested whether large deletions/duplications or small mutations, such as point mutations or short insertions/deletions of the dystrophin gene, could be predicted accurately in a single platform using next-generation sequencing technology. A custom solution-based target enrichment kit was designed to capture whole genomic regions of the dystrophin gene and other muscular-dystrophy-related genes. A multiplexing strategy, wherein four differently bar-coded samples were captured and sequenced together in a single lane of the Illumina Genome Analyser, was applied. The study subjects were 25 16 with deficient dystrophin expression without a large deletion/duplication and 9 with a known large deletion/duplication. Nearly 100% of the exonic region of the dystrophin gene was covered by at least eight reads with a mean read depth of 107. Pathogenic small mutations were identified in 15 of the 16 patients without a large deletion/duplication. Using these 16 patients as the standard, the authors' method accurately predicted the deleted or duplicated exons in the 9 patients with known mutations. Inclusion of non-coding regions and paired-end sequence analysis enabled accurate identification by increasing the read depth and providing information about the breakpoint junction. The current method has an advantage for the genetic diagnosis of Duchenne muscular dystrophy and Becker muscular dystrophy wherein a comprehensive mutational search may be feasible using a single platform.
Status of VICTORIA: NRC peer review and recent code applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bixler, N.E.; Schaperow, J.H.
1997-12-01
VICTORIA is a mechanistic computer code designed to analyze fission product behavior within a nuclear reactor coolant system (RCS) during a severe accident. It provides detailed predictions of the release of radioactive and nonradioactive materials from the reactor core and transport and deposition of these materials within the RCS. A summary of the results and recommendations of an independent peer review of VICTORIA by the US Nuclear Regulatory Commission (NRC) is presented, along with recent applications of the code. The latter include analyses of a temperature-induced steam generator tube rupture sequence and post-test analyses of the Phebus FPT-1 test. Themore » next planned Phebus test, FTP-4, will focus on fission product releases from a rubble bed, especially those of the less-volatile elements, and on the speciation of the released elements. Pretest analyses using VICTORIA to estimate the magnitude and timing of releases are presented. The predicted release of uranium is a matter of particular importance because of concern about filter plugging during the test.« less
SCMPSP: Prediction and characterization of photosynthetic proteins based on a scoring card method.
Vasylenko, Tamara; Liou, Yi-Fan; Chen, Hong-An; Charoenkwan, Phasit; Huang, Hui-Ling; Ho, Shinn-Ying
2015-01-01
Photosynthetic proteins (PSPs) greatly differ in their structure and function as they are involved in numerous subprocesses that take place inside an organelle called a chloroplast. Few studies predict PSPs from sequences due to their high variety of sequences and structues. This work aims to predict and characterize PSPs by establishing the datasets of PSP and non-PSP sequences and developing prediction methods. A novel bioinformatics method of predicting and characterizing PSPs based on scoring card method (SCMPSP) was used. First, a dataset consisting of 649 PSPs was established by using a Gene Ontology term GO:0015979 and 649 non-PSPs from the SwissProt database with sequence identity <= 25%.- Several prediction methods are presented based on support vector machine (SVM), decision tree J48, Bayes, BLAST, and SCM. The SVM method using dipeptide features-performed well and yielded - a test accuracy of 72.31%. The SCMPSP method uses the estimated propensity scores of 400 dipeptides - as PSPs and has a test accuracy of 71.54%, which is comparable to that of the SVM method. The derived propensity scores of 20 amino acids were further used to identify informative physicochemical properties for characterizing PSPs. The analytical results reveal the following four characteristics of PSPs: 1) PSPs favour hydrophobic side chain amino acids; 2) PSPs are composed of the amino acids prone to form helices in membrane environments; 3) PSPs have low interaction with water; and 4) PSPs prefer to be composed of the amino acids of electron-reactive side chains. The SCMPSP method not only estimates the propensity of a sequence to be PSPs, it also discovers characteristics that further improve understanding of PSPs. The SCMPSP source code and the datasets used in this study are available at http://iclab.life.nctu.edu.tw/SCMPSP/.
Farooq, Muhammad; Mansoor, Shahid; Guo, Hui; Amin, Imran; Chee, Peng W.; Azim, M. Kamran; Paterson, Andrew H.
2017-01-01
MicroRNAs (miRNAs) are small 20–24nt molecules that have been well studied over the past decade due to their important regulatory roles in different cellular processes. The mature sequences are more conserved across vast phylogenetic scales than their precursors and some are conserved within entire kingdoms, hence, their loci and function can be predicted by homology searches. Different studies have been performed to elucidate miRNAs using de novo prediction methods but due to complex regulatory mechanisms or false positive in silico predictions, not all of them express in reality and sometimes computationally predicted mature transcripts differ from the actual expressed ones. With the availability of a complete genome sequence of Gossypium arboreum, it is important to annotate the genome for both coding and non-coding regions using high confidence transcript evidence, for this cotton species that is highly resistant to various biotic and abiotic stresses. Here we have analyzed the small RNA transcriptome of G. arboreum leaves and provided genome annotation of miRNAs with evidence from miRNA/miRNA∗ transcripts. A total of 446 miRNAs clustered into 224 miRNA families were found, among which 48 families are conserved in other plants and 176 are novel. Four short RNA libraries were used to shortlist best predictions based on high reads per million. The size, origin, copy numbers and transcript depth of all miRNAs along with their isoforms and targets has been reported. The highest gene copy number was observed for gar-miR7504 followed by gar-miR166, gar-miR8771, gar-miR156, and gar-miR7484. Altogether, 1274 target genes were found in G. arboreum that are enriched for 216 KEGG pathways. The resultant genomic annotations are provided in UCSC, BED format. PMID:28663752
Insilico profiling of microRNAs in Korean ginseng (Panax ginseng Meyer)
Mathiyalagan, Ramya; Subramaniyam, Sathiyamoorthy; Natarajan, Sathishkumar; Kim, Yeon Ju; Sun, Myung Suk; Kim, Se Young; Kim, Yu-Jin; Yang, Deok Chun
2013-01-01
MicroRNAs (miRNAs) are a class of recently discovered non-coding small RNA molecules, on average approximately 21 nucleotides in length, which underlie numerous important biological roles in gene regulation in various organisms. The miRNA database (release 18) has 18,226 miRNAs, which have been deposited from different species. Although miRNAs have been identified and validated in many plant species, no studies have been reported on discovering miRNAs in Panax ginseng Meyer, which is a traditionally known medicinal plant in oriental medicine, also known as Korean ginseng. It has triterpene ginseng saponins called ginsenosides, which are responsible for its various pharmacological activities. Predicting conserved miRNAs by homology-based analysis with available expressed sequence tag (EST) sequences can be powerful, if the species lacks whole genome sequence information. In this study by using the EST based computational approach, 69 conserved miRNAs belonging to 44 miRNA families were identified in Korean ginseng. The digital gene expression patterns of predicted conserved miRNAs were analyzed by deep sequencing using small RNA sequences of flower buds, leaves, and lateral roots. We have found that many of the identified miRNAs showed tissue specific expressions. Using the insilico method, 346 potential targets were identified for the predicted 69 conserved miRNAs by searching the ginseng EST database, and the predicted targets were mainly involved in secondary metabolic processes, responses to biotic and abiotic stress, and transcription regulator activities, as well as a variety of other metabolic processes. PMID:23717176
Koo, Hyunmin; Hakim, Joseph A; Fisher, Phillip R E; Grueneberg, Alexander; Andersen, Dale T; Bej, Asim K
2016-01-01
In this study, we report the distribution and abundance of cold-adaptation proteins in microbial mat communities in the perennially ice-covered Lake Joyce, located in the McMurdo Dry Valleys, Antarctica. We have used MG-RAST and R code bioinformatics tools on Illumina HiSeq2000 shotgun metagenomic data and compared the filtering efficacy of these two methods on cold-adaptation proteins. Overall, the abundance of cold-shock DEAD-box protein A (CSDA), antifreeze proteins (AFPs), fatty acid desaturase (FAD), trehalose synthase (TS), and cold-shock family of proteins (CSPs) were present in all mat samples at high, moderate, or low levels, whereas the ice nucleation protein (INP) was present only in the ice and bulbous mat samples at insignificant levels. Considering the near homogeneous temperature profile of Lake Joyce (0.08-0.29 °C), the distribution and abundance of these proteins across various mat samples predictively correlated with known functional attributes necessary for microbial communities to thrive in this ecosystem. The comparison of the MG-RAST and the R code methods showed dissimilar occurrences of the cold-adaptation protein sequences, though with insignificant ANOSIM (R = 0.357; p-value = 0.012), ADONIS (R(2) = 0.274; p-value = 0.03) and STAMP (p-values = 0.521-0.984) statistical analyses. Furthermore, filtering targeted sequences using the R code accounted for taxonomic groups by avoiding sequence redundancies, whereas the MG-RAST provided total counts resulting in a higher sequence output. The results from this study revealed for the first time the distribution of cold-adaptation proteins in six different types of microbial mats in Lake Joyce, while suggesting a simpler and more manageable user-defined method of R code, as compared to a web-based MG-RAST pipeline.
Hu, Long; Xu, Zhiyu; Hu, Boqin; Lu, Zhi John
2017-01-09
Recent genomic studies suggest that novel long non-coding RNAs (lncRNAs) are specifically expressed and far outnumber annotated lncRNA sequences. To identify and characterize novel lncRNAs in RNA sequencing data from new samples, we have developed COME, a coding potential calculation tool based on multiple features. It integrates multiple sequence-derived and experiment-based features using a decompose-compose method, which makes it more accurate and robust than other well-known tools. We also showed that COME was able to substantially improve the consistency of predication results from other coding potential calculators. Moreover, COME annotates and characterizes each predicted lncRNA transcript with multiple lines of supporting evidence, which are not provided by other tools. Remarkably, we found that one subgroup of lncRNAs classified by such supporting features (i.e. conserved local RNA secondary structure) was highly enriched in a well-validated database (lncRNAdb). We further found that the conserved structural domains on lncRNAs had better chance than other RNA regions to interact with RNA binding proteins, based on the recent eCLIP-seq data in human, indicating their potential regulatory roles. Overall, we present COME as an accurate, robust and multiple-feature supported method for the identification and characterization of novel lncRNAs. The software implementation is available at https://github.com/lulab/COME. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Error and Error Mitigation in Low-Coverage Genome Assemblies
Hubisz, Melissa J.; Lin, Michael F.; Kellis, Manolis; Siepel, Adam
2011-01-01
The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download. PMID:21340033
Meta4: a web application for sharing and annotating metagenomic gene predictions using web services.
Richardson, Emily J; Escalettes, Franck; Fotheringham, Ian; Wallace, Robert J; Watson, Mick
2013-01-01
Whole-genome shotgun metagenomics experiments produce DNA sequence data from entire ecosystems, and provide a huge amount of novel information. Gene discovery projects require up-to-date information about sequence homology and domain structure for millions of predicted proteins to be presented in a simple, easy-to-use system. There is a lack of simple, open, flexible tools that allow the rapid sharing of metagenomics datasets with collaborators in a format they can easily interrogate. We present Meta4, a flexible and extensible web application that can be used to share and annotate metagenomic gene predictions. Proteins and predicted domains are stored in a simple relational database, with a dynamic front-end which displays the results in an internet browser. Web services are used to provide up-to-date information about the proteins from homology searches against public databases. Information about Meta4 can be found on the project website, code is available on Github, a cloud image is available, and an example implementation can be seen at.
A High-Throughput Arabidopsis Reverse Genetics System
Sessions, Allen; Burke, Ellen; Presting, Gernot; Aux, George; McElver, John; Patton, David; Dietrich, Bob; Ho, Patrick; Bacwaden, Johana; Ko, Cynthia; Clarke, Joseph D.; Cotton, David; Bullis, David; Snell, Jennifer; Miguel, Trini; Hutchison, Don; Kimmerly, Bill; Mitzel, Theresa; Katagiri, Fumiaki; Glazebrook, Jane; Law, Marc; Goff, Stephen A.
2002-01-01
A collection of Arabidopsis lines with T-DNA insertions in known sites was generated to increase the efficiency of functional genomics. A high-throughput modified thermal asymetric interlaced (TAIL)-PCR protocol was developed and used to amplify DNA fragments flanking the T-DNA left borders from ∼100,000 transformed lines. A total of 85,108 TAIL-PCR products from 52,964 T-DNA lines were sequenced and compared with the Arabidopsis genome to determine the positions of T-DNAs in each line. Predicted T-DNA insertion sites, when mapped, showed a bias against predicted coding sequences. Predicted insertion mutations in genes of interest can be identified using Arabidopsis Gene Index name searches or by BLAST (Basic Local Alignment Search Tool) search. Insertions can be confirmed by simple PCR assays on individual lines. Predicted insertions were confirmed in 257 of 340 lines tested (76%). This resource has been named SAIL (Syngenta Arabidopsis Insertion Library) and is available to the scientific community at www.tmri.org. PMID:12468722
PANNZER2: a rapid functional annotation web server.
Törönen, Petri; Medlar, Alan; Holm, Liisa
2018-05-08
The unprecedented growth of high-throughput sequencing has led to an ever-widening annotation gap in protein databases. While computational prediction methods are available to make up the shortfall, a majority of public web servers are hindered by practical limitations and poor performance. Here, we introduce PANNZER2 (Protein ANNotation with Z-scoRE), a fast functional annotation web server that provides both Gene Ontology (GO) annotations and free text description predictions. PANNZER2 uses SANSparallel to perform high-performance homology searches, making bulk annotation based on sequence similarity practical. PANNZER2 can output GO annotations from multiple scoring functions, enabling users to see which predictions are robust across predictors. Finally, PANNZER2 predictions scored within the top 10 methods for molecular function and biological process in the CAFA2 NK-full benchmark. The PANNZER2 web server is updated on a monthly schedule and is accessible at http://ekhidna2.biocenter.helsinki.fi/sanspanz/. The source code is available under the GNU Public Licence v3.
Santos, Leonardo N; Silva, Eduardo S; Santos, André S; De Sá, Pablo H; Ramos, Rommel T; Silva, Artur; Cooper, Philip J; Barreto, Maurício L; Loureiro, Sebastião; Pinheiro, Carina S; Alcantara-Neves, Neuza M; Pacheco, Luis G C
2016-07-01
Infection with helminthic parasites, including the soil-transmitted helminth Trichuris trichiura (human whipworm), has been shown to modulate host immune responses and, consequently, to have an impact on the development and manifestation of chronic human inflammatory diseases. De novo derivation of helminth proteomes from sequencing of transcriptomes will provide valuable data to aid identification of parasite proteins that could be evaluated as potential immunotherapeutic molecules in near future. Herein, we characterized the transcriptome of the adult stage of the human whipworm T. trichiura, using next-generation sequencing technology and a de novo assembly strategy. Nearly 17.6 million high-quality clean reads were assembled into 6414 contiguous sequences, with an N50 of 1606bp. In total, 5673 protein-encoding sequences were confidentially identified in the T. trichiura adult worm transcriptome; of these, 1013 sequences represent potential newly discovered proteins for the species, most of which presenting orthologs already annotated in the related species T. suis. A number of transcripts representing probable novel non-coding transcripts for the species T. trichiura were also identified. Among the most abundant transcripts, we found sequences that code for proteins involved in lipid transport, such as vitellogenins, and several chitin-binding proteins. Through a cross-species expression analysis of gene orthologs shared by T. trichiura and the closely related parasites T. suis and T. muris it was possible to find twenty-six protein-encoding genes that are consistently highly expressed in the adult stages of the three helminth species. Additionally, twenty transcripts could be identified that code for proteins previously detected by mass spectrometry analysis of protein fractions of the whipworm somatic extract that present immunomodulatory activities. Five of these transcripts were amongst the most highly expressed protein-encoding sequences in the T. trichiura adult worm. Besides, orthologs of proteins demonstrated to have potent immunomodulatory properties in related parasitic helminths were also predicted from the T. trichiura de novo assembled transcriptome. Copyright © 2016. Published by Elsevier B.V.
Raymond, Frédéric; Boisvert, Sébastien; Roy, Gaétan; Ritt, Jean-François; Légaré, Danielle; Isnard, Amandine; Stanke, Mario; Olivier, Martin; Tremblay, Michel J.; Papadopoulou, Barbara; Ouellette, Marc; Corbeil, Jacques
2012-01-01
The Leishmania tarentolae Parrot-TarII strain genome sequence was resolved to an average 16-fold mean coverage by next-generation DNA sequencing technologies. This is the first non-pathogenic to humans kinetoplastid protozoan genome to be described thus providing an opportunity for comparison with the completed genomes of pathogenic Leishmania species. A high synteny was observed between all sequenced Leishmania species. A limited number of chromosomal regions diverged between L. tarentolae and L. infantum, while remaining syntenic to L. major. Globally, >90% of the L. tarentolae gene content was shared with the other Leishmania species. We identified 95 predicted coding sequences unique to L. tarentolae and 250 genes that were absent from L. tarentolae. Interestingly, many of the latter genes were expressed in the intracellular amastigote stage of pathogenic species. In addition, genes coding for products involved in antioxidant defence or participating in vesicular-mediated protein transport were underrepresented in L. tarentolae. In contrast to other Leishmania genomes, two gene families were expanded in L. tarentolae, namely the zinc metallo-peptidase surface glycoprotein GP63 and the promastigote surface antigen PSA31C. Overall, L. tarentolae's gene content appears better adapted to the promastigote insect stage rather than the amastigote mammalian stage. PMID:21998295
Coarse-grained sequences for protein folding and design.
Brown, Scott; Fawzi, Nicolas J; Head-Gordon, Teresa
2003-09-16
We present the results of sequence design on our off-lattice minimalist model in which no specification of native-state tertiary contacts is needed. We start with a sequence that adopts a target topology and build on it through sequence mutation to produce new sequences that comprise distinct members within a target fold class. In this work, we use the alpha/beta ubiquitin fold class and design two new sequences that, when characterized through folding simulations, reproduce the differences in folding mechanism seen experimentally for proteins L and G. The primary implication of this work is that patterning of hydrophobic and hydrophilic residues is the physical origin for the success of relative contact-order descriptions of folding, and that these physics-based potentials provide a predictive connection between free energy landscapes and amino acid sequence (the original protein folding problem). We present results of the sequence mapping from a 20- to the three-letter code for determining a sequence that folds into the WW domain topology to illustrate future extensions to protein design.
Coarse-grained sequences for protein folding and design
Brown, Scott; Fawzi, Nicolas J.; Head-Gordon, Teresa
2003-01-01
We present the results of sequence design on our off-lattice minimalist model in which no specification of native-state tertiary contacts is needed. We start with a sequence that adopts a target topology and build on it through sequence mutation to produce new sequences that comprise distinct members within a target fold class. In this work, we use the α/β ubiquitin fold class and design two new sequences that, when characterized through folding simulations, reproduce the differences in folding mechanism seen experimentally for proteins L and G. The primary implication of this work is that patterning of hydrophobic and hydrophilic residues is the physical origin for the success of relative contact-order descriptions of folding, and that these physics-based potentials provide a predictive connection between free energy landscapes and amino acid sequence (the original protein folding problem). We present results of the sequence mapping from a 20- to the three-letter code for determining a sequence that folds into the WW domain topology to illustrate future extensions to protein design. PMID:12963815
Mutation detection in the human HSP70B′ gene by denaturing high-performance liquid chromatography
Hecker, Karl H.; Asea, Alexzander; Kobayashi, Kaoru; Green, Stacy; Tang, Dan; Calderwood, Stuart K.
2000-01-01
Variances, particularly single nucleotide polymorphisms (SNP), in the genomic sequence of individuals are the primary key to understanding gene function as it relates to differences in the susceptibility to disease, environmental influences, and therapy. In this report, the HSP70B′ gene is the target sequence for mutation detection in biopsy samples from human prostate cancer patients undergoing combined hyperthermia and radiation therapy at the Dana-Farber Cancer Institute, using temperature-modulated heteroduplex analysis (TMHA). The underlying principles of TMHA for mutation detection using DHPLC technology are discussed. The procedures involved in amplicon design for mutation analysis by DHPLC are detailed. The melting behavior of the complete coding sequence of the target gene is characterized using WAVEMAKERTM software. Four overlapping amplicons, which span the complete coding region of the HSP70B′ gene, amenable to mutation detection by DHPLC were identified based on the software-predicted melting profile of the target sequence. TMHA was performed on PCR products of individual amplicons of the HSP70B′ gene on the WAVE® Nucleic Acid Fragment Analysis System. The criteria for mutation calling by comparing wild-type and mutant chromatographic patterns are discussed. PMID:11189446
Mutation detection in the human HSP7OB' gene by denaturing high-performance liquid chromatography.
Hecker, K H; Asea, A; Kobayashi, K; Green, S; Tang, D; Calderwood, S K
2000-11-01
Variances, particularly single nucleotide polymorphisms (SNP), in the genomic sequence of individuals are the primary key to understanding gene function as it relates to differences in the susceptibility to disease, environmental influences, and therapy. In this report, the HSP70B' gene is the target sequence for mutation detection in biopsy samples from human prostate cancer patients undergoing combined hyperthermia and radiation therapy at the Dana-Farber Cancer Institute, using temperature-modulated heteroduplex analysis (TMHA). The underlying principles of TMHA for mutation detection using DHPLC technology are discussed. The procedures involved in amplicon design for mutation analysis by DHPLC are detailed. The melting behavior of the complete coding sequence of the target gene is characterized using WAVEMAKER software. Four overlapping amplicons, which span the complete coding region of the HSP70B' gene, amenable to mutation detection by DHPLC were identified based on the software-predicted melting profile of the target sequence. TMHA was performed on PCR products of individual amplicons of the HSP70B' gene on the WAVE Nucleic Acid Fragment Analysis System. The criteria for mutation calling by comparing wild-type and mutant chromatographic patterns are discussed.
Evolution of coding and non-coding genes in HOX clusters of a marsupial.
Yu, Hongshi; Lindsay, James; Feng, Zhi-Ping; Frankenberg, Stephen; Hu, Yanqiu; Carone, Dawn; Shaw, Geoff; Pask, Andrew J; O'Neill, Rachel; Papenfuss, Anthony T; Renfree, Marilyn B
2012-06-18
The HOX gene clusters are thought to be highly conserved amongst mammals and other vertebrates, but the long non-coding RNAs have only been studied in detail in human and mouse. The sequencing of the kangaroo genome provides an opportunity to use comparative analyses to compare the HOX clusters of a mammal with a distinct body plan to those of other mammals. Here we report a comparative analysis of HOX gene clusters between an Australian marsupial of the kangaroo family and the eutherians. There was a strikingly high level of conservation of HOX gene sequence and structure and non-protein coding genes including the microRNAs miR-196a, miR-196b, miR-10a and miR-10b and the long non-coding RNAs HOTAIR, HOTAIRM1 and HOXA11AS that play critical roles in regulating gene expression and controlling development. By microRNA deep sequencing and comparative genomic analyses, two conserved microRNAs (miR-10a and miR-10b) were identified and one new candidate microRNA with typical hairpin precursor structure that is expressed in both fibroblasts and testes was found. The prediction of microRNA target analysis showed that several known microRNA targets, such as miR-10, miR-414 and miR-464, were found in the tammar HOX clusters. In addition, several novel and putative miRNAs were identified that originated from elsewhere in the tammar genome and that target the tammar HOXB and HOXD clusters. This study confirms that the emergence of known long non-coding RNAs in the HOX clusters clearly predate the marsupial-eutherian divergence 160 Ma ago. It also identified a new potentially functional microRNA as well as conserved miRNAs. These non-coding RNAs may participate in the regulation of HOX genes to influence the body plan of this marsupial.
Evolution of coding and non-coding genes in HOX clusters of a marsupial
2012-01-01
Background The HOX gene clusters are thought to be highly conserved amongst mammals and other vertebrates, but the long non-coding RNAs have only been studied in detail in human and mouse. The sequencing of the kangaroo genome provides an opportunity to use comparative analyses to compare the HOX clusters of a mammal with a distinct body plan to those of other mammals. Results Here we report a comparative analysis of HOX gene clusters between an Australian marsupial of the kangaroo family and the eutherians. There was a strikingly high level of conservation of HOX gene sequence and structure and non-protein coding genes including the microRNAs miR-196a, miR-196b, miR-10a and miR-10b and the long non-coding RNAs HOTAIR, HOTAIRM1 and HOXA11AS that play critical roles in regulating gene expression and controlling development. By microRNA deep sequencing and comparative genomic analyses, two conserved microRNAs (miR-10a and miR-10b) were identified and one new candidate microRNA with typical hairpin precursor structure that is expressed in both fibroblasts and testes was found. The prediction of microRNA target analysis showed that several known microRNA targets, such as miR-10, miR-414 and miR-464, were found in the tammar HOX clusters. In addition, several novel and putative miRNAs were identified that originated from elsewhere in the tammar genome and that target the tammar HOXB and HOXD clusters. Conclusions This study confirms that the emergence of known long non-coding RNAs in the HOX clusters clearly predate the marsupial-eutherian divergence 160 Ma ago. It also identified a new potentially functional microRNA as well as conserved miRNAs. These non-coding RNAs may participate in the regulation of HOX genes to influence the body plan of this marsupial. PMID:22708672
Gene finding in metatranscriptomic sequences.
Ismail, Wazim Mohammed; Ye, Yuzhen; Tang, Haixu
2014-01-01
Metatranscriptomic sequencing is a highly sensitive bioassay of functional activity in a microbial community, providing complementary information to the metagenomic sequencing of the community. The acquisition of the metatranscriptomic sequences will enable us to refine the annotations of the metagenomes, and to study the gene activities and their regulation in complex microbial communities and their dynamics. In this paper, we present TransGeneScan, a software tool for finding genes in assembled transcripts from metatranscriptomic sequences. By incorporating several features of metatranscriptomic sequencing, including strand-specificity, short intergenic regions, and putative antisense transcripts into a Hidden Markov Model, TranGeneScan can predict a sense transcript containing one or multiple genes (in an operon) or an antisense transcript. We tested TransGeneScan on a mock metatranscriptomic data set containing three known bacterial genomes. The results showed that TranGeneScan performs better than metagenomic gene finders (MetaGeneMark and FragGeneScan) on predicting protein coding genes in assembled transcripts, and achieves comparable or even higher accuracy than gene finders for microbial genomes (Glimmer and GeneMark). These results imply, with the assistance of metatranscriptomic sequencing, we can obtain a broad and precise picture about the genes (and their functions) in a microbial community. TransGeneScan is available as open-source software on SourceForge at https://sourceforge.net/projects/transgenescan/.
Sequencing proteins with transverse ionic transport in nanochannels.
Boynton, Paul; Di Ventra, Massimiliano
2016-05-03
De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer's Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique's potential for de novo protein sequencing.
orthAgogue: an agile tool for the rapid prediction of orthology relations.
Ekseth, Ole Kristian; Kuiper, Martin; Mironov, Vladimir
2014-03-01
The comparison of genes and gene products across species depends on high-quality tools to determine the relationships between gene or protein sequences from various species. Although some excellent applications are available and widely used, their performance leaves room for improvement. We developed orthAgogue: a multithreaded C application for high-speed estimation of homology relations in massive datasets, operated via a flexible and easy command-line interface. The orthAgogue software is distributed under the GNU license. The source code and binaries compiled for Linux are available at https://code.google.com/p/orthagogue/.
PARTS: Probabilistic Alignment for RNA joinT Secondary structure prediction
Harmanci, Arif Ozgun; Sharma, Gaurav; Mathews, David H.
2008-01-01
A novel method is presented for joint prediction of alignment and common secondary structures of two RNA sequences. The joint consideration of common secondary structures and alignment is accomplished by structural alignment over a search space defined by the newly introduced motif called matched helical regions. The matched helical region formulation generalizes previously employed constraints for structural alignment and thereby better accommodates the structural variability within RNA families. A probabilistic model based on pseudo free energies obtained from precomputed base pairing and alignment probabilities is utilized for scoring structural alignments. Maximum a posteriori (MAP) common secondary structures, sequence alignment and joint posterior probabilities of base pairing are obtained from the model via a dynamic programming algorithm called PARTS. The advantage of the more general structural alignment of PARTS is seen in secondary structure predictions for the RNase P family. For this family, the PARTS MAP predictions of secondary structures and alignment perform significantly better than prior methods that utilize a more restrictive structural alignment model. For the tRNA and 5S rRNA families, the richer structural alignment model of PARTS does not offer a benefit and the method therefore performs comparably with existing alternatives. For all RNA families studied, the posterior probability estimates obtained from PARTS offer an improvement over posterior probability estimates from a single sequence prediction. When considering the base pairings predicted over a threshold value of confidence, the combination of sensitivity and positive predictive value is superior for PARTS than for the single sequence prediction. PARTS source code is available for download under the GNU public license at http://rna.urmc.rochester.edu. PMID:18304945
Thuan, Nguyen Huy; Dhakal, Dipesh; Pokhrel, Anaya Raj; Chu, Luan Luong; Van Pham, Thi Thuy; Shrestha, Anil; Sohng, Jae Kyung
2018-05-01
Streptomyces peucetius ATCC 27952 produces two major anthracyclines, doxorubicin (DXR) and daunorubicin (DNR), which are potent chemotherapeutic agents for the treatment of several cancers. In order to gain detailed insight on genetics and biochemistry of the strain, the complete genome was determined and analyzed. The result showed that its complete sequence contains 7187 protein coding genes in a total of 8,023,114 bp, whereas 87% of the genome contributed to the protein coding region. The genomic sequence included 18 rRNA, 66 tRNAs, and 3 non-coding RNAs. In silico studies predicted ~ 68 biosynthetic gene clusters (BCGs) encoding diverse classes of secondary metabolites, including non-ribosomal polyketide synthase (NRPS), polyketide synthase (PKS I, II, and III), terpenes, and others. Detailed analysis of the genome sequence revealed versatile biocatalytic enzymes such as cytochrome P450 (CYP), electron transfer systems (ETS) genes, methyltransferase (MT), glycosyltransferase (GT). In addition, numerous functional genes (transporter gene, SOD, etc.) and regulatory genes (afsR-sp, metK-sp, etc.) involved in the regulation of secondary metabolites were found. This minireview summarizes the genome-based genome mining (GM) of diverse BCGs and genome exploration (GE) of versatile biocatalytic enzymes, and other enzymes involved in maintenance and regulation of metabolism of S. peucetius. The detailed analysis of genome sequence provides critically important knowledge useful in the bioengineering of the strain or harboring catalytically efficient enzymes for biotechnological applications.
Aoki, Koh; Yano, Kentaro; Suzuki, Ayako; Kawamura, Shingo; Sakurai, Nozomu; Suda, Kunihiro; Kurabayashi, Atsushi; Suzuki, Tatsuya; Tsugane, Taneaki; Watanabe, Manabu; Ooga, Kazuhide; Torii, Maiko; Narita, Takanori; Shin-I, Tadasu; Kohara, Yuji; Yamamoto, Naoki; Takahashi, Hideki; Watanabe, Yuichiro; Egusa, Mayumi; Kodama, Motoichiro; Ichinose, Yuki; Kikuchi, Mari; Fukushima, Sumire; Okabe, Akiko; Arie, Tsutomu; Sato, Yuko; Yazawa, Katsumi; Satoh, Shinobu; Omura, Toshikazu; Ezura, Hiroshi; Shibata, Daisuke
2010-03-30
The Solanaceae family includes several economically important vegetable crops. The tomato (Solanum lycopersicum) is regarded as a model plant of the Solanaceae family. Recently, a number of tomato resources have been developed in parallel with the ongoing tomato genome sequencing project. In particular, a miniature cultivar, Micro-Tom, is regarded as a model system in tomato genomics, and a number of genomics resources in the Micro-Tom-background, such as ESTs and mutagenized lines, have been established by an international alliance. To accelerate the progress in tomato genomics, we developed a collection of fully-sequenced 13,227 Micro-Tom full-length cDNAs. By checking redundant sequences, coding sequences, and chimeric sequences, a set of 11,502 non-redundant full-length cDNAs (nrFLcDNAs) was generated. Analysis of untranslated regions demonstrated that tomato has longer 5'- and 3'-untranslated regions than most other plants but rice. Classification of functions of proteins predicted from the coding sequences demonstrated that nrFLcDNAs covered a broad range of functions. A comparison of nrFLcDNAs with genes of sixteen plants facilitated the identification of tomato genes that are not found in other plants, most of which did not have known protein domains. Mapping of the nrFLcDNAs onto currently available tomato genome sequences facilitated prediction of exon-intron structure. Introns of tomato genes were longer than those of Arabidopsis and rice. According to a comparison of exon sequences between the nrFLcDNAs and the tomato genome sequences, the frequency of nucleotide mismatch in exons between Micro-Tom and the genome-sequencing cultivar (Heinz 1706) was estimated to be 0.061%. The collection of Micro-Tom nrFLcDNAs generated in this study will serve as a valuable genomic tool for plant biologists to bridge the gap between basic and applied studies. The nrFLcDNA sequences will help annotation of the tomato whole-genome sequence and aid in tomato functional genomics and molecular breeding. Full-length cDNA sequences and their annotations are provided in the database KaFTom http://www.pgb.kazusa.or.jp/kaftom/ via the website of the National Bioresource Project Tomato http://tomato.nbrp.jp.
Richardson, Casey R.; Luo, Qing-Jun; Gontcharova, Viktoria; Jiang, Ying-Wen; Samanta, Manoj; Youn, Eunseog; Rock, Christopher D.
2010-01-01
Background MicroRNAs (miRNAs) and trans-acting small-interfering RNAs (tasi-RNAs) are small (20–22 nt long) RNAs (smRNAs) generated from hairpin secondary structures or antisense transcripts, respectively, that regulate gene expression by Watson-Crick pairing to a target mRNA and altering expression by mechanisms related to RNA interference. The high sequence homology of plant miRNAs to their targets has been the mainstay of miRNA prediction algorithms, which are limited in their predictive power for other kingdoms because miRNA complementarity is less conserved yet transitive processes (production of antisense smRNAs) are active in eukaryotes. We hypothesize that antisense transcription and associated smRNAs are biomarkers which can be computationally modeled for gene discovery. Principal Findings We explored rice (Oryza sativa) sense and antisense gene expression in publicly available whole genome tiling array transcriptome data and sequenced smRNA libraries (as well as C. elegans) and found evidence of transitivity of MIRNA genes similar to that found in Arabidopsis. Statistical analysis of antisense transcript abundances, presence of antisense ESTs, and association with smRNAs suggests several hundred Arabidopsis ‘orphan’ hypothetical genes are non-coding RNAs. Consistent with this hypothesis, we found novel Arabidopsis homologues of some MIRNA genes on the antisense strand of previously annotated protein-coding genes. A Support Vector Machine (SVM) was applied using thermodynamic energy of binding plus novel expression features of sense/antisense transcription topology and siRNA abundances to build a prediction model of miRNA targets. The SVM when trained on targets could predict the “ancient” (deeply conserved) class of validated Arabidopsis MIRNA genes with an accuracy of 84%, and 76% for “new” rapidly-evolving MIRNA genes. Conclusions Antisense and smRNA expression features and computational methods may identify novel MIRNA genes and other non-coding RNAs in plants and potentially other kingdoms, which can provide insight into antisense transcription, miRNA evolution, and post-transcriptional gene regulation. PMID:20520764
PopCORN: Hunting down the differences between binary population synthesis codes
NASA Astrophysics Data System (ADS)
Toonen, S.; Claeys, J. S. W.; Mennekens, N.; Ruiter, A. J.
2014-02-01
Context. Binary population synthesis (BPS) modelling is a very effective tool to study the evolution and properties of various types of close binary systems. The uncertainty in the parameters of the model and their effect on a population can be tested in a statistical way, which then leads to a deeper understanding of the underlying (sometimes poorly understood) physical processes involved. Several BPS codes exist that have been developed with different philosophies and aims. Although BPS has been very successful for studies of many populations of binary stars, in the particular case of the study of the progenitors of supernovae Type Ia, the predicted rates and ZAMS progenitors vary substantially between different BPS codes. Aims: To understand the predictive power of BPS codes, we study the similarities and differences in the predictions of four different BPS codes for low- and intermediate-mass binaries. We investigate the differences in the characteristics of the predicted populations, and whether they are caused by different assumptions made in the BPS codes or by numerical effects, e.g. a lack of accuracy in BPS codes. Methods: We compare a large number of evolutionary sequences for binary stars, starting with the same initial conditions following the evolution until the first (and when applicable, the second) white dwarf (WD) is formed. To simplify the complex problem of comparing BPS codes that are based on many (often different) assumptions, we equalise the assumptions as much as possible to examine the inherent differences of the four BPS codes. Results: We find that the simulated populations are similar between the codes. Regarding the population of binaries with one WD, there is very good agreement between the physical characteristics, the evolutionary channels that lead to the birth of these systems, and their birthrates. Regarding the double WD population, there is a good agreement on which evolutionary channels exist to create double WDs and a rough agreement on the characteristics of the double WD population. Regarding which progenitor systems lead to a single and double WD system and which systems do not, the four codes agree well. Most importantly, we find that for these two populations, the differences in the predictions from the four codes are not due to numerical differences, but because of different inherent assumptions. We identify critical assumptions for BPS studies that need to be studied in more detail. Appendices are available in electronic form at http://www.aanda.org
Stone, David M; Kerr, Rose C; Hughes, Margaret; Radford, Alan D; Darby, Alistair C
2013-11-01
The complete coding sequences were determined for four putative vesiculoviruses isolated from fish. Sequence alignment and phylogenetic analysis based on the predicted amino acid sequences of the five main proteins assigned tench rhabdovirus and grass carp rhabdovirus together with spring viraemia of carp and pike fry rhabdovirus to a lineage that was distinct from the mammalian vesiculoviruses. Perch rhabdovirus, eel virus European X, lake trout rhabdovirus 903/87 and sea trout virus were placed in a second lineage that was also distinct from the recognised genera in the family Rhabdoviridae. Establishment of two new rhabdovirus genera, "Perhabdovirus" and "Sprivivirus", is discussed.
Computational Prediction of miRNA Genes from Small RNA Sequencing Data
Kang, Wenjing; Friedländer, Marc R.
2015-01-01
Next-generation sequencing now for the first time allows researchers to gage the depth and variation of entire transcriptomes. However, now as rare transcripts can be detected that are present in cells at single copies, more advanced computational tools are needed to accurately annotate and profile them. microRNAs (miRNAs) are 22 nucleotide small RNAs (sRNAs) that post-transcriptionally reduce the output of protein coding genes. They have established roles in numerous biological processes, including cancers and other diseases. During miRNA biogenesis, the sRNAs are sequentially cleaved from precursor molecules that have a characteristic hairpin RNA structure. The vast majority of new miRNA genes that are discovered are mined from small RNA sequencing (sRNA-seq), which can detect more than a billion RNAs in a single run. However, given that many of the detected RNAs are degradation products from all types of transcripts, the accurate identification of miRNAs remain a non-trivial computational problem. Here, we review the tools available to predict animal miRNAs from sRNA sequencing data. We present tools for generalist and specialist use cases, including prediction from massively pooled data or in species without reference genome. We also present wet-lab methods used to validate predicted miRNAs, and approaches to computationally benchmark prediction accuracy. For each tool, we reference validation experiments and benchmarking efforts. Last, we discuss the future of the field. PMID:25674563
Kiryu, Hisanori; Kin, Taishin; Asai, Kiyoshi
2007-02-15
Recent transcriptomic studies have revealed the existence of a considerable number of non-protein-coding RNA transcripts in higher eukaryotic cells. To investigate the functional roles of these transcripts, it is of great interest to find conserved secondary structures from multiple alignments on a genomic scale. Since multiple alignments are often created using alignment programs that neglect the special conservation patterns of RNA secondary structures for computational efficiency, alignment failures can cause potential risks of overlooking conserved stem structures. We investigated the dependence of the accuracy of secondary structure prediction on the quality of alignments. We compared three algorithms that maximize the expected accuracy of secondary structures as well as other frequently used algorithms. We found that one of our algorithms, called McCaskill-MEA, was more robust against alignment failures than others. The McCaskill-MEA method first computes the base pairing probability matrices for all the sequences in the alignment and then obtains the base pairing probability matrix of the alignment by averaging over these matrices. The consensus secondary structure is predicted from this matrix such that the expected accuracy of the prediction is maximized. We show that the McCaskill-MEA method performs better than other methods, particularly when the alignment quality is low and when the alignment consists of many sequences. Our model has a parameter that controls the sensitivity and specificity of predictions. We discussed the uses of that parameter for multi-step screening procedures to search for conserved secondary structures and for assigning confidence values to the predicted base pairs. The C++ source code that implements the McCaskill-MEA algorithm and the test dataset used in this paper are available at http://www.ncrna.org/papers/McCaskillMEA/. Supplementary data are available at Bioinformatics online.
Selection of the simplest RNA that binds isoleucine
LOZUPONE, CATHERINE; CHANGAYIL, SHANKAR; MAJERFELD, IRENE; YARUS, MICHAEL
2003-01-01
We have identified the simplest RNA binding site for isoleucine using selection-amplification (SELEX), by shrinking the size of the randomized region until affinity selection is extinguished. Such a protocol can be useful because selection does not necessarily make the simplest active motif most prominent, as is often assumed. We find an isoleucine binding site that behaves exactly as predicted for the site that requires fewest nucleotides. This UAUU motif (16 highly conserved positions; 27 total), is also the most abundant site in successful selections on short random tracts. The UAUU site, now isolated independently at least 63 times, is a small asymmetric internal loop. Conserved loop sequences include isoleucine codon and anticodon triplets, whose nucleotides are required for amino acid binding. This reproducible association between isoleucine and its coding sequences supports the idea that the genetic code is, at least in part, a stereochemical residue of the most easily isolated RNA–amino acid binding structures. PMID:14561881
Characterization of the Lymantria dispar nucleopolyhedrovirus 25K FP gene
David S. Bischoff; James M. Slavicek
1996-01-01
The Lymantria dispar nucleopolyhedrovirus (LdMNPV) gene encoding the 25K FP protein has been cloned and sequenced. The 25KFP gene codes for a 217 amino acid protein with a predicted molecular mass of 24870 Da. Expression of the 25K FP protein in a rabbit reticulocyte system generated a 27 kDa protein, in close agreement with the...
Draft Genome Sequence of Mycobacterium boenickei CIP 107829.
Bouam, Amar; Robert, Catherine; Croce, Olivier; Levasseur, Anthony; Drancourt, Michel
2017-05-04
Mycobacterium boenickei is a rapidly growing mycobacterium isolated for the first time from a leg wound in the United States. Its 6,506,908-bp draft genome exhibits a 66.77% G+C content, 6,279 protein-coding genes, and 59 predicted RNA genes. In silico DNA-DNA hybridization confirms its assignment to the Mycobacterium fortuitum complex. Copyright © 2017 Bouam et al.
ERIC Educational Resources Information Center
Hitch, Graham J.; Flude, Brenda; Burgess, Neil
2009-01-01
Three experiments tested predictions of a neural network model of phonological short-term memory that assumes separate representations for order and item information, order being coded via a context-timing signal [Burgess, N., & Hitch, G. J. (1999). Memory for serial order: A network model of the phonological loop and its timing. "Psychological…
Zhang, Yan-Qiong; Chen, Dong-Liang; Tian, Hai-Feng; Zhang, Bao-Hong; Wen, Jian-Fan
2009-10-01
Using a combined computational program, we identified 50 potential microRNAs (miRNAs) in Giardia lamblia, one of the most primitive unicellular eukaryotes. These miRNAs are unique to G. lamblia and no homologues have been found in other organisms; miRNAs, currently known in other species, were not found in G. lamblia. This suggests that miRNA biogenesis and miRNA-mediated gene regulation pathway may evolve independently, especially in evolutionarily distant lineages. A majority (43) of the predicted miRNAs are located at one single locus; however, some miRNAs have two or more copies in the genome. Among the 58 miRNA genes, 28 are located in the intergenic regions whereas 30 are present in the anti-sense strands of the protein-coding sequences. Five predicted miRNAs are expressed in G. lamblia trophozoite cells evidenced by expressed sequence tags or RT-PCR. Thirty-seven identified miRNAs may target 50 protein-coding genes, including seven variant-specific surface proteins (VSPs). Our findings provide a clue that miRNA-mediated gene regulation may exist in the early stage of eukaryotic evolution, suggesting that it is an important regulation system ubiquitous in eukaryotes.
DNA barcode goes two-dimensions: DNA QR code web server.
Liu, Chang; Shi, Linchun; Xu, Xiaolan; Li, Huan; Xing, Hang; Liang, Dong; Jiang, Kun; Pang, Xiaohui; Song, Jingyuan; Chen, Shilin
2012-01-01
The DNA barcoding technology uses a standard region of DNA sequence for species identification and discovery. At present, "DNA barcode" actually refers to DNA sequences, which are not amenable to information storage, recognition, and retrieval. Our aim is to identify the best symbology that can represent DNA barcode sequences in practical applications. A comprehensive set of sequences for five DNA barcode markers ITS2, rbcL, matK, psbA-trnH, and CO1 was used as the test data. Fifty-three different types of one-dimensional and ten two-dimensional barcode symbologies were compared based on different criteria, such as coding capacity, compression efficiency, and error detection ability. The quick response (QR) code was found to have the largest coding capacity and relatively high compression ratio. To facilitate the further usage of QR code-based DNA barcodes, a web server was developed and is accessible at http://qrfordna.dnsalias.org. The web server allows users to retrieve the QR code for a species of interests, convert a DNA sequence to and from a QR code, and perform species identification based on local and global sequence similarities. In summary, the first comprehensive evaluation of various barcode symbologies has been carried out. The QR code has been found to be the most appropriate symbology for DNA barcode sequences. A web server has also been constructed to allow biologists to utilize QR codes in practical DNA barcoding applications.
Lichenase and coding sequences
Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong
2000-08-15
The present invention provides a fungal lichenase, i.e., an endo-1,3-1,4-.beta.-D-glucanohydrolase, its coding sequence, recombinant DNA molecules comprising the lichenase coding sequences, recombinant host cells and methods for producing same. The present lichenase is from Orpinomyces PC-2.
Bringing the fathead minnow (Pimephales promelas) into the ...
The fathead minnow (Pimephales promelas) is a well-established ecotoxicological model organism that has been widely used for regulatory ecotoxicity testing and research for over a half century. Throughout this time, a lot of knowledge has been gained about the fathead minnow’s biological responses to various xenobiotics. However, despite its importance as a model organism, the fathead minnow still has few publicly available gene sequences. Recently, Burns et al. (2015; Environ. Toxicol. Chem. 35:212) described the sequencing and de-novo assembly of the fathead minnow genome. Two draft genome assemblies are now publicly available on the GenBank database. However, on their own the draft assemblies remain of limited use to researchers who are primarily interested in the functional units of the genome, i.e. the genes. In the present study, an annotation pipeline, consisting of gene prediction, evidence alignment, and data synthesis, was applied to the fathead minnow SOAPdenovo assembly. Ab initio gene prediction was performed using AUGUSTUS, which provided a starting point of 43,345 gene predictions. Fathead minnow Expressed Sequence Tags (ESTs) and zebrafish protein-coding sequences (CDSs) were then aligned to the assembly using the corresponding spliced alignment methods of the program Exonerate. Of the over 240,000 EST alignments, 73% were successfully aligned with 90% or greater sequence identity and query coverage. Similarly, 39% of nearly 45,000 zebrafish co
Hyndman, Timothy H; Marschang, Rachel E; Wellehan, James F X; Nicholls, Philip K
2012-10-01
This paper describes the isolation and molecular identification of a novel paramyxovirus found during an investigation of an outbreak of neurorespiratory disease in a collection of Australian pythons. Using Illumina® high-throughput sequencing, a 17,187 nucleotide sequence was assembled from RNA extracts from infected viper heart cells (VH2) displaying widespread cytopathic effects in the form of multinucleate giant cells. The sequence appears to contain all the coding regions of the genome, including the following predicted paramyxoviral open reading frames (ORFs): 3'--Nucleocapsid (N)--putative Phosphoprotein (P)--Matrix (M)--Fusion (F)--putative attachment protein--Polymerase (L)--5'. There is also a 540 nucleotide ORF between the N and putative P genes that may be an additional coding region. Phylogenetic analyses of the complete N, M, F and L genes support the clustering of this virus within the family Paramyxoviridae but outside both of the current subfamilies: Paramyxovirinae and Pneumovirinae. We propose to name this new virus, Sunshine virus, after the geographic origin of the first isolate--the Sunshine Coast of Queensland, Australia. Copyright © 2012 Elsevier B.V. All rights reserved.
MouSensor: A Versatile Genetic Platform to Create Super Sniffer Mice for Studying Human Odor Coding.
D'Hulst, Charlotte; Mina, Raena B; Gershon, Zachary; Jamet, Sophie; Cerullo, Antonio; Tomoiaga, Delia; Bai, Li; Belluscio, Leonardo; Rogers, Matthew E; Sirotin, Yevgeniy; Feinstein, Paul
2016-07-26
Typically, ∼0.1% of the total number of olfactory sensory neurons (OSNs) in the main olfactory epithelium express the same odorant receptor (OR) in a singular fashion and their axons coalesce into homotypic glomeruli in the olfactory bulb. Here, we have dramatically increased the total number of OSNs expressing specific cloned OR coding sequences by multimerizing a 21-bp sequence encompassing the predicted homeodomain binding site sequence, TAATGA, known to be essential in OR gene choice. Singular gene choice is maintained in these "MouSensors." In vivo synaptopHluorin imaging of odor-induced responses by known M71 ligands shows functional glomerular activation in an M71 MouSensor. Moreover, a behavioral avoidance task demonstrates that specific odor detection thresholds are significantly decreased in multiple transgenic lines, expressing mouse or human ORs. We have developed a versatile platform to study gene choice and axon identity, to create biosensors with great translational potential, and to finally decode human olfaction. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.
Otsuki, Tetsuji; Ota, Toshio; Nishikawa, Tetsuo; Hayashi, Koji; Suzuki, Yutaka; Yamamoto, Jun-ichi; Wakamatsu, Ai; Kimura, Kouichi; Sakamoto, Katsuhiko; Hatano, Naoto; Kawai, Yuri; Ishii, Shizuko; Saito, Kaoru; Kojima, Shin-ichi; Sugiyama, Tomoyasu; Ono, Tetsuyoshi; Okano, Kazunori; Yoshikawa, Yoko; Aotsuka, Satoshi; Sasaki, Naokazu; Hattori, Atsushi; Okumura, Koji; Nagai, Keiichi; Sugano, Sumio; Isogai, Takao
2005-01-01
We have developed an in silico method of selection of human full-length cDNAs encoding secretion or membrane proteins from oligo-capped cDNA libraries. Fullness rates were increased to about 80% by combination of the oligo-capping method and ATGpr, software for prediction of translation start point and the coding potential. Then, using 5'-end single-pass sequences, cDNAs having the signal sequence were selected by PSORT ('signal sequence trap'). We also applied 'secretion or membrane protein-related keyword trap' based on the result of BLAST search against the SWISS-PROT database for the cDNAs which could not be selected by PSORT. Using the above procedures, 789 cDNAs were primarily selected and subjected to full-length sequencing, and 334 of these cDNAs were finally selected as novel. Most of the cDNAs (295 cDNAs: 88.3%) were predicted to encode secretion or membrane proteins. In particular, 165(80.5%) of the 205 cDNAs selected by PSORT were predicted to have signal sequences, while 70 (54.2%) of the 129 cDNAs selected by 'keyword trap' preserved the secretion or membrane protein-related keywords. Many important cDNAs were obtained, including transporters, receptors, and ligands, involved in significant cellular functions. Thus, an efficient method of selecting secretion or membrane protein-encoding cDNAs was developed by combining the above four procedures.
FRAGS: estimation of coding sequence substitution rates from fragmentary data
Swart, Estienne C; Hide, Winston A; Seoighe, Cathal
2004-01-01
Background Rates of substitution in protein-coding sequences can provide important insights into evolutionary processes that are of biomedical and theoretical interest. Increased availability of coding sequence data has enabled researchers to estimate more accurately the coding sequence divergence of pairs of organisms. However the use of different data sources, alignment protocols and methods to estimate substitution rates leads to widely varying estimates of key parameters that define the coding sequence divergence of orthologous genes. Although complete genome sequence data are not available for all organisms, fragmentary sequence data can provide accurate estimates of substitution rates provided that an appropriate and consistent methodology is used and that differences in the estimates obtainable from different data sources are taken into account. Results We have developed FRAGS, an application framework that uses existing, freely available software components to construct in-frame alignments and estimate coding substitution rates from fragmentary sequence data. Coding sequence substitution estimates for human and chimpanzee sequences, generated by FRAGS, reveal that methodological differences can give rise to significantly different estimates of important substitution parameters. The estimated substitution rates were also used to infer upper-bounds on the amount of sequencing error in the datasets that we have analysed. Conclusion We have developed a system that performs robust estimation of substitution rates for orthologous sequences from a pair of organisms. Our system can be used when fragmentary genomic or transcript data is available from one of the organisms and the other is a completely sequenced genome within the Ensembl database. As well as estimating substitution statistics our system enables the user to manage and query alignment and substitution data. PMID:15005802
Genome assembly and transcriptome resource for river buffalo, Bubalus bubalis (2n = 50)
Iamartino, Daniela; Pruitt, Kim D; Sonstegard, Tad; Smith, Timothy P L; Low, Wai Yee; Biagini, Tommaso; Bomba, Lorenzo; Capomaccio, Stefano; Castiglioni, Bianca; Coletta, Angelo; Corrado, Federica; Ferré, Fabrizio; Iannuzzi, Leopoldo; Lawley, Cynthia; Macciotta, Nicolò; McClure, Matthew; Mancini, Giordano; Matassino, Donato; Mazza, Raffaele; Milanesi, Marco; Moioli, Bianca; Morandi, Nicola; Ramunno, Luigi; Peretti, Vincenzo; Pilla, Fabio; Ramelli, Paola; Schroeder, Steven; Strozzi, Francesco; Thibaud-Nissen, Francoise; Zicarelli, Luigi; Ajmone-Marsan, Paolo; Valentini, Alessio; Chillemi, Giovanni; Zimin, Aleksey
2017-01-01
Abstract Water buffalo is a globally important species for agriculture and local economies. A de novo assembled, well-annotated reference sequence for the water buffalo is an important prerequisite for studying the biology of this species, and is necessary to manage genetic diversity and to use modern breeding and genomic selection techniques. However, no such genome assembly has been previously reported. There are 2 species of domestic water buffalo, the river (2n = 50) and the swamp (2n = 48) buffalo. Here we describe a draft quality reference sequence for the river buffalo created from Illumina GA and Roche 454 short read sequences using the MaSuRCA assembler. The assembled sequence is 2.83 Gb, consisting of 366 983 scaffolds with a scaffold N50 of 1.41 Mb and contig N50 of 21 398 bp. Annotation of the genome was supported by transcriptome data from 30 tissues and identified 21 711 predicted protein coding genes. Searches for complete mammalian BUSCO gene groups found 98.6% of curated single copy orthologs present among predicted genes, which suggests a high level of completeness of the genome. The annotated sequence is available from NCBI at accession GCA_000471725.1. PMID:29048578
Visual pattern image sequence coding
NASA Technical Reports Server (NTRS)
Silsbee, Peter; Bovik, Alan C.; Chen, Dapang
1990-01-01
The visual pattern image coding (VPIC) configurable digital image-coding process is capable of coding with visual fidelity comparable to the best available techniques, at compressions which (at 30-40:1) exceed all other technologies. These capabilities are associated with unprecedented coding efficiencies; coding and decoding operations are entirely linear with respect to image size and entail a complexity that is 1-2 orders of magnitude faster than any previous high-compression technique. The visual pattern image sequence coding to which attention is presently given exploits all the advantages of the static VPIC in the reduction of information from an additional, temporal dimension, to achieve unprecedented image sequence coding performance.
Analysis of the possibility of using G.729 codec for steganographic transmission
NASA Astrophysics Data System (ADS)
Piotrowski, Zbigniew; Ciołek, Michał; Dołowski, Jerzy; Wojtuń, Jarosław
2017-04-01
Network steganography is dedicated in particular for those communication services for which there are no bridges or nodes carrying out unintentional attacks on steganographic sequence. In order to set up a hidden communication channel the method of data encoding and decoding was implemented using code books of codec G.729. G.729 codec includes, in its construction, linear prediction vocoder CS-ACELP (Conjugate Structure Algebraic Code Excited Linear Prediction), and by modifying the binary content of the codebook, it is easy to change a binary output stream. The article describes the results of research on the selection of these bits of the codebook codec G.729 which the negation of the least have influence to the loss of quality and fidelity of the output signal. The study was performed with the use of subjective and objective listening tests.
GABI-Kat SimpleSearch: new features of the Arabidopsis thaliana T-DNA mutant database.
Kleinboelting, Nils; Huep, Gunnar; Kloetgen, Andreas; Viehoever, Prisca; Weisshaar, Bernd
2012-01-01
T-DNA insertion mutants are very valuable for reverse genetics in Arabidopsis thaliana. Several projects have generated large sequence-indexed collections of T-DNA insertion lines, of which GABI-Kat is the second largest resource worldwide. User access to the collection and its Flanking Sequence Tags (FSTs) is provided by the front end SimpleSearch (http://www.GABI-Kat.de). Several significant improvements have been implemented recently. The database now relies on the TAIRv10 genome sequence and annotation dataset. All FSTs have been newly mapped using an optimized procedure that leads to improved accuracy of insertion site predictions. A fraction of the collection with weak FST yield was re-analysed by generating new FSTs. Along with newly found predictions for older sequences about 20,000 new FSTs were included in the database. Information about groups of FSTs pointing to the same insertion site that is found in several lines but is real only in a single line are included, and many problematic FST-to-line links have been corrected using new wet-lab data. SimpleSearch currently contains data from ~71,000 lines with predicted insertions covering 62.5% of the 27,206 nuclear protein coding genes, and offers insertion allele-specific data from 9545 confirmed lines that are available from the Nottingham Arabidopsis Stock Centre.
Pettigrew, Christopher; Wayte, Nicola; Lovelock, Paul K; Tavtigian, Sean V; Chenevix-Trench, Georgia; Spurdle, Amanda B; Brown, Melissa A
2005-01-01
Introduction Aberrant pre-mRNA splicing can be more detrimental to the function of a gene than changes in the length or nature of the encoded amino acid sequence. Although predicting the effects of changes in consensus 5' and 3' splice sites near intron:exon boundaries is relatively straightforward, predicting the possible effects of changes in exonic splicing enhancers (ESEs) remains a challenge. Methods As an initial step toward determining which ESEs predicted by the web-based tool ESEfinder in the breast cancer susceptibility gene BRCA1 are likely to be functional, we have determined their evolutionary conservation and compared their location with known BRCA1 sequence variants. Results Using the default settings of ESEfinder, we initially detected 669 potential ESEs in the coding region of the BRCA1 gene. Increasing the threshold score reduced the total number to 464, while taking into consideration the proximity to splice donor and acceptor sites reduced the number to 211. Approximately 11% of these ESEs (23/211) either are identical at the nucleotide level in human, primates, mouse, cow, dog and opossum Brca1 (conserved) or are detectable by ESEfinder in the same position in the Brca1 sequence (shared). The frequency of conserved and shared predicted ESEs between human and mouse is higher in BRCA1 exons (2.8 per 100 nucleotides) than in introns (0.6 per 100 nucleotides). Of conserved or shared putative ESEs, 61% (14/23) were predicted to be affected by sequence variants reported in the Breast Cancer Information Core database. Applying the filters described above increased the colocalization of predicted ESEs with missense changes, in-frame deletions and unclassified variants predicted to be deleterious to protein function, whereas they decreased the colocalization with known polymorphisms or unclassified variants predicted to be neutral. Conclusion In this report we show that evolutionary conservation analysis may be used to improve the specificity of an ESE prediction tool. This is the first report on the prediction of the frequency and distribution of ESEs in the BRCA1 gene, and it is the first reported attempt to predict which ESEs are most likely to be functional and therefore which sequence variants in ESEs are most likely to be pathogenic. PMID:16280041
Scene-aware joint global and local homographic video coding
NASA Astrophysics Data System (ADS)
Peng, Xiulian; Xu, Jizheng; Sullivan, Gary J.
2016-09-01
Perspective motion is commonly represented in video content that is captured and compressed for various applications including cloud gaming, vehicle and aerial monitoring, etc. Existing approaches based on an eight-parameter homography motion model cannot deal with this efficiently, either due to low prediction accuracy or excessive bit rate overhead. In this paper, we consider the camera motion model and scene structure in such video content and propose a joint global and local homography motion coding approach for video with perspective motion. The camera motion is estimated by a computer vision approach, and camera intrinsic and extrinsic parameters are globally coded at the frame level. The scene is modeled as piece-wise planes, and three plane parameters are coded at the block level. Fast gradient-based approaches are employed to search for the plane parameters for each block region. In this way, improved prediction accuracy and low bit costs are achieved. Experimental results based on the HEVC test model show that up to 9.1% bit rate savings can be achieved (with equal PSNR quality) on test video content with perspective motion. Test sequences for the example applications showed a bit rate savings ranging from 3.7 to 9.1%.
Zhu, Xun; Xie, Shangbo; Armengaud, Jean; Xie, Wen; Guo, Zhaojiang; Kang, Shi; Wu, Qingjun; Wang, Shaoli; Xia, Jixing; He, Rongjun; Zhang, Youjun
2016-06-01
The diamondback moth, Plutella xylostella (L.), is the major cosmopolitan pest of brassica and other cruciferous crops. Its larval midgut is a dynamic tissue that interfaces with a wide variety of toxicological and physiological processes. The draft sequence of the P. xylostella genome was recently released, but its annotation remains challenging because of the low sequence coverage of this branch of life and the poor description of exon/intron splicing rules for these insects. Peptide sequencing by computational assignment of tandem mass spectra to genome sequence information provides an experimental independent approach for confirming or refuting protein predictions, a concept that has been termed proteogenomics. In this study, we carried out an in-depth proteogenomic analysis to complement genome annotation of P. xylostella larval midgut based on shotgun HPLC-ESI-MS/MS data by means of a multialgorithm pipeline. A total of 876,341 tandem mass spectra were searched against the predicted P. xylostella protein sequences and a whole-genome six-frame translation database. Based on a data set comprising 2694 novel genome search specific peptides, we discovered 439 novel protein-coding genes and corrected 128 existing gene models. To get the most accurate data to seed further insect genome annotation, more than half of the novel protein-coding genes, i.e. 235 over 439, were further validated after RT-PCR amplification and sequencing of the corresponding transcripts. Furthermore, we validated 53 novel alternative splicings. Finally, a total of 6764 proteins were identified, resulting in one of the most comprehensive proteogenomic study of a nonmodel animal. As the first tissue-specific proteogenomics analysis of P. xylostella, this study provides the fundamental basis for high-throughput proteomics and functional genomics approaches aimed at deciphering the molecular mechanisms of resistance and controlling this pest. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.
From sequence to enzyme mechanism using multi-label machine learning.
De Ferrari, Luna; Mitchell, John B O
2014-05-19
In this work we predict enzyme function at the level of chemical mechanism, providing a finer granularity of annotation than traditional Enzyme Commission (EC) classes. Hence we can predict not only whether a putative enzyme in a newly sequenced organism has the potential to perform a certain reaction, but how the reaction is performed, using which cofactors and with susceptibility to which drugs or inhibitors, details with important consequences for drug and enzyme design. Work that predicts enzyme catalytic activity based on 3D protein structure features limits the prediction of mechanism to proteins already having either a solved structure or a close relative suitable for homology modelling. In this study, we evaluate whether sequence identity, InterPro or Catalytic Site Atlas sequence signatures provide enough information for bulk prediction of enzyme mechanism. By splitting MACiE (Mechanism, Annotation and Classification in Enzymes database) mechanism labels to a finer granularity, which includes the role of the protein chain in the overall enzyme complex, the method can predict at 96% accuracy (and 96% micro-averaged precision, 99.9% macro-averaged recall) the MACiE mechanism definitions of 248 proteins available in the MACiE, EzCatDb (Database of Enzyme Catalytic Mechanisms) and SFLD (Structure Function Linkage Database) databases using an off-the-shelf K-Nearest Neighbours multi-label algorithm. We find that InterPro signatures are critical for accurate prediction of enzyme mechanism. We also find that incorporating Catalytic Site Atlas attributes does not seem to provide additional accuracy. The software code (ml2db), data and results are available online at http://sourceforge.net/projects/ml2db/ and as supplementary files.
Identification and characterization of a novel zebrafish (Danio rerio) pentraxin-carbonic anhydrase.
Patrikainen, Maarit S; Tolvanen, Martti E E; Aspatwar, Ashok; Barker, Harlan R; Ortutay, Csaba; Jänis, Janne; Laitaoja, Mikko; Hytönen, Vesa P; Azizi, Latifeh; Manandhar, Prajwol; Jáger, Edit; Vullo, Daniela; Kukkurainen, Sampo; Hilvo, Mika; Supuran, Claudiu T; Parkkila, Seppo
2017-01-01
Carbonic anhydrases (CAs) are ubiquitous, essential enzymes which catalyze the conversion of carbon dioxide and water to bicarbonate and H + ions. Vertebrate genomes generally contain gene loci for 15-21 different CA isoforms, three of which are enzymatically inactive. CA VI is the only secretory protein of the enzymatically active isoforms. We discovered that non-mammalian CA VI contains a C-terminal pentraxin (PTX) domain, a novel combination for both CAs and PTXs. We isolated and sequenced zebrafish ( Danio rerio ) CA VI cDNA, complete with the sequence coding for the PTX domain, and produced the recombinant CA VI-PTX protein. Enzymatic activity and kinetic parameters were measured with a stopped-flow instrument. Mass spectrometry, analytical gel filtration and dynamic light scattering were used for biophysical characterization. Sequence analyses and Bayesian phylogenetics were used in generating hypotheses of protein structure and CA VI gene evolution. A CA VI-PTX antiserum was produced, and the expression of CA VI protein was studied by immunohistochemistry. A knock-down zebrafish model was constructed, and larvae were observed up to five days post-fertilization (dpf). The expression of ca6 mRNA was quantitated by qRT-PCR in different developmental times in morphant and wild-type larvae and in different adult fish tissues. Finally, the swimming behavior of the morphant fish was compared to that of wild-type fish. The recombinant enzyme has a very high carbonate dehydratase activity. Sequencing confirms a 530-residue protein identical to one of the predicted proteins in the Ensembl database (ensembl.org). The protein is pentameric in solution, as studied by gel filtration and light scattering, presumably joined by the PTX domains. Mass spectrometry confirms the predicted signal peptide cleavage and disulfides, and N-glycosylation in two of the four observed glycosylation motifs. Molecular modeling of the pentamer is consistent with the modifications observed in mass spectrometry. Phylogenetics and sequence analyses provide a consistent hypothesis of the evolutionary history of domains associated with CA VI in mammals and non-mammals. Briefly, the evidence suggests that ancestral CA VI was a transmembrane protein, the exon coding for the cytoplasmic domain was replaced by one coding for PTX domain, and finally, in the therian lineage, the PTX-coding exon was lost. We knocked down CA VI expression in zebrafish embryos with antisense morpholino oligonucleotides, resulting in phenotype features of decreased buoyancy and swim bladder deflation in 4 dpf larvae. These findings provide novel insights into the evolution, structure, and function of this unique CA form.
GeneMachine: gene prediction and sequence annotation.
Makalowska, I; Ryan, J F; Baxevanis, A D
2001-09-01
A number of free-standing programs have been developed in order to help researchers find potential coding regions and deduce gene structure for long stretches of what is essentially 'anonymous DNA'. As these programs apply inherently different criteria to the question of what is and is not a coding region, multiple algorithms should be used in the course of positional cloning and positional candidate projects to assure that all potential coding regions within a previously-identified critical region are identified. We have developed a gene identification tool called GeneMachine which allows users to query multiple exon and gene prediction programs in an automated fashion. BLAST searches are also performed in order to see whether a previously-characterized coding region corresponds to a region in the query sequence. A suite of Perl programs and modules are used to run MZEF, GENSCAN, GRAIL 2, FGENES, RepeatMasker, Sputnik, and BLAST. The results of these runs are then parsed and written into ASN.1 format. Output files can be opened using NCBI Sequin, in essence using Sequin as both a workbench and as a graphical viewer. The main feature of GeneMachine is that the process is fully automated; the user is only required to launch GeneMachine and then open the resulting file with Sequin. Annotations can then be made to these results prior to submission to GenBank, thereby increasing the intrinsic value of these data. GeneMachine is freely-available for download at http://genome.nhgri.nih.gov/genemachine. A public Web interface to the GeneMachine server for academic and not-for-profit users is available at http://genemachine.nhgri.nih.gov. The Web supplement to this paper may be found at http://genome.nhgri.nih.gov/genemachine/supplement/.
de Borba, Luana; Villordo, Sergio M; Iglesias, Nestor G; Filomatori, Claudia V; Gebhard, Leopoldo G; Gamarnik, Andrea V
2015-03-01
The dengue virus genome is a dynamic molecule that adopts different conformations in the infected cell. Here, using RNA folding predictions, chemical probing analysis, RNA binding assays, and functional studies, we identified new cis-acting elements present in the capsid coding sequence that facilitate cyclization of the viral RNA by hybridization with a sequence involved in a local dumbbell structure at the viral 3' untranslated region (UTR). The identified interaction differentially enhances viral replication in mosquito and mammalian cells. Copyright © 2015, American Society for Microbiology. All Rights Reserved.
Hunt, C; Morimoto, R I
1985-01-01
We have determined the nucleotide sequence of the human hsp70 gene and 5' flanking region. The hsp70 gene is transcribed as an uninterrupted primary transcript of 2440 nucleotides composed of a 5' noncoding leader sequence of 212 nucleotides, a 3' noncoding region of 242 nucleotides, and a continuous open reading frame of 1986 nucleotides that encodes a protein with predicted molecular mass of 69,800 daltons. Upstream of the 5' terminus are the canonical TATAAA box, the sequence ATTGG that corresponds in the inverted orientation to the CCAAT motif, and the dyad sequence CTGGAAT/ATTCCCG that shares homology in 12 of 14 positions with the consensus transcription regulatory sequence common to Drosophila heat shock genes. Comparison of the predicted amino acid sequences of human hsp70 with the published sequences of Drosophila hsp70 and Escherichia coli dnaK reveals that human hsp70 is 73% identical to Drosophila hsp70 and 47% identical to E. coli dnaK. Surprisingly, the nucleotide sequences of the human and Drosophila genes are 72% identical and human and E. coli genes are 50% identical, which is more highly conserved than necessary given the degeneracy of the genetic code. The lack of accumulated silent nucleotide substitutions leads us to propose that there may be additional information in the nucleotide sequence of the hsp70 gene or the corresponding mRNA that precludes the maximum divergence allowed in the silent codon positions. PMID:3931075
REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants.
Ioannidis, Nilah M; Rothstein, Joseph H; Pejaver, Vikas; Middha, Sumit; McDonnell, Shannon K; Baheti, Saurabh; Musolf, Anthony; Li, Qing; Holzinger, Emily; Karyadi, Danielle; Cannon-Albright, Lisa A; Teerlink, Craig C; Stanford, Janet L; Isaacs, William B; Xu, Jianfeng; Cooney, Kathleen A; Lange, Ethan M; Schleutker, Johanna; Carpten, John D; Powell, Isaac J; Cussenot, Olivier; Cancel-Tassin, Geraldine; Giles, Graham G; MacInnis, Robert J; Maier, Christiane; Hsieh, Chih-Lin; Wiklund, Fredrik; Catalona, William J; Foulkes, William D; Mandal, Diptasri; Eeles, Rosalind A; Kote-Jarai, Zsofia; Bustamante, Carlos D; Schaid, Daniel J; Hastie, Trevor; Ostrander, Elaine A; Bailey-Wilson, Joan E; Radivojac, Predrag; Thibodeau, Stephen N; Whittemore, Alice S; Sieh, Weiva
2016-10-06
The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10 -12 ) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale. Copyright © 2016 American Society of Human Genetics. All rights reserved.
Bobrova, E V; Liakhovetskiĭ, V A; Borshchevskaia, E R
2011-01-01
The dependence of errors during reproduction of a sequence of hand movements without visual feedback on the previous right- and left-hand performance ("prehistory") and on positions in space of sequence elements (random or ordered by the explicit rule) was analyzed. It was shown that the preceding information about the ordered positions of the sequence elements was used during right-hand movements, whereas left-hand movements were performed with involvement of the information about the random sequence. The data testify to a central mechanism of the analysis of spatial structure of sequence elements. This mechanism activates movement coding specific for the left hemisphere (vector coding) in case of an ordered sequence structure and positional coding specific for the right hemisphere in case of a random sequence structure.
Burgos, Mariana; Arenas, Alvaro; Cabrera, Rodrigo
2016-08-01
Inherited long QT syndrome (LQTS) is a cardiac channelopathy characterized by a prolongation of QT interval and the risk of syncope, cardiac arrest, and sudden cardiac death. Genetic diagnosis of LQTS is critical in medical practice as results can guide adequate management of patients and distinguish phenocopies such as catecholaminergic polymorphic ventricular tachycardia (CPVT). However, extensive screening of large genomic regions is required in order to reliably identify genetic causes. Semiconductor whole exome sequencing (WES) is a promising approach for the identification of variants in the coding regions of most human genes. DNA samples from 21 Colombian patients clinically diagnosed with LQTS were enriched for coding regions using multiplex polymerase chain reaction (PCR) and subjected to WES using a semiconductor sequencer. Semiconductor WES showed mean coverage of 93.6 % for all coding regions relevant to LQTS at >10× depth with high intra- and inter-assay depth heterogeneity. Fifteen variants were detected in 12 patients in genes associated with LQTS. Three variants were identified in three patients in genes associated with CPVT. Co-segregation analysis was performed when possible. All variants were analyzed with two pathogenicity prediction algorithms. The overall prevalence of LQTS and CPVT variants in our cohort was 71.4 %. All LQTS variants previously identified through commercial genetic testing were identified. Standardized WES assays can be easily implemented, often at a lower cost than sequencing panels. Our results show that WES can identify LQTS-causing mutations and permits differential diagnosis of related conditions in a real-world clinical setting. However, high heterogeneity in sequencing depth and low coverage in the most relevant genes is expected to be associated with reduced analytical sensitivity.
GAMSOR: Gamma Source Preparation and DIF3D Flux Solution
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, M. A.; Lee, C. H.; Hill, R. N.
2017-06-28
Nuclear reactors that rely upon the fission reaction have two modes of thermal energy deposition in the reactor system: neutron absorption and gamma absorption. The gamma rays are typically generated by neutron capture reactions or during the fission process which means the primary driver of energy production is of course the neutron interaction. In conventional reactor physics methods, the gamma heating component is ignored such that the gamma absorption is forced to occur at the gamma emission site. For experimental reactor systems like EBR-II and FFTF, the placement of structural pins and assemblies internal to the core leads to problemsmore » with power heating predictions because there is no fission power source internal to the assembly to dictate a spatial distribution of the power. As part of the EBR-II support work in the 1980s, the GAMSOR code was developed to assist analysts in calculating the gamma heating. The GAMSOR code is a modified version of DIF3D and actually functions within a sequence of DIF3D calculations. The gamma flux in a conventional fission reactor system does not perturb the neutron flux and thus the gamma flux calculation can be cast as a fixed source problem given a solution to the steady state neutron flux equation. This leads to a sequence of DIF3D calculations, called the GAMSOR sequence, which involves solving the neutron flux, then the gamma flux, and then combining the results to do a summary edit. In this manuscript, we go over the GAMSOR code and detail how it is put together and functions. We also discuss how to setup the GAMSOR sequence and input for each DIF3D calculation in the GAMSOR sequence.« less
The genome of woodland strawberry (Fragaria vesca)
Shulaev, Vladimir; Sargent, Daniel J; Crowhurst, Ross N; Mockler, Todd C; Folkerts, Otto; Delcher, Arthur L; Jaiswal, Pankaj; Mockaitis, Keithanne; Liston, Aaron; Mane, Shrinivasrao P; Burns, Paul; Davis, Thomas M; Slovin, Janet P; Bassil, Nahla; Hellens, Roger P; Evans, Clive; Harkins, Tim; Kodira, Chinnappa; Desany, Brian; Crasta, Oswald R; Jensen, Roderick V; Allan, Andrew C; Michael, Todd P; Setubal, Joao Carlos; Celton, Jean-Marc; Rees, D Jasper G; Williams, Kelly P; Holt, Sarah H; Ruiz Rojas, Juan Jairo; Chatterjee, Mithu; Liu, Bo; Silva, Herman; Meisel, Lee; Adato, Avital; Filichkin, Sergei A; Troggio, Michela; Viola, Roberto; Ashman, Tia-Lynn; Wang, Hao; Dharmawardhana, Palitha; Elser, Justin; Raja, Rajani; Priest, Henry D; Bryant, Douglas W; Fox, Samuel E; Givan, Scott A; Wilhelm, Larry J; Naithani, Sushma; Christoffels, Alan; Salama, David Y; Carter, Jade; Girona, Elena Lopez; Zdepski, Anna; Wang, Wenqin; Kerstetter, Randall A; Schwab, Wilfried; Korban, Schuyler S; Davik, Jahn; Monfort, Amparo; Denoyes-Rothan, Beatrice; Arus, Pere; Mittler, Ron; Flinn, Barry; Aharoni, Asaph; Bennetzen, Jeffrey L; Salzberg, Steven L; Dickerman, Allan W; Velasco, Riccardo; Borodovsky, Mark; Veilleux, Richard E; Folta, Kevin M
2012-01-01
The woodland strawberry, Fragaria vesca (2n = 2x = 14), is a versatile experimental plant system. This diminutive herbaceous perennial has a small genome (240 Mb), is amenable to genetic transformation and shares substantial sequence identity with the cultivated strawberry (Fragaria × ananassa) and other economically important rosaceous plants. Here we report the draft F. vesca genome, which was sequenced to ×39 coverage using second-generation technology, assembled de novo and then anchored to the genetic linkage map into seven pseudochromosomes. This diploid strawberry sequence lacks the large genome duplications seen in other rosids. Gene prediction modeling identified 34,809 genes, with most being supported by transcriptome mapping. Genes critical to valuable horticultural traits including flavor, nutritional value and flowering time were identified. Macrosyntenic relationships between Fragaria and Prunus predict a hypothetical ancestral Rosaceae genome that had nine chromosomes. New phylogenetic analysis of 154 protein-coding genes suggests that assignment of Populus to Malvidae, rather than Fabidae, is warranted. PMID:21186353
Stop Codon Reassignment in the Wild
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ivanova, Natalia; Schwientek, Patrick; Tripp, H. James
Since the discovery of the genetic code and protein translation mechanisms (1), a limited number of variations of the standard assignment between unique base triplets (codons) and their encoded amino acids and translational stop signals have been found in bacteria and phages (2-3). Given the apparent ubiquity of the canonical genetic code, the design of genomically recoded organisms with non-canonical codes has been suggested as a means to prevent horizontal gene transfer between laboratory and environmental organisms (4). It is also predicted that genomically recoded organisms are immune to infection by viruses, under the assumption that phages and their hostsmore » must share a common genetic code (5). This paradigm is supported by the observation of increased resistance of genomically recoded bacteria to phages with a canonical code (4). Despite these assumptions and accompanying lines of evidence, it remains unclear whether differential and non-canonical codon usage represents an absolute barrier to phage infection and genetic exchange between organisms. Our knowledge of the diversity of genetic codes and their use by viruses and their hosts is primarily derived from the analysis of cultivated organisms. Advances in single-cell sequencing and metagenome assembly technologies have enabled the reconstruction of genomes of uncultivated bacterial and archaeal lineages (6). These initial findings suggest that large scale systematic studies of uncultivated microorganisms and viruses may reveal the extent and modes of divergence from the canonical genetic code operating in nature. To explore alternative genetic codes, we carried out a systematic analysis of stop codon reassignments from the canonical TAG amber, TGA opal, and TAA ochre codons in assembled metagenomes from environmental and host-associated samples, single-cell genomes of uncultivated bacteria and archaea, and a collection of phage sequences« less
UltraPse: A Universal and Extensible Software Platform for Representing Biological Sequences.
Du, Pu-Feng; Zhao, Wei; Miao, Yang-Yang; Wei, Le-Yi; Wang, Likun
2017-11-14
With the avalanche of biological sequences in public databases, one of the most challenging problems in computational biology is to predict their biological functions and cellular attributes. Most of the existing prediction algorithms can only handle fixed-length numerical vectors. Therefore, it is important to be able to represent biological sequences with various lengths using fixed-length numerical vectors. Although several algorithms, as well as software implementations, have been developed to address this problem, these existing programs can only provide a fixed number of representation modes. Every time a new sequence representation mode is developed, a new program will be needed. In this paper, we propose the UltraPse as a universal software platform for this problem. The function of the UltraPse is not only to generate various existing sequence representation modes, but also to simplify all future programming works in developing novel representation modes. The extensibility of UltraPse is particularly enhanced. It allows the users to define their own representation mode, their own physicochemical properties, or even their own types of biological sequences. Moreover, UltraPse is also the fastest software of its kind. The source code package, as well as the executables for both Linux and Windows platforms, can be downloaded from the GitHub repository.
Mining of Microbial Genomes for the Novel Sources of Nitrilases.
Sharma, Nikhil; Thakur, Neerja; Raj, Tilak; Savitri; Bhalla, Tek Chand
2017-01-01
Next-generation DNA sequencing (NGS) has made it feasible to sequence large number of microbial genomes and advancements in computational biology have opened enormous opportunities to mine genome sequence data for novel genes and enzymes or their sources. In the present communication in silico mining of microbial genomes has been carried out to find novel sources of nitrilases. The sequences selected were analyzed for homology and considered for designing motifs. The manually designed motifs based on amino acid sequences of nitrilases were used to screen 2000 microbial genomes (translated to proteomes). This resulted in identification of one hundred thirty-eight putative/hypothetical sequences which could potentially code for nitrilase activity. In vitro validation of nine predicted sources of nitrilases was done for nitrile/cyanide hydrolyzing activity. Out of nine predicted nitrilases, Gluconacetobacter diazotrophicus , Sphingopyxis alaskensis , Saccharomonospora viridis , and Shimwellia blattae were specific for aliphatic nitriles, whereas nitrilases from Geodermatophilus obscurus , Nocardiopsis dassonvillei , Runella slithyformis , and Streptomyces albus possessed activity for aromatic nitriles. Flavobacterium indicum was specific towards potassium cyanide (KCN) which revealed the presence of nitrilase homolog, that is, cyanide dihydratase with no activity for either aliphatic, aromatic, or aryl nitriles. The present study reports the novel sources of nitrilases and cyanide dihydratase which were not reported hitherto by in silico or in vitro studies.
Discrete Cosine Transform Image Coding With Sliding Block Codes
NASA Astrophysics Data System (ADS)
Divakaran, Ajay; Pearlman, William A.
1989-11-01
A transform trellis coding scheme for images is presented. A two dimensional discrete cosine transform is applied to the image followed by a search on a trellis structured code. This code is a sliding block code that utilizes a constrained size reproduction alphabet. The image is divided into blocks by the transform coding. The non-stationarity of the image is counteracted by grouping these blocks in clusters through a clustering algorithm, and then encoding the clusters separately. Mandela ordered sequences are formed from each cluster i.e identically indexed coefficients from each block are grouped together to form one dimensional sequences. A separate search ensues on each of these Mandela ordered sequences. Padding sequences are used to improve the trellis search fidelity. The padding sequences absorb the error caused by the building up of the trellis to full size. The simulations were carried out on a 256x256 image ('LENA'). The results are comparable to any existing scheme. The visual quality of the image is enhanced considerably by the padding and clustering.
DNA Barcode Goes Two-Dimensions: DNA QR Code Web Server
Li, Huan; Xing, Hang; Liang, Dong; Jiang, Kun; Pang, Xiaohui; Song, Jingyuan; Chen, Shilin
2012-01-01
The DNA barcoding technology uses a standard region of DNA sequence for species identification and discovery. At present, “DNA barcode” actually refers to DNA sequences, which are not amenable to information storage, recognition, and retrieval. Our aim is to identify the best symbology that can represent DNA barcode sequences in practical applications. A comprehensive set of sequences for five DNA barcode markers ITS2, rbcL, matK, psbA-trnH, and CO1 was used as the test data. Fifty-three different types of one-dimensional and ten two-dimensional barcode symbologies were compared based on different criteria, such as coding capacity, compression efficiency, and error detection ability. The quick response (QR) code was found to have the largest coding capacity and relatively high compression ratio. To facilitate the further usage of QR code-based DNA barcodes, a web server was developed and is accessible at http://qrfordna.dnsalias.org. The web server allows users to retrieve the QR code for a species of interests, convert a DNA sequence to and from a QR code, and perform species identification based on local and global sequence similarities. In summary, the first comprehensive evaluation of various barcode symbologies has been carried out. The QR code has been found to be the most appropriate symbology for DNA barcode sequences. A web server has also been constructed to allow biologists to utilize QR codes in practical DNA barcoding applications. PMID:22574113
Student Use of Physics to Make Sense of Incomplete but Functional VPython Programs in a Lab Setting
NASA Astrophysics Data System (ADS)
Weatherford, Shawn A.
2011-12-01
Computational activities in Matter & Interactions, an introductory calculus-based physics course, have the instructional goal of providing students with the experience of applying the same set of a small number of fundamental principles to model a wide range of physical systems. However there are significant instructional challenges for students to build computer programs under limited time constraints, especially for students who are unfamiliar with programming languages and concepts. Prior attempts at designing effective computational activities were successful at having students ultimately build working VPython programs under the tutelage of experienced teaching assistants in a studio lab setting. A pilot study revealed that students who completed these computational activities had significant difficultly repeating the exact same tasks and further, had difficulty predicting the animation that would be produced by the example program after interpreting the program code. This study explores the interpretation and prediction tasks as part of an instructional sequence where students are asked to read and comprehend a functional, but incomplete program. Rather than asking students to begin their computational tasks with modifying program code, we explicitly ask students to interpret an existing program that is missing key lines of code. The missing lines of code correspond to the algebraic form of fundamental physics principles or the calculation of forces which would exist between analogous physical objects in the natural world. Students are then asked to draw a prediction of what they would see in the simulation produced by the VPython program and ultimately run the program to evaluate the students' prediction. This study specifically looks at how the participants use physics while interpreting the program code and creating a whiteboard prediction. This study also examines how students evaluate their understanding of the program and modification goals at the beginning of the modification task. While working in groups over the course of a semester, study participants were recorded while they completed three activities using these incomplete programs. Analysis of the video data showed that study participants had little difficulty interpreting physics quantities, generating a prediction, or determining how to modify the incomplete program. Participants did not base their prediction solely from the information from the incomplete program. When participants tried to predict the motion of the objects in the simulation, many turned to their knowledge of how the system would evolve if it represented an analogous real-world physical system. For example, participants attributed the real-world behavior of springs to helix objects even though the program did not include calculations for the spring to exert a force when stretched. Participants rarely interpreted lines of code in the computational loop during the first computational activity, but this changed during latter computational activities with most participants using their physics knowledge to interpret the computational loop. Computational activities in the Matter & Interactions curriculum were revised in light of these findings to include an instructional sequence of tasks to build a comprehension of the example program. The modified activities also ask students to create an additional whiteboard prediction for the time-evolution of the real-world phenomena which the example program will eventually model. This thesis shows how comprehension tasks identified by Palinscar and Brown (1984) as effective in improving reading comprehension are also effective in helping students apply their physics knowledge to interpret a computer program which attempts to model a real-world phenomena and identify errors in their understanding of the use, or omission, of fundamental physics principles in a computational model.
Genome-wide comparative analysis of four Indian Drosophila species.
Mohanty, Sujata; Khanna, Radhika
2017-12-01
Comparative analysis of multiple genomes of closely or distantly related Drosophila species undoubtedly creates excitement among evolutionary biologists in exploring the genomic changes with an ecology and evolutionary perspective. We present herewith the de novo assembled whole genome sequences of four Drosophila species, D. bipectinata, D. takahashii, D. biarmipes and D. nasuta of Indian origin using Next Generation Sequencing technology on an Illumina platform along with their detailed assembly statistics. The comparative genomics analysis, e.g. gene predictions and annotations, functional and orthogroup analysis of coding sequences and genome wide SNP distribution were performed. The whole genome of Zaprionus indianus of Indian origin published earlier by us and the genome sequences of previously sequenced 12 Drosophila species available in the NCBI database were included in the analysis. The present work is a part of our ongoing genomics project of Indian Drosophila species.
Caballero, Javier; Peralta, Cecilia; Molla, Antonella; Del Valle, Eleodoro E; Caballero, Primitivo; Berry, Colin; Felipe, Verónica; Yaryura, Pablo; Palma, Leopoldo
2018-01-01
Bacillus cereus is a gram-positive, spore-forming bacterium possessing an important and historical record as a human-pathogenic bacterium. However, several strains of this species exhibit interesting potential to be used as plant growth-promoting rhizobacteria. Here, we report the draft genome sequence of B. cereus strain CITVM-11.1, which consists of 37 contig sequences, accounting for 5,746,486 bp (with a GC content of 34.8%) and 5,752 predicted protein-coding sequences. Several of them could potentially be involved in plant-bacterium interactions and may contribute to the strong antagonistic activity shown by this strain against the charcoal root rot fungus, Macrophomina phaseolina. This genomic sequence also showed a number of genes that may confer this strain resistance against several polluting heavy metals and for the bioconversion of mycotoxins. © 2018 S. Karger AG, Basel.
Thiel, Kati; Mulaku, Edita; Dandapani, Hariharan; Nagy, Csaba; Aro, Eva-Mari; Kallio, Pauli
2018-03-02
Photosynthetic cyanobacteria have been studied as potential host organisms for direct solar-driven production of different carbon-based chemicals from CO 2 and water, as part of the development of sustainable future biotechnological applications. The engineering approaches, however, are still limited by the lack of comprehensive information on most optimal expression strategies and validated species-specific genetic elements which are essential for increasing the intricacy, predictability and efficiency of the systems. This study focused on the systematic evaluation of the key translational control elements, ribosome binding sites (RBS), in the cyanobacterial host Synechocystis sp. PCC 6803, with the objective of expanding the palette of tools for more rigorous engineering approaches. An expression system was established for the comparison of 13 selected RBS sequences in Synechocystis, using several alternative reporter proteins (sYFP2, codon-optimized GFPmut3 and ethylene forming enzyme) as quantitative indicators of the relative translation efficiencies. The set-up was shown to yield highly reproducible expression patterns in independent analytical series with low variation between biological replicates, thus allowing statistical comparison of the activities of the different RBSs in vivo. While the RBSs covered a relatively broad overall expression level range, the downstream gene sequence was demonstrated in a rigorous manner to have a clear impact on the resulting translational profiles. This was expected to reflect interfering sequence-specific mRNA-level interaction between the RBS and the coding region, yet correlation between potential secondary structure formation and observed translation levels could not be resolved with existing in silico prediction tools. The study expands our current understanding on the potential and limitations associated with the regulation of protein expression at translational level in engineered cyanobacteria. The acquired information can be used for selecting appropriate RBSs for optimizing over-expression constructs or multicistronic pathways in Synechocystis, while underlining the complications in predicting the activity due to gene-specific interactions which may reduce the translational efficiency for a given RBS-gene combination. Ultimately, the findings emphasize the need for additional characterized insulator sequence elements to decouple the interaction between the RBS and the coding region for future engineering approaches.
Prediction of thermal cycling induced cracking in polmer matrix composites
NASA Technical Reports Server (NTRS)
Mcmanus, Hugh L.
1994-01-01
The work done in the period August 1993 through February 1994 on the 'Prediction of Thermal Cycling Induced Cracking In Polymer Matrix Composites' program is summarized. Most of the work performed in this period, as well as the previous one, is described in detail in the attached Master's thesis, 'Analysis of Thermally Induced Damage in Composite Space Structures,' by Cecelia Hyun Seon Park. Work on a small thermal cycling and aging chamber was concluded in this period. The chamber was extensively tested and calibrated. Temperatures can be controlled very precisely, and are very uniform in the test chamber. Based on results obtained in the previous period of this program, further experimental progressive cracking studies were carried out. The laminates tested were selected to clarify the differences between the behaviors of thick and thin ply layers, and to explore other variables such as stacking sequence and scaling effects. Most specimens tested were made available from existing stock at Langley Research Center. One laminate type had to be constructed from available prepreg material at Langley Research Center. Specimens from this laminate were cut and prepared at MIT. Thermal conditioning was carried out at Langley Research Center, and at the newly constructed MIT facility. Specimens were examined by edge inspection and by crack configuration studies, in which specimens were sanded down in order to examine the distribution of cracks within the specimens. A method for predicting matrix cracking due to decreasing temperatures and/or thermal cycling in all plies of an arbitrary laminate was implemented as a computer code. The code also predicts changes in properties due to the cracking. Extensive correlations between test results and code predictions were carried out. The computer code was documented and is ready for distribution.
Roth, Melissa S; Cokus, Shawn J; Gallaher, Sean D; Walter, Andreas; Lopez, David; Erickson, Erika; Endelman, Benjamin; Westcott, Daniel; Larabell, Carolyn A; Merchant, Sabeeha S; Pellegrini, Matteo; Niyogi, Krishna K
2017-05-23
Microalgae have potential to help meet energy and food demands without exacerbating environmental problems. There is interest in the unicellular green alga Chromochloris zofingiensis , because it produces lipids for biofuels and a highly valuable carotenoid nutraceutical, astaxanthin. To advance understanding of its biology and facilitate commercial development, we present a C. zofingiensis chromosome-level nuclear genome, organelle genomes, and transcriptome from diverse growth conditions. The assembly, derived from a combination of short- and long-read sequencing in conjunction with optical mapping, revealed a compact genome of ∼58 Mbp distributed over 19 chromosomes containing 15,274 predicted protein-coding genes. The genome has uniform gene density over chromosomes, low repetitive sequence content (∼6%), and a high fraction of protein-coding sequence (∼39%) with relatively long coding exons and few coding introns. Functional annotation of gene models identified orthologous families for the majority (∼73%) of genes. Synteny analysis uncovered localized but scrambled blocks of genes in putative orthologous relationships with other green algae. Two genes encoding beta-ketolase ( BKT ), the key enzyme synthesizing astaxanthin, were found in the genome, and both were up-regulated by high light. Isolation and molecular analysis of astaxanthin-deficient mutants showed that BKT1 is required for the production of astaxanthin. Moreover, the transcriptome under high light exposure revealed candidate genes that could be involved in critical yet missing steps of astaxanthin biosynthesis, including ABC transporters, cytochrome P450 enzymes, and an acyltransferase. The high-quality genome and transcriptome provide insight into the green algal lineage and carotenoid production.
Roth, Melissa S.; Cokus, Shawn J.; Gallaher, Sean D.; ...
2017-05-08
Microalgae have potential to help meet energy and food demands without exacerbating environmental problems. There is interest in the unicellular green alga Chromochloris zofingiensis, because it produces lipids for biofuels and a highly valuable carotenoid nutraceutical, astaxanthin. Here, to advance understanding of its biology and facilitate commercial development, we present a C. zofingiensis chromosome-level nuclear genome, organelle genomes, and transcriptome from diverse growth conditions. The assembly, derived from a combination of short- and long-read sequencing in conjunction with optical mapping, revealed a compact genome of ~58 Mbp distributed over 19 chromosomes containing 15,274 predicted protein-coding genes. The genome has uniformmore » gene density over chromosomes, low repetitive sequence content (~6%), and a high fraction of protein-coding sequence (~39%) with relatively long coding exons and few coding introns. Functional annotation of gene models identified orthologous families for the majority (~73%) of genes. Synteny analysis uncovered localized but scrambled blocks of genes in putative orthologous relationships with other green algae. Two genes encoding beta-ketolase (BKT), the key enzyme synthesizing astaxanthin, were found in the genome, and both were up-regulated by high light. Isolation and molecular analysis of astaxanthin-deficient mutants showed that BKT1 is required for the production of astaxanthin. Moreover, the transcriptome under high light exposure revealed candidate genes that could be involved in critical yet missing steps of astaxanthin biosynthesis, including ABC transporters, cytochrome P450 enzymes, and an acyltransferase. Finally, the high-quality genome and transcriptome provide insight into the green algal lineage and carotenoid production.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Roth, Melissa S.; Cokus, Shawn J.; Gallaher, Sean D.
Microalgae have potential to help meet energy and food demands without exacerbating environmental problems. There is interest in the unicellular green alga Chromochloris zofingiensis, because it produces lipids for biofuels and a highly valuable carotenoid nutraceutical, astaxanthin. Here, to advance understanding of its biology and facilitate commercial development, we present a C. zofingiensis chromosome-level nuclear genome, organelle genomes, and transcriptome from diverse growth conditions. The assembly, derived from a combination of short- and long-read sequencing in conjunction with optical mapping, revealed a compact genome of ~58 Mbp distributed over 19 chromosomes containing 15,274 predicted protein-coding genes. The genome has uniformmore » gene density over chromosomes, low repetitive sequence content (~6%), and a high fraction of protein-coding sequence (~39%) with relatively long coding exons and few coding introns. Functional annotation of gene models identified orthologous families for the majority (~73%) of genes. Synteny analysis uncovered localized but scrambled blocks of genes in putative orthologous relationships with other green algae. Two genes encoding beta-ketolase (BKT), the key enzyme synthesizing astaxanthin, were found in the genome, and both were up-regulated by high light. Isolation and molecular analysis of astaxanthin-deficient mutants showed that BKT1 is required for the production of astaxanthin. Moreover, the transcriptome under high light exposure revealed candidate genes that could be involved in critical yet missing steps of astaxanthin biosynthesis, including ABC transporters, cytochrome P450 enzymes, and an acyltransferase. Finally, the high-quality genome and transcriptome provide insight into the green algal lineage and carotenoid production.« less
Roth, Melissa S.; Cokus, Shawn J.; Gallaher, Sean D.; Walter, Andreas; Lopez, David; Erickson, Erika; Endelman, Benjamin; Westcott, Daniel; Larabell, Carolyn A.; Merchant, Sabeeha S.; Pellegrini, Matteo
2017-01-01
Microalgae have potential to help meet energy and food demands without exacerbating environmental problems. There is interest in the unicellular green alga Chromochloris zofingiensis, because it produces lipids for biofuels and a highly valuable carotenoid nutraceutical, astaxanthin. To advance understanding of its biology and facilitate commercial development, we present a C. zofingiensis chromosome-level nuclear genome, organelle genomes, and transcriptome from diverse growth conditions. The assembly, derived from a combination of short- and long-read sequencing in conjunction with optical mapping, revealed a compact genome of ∼58 Mbp distributed over 19 chromosomes containing 15,274 predicted protein-coding genes. The genome has uniform gene density over chromosomes, low repetitive sequence content (∼6%), and a high fraction of protein-coding sequence (∼39%) with relatively long coding exons and few coding introns. Functional annotation of gene models identified orthologous families for the majority (∼73%) of genes. Synteny analysis uncovered localized but scrambled blocks of genes in putative orthologous relationships with other green algae. Two genes encoding beta-ketolase (BKT), the key enzyme synthesizing astaxanthin, were found in the genome, and both were up-regulated by high light. Isolation and molecular analysis of astaxanthin-deficient mutants showed that BKT1 is required for the production of astaxanthin. Moreover, the transcriptome under high light exposure revealed candidate genes that could be involved in critical yet missing steps of astaxanthin biosynthesis, including ABC transporters, cytochrome P450 enzymes, and an acyltransferase. The high-quality genome and transcriptome provide insight into the green algal lineage and carotenoid production. PMID:28484037
Jiu-Sheng, Li; Ze-Jiang, Zhao; Jian-Quan, Yao
2017-11-27
In order to extend to 3-bit encoding, we propose notched-wheel structures as polarization insensitive coding metasurfaces to control terahertz wave reflection and suppress backward scattering. By using a coding sequence of "00110011…" along x-axis direction and 16 × 16 random coding sequence, we investigate the polarization insensitive properties of the coding metasurfaces. By designing the coding sequences of the basic coding elements, the terahertz wave reflection can be flexibly manipulated. Additionally, radar cross section (RCS) reduction in the backward direction is less than -10dB in a wide band. The present approach can offer application for novel terahertz manipulation devices.
Noncoding sequence classification based on wavelet transform analysis: part I
NASA Astrophysics Data System (ADS)
Paredes, O.; Strojnik, M.; Romo-Vázquez, R.; Vélez Pérez, H.; Ranta, R.; Garcia-Torales, G.; Scholl, M. K.; Morales, J. A.
2017-09-01
DNA sequences in human genome can be divided into the coding and noncoding ones. Coding sequences are those that are read during the transcription. The identification of coding sequences has been widely reported in literature due to its much-studied periodicity. Noncoding sequences represent the majority of the human genome. They play an important role in gene regulation and differentiation among the cells. However, noncoding sequences do not exhibit periodicities that correlate to their functions. The ENCODE (Encyclopedia of DNA elements) and Epigenomic Roadmap Project projects have cataloged the human noncoding sequences into specific functions. We study characteristics of noncoding sequences with wavelet analysis of genomic signals.
Smith, Colin A; Kortemme, Tanja
2011-01-01
Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s) are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface), interactions between and within parts of the structure (e.g. domains) can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.
Fine tangled pili expressed by Haemophilus ducreyi are a novel class of pili.
Brentjens, R J; Ketterer, M; Apicella, M A; Spinola, S M
1996-01-01
Haemophilus ducreyi synthesizes fine, tangled pili composed predominantly of a protein whose apparent molecular weight is 24,000 (24K). A hybridoma, 2D8, produced a monoclonal antibody (MAb) that bound to a 24K protein in H. ducreyi strains isolated from diverse geographic locations. A lambda gt11 H. ducreyi library was screened with MAb 2D8. A 3.5-kb chromosomal insert from one reactive plaque was amplified and ligated into the pCRII vector. The recombinant plasmid, designated pHD24, expressed a 24K protein in Escherichia coli INV alpha F that bound MAb 2D8. The coding sequence of the 24K gene was localized by exonuclease III digestion. The insert contained a 570-bp open reading frame, designated ftpA (fine, tangled pili). Translation of ftpA predicted a polypeptide with a molecular weight of 21.1K. The predicted N-terminal amino acid sequence of the polypeptide encoded by ftpA was identical to the N-terminal amino acid sequence of purified pilin and lacked a cleavable signal sequence. Primer extension analysis of ftpA confirmed the lack of a leader peptide. The predicted amino acid sequence lacked homology to known pilin sequences but shared homology with the sequences of E. coli Dps and Treponema pallidum antigen TpF1 or 4D, proteins which associate to form ordered rings. An isogenic pilin mutant, H. ducreyi 35000ftpA::mTn3(Cm), was constructed by shuttle mutagenesis and did not contain pili when examined by electron microscopy. We conclude that H. ducreyi synthesizes fine, tangled pili that are composed of a unique major subunit, which may be exported by a signal sequence independent mechanism. PMID:8550517
The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads.
Wang, Zhiwen; Hobson, Neil; Galindo, Leonardo; Zhu, Shilin; Shi, Daihu; McDill, Joshua; Yang, Linfeng; Hawkins, Simon; Neutelings, Godfrey; Datla, Raju; Lambert, Georgina; Galbraith, David W; Grassa, Christopher J; Geraldes, Armando; Cronk, Quentin C; Cullis, Christopher; Dash, Prasanta K; Kumar, Polumetla A; Cloutier, Sylvie; Sharpe, Andrew G; Wong, Gane K-S; Wang, Jun; Deyholos, Michael K
2012-11-01
Flax (Linum usitatissimum) is an ancient crop that is widely cultivated as a source of fiber, oil and medicinally relevant compounds. To accelerate crop improvement, we performed whole-genome shotgun sequencing of the nuclear genome of flax. Seven paired-end libraries ranging in size from 300 bp to 10 kb were sequenced using an Illumina genome analyzer. A de novo assembly, comprised exclusively of deep-coverage (approximately 94× raw, approximately 69× filtered) short-sequence reads (44-100 bp), produced a set of scaffolds with N(50) =694 kb, including contigs with N(50)=20.1 kb. The contig assembly contained 302 Mb of non-redundant sequence representing an estimated 81% genome coverage. Up to 96% of published flax ESTs aligned to the whole-genome shotgun scaffolds. However, comparisons with independently sequenced BACs and fosmids showed some mis-assembly of regions at the genome scale. A total of 43384 protein-coding genes were predicted in the whole-genome shotgun assembly, and up to 93% of published flax ESTs, and 86% of A. thaliana genes aligned to these predicted genes, indicating excellent coverage and accuracy at the gene level. Analysis of the synonymous substitution rates (K(s) ) observed within duplicate gene pairs was consistent with a recent (5-9 MYA) whole-genome duplication in flax. Within the predicted proteome, we observed enrichment of many conserved domains (Pfam-A) that may contribute to the unique properties of this crop, including agglutinin proteins. Together these results show that de novo assembly, based solely on whole-genome shotgun short-sequence reads, is an efficient means of obtaining nearly complete genome sequence information for some plant species. © 2012 The Authors. The Plant Journal © 2012 Blackwell Publishing Ltd.
Converting Panax ginseng DNA and chemical fingerprints into two-dimensional barcode.
Cai, Yong; Li, Peng; Li, Xi-Wen; Zhao, Jing; Chen, Hai; Yang, Qing; Hu, Hao
2017-07-01
In this study, we investigated how to convert the Panax ginseng DNA sequence code and chemical fingerprints into a two-dimensional code. In order to improve the compression efficiency, GATC2Bytes and digital merger compression algorithms are proposed. HPLC chemical fingerprint data of 10 groups of P. ginseng from Northeast China and the internal transcribed spacer 2 (ITS2) sequence code as the DNA sequence code were ready for conversion. In order to convert such data into a two-dimensional code, the following six steps were performed: First, the chemical fingerprint characteristic data sets were obtained through the inflection filtering algorithm. Second, precompression processing of such data sets is undertaken. Third, precompression processing was undertaken with the P. ginseng DNA (ITS2) sequence codes. Fourth, the precompressed chemical fingerprint data and the DNA (ITS2) sequence code were combined in accordance with the set data format. Such combined data can be compressed by Zlib, an open source data compression algorithm. Finally, the compressed data generated a two-dimensional code called a quick response code (QR code). Through the abovementioned converting process, it can be found that the number of bytes needed for storing P. ginseng chemical fingerprints and its DNA (ITS2) sequence code can be greatly reduced. After GTCA2Bytes algorithm processing, the ITS2 compression rate reaches 75% and the chemical fingerprint compression rate exceeds 99.65% via filtration and digital merger compression algorithm processing. Therefore, the overall compression ratio even exceeds 99.36%. The capacity of the formed QR code is around 0.5k, which can easily and successfully be read and identified by any smartphone. P. ginseng chemical fingerprints and its DNA (ITS2) sequence code can form a QR code after data processing, and therefore the QR code can be a perfect carrier of the authenticity and quality of P. ginseng information. This study provides a theoretical basis for the development of a quality traceability system of traditional Chinese medicine based on a two-dimensional code.
Baurens, Franc-Christophe; Bocs, Stéphanie; Rouard, Mathieu; Matsumoto, Takashi; Miller, Robert N G; Rodier-Goud, Marguerite; MBéguié-A-MBéguié, Didier; Yahiaoui, Nabila
2010-07-16
Comparative sequence analysis of complex loci such as resistance gene analog clusters allows estimating the degree of sequence conservation and mechanisms of divergence at the intraspecies level. In banana (Musa sp.), two diploid wild species Musa acuminata (A genome) and Musa balbisiana (B genome) contribute to the polyploid genome of many cultivars. The M. balbisiana species is associated with vigour and tolerance to pests and disease and little is known on the genome structure and haplotype diversity within this species. Here, we compare two genomic sequences of 253 and 223 kb corresponding to two haplotypes of the RGA08 resistance gene analog locus in M. balbisiana "Pisang Klutuk Wulung" (PKW). Sequence comparison revealed two regions of contrasting features. The first is a highly colinear gene-rich region where the two haplotypes diverge only by single nucleotide polymorphisms and two repetitive element insertions. The second corresponds to a large cluster of RGA08 genes, with 13 and 18 predicted RGA genes and pseudogenes spread over 131 and 152 kb respectively on each haplotype. The RGA08 cluster is enriched in repetitive element insertions, in duplicated non-coding intergenic sequences including low complexity regions and shows structural variations between haplotypes. Although some allelic relationships are retained, a large diversity of RGA08 genes occurs in this single M. balbisiana genotype, with several RGA08 paralogs specific to each haplotype. The RGA08 gene family has evolved by mechanisms of unequal recombination, intragenic sequence exchange and diversifying selection. An unequal recombination event taking place between duplicated non-coding intergenic sequences resulted in a different RGA08 gene content between haplotypes pointing out the role of such duplicated regions in the evolution of RGA clusters. Based on the synonymous substitution rate in coding sequences, we estimated a 1 million year divergence time for these M. balbisiana haplotypes. A large RGA08 gene cluster identified in wild banana corresponds to a highly variable genomic region between haplotypes surrounded by conserved flanking regions. High level of sequence identity (70 to 99%) of the genic and intergenic regions suggests a recent and rapid evolution of this cluster in M. balbisiana.
A Third Approach to Gene Prediction Suggests Thousands of Additional Human Transcribed Regions
Glusman, Gustavo; Qin, Shizhen; El-Gewely, M. Raafat; Siegel, Andrew F; Roach, Jared C; Hood, Leroy; Smit, Arian F. A
2006-01-01
The identification and characterization of the complete ensemble of genes is a main goal of deciphering the digital information stored in the human genome. Many algorithms for computational gene prediction have been described, ultimately derived from two basic concepts: (1) modeling gene structure and (2) recognizing sequence similarity. Successful hybrid methods combining these two concepts have also been developed. We present a third orthogonal approach to gene prediction, based on detecting the genomic signatures of transcription, accumulated over evolutionary time. We discuss four algorithms based on this third concept: Greens and CHOWDER, which quantify mutational strand biases caused by transcription-coupled DNA repair, and ROAST and PASTA, which are based on strand-specific selection against polyadenylation signals. We combined these algorithms into an integrated method called FEAST, which we used to predict the location and orientation of thousands of putative transcription units not overlapping known genes. Many of the newly predicted transcriptional units do not appear to code for proteins. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many apparent “genomic deserts.” PMID:16543943
Fernández-Cancio, Mónica; Nistal, Manuel; Gracia, Ricardo; Molina, M Antonia; Tovar, Juan Antonio; Esteban, Cristina; Carrascosa, Antonio; Audí, Laura
2004-01-01
The goal of this study was to perform 5-alpha-reductase type 2 gene (SRD5A2) analysis in a male pseudohermaphrodite (MPH) patient with normal testosterone (T) production and normal androgen receptor (AR) gene coding sequences. A patient of Chinese origin with ambiguous genitalia at 14 months, a 46,XY karyotype, and normal T secretion under human chorionic gonadotropin (hCG) stimulation underwent a gonadectomy at 20 months. Exons 1-8 of the AR gene and exons 1-5 of the SRD5A2 gene were sequenced from peripheral blood DNA. AR gene coding sequences were normal. SRD5A2 gene analysis revealed 2 consecutive mutations in exon 4, each located in a different allele: 1) a T nucleotide deletion, which predicts a frameshift mutation from codon 219, and 2) a missense mutation at codon 227, where the substitution of guanine (CGA) by adenine (CAA) predicts a glutamine replacement of arginine (R227Q). Testes located in the inguinal canal showed a normal morphology for age. The patient was a compound heterozygote for SRD5A2 mutations, carrying 2 mutations in exon 4. The patient showed an R227Q mutation that has been described in an Asian population and MPH patients, along with a novel frameshift mutation, Tdel219. Testis morphology showed that, during early infancy, the 5-alpha-reductase enzyme deficiency may not have affected interstitial or tubular development.
Tang, Jia-Min; Li, Fen; Cheng, Tian-Yin; Duan, De-Yong; Liu, Guo-Hua
2018-05-22
The sheep ked Melophagus ovinus is mainly found in Europe, Northwestern Africa, and Asia. Although M. ovinus is an important ectoparasite of sheep in many countries, the population genetics, molecular biology, and systematics of this ectoparasite remain poorly understood. Herein, we determined the mitochondrial (mt) genome of M. ovinus from Gansu Province, China (MOG) and compared with that of M. ovinus Xinjiang Uygur Autonomous Region, China (MOX). The mt genome sequence (15,044 bp) of M. ovinus MOG was significantly shorter (529 bp) than M. ovinus MOX. Nucleotide sequence difference in the whole mt genome except for non-coding region was 0.37% between M. ovinus MOG and MOX. For the 13 protein-coding genes, comparison revealed sequence divergences at both the nucleotide (0-1.1%) and amino acid (0-0.59%) levels between M. ovinus MOG and MOX, respectively. Interestingly, the cox1 gene of M. ovinus MOX is predicted to employ unusual mt start codons AAA, which has not been predicted previously for any parasite genome. Phylogenetic analyses showed that M. ovinus (Hippoboscoidea) is related to the superfamilies Oestroidea + Muscoidea. Our results have also indicated the paraphylies of the four families (Anthomyiidae, Calliphoridae, Muscidae, and Oestridae) and two superfamilies (Oestroidea and Muscoidea). This mt genome of M. ovinus provides useful molecular markers for studies into the population genetics, molecular biology, and systematics of this ectoparasite.
Non-coding RNAs in lung cancer
Ricciuti, Biagio; Mecca, Carmen; Crinò, Lucio; Baglivo, Sara; Cenci, Matteo; Metro, Giulio
2014-01-01
The discovery that protein-coding genes represent less than 2% of all human genome, and the evidence that more than 90% of it is actively transcribed, changed the classical point of view of the central dogma of molecular biology, which was always based on the assumption that RNA functions mainly as an intermediate bridge between DNA sequences and protein synthesis machinery. Accumulating data indicates that non-coding RNAs are involved in different physiological processes, providing for the maintenance of cellular homeostasis. They are important regulators of gene expression, cellular differentiation, proliferation, migration, apoptosis, and stem cell maintenance. Alterations and disruptions of their expression or activity have increasingly been associated with pathological changes of cancer cells, this evidence and the prospect of using these molecules as diagnostic markers and therapeutic targets, make currently non-coding RNAs among the most relevant molecules in cancer research. In this paper we will provide an overview of non-coding RNA function and disruption in lung cancer biology, also focusing on their potential as diagnostic, prognostic and predictive biomarkers. PMID:25593996
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ansong, Charles; Tolic, Nikola; Purvine, Samuel O.
Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. For example systems biology-oriented genome scale modeling efforts greatly benefit from accurate annotation of protein-coding genes to develop proper functioning models. However, determining protein-coding genes for most new genomes is almost completely performed by inference, using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function. With the ability to directly measure peptides arising from expressed proteins, mass spectrometry-based proteomics approaches can be used to augment and verify codingmore » regions of a genomic sequence and importantly detect post-translational processing events. In this study we utilized “shotgun” proteomics to guide accurate primary genome annotation of the bacterial pathogen Salmonella Typhimurium 14028 to facilitate a systems-level understanding of Salmonella biology. The data provides protein-level experimental confirmation for 44% of predicted protein-coding genes, suggests revisions to 48 genes assigned incorrect translational start sites, and uncovers 13 non-annotated genes missed by gene prediction programs. We also present a comprehensive analysis of post-translational processing events in Salmonella, revealing a wide range of complex chemical modifications (70 distinct modifications) and confirming more than 130 signal peptide and N-terminal methionine cleavage events in Salmonella. This study highlights several ways in which proteomics data applied during the primary stages of annotation can improve the quality of genome annotations, especially with regards to the annotation of mature protein products.« less
Piccolo, Lidia Del; Finset, Arnstein; Mellblom, Anneli V; Figueiredo-Braga, Margarida; Korsvold, Live; Zhou, Yuefang; Zimmermann, Christa; Humphris, Gerald
2017-12-01
To discuss the theoretical and empirical framework of VR-CoDES and potential future direction in research based on the coding system. The paper is based on selective review of papers relevant to the construction and application of VR-CoDES. VR-CoDES system is rooted in patient-centered and biopsychosocial model of healthcare consultations and on a functional approach to emotion theory. According to the VR-CoDES, emotional interaction is studied in terms of sequences consisting of an eliciting event, an emotional expression by the patient and the immediate response by the clinician. The rationale for the emphasis on sequences, on detailed classification of cues and concerns, and on the choices of explicit vs. non-explicit responses and providing vs. reducing room for further disclosure, as basic categories of the clinician responses, is described. Results from research on VR-CoDES may help raise awareness of emotional sequences. Future directions in applying VR-CoDES in research may include studies on predicting patient and clinician behavior within the consultation, qualitative analyses of longer sequences including several VR-CoDES triads, and studies of effects of emotional communication on health outcomes. VR-CoDES may be applied to develop interventions to promote good handling of patients' emotions in healthcare encounters. Copyright © 2017 Elsevier B.V. All rights reserved.
Remus-Emsermann, Mitja N P; Schmid, Michael; Gekenidis, Maria-Theresia; Pelludat, Cosima; Frey, Jürg E; Ahrens, Christian H; Drissner, David
2016-01-01
Pseudomonas citronellolis is a Gram negative, motile gammaproteobacterium belonging to the order Pseudomonadales and the family Pseudomonadaceae . We isolated strain P3B5 from the phyllosphere of basil plants ( Ocimum basilicum L.). Here we describe the physiology of this microorganism, its full genome sequence, and detailed annotation. The 6.95 Mbp genome contains 6071 predicted protein coding sequences and 96 RNA coding sequences. P. citronellolis has been the subject of many studies including the investigation of long-chain aliphatic compounds and terpene degradation. Plant leaves are covered by long-chain aliphates making up a waxy layer that is associated with the leaf cuticle. In addition, basil leaves are known to contain high amounts of terpenoid substances, hinting to a potential nutrient niche that might be exploited by P. citronellolis . Furthermore, the isolated strain exhibited resistance to several antibiotics. To evaluate the potential of this strain as source of transferable antibiotic resistance genes on raw consumed herbs we therefore investigated if those resistances are encoded on mobile genetic elements. The availability of the genome will be helpful for comparative genomics of the phylogenetically broad pseudomonads, in particular with the sequence of the P. citronellolis type strain PRJDB205 not yet publicly available. The genome is discussed with respect to a phyllosphere related lifestyle, aliphate and terpenoid degradation, and antibiotic resistance.
GuiTope: an application for mapping random-sequence peptides to protein sequences.
Halperin, Rebecca F; Stafford, Phillip; Emery, Jack S; Navalkar, Krupa Arun; Johnston, Stephen Albert
2012-01-03
Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. GuiTope provides a graphical user interface for aligning peptide sequences to protein sequences. All alignment parameters are accessible to the user including the ability to specify the amino acid frequency in the peptide library; these frequencies often differ significantly from those assumed by popular alignment programs. It also includes a novel feature to align di-peptide inversions, which we have found improves the accuracy of antibody epitope prediction from peptide microarray data and shows utility in analyzing phage display datasets. Finally, GuiTope can randomly select peptides from a given library to estimate a null distribution of scores and calculate statistical significance. GuiTope provides a convenient method for comparing selected peptide sequences to protein sequences, including flexible alignment parameters, novel alignment features, ability to search a database, and statistical significance of results. The software is available as an executable (for PC) at http://www.immunosignature.com/software and ongoing updates and source code will be available at sourceforge.net.
2012-01-01
Background MicroRNAs (miRNAs) are one of the functional non-coding small RNAs involved in the epigenetic control of the plant genome. Although plants contain both evolutionary conserved miRNAs and species-specific miRNAs within their genomes, computational methods often only identify evolutionary conserved miRNAs. The recent sequencing of the Brassica rapa genome enables us to identify miRNAs and their putative target genes. In this study, we sought to provide a more comprehensive prediction of B. rapa miRNAs based on high throughput small RNA deep sequencing. Results We sequenced small RNAs from five types of tissue: seedlings, roots, petioles, leaves, and flowers. By analyzing 2.75 million unique reads that mapped to the B. rapa genome, we identified 216 novel and 196 conserved miRNAs that were predicted to target approximately 20% of the genome’s protein coding genes. Quantitative analysis of miRNAs from the five types of tissue revealed that novel miRNAs were expressed in diverse tissues but their expression levels were lower than those of the conserved miRNAs. Comparative analysis of the miRNAs between the B. rapa and Arabidopsis thaliana genomes demonstrated that redundant copies of conserved miRNAs in the B. rapa genome may have been deleted after whole genome triplication. Novel miRNA members seemed to have spontaneously arisen from the B. rapa and A. thaliana genomes, suggesting the species-specific expansion of miRNAs. We have made this data publicly available in a miRNA database of B. rapa called BraMRs. The database allows the user to retrieve miRNA sequences, their expression profiles, and a description of their target genes from the five tissue types investigated here. Conclusions This is the first report to identify novel miRNAs from Brassica crops using genome-wide high throughput techniques. The combination of computational methods and small RNA deep sequencing provides robust predictions of miRNAs in the genome. The finding of numerous novel miRNAs, many with few target genes and low expression levels, suggests the rapid evolution of miRNA genes. The development of a miRNA database, BraMRs, enables us to integrate miRNA identification, target prediction, and functional annotation of target genes. BraMRs will represent a valuable public resource with which to study the epigenetic control of B. rapa and other closely related Brassica species. The database is available at the following link: http://bramrs.rna.kr [1]. PMID:23163954
Approaches to Fungal Genome Annotation
Haas, Brian J.; Zeng, Qiandong; Pearson, Matthew D.; Cuomo, Christina A.; Wortman, Jennifer R.
2011-01-01
Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center’s production genome annotation environment. PMID:22059117
Lua, Rhonald C; Wilson, Stephen J; Konecki, Daniel M; Wilkins, Angela D; Venner, Eric; Morgan, Daniel H; Lichtarge, Olivier
2016-01-04
The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Gritsun, T S; Venugopal, K; Zanotto, P M; Mikhailov, M V; Sall, A A; Holmes, E C; Polkinghorne, I; Frolova, T V; Pogodina, V V; Lashkevich, V A; Gould, E A
1997-05-01
The complete nucleotide sequence of two tick-transmitted flaviviruses, Vasilchenko (Vs) from Siberia and louping ill (LI) from the UK, have been determined. The genomes were respectively, 10928 and 10871 nucleotides (nt) in length. The coding strategy and functional protein sequence motifs of tick-borne flaviviruses are presented in both Vs and LI viruses. The phylogenies based on maximum likelihood, maximum parsimony and distance analysis of the polyproteins, identified Vs virus as a member of the tick-borne encephalitis virus subgroup within the tick-borne serocomplex, genus Flavivirus, family Flaviviridae. Comparative alignment of the 3'-untranslated regions revealed deletions of different lengths essentially at the same position downstream of the stop codon for all tick-borne viruses. Two direct 27 nucleotide repeats at the 3'-end were found only for Vs and LI virus. Immediately following the deletions a region of 332-334 nt with relatively conserved primary structure (67-94% identity) was observed at the 3'-non-coding end of the virus genome. Pairwise comparisons of the nucleotide sequence data revealed similar levels of variation between the coding region, and the 5' and 3'-termini of the genome, implying an equivalent strong selective control for translated and untranslated regions. Indeed the predicted folding of the 5' and 3'-untranslated regions revealed patterns of stem and loop structures conserved for all tick-borne flaviviruses suggesting a purifying selection for preservation of essential RNA secondary structures which could be involved in translational control and replication. The possible implications of these findings are discussed.
Deciphering mRNA Sequence Determinants of Protein Production Rate
NASA Astrophysics Data System (ADS)
Szavits-Nossan, Juraj; Ciandrini, Luca; Romano, M. Carmen
2018-03-01
One of the greatest challenges in biophysical models of translation is to identify coding sequence features that affect the rate of translation and therefore the overall protein production in the cell. We propose an analytic method to solve a translation model based on the inhomogeneous totally asymmetric simple exclusion process, which allows us to unveil simple design principles of nucleotide sequences determining protein production rates. Our solution shows an excellent agreement when compared to numerical genome-wide simulations of S. cerevisiae transcript sequences and predicts that the first 10 codons, which is the ribosome footprint length on the mRNA, together with the value of the initiation rate, are the main determinants of protein production rate under physiological conditions. Finally, we interpret the obtained analytic results based on the evolutionary role of the codons' choice for regulating translation rates and ribosome densities.
Langner, Robert; Sternkopf, Melanie A; Kellermann, Tanja S; Grefkes, Christian; Kurth, Florian; Schneider, Frank; Zilles, Karl; Eickhoff, Simon B
2014-07-01
The neurobiological organization of action-oriented working memory is not well understood. To elucidate the neural correlates of translating visuo-spatial stimulus sequences into delayed (memory-guided) sequential actions, we measured brain activity using functional magnetic resonance imaging while participants encoded sequences of four to seven dots appearing on fingers of a left or right schematic hand. After variable delays, sequences were to be reproduced with the corresponding fingers. Recall became less accurate with longer sequences and was initiated faster after long delays. Across both hands, encoding and recall activated bilateral prefrontal, premotor, superior and inferior parietal regions as well as the basal ganglia, whereas hand-specific activity was found (albeit to a lesser degree during encoding) in contralateral premotor, sensorimotor, and superior parietal cortex. Activation differences after long versus short delays were restricted to motor-related regions, indicating that rehearsal during long delays might have facilitated the conversion of the memorandum into concrete motor programs at recall. Furthermore, basal ganglia activity during encoding selectively predicted correct recall. Taken together, the results suggest that to-be-reproduced visuo-spatial sequences are encoded as prospective action representations (motor intentions), possibly in addition to retrospective sensory codes. Overall, our study supports and extends multi-component models of working memory, highlighting the notion that sensory input can be coded in multiple ways depending on what the memorandum is to be used for. Copyright © 2013 Wiley Periodicals, Inc.
Perceptually tuned low-bit-rate video codec for ATM networks
NASA Astrophysics Data System (ADS)
Chou, Chun-Hsien
1996-02-01
In order to maintain high visual quality in transmitting low bit-rate video signals over asynchronous transfer mode (ATM) networks, a layered coding scheme that incorporates the human visual system (HVS), motion compensation (MC), and conditional replenishment (CR) is presented in this paper. An empirical perceptual model is proposed to estimate the spatio- temporal just-noticeable distortion (STJND) profile for each frame, by which perceptually important (PI) prediction-error signals can be located. Because of the limited channel capacity of the base layer, only coded data of motion vectors, the PI signals within a small strip of the prediction-error image and, if there are remaining bits, the PI signals outside the strip are transmitted by the cells of the base-layer channel. The rest of the coded data are transmitted by the second-layer cells which may be lost due to channel error or network congestion. Simulation results show that visual quality of the reconstructed CIF sequence is acceptable when the capacity of the base-layer channel is allocated with 2 multiplied by 64 kbps and the cells of the second layer are all lost.
Is a Genome a Codeword of an Error-Correcting Code?
Kleinschmidt, João H.; Silva-Filho, Márcio C.; Bim, Edson; Herai, Roberto H.; Yamagishi, Michel E. B.; Palazzo, Reginaldo
2012-01-01
Since a genome is a discrete sequence, the elements of which belong to a set of four letters, the question as to whether or not there is an error-correcting code underlying DNA sequences is unavoidable. The most common approach to answering this question is to propose a methodology to verify the existence of such a code. However, none of the methodologies proposed so far, although quite clever, has achieved that goal. In a recent work, we showed that DNA sequences can be identified as codewords in a class of cyclic error-correcting codes known as Hamming codes. In this paper, we show that a complete intron-exon gene, and even a plasmid genome, can be identified as a Hamming code codeword as well. Although this does not constitute a definitive proof that there is an error-correcting code underlying DNA sequences, it is the first evidence in this direction. PMID:22649495
MINS2: revisiting the molecular code for transmembrane-helix recognition by the Sec61 translocon.
Park, Yungki; Helms, Volkhard
2008-08-15
To be fully functional, membrane proteins should not only fold, but also get inserted into the membrane, which is mediated by the Sec61 translocon. Recent experimental studies have attempted to elucidate how the Sec61 translocon accomplishes this delicate task by measuring the translocon-mediated membrane insertion free energies of 357 systematically designed peptides. On the basis of this data set, we have developed MINS2, a novel sequence-based computational method for predicting the membrane insertion free energies of protein sequences. A benchmark analysis of MINS2 shows that MINS2 signi.cantly outperforms previously proposed methods. Importantly, the application of MINS2 to known membrane protein structures shows that a better prediction of membrane insertion free energies does not lead to a better prediction of transmembrane segments of polytopic membrane proteins. A web server for MINS2 is publicly available at http://service.bioinformatik.uni-saarland.de/mins. Supplementary data are available at Bioinformatics online.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, En -Min; Murugapiran, Senthil K.; Mefferd, Chrisabelle C.
Thermus amyloliquefaciens type strain YIM 77409 T is a thermophilic, Gram-negative, non-motile and rod-shaped bacterium isolated from Niujie Hot Spring in Eryuan County, Yunnan Province, southwest China. In the present study we describe the features of strain YIM 77409 T together with its genome sequence and annotation. The genome is 2,160,855 bp long and consists of 6 scaffolds with 67.4 % average GC content. A total of 2,313 genes were predicted, comprising 2,257 protein-coding and 56 RNA genes. The genome is predicted to encode a complete glycolysis, pentose phosphate pathway, and tricarboxylic acid cycle. Additionally, a large number of transportersmore » and enzymes for heterotrophy highlight the broad heterotrophic lifestyle of this organism. Furthermore, a denitrification gene cluster included genes predicted to encode enzymes for the sequential reduction of nitrate to nitrous oxide, consistent with the incomplete denitrification phenotype of this strain.« less
Zhou, En -Min; Murugapiran, Senthil K.; Mefferd, Chrisabelle C.; ...
2016-02-27
Thermus amyloliquefaciens type strain YIM 77409 T is a thermophilic, Gram-negative, non-motile and rod-shaped bacterium isolated from Niujie Hot Spring in Eryuan County, Yunnan Province, southwest China. In the present study we describe the features of strain YIM 77409 T together with its genome sequence and annotation. The genome is 2,160,855 bp long and consists of 6 scaffolds with 67.4 % average GC content. A total of 2,313 genes were predicted, comprising 2,257 protein-coding and 56 RNA genes. The genome is predicted to encode a complete glycolysis, pentose phosphate pathway, and tricarboxylic acid cycle. Additionally, a large number of transportersmore » and enzymes for heterotrophy highlight the broad heterotrophic lifestyle of this organism. Furthermore, a denitrification gene cluster included genes predicted to encode enzymes for the sequential reduction of nitrate to nitrous oxide, consistent with the incomplete denitrification phenotype of this strain.« less
NASA Astrophysics Data System (ADS)
Weigt, Martin
Over the last years, biological research has been revolutionized by experimental high-throughput techniques, in particular by next-generation sequencing technology. Unprecedented amounts of data are accumulating, and there is a growing request for computational methods unveiling the information hidden in raw data, thereby increasing our understanding of complex biological systems. Statistical-physics models based on the maximum-entropy principle have, in the last few years, played an important role in this context. To give a specific example, proteins and many non-coding RNA show a remarkable degree of structural and functional conservation in the course of evolution, despite a large variability in amino acid sequences. We have developed a statistical-mechanics inspired inference approach - called Direct-Coupling Analysis - to link this sequence variability (easy to observe in sequence alignments, which are available in public sequence databases) to bio-molecular structure and function. In my presentation I will show, how this methodology can be used (i) to infer contacts between residues and thus to guide tertiary and quaternary protein structure prediction and RNA structure prediction, (ii) to discriminate interacting from non-interacting protein families, and thus to infer conserved protein-protein interaction networks, and (iii) to reconstruct mutational landscapes and thus to predict the phenotypic effect of mutations. References [1] M. Figliuzzi, H. Jacquier, A. Schug, O. Tenaillon and M. Weigt ''Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1'', Mol. Biol. Evol. (2015), doi: 10.1093/molbev/msv211 [2] E. De Leonardis, B. Lutz, S. Ratz, S. Cocco, R. Monasson, A. Schug, M. Weigt ''Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction'', Nucleic Acids Research (2015), doi: 10.1093/nar/gkv932 [3] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. Marks, C. Sander, R. Zecchina, J.N. Onuchic, T. Hwa, M. Weigt, ''Direct-coupling analysis of residue co-evolution captures native contacts across many protein families'', Proc. Natl. Acad. Sci. 108, E1293-E1301 (2011).
Kjaersgård, I V; Jespersen, H M; Rasmussen, S K; Welinder, K G
1997-03-01
cDNA clones encoding two new Arabidopsis thaliana peroxidases, ATP 1a and ATP 2a, have been identified by searching the Arabidopsis database of expressed sequence tags (dbEST). They represent a novel branch of hitherto uncharacterized plant peroxidases which is only 35% identical in amino acid sequence to the well characterized group of basic plant peroxidases represented by the horseradish (Armoracia rusticana) isoperoxidases HRP C, HRP E5 and the similar Arabidopsis isoperoxidases ATP Ca, ATP Cb, and ATP Ea. However ATP 1a is 87% identical in amino acid sequence to a peroxidase encoded by an mRNA isolated from cotton (Gossypium hirsutum). As cotton and Arabidopsis belong to rather diverse families (Malvaceae and Crucifereae, respectively), in contrast with Arabidopsis and horseradish (both Crucifereae), the high degree of sequence identity indicates that this novel type of peroxidase, albeit of unknown function, is likely to be widespread in plant species. The atp 1 and atp 2 types of cDNA sequences were the most redundant among the 28 different isoperoxidases identified among about 200 peroxidase encoding ESTs. Interestingly, 8 out of totally 38 EST sequences coding for ATP 1 showed three identical nucleotide substitutions. This variant form is designated ATP 1b. Similarly, six out of totally 16 EST sequences coding for ATP 2 showed a number of deletions and nucleotide changes. This variant form is designated ATP 2b. The selected EST clones are full-length and contain coding regions of 993 nucleotides for atp 1a, and 984 nucleotides for atp 2a. These regions show 61% DNA sequence identity. The predicted mature proteins ATP 1a, and ATP 2a are 57% identical in sequence and contain the structurally and functionally important residues, characteristic of the plant peroxidase superfamily. However, they do show two differences of importance to peroxidase catalysis: (1) the asparagine residue linked with the active site distal histidine via hydrogen bonding is absent; (2) an N-glycosylation site is located right at the entrance to the heme channel. The reverse transcriptase polymerase chain reaction (RT-PCR) was used to identify mRNAs coding for ATP 1a/b and ATP 2a/b in germinating seeds, seedlings, roots, leaves, stems, flowers and cell suspension culture using elongation factor 1alpha (EF-1alpha) for the first time as a positive control. Both mRNAs were transcribed at levels comparable to EF-1alpha in all plant tissues investigated which were more than two days old, and in cell suspension culture. In addition, the mRNA coding for ATP 1a/b was found in two day old germinating seeds. The abundant transcription of ATP 1a/b and ATP 2a/b is in line with their many entries in dbEST, and indicates essential roles for these novel peroxidases.
Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign
2007-01-01
Background Joint alignment and secondary structure prediction of two RNA sequences can significantly improve the accuracy of the structural predictions. Methods addressing this problem, however, are forced to employ constraints that reduce computation by restricting the alignments and/or structures (i.e. folds) that are permissible. In this paper, a new methodology is presented for the purpose of establishing alignment constraints based on nucleotide alignment and insertion posterior probabilities. Using a hidden Markov model, posterior probabilities of alignment and insertion are computed for all possible pairings of nucleotide positions from the two sequences. These alignment and insertion posterior probabilities are additively combined to obtain probabilities of co-incidence for nucleotide position pairs. A suitable alignment constraint is obtained by thresholding the co-incidence probabilities. The constraint is integrated with Dynalign, a free energy minimization algorithm for joint alignment and secondary structure prediction. The resulting method is benchmarked against the previous version of Dynalign and against other programs for pairwise RNA structure prediction. Results The proposed technique eliminates manual parameter selection in Dynalign and provides significant computational time savings in comparison to prior constraints in Dynalign while simultaneously providing a small improvement in the structural prediction accuracy. Savings are also realized in memory. In experiments over a 5S RNA dataset with average sequence length of approximately 120 nucleotides, the method reduces computation by a factor of 2. The method performs favorably in comparison to other programs for pairwise RNA structure prediction: yielding better accuracy, on average, and requiring significantly lesser computational resources. Conclusion Probabilistic analysis can be utilized in order to automate the determination of alignment constraints for pairwise RNA structure prediction methods in a principled fashion. These constraints can reduce the computational and memory requirements of these methods while maintaining or improving their accuracy of structural prediction. This extends the practical reach of these methods to longer length sequences. The revised Dynalign code is freely available for download. PMID:17445273
Chen, Yan ping; Pettis, Jeffery S; Zhao, Yan; Liu, Xinyue; Tallon, Luke J; Sadzewicz, Lisa D; Li, Renhua; Zheng, Huoqing; Huang, Shaokang; Zhang, Xuan; Hamilton, Michele C; Pernal, Stephen F; Melathopoulos, Andony P; Yan, Xianghe; Evans, Jay D
2013-07-05
The microsporidia parasite Nosema contributes to the steep global decline of honey bees that are critical pollinators of food crops. There are two species of Nosema that have been found to infect honey bees, Nosema apis and N. ceranae. Genome sequencing of N. apis and comparative genome analysis with N. ceranae, a fully sequenced microsporidia species, reveal novel insights into host-parasite interactions underlying the parasite infections. We applied the whole-genome shotgun sequencing approach to sequence and assemble the genome of N. apis which has an estimated size of 8.5 Mbp. We predicted 2,771 protein- coding genes and predicted the function of each putative protein using the Gene Ontology. The comparative genomic analysis led to identification of 1,356 orthologs that are conserved between the two Nosema species and genes that are unique characteristics of the individual species, thereby providing a list of virulence factors and new genetic tools for studying host-parasite interactions. We also identified a highly abundant motif in the upstream promoter regions of N. apis genes. This motif is also conserved in N. ceranae and other microsporidia species and likely plays a role in gene regulation across the microsporidia. The availability of the N. apis genome sequence is a significant addition to the rapidly expanding body of microsprodian genomic data which has been improving our understanding of eukaryotic genome diversity and evolution in a broad sense. The predicted virulent genes and transcriptional regulatory elements are potential targets for innovative therapeutics to break down the life cycle of the parasite.
Gupta, Sonal; Nawaz, Kashif; Parween, Sabiha; Roy, Riti; Sahu, Kamlesh; Kumar Pole, Anil; Khandal, Hitaishi; Srivastava, Rishi; Kumar Parida, Swarup; Chattopadhyay, Debasis
2017-02-01
Cicer reticulatum L. is the wild progenitor of the fourth most important legume crop chickpea (C. arietinum L.). We assembled short-read sequences into 416 Mb draft genome of C. reticulatum and anchored 78% (327 Mb) of this assembly to eight linkage groups. Genome annotation predicted 25,680 protein-coding genes covering more than 90% of predicted gene space. The genome assembly shared a substantial synteny and conservation of gene orders with the genome of the model legume Medicago truncatula. Resistance gene homologs of wild and domesticated chickpeas showed high sequence homology and conserved synteny. Comparison of gene sequences and nucleotide diversity using 66 wild and domesticated chickpea accessions suggested that the desi type chickpea was genetically closer to the wild species than the kabuli type. Comparative analyses predicted gene flow between the wild and the cultivated species during domestication. Molecular diversity and population genetic structure determination using 15,096 genome-wide single nucleotide polymorphisms revealed an admixed domestication pattern among cultivated (desi and kabuli) and wild chickpea accessions belonging to three population groups reflecting significant influence of parentage or geographical origin for their cultivar-specific population classification. The assembly and the polymorphic sequence resources presented here would facilitate the study of chickpea domestication and targeted use of wild Cicer germplasms for agronomic trait improvement in chickpea. © The Author 2016. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Garcia-Reyero, Natàlia; Griffitt, Robert J.; Liu, Li; Kroll, Kevin J.; Farmerie, William G.; Barber, David S.; Denslow, Nancy D.
2009-01-01
A novel custom microarray for largemouth bass (Micropterus salmoides) was designed with sequences obtained from a normalized cDNA library using the 454 Life Sciences GS-20 pyrosequencer. This approach yielded in excess of 58 million bases of high-quality sequence. The sequence information was combined with 2,616 reads obtained by traditional suppressive subtractive hybridizations to derive a total of 31,391 unique sequences. Annotation and coding sequences were predicted for these transcripts where possible. 16,350 annotated transcripts were selected as target sequences for the design of the custom largemouth bass oligonucleotide microarray. The microarray was validated by examining the transcriptomic response in male largemouth bass exposed to 17β-œstradiol. Transcriptomic responses were assessed in liver and gonad, and indicated gene expression profiles typical of exposure to œstradiol. The results demonstrate the potential to rapidly create the tools necessary to assess large scale transcriptional responses in non-model species, paving the way for expanded impact of toxicogenomics in ecotoxicology. PMID:19936325
Li, Yongping; Wei, Wei; Feng, Jia; Luo, Huifeng; Pi, Mengting; Liu, Zhongchi; Kang, Chunying
2018-01-01
Abstract The genome of the wild diploid strawberry species Fragaria vesca, an ideal model system of cultivated strawberry (Fragaria × ananassa, octoploid) and other Rosaceae family crops, was first published in 2011 and followed by a new assembly (Fvb). However, the annotation for Fvb mainly relied on ab initio predictions and included only predicted coding sequences, therefore an improved annotation is highly desirable. Here, a new annotation version named v2.0.a2 was created for the Fvb genome by a pipeline utilizing one PacBio library, 90 Illumina RNA-seq libraries, and 9 small RNA-seq libraries. Altogether, 18,641 genes (55.6% out of 33,538 genes) were augmented with information on the 5′ and/or 3′ UTRs, 13,168 (39.3%) protein-coding genes were modified or newly identified, and 7,370 genes were found to possess alternative isoforms. In addition, 1,938 long non-coding RNAs, 171 miRNAs, and 51,714 small RNA clusters were integrated into the annotation. This new annotation of F. vesca is substantially improved in both accuracy and integrity of gene predictions, beneficial to the gene functional studies in strawberry and to the comparative genomic analysis of other horticultural crops in Rosaceae family. PMID:29036429
NASA Astrophysics Data System (ADS)
Bhooplapur, Sharad; Akbulut, Mehmetkan; Quinlan, Franklyn; Delfyett, Peter J.
2010-04-01
A novel scheme for recognition of electronic bit-sequences is demonstrated. Two electronic bit-sequences that are to be compared are each mapped to a unique code from a set of Walsh-Hadamard codes. The codes are then encoded in parallel on the spectral phase of the frequency comb lines from a frequency-stabilized mode-locked semiconductor laser. Phase encoding is achieved by using two independent spatial light modulators based on liquid crystal arrays. Encoded pulses are compared using interferometric pulse detection and differential balanced photodetection. Orthogonal codes eight bits long are compared, and matched codes are successfully distinguished from mismatched codes with very low error rates, of around 10-18. This technique has potential for high-speed, high accuracy recognition of bit-sequences, with applications in keyword searches and internet protocol packet routing.
Two Perspectives on the Origin of the Standard Genetic Code
NASA Astrophysics Data System (ADS)
Sengupta, Supratim; Aggarwal, Neha; Bandhu, Ashutosh Vishwa
2014-12-01
The origin of a genetic code made it possible to create ordered sequences of amino acids. In this article we provide two perspectives on code origin by carrying out simulations of code-sequence coevolution in finite populations with the aim of examining how the standard genetic code may have evolved from more primitive code(s) encoding a small number of amino acids. We determine the efficacy of the physico-chemical hypothesis of code origin in the absence and presence of horizontal gene transfer (HGT) by allowing a diverse collection of code-sequence sets to compete with each other. We find that in the absence of horizontal gene transfer, natural selection between competing codes distinguished by differences in the degree of physico-chemical optimization is unable to explain the structure of the standard genetic code. However, for certain probabilities of the horizontal transfer events, a universal code emerges having a structure that is consistent with the standard genetic code.
An algebraic hypothesis about the primeval genetic code architecture.
Sánchez, Robersy; Grau, Ricardo
2009-09-01
A plausible architecture of an ancient genetic code is derived from an extended base triplet vector space over the Galois field of the extended base alphabet {D,A,C,G,U}, where symbol D represents one or more hypothetical bases with unspecific pairings. We hypothesized that the high degeneration of a primeval genetic code with five bases and the gradual origin and improvement of a primeval DNA repair system could make possible the transition from ancient to modern genetic codes. Our results suggest that the Watson-Crick base pairing G identical with C and A=U and the non-specific base pairing of the hypothetical ancestral base D used to define the sum and product operations are enough features to determine the coding constraints of the primeval and the modern genetic code, as well as, the transition from the former to the latter. Geometrical and algebraic properties of this vector space reveal that the present codon assignment of the standard genetic code could be induced from a primeval codon assignment. Besides, the Fourier spectrum of the extended DNA genome sequences derived from the multiple sequence alignment suggests that the called period-3 property of the present coding DNA sequences could also exist in the ancient coding DNA sequences. The phylogenetic analyses achieved with metrics defined in the N-dimensional vector space (B(3))(N) of DNA sequences and with the new evolutionary model presented here also suggest that an ancient DNA coding sequence with five or more bases does not contradict the expected evolutionary history.
NASA Astrophysics Data System (ADS)
Lestari, D.; Bustamam, A.; Novianti, T.; Ardaneswari, G.
2017-07-01
DNA sequence can be defined as a succession of letters, representing the order of nucleotides within DNA, using a permutation of four DNA base codes including adenine (A), guanine (G), cytosine (C), and thymine (T). The precise code of the sequences is determined using DNA sequencing methods and technologies, which have been developed since the 1970s and currently become highly developed, advanced and highly throughput sequencing technologies. So far, DNA sequencing has greatly accelerated biological and medical research and discovery. However, in some cases DNA sequencing could produce any ambiguous and not clear enough sequencing results that make them quite difficult to be determined whether these codes are A, T, G, or C. To solve these problems, in this study we can introduce other representation of DNA codes namely Quaternion Q = (PA, PT, PG, PC), where PA, PT, PG, PC are the probability of A, T, G, C bases that could appear in Q and PA + PT + PG + PC = 1. Furthermore, using Quaternion representations we are able to construct the improved scoring matrix for global sequence alignment processes, by applying a dot product method. Moreover, this scoring matrix produces better and higher quality of the match and mismatch score between two DNA base codes. In implementation, we applied the Needleman-Wunsch global sequence alignment algorithm using Octave, to analyze our target sequence which contains some ambiguous sequence data. The subject sequences are the DNA sequences of Streptococcus pneumoniae families obtained from the Genebank, meanwhile the target DNA sequence are received from our collaborator database. As the results we found the Quaternion representations improve the quality of the sequence alignment score and we can conclude that DNA sequence target has maximum similarity with Streptococcus pneumoniae.
Detecting long tandem duplications in genomic sequences.
Audemard, Eric; Schiex, Thomas; Faraut, Thomas
2012-05-08
Detecting duplication segments within completely sequenced genomes provides valuable information to address genome evolution and in particular the important question of the emergence of novel functions. The usual approach to gene duplication detection, based on all-pairs protein gene comparisons, provides only a restricted view of duplication. In this paper, we introduce ReD Tandem, a software using a flow based chaining algorithm targeted at detecting tandem duplication arrays of moderate to longer length regions, with possibly locally weak similarities, directly at the DNA level. On the A. thaliana genome, using a reference set of tandem duplicated genes built using TAIR,(a) we show that ReD Tandem is able to predict a large fraction of recently duplicated genes (dS < 1) and that it is also able to predict tandem duplications involving non coding elements such as pseudo-genes or RNA genes. ReD Tandem allows to identify large tandem duplications without any annotation, leading to agnostic identification of tandem duplications. This approach nicely complements the usual protein gene based which ignores duplications involving non coding regions. It is however inherently restricted to relatively recent duplications. By recovering otherwise ignored events, ReD Tandem gives a more comprehensive view of existing evolutionary processes and may also allow to improve existing annotations.
Reranking candidate gene models with cross-species comparison for improved gene prediction
Liu, Qian; Crammer, Koby; Pereira, Fernando CN; Roos, David S
2008-01-01
Background Most gene finders score candidate gene models with state-based methods, typically HMMs, by combining local properties (coding potential, splice donor and acceptor patterns, etc). Competing models with similar state-based scores may be distinguishable with additional information. In particular, functional and comparative genomics datasets may help to select among competing models of comparable probability by exploiting features likely to be associated with the correct gene models, such as conserved exon/intron structure or protein sequence features. Results We have investigated the utility of a simple post-processing step for selecting among a set of alternative gene models, using global scoring rules to rerank competing models for more accurate prediction. For each gene locus, we first generate the K best candidate gene models using the gene finder Evigan, and then rerank these models using comparisons with putative orthologous genes from closely-related species. Candidate gene models with lower scores in the original gene finder may be selected if they exhibit strong similarity to probable orthologs in coding sequence, splice site location, or signal peptide occurrence. Experiments on Drosophila melanogaster demonstrate that reranking based on cross-species comparison outperforms the best gene models identified by Evigan alone, and also outperforms the comparative gene finders GeneWise and Augustus+. Conclusion Reranking gene models with cross-species comparison improves gene prediction accuracy. This straightforward method can be readily adapted to incorporate additional lines of evidence, as it requires only a ranked source of candidate gene models. PMID:18854050
Pérez-Quintero, Alvaro L.; Rodriguez-R, Luis M.; Dereeper, Alexis; López, Camilo; Koebnik, Ralf; Szurek, Boris; Cunnac, Sebastien
2013-01-01
Transcription Activators-Like Effectors (TALEs) belong to a family of virulence proteins from the Xanthomonas genus of bacterial plant pathogens that are translocated into the plant cell. In the nucleus, TALEs act as transcription factors inducing the expression of susceptibility genes. A code for TALE-DNA binding specificity and high-resolution three-dimensional structures of TALE-DNA complexes were recently reported. Accurate prediction of TAL Effector Binding Elements (EBEs) is essential to elucidate the biological functions of the many sequenced TALEs as well as for robust design of artificial TALE DNA-binding domains in biotechnological applications. In this work a program with improved EBE prediction performances was developed using an updated specificity matrix and a position weight correction function to account for the matching pattern observed in a validation set of TALE-DNA interactions. To gain a systems perspective on the large TALE repertoires from X. oryzae strains, this program was used to predict rice gene targets for 99 sequenced family members. Integrating predictions and available expression data in a TALE-gene network revealed multiple candidate transcriptional targets for many TALEs as well as several possible instances of functional convergence among TALEs. PMID:23869221
Ab initio gene identification in metagenomic sequences
Zhu, Wenhan; Lomsadze, Alexandre; Borodovsky, Mark
2010-01-01
We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes. PMID:20403810
Aversano, Riccardo; Contaldi, Felice; Ercolano, Maria Raffaella; Grosso, Valentina; Iorizzo, Massimo; Tatino, Filippo; Xumerle, Luciano; Dal Molin, Alessandra; Avanzato, Carla; Ferrarini, Alberto; Delledonne, Massimo; Sanseverino, Walter; Cigliano, Riccardo Aiese; Capella-Gutierrez, Salvador; Gabaldón, Toni; Frusciante, Luigi; Bradeen, James M.; Carputo, Domenico
2015-01-01
Here, we report the draft genome sequence of Solanum commersonii, which consists of ∼830 megabases with an N50 of 44,303 bp anchored to 12 chromosomes, using the potato (Solanum tuberosum) genome sequence as a reference. Compared with potato, S. commersonii shows a striking reduction in heterozygosity (1.5% versus 53 to 59%), and differences in genome sizes were mainly due to variations in intergenic sequence length. Gene annotation by ab initio prediction supported by RNA-seq data produced a catalog of 1703 predicted microRNAs, 18,882 long noncoding RNAs of which 20% are shown to target cold-responsive genes, and 39,290 protein-coding genes with a significant repertoire of nonredundant nucleotide binding site-encoding genes and 126 cold-related genes that are lacking in S. tuberosum. Phylogenetic analyses indicate that domesticated potato and S. commersonii lineages diverged ∼2.3 million years ago. Three duplication periods corresponding to genome enrichment for particular gene families related to response to salt stress, water transport, growth, and defense response were discovered. The draft genome sequence of S. commersonii substantially increases our understanding of the domesticated germplasm, facilitating translation of acquired knowledge into advances in crop stability in light of global climate and environmental changes. PMID:25873387
Genome assembly and transcriptome resource for river buffalo, Bubalus bubalis (2n = 50).
Williams, John L; Iamartino, Daniela; Pruitt, Kim D; Sonstegard, Tad; Smith, Timothy P L; Low, Wai Yee; Biagini, Tommaso; Bomba, Lorenzo; Capomaccio, Stefano; Castiglioni, Bianca; Coletta, Angelo; Corrado, Federica; Ferré, Fabrizio; Iannuzzi, Leopoldo; Lawley, Cynthia; Macciotta, Nicolò; McClure, Matthew; Mancini, Giordano; Matassino, Donato; Mazza, Raffaele; Milanesi, Marco; Moioli, Bianca; Morandi, Nicola; Ramunno, Luigi; Peretti, Vincenzo; Pilla, Fabio; Ramelli, Paola; Schroeder, Steven; Strozzi, Francesco; Thibaud-Nissen, Francoise; Zicarelli, Luigi; Ajmone-Marsan, Paolo; Valentini, Alessio; Chillemi, Giovanni; Zimin, Aleksey
2017-10-01
Water buffalo is a globally important species for agriculture and local economies. A de novo assembled, well-annotated reference sequence for the water buffalo is an important prerequisite for studying the biology of this species, and is necessary to manage genetic diversity and to use modern breeding and genomic selection techniques. However, no such genome assembly has been previously reported. There are 2 species of domestic water buffalo, the river (2 n = 50) and the swamp (2 n = 48) buffalo. Here we describe a draft quality reference sequence for the river buffalo created from Illumina GA and Roche 454 short read sequences using the MaSuRCA assembler. The assembled sequence is 2.83 Gb, consisting of 366 983 scaffolds with a scaffold N50 of 1.41 Mb and contig N50 of 21 398 bp. Annotation of the genome was supported by transcriptome data from 30 tissues and identified 21 711 predicted protein coding genes. Searches for complete mammalian BUSCO gene groups found 98.6% of curated single copy orthologs present among predicted genes, which suggests a high level of completeness of the genome. The annotated sequence is available from NCBI at accession GCA_000471725.1. © The Author 2017. Published by Oxford University Press.
Gubser, Caroline; Smith, Geoffrey L
2002-04-01
Camelpox virus (CMPV) and variola virus (VAR) are orthopoxviruses (OPVs) that share several biological features and cause high mortality and morbidity in their single host species. The sequence of a virulent CMPV strain was determined; it is 202182 bp long, with inverted terminal repeats (ITRs) of 6045 bp and has 206 predicted open reading frames (ORFs). As for other poxviruses, the genes are tightly packed with little non-coding sequence. Most genes within 25 kb of each terminus are transcribed outwards towards the terminus, whereas genes within the centre of the genome are transcribed from either DNA strand. The central region of the genome contains genes that are highly conserved in other OPVs and 87 of these are conserved in all sequenced chordopoxviruses. In contrast, genes towards either terminus are more variable and encode proteins involved in host range, virulence or immunomodulation. In some cases, these are broken versions of genes found in other OPVs. The relationship of CMPV to other OPVs was analysed by comparisons of DNA and predicted protein sequences, repeats within the ITRs and arrangement of ORFs within the terminal regions. Each comparison gave the same conclusion: CMPV is the closest known virus to variola virus, the cause of smallpox.
Rapid evolution of cis-regulatory sequences via local point mutations
NASA Technical Reports Server (NTRS)
Stone, J. R.; Wray, G. A.
2001-01-01
Although the evolution of protein-coding sequences within genomes is well understood, the same cannot be said of the cis-regulatory regions that control transcription. Yet, changes in gene expression are likely to constitute an important component of phenotypic evolution. We simulated the evolution of new transcription factor binding sites via local point mutations. The results indicate that new binding sites appear and become fixed within populations on microevolutionary timescales under an assumption of neutral evolution. Even combinations of two new binding sites evolve very quickly. We predict that local point mutations continually generate considerable genetic variation that is capable of altering gene expression.
Margam, Venu M.; Coates, Brad S.; Bayles, Darrell O.; Hellmich, Richard L.; Agunbiade, Tolulope; Seufferheld, Manfredo J.; Sun, Weilin; Kroemer, Jeremy A.; Ba, Malick N.; Binso-Dabire, Clementine L.; Baoua, Ibrahim; Ishiyaku, Mohammad F.; Covas, Fernando G.; Srinivasan, Ramasamy; Armstrong, Joel; Murdock, Larry L.; Pittendrigh, Barry R.
2011-01-01
The legume pod borer, Maruca vitrata (Lepidoptera: Crambidae), is an insect pest species of crops grown by subsistence farmers in tropical regions of Africa. We present the de novo assembly of 3729 contigs from 454- and Sanger-derived sequencing reads for midgut, salivary, and whole adult tissues of this non-model species. Functional annotation predicted that 1320 M. vitrata protein coding genes are present, of which 631 have orthologs within the Bombyx mori gene model. A homology-based analysis assigned M. vitrata genes into a group of paralogs, but these were subsequently partitioned into putative orthologs following phylogenetic analyses. Following sequence quality filtering, a total of 1542 putative single nucleotide polymorphisms (SNPs) were predicted within M. vitrata contig assemblies. Seventy one of 1078 designed molecular genetic markers were used to screen M. vitrata samples from five collection sites in West Africa. Population substructure may be present with significant implications in the insect resistance management recommendations pertaining to the release of biological control agents or transgenic cowpea that express Bacillus thuringiensis crystal toxins. Mutation data derived from transcriptome sequencing is an expeditious and economical source for genetic markers that allow evaluation of ecological differentiation. PMID:21754987
Wind data mining by Kohonen Neural Networks.
Fayos, José; Fayos, Carolina
2007-02-14
Time series of Circulation Weather Type (CWT), including daily averaged wind direction and vorticity, are self-classified by similarity using Kohonen Neural Networks (KNN). It is shown that KNN is able to map by similarity all 7300 five-day CWT sequences during the period of 1975-94, in London, United Kingdom. It gives, as a first result, the most probable wind sequences preceding each one of the 27 CWT Lamb classes in that period. Inversely, as a second result, the observed diffuse correlation between both five-day CWT sequences and the CWT of the 6(th) day, in the long 20-year period, can be generalized to predict the last from the previous CWT sequence in a different test period, like 1995, as both time series are similar. Although the average prediction error is comparable to that obtained by forecasting standard methods, the KNN approach gives complementary results, as they depend only on an objective classification of observed CWT data, without any model assumption. The 27 CWT of the Lamb Catalogue were coded with binary three-dimensional vectors, pointing to faces, edges and vertex of a "wind-cube," so that similar CWT vectors were close.
Circular codes revisited: a statistical approach.
Gonzalez, D L; Giannerini, S; Rosa, R
2011-04-21
In 1996 Arquès and Michel [1996. A complementary circular code in the protein coding genes. J. Theor. Biol. 182, 45-58] discovered the existence of a common circular code in eukaryote and prokaryote genomes. Since then, circular code theory has provoked great interest and underwent a rapid development. In this paper we discuss some theoretical issues related to the synchronization properties of coding sequences and circular codes with particular emphasis on the problem of retrieval and maintenance of the reading frame. Motivated by the theoretical discussion, we adopt a rigorous statistical approach in order to try to answer different questions. First, we investigate the covering capability of the whole class of 216 self-complementary, C(3) maximal codes with respect to a large set of coding sequences. The results indicate that, on average, the code proposed by Arquès and Michel has the best covering capability but, still, there exists a great variability among sequences. Second, we focus on such code and explore the role played by the proportion of the bases by means of a hierarchy of permutation tests. The results show the existence of a sort of optimization mechanism such that coding sequences are tailored as to maximize or minimize the coverage of circular codes on specific reading frames. Such optimization clearly relates the function of circular codes with reading frame synchronization. Copyright © 2011 Elsevier Ltd. All rights reserved.
Modeling genome coverage in single-cell sequencing
Daley, Timothy; Smith, Andrew D.
2014-01-01
Motivation: Single-cell DNA sequencing is necessary for examining genetic variation at the cellular level, which remains hidden in bulk sequencing experiments. But because they begin with such small amounts of starting material, the amount of information that is obtained from single-cell sequencing experiment is highly sensitive to the choice of protocol employed and variability in library preparation. In particular, the fraction of the genome represented in single-cell sequencing libraries exhibits extreme variability due to quantitative biases in amplification and loss of genetic material. Results: We propose a method to predict the genome coverage of a deep sequencing experiment using information from an initial shallow sequencing experiment mapped to a reference genome. The observed coverage statistics are used in a non-parametric empirical Bayes Poisson model to estimate the gain in coverage from deeper sequencing. This approach allows researchers to know statistical features of deep sequencing experiments without actually sequencing deeply, providing a basis for optimizing and comparing single-cell sequencing protocols or screening libraries. Availability and implementation: The method is available as part of the preseq software package. Source code is available at http://smithlabresearch.org/preseq. Contact: andrewds@usc.edu Supplementary information: Supplementary material is available at Bioinformatics online. PMID:25107873
Turco, Gina; Schnable, James C.; Pedersen, Brent; Freeling, Michael
2013-01-01
Conserved non-coding sequences (CNS) are islands of non-coding sequence that, like protein coding exons, show less divergence in sequence between related species than functionless DNA. Several CNSs have been demonstrated experimentally to function as cis-regulatory regions. However, the specific functions of most CNSs remain unknown. Previous searches for CNS in plants have either anchored on exons and only identified nearby sequences or required years of painstaking manual annotation. Here we present an open source tool that can accurately identify CNSs between any two related species with sequenced genomes, including both those immediately adjacent to exons and distal sequences separated by >12 kb of non-coding sequence. We have used this tool to characterize new motifs, associate CNSs with additional functions, and identify previously undetected genes encoding RNA and protein in the genomes of five grass species. We provide a list of 15,363 orthologous CNSs conserved across all grasses tested. We were also able to identify regulatory sequences present in the common ancestor of grasses that have been lost in one or more extant grass lineages. Lists of orthologous gene pairs and associated CNSs are provided for reference inbred lines of arabidopsis, Japonica rice, foxtail millet, sorghum, brachypodium, and maize. PMID:23874343
Jones, David T; Singh, Tanya; Kosciolek, Tomasz; Tetchner, Stuart
2015-04-01
Recent developments of statistical techniques to infer direct evolutionary couplings between residue pairs have rendered covariation-based contact prediction a viable means for accurate 3D modelling of proteins, with no information other than the sequence required. To extend the usefulness of contact prediction, we have designed a new meta-predictor (MetaPSICOV) which combines three distinct approaches for inferring covariation signals from multiple sequence alignments, considers a broad range of other sequence-derived features and, uniquely, a range of metrics which describe both the local and global quality of the input multiple sequence alignment. Finally, we use a two-stage predictor, where the second stage filters the output of the first stage. This two-stage predictor is additionally evaluated on its ability to accurately predict the long range network of hydrogen bonds, including correctly assigning the donor and acceptor residues. Using the original PSICOV benchmark set of 150 protein families, MetaPSICOV achieves a mean precision of 0.54 for top-L predicted long range contacts-around 60% higher than PSICOV, and around 40% better than CCMpred. In de novo protein structure prediction using FRAGFOLD, MetaPSICOV is able to improve the TM-scores of models by a median of 0.05 compared with PSICOV. Lastly, for predicting long range hydrogen bonding, MetaPSICOV-HB achieves a precision of 0.69 for the top-L/10 hydrogen bonds compared with just 0.26 for the baseline MetaPSICOV. MetaPSICOV is available as a freely available web server at http://bioinf.cs.ucl.ac.uk/MetaPSICOV. Raw data (predicted contact lists and 3D models) and source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/MetaPSICOV. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Li, Jun-Yi; Xie, Hui; Xu, Chun-Ling; Li, Yu
2016-01-01
The chrysanthemum foliar nematode (CFN), Aphelenchoides ritzemabosi, is a plant parasitic nematode that attacks many plants. In this study, a transcriptomes of mixed-stage population of CFN was sequenced on the Illumina HiSeq 2000 platform. 68.10 million Illumina high quality paired end reads were obtained which generated 26,817 transcripts with a mean length of 1,032 bp and an N50 of 1,672 bp, of which 16,467 transcripts were annotated against six databases. In total, 20,311 coding region sequences (CDS), 495 simple sequence repeats (SSRs) and 8,353 single-nucleotide polymorphisms (SNPs) were predicted, respectively. The CFN with the most shared sequences was B. xylophilus with 16,846 (62.82%) common transcripts and 10,543 (39.31%) CFN transcripts matched sequences of all of four plant parasitic nematodes compared. A total of 111 CFN transcripts were predicted as homologues of 7 types of carbohydrate-active enzymes (CAZymes) with plant/fungal cell wall-degrading activities, fewer transcripts were predicted as homologues of plant cell wall-degrading enzymes than fungal cell wall-degrading enzymes. The phylogenetic analysis of GH5, GH16, GH43 and GH45 proteins between CFN and other organisms showed CFN and other nematodes have a closer phylogenetic relationship. In the CFN transcriptome, sixteen types of genes orthologues with seven classes of protein families involved in the RNAi pathway in C. elegans were predicted. This research provides comprehensive gene expression information at the transcriptional level, which will facilitate the elucidation of the molecular mechanisms of CFN and the distribution of gene functions at the macro level, potentially revealing improved methods for controlling CFN. PMID:27875578
GenoMycDB: a database for comparative analysis of mycobacterial genes and genomes.
Catanho, Marcos; Mascarenhas, Daniel; Degrave, Wim; Miranda, Antonio Basílio de
2006-03-31
Several databases and computational tools have been created with the aim of organizing, integrating and analyzing the wealth of information generated by large-scale sequencing projects of mycobacterial genomes and those of other organisms. However, with very few exceptions, these databases and tools do not allow for massive and/or dynamic comparison of these data. GenoMycDB (http://www.dbbm.fiocruz.br/GenoMycDB) is a relational database built for large-scale comparative analyses of completely sequenced mycobacterial genomes, based on their predicted protein content. Its central structure is composed of the results obtained after pair-wise sequence alignments among all the predicted proteins coded by the genomes of six mycobacteria: Mycobacterium tuberculosis (strains H37Rv and CDC1551), M. bovis AF2122/97, M. avium subsp. paratuberculosis K10, M. leprae TN, and M. smegmatis MC2 155. The database stores the computed similarity parameters of every aligned pair, providing for each protein sequence the predicted subcellular localization, the assigned cluster of orthologous groups, the features of the corresponding gene, and links to several important databases. Tables containing pairs or groups of potential homologs between selected species/strains can be produced dynamically by user-defined criteria, based on one or multiple sequence similarity parameters. In addition, searches can be restricted according to the predicted subcellular localization of the protein, the DNA strand of the corresponding gene and/or the description of the protein. Massive data search and/or retrieval are available, and different ways of exporting the result are offered. GenoMycDB provides an on-line resource for the functional classification of mycobacterial proteins as well as for the analysis of genome structure, organization, and evolution.
Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.
Nagy, Alinda; Hegyi, Hédi; Farkas, Krisztina; Tordai, Hedvig; Kozma, Evelin; Bányai, László; Patthy, László
2008-08-27
Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes. Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries. MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.
The Yak genome database: an integrative database for studying yak biology and high-altitude adaption
2012-01-01
Background The yak (Bos grunniens) is a long-haired bovine that lives at high altitudes and is an important source of milk, meat, fiber and fuel. The recent sequencing, assembly and annotation of its genome are expected to further our understanding of the means by which it has adapted to life at high altitudes and its ecologically important traits. Description The Yak Genome Database (YGD) is an internet-based resource that provides access to genomic sequence data and predicted functional information concerning the genes and proteins of Bos grunniens. The curated data stored in the YGD includes genome sequences, predicted genes and associated annotations, non-coding RNA sequences, transposable elements, single nucleotide variants, and three-way whole-genome alignments between human, cattle and yak. YGD offers useful searching and data mining tools, including the ability to search for genes by name or using function keywords as well as GBrowse genome browsers and/or BLAST servers, which can be used to visualize genome regions and identify similar sequences. Sequence data from the YGD can also be downloaded to perform local searches. Conclusions A new yak genome database (YGD) has been developed to facilitate studies on high-altitude adaption and bovine genomics. The database will be continuously updated to incorporate new information such as transcriptome data and population resequencing data. The YGD can be accessed at http://me.lzu.edu.cn/yak. PMID:23134687
Is sequencing better than phenotypic tests for the detection of pyrazinamide resistance?
Bouzouita, I; Cabibbe, A M; Trovato, A; Draoui, H; Ghariani, A; Midouni, B; Essalah, L; Mehiri, E; Cirillo, D M; Slim-Saidi, L
2018-06-01
Phenotypic tests used to detect pyrazinamide (PZA) resistance are slow and have a high rate of false resistance. To evaluate the accuracy of pncA sequencing for the detection of PZA resistance in Mycobacterium tuberculosis strains isolated in Tunisia. A total of 82 isolates, 41 resistant and 41 susceptible to PZA on BACTEC™ MGIT™ 960, were sequenced for pncA. Whole genome sequencing was performed for strains that were phenotypically resistant and had wild-type pncA in addition to MGIT retesting with a modified protocol. Twenty-three strains resistant to PZA with negative pyrazinamidase (PZase) activity harboured a mutation in the promoter or coding region of pncA. However, 18 strains resistant to PZA did not present any mutation. Repeat MGIT 960 showed that 16 of 18 M. tuberculosis isolates were falsely resistant to PZA. Compared with MGIT, PZase activity assay and pncA sequencing both presented a sensitivity of 92.0% (95%CI 73.9-99.0) and a specificity of respectively 96.5% (positive predictive value [PPV] 92.0%, negative predictive value [NPV] 96.5%) and 100.0% (PPV 100.0%, NPV 96.6%). The standard MGIT assay showed a high rate of false resistance to PZA, and the PZase activity assay is slow. pncA sequencing could therefore represent a rapid, accurate, alternative test to detect PZA resistance.
Long Non-coding RNAs and Their Biological Roles in Plants
Liu, Xue; Hao, Lili; Li, Dayong; Zhu, Lihuang; Hu, Songnian
2015-01-01
With the development of genomics and bioinformatics, especially the extensive applications of high-throughput sequencing technology, more transcriptional units with little or no protein-coding potential have been discovered. Such RNA molecules are called non-protein-coding RNAs (npcRNAs or ncRNAs). Among them, long npcRNAs or ncRNAs (lnpcRNAs or lncRNAs) represent diverse classes of transcripts longer than 200 nucleotides. In recent years, the lncRNAs have been considered as important regulators in many essential biological processes. In plants, although a large number of lncRNA transcripts have been predicted and identified in few species, our current knowledge of their biological functions is still limited. Here, we have summarized recent studies on their identification, characteristics, classification, bioinformatics, resources, and current exploration of their biological functions in plants. PMID:25936895
Xiong, H; Campelo, D; Pollack, R J; Raoult, D; Shao, R; Alem, M; Ali, J; Bilcha, K; Barker, S C
2014-08-01
The Illumina Hiseq platform was used to sequence the entire mitochondrial coding-regions of 20 body lice, Pediculus humanus Linnaeus, and head lice, P. capitis De Geer (Phthiraptera: Pediculidae), from eight towns and cities in five countries: Ethiopia, France, China, Australia and the U.S.A. These data (∼310 kb) were used to see how much more informative entire mitochondrial coding-region sequences were than partial mitochondrial coding-region sequences, and thus to guide the design of future studies of the phylogeny, origin, evolution and taxonomy of body lice and head lice. Phylogenies were compared from entire coding-region sequences (∼15.4 kb), entire cox1 (∼1.5 kb), partial cox1 (∼700 bp) and partial cytb (∼600 bp) sequences. On the one hand, phylogenies from entire mitochondrial coding-region sequences (∼15.4 kb) were much more informative than phylogenies from entire cox1 sequences (∼1.5 kb) and partial gene sequences (∼600 to ∼700 bp). For example, 19 branches had > 95% bootstrap support in our maximum likelihood tree from the entire mitochondrial coding-regions (∼15.4 kb) whereas the tree from 700 bp cox1 had only two branches with bootstrap support > 95%. Yet, by contrast, partial cytb (∼600 bp) and partial cox1 (∼486 bp) sequences were sufficient to genotype lice to Clade A, B or C. The sequences of the mitochondrial genomes of the P. humanus, P. capitis and P. schaeffi Fahrenholz studied are in NCBI GenBank under the accession numbers KC660761-800, KC685631-6330, KC241882-97, EU219988-95, HM241895-8 and JX080388-407. © 2014 The Royal Entomological Society.
Effective Identification of Similar Patients Through Sequential Matching over ICD Code Embedding.
Nguyen, Dang; Luo, Wei; Venkatesh, Svetha; Phung, Dinh
2018-04-11
Evidence-based medicine often involves the identification of patients with similar conditions, which are often captured in ICD (International Classification of Diseases (World Health Organization 2013)) code sequences. With no satisfying prior solutions for matching ICD-10 code sequences, this paper presents a method which effectively captures the clinical similarity among routine patients who have multiple comorbidities and complex care needs. Our method leverages the recent progress in representation learning of individual ICD-10 codes, and it explicitly uses the sequential order of codes for matching. Empirical evaluation on a state-wide cancer data collection shows that our proposed method achieves significantly higher matching performance compared with state-of-the-art methods ignoring the sequential order. Our method better identifies similar patients in a number of clinical outcomes including readmission and mortality outlook. Although this paper focuses on ICD-10 diagnosis code sequences, our method can be adapted to work with other codified sequence data.
Evans, K L; Lawson, D; Meitinger, T; Blackwood, D H; Porteous, D J
2000-04-03
Bipolar affective disorder (BPAD) is a complex disease with a significant genetic component. Heterozygous carriers of Wolfram syndrome (WFS) are at increased risk of psychiatric illness. A gene for WFS (WFS1) has recently been cloned and mapped to chromosome 4p, in the general region we previously reported as showing linkage to BPAD. Here we present sequence analysis of the WFS1 coding sequence in five affected individuals from two chromosome 4p-linked families. This resulted in the identification of six polymorphisms, two of which are predicted to change the amino acid sequence of the WFS1 protein, however none of the changes segregated with disease status. Am. J. Med. Genet. (Neuropsychiatr. Genet.) 96:158-160, 2000. Copyright 2000 Wiley-Liss, Inc.
Zhou, Carol L Ecale
2015-01-01
In order to better define regions of similarity among related protein structures, it is useful to identify the residue-residue correspondences among proteins. Few codes exist for constructing a one-to-many multiple sequence alignment derived from a set of structure or sequence alignments, and a need was evident for creating such a tool for combining pairwise structure alignments that would allow for insertion of gaps in the reference structure. This report describes a new Python code, CombAlign, which takes as input a set of pairwise sequence alignments (which may be structure based) and generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA). The use and utility of CombAlign was demonstrated by generating gapped MSSAs using sets of pairwise structure-based sequence alignments between structure models of the matrix protein (VP40) and pre-small/secreted glycoprotein (sGP) of Reston Ebolavirus and the corresponding proteins of several other filoviruses. The gapped MSSAs revealed structure-based residue-residue correspondences, which enabled identification of structurally similar versus differing regions in the Reston proteins compared to each of the other corresponding proteins. CombAlign is a new Python code that generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA) given a set of pairwise sequence alignments (which may be structure based). CombAlign has utility in assisting the user in distinguishing structurally conserved versus divergent regions on a reference protein structure relative to other closely related proteins. CombAlign was developed in Python 2.6, and the source code is available for download from the GitHub code repository.
SBLOCA outside containment at Browns Ferry Unit One: accident sequence analysis. [Small break
DOE Office of Scientific and Technical Information (OSTI.GOV)
Condon, W.A.; Harrington, R.M.; Greene, S.R.
1982-11-01
This study describes the predicted response of Unit 1 at the Browns Ferry Nuclear Plant to a postulated small-break loss-of-coolant accident outside of the primary containment. The break has been assumed to occur in the scram discharge volume piping immediately following a reactor scram that cannot be reset. The events before core uncovering are discussed for both the worst-case accident sequence without operator action and for the more likely sequences with operator action. Without operator action, the events after core uncovering would include core meltdown and subsequent containment failure, and this event sequence has been determined through use of themore » MARCH code. An estimate of the magnitude and timing of the concomitant release of the noble gas, cesium, and iodine-based fission products to the environment is provided in Volume 2 of this report.« less
Houzet, Laurent; Klase, Zachary; Yeung, Man Lung; Wu, Annie; Le, Shu-Yun; Quiñones, Mariam; Jeang, Kuan-Teh
2012-01-01
MicroRNAs (miRNAs) are 22-nt non-coding RNAs involved in the regulation of cellular gene expression and potential cellular defense against viral infection. Using in silico analyses, we predicted target sites for 22 human miRNAs in the HIV genome. Transfection experiments using synthetic miRNAs showed that five of these miRNAs capably decreased HIV replication. Using one of these five miRNAs, human miR-326 as an example, we demonstrated that the degree of complementarity between the predicted viral sequence and cellular miR-326 correlates, in a Dicer-dependent manner, with the potency of miRNA-mediated restriction of viral replication. Antagomirs to miR-326 that knocked down this cell endogenous miRNA increased HIV-1 replication in cells, suggesting that miR-326 is physiologically functional in moderating HIV-1 replication in human cells. PMID:23042677
Complete Coding Genome Sequence for Mogiana Tick Virus, a Jingmenvirus Isolated from Ticks in Brazil
2017-05-04
and capable of infecting a wide range of animal hosts (1–5). Here, we report the complete coding genome sequence (i.e., only missing portions of...segmented nature of the genome was not under- stood. Therefore, only the two genome segments with detectable sequence homolo- gies to flaviviruses were...originally reported (2). We revisited the data set of Maruyama et al. (2) and assembled the complete coding sequences for all four genome segments. We
Quantized phase coding and connected region labeling for absolute phase retrieval.
Chen, Xiangcheng; Wang, Yuwei; Wang, Yajun; Ma, Mengchao; Zeng, Chunnian
2016-12-12
This paper proposes an absolute phase retrieval method for complex object measurement based on quantized phase-coding and connected region labeling. A specific code sequence is embedded into quantized phase of three coded fringes. Connected regions of different codes are labeled and assigned with 3-digit-codes combining the current period and its neighbors. Wrapped phase, more than 36 periods, can be restored with reference to the code sequence. Experimental results verify the capability of the proposed method to measure multiple isolated objects.
Functional interrogation of non-coding DNA through CRISPR genome editing
Canver, Matthew C.; Bauer, Daniel E.; Orkin, Stuart H.
2017-01-01
Methodologies to interrogate non-coding regions have lagged behind coding regions despite comprising the vast majority of the genome. However, the rapid evolution of clustered regularly interspaced short palindromic repeats (CRISPR)-based genome editing has provided a multitude of novel techniques for laboratory investigation including significant contributions to the toolbox for studying non-coding DNA. CRISPR-mediated loss-of-function strategies rely on direct disruption of the underlying sequence or repression of transcription without modifying the targeted DNA sequence. CRISPR-mediated gain-of-function approaches similarly benefit from methods to alter the targeted sequence through integration of customized sequence into the genome as well as methods to activate transcription. Here we review CRISPR-based loss- and gain-of-function techniques for the interrogation of non-coding DNA. PMID:28288828
Bobrova, E V; Bogacheva, I N; Lyakhovetskii, V A; Fabinskaja, A A; Fomina, E V
2017-01-01
In order to test the hypothesis of hemisphere specialization for different types of information coding (the right hemisphere, for positional coding; the left one, for vector coding), we analyzed the errors of right and left-handers during a task involving the memorization of sequences of movements by the left or the right hand, which activates vector coding by changing the order of movements in memorized sequences. The task was first performed by the right or the left hand, then by the opposite hand. It was found that both'right- and left-handers use the information about the previous movements of the dominant hand, but not of the non-dom" inant one. After changing the hand, right-handers use the information about previous movements of the second hand, while left-handers do not. We compared our results with the data of previous experiments, in which positional coding was activated, and concluded that both right- and left-handers use vector coding for memorizing the sequences of their dominant hands and positional coding for memorizing the sequences of non-dominant hand. No similar patterns of errors were found between right- and left-handers after changing the hand, which suggests that in right- and left-handersthe skills are transferred in different ways depending on the type of coding.
Castrignanò, Tiziana; Canali, Alessandro; Grillo, Giorgio; Liuni, Sabino; Mignone, Flavio; Pesole, Graziano
2004-01-01
The identification and characterization of genome tracts that are highly conserved across species during evolution may contribute significantly to the functional annotation of whole-genome sequences. Indeed, such sequences are likely to correspond to known or unknown coding exons or regulatory motifs. Here, we present a web server implementing a previously developed algorithm that, by comparing user-submitted genome sequences, is able to identify statistically significant conserved blocks and assess their coding or noncoding nature through the measure of a coding potential score. The web tool, available at http://www.caspur.it/CSTminer/, is dynamically interconnected with the Ensembl genome resources and produces a graphical output showing a map of detected conserved sequences and annotated gene features. PMID:15215464
Implication of common and disease specific variants in CLU, CR1, and PICALM.
Ferrari, Raffaele; Moreno, Jorge H; Minhajuddin, Abu T; O'Bryant, Sid E; Reisch, Joan S; Barber, Robert C; Momeni, Parastoo
2012-08-01
Two recent genome-wide association studies (GWAS) for late onset Alzheimer's disease (LOAD) revealed 3 new genes: clusterin (CLU), phosphatidylinositol binding clathrin assembly protein (PICALM), and complement receptor 1 (CR1). In order to evaluate association with these genome-wide association study-identified genes and to isolate the variants contributing to the pathogenesis of LOAD, we genotyped the top single nucleotide polymorphisms (SNPs), rs11136000 (CLU), rs3818361 (CR1), and rs3851179 (PICALM), and sequenced the entire coding regions of these genes in our cohort of 342 LOAD patients and 277 control subjects. We confirmed the association of rs3851179 (PICALM) (p = 7.4 × 10(-3)) with the disease status. Through sequencing we identified 18 variants in CLU, 3 of which were found exclusively in patients; 8 variants (out of 65) in CR1 gene were only found in patients and the 16 variants identified in PICALM gene were present in both patients and controls. In silico analysis of the variants in PICALM did not predict any damaging effect on the protein. The haplotype analysis of the variants in each gene predicted a common haplotype when the 3 single nucleotide polymorphisms rs11136000 (CLU), rs3818361 (CR1), and rs3851179 (PICALM), respectively, were included. For each gene the haplotype structure and size differed between patients and controls. In conclusion, we confirmed association of CLU, CR1, and PICALM genes with the disease status in our cohort through identification of a number of disease-specific variants among patients through the sequencing of the coding region of these genes. Published by Elsevier Inc.
Molecular cloning and functional analysis of MRLC2 in Tianfu, Boer, and Chengdu Ma goats.
Xu, H G; Xu, G Y; Wan, L; Ma, J
2013-03-15
To determine the molecular basis of heterosis in goats, fluorescence quantitative polymerase chain reaction (PCR) was performed to investigate myosin-regulatory light chain 2 (MRLC2) gene expression in the longissimus dorsi muscle tissues of the Tianfu goat and its parents, the Boer and Chengdu Ma goats. The goat MRLC2 gene was differentially expressed in the crossbreed, and the purebred mRNA were isolated and identified using fluorescence quantitative reverse transcription-PCR (RT-PCR). The complete coding sequence of MRLC2 was obtained using the cDNA method, and the full-length coding sequence consisted of 513 bp encoding 172 amino acids. The EF-hand superfamily domain of the MRLC2 protein is well conserved in caprine and other animals. The deduced amino acid sequence of MRLC2 shared significant identity with MRLC2 from other mammals. Phylogenetic tree analysis revealed that the MRLC2 protein was closely related to MRLC2 in other mammals. Several predicted miRNA target sites were found in the coding sequence of caprine MRLC2 mRNA. Analysis by RT-PCR showed that MRLC2 mRNA was present in the heart, stomach, liver, spleen, lung, small intestine, kidney, leg muscle, abdominal muscle, and longissimus dorsi muscles. In particular, the high expression of MRLC2 mRNA was detected in the longissimus dorsi, leg muscle, abdominal muscle, stomach, and heart, but low levels of expression were also observed in the liver, spleen, lung, small intestine, and kidney. The expression of the MRLC2 gene was upregulated in the longissimus dorsi muscle of Boer and Tianfu goats, and it was moderately upregulated in Chengdu Ma goats.
Primer development to obtain complete coding sequence of HA and NA genes of influenza A/H3N2 virus.
Agustiningsih, Agustiningsih; Trimarsanto, Hidayat; Setiawaty, Vivi; Artika, I Made; Muljono, David Handojo
2016-08-30
Influenza is an acute respiratory illness and has become a serious public health problem worldwide. The need to study the HA and NA genes in influenza A virus is essential since these genes frequently undergo mutations. This study describes the development of primer sets for RT-PCR to obtain complete coding sequence of Hemagglutinin (HA) and Neuraminidase (NA) genes of influenza A/H3N2 virus from Indonesia. The primers were developed based on influenza A/H3N2 sequence worldwide from Global Initiative on Sharing All Influenza Data (GISAID) and further tested using Indonesian influenza A/H3N2 archived samples of influenza-like illness (ILI) surveillance from 2008 to 2009. An optimum RT-PCR condition was acquired for all HA and NA fragments designed to cover complete coding sequence of HA and NA genes. A total of 71 samples were successfully sequenced for complete coding sequence both of HA and NA genes out of 145 samples of influenza A/H3N2 tested. The developed primer sets were suitable for obtaining complete coding sequences of HA and NA genes of Indonesian samples from 2008 to 2009.
GATA: A graphic alignment tool for comparative sequenceanalysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nix, David A.; Eisen, Michael B.
2005-01-01
Several problems exist with current methods used to align DNA sequences for comparative sequence analysis. Most dynamic programming algorithms assume that conserved sequence elements are collinear. This assumption appears valid when comparing orthologous protein coding sequences. Functional constraints on proteins provide strong selective pressure against sequence inversions, and minimize sequence duplications and feature shuffling. For non-coding sequences this collinearity assumption is often invalid. For example, enhancers contain clusters of transcription factor binding sites that change in number, orientation, and spacing during evolution yet the enhancer retains its activity. Dotplot analysis is often used to estimate non-coding sequence relatedness. Yet dotmore » plots do not actually align sequences and thus cannot account well for base insertions or deletions. Moreover, they lack an adequate statistical framework for comparing sequence relatedness and are limited to pairwise comparisons. Lastly, dot plots and dynamic programming text outputs fail to provide an intuitive means for visualizing DNA alignments.« less
Phylogenetic Network for European mtDNA
Finnilä, Saara; Lehtonen, Mervi S.; Majamaa, Kari
2001-01-01
The sequence in the first hypervariable segment (HVS-I) of the control region has been used as a source of evolutionary information in most phylogenetic analyses of mtDNA. Population genetic inference would benefit from a better understanding of the variation in the mtDNA coding region, but, thus far, complete mtDNA sequences have been rare. We determined the nucleotide sequence in the coding region of mtDNA from 121 Finns, by conformation-sensitive gel electrophoresis and subsequent sequencing and by direct sequencing of the D loop. Furthermore, 71 sequences from our previous reports were included, so that the samples represented all the mtDNA haplogroups present in the Finnish population. We found a total of 297 variable sites in the coding region, which allowed the compilation of unambiguous phylogenetic networks. The D loop harbored 104 variable sites, and, in most cases, these could be localized within the coding-region networks, without discrepancies. Interestingly, many homoplasies were detected in the coding region. Nucleotide variation in the rRNA and tRNA genes was 6%, and that in the third nucleotide positions of structural genes amounted to 22% of that in the HVS-I. The complete networks enabled the relationships between the mtDNA haplogroups to be analyzed. Phylogenetic networks based on the entire coding-region sequence in mtDNA provide a rich source for further population genetic studies, and complete sequences make it easier to differentiate between disease-causing mutations and rare polymorphisms. PMID:11349229
Huang, Lixing; Hu, Jiao; Su, Yongquan; Qin, Yingxue; Kong, Wendi; Ma, Ying; Xu, Xiaojin; Lin, Mao; Yan, Qingpi
2015-01-01
The capability of Vibrio alginolyticus to adhere to fish mucus is a key virulence factor of the bacteria. Our previous research showed that stress conditions, such as Cu(2+), Pb(2+), Hg(2+), and low pH, can reduce this adhesion ability. Non-coding (nc) RNAs play a crucial role in regulating bacterial gene expression, affecting the bacteria's pathogenicity. To investigate the mechanism(s) underlying the decline in adhesion ability caused by stressors, we combined high-throughput sequencing with computational techniques to detect stressed ncRNA dynamics. These approaches yielded three commonly altered ncRNAs that are predicted to regulate the bacterial chemotaxis pathway, which plays a key role in the adhesion process of bacteria. We hypothesized they play a key role in the adhesion process of V. alginolyticus. In this study, we validated the effects of these three ncRNAs on their predicted target genes and their role in the V. alginolyticus adhesion process with RNA interference (i), quantitative real-time polymerase chain reaction (qPCR), northern blot, capillary assay, and in vitro adhesion assays. The expression of these ncRNAs and their predicted target genes were confirmed by qPCR and northern blot, which reinforced the reliability of the sequencing data and the target prediction. Overexpression of these ncRNAs was capable of reducing the chemotactic and adhesion ability of V. alginolyticus, and the expression levels of their target genes were also significantly reduced. Our results indicated that these three ncRNAs: (1) are able to regulate the bacterial chemotaxis pathway, and (2) play a key role in the adhesion process of V. alginolyticus.
The Unsupervised Acquisition of a Lexicon from Continuous Speech.
1995-11-01
Com- munication, 2(1):57{89, 1982. [42] J. Ziv and A. Lempel . Compression of individual sequences by variable rate coding. IEEE Trans- actions on...parameters of the compression algorithm , in a never-ending attempt to identify and eliminate the predictable. They lead us to a class of grammars in...the rst 10 sentences of the test set, previously unseen by the algorithm . Vertical bars indicate word boundaries. 7.1 Text Compression and Language
Ishikawa, Sohta A; Inagaki, Yuji; Hashimoto, Tetsuo
2012-01-01
In phylogenetic analyses of nucleotide sequences, 'homogeneous' substitution models, which assume the stationarity of base composition across a tree, are widely used, albeit individual sequences may bear distinctive base frequencies. In the worst-case scenario, a homogeneous model-based analysis can yield an artifactual union of two distantly related sequences that achieved similar base frequencies in parallel. Such potential difficulty can be countered by two approaches, 'RY-coding' and 'non-homogeneous' models. The former approach converts four bases into purine and pyrimidine to normalize base frequencies across a tree, while the heterogeneity in base frequency is explicitly incorporated in the latter approach. The two approaches have been applied to real-world sequence data; however, their basic properties have not been fully examined by pioneering simulation studies. Here, we assessed the performances of the maximum-likelihood analyses incorporating RY-coding and a non-homogeneous model (RY-coding and non-homogeneous analyses) on simulated data with parallel convergence to similar base composition. Both RY-coding and non-homogeneous analyses showed superior performances compared with homogeneous model-based analyses. Curiously, the performance of RY-coding analysis appeared to be significantly affected by a setting of the substitution process for sequence simulation relative to that of non-homogeneous analysis. The performance of a non-homogeneous analysis was also validated by analyzing a real-world sequence data set with significant base heterogeneity.
Palindromic repetitive DNA elements with coding potential in Methanocaldococcus jannaschii.
Suyama, Mikita; Lathe, Warren C; Bork, Peer
2005-10-10
We have identified 141 novel palindromic repetitive elements in the genome of euryarchaeon Methanocaldococcus jannaschii. The total length of these elements is 14.3kb, which corresponds to 0.9% of the total genomic sequence and 6.3% of all extragenic regions. The elements can be divided into three groups (MJRE1-3) based on the sequence similarity. The low sequence identity within each of the groups suggests rather old origin of these elements in M. jannaschii. Three MJRE2 elements were located within the protein coding regions without disrupting the coding potential of the host genes, indicating that insertion of repeats might be a widespread mechanism to enhance sequence diversity in coding regions.
Brain cDNA clone for human cholinesterase
DOE Office of Scientific and Technical Information (OSTI.GOV)
McTiernan, C.; Adkins, S.; Chatonnet, A.
1987-10-01
A cDNA library from human basal ganglia was screened with oligonucleotide probes corresponding to portions of the amino acid sequence of human serum cholinesterase. Five overlapping clones, representing 2.4 kilobases, were isolated. The sequenced cDNA contained 207 base pairs of coding sequence 5' to the amino terminus of the mature protein in which there were four ATG translation start sites in the same reading frame as the protein. Only the ATG coding for Met-(-28) lay within a favorable consensus sequence for functional initiators. There were 1722 base pairs of coding sequence corresponding to the protein found circulating in human serum.more » The amino acid sequence deduced from the cDNA exactly matched the 574 amino acid sequence of human serum cholinesterase, as previously determined by Edman degradation. Therefore, our clones represented cholinesterase rather than acetylcholinesterase. It was concluded that the amino acid sequences of cholinesterase from two different tissues, human brain and human serum, were identical. Hybridization of genomic DNA blots suggested that a single gene, or very few genes coded for cholinesterase.« less
Functional interrogation of non-coding DNA through CRISPR genome editing.
Canver, Matthew C; Bauer, Daniel E; Orkin, Stuart H
2017-05-15
Methodologies to interrogate non-coding regions have lagged behind coding regions despite comprising the vast majority of the genome. However, the rapid evolution of clustered regularly interspaced short palindromic repeats (CRISPR)-based genome editing has provided a multitude of novel techniques for laboratory investigation including significant contributions to the toolbox for studying non-coding DNA. CRISPR-mediated loss-of-function strategies rely on direct disruption of the underlying sequence or repression of transcription without modifying the targeted DNA sequence. CRISPR-mediated gain-of-function approaches similarly benefit from methods to alter the targeted sequence through integration of customized sequence into the genome as well as methods to activate transcription. Here we review CRISPR-based loss- and gain-of-function techniques for the interrogation of non-coding DNA. Copyright © 2017 Elsevier Inc. All rights reserved.
Intlekofer, Andrew M; Joffe, Erel; Batlevi, Connie L; Hilden, Patrick; He, Jie; Seshan, Venkatraman E; Zelenetz, Andrew D; Palomba, M Lia; Moskowitz, Craig H; Portlock, Carol; Straus, David J; Noy, Ariela; Horwitz, Steven M; Gerecitano, John F; Moskowitz, Alison; Hamlin, Paul; Matasar, Matthew J; Kumar, Anita; van den Brink, Marcel R; Knapp, Kristina M; Pichardo, Janine D; Nahas, Michelle K; Trabucco, Sally E; Mughal, Tariq; Copeland, Amanda R; Papaemmanuil, Elli; Moarii, Mathai; Levine, Ross L; Dogan, Ahmet; Miller, Vincent A; Younes, Anas
2018-06-12
We sought to define the genomic landscape of diffuse large B-cell lymphoma (DLBCL) by using formalin-fixed paraffin-embedded (FFPE) biopsy specimens. We used targeted sequencing of genes altered in hematologic malignancies, including DNA coding sequence for 405 genes, noncoding sequence for 31 genes, and RNA coding sequence for 265 genes (FoundationOne-Heme). Short variants, rearrangements, and copy number alterations were determined. We studied 198 samples (114 de novo, 58 previously treated, and 26 large-cell transformation from follicular lymphoma). Median number of GAs per case was 6, with 97% of patients harboring at least one alteration. Recurrent GAs were detected in genes with established roles in DLBCL pathogenesis (e.g. MYD88, CREBBP, CD79B, EZH2), as well as notable differences compared to prior studies such as inactivating mutations in TET2 (5%). Less common GAs identified potential targets for approved or investigational therapies, including BRAF, CD274 (PD-L1), IDH2, and JAK1/2. TP53 mutations were more frequently observed in relapsed/refractory DLBCL, and predicted for lack of response to first-line chemotherapy, identifying a subset of patients that could be prioritized for novel therapies. Overall, 90% (n = 169) of the patients harbored a GA which could be explored for therapeutic intervention, with 54% (n = 107) harboring more than one putative target.
Roux-Rouquie, M; Marilley, M
2000-09-15
We have modeled local DNA sequence parameters to search for DNA architectural motifs involved in transcription regulation and promotion within the Xenopus laevis ribosomal gene promoter and the intergenic spacer (IGS) sequences. The IGS was found to be shaped into distinct topological domains. First, intrinsic bends split the IGS into domains of common but different helical features. Local parameters at inter-domain junctions exhibit a high variability with respect to intrinsic curvature, bendability and thermal stability. Secondly, the repeated sequence blocks of the IGS exhibit right-handed supercoiled structures which could be related to their enhancer properties. Thirdly, the gene promoter presents both inherent curvature and minor groove narrowing which may be viewed as motifs of a structural code for protein recognition and binding. Such pre-existing deformations could simply be remodeled during the binding of the transcription complex. Alternatively, these deformations could pre-shape the promoter in such a way that further remodeling is facilitated. Mutations shown to abolish promoter curvature as well as intrinsic minor groove narrowing, in a variant which maintained full transcriptional activity, bring circumstantial evidence for structurally-preorganized motifs in relation to transcription regulation and promotion. Using well documented X. laevis rDNA regulatory sequences we showed that computer modeling may be of invaluable assistance in assessing encrypted architectural motifs. The evidence of these DNA topological motifs with respect to the concept of structural code is discussed.
Chan, Wen-Ling; Yang, Wen-Kuang; Huang, Hsien-Da; Chang, Jan-Gowth
2013-01-01
RNA interference (RNAi) is a gene silencing process within living cells, which is controlled by the RNA-induced silencing complex with a sequence-specific manner. In flies and mice, the pseudogene transcripts can be processed into short interfering RNAs (siRNAs) that regulate protein-coding genes through the RNAi pathway. Following these findings, we construct an innovative and comprehensive database to elucidate siRNA-mediated mechanism in human transcribed pseudogenes (TPGs). To investigate TPG producing siRNAs that regulate protein-coding genes, we mapped the TPGs to small RNAs (sRNAs) that were supported by publicly deep sequencing data from various sRNA libraries and constructed the TPG-derived siRNA-target interactions. In addition, we also presented that TPGs can act as a target for miRNAs that actually regulate the parental gene. To enable the systematic compilation and updating of these results and additional information, we have developed a database, pseudoMap, capturing various types of information, including sequence data, TPG and cognate annotation, deep sequencing data, RNA-folding structure, gene expression profiles, miRNA annotation and target prediction. As our knowledge, pseudoMap is the first database to demonstrate two mechanisms of human TPGs: encoding siRNAs and decoying miRNAs that target the parental gene. pseudoMap is freely accessible at http://pseudomap.mbc.nctu.edu.tw/. Database URL: http://pseudomap.mbc.nctu.edu.tw/
Ntougias, Spyridon; Lapidus, Alla; Copeland, Alex; ...
2015-08-13
Members of the genus Halotalea (family Halomonadaceae) are of high significance since they can tolerate the greatest glucose and maltose concentrations ever reported for known bacteria and are involved in the degradation of industrial effluents. Here, the characteristics and the permanent-draft genome sequence and annotation of Halotalea alkalilenta AW-7T are described. The microorganism was sequenced as a part of the Genomic Encyclopedia of Type Strains, Phase I: the one thousand microbial genomes (KMG) project at the DOE Joint Genome Institute, and it is the only strain within the genus Halotalea having its genome sequenced. The genome is 4,467,826 bp longmore » and consists of 40 scaffolds with 64.62 % average GC content. A total of 4,104 genes were predicted, comprising of 4,028 protein-coding and 76 RNA genes. Most protein-coding genes (87.79 %) were assigned to a putative function. Halotalea alkalilenta AW-7T encodes the catechol and protocatechuate degradation to β-ketoadipate via the β-ketoadipate and protocatechuate ortho-cleavage degradation pathway, and it possesses the genetic ability to detoxify fluoroacetate, cyanate and acrylonitrile. Lastly, an emended description of the genus Halotalea Ntougias et al. 2007 is also provided in order to describe the delayed fermentation ability of the type strain.« less
Evidence of birth-and-death evolution of 5S rRNA gene in Channa species (Teleostei, Perciformes).
Barman, Anindya Sundar; Singh, Mamta; Singh, Rajeev Kumar; Lal, Kuldeep Kumar
2016-12-01
In higher eukaryotes, minor rDNA family codes for 5S rRNA that is arranged in tandem arrays and comprises of a highly conserved 120 bp long coding sequence with a variable non-transcribed spacer (NTS). Initially the 5S rDNA repeats are considered to be evolved by the process of concerted evolution. But some recent reports, including teleost fishes suggested that evolution of 5S rDNA repeat does not fit into the concerted evolution model and evolution of 5S rDNA family may be explained by a birth-and-death evolution model. In order to study the mode of evolution of 5S rDNA repeats in Perciformes fish species, nucleotide sequence and molecular organization of five species of genus Channa were analyzed in the present study. Molecular analyses revealed several variants of 5S rDNA repeats (four types of NTS) and networks created by a neighbor net algorithm for each type of sequences (I, II, III and IV) did not show a clear clustering in species specific manner. The stable secondary structure is predicted and upstream and downstream conserved regulatory elements were characterized. Sequence analyses also shown the presence of two putative pseudogenes in Channa marulius. Present study supported that 5S rDNA repeats in genus Channa were evolved under the process of birth-and-death.
Paridaens, Tom; Van Wallendael, Glenn; De Neve, Wesley; Lambert, Peter
2017-05-15
The past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the cost of sequencing is dropping significantly faster than the cost of storage and transmission. The latter motivates a need for continuous improvements in the area of genomic data compression, not only at the level of effectiveness (compression rate), but also at the level of functionality (e.g. random access), configurability (effectiveness versus complexity, coding tool set …) and versatility (support for both sequenced reads and assembled sequences). In that regard, we can point out that current approaches mostly do not support random access, requiring full files to be transmitted, and that current approaches are restricted to either read or sequence compression. We propose AFRESh, an adaptive framework for no-reference compression of genomic data with random access functionality, targeting the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme (CABAC), to compress raw genetic codes. To the best of our knowledge, our paper is the first to describe an effective implementation CABAC outside of its' original application. By applying CABAC, the compression effectiveness improves by up to 19% for assembled sequences and up to 62% for reads. By applying AFRESh to the genomic symbols of the MPEG genomic compression test set for reads, a compression gain is achieved of up to 51% compared to SCALCE, 42% compared to LFQC and 44% compared to ORCOM. When comparing to generic compression approaches, a compression gain is achieved of up to 41% compared to GNU Gzip and 22% compared to 7-Zip at the Ultra setting. Additionaly, when compressing assembled sequences of the Human Genome, a compression gain is achieved up to 34% compared to GNU Gzip and 16% compared to 7-Zip at the Ultra setting. A Windows executable version can be downloaded at https://github.com/tparidae/AFresh . tom.paridaens@ugent.be. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
2010-01-01
Background Comparative sequence analysis of complex loci such as resistance gene analog clusters allows estimating the degree of sequence conservation and mechanisms of divergence at the intraspecies level. In banana (Musa sp.), two diploid wild species Musa acuminata (A genome) and Musa balbisiana (B genome) contribute to the polyploid genome of many cultivars. The M. balbisiana species is associated with vigour and tolerance to pests and disease and little is known on the genome structure and haplotype diversity within this species. Here, we compare two genomic sequences of 253 and 223 kb corresponding to two haplotypes of the RGA08 resistance gene analog locus in M. balbisiana "Pisang Klutuk Wulung" (PKW). Results Sequence comparison revealed two regions of contrasting features. The first is a highly colinear gene-rich region where the two haplotypes diverge only by single nucleotide polymorphisms and two repetitive element insertions. The second corresponds to a large cluster of RGA08 genes, with 13 and 18 predicted RGA genes and pseudogenes spread over 131 and 152 kb respectively on each haplotype. The RGA08 cluster is enriched in repetitive element insertions, in duplicated non-coding intergenic sequences including low complexity regions and shows structural variations between haplotypes. Although some allelic relationships are retained, a large diversity of RGA08 genes occurs in this single M. balbisiana genotype, with several RGA08 paralogs specific to each haplotype. The RGA08 gene family has evolved by mechanisms of unequal recombination, intragenic sequence exchange and diversifying selection. An unequal recombination event taking place between duplicated non-coding intergenic sequences resulted in a different RGA08 gene content between haplotypes pointing out the role of such duplicated regions in the evolution of RGA clusters. Based on the synonymous substitution rate in coding sequences, we estimated a 1 million year divergence time for these M. balbisiana haplotypes. Conclusions A large RGA08 gene cluster identified in wild banana corresponds to a highly variable genomic region between haplotypes surrounded by conserved flanking regions. High level of sequence identity (70 to 99%) of the genic and intergenic regions suggests a recent and rapid evolution of this cluster in M. balbisiana. PMID:20637079
Avsec, Žiga; Cheng, Jun; Gagneur, Julien
2018-01-01
Abstract Motivation Regulatory sequences are not solely defined by their nucleic acid sequence but also by their relative distances to genomic landmarks such as transcription start site, exon boundaries or polyadenylation site. Deep learning has become the approach of choice for modeling regulatory sequences because of its strength to learn complex sequence features. However, modeling relative distances to genomic landmarks in deep neural networks has not been addressed. Results Here we developed spline transformation, a neural network module based on splines to flexibly and robustly model distances. Modeling distances to various genomic landmarks with spline transformations significantly increased state-of-the-art prediction accuracy of in vivo RNA-binding protein binding sites for 120 out of 123 proteins. We also developed a deep neural network for human splice branchpoint based on spline transformations that outperformed the current best, already distance-based, machine learning model. Compared to piecewise linear transformation, as obtained by composition of rectified linear units, spline transformation yields higher prediction accuracy as well as faster and more robust training. As spline transformation can be applied to further quantities beyond distances, such as methylation or conservation, we foresee it as a versatile component in the genomics deep learning toolbox. Availability and implementation Spline transformation is implemented as a Keras layer in the CONCISE python package: https://github.com/gagneurlab/concise. Analysis code is available at https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017. Contact avsec@in.tum.de or gagneur@in.tum.de Supplementary information Supplementary data are available at Bioinformatics online. PMID:29155928
Performance comparison of leading image codecs: H.264/AVC Intra, JPEG2000, and Microsoft HD Photo
NASA Astrophysics Data System (ADS)
Tran, Trac D.; Liu, Lijie; Topiwala, Pankaj
2007-09-01
This paper provides a detailed rate-distortion performance comparison between JPEG2000, Microsoft HD Photo, and H.264/AVC High Profile 4:4:4 I-frame coding for high-resolution still images and high-definition (HD) 1080p video sequences. This work is an extension to our previous comparative study published in previous SPIE conferences [1, 2]. Here we further optimize all three codecs for compression performance. Coding simulations are performed on a set of large-format color images captured from mainstream digital cameras and 1080p HD video sequences commonly used for H.264/AVC standardization work. Overall, our experimental results show that all three codecs offer very similar coding performances at the high-quality, high-resolution setting. Differences tend to be data-dependent: JPEG2000 with the wavelet technology tends to be the best performer with smooth spatial data; H.264/AVC High-Profile with advanced spatial prediction modes tends to cope best with more complex visual content; Microsoft HD Photo tends to be the most consistent across the board. For the still-image data sets, JPEG2000 offers the best R-D performance gains (around 0.2 to 1 dB in peak signal-to-noise ratio) over H.264/AVC High-Profile intra coding and Microsoft HD Photo. For the 1080p video data set, all three codecs offer very similar coding performance. As in [1, 2], neither do we consider scalability nor complexity in this study (JPEG2000 is operating in non-scalable, but optimal performance mode).
A Code Division Multiple Access Communication System for the Low Frequency Band.
1983-04-01
frequency channels spread-spectrum communication / complex sequences, orthogonal codes impulsive noise 20. ABSTRACT (Continue an reverse side It...their transmissions with signature sequences. Our LF/CDMA scheme is different in that each user’s signature sequence set consists of M orthogonal ...signature sequences. Our LF/CDMA scheme is different in that each user’s signature sequence set consists of M orthogonal sequences and thus log 2 M