Sample records for predicted protein coding

  1. Prediction of plant lncRNA by ensemble machine learning classifiers.

    PubMed

    Simopoulos, Caitlin M A; Weretilnyk, Elizabeth A; Golding, G Brian

    2018-05-02

    In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With comparatively few validated plant long non-coding RNAs, research on this potentially critical class of RNA is hindered by a lack of appropriate prediction tools and databases. Supervised learning models trained on data sets of mostly non-validated, non-coding transcripts have been previously used to identify this enigmatic RNA class with applications largely focused on animal systems. Our approach uses a training set comprised only of empirically validated long non-protein coding RNAs from plant, animal, and viral sources to predict and rank candidate long non-protein coding gene products for future functional validation. Individual stochastic gradient boosting and random forest classifiers trained on only empirically validated long non-protein coding RNAs were constructed. In order to use the strengths of multiple classifiers, we combined multiple models into a single stacking meta-learner. This ensemble approach benefits from the diversity of several learners to effectively identify putative plant long non-coding RNAs from transcript sequence features. When the predicted genes identified by the ensemble classifier were compared to those listed in GreeNC, an established plant long non-coding RNA database, overlap for predicted genes from Arabidopsis thaliana, Oryza sativa and Eutrema salsugineum ranged from 51 to 83% with the highest agreement in Eutrema salsugineum. Most of the highest ranking predictions from Arabidopsis thaliana were annotated as potential natural antisense genes, pseudogenes, transposable elements, or simply computationally predicted hypothetical protein. Due to the nature of this tool, the model can be updated as new long non-protein coding transcripts are identified and functionally verified. This ensemble classifier is an accurate tool that can be used to rank long non-protein coding RNA predictions for use in conjunction with gene expression studies. Selection of plant transcripts with a high potential for regulatory roles as long non-protein coding RNAs will advance research in the elucidation of long non-protein coding RNA function.

  2. Genomic analysis of organismal complexity in the multicellular green alga Volvox carteri

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Prochnik, Simon E.; Umen, James; Nedelcu, Aurora

    2010-07-01

    Analysis of the Volvox carteri genome reveals that this green alga's increased organismal complexity and multicellularity are associated with modifications in protein families shared with its unicellular ancestor, and not with large-scale innovations in protein coding capacity. The multicellular green alga Volvox carteri and its morphologically diverse close relatives (the volvocine algae) are uniquely suited for investigating the evolution of multicellularity and development. We sequenced the 138 Mb genome of V. carteri and compared its {approx}14,500 predicted proteins to those of its unicellular relative, Chlamydomonas reinhardtii. Despite fundamental differences in organismal complexity and life history, the two species have similarmore » protein-coding potentials, and few species-specific protein-coding gene predictions. Interestingly, volvocine algal-specific proteins are enriched in Volvox, including those associated with an expanded and highly compartmentalized extracellular matrix. Our analysis shows that increases in organismal complexity can be associated with modifications of lineage-specific proteins rather than large-scale invention of protein-coding capacity.« less

  3. RBind: computational network method to predict RNA binding sites.

    PubMed

    Wang, Kaili; Jian, Yiren; Wang, Huiwen; Zeng, Chen; Zhao, Yunjie

    2018-04-26

    Non-coding RNA molecules play essential roles by interacting with other molecules to perform various biological functions. However, it is difficult to determine RNA structures due to their flexibility. At present, the number of experimentally solved RNA-ligand and RNA-protein structures is still insufficient. Therefore, binding sites prediction of non-coding RNA is required to understand their functions. Current RNA binding site prediction algorithms produce many false positive nucleotides that are distance away from the binding sites. Here, we present a network approach, RBind, to predict the RNA binding sites. We benchmarked RBind in RNA-ligand and RNA-protein datasets. The average accuracy of 0.82 in RNA-ligand and 0.63 in RNA-protein testing showed that this network strategy has a reliable accuracy for binding sites prediction. The codes and datasets are available at https://zhaolab.com.cn/RBind. yjzhaowh@mail.ccnu.edu.cn. Supplementary data are available at Bioinformatics online.

  4. Polymerization of non-complementary RNA: systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes.

    PubMed

    Seligmann, Hervé

    2013-03-01

    Usual DNA→RNA transcription exchanges T→U. Assuming different systematic symmetric nucleotide exchanges during translation, some GenBank RNAs match exactly human mitochondrial sequences (exchange rules listed in decreasing transcript frequencies): C↔U, A↔U, A↔U+C↔G (two nucleotide pairs exchanged), G↔U, A↔G, C↔G, none for A↔C, A↔G+C↔U, and A↔C+G↔U. Most unusual transcripts involve exchanging uracil. Independent measures of rates of rare replicational enzymatic DNA nucleotide misinsertions predict frequencies of RNA transcripts systematically exchanging the corresponding misinserted nucleotides. Exchange transcripts self-hybridize less than other gene regions, self-hybridization increases with length, suggesting endoribonuclease-limited elongation. Blast detects stop codon depleted putative protein coding overlapping genes within exchange-transcribed mitochondrial genes. These align with existing GenBank proteins (mainly metazoan origins, prokaryotic and viral origins underrepresented). These GenBank proteins frequently interact with RNA/DNA, are membrane transporters, or are typical of mitochondrial metabolism. Nucleotide exchange transcript frequencies increase with overlapping gene densities and stop densities, indicating finely tuned counterbalancing regulation of expression of systematic symmetric nucleotide exchange-encrypted proteins. Such expression necessitates combined activities of suppressor tRNAs matching stops, and nucleotide exchange transcription. Two independent properties confirm predicted exchanged overlap coding genes: discrepancy of third codon nucleotide contents from replicational deamination gradients, and codon usage according to circular code predictions. Predictions from both properties converge, especially for frequent nucleotide exchange types. Nucleotide exchanging transcription apparently increases coding densities of protein coding genes without lengthening genomes, revealing unsuspected functional DNA coding potential. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  5. Transcription Factor Binding Profiles Reveal Cyclic Expression of Human Protein-coding Genes and Non-coding RNAs

    PubMed Central

    Cheng, Chao; Ung, Matthew; Grant, Gavin D.; Whitfield, Michael L.

    2013-01-01

    Cell cycle is a complex and highly supervised process that must proceed with regulatory precision to achieve successful cellular division. Despite the wide application, microarray time course experiments have several limitations in identifying cell cycle genes. We thus propose a computational model to predict human cell cycle genes based on transcription factor (TF) binding and regulatory motif information in their promoters. We utilize ENCODE ChIP-seq data and motif information as predictors to discriminate cell cycle against non-cell cycle genes. Our results show that both the trans- TF features and the cis- motif features are predictive of cell cycle genes, and a combination of the two types of features can further improve prediction accuracy. We apply our model to a complete list of GENCODE promoters to predict novel cell cycle driving promoters for both protein-coding genes and non-coding RNAs such as lincRNAs. We find that a similar percentage of lincRNAs are cell cycle regulated as protein-coding genes, suggesting the importance of non-coding RNAs in cell cycle division. The model we propose here provides not only a practical tool for identifying novel cell cycle genes with high accuracy, but also new insights on cell cycle regulation by TFs and cis-regulatory elements. PMID:23874175

  6. Analysis and recognition of 5′ UTR intron splice sites in human pre-mRNA

    PubMed Central

    Eden, E.; Brunak, S.

    2004-01-01

    Prediction of splice sites in non-coding regions of genes is one of the most challenging aspects of gene structure recognition. We perform a rigorous analysis of such splice sites embedded in human 5′ untranslated regions (UTRs), and investigate correlations between this class of splice sites and other features found in the adjacent exons and introns. By restricting the training of neural network algorithms to ‘pure’ UTRs (not extending partially into protein coding regions), we for the first time investigate the predictive power of the splicing signal proper, in contrast to conventional splice site prediction, which typically relies on the change in sequence at the transition from protein coding to non-coding. By doing so, the algorithms were able to pick up subtler splicing signals that were otherwise masked by ‘coding’ noise, thus enhancing significantly the prediction of 5′ UTR splice sites. For example, the non-coding splice site predicting networks pick up compositional and positional bias in the 3′ ends of non-coding exons and 5′ non-coding intron ends, where cytosine and guanine are over-represented. This compositional bias at the true UTR donor sites is also visible in the synaptic weights of the neural networks trained to identify UTR donor sites. Conventional splice site prediction methods perform poorly in UTRs because the reading frame pattern is absent. The NetUTR method presented here performs 2–3-fold better compared with NetGene2 and GenScan in 5′ UTRs. We also tested the 5′ UTR trained method on protein coding regions, and discovered, surprisingly, that it works quite well (although it cannot compete with NetGene2). This indicates that the local splicing pattern in UTRs and coding regions is largely the same. The NetUTR method is made publicly available at www.cbs.dtu.dk/services/NetUTR. PMID:14960723

  7. Experimental annotation of post-translational features and translated coding regions in the pathogen Salmonella Typhimurium

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ansong, Charles; Tolic, Nikola; Purvine, Samuel O.

    Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. For example systems biology-oriented genome scale modeling efforts greatly benefit from accurate annotation of protein-coding genes to develop proper functioning models. However, determining protein-coding genes for most new genomes is almost completely performed by inference, using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function. With the ability to directly measure peptides arising from expressed proteins, mass spectrometry-based proteomics approaches can be used to augment and verify codingmore » regions of a genomic sequence and importantly detect post-translational processing events. In this study we utilized “shotgun” proteomics to guide accurate primary genome annotation of the bacterial pathogen Salmonella Typhimurium 14028 to facilitate a systems-level understanding of Salmonella biology. The data provides protein-level experimental confirmation for 44% of predicted protein-coding genes, suggests revisions to 48 genes assigned incorrect translational start sites, and uncovers 13 non-annotated genes missed by gene prediction programs. We also present a comprehensive analysis of post-translational processing events in Salmonella, revealing a wide range of complex chemical modifications (70 distinct modifications) and confirming more than 130 signal peptide and N-terminal methionine cleavage events in Salmonella. This study highlights several ways in which proteomics data applied during the primary stages of annotation can improve the quality of genome annotations, especially with regards to the annotation of mature protein products.« less

  8. Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection.

    PubMed

    Liu, Liang; Cai, Yudong; Lu, Wencong; Feng, Kaiyan; Peng, Chunrong; Niu, Bing

    2009-03-06

    Based on pseudo amino acid (PseAA) composition and a novel hybrid feature selection frame, this paper presents a computational system to predict the PPIs (protein-protein interactions) using 8796 protein pairs. These pairs are coded by PseAA composition, resulting in 114 features. A hybrid feature selection system, mRMR-KNNs-wrapper, is applied to obtain an optimized feature set by excluding poor-performed and/or redundant features, resulting in 103 remaining features. Using the optimized 103-feature subset, a prediction model is trained and tested in the k-nearest neighbors (KNNs) learning system. This prediction model achieves an overall accurate prediction rate of 76.18%, evaluated by 10-fold cross-validation test, which is 1.46% higher than using the initial 114 features and is 6.51% higher than the 20 features, coded by amino acid compositions. The PPIs predictor, developed for this research, is available for public use at http://chemdata.shu.edu.cn/ppi.

  9. Context influences on TALE–DNA binding revealed by quantitative profiling

    PubMed Central

    Rogers, Julia M.; Barrera, Luis A.; Reyon, Deepak; Sander, Jeffry D.; Kellis, Manolis; Joung, J Keith; Bulyk, Martha L.

    2015-01-01

    Transcription activator-like effector (TALE) proteins recognize DNA using a seemingly simple DNA-binding code, which makes them attractive for use in genome engineering technologies that require precise targeting. Although this code is used successfully to design TALEs to target specific sequences, off-target binding has been observed and is difficult to predict. Here we explore TALE–DNA interactions comprehensively by quantitatively assaying the DNA-binding specificities of 21 representative TALEs to ∼5,000–20,000 unique DNA sequences per protein using custom-designed protein-binding microarrays (PBMs). We find that protein context features exert significant influences on binding. Thus, the canonical recognition code does not fully capture the complexity of TALE–DNA binding. We used the PBM data to develop a computational model, Specificity Inference For TAL-Effector Design (SIFTED), to predict the DNA-binding specificity of any TALE. We provide SIFTED as a publicly available web tool that predicts potential genomic off-target sites for improved TALE design. PMID:26067805

  10. Context influences on TALE-DNA binding revealed by quantitative profiling.

    PubMed

    Rogers, Julia M; Barrera, Luis A; Reyon, Deepak; Sander, Jeffry D; Kellis, Manolis; Joung, J Keith; Bulyk, Martha L

    2015-06-11

    Transcription activator-like effector (TALE) proteins recognize DNA using a seemingly simple DNA-binding code, which makes them attractive for use in genome engineering technologies that require precise targeting. Although this code is used successfully to design TALEs to target specific sequences, off-target binding has been observed and is difficult to predict. Here we explore TALE-DNA interactions comprehensively by quantitatively assaying the DNA-binding specificities of 21 representative TALEs to ∼5,000-20,000 unique DNA sequences per protein using custom-designed protein-binding microarrays (PBMs). We find that protein context features exert significant influences on binding. Thus, the canonical recognition code does not fully capture the complexity of TALE-DNA binding. We used the PBM data to develop a computational model, Specificity Inference For TAL-Effector Design (SIFTED), to predict the DNA-binding specificity of any TALE. We provide SIFTED as a publicly available web tool that predicts potential genomic off-target sites for improved TALE design.

  11. Cloud prediction of protein structure and function with PredictProtein for Debian.

    PubMed

    Kaján, László; Yachdav, Guy; Vicedo, Esmeralda; Steinegger, Martin; Mirdita, Milot; Angermüller, Christof; Böhm, Ariane; Domke, Simon; Ertl, Julia; Mertes, Christian; Reisinger, Eva; Staniewski, Cedric; Rost, Burkhard

    2013-01-01

    We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome.

  12. Cloud Prediction of Protein Structure and Function with PredictProtein for Debian

    PubMed Central

    Kaján, László; Yachdav, Guy; Vicedo, Esmeralda; Steinegger, Martin; Mirdita, Milot; Angermüller, Christof; Böhm, Ariane; Domke, Simon; Ertl, Julia; Mertes, Christian; Reisinger, Eva; Rost, Burkhard

    2013-01-01

    We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome. PMID:23971032

  13. 3D RNA and functional interactions from evolutionary couplings

    PubMed Central

    Weinreb, Caleb; Riesselman, Adam; Ingraham, John B.; Gross, Torsten; Sander, Chris; Marks, Debora S.

    2016-01-01

    Summary Non-coding RNAs are ubiquitous, but the discovery of new RNA gene sequences far outpaces research on their structure and functional interactions. We mine the evolutionary sequence record to derive precise information about function and structure of RNAs and RNA-protein complexes. As in protein structure prediction, we use maximum entropy global probability models of sequence co-variation to infer evolutionarily constrained nucleotide-nucleotide interactions within RNA molecules, and nucleotide-amino acid interactions in RNA-protein complexes. The predicted contacts allow all-atom blinded 3D structure prediction at good accuracy for several known RNA structures and RNA-protein complexes. For unknown structures, we predict contacts in 160 non-coding RNA families. Beyond 3D structure prediction, evolutionary couplings help identify important functional interactions, e.g., at switch points in riboswitches and at a complex nucleation site in HIV. Aided by accelerating sequence accumulation, evolutionary coupling analysis can accelerate the discovery of functional interactions and 3D structures involving RNA. PMID:27087444

  14. N-terminal Proteomics Assisted Profiling of the Unexplored Translation Initiation Landscape in Arabidopsis thaliana *

    PubMed Central

    Ndah, Elvis; Jonckheere, Veronique

    2017-01-01

    Proteogenomics is an emerging research field yet lacking a uniform method of analysis. Proteogenomic studies in which N-terminal proteomics and ribosome profiling are combined, suggest that a high number of protein start sites are currently missing in genome annotations. We constructed a proteogenomic pipeline specific for the analysis of N-terminal proteomics data, with the aim of discovering novel translational start sites outside annotated protein coding regions. In summary, unidentified MS/MS spectra were matched to a specific N-terminal peptide library encompassing protein N termini encoded in the Arabidopsis thaliana genome. After a stringent false discovery rate filtering, 117 protein N termini compliant with N-terminal methionine excision specificity and indicative of translation initiation were found. These include N-terminal protein extensions and translation from transposable elements and pseudogenes. Gene prediction provided supporting protein-coding models for approximately half of the protein N termini. Besides the prediction of functional domains (partially) contained within the newly predicted ORFs, further supporting evidence of translation was found in the recently released Araport11 genome re-annotation of Arabidopsis and computational translations of sequences stored in public repositories. Most interestingly, complementary evidence by ribosome profiling was found for 23 protein N termini. Finally, by analyzing protein N-terminal peptides, an in silico analysis demonstrates the applicability of our N-terminal proteogenomics strategy in revealing protein-coding potential in species with well- and poorly-annotated genomes. PMID:28432195

  15. N-terminal Proteomics Assisted Profiling of the Unexplored Translation Initiation Landscape in Arabidopsis thaliana.

    PubMed

    Willems, Patrick; Ndah, Elvis; Jonckheere, Veronique; Stael, Simon; Sticker, Adriaan; Martens, Lennart; Van Breusegem, Frank; Gevaert, Kris; Van Damme, Petra

    2017-06-01

    Proteogenomics is an emerging research field yet lacking a uniform method of analysis. Proteogenomic studies in which N-terminal proteomics and ribosome profiling are combined, suggest that a high number of protein start sites are currently missing in genome annotations. We constructed a proteogenomic pipeline specific for the analysis of N-terminal proteomics data, with the aim of discovering novel translational start sites outside annotated protein coding regions. In summary, unidentified MS/MS spectra were matched to a specific N-terminal peptide library encompassing protein N termini encoded in the Arabidopsis thaliana genome. After a stringent false discovery rate filtering, 117 protein N termini compliant with N-terminal methionine excision specificity and indicative of translation initiation were found. These include N-terminal protein extensions and translation from transposable elements and pseudogenes. Gene prediction provided supporting protein-coding models for approximately half of the protein N termini. Besides the prediction of functional domains (partially) contained within the newly predicted ORFs, further supporting evidence of translation was found in the recently released Araport11 genome re-annotation of Arabidopsis and computational translations of sequences stored in public repositories. Most interestingly, complementary evidence by ribosome profiling was found for 23 protein N termini. Finally, by analyzing protein N-terminal peptides, an in silico analysis demonstrates the applicability of our N-terminal proteogenomics strategy in revealing protein-coding potential in species with well- and poorly-annotated genomes. © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.

  16. Prediction of G-protein-coupled receptor classes in low homology using Chou's pseudo amino acid composition with approximate entropy and hydrophobicity patterns.

    PubMed

    Gu, Q; Ding, Y S; Zhang, T L

    2010-05-01

    We use approximate entropy and hydrophobicity patterns to predict G-protein-coupled receptors. Adaboost classifier is adopted as the prediction engine. A low homology dataset is used to validate the proposed method. Compared with the results reported, the successful rate is encouraging. The source code is written by Matlab.

  17. Ribonucleoprotein complexes in neurologic diseases.

    PubMed

    Ule, Jernej

    2008-10-01

    Ribonucleoprotein (RNP) complexes regulate the tissue-specific RNA processing and transport that increases the coding capacity of our genome and the ability to respond quickly and precisely to the diverse set of signals. This review focuses on three proteins that are part of RNP complexes in most cells of our body: TAR DNA-binding protein (TDP-43), the survival motor neuron protein (SMN), and fragile-X mental retardation protein (FMRP). In particular, the review asks the question why these ubiquitous proteins are primarily associated with defects in specific regions of the central nervous system? To understand this question, it is important to understand the role of genetic and cellular environment in causing the defect in the protein, as well as how the defective protein leads to misregulation of specific target RNAs. Two approaches for comprehensive analysis of defective RNA-protein interactions are presented. The first approach defines the RNA code or the collection of proteins that bind to a certain cis-acting RNA site in order to lead to a predictable outcome. The second approach defines the RNA map or the summary of positions on target RNAs where binding of a particular RNA-binding protein leads to a predictable outcome. As we learn more about the RNA codes and maps that guide the action of the dynamic RNP world in our brain, possibilities for new treatments of neurologic diseases are bound to emerge.

  18. Improved method for predicting protein fold patterns with ensemble classifiers.

    PubMed

    Chen, W; Liu, X; Huang, Y; Jiang, Y; Zou, Q; Lin, C

    2012-01-27

    Protein folding is recognized as a critical problem in the field of biophysics in the 21st century. Predicting protein-folding patterns is challenging due to the complex structure of proteins. In an attempt to solve this problem, we employed ensemble classifiers to improve prediction accuracy. In our experiments, 188-dimensional features were extracted based on the composition and physical-chemical property of proteins and 20-dimensional features were selected using a coupled position-specific scoring matrix. Compared with traditional prediction methods, these methods were superior in terms of prediction accuracy. The 188-dimensional feature-based method achieved 71.2% accuracy in five cross-validations. The accuracy rose to 77% when we used a 20-dimensional feature vector. These methods were used on recent data, with 54.2% accuracy. Source codes and dataset, together with web server and software tools for prediction, are available at: http://datamining.xmu.edu.cn/main/~cwc/ProteinPredict.html.

  19. Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes.

    PubMed

    Peng, Hui; Lan, Chaowang; Liu, Yuansheng; Liu, Tao; Blumenstein, Michael; Li, Jinyan

    2017-10-03

    Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes.

  20. Chromosome preference of disease genes and vectorization for the prediction of non-coding disease genes

    PubMed Central

    Peng, Hui; Lan, Chaowang; Liu, Yuansheng; Liu, Tao; Blumenstein, Michael; Li, Jinyan

    2017-01-01

    Disease-related protein-coding genes have been widely studied, but disease-related non-coding genes remain largely unknown. This work introduces a new vector to represent diseases, and applies the newly vectorized data for a positive-unlabeled learning algorithm to predict and rank disease-related long non-coding RNA (lncRNA) genes. This novel vector representation for diseases consists of two sub-vectors, one is composed of 45 elements, characterizing the information entropies of the disease genes distribution over 45 chromosome substructures. This idea is supported by our observation that some substructures (e.g., the chromosome 6 p-arm) are highly preferred by disease-related protein coding genes, while some (e.g., the 21 p-arm) are not favored at all. The second sub-vector is 30-dimensional, characterizing the distribution of disease gene enriched KEGG pathways in comparison with our manually created pathway groups. The second sub-vector complements with the first one to differentiate between various diseases. Our prediction method outperforms the state-of-the-art methods on benchmark datasets for prioritizing disease related lncRNA genes. The method also works well when only the sequence information of an lncRNA gene is known, or even when a given disease has no currently recognized long non-coding genes. PMID:29108274

  1. Identification of Methanococcus Jannaschii Proteins in 2-D Gel Electrophoresis Patterns by Mass Spectrometry

    DOE R&D Accomplishments Database

    Liang, X.

    1998-06-10

    The genome of Methanococcus jannaschii has been sequenced completely and has been found to contain approximately 1,770 predicted protein-coding regions. When these coding regions are expressed and how their expression is regulated, however, remain open questions. In this work, mass spectrometry was combined with two-dimensional gel electrophoresis to identify which proteins the genes produce under different growth conditions, and thus investigate the regulation of genes responsible for functions characteristic of this thermophilic representative of the methanogenic Archaea.

  2. Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins

    PubMed Central

    Delcourt, Vivian; Lucier, Jean-François; Gagnon, Jules; Beaudoin, Maxime C; Vanderperre, Benoît; Breton, Marc-André; Motard, Julie; Jacques, Jean-François; Brunelle, Mylène; Gagnon-Arsenault, Isabelle; Fournier, Isabelle; Ouangraoua, Aida; Hunting, Darel J; Cohen, Alan A; Landry, Christian R; Scott, Michelle S

    2017-01-01

    Recent functional, proteomic and ribosome profiling studies in eukaryotes have concurrently demonstrated the translation of alternative open-reading frames (altORFs) in addition to annotated protein coding sequences (CDSs). We show that a large number of small proteins could in fact be coded by these altORFs. The putative alternative proteins translated from altORFs have orthologs in many species and contain functional domains. Evolutionary analyses indicate that altORFs often show more extreme conservation patterns than their CDSs. Thousands of alternative proteins are detected in proteomic datasets by reanalysis using a database containing predicted alternative proteins. This is illustrated with specific examples, including altMiD51, a 70 amino acid mitochondrial fission-promoting protein encoded in MiD51/Mief1/SMCR7L, a gene encoding an annotated protein promoting mitochondrial fission. Our results suggest that many genes are multicoding genes and code for a large protein and one or several small proteins. PMID:29083303

  3. CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.

    PubMed

    Testa, Alison C; Hane, James K; Ellwood, Simon R; Oliver, Richard P

    2015-03-11

    The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.

  4. GeneBuilder: interactive in silico prediction of gene structure.

    PubMed

    Milanesi, L; D'Angelo, D; Rogozin, I B

    1999-01-01

    Prediction of gene structure in newly sequenced DNA becomes very important in large genome sequencing projects. This problem is complicated due to the exon-intron structure of eukaryotic genes and because gene expression is regulated by many different short nucleotide domains. In order to be able to analyse the full gene structure in different organisms, it is necessary to combine information about potential functional signals (promoter region, splice sites, start and stop codons, 3' untranslated region) together with the statistical properties of coding sequences (coding potential), information about homologous proteins, ESTs and repeated elements. We have developed the GeneBuilder system which is based on prediction of functional signals and coding regions by different approaches in combination with similarity searches in proteins and EST databases. The potential gene structure models are obtained by using a dynamic programming method. The program permits the use of several parameters for gene structure prediction and refinement. During gene model construction, selecting different exon homology levels with a protein sequence selected from a list of homologous proteins can improve the accuracy of the gene structure prediction. In the case of low homology, GeneBuilder is still able to predict the gene structure. The GeneBuilder system has been tested by using the standard set (Burset and Guigo, Genomics, 34, 353-367, 1996) and the performances are: 0.89 sensitivity and 0.91 specificity at the nucleotide level. The total correlation coefficient is 0.88. The GeneBuilder system is implemented as a part of the WebGene a the URL: http://www.itba.mi. cnr.it/webgene and TRADAT (TRAncription Database and Analysis Tools) launcher URL: http://www.itba.mi.cnr.it/tradat.

  5. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana.

    PubMed

    Mayer, K; Schüller, C; Wambutt, R; Murphy, G; Volckaert, G; Pohl, T; Düsterhöft, A; Stiekema, W; Entian, K D; Terryn, N; Harris, B; Ansorge, W; Brandt, P; Grivell, L; Rieger, M; Weichselgartner, M; de Simone, V; Obermaier, B; Mache, R; Müller, M; Kreis, M; Delseny, M; Puigdomenech, P; Watson, M; Schmidtheini, T; Reichert, B; Portatelle, D; Perez-Alonso, M; Boutry, M; Bancroft, I; Vos, P; Hoheisel, J; Zimmermann, W; Wedler, H; Ridley, P; Langham, S A; McCullagh, B; Bilham, L; Robben, J; Van der Schueren, J; Grymonprez, B; Chuang, Y J; Vandenbussche, F; Braeken, M; Weltjens, I; Voet, M; Bastiaens, I; Aert, R; Defoor, E; Weitzenegger, T; Bothe, G; Ramsperger, U; Hilbert, H; Braun, M; Holzer, E; Brandt, A; Peters, S; van Staveren, M; Dirske, W; Mooijman, P; Klein Lankhorst, R; Rose, M; Hauf, J; Kötter, P; Berneiser, S; Hempel, S; Feldpausch, M; Lamberth, S; Van den Daele, H; De Keyser, A; Buysshaert, C; Gielen, J; Villarroel, R; De Clercq, R; Van Montagu, M; Rogers, J; Cronin, A; Quail, M; Bray-Allen, S; Clark, L; Doggett, J; Hall, S; Kay, M; Lennard, N; McLay, K; Mayes, R; Pettett, A; Rajandream, M A; Lyne, M; Benes, V; Rechmann, S; Borkova, D; Blöcker, H; Scharfe, M; Grimm, M; Löhnert, T H; Dose, S; de Haan, M; Maarse, A; Schäfer, M; Müller-Auer, S; Gabel, C; Fuchs, M; Fartmann, B; Granderath, K; Dauner, D; Herzl, A; Neumann, S; Argiriou, A; Vitale, D; Liguori, R; Piravandi, E; Massenet, O; Quigley, F; Clabauld, G; Mündlein, A; Felber, R; Schnabl, S; Hiller, R; Schmidt, W; Lecharny, A; Aubourg, S; Chefdor, F; Cooke, R; Berger, C; Montfort, A; Casacuberta, E; Gibbons, T; Weber, N; Vandenbol, M; Bargues, M; Terol, J; Torres, A; Perez-Perez, A; Purnelle, B; Bent, E; Johnson, S; Tacon, D; Jesse, T; Heijnen, L; Schwarz, S; Scholler, P; Heber, S; Francs, P; Bielke, C; Frishman, D; Haase, D; Lemcke, K; Mewes, H W; Stocker, S; Zaccaria, P; Bevan, M; Wilson, R K; de la Bastide, M; Habermann, K; Parnell, L; Dedhia, N; Gnoj, L; Schutz, K; Huang, E; Spiegel, L; Sehkon, M; Murray, J; Sheet, P; Cordes, M; Abu-Threideh, J; Stoneking, T; Kalicki, J; Graves, T; Harmon, G; Edwards, J; Latreille, P; Courtney, L; Cloud, J; Abbott, A; Scott, K; Johnson, D; Minx, P; Bentley, D; Fulton, B; Miller, N; Greco, T; Kemp, K; Kramer, J; Fulton, L; Mardis, E; Dante, M; Pepin, K; Hillier, L; Nelson, J; Spieth, J; Ryan, E; Andrews, S; Geisel, C; Layman, D; Du, H; Ali, J; Berghoff, A; Jones, K; Drone, K; Cotton, M; Joshu, C; Antonoiu, B; Zidanic, M; Strong, C; Sun, H; Lamar, B; Yordan, C; Ma, P; Zhong, J; Preston, R; Vil, D; Shekher, M; Matero, A; Shah, R; Swaby, I K; O'Shaughnessy, A; Rodriguez, M; Hoffmann, J; Till, S; Granat, S; Shohdy, N; Hasegawa, A; Hameed, A; Lodhi, M; Johnson, A; Chen, E; Marra, M; Martienssen, R; McCombie, W R

    1999-12-16

    The higher plant Arabidopsis thaliana (Arabidopsis) is an important model for identifying plant genes and determining their function. To assist biological investigations and to define chromosome structure, a coordinated effort to sequence the Arabidopsis genome was initiated in late 1996. Here we report one of the first milestones of this project, the sequence of chromosome 4. Analysis of 17.38 megabases of unique sequence, representing about 17% of the genome, reveals 3,744 protein coding genes, 81 transfer RNAs and numerous repeat elements. Heterochromatic regions surrounding the putative centromere, which has not yet been completely sequenced, are characterized by an increased frequency of a variety of repeats, new repeats, reduced recombination, lowered gene density and lowered gene expression. Roughly 60% of the predicted protein-coding genes have been functionally characterized on the basis of their homology to known genes. Many genes encode predicted proteins that are homologous to human and Caenorhabditis elegans proteins.

  6. Genome Sequences of Three Cluster AU Arthrobacter Phages, Caterpillar, Nightmare, and Teacup

    PubMed Central

    Adair, Tamarah L.; Stowe, Emily; Pizzorno, Marie C.; Krukonis, Gregory; Harrison, Melinda; Garlena, Rebecca A.; Russell, Daniel A.; Jacobs-Sera, Deborah

    2017-01-01

    ABSTRACT Caterpillar, Nightmare, and Teacup are cluster AU siphoviral phages isolated from enriched soil on Arthrobacter sp. strain ATCC 21022. These genomes are 58 kbp long with an average G+C content of 50%. Sequence analysis predicts 86 to 92 protein-coding genes, including a large number of small proteins with predicted transmembrane domains. PMID:29122860

  7. ChIPBase v2.0: decoding transcriptional regulatory networks of non-coding RNAs and protein-coding genes from ChIP-seq data.

    PubMed

    Zhou, Ke-Ren; Liu, Shun; Sun, Wen-Ju; Zheng, Ling-Ling; Zhou, Hui; Yang, Jian-Hua; Qu, Liang-Hu

    2017-01-04

    The abnormal transcriptional regulation of non-coding RNAs (ncRNAs) and protein-coding genes (PCGs) is contributed to various biological processes and linked with human diseases, but the underlying mechanisms remain elusive. In this study, we developed ChIPBase v2.0 (http://rna.sysu.edu.cn/chipbase/) to explore the transcriptional regulatory networks of ncRNAs and PCGs. ChIPBase v2.0 has been expanded with ∼10 200 curated ChIP-seq datasets, which represent about 20 times expansion when comparing to the previous released version. We identified thousands of binding motif matrices and their binding sites from ChIP-seq data of DNA-binding proteins and predicted millions of transcriptional regulatory relationships between transcription factors (TFs) and genes. We constructed 'Regulator' module to predict hundreds of TFs and histone modifications that were involved in or affected transcription of ncRNAs and PCGs. Moreover, we built a web-based tool, Co-Expression, to explore the co-expression patterns between DNA-binding proteins and various types of genes by integrating the gene expression profiles of ∼10 000 tumor samples and ∼9100 normal tissues and cell lines. ChIPBase also provides a ChIP-Function tool and a genome browser to predict functions of diverse genes and visualize various ChIP-seq data. This study will greatly expand our understanding of the transcriptional regulations of ncRNAs and PCGs. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  8. osFP: a web server for predicting the oligomeric states of fluorescent proteins.

    PubMed

    Simeon, Saw; Shoombuatong, Watshara; Anuwongcharoen, Nuttapat; Preeyanon, Likit; Prachayasittikul, Virapong; Wikberg, Jarl E S; Nantasenamat, Chanin

    2016-01-01

    Currently, monomeric fluorescent proteins (FP) are ideal markers for protein tagging. The prediction of oligomeric states is helpful for enhancing live biomedical imaging. Computational prediction of FP oligomeric states can accelerate the effort of protein engineering efforts of creating monomeric FPs. To the best of our knowledge, this study represents the first computational model for predicting and analyzing FP oligomerization directly from the amino acid sequence. After data curation, an exhaustive data set consisting of 397 non-redundant FP oligomeric states was compiled from the literature. Results from benchmarking of the protein descriptors revealed that the model built with amino acid composition descriptors was the top performing model with accuracy, sensitivity and specificity in excess of 80% and MCC greater than 0.6 for all three data subsets (e.g. training, tenfold cross-validation and external sets). The model provided insights on the important residues governing the oligomerization of FP. To maximize the benefit of the generated predictive model, it was implemented as a web server under the R programming environment. osFP affords a user-friendly interface that can be used to predict the oligomeric state of FP using the protein sequence. The advantage of osFP is that it is platform-independent meaning that it can be accessed via a web browser on any operating system and device. osFP is freely accessible at http://codes.bio/osfp/ while the source code and data set is provided on GitHub at https://github.com/chaninn/osFP/.Graphical Abstract.

  9. Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing.

    PubMed

    Kanda, Kojun; Pflug, James M; Sproul, John S; Dasenko, Mark A; Maddison, David R

    2015-01-01

    In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles being more successfully sequenced.

  10. Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing

    PubMed Central

    Dasenko, Mark A.

    2015-01-01

    In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles being more successfully sequenced. PMID:26716693

  11. SGP-1: Prediction and Validation of Homologous Genes Based on Sequence Alignments

    PubMed Central

    Wiehe, Thomas; Gebauer-Jung, Steffi; Mitchell-Olds, Thomas; Guigó, Roderic

    2001-01-01

    Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of SGP-1 depends little on species-specific properties such as codon usage or the nucleotide distribution. SGP-1 may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors. PMID:11544202

  12. Probability of coding of a DNA sequence: an algorithm to predict translated reading frames from their thermodynamic characteristics.

    PubMed Central

    Tramontano, A; Macchiato, M F

    1986-01-01

    An algorithm to determine the probability that a reading frame codifies for a protein is presented. It is based on the results of our previous studies on the thermodynamic characteristics of a translated reading frame. We also develop a prediction procedure to distinguish between coding and non-coding reading frames. The procedure is based on the characteristics of the putative product of the DNA sequence and not on periodicity characteristics of the sequence, so the prediction is not biased by the presence of overlapping translated reading frames or by the presence of translated reading frames on the complementary DNA strand. PMID:3753761

  13. Molecular cloning and analysis of Schizosaccharomyces pombe Reb1p: sequence-specific recognition of two sites in the far upstream rDNA intergenic spacer.

    PubMed Central

    Zhao, A; Guo, A; Liu, Z; Pape, L

    1997-01-01

    The coding sequences for a Schizosaccharomyces pombe sequence-specific DNA binding protein, Reb1p, have been cloned. The predicted S. pombe Reb1p is 24-29% identical to mouse TTF-1 (transcription termination factor-1) and Saccharomyces cerevisiae REB1 protein, both of which direct termination of RNA polymerase I catalyzed transcripts. The S.pombe Reb1 cDNA encodes a predicted polypeptide of 504 amino acids with a predicted molecular weight of 58.4 kDa. The S. pombe Reb1p is unusual in that the bipartite DNA binding motif identified originally in S.cerevisiae and Klyveromyces lactis REB1 proteins is uninterrupted and thus S.pombe Reb1p may contain the smallest natural REB1 homologous DNA binding domain. Its genomic coding sequences were shown to be interrupted by two introns. A recombinant histidine-tagged Reb1 protein bearing the rDNA binding domain has two homologous, sequence-specific binding sites in the S. pomber DNA intergenic spacer, located between 289 and 480 nt downstream of the end of the approximately 25S rRNA coding sequences. Each binding site is 13-14 bp downstream of two of the three proposed in vivo termination sites. The core of this 17 bp site, AGGTAAGGGTAATGCAC, is specifically protected by Reb1p in footprinting analysis. PMID:9016645

  14. TCOF1 gene encodes a putative nucleolar phosphoprotein that exhibits mutations in Treacher Collins Syndrome throughout its coding region.

    PubMed

    Wise, C A; Chiang, L C; Paznekas, W A; Sharma, M; Musy, M M; Ashley, J A; Lovett, M; Jabs, E W

    1997-04-01

    Treacher Collins Syndrome (TCS) is the most common of the human mandibulofacial dysostosis disorders. Recently, a partial TCOF1 cDNA was identified and shown to contain mutations in TCS families. Here we present the entire exon/intron genomic structure and the complete coding sequence of TCOF1. TCOF1 encodes a low complexity protein of 1,411 amino acids, whose predicted protein structure reveals repeated motifs that mirror the organization of its exons. These motifs are shared with nucleolar trafficking proteins in other species and are predicted to be highly phosphorylated by casein kinase. Consistent with this, the full-length TCOF1 protein sequence also contains putative nuclear and nucleolar localization signals. Throughout the open reading frame, we detected an additional eight mutations in TCS families and several polymorphisms. We postulate that TCS results from defects in a nucleolar trafficking protein that is critically required during human craniofacial development.

  15. A novel feature extraction scheme with ensemble coding for protein-protein interaction prediction.

    PubMed

    Du, Xiuquan; Cheng, Jiaxing; Zheng, Tingting; Duan, Zheng; Qian, Fulan

    2014-07-18

    Protein-protein interactions (PPIs) play key roles in most cellular processes, such as cell metabolism, immune response, endocrine function, DNA replication, and transcription regulation. PPI prediction is one of the most challenging problems in functional genomics. Although PPI data have been increasing because of the development of high-throughput technologies and computational methods, many problems are still far from being solved. In this study, a novel predictor was designed by using the Random Forest (RF) algorithm with the ensemble coding (EC) method. To reduce computational time, a feature selection method (DX) was adopted to rank the features and search the optimal feature combination. The DXEC method integrates many features and physicochemical/biochemical properties to predict PPIs. On the Gold Yeast dataset, the DXEC method achieves 67.2% overall precision, 80.74% recall, and 70.67% accuracy. On the Silver Yeast dataset, the DXEC method achieves 76.93% precision, 77.98% recall, and 77.27% accuracy. On the human dataset, the prediction accuracy reaches 80% for the DXEC-RF method. We extended the experiment to a bigger and more realistic dataset that maintains 50% recall on the Yeast All dataset and 80% recall on the Human All dataset. These results show that the DXEC method is suitable for performing PPI prediction. The prediction service of the DXEC-RF classifier is available at http://ailab.ahu.edu.cn:8087/ DXECPPI/index.jsp.

  16. Parametric Bayesian priors and better choice of negative examples improve protein function prediction.

    PubMed

    Youngs, Noah; Penfold-Brown, Duncan; Drew, Kevin; Shasha, Dennis; Bonneau, Richard

    2013-05-01

    Computational biologists have demonstrated the utility of using machine learning methods to predict protein function from an integration of multiple genome-wide data types. Yet, even the best performing function prediction algorithms rely on heuristics for important components of the algorithm, such as choosing negative examples (proteins without a given function) or determining key parameters. The improper choice of negative examples, in particular, can hamper the accuracy of protein function prediction. We present a novel approach for choosing negative examples, using a parameterizable Bayesian prior computed from all observed annotation data, which also generates priors used during function prediction. We incorporate this new method into the GeneMANIA function prediction algorithm and demonstrate improved accuracy of our algorithm over current top-performing function prediction methods on the yeast and mouse proteomes across all metrics tested. Code and Data are available at: http://bonneaulab.bio.nyu.edu/funcprop.html

  17. Characterization of the Lymantria dispar nucleopolyhedrovirus 25K FP gene

    Treesearch

    David S. Bischoff; James M. Slavicek

    1996-01-01

    The Lymantria dispar nucleopolyhedrovirus (LdMNPV) gene encoding the 25K FP protein has been cloned and sequenced. The 25KFP gene codes for a 217 amino acid protein with a predicted molecular mass of 24870 Da. Expression of the 25K FP protein in a rabbit reticulocyte system generated a 27 kDa protein, in close agreement with the...

  18. Cohort-specific imputation of gene expression improves prediction of warfarin dose for African Americans.

    PubMed

    Gottlieb, Assaf; Daneshjou, Roxana; DeGorter, Marianne; Bourgeois, Stephane; Svensson, Peter J; Wadelius, Mia; Deloukas, Panos; Montgomery, Stephen B; Altman, Russ B

    2017-11-24

    Genome-wide association studies are useful for discovering genotype-phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into "gene level" effects. Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression-on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort.

  19. CONFOLD2: improved contact-driven ab initio protein structure modeling.

    PubMed

    Adhikari, Badri; Cheng, Jianlin

    2018-01-25

    Contact-guided protein structure prediction methods are becoming more and more successful because of the latest advances in residue-residue contact prediction. To support contact-driven structure prediction, effective tools that can quickly build tertiary structural models of good quality from predicted contacts need to be developed. We develop an improved contact-driven protein modelling method, CONFOLD2, and study how it may be effectively used for ab initio protein structure prediction with predicted contacts as input. It builds models using various subsets of input contacts to explore the fold space under the guidance of a soft square energy function, and then clusters the models to obtain the top five models. CONFOLD2 obtains an average reconstruction accuracy of 0.57 TM-score for the 150 proteins in the PSICOV contact prediction dataset. When benchmarked on the CASP11 contacts predicted using CONSIP2 and CASP12 contacts predicted using Raptor-X, CONFOLD2 achieves a mean TM-score of 0.41 on both datasets. CONFOLD2 allows to quickly generate top five structural models for a protein sequence when its secondary structures and contacts predictions at hand. The source code of CONFOLD2 is publicly available at https://github.com/multicom-toolbox/CONFOLD2/ .

  20. SPRINT: ultrafast protein-protein interaction prediction of the entire human interactome.

    PubMed

    Li, Yiwei; Ilie, Lucian

    2017-11-15

    Proteins perform their functions usually by interacting with other proteins. Predicting which proteins interact is a fundamental problem. Experimental methods are slow, expensive, and have a high rate of error. Many computational methods have been proposed among which sequence-based ones are very promising. However, so far no such method is able to predict effectively the entire human interactome: they require too much time or memory. We present SPRINT (Scoring PRotein INTeractions), a new sequence-based algorithm and tool for predicting protein-protein interactions. We comprehensively compare SPRINT with state-of-the-art programs on seven most reliable human PPI datasets and show that it is more accurate while running orders of magnitude faster and using very little memory. SPRINT is the only sequence-based program that can effectively predict the entire human interactome: it requires between 15 and 100 min, depending on the dataset. Our goal is to transform the very challenging problem of predicting the entire human interactome into a routine task. The source code of SPRINT is freely available from https://github.com/lucian-ilie/SPRINT/ and the datasets and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/ .

  1. SIFTER search: a web server for accurate phylogeny-based protein function prediction

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.

    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access tomore » precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.« less

  2. SIFTER search: a web server for accurate phylogeny-based protein function prediction

    DOE PAGES

    Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.

    2015-05-15

    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access tomore » precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.« less

  3. Comparison of stochastic optimization methods for all-atom folding of the Trp-Cage protein.

    PubMed

    Schug, Alexander; Herges, Thomas; Verma, Abhinav; Lee, Kyu Hwan; Wenzel, Wolfgang

    2005-12-09

    The performances of three different stochastic optimization methods for all-atom protein structure prediction are investigated and compared. We use the recently developed all-atom free-energy force field (PFF01), which was demonstrated to correctly predict the native conformation of several proteins as the global optimum of the free energy surface. The trp-cage protein (PDB-code 1L2Y) is folded with the stochastic tunneling method, a modified parallel tempering method, and the basin-hopping technique. All the methods correctly identify the native conformation, and their relative efficiency is discussed.

  4. Complete nucleotide sequence of the freshwater unicellular cyanobacterium Synechococcus elongatus PCC 6301 chromosome: gene content and organization.

    PubMed

    Sugita, Chieko; Ogata, Koretsugu; Shikata, Masamitsu; Jikuya, Hiroyuki; Takano, Jun; Furumichi, Miho; Kanehisa, Minoru; Omata, Tatsuo; Sugiura, Masahiro; Sugita, Mamoru

    2007-01-01

    The entire genome of the unicellular cyanobacterium Synechococcus elongatus PCC 6301 (formerly Anacystis nidulans Berkeley strain 6301) was sequenced. The genome consisted of a circular chromosome 2,696,255 bp long. A total of 2,525 potential protein-coding genes, two sets of rRNA genes, 45 tRNA genes representing 42 tRNA species, and several genes for small stable RNAs were assigned to the chromosome by similarity searches and computer predictions. The translated products of 56% of the potential protein-coding genes showed sequence similarities to experimentally identified and predicted proteins of known function, and the products of 35% of the genes showed sequence similarities to the translated products of hypothetical genes. The remaining 9% of genes lacked significant similarities to genes for predicted proteins in the public DNA databases. Some 139 genes coding for photosynthesis-related components were identified. Thirty-seven genes for two-component signal transduction systems were also identified. This is the smallest number of such genes identified in cyanobacteria, except for marine cyanobacteria, suggesting that only simple signal transduction systems are found in this strain. The gene arrangement and nucleotide sequence of Synechococcus elongatus PCC 6301 were nearly identical to those of a closely related strain Synechococcus elongatus PCC 7942, except for the presence of a 188.6 kb inversion. The sequences as well as the gene information shown in this paper are available in the Web database, CYORF (http://www.cyano.genome.jp/).

  5. A conserved predicted pseudoknot in the NS2A-encoding sequence of West Nile and Japanese encephalitis flaviviruses suggests NS1' may derive from ribosomal frameshifting

    PubMed Central

    Firth, Andrew E; Atkins, John F

    2009-01-01

    Japanese encephalitis, West Nile, Usutu and Murray Valley encephalitis viruses form a tight subgroup within the larger Flavivirus genus. These viruses utilize a single-polyprotein expression strategy, resulting in ~10 mature proteins. Plotting the conservation at synonymous sites along the polyprotein coding sequence reveals strong conservation peaks at the very 5' end of the coding sequence, and also at the 5' end of the sequence encoding the NS2A protein. Such peaks are generally indicative of functionally important non-coding sequence elements. The second peak corresponds to a predicted stable pseudoknot structure whose biological importance is supported by compensatory mutations that preserve the structure. The pseudoknot is preceded by a conserved slippery heptanucleotide (Y CCU UUU), thus forming a classical stimulatory motif for -1 ribosomal frameshifting. We hypothesize, therefore, that the functional importance of the pseudoknot is to stimulate a portion of ribosomes to shift -1 nt into a short (45 codon), conserved, overlapping open reading frame, termed foo. Since cleavage at the NS1-NS2A boundary is known to require synthesis of NS2A in cis, the resulting transframe fusion protein is predicted to be NS1-NS2AN-term-FOO. We hypothesize that this may explain the origin of the previously identified NS1 'extension' protein in JEV-group flaviviruses, known as NS1'. PMID:19196463

  6. Systematic asymmetric nucleotide exchanges produce human mitochondrial RNAs cryptically encoding for overlapping protein coding genes.

    PubMed

    Seligmann, Hervé

    2013-05-07

    GenBank's EST database includes RNAs matching exactly human mitochondrial sequences assuming systematic asymmetric nucleotide exchange-transcription along exchange rules: A→G→C→U/T→A (12 ESTs), A→U/T→C→G→A (4 ESTs), C→G→U/T→C (3 ESTs), and A→C→G→U/T→A (1 EST), no RNAs correspond to other potential asymmetric exchange rules. Hypothetical polypeptides translated from nucleotide-exchanged human mitochondrial protein coding genes align with numerous GenBank proteins, predicted secondary structures resemble their putative GenBank homologue's. Two independent methods designed to detect overlapping genes (one based on nucleotide contents analyses in relation to replicative deamination gradients at third codon positions, and circular code analyses of codon contents based on frame redundancy), confirm nucleotide-exchange-encrypted overlapping genes. Methods converge on which genes are most probably active, and which not, and this for the various exchange rules. Mean EST lengths produced by different nucleotide exchanges are proportional to (a) extents that various bioinformatics analyses confirm the protein coding status of putative overlapping genes; (b) known kinetic chemistry parameters of the corresponding nucleotide substitutions by the human mitochondrial DNA polymerase gamma (nucleotide DNA misinsertion rates); (c) stop codon densities in predicted overlapping genes (stop codon readthrough and exchanging polymerization regulate gene expression by counterbalancing each other). Numerous rarely expressed proteins seem encoded within regular mitochondrial genes through asymmetric nucleotide exchange, avoiding lengthening genomes. Intersecting evidence between several independent approaches confirms the working hypothesis status of gene encryption by systematic nucleotide exchanges. Copyright © 2013 Elsevier Ltd. All rights reserved.

  7. Promoter analysis reveals globally differential regulation of human long non-coding RNA and protein-coding genes

    DOE PAGES

    Alam, Tanvir; Medvedeva, Yulia A.; Jia, Hui; ...

    2014-10-02

    Transcriptional regulation of protein-coding genes is increasingly well-understood on a global scale, yet no comparable information exists for long non-coding RNA (lncRNA) genes, which were recently recognized to be as numerous as protein-coding genes in mammalian genomes. We performed a genome-wide comparative analysis of the promoters of human lncRNA and protein-coding genes, finding global differences in specific genetic and epigenetic features relevant to transcriptional regulation. These two groups of genes are hence subject to separate transcriptional regulatory programs, including distinct transcription factor (TF) proteins that significantly favor lncRNA, rather than coding-gene, promoters. We report a specific signature of promoter-proximal transcriptionalmore » regulation of lncRNA genes, including several distinct transcription factor binding sites (TFBS). Experimental DNase I hypersensitive site profiles are consistent with active configurations of these lncRNA TFBS sets in diverse human cell types. TFBS ChIP-seq datasets confirm the binding events that we predicted using computational approaches for a subset of factors. For several TFs known to be directly regulated by lncRNAs, we find that their putative TFBSs are enriched at lncRNA promoters, suggesting that the TFs and the lncRNAs may participate in a bidirectional feedback loop regulatory network. Accordingly, cells may be able to modulate lncRNA expression levels independently of mRNA levels via distinct regulatory pathways. Our results also raise the possibility that, given the historical reliance on protein-coding gene catalogs to define the chromatin states of active promoters, a revision of these chromatin signature profiles to incorporate expressed lncRNA genes is warranted in the future.« less

  8. Promoter analysis reveals globally differential regulation of human long non-coding RNA and protein-coding genes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Alam, Tanvir; Medvedeva, Yulia A.; Jia, Hui

    Transcriptional regulation of protein-coding genes is increasingly well-understood on a global scale, yet no comparable information exists for long non-coding RNA (lncRNA) genes, which were recently recognized to be as numerous as protein-coding genes in mammalian genomes. We performed a genome-wide comparative analysis of the promoters of human lncRNA and protein-coding genes, finding global differences in specific genetic and epigenetic features relevant to transcriptional regulation. These two groups of genes are hence subject to separate transcriptional regulatory programs, including distinct transcription factor (TF) proteins that significantly favor lncRNA, rather than coding-gene, promoters. We report a specific signature of promoter-proximal transcriptionalmore » regulation of lncRNA genes, including several distinct transcription factor binding sites (TFBS). Experimental DNase I hypersensitive site profiles are consistent with active configurations of these lncRNA TFBS sets in diverse human cell types. TFBS ChIP-seq datasets confirm the binding events that we predicted using computational approaches for a subset of factors. For several TFs known to be directly regulated by lncRNAs, we find that their putative TFBSs are enriched at lncRNA promoters, suggesting that the TFs and the lncRNAs may participate in a bidirectional feedback loop regulatory network. Accordingly, cells may be able to modulate lncRNA expression levels independently of mRNA levels via distinct regulatory pathways. Our results also raise the possibility that, given the historical reliance on protein-coding gene catalogs to define the chromatin states of active promoters, a revision of these chromatin signature profiles to incorporate expressed lncRNA genes is warranted in the future.« less

  9. Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157:H7 (EHEC).

    PubMed

    Neuhaus, Klaus; Landstorfer, Richard; Fellner, Lea; Simon, Svenja; Schafferhans, Andrea; Goldberg, Tatyana; Marx, Harald; Ozoline, Olga N; Rost, Burkhard; Kuster, Bernhard; Keim, Daniel A; Scherer, Siegfried

    2016-02-24

    Genomes of E. coli, including that of the human pathogen Escherichia coli O157:H7 (EHEC) EDL933, still harbor undetected protein-coding genes which, apparently, have escaped annotation due to their small size and non-essential function. To find such genes, global gene expression of EHEC EDL933 was examined, using strand-specific RNAseq (transcriptome), ribosomal footprinting (translatome) and mass spectrometry (proteome). Using the above methods, 72 short, non-annotated protein-coding genes were detected. All of these showed signals in the ribosomal footprinting assay indicating mRNA translation. Seven were verified by mass spectrometry. Fifty-seven genes are annotated in other enterobacteriaceae, mainly as hypothetical genes; the remaining 15 genes constitute novel discoveries. In addition, protein structure and function were predicted computationally and compared between EHEC-encoded proteins and 100-times randomly shuffled proteins. Based on this comparison, 61 of the 72 novel proteins exhibit predicted structural and functional features similar to those of annotated proteins. Many of the novel genes show differential transcription when grown under eleven diverse growth conditions suggesting environmental regulation. Three genes were found to confer a phenotype in previous studies, e.g., decreased cattle colonization. These findings demonstrate that ribosomal footprinting can be used to detect novel protein coding genes, contributing to the growing body of evidence that hypothetical genes are not annotation artifacts and opening an additional way to study their functionality. All 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo.

  10. Tetrahymena thermophila acidic ribosomal protein L37 contains an archaebacterial type of C-terminus.

    PubMed

    Hansen, T S; Andreasen, P H; Dreisig, H; Højrup, P; Nielsen, H; Engberg, J; Kristiansen, K

    1991-09-15

    We have cloned and characterized a Tetrahymena thermophila macronuclear gene (L37) encoding the acidic ribosomal protein (A-protein) L37. The gene contains a single intron located in the 3'-part of the coding region. Two major and three minor transcription start points (tsp) were mapped 39 to 63 nucleotides upstream from the translational start codon. The uppermost tsp mapped to the first T in a putative T. thermophila RNA polymerase II initiator element, TATAA. The coding region of L37 predicts a protein of 109 amino acid (aa) residues. A substantial part of the deduced aa sequence was verified by protein sequencing. The T. thermophila L37 clearly belongs to the P1-type family of eukaryotic A-proteins, but the C-terminal region has the hallmarks of archaebacterial A-proteins.

  11. Two fundamental questions about protein evolution.

    PubMed

    Penny, David; Zhong, Bojian

    2015-12-01

    Two basic questions are considered that approach protein evolution from different directions; the problems arising from using Markov models for the deeper divergences, and then the origin of proteins themselves. The real problem for the first question (going backwards in time) is that at deeper phylogenies the Markov models of sequence evolution must lose information exponentially at deeper divergences, and several testable methods are suggested that should help resolve these deeper divergences. For the second question (coming forwards in time) a problem is that most models for the origin of protein synthesis do not give a role for the very earliest stages of the process. From our knowledge of the importance of replication accuracy in limiting the length of a coding molecule, a testable hypothesis is proposed. The length of the code, the code itself, and tRNAs would all have prior roles in increasing the accuracy of RNA replication; thus proteins would have been formed only after the tRNAs and the length of the triplet code are already formed. Both questions lead to testable predictions. Copyright © 2014 Elsevier B.V. and Société Française de Biochimie et Biologie Moléculaire (SFBBM). All rights reserved.

  12. Accurate prediction of cellular co-translational folding indicates proteins can switch from post- to co-translational folding

    PubMed Central

    Nissley, Daniel A.; Sharma, Ajeet K.; Ahmed, Nabeel; Friedrich, Ulrike A.; Kramer, Günter; Bukau, Bernd; O'Brien, Edward P.

    2016-01-01

    The rates at which domains fold and codons are translated are important factors in determining whether a nascent protein will co-translationally fold and function or misfold and malfunction. Here we develop a chemical kinetic model that calculates a protein domain's co-translational folding curve during synthesis using only the domain's bulk folding and unfolding rates and codon translation rates. We show that this model accurately predicts the course of co-translational folding measured in vivo for four different protein molecules. We then make predictions for a number of different proteins in yeast and find that synonymous codon substitutions, which change translation-elongation rates, can switch some protein domains from folding post-translationally to folding co-translationally—a result consistent with previous experimental studies. Our approach explains essential features of co-translational folding curves and predicts how varying the translation rate at different codon positions along a transcript's coding sequence affects this self-assembly process. PMID:26887592

  13. SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition

    PubMed Central

    Melvin, Iain; Ie, Eugene; Kuang, Rui; Weston, Jason; Stafford, William Noble; Leslie, Christina

    2007-01-01

    Background Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. Results We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at . Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. Conclusion By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition. PMID:17570145

  14. Retrieval of Enterobacteriaceae drug targets using singular value decomposition.

    PubMed

    Silvério-Machado, Rita; Couto, Bráulio R G M; Dos Santos, Marcos A

    2015-04-15

    The identification of potential drug target proteins in bacteria is important in pharmaceutical research for the development of new antibiotics to combat bacterial agents that cause diseases. A new model that combines the singular value decomposition (SVD) technique with biological filters composed of a set of protein properties associated with bacterial drug targets and similarity to protein-coding essential genes of Escherichia coli (strain K12) has been created to predict potential antibiotic drug targets in the Enterobacteriaceae family. This model identified 99 potential drug target proteins in the studied family, which exhibit eight different functions and are protein-coding essential genes or similar to protein-coding essential genes of E.coli (strain K12), indicating that the disruption of the activities of these proteins is critical for cells. Proteins from bacteria with described drug resistance were found among the retrieved candidates. These candidates have no similarity to the human proteome, therefore exhibiting the advantage of causing no adverse effects or at least no known adverse effects on humans. rita_silverio@hotmail.com. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  15. Computational prediction of over-annotated protein-coding genes in the genome of Agrobacterium tumefaciens strain C58

    NASA Astrophysics Data System (ADS)

    Yu, Jia-Feng; Sui, Tian-Xiang; Wang, Hong-Mei; Wang, Chun-Ling; Jing, Li; Wang, Ji-Hua

    2015-12-01

    Agrobacterium tumefaciens strain C58 is a type of pathogen that can cause tumors in some dicotyledonous plants. Ever since the genome of A. tumefaciens strain C58 was sequenced, the quality of annotation of its protein-coding genes has been queried continually, because the annotation varies greatly among different databases. In this paper, the questionable hypothetical genes were re-predicted by integrating the TN curve and Z curve methods. As a result, 30 genes originally annotated as “hypothetical” were discriminated as being non-coding sequences. By testing the re-prediction program 10 times on data sets composed of the function-known genes, the mean accuracy of 99.99% and mean Matthews correlation coefficient value of 0.9999 were obtained. Further sequence analysis and COG analysis showed that the re-annotation results were very reliable. This work can provide an efficient tool and data resources for future studies of A. tumefaciens strain C58. Project supported by the National Natural Science Foundation of China (Grant Nos. 61302186 and 61271378) and the Funding from the State Key Laboratory of Bioelectronics of Southeast University.

  16. Extreme Sensory Complexity Encoded in the 10-Megabase Draft Genome Sequence of the Chromatically Acclimating Cyanobacterium Tolypothrix sp. PCC 7601

    PubMed Central

    Yerrapragada, Shaila; Shukla, Animesh; Hallsworth-Pepin, Kymberlie; Choi, Kwangmin; Wollam, Aye; Clifton, Sandra; Qin, Xiang; Muzny, Donna; Raghuraman, Sriram; Ashki, Haleh; Uzman, Akif; Highlander, Sarah K.; Fryszczyn, Bartlomiej G.; Fox, George E.; Tirumalai, Madhan R.; Liu, Yamei; Kim, Sun

    2015-01-01

    Tolypothrix sp. PCC 7601 is a freshwater filamentous cyanobacterium with complex responses to environmental conditions. Here, we present its 9.96-Mbp draft genome sequence, containing 10,065 putative protein-coding sequences, including 305 predicted two-component system proteins and 27 putative phytochrome-class photoreceptors, the most such proteins in any sequenced genome. PMID:25953173

  17. Predicting Gene Structure Changes Resulting from Genetic Variants via Exon Definition Features.

    PubMed

    Majoros, William H; Holt, Carson; Campbell, Michael S; Ware, Doreen; Yandell, Mark; Reddy, Timothy E

    2018-04-25

    Genetic variation that disrupts gene function by altering gene splicing between individuals can substantially influence traits and disease. In those cases, accurately predicting the effects of genetic variation on splicing can be highly valuable for investigating the mechanisms underlying those traits and diseases. While methods have been developed to generate high quality computational predictions of gene structures in reference genomes, the same methods perform poorly when used to predict the potentially deleterious effects of genetic changes that alter gene splicing between individuals. Underlying that discrepancy in predictive ability are the common assumptions by reference gene finding algorithms that genes are conserved, well-formed, and produce functional proteins. We describe a probabilistic approach for predicting recent changes to gene structure that may or may not conserve function. The model is applicable to both coding and noncoding genes, and can be trained on existing gene annotations without requiring curated examples of aberrant splicing. We apply this model to the problem of predicting altered splicing patterns in the genomes of individual humans, and we demonstrate that performing gene-structure prediction without relying on conserved coding features is feasible. The model predicts an unexpected abundance of variants that create de novo splice sites, an observation supported by both simulations and empirical data from RNA-seq experiments. While these de novo splice variants are commonly misinterpreted by other tools as coding or noncoding variants of little or no effect, we find that in some cases they can have large effects on splicing activity and protein products, and we propose that they may commonly act as cryptic factors in disease. The software is available from geneprediction.org/SGRF. bmajoros@duke.edu. Supplementary information is available at Bioinformatics online.

  18. Analysis of protein-coding genetic variation in 60,706 humans.

    PubMed

    Lek, Monkol; Karczewski, Konrad J; Minikel, Eric V; Samocha, Kaitlin E; Banks, Eric; Fennell, Timothy; O'Donnell-Luria, Anne H; Ware, James S; Hill, Andrew J; Cummings, Beryl B; Tukiainen, Taru; Birnbaum, Daniel P; Kosmicki, Jack A; Duncan, Laramie E; Estrada, Karol; Zhao, Fengmei; Zou, James; Pierce-Hoffman, Emma; Berghout, Joanne; Cooper, David N; Deflaux, Nicole; DePristo, Mark; Do, Ron; Flannick, Jason; Fromer, Menachem; Gauthier, Laura; Goldstein, Jackie; Gupta, Namrata; Howrigan, Daniel; Kiezun, Adam; Kurki, Mitja I; Moonshine, Ami Levy; Natarajan, Pradeep; Orozco, Lorena; Peloso, Gina M; Poplin, Ryan; Rivas, Manuel A; Ruano-Rubio, Valentin; Rose, Samuel A; Ruderfer, Douglas M; Shakir, Khalid; Stenson, Peter D; Stevens, Christine; Thomas, Brett P; Tiao, Grace; Tusie-Luna, Maria T; Weisburd, Ben; Won, Hong-Hee; Yu, Dongmei; Altshuler, David M; Ardissino, Diego; Boehnke, Michael; Danesh, John; Donnelly, Stacey; Elosua, Roberto; Florez, Jose C; Gabriel, Stacey B; Getz, Gad; Glatt, Stephen J; Hultman, Christina M; Kathiresan, Sekar; Laakso, Markku; McCarroll, Steven; McCarthy, Mark I; McGovern, Dermot; McPherson, Ruth; Neale, Benjamin M; Palotie, Aarno; Purcell, Shaun M; Saleheen, Danish; Scharf, Jeremiah M; Sklar, Pamela; Sullivan, Patrick F; Tuomilehto, Jaakko; Tsuang, Ming T; Watkins, Hugh C; Wilson, James G; Daly, Mark J; MacArthur, Daniel G

    2016-08-18

    Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.

  19. Sequence similarity is more relevant than species specificity in probabilistic backtranslation.

    PubMed

    Ferro, Alfredo; Giugno, Rosalba; Pigola, Giuseppe; Pulvirenti, Alfredo; Di Pietro, Cinzia; Purrello, Michele; Ragusa, Marco

    2007-02-21

    Backtranslation is the process of decoding a sequence of amino acids into the corresponding codons. All synthetic gene design systems include a backtranslation module. The degeneracy of the genetic code makes backtranslation potentially ambiguous since most amino acids are encoded by multiple codons. The common approach to overcome this difficulty is based on imitation of codon usage within the target species. This paper describes EasyBack, a new parameter-free, fully-automated software for backtranslation using Hidden Markov Models. EasyBack is not based on imitation of codon usage within the target species, but instead uses a sequence-similarity criterion. The model is trained with a set of proteins with known cDNA coding sequences, constructed from the input protein by querying the NCBI databases with BLAST. Unlike existing software, the proposed method allows the quality of prediction to be estimated. When tested on a group of proteins that show different degrees of sequence conservation, EasyBack outperforms other published methods in terms of precision. The prediction quality of a protein backtranslation methis markedly increased by replacing the criterion of most used codon in the same species with a Hidden Markov Model trained with a set of most similar sequences from all species. Moreover, the proposed method allows the quality of prediction to be estimated probabilistically.

  20. Draft Genome Sequence of the Deinococcus-Thermus Bacterium Meiothermus ruber Strain A

    DOE PAGES

    Thiel, Vera; Tomsho, Lynn P.; Burhans, Richard; ...

    2015-03-26

    The draft genome sequence of the Deinococcus-Thermus group bacterium Meiothermus ruber strain A, isolated from a cyanobacterial enrichment culture obtained from Octopus Spring (Yellowstone National Park, WY), comprises 2,968,099 bp in 170 contigs. It is predicted to contain 2,895 protein-coding genes, 44 tRNA-coding genes, and 2 rRNA operons.

  1. A Graph-Centric Approach for Metagenome-Guided Peptide and Protein Identification in Metaproteomics

    PubMed Central

    Tang, Haixu; Li, Sujun; Ye, Yuzhen

    2016-01-01

    Metaproteomic studies adopt the common bottom-up proteomics approach to investigate the protein composition and the dynamics of protein expression in microbial communities. When matched metagenomic and/or metatranscriptomic data of the microbial communities are available, metaproteomic data analyses often employ a metagenome-guided approach, in which complete or fragmental protein-coding genes are first directly predicted from metagenomic (and/or metatranscriptomic) sequences or from their assemblies, and the resulting protein sequences are then used as the reference database for peptide/protein identification from MS/MS spectra. This approach is often limited because protein coding genes predicted from metagenomes are incomplete and fragmental. In this paper, we present a graph-centric approach to improving metagenome-guided peptide and protein identification in metaproteomics. Our method exploits the de Bruijn graph structure reported by metagenome assembly algorithms to generate a comprehensive database of protein sequences encoded in the community. We tested our method using several public metaproteomic datasets with matched metagenomic and metatranscriptomic sequencing data acquired from complex microbial communities in a biological wastewater treatment plant. The results showed that many more peptides and proteins can be identified when assembly graphs were utilized, improving the characterization of the proteins expressed in the microbial communities. The additional proteins we identified contribute to the characterization of important pathways such as those involved in degradation of chemical hazards. Our tools are released as open-source software on github at https://github.com/COL-IU/Graph2Pro. PMID:27918579

  2. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.

    PubMed Central

    Borodovsky, M; Rudd, K E; Koonin, E V

    1994-01-01

    The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins. Images PMID:7984428

  3. AlloPred: prediction of allosteric pockets on proteins using normal mode perturbation analysis.

    PubMed

    Greener, Joe G; Sternberg, Michael J E

    2015-10-23

    Despite being hugely important in biological processes, allostery is poorly understood and no universal mechanism has been discovered. Allosteric drugs are a largely unexplored prospect with many potential advantages over orthosteric drugs. Computational methods to predict allosteric sites on proteins are needed to aid the discovery of allosteric drugs, as well as to advance our fundamental understanding of allostery. AlloPred, a novel method to predict allosteric pockets on proteins, was developed. AlloPred uses perturbation of normal modes alongside pocket descriptors in a machine learning approach that ranks the pockets on a protein. AlloPred ranked an allosteric pocket top for 23 out of 40 known allosteric proteins, showing comparable and complementary performance to two existing methods. In 28 of 40 cases an allosteric pocket was ranked first or second. The AlloPred web server, freely available at http://www.sbg.bio.ic.ac.uk/allopred/home, allows visualisation and analysis of predictions. The source code and dataset information are also available from this site. Perturbation of normal modes can enhance our ability to predict allosteric sites on proteins. Computational methods such as AlloPred assist drug discovery efforts by suggesting sites on proteins for further experimental study.

  4. Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains.

    PubMed

    Bulashevska, Alla; Eils, Roland

    2006-06-14

    The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction. A novel method called HensBC is introduced to predict protein subcellular location. HensBC is a recursive algorithm which constructs a hierarchical ensemble of classifiers. The classifiers used are Bayesian classifiers based on Markov chain models. We tested our method on six various datasets; among them are Gram-negative bacteria dataset, data for discriminating outer membrane proteins and apoptosis proteins dataset. We observed that our method can predict the subcellular location with high accuracy. Another advantage of the proposed method is that it can improve the accuracy of the prediction of some classes with few sequences in training and is therefore useful for datasets with imbalanced distribution of classes. This study introduces an algorithm which uses only the primary sequence of a protein to predict its subcellular location. The proposed recursive scheme represents an interesting methodology for learning and combining classifiers. The method is computationally efficient and competitive with the previously reported approaches in terms of prediction accuracies as empirical results indicate. The code for the software is available upon request.

  5. RAIN: RNA–protein Association and Interaction Networks

    PubMed Central

    Junge, Alexander; Refsgaard, Jan C.; Garde, Christian; Pan, Xiaoyong; Santos, Alberto; Alkan, Ferhat; Anthon, Christian; von Mering, Christian; Workman, Christopher T.; Jensen, Lars Juhl; Gorodkin, Jan

    2017-01-01

    Protein association networks can be inferred from a range of resources including experimental data, literature mining and computational predictions. These types of evidence are emerging for non-coding RNAs (ncRNAs) as well. However, integration of ncRNAs into protein association networks is challenging due to data heterogeneity. Here, we present a database of ncRNA–RNA and ncRNA–protein interactions and its integration with the STRING database of protein–protein interactions. These ncRNA associations cover four organisms and have been established from curated examples, experimental data, interaction predictions and automatic literature mining. RAIN uses an integrative scoring scheme to assign a confidence score to each interaction. We demonstrate that RAIN outperforms the underlying microRNA-target predictions in inferring ncRNA interactions. RAIN can be operated through an easily accessible web interface and all interaction data can be downloaded. Database URL: http://rth.dk/resources/rain PMID:28077569

  6. Correlations of nucleotide substitution rates and base composition of mammalian coding sequences with protein structure.

    PubMed

    Chiusano, M L; D'Onofrio, G; Alvarez-Valin, F; Jabbari, K; Colonna, G; Bernardi, G

    1999-09-30

    We investigated the relationships between the nucleotide substitution rates and the predicted secondary structures in the three states representation (alpha-helix, beta-sheet, and coil). The analysis was carried out on 34 alignments, each of which comprised sequences belonging to at least four different mammalian orders. The rates of synonymous substitution were found to be significantly different in regions predicted to be alpha-helix, beta-sheet, or coil. Likewise, the nonsynonymous rates also differ, although expectedly at a lower extent, in the three types of secondary structure, suggesting that different selective constraints associated with the different structures are affecting in a similar way the synonymous and nonsynonymous rates. Moreover, the base composition of the third codon positions is different in coding sequence regions corresponding to different secondary structures of proteins.

  7. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).

    PubMed

    Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

    2015-01-01

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.

  8. Gene Unprediction with Spurio: A tool to identify spurious protein sequences.

    PubMed

    Höps, Wolfram; Jeffryes, Matt; Bateman, Alex

    2018-01-01

    We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation.  Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases.  We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes.  Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.

  9. Comparison of the protein-coding gene content of Chlamydia trachomatis and Protochlamydia amoebophila using a Raspberry Pi computer.

    PubMed

    Robson, James F; Barker, Daniel

    2015-10-13

    To demonstrate the bioinformatics capabilities of a low-cost computer, the Raspberry Pi, we present a comparison of the protein-coding gene content of two species in phylum Chlamydiae: Chlamydia trachomatis, a common sexually transmitted infection of humans, and Candidatus Protochlamydia amoebophila, a recently discovered amoebal endosymbiont. Identifying species-specific proteins and differences in protein families could provide insights into the unique phenotypes of the two species. Using a Raspberry Pi computer, sequence similarity-based protein families were predicted across the two species, C. trachomatis and P. amoebophila, and their members counted. Examples include nine multi-protein families unique to C. trachomatis, 132 multi-protein families unique to P. amoebophila and one family with multiple copies in both. Most families unique to C. trachomatis were polymorphic outer-membrane proteins. Additionally, multiple protein families lacking functional annotation were found. Predicted functional interactions suggest one of these families is involved with the exodeoxyribonuclease V complex. The Raspberry Pi computer is adequate for a comparative genomics project of this scope. The protein families unique to P. amoebophila may provide a basis for investigating the host-endosymbiont interaction. However, additional species should be included; and further laboratory research is required to identify the functions of unknown or putative proteins. Multiple outer membrane proteins were found in C. trachomatis, suggesting importance for host evasion. The tyrosine transport protein family is shared between both species, with four proteins in C. trachomatis and two in P. amoebophila. Shared protein families could provide a starting point for discovery of wide-spectrum drugs against Chlamydiae.

  10. Extreme Sensory Complexity Encoded in the 10-Megabase Draft Genome Sequence of the Chromatically Acclimating Cyanobacterium Tolypothrix sp. PCC 7601.

    PubMed

    Yerrapragada, Shaila; Shukla, Animesh; Hallsworth-Pepin, Kymberlie; Choi, Kwangmin; Wollam, Aye; Clifton, Sandra; Qin, Xiang; Muzny, Donna; Raghuraman, Sriram; Ashki, Haleh; Uzman, Akif; Highlander, Sarah K; Fryszczyn, Bartlomiej G; Fox, George E; Tirumalai, Madhan R; Liu, Yamei; Kim, Sun; Kehoe, David M; Weinstock, George M

    2015-05-07

    Tolypothrix sp. PCC 7601 is a freshwater filamentous cyanobacterium with complex responses to environmental conditions. Here, we present its 9.96-Mbp draft genome sequence, containing 10,065 putative protein-coding sequences, including 305 predicted two-component system proteins and 27 putative phytochrome-class photoreceptors, the most such proteins in any sequenced genome. Copyright © 2015 Yerrapragada et al.

  11. Identification and characterization of an early gene in the Lymantria dispar multinucleocapsid nuclear polyhedrosis virus

    Treesearch

    David S. Bischoff; James M. Slavicek

    1995-01-01

    The Lymantria dispar multinucleocapsid nuclear polyhedrosis virus (LdMNPV) gene encoding G22 was cloned and sequenced. The G22 gene codes for a 191 amino acid protein with a predicted Mr of 22000. Expression of G22 in a rabbit reticulocyte system generated a protein with an M...

  12. Non-coding RNAs in lung cancer

    PubMed Central

    Ricciuti, Biagio; Mecca, Carmen; Crinò, Lucio; Baglivo, Sara; Cenci, Matteo; Metro, Giulio

    2014-01-01

    The discovery that protein-coding genes represent less than 2% of all human genome, and the evidence that more than 90% of it is actively transcribed, changed the classical point of view of the central dogma of molecular biology, which was always based on the assumption that RNA functions mainly as an intermediate bridge between DNA sequences and protein synthesis machinery. Accumulating data indicates that non-coding RNAs are involved in different physiological processes, providing for the maintenance of cellular homeostasis. They are important regulators of gene expression, cellular differentiation, proliferation, migration, apoptosis, and stem cell maintenance. Alterations and disruptions of their expression or activity have increasingly been associated with pathological changes of cancer cells, this evidence and the prospect of using these molecules as diagnostic markers and therapeutic targets, make currently non-coding RNAs among the most relevant molecules in cancer research. In this paper we will provide an overview of non-coding RNA function and disruption in lung cancer biology, also focusing on their potential as diagnostic, prognostic and predictive biomarkers. PMID:25593996

  13. Long Non-Coding RNAs (lncRNAs) of Sea Cucumber: Large-Scale Prediction, Expression Profiling, Non-Coding Network Construction, and lncRNA-microRNA-Gene Interaction Analysis of lncRNAs in Apostichopus japonicus and Holothuria glaberrima During LPS Challenge and Radial Organ Complex Regeneration.

    PubMed

    Mu, Chuang; Wang, Ruijia; Li, Tianqi; Li, Yuqiang; Tian, Meilin; Jiao, Wenqian; Huang, Xiaoting; Zhang, Lingling; Hu, Xiaoli; Wang, Shi; Bao, Zhenmin

    2016-08-01

    Long non-coding RNA (lncRNA) structurally resembles mRNA but cannot be translated into protein. Although the systematic identification and characterization of lncRNAs have been increasingly reported in model species, information concerning non-model species is still lacking. Here, we report the first systematic identification and characterization of lncRNAs in two sea cucumber species: (1) Apostichopus japonicus during lipopolysaccharide (LPS) challenge and in heathy tissues and (2) Holothuria glaberrima during radial organ complex regeneration, using RNA-seq datasets and bioinformatics analysis. We identified A. japonicus and H. glaberrima lncRNAs that were differentially expressed during LPS challenge and radial organ complex regeneration, respectively. Notably, the predicted lncRNA-microRNA-gene trinities revealed that, in addition to targeting protein-coding transcripts, miRNAs might also target lncRNAs, thereby participating in a potential novel layer of regulatory interactions among non-coding RNA classes in echinoderms. Furthermore, the constructed coding-non-coding network implied the potential involvement of lncRNA-gene interactions during the regulation of several important genes (e.g., Toll-like receptor 1 [TLR1] and transglutaminase-1 [TGM1]) in response to LPS challenge and radial organ complex regeneration in sea cucumbers. Overall, this pioneer systematic identification, annotation, and characterization of lncRNAs in echinoderm pave the way for similar studies and future genetic, genomic, and evolutionary research in non-model species.

  14. Identification of Putative Nuclear Receptors and Steroidogenic Enzymes in Murray-Darling Rainbowfish (Melanotaenia fluviatilis) Using RNA-Seq and De Novo Transcriptome Assembly.

    PubMed

    Bain, Peter A; Papanicolaou, Alexie; Kumar, Anupama

    2015-01-01

    Murray-Darling rainbowfish (Melanotaenia fluviatilis [Castelnau, 1878]; Atheriniformes: Melanotaeniidae) is a small-bodied teleost currently under development in Australasia as a test species for aquatic toxicological studies. To date, efforts towards the development of molecular biomarkers of contaminant exposure have been hindered by the lack of available sequence data. To address this, we sequenced messenger RNA from brain, liver and gonads of mature male and female fish and generated a high-quality draft transcriptome using a de novo assembly approach. 149,742 clusters of putative transcripts were obtained, encompassing 43,841 non-redundant protein-coding regions. Deduced amino acid sequences were annotated by functional inference based on similarity with sequences from manually curated protein sequence databases. The draft assembly contained protein-coding regions homologous to 95.7% of the complete cohort of predicted proteins from the taxonomically related species, Oryzias latipes (Japanese medaka). The mean length of rainbowfish protein-coding sequences relative to their medaka homologues was 92.1%, indicating that despite the limited number of tissues sampled a large proportion of the total expected number of protein-coding genes was captured in the study. Because of our interest in the effects of environmental contaminants on endocrine pathways, we manually curated subsets of coding regions for putative nuclear receptors and steroidogenic enzymes in the rainbowfish transcriptome, revealing 61 candidate nuclear receptors encompassing all known subfamilies, and 41 putative steroidogenic enzymes representing all major steroidogenic enzymes occurring in teleosts. The transcriptome presented here will be a valuable resource for researchers interested in biomarker development, protein structure and function, and contaminant-response genomics in Murray-Darling rainbowfish.

  15. PAR-CLIP data indicate that Nrd1-Nab3-dependent transcription termination regulates expression of hundreds of protein coding genes in yeast

    PubMed Central

    2014-01-01

    Background Nrd1 and Nab3 are essential sequence-specific yeast RNA binding proteins that function as a heterodimer in the processing and degradation of diverse classes of RNAs. These proteins also regulate several mRNA coding genes; however, it remains unclear exactly what percentage of the mRNA component of the transcriptome these proteins control. To address this question, we used the pyCRAC software package developed in our laboratory to analyze CRAC and PAR-CLIP data for Nrd1-Nab3-RNA interactions. Results We generated high-resolution maps of Nrd1-Nab3-RNA interactions, from which we have uncovered hundreds of new Nrd1-Nab3 mRNA targets, representing between 20 and 30% of protein-coding transcripts. Although Nrd1 and Nab3 showed a preference for binding near 5′ ends of relatively short transcripts, they bound transcripts throughout coding sequences and 3′ UTRs. Moreover, our data for Nrd1-Nab3 binding to 3′ UTRs was consistent with a role for these proteins in the termination of transcription. Our data also support a tight integration of Nrd1-Nab3 with the nutrient response pathway. Finally, we provide experimental evidence for some of our predictions, using northern blot and RT-PCR assays. Conclusions Collectively, our data support the notion that Nrd1 and Nab3 function is tightly integrated with the nutrient response and indicate a role for these proteins in the regulation of many mRNA coding genes. Further, we provide evidence to support the hypothesis that Nrd1-Nab3 represents a failsafe termination mechanism in instances of readthrough transcription. PMID:24393166

  16. Computational prediction of secretion systems and secretomes of Brucella: identification of novel type IV effectors and their interaction with the host.

    PubMed

    Sankarasubramanian, Jagadesan; Vishnu, Udayakumar S; Dinakaran, Vasudevan; Sridhar, Jayavel; Gunasekaran, Paramasamy; Rajendhran, Jeyaprakash

    2016-01-01

    Brucella spp. are facultative intracellular pathogens that cause brucellosis in various mammals including humans. Brucella survive inside the host cells by forming vacuoles and subverting host defence systems. This study was aimed to predict the secretion systems and the secretomes of Brucella spp. from 39 complete genome sequences available in the databases. Furthermore, an attempt was made to identify the type IV secretion effectors and their interactions with host proteins. We predicted the secretion systems of Brucella by the KEGG pathway and SecReT4. Brucella secretomes and type IV effectors (T4SEs) were predicted through genome-wide screening using JVirGel and S4TE, respectively. Protein-protein interactions of Brucella T4SEs with their hosts were analyzed by HPIDB 2.0. Genes coding for Sec and Tat pathways of secretion and type I (T1SS), type IV (T4SS) and type V (T5SS) secretion systems were identified and they are conserved in all the species of Brucella. In addition to the well-known VirB operon coding for the type IV secretion system (T4SS), we have identified the presence of additional genes showing homology with T4SS of other organisms. On the whole, 10.26 to 14.94% of total proteomes were found to be either secreted (secretome) or membrane associated (membrane proteome). Approximately, 1.7 to 3.0% of total proteomes were identified as type IV secretion effectors (T4SEs). Prediction of protein-protein interactions showed 29 and 36 host-pathogen specific interactions between Bos taurus (cattle)-B. abortus and Ovis aries (sheep)-B. melitensis, respectively. Functional characterization of the predicted T4SEs and their interactions with their respective hosts may reveal the secrets of host specificity of Brucella.

  17. LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature.

    PubMed

    Pian, Cong; Zhang, Guangle; Chen, Zhi; Chen, Yuanyuan; Zhang, Jin; Yang, Tao; Zhang, Liangyun

    2016-01-01

    As a novel class of noncoding RNAs, long noncoding RNAs (lncRNAs) have been verified to be associated with various diseases. As large scale transcripts are generated every year, it is significant to accurately and quickly identify lncRNAs from thousands of assembled transcripts. To accurately discover new lncRNAs, we develop a classification tool of random forest (RF) named LncRNApred based on a new hybrid feature. This hybrid feature set includes three new proposed features, which are MaxORF, RMaxORF and SNR. LncRNApred is effective for classifying lncRNAs and protein coding transcripts accurately and quickly. Moreover,our RF model only requests the training using data on human coding and non-coding transcripts. Other species can also be predicted by using LncRNApred. The result shows that our method is more effective compared with the Coding Potential Calculate (CPC). The web server of LncRNApred is available for free at http://mm20132014.wicp.net:57203/LncRNApred/home.jsp.

  18. DeepLoc: prediction of protein subcellular localization using deep learning.

    PubMed

    Almagro Armenteros, José Juan; Sønderby, Casper Kaae; Sønderby, Søren Kaae; Nielsen, Henrik; Winther, Ole

    2017-11-01

    The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research. Many machine learning methods have been successfully applied in this task, but in most of them, predictions rely on annotation of homologues from knowledge databases. For novel proteins where no annotated homologues exist, and for predicting the effects of sequence variants, it is desirable to have methods for predicting protein properties from sequence information only. Here, we present a prediction algorithm using deep neural networks to predict protein subcellular localization relying only on sequence information. At its core, the prediction model uses a recurrent neural network that processes the entire protein sequence and an attention mechanism identifying protein regions important for the subcellular localization. The model was trained and tested on a protein dataset extracted from one of the latest UniProt releases, in which experimentally annotated proteins follow more stringent criteria than previously. We demonstrate that our model achieves a good accuracy (78% for 10 categories; 92% for membrane-bound or soluble), outperforming current state-of-the-art algorithms, including those relying on homology information. The method is available as a web server at http://www.cbs.dtu.dk/services/DeepLoc. Example code is available at https://github.com/JJAlmagro/subcellular_localization. The dataset is available at http://www.cbs.dtu.dk/services/DeepLoc/data.php. jjalma@dtu.dk. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  19. A novel Multi-Agent Ada-Boost algorithm for predicting protein structural class with the information of protein secondary structure.

    PubMed

    Fan, Ming; Zheng, Bin; Li, Lihua

    2015-10-01

    Knowledge of the structural class of a given protein is important for understanding its folding patterns. Although a lot of efforts have been made, it still remains a challenging problem for prediction of protein structural class solely from protein sequences. The feature extraction and classification of proteins are the main problems in prediction. In this research, we extended our earlier work regarding these two aspects. In protein feature extraction, we proposed a scheme by calculating the word frequency and word position from sequences of amino acid, reduced amino acid, and secondary structure. For an accurate classification of the structural class of protein, we developed a novel Multi-Agent Ada-Boost (MA-Ada) method by integrating the features of Multi-Agent system into Ada-Boost algorithm. Extensive experiments were taken to test and compare the proposed method using four benchmark datasets in low homology. The results showed classification accuracies of 88.5%, 96.0%, 88.4%, and 85.5%, respectively, which are much better compared with the existing methods. The source code and dataset are available on request.

  20. PconsFold: improved contact predictions improve protein models.

    PubMed

    Michel, Mirco; Hayat, Sikander; Skwark, Marcin J; Sander, Chris; Marks, Debora S; Elofsson, Arne

    2014-09-01

    Recently it has been shown that the quality of protein contact prediction from evolutionary information can be improved significantly if direct and indirect information is separated. Given sufficiently large protein families, the contact predictions contain sufficient information to predict the structure of many protein families. However, since the first studies contact prediction methods have improved. Here, we ask how much the final models are improved if improved contact predictions are used. In a small benchmark of 15 proteins, we show that the TM-scores of top-ranked models are improved by on average 33% using PconsFold compared with the original version of EVfold. In a larger benchmark, we find that the quality is improved with 15-30% when using PconsC in comparison with earlier contact prediction methods. Further, using Rosetta instead of CNS does not significantly improve global model accuracy, but the chemistry of models generated with Rosetta is improved. PconsFold is a fully automated pipeline for ab initio protein structure prediction based on evolutionary information. PconsFold is based on PconsC contact prediction and uses the Rosetta folding protocol. Due to its modularity, the contact prediction tool can be easily exchanged. The source code of PconsFold is available on GitHub at https://www.github.com/ElofssonLab/pcons-fold under the MIT license. PconsC is available from http://c.pcons.net/. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.

  1. MSPocket: an orientation-independent algorithm for the detection of ligand binding pockets.

    PubMed

    Zhu, Hongbo; Pisabarro, M Teresa

    2011-02-01

    Identification of ligand binding pockets on proteins is crucial for the characterization of protein functions. It provides valuable information for protein-ligand docking and rational engineering of small molecules that regulate protein functions. A major number of current prediction algorithms of ligand binding pockets are based on cubic grid representation of proteins and, thus, the results are often protein orientation dependent. We present the MSPocket program for detecting pockets on the solvent excluded surface of proteins. The core algorithm of the MSPocket approach does not use any cubic grid system to represent proteins and is therefore independent of protein orientations. We demonstrate that MSPocket is able to achieve an accuracy of 75% in predicting ligand binding pockets on a test dataset used for evaluating several existing methods. The accuracy is 92% if the top three predictions are considered. Comparison to one of the recently published best performing methods shows that MSPocket reaches similar performance with the additional feature of being protein orientation independent. Interestingly, some of the predictions are different, meaning that the two methods can be considered complementary and combined to achieve better prediction accuracy. MSPocket also provides a graphical user interface for interactive investigation of the predicted ligand binding pockets. In addition, we show that overlap criterion is a better strategy for the evaluation of predicted ligand binding pockets than the single point distance criterion. The MSPocket source code can be downloaded from http://appserver.biotec.tu-dresden.de/MSPocket/. MSPocket is also available as a PyMOL plugin with a graphical user interface.

  2. Distribution of cold adaptation proteins in microbial mats in Lake Joyce, Antarctica: Analysis of metagenomic data by using two bioinformatics tools.

    PubMed

    Koo, Hyunmin; Hakim, Joseph A; Fisher, Phillip R E; Grueneberg, Alexander; Andersen, Dale T; Bej, Asim K

    2016-01-01

    In this study, we report the distribution and abundance of cold-adaptation proteins in microbial mat communities in the perennially ice-covered Lake Joyce, located in the McMurdo Dry Valleys, Antarctica. We have used MG-RAST and R code bioinformatics tools on Illumina HiSeq2000 shotgun metagenomic data and compared the filtering efficacy of these two methods on cold-adaptation proteins. Overall, the abundance of cold-shock DEAD-box protein A (CSDA), antifreeze proteins (AFPs), fatty acid desaturase (FAD), trehalose synthase (TS), and cold-shock family of proteins (CSPs) were present in all mat samples at high, moderate, or low levels, whereas the ice nucleation protein (INP) was present only in the ice and bulbous mat samples at insignificant levels. Considering the near homogeneous temperature profile of Lake Joyce (0.08-0.29 °C), the distribution and abundance of these proteins across various mat samples predictively correlated with known functional attributes necessary for microbial communities to thrive in this ecosystem. The comparison of the MG-RAST and the R code methods showed dissimilar occurrences of the cold-adaptation protein sequences, though with insignificant ANOSIM (R = 0.357; p-value = 0.012), ADONIS (R(2) = 0.274; p-value = 0.03) and STAMP (p-values = 0.521-0.984) statistical analyses. Furthermore, filtering targeted sequences using the R code accounted for taxonomic groups by avoiding sequence redundancies, whereas the MG-RAST provided total counts resulting in a higher sequence output. The results from this study revealed for the first time the distribution of cold-adaptation proteins in six different types of microbial mats in Lake Joyce, while suggesting a simpler and more manageable user-defined method of R code, as compared to a web-based MG-RAST pipeline.

  3. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier.

    PubMed

    Kulmanov, Maxat; Khan, Mohammed Asif; Hoehndorf, Robert; Wren, Jonathan

    2018-02-15

    A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein-protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo. robert.hoehndorf@kaust.edu.sa. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  4. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE PAGES

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...

    2015-10-26

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  5. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  6. Evolution and Diversity of the Human Hepatitis D Virus Genome

    PubMed Central

    Huang, Chi-Ruei; Lo, Szecheng J.

    2010-01-01

    Human hepatitis delta virus (HDV) is the smallest RNA virus in genome. HDV genome is divided into a viroid-like sequence and a protein-coding sequence which could have originated from different resources and the HDV genome was eventually constituted through RNA recombination. The genome subsequently diversified through accumulation of mutations selected by interactions between the mutated RNA and proteins with host factors to successfully form the infectious virions. Therefore, we propose that the conservation of HDV nucleotide sequence is highly related with its functionality. Genome analysis of known HDV isolates shows that the C-terminal coding sequences of large delta antigen (LDAg) are the highest diversity than other regions of protein-coding sequences but they still retain biological functionality to interact with the heavy chain of clathrin can be selected and maintained. Since viruses interact with many host factors, including escaping the host immune response, how to design a program to predict RNA genome evolution is a great challenging work. PMID:20204073

  7. SinEx DB: a database for single exon coding sequences in mammalian genomes.

    PubMed

    Jorquera, Roddy; Ortiz, Rodrigo; Ossandon, F; Cárdenas, Juan Pablo; Sepúlveda, Rene; González, Carolina; Holmes, David S

    2016-01-01

    Eukaryotic genes are typically interrupted by intragenic, noncoding sequences termed introns. However, some genes lack introns in their coding sequence (CDS) and are generally known as 'single exon genes' (SEGs). In this work, a SEG is defined as a nuclear, protein-coding gene that lacks introns in its CDS. Whereas, many public databases of Eukaryotic multi-exon genes are available, there are only two specialized databases for SEGs. The present work addresses the need for a more extensive and diverse database by creating SinEx DB, a publicly available, searchable database of predicted SEGs from 10 completely sequenced mammalian genomes including human. SinEx DB houses the DNA and protein sequence information of these SEGs and includes their functional predictions (KOG) and the relative distribution of these functions within species. The information is stored in a relational database built with My SQL Server 5.1.33 and the complete dataset of SEG sequences and their functional predictions are available for downloading. SinEx DB can be interrogated by: (i) a browsable phylogenetic schema, (ii) carrying out BLAST searches to the in-house SinEx DB of SEGs and (iii) via an advanced search mode in which the database can be searched by key words and any combination of searches by species and predicted functions. SinEx DB provides a rich source of information for advancing our understanding of the evolution and function of SEGs.Database URL: www.sinex.cl. © The Author(s) 2016. Published by Oxford University Press.

  8. Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae.

    PubMed

    Zubek, Julian; Tatjewski, Marcin; Boniecki, Adam; Mnich, Maciej; Basu, Subhadip; Plewczynski, Dariusz

    2015-01-01

    Accurate identification of protein-protein interactions (PPI) is the key step in understanding proteins' biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein-protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein-protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).

  9. Sequence-dependent modelling of local DNA bending phenomena: curvature prediction and vibrational analysis.

    PubMed

    Vlahovicek, K; Munteanu, M G; Pongor, S

    1999-01-01

    Bending is a local conformational micropolymorphism of DNA in which the original B-DNA structure is only distorted but not extensively modified. Bending can be predicted by simple static geometry models as well as by a recently developed elastic model that incorporate sequence dependent anisotropic bendability (SDAB). The SDAB model qualitatively explains phenomena including affinity of protein binding, kinking, as well as sequence-dependent vibrational properties of DNA. The vibrational properties of DNA segments can be studied by finite element analysis of a model subjected to an initial bending moment. The frequency spectrum is obtained by applying Fourier analysis to the displacement values in the time domain. This analysis shows that the spectrum of the bending vibrations quite sensitively depends on the sequence, for example the spectrum of a curved sequence is characteristically different from the spectrum of straight sequence motifs of identical basepair composition. Curvature distributions are genome-specific, and pronounced differences are found between protein-coding and regulatory regions, respectively, that is, sites of extreme curvature and/or bendability are less frequent in protein-coding regions. A WWW server is set up for the prediction of curvature and generation of 3D models from DNA sequences (http:@www.icgeb.trieste.it/dna).

  10. The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4)

    DOE PAGES

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...

    2016-02-24

    The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provide d via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation ismore » followed by functional annotation including assignment of protein product names and connection to various protein family databases.« less

  11. The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos

    The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provide d via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation ismore » followed by functional annotation including assignment of protein product names and connection to various protein family databases.« less

  12. Long Non-coding RNAs and Their Biological Roles in Plants

    PubMed Central

    Liu, Xue; Hao, Lili; Li, Dayong; Zhu, Lihuang; Hu, Songnian

    2015-01-01

    With the development of genomics and bioinformatics, especially the extensive applications of high-throughput sequencing technology, more transcriptional units with little or no protein-coding potential have been discovered. Such RNA molecules are called non-protein-coding RNAs (npcRNAs or ncRNAs). Among them, long npcRNAs or ncRNAs (lnpcRNAs or lncRNAs) represent diverse classes of transcripts longer than 200 nucleotides. In recent years, the lncRNAs have been considered as important regulators in many essential biological processes. In plants, although a large number of lncRNA transcripts have been predicted and identified in few species, our current knowledge of their biological functions is still limited. Here, we have summarized recent studies on their identification, characteristics, classification, bioinformatics, resources, and current exploration of their biological functions in plants. PMID:25936895

  13. Mitochondrial genetic codes evolve to match amino acid requirements of proteins.

    PubMed

    Swire, Jonathan; Judson, Olivia P; Burt, Austin

    2005-01-01

    Mitochondria often use genetic codes different from the standard genetic code. Now that many mitochondrial genomes have been sequenced, these variant codes provide the first opportunity to examine empirically the processes that produce new genetic codes. The key question is: Are codon reassignments the sole result of mutation and genetic drift? Or are they the result of natural selection? Here we present an analysis of 24 phylogenetically independent codon reassignments in mitochondria. Although the mutation-drift hypothesis can explain reassignments from stop to an amino acid, we found that it cannot explain reassignments from one amino acid to another. In particular--and contrary to the predictions of the mutation-drift hypothesis--the codon involved in such a reassignment was not rare in the ancestral genome. Instead, such reassignments appear to take place while the codon is in use at an appreciable frequency. Moreover, the comparison of inferred amino acid usage in the ancestral genome with the neutral expectation shows that the amino acid gaining the codon was selectively favored over the amino acid losing the codon. These results are consistent with a simple model of weak selection on the amino acid composition of proteins in which codon reassignments are selected because they compensate for multiple slightly deleterious mutations throughout the mitochondrial genome. We propose that the selection pressure is for reduced protein synthesis cost: most reassignments give amino acids that are less expensive to synthesize. Taken together, our results strongly suggest that mitochondrial genetic codes evolve to match the amino acid requirements of proteins.

  14. Consistent prediction of GO protein localization.

    PubMed

    Spetale, Flavio E; Arce, Debora; Krsticevic, Flavia; Bulacio, Pilar; Tapia, Elizabeth

    2018-05-17

    The GO-Cellular Component (GO-CC) ontology provides a controlled vocabulary for the consistent description of the subcellular compartments or macromolecular complexes where proteins may act. Current machine learning-based methods used for the automated GO-CC annotation of proteins suffer from the inconsistency of individual GO-CC term predictions. Here, we present FGGA-CC + , a class of hierarchical graph-based classifiers for the consistent GO-CC annotation of protein coding genes at the subcellular compartment or macromolecular complex levels. Aiming to boost the accuracy of GO-CC predictions, we make use of the protein localization knowledge in the GO-Biological Process (GO-BP) annotations to boost the accuracy of GO-CC prediction. As a result, FGGA-CC + classifiers are built from annotation data in both the GO-CC and GO-BP ontologies. Due to their graph-based design, FGGA-CC + classifiers are fully interpretable and their predictions amenable to expert analysis. Promising results on protein annotation data from five model organisms were obtained. Additionally, successful validation results in the annotation of a challenging subset of tandem duplicated genes in the tomato non-model organism were accomplished. Overall, these results suggest that FGGA-CC + classifiers can indeed be useful for satisfying the huge demand of GO-CC annotation arising from ubiquitous high throughout sequencing and proteomic projects.

  15. Rare and Coding Region Genetic Variants Associated With Risk of Ischemic Stroke: The NHLBI Exome Sequence Project.

    PubMed

    Auer, Paul L; Nalls, Mike; Meschia, James F; Worrall, Bradford B; Longstreth, W T; Seshadri, Sudha; Kooperberg, Charles; Burger, Kathleen M; Carlson, Christopher S; Carty, Cara L; Chen, Wei-Min; Cupples, L Adrienne; DeStefano, Anita L; Fornage, Myriam; Hardy, John; Hsu, Li; Jackson, Rebecca D; Jarvik, Gail P; Kim, Daniel S; Lakshminarayan, Kamakshi; Lange, Leslie A; Manichaikul, Ani; Quinlan, Aaron R; Singleton, Andrew B; Thornton, Timothy A; Nickerson, Deborah A; Peters, Ulrike; Rich, Stephen S

    2015-07-01

    Stroke is the second leading cause of death and the third leading cause of years of life lost. Genetic factors contribute to stroke prevalence, and candidate gene and genome-wide association studies (GWAS) have identified variants associated with ischemic stroke risk. These variants often have small effects without obvious biological significance. Exome sequencing may discover predicted protein-altering variants with a potentially large effect on ischemic stroke risk. To investigate the contribution of rare and common genetic variants to ischemic stroke risk by targeting the protein-coding regions of the human genome. The National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP) analyzed approximately 6000 participants from numerous cohorts of European and African ancestry. For discovery, 365 cases of ischemic stroke (small-vessel and large-vessel subtypes) and 809 European ancestry controls were sequenced; for replication, 47 affected sibpairs concordant for stroke subtype and an African American case-control series were sequenced, with 1672 cases and 4509 European ancestry controls genotyped. The ESP's exome sequencing and genotyping started on January 1, 2010, and continued through June 30, 2012. Analyses were conducted on the full data set between July 12, 2012, and July 13, 2013. Discovery of new variants or genes contributing to ischemic stroke risk and subtype (primary analysis) and determination of support for protein-coding variants contributing to risk in previously published candidate genes (secondary analysis). We identified 2 novel genes associated with an increased risk of ischemic stroke: a protein-coding variant in PDE4DIP (rs1778155; odds ratio, 2.15; P = 2.63 × 10(-8)) with an intracellular signal transduction mechanism and in ACOT4 (rs35724886; odds ratio, 2.04; P = 1.24 × 10(-7)) with a fatty acid metabolism; confirmation of PDE4DIP was observed in affected sibpair families with large-vessel stroke subtype and in African Americans. Replication of protein-coding variants in candidate genes was observed for 2 previously reported GWAS associations: ZFHX3 (cardioembolic stroke) and ABCA1 (large-vessel stroke). Exome sequencing discovered 2 novel genes and mechanisms, PDE4DIP and ACOT4, associated with increased risk for ischemic stroke. In addition, ZFHX3 and ABCA1 were discovered to have protein-coding variants associated with ischemic stroke. These results suggest that genetic variation in novel pathways contributes to ischemic stroke risk and serves as a target for prediction, prevention, and therapy.

  16. An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics.

    PubMed

    Omasits, Ulrich; Varadarajan, Adithi R; Schmid, Michael; Goetze, Sandra; Melidis, Damianos; Bourqui, Marc; Nikolayeva, Olga; Québatte, Maxime; Patrignani, Andrea; Dehio, Christoph; Frey, Juerg E; Robinson, Mark D; Wollscheid, Bernd; Ahrens, Christian H

    2017-12-01

    Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae , Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote. © 2017 Omasits et al.; Published by Cold Spring Harbor Laboratory Press.

  17. SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines

    PubMed Central

    2014-01-01

    Background It is important to predict the quality of a protein structural model before its native structure is known. The method that can predict the absolute local quality of individual residues in a single protein model is rare, yet particularly needed for using, ranking and refining protein models. Results We developed a machine learning tool (SMOQ) that can predict the distance deviation of each residue in a single protein model. SMOQ uses support vector machines (SVM) with protein sequence and structural features (i.e. basic feature set), including amino acid sequence, secondary structures, solvent accessibilities, and residue-residue contacts to make predictions. We also trained a SVM model with two new additional features (profiles and SOV scores) on 20 CASP8 targets and found that including them can only improve the performance when real deviations between native and model are higher than 5Å. The SMOQ tool finally released uses the basic feature set trained on 85 CASP8 targets. Moreover, SMOQ implemented a way to convert predicted local quality scores into a global quality score. SMOQ was tested on the 84 CASP9 single-domain targets. The average difference between the residue-specific distance deviation predicted by our method and the actual distance deviation on the test data is 2.637Å. The global quality prediction accuracy of the tool is comparable to other good tools on the same benchmark. Conclusion SMOQ is a useful tool for protein single model quality assessment. Its source code and executable are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/. PMID:24776231

  18. SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines.

    PubMed

    Cao, Renzhi; Wang, Zheng; Wang, Yiheng; Cheng, Jianlin

    2014-04-28

    It is important to predict the quality of a protein structural model before its native structure is known. The method that can predict the absolute local quality of individual residues in a single protein model is rare, yet particularly needed for using, ranking and refining protein models. We developed a machine learning tool (SMOQ) that can predict the distance deviation of each residue in a single protein model. SMOQ uses support vector machines (SVM) with protein sequence and structural features (i.e. basic feature set), including amino acid sequence, secondary structures, solvent accessibilities, and residue-residue contacts to make predictions. We also trained a SVM model with two new additional features (profiles and SOV scores) on 20 CASP8 targets and found that including them can only improve the performance when real deviations between native and model are higher than 5Å. The SMOQ tool finally released uses the basic feature set trained on 85 CASP8 targets. Moreover, SMOQ implemented a way to convert predicted local quality scores into a global quality score. SMOQ was tested on the 84 CASP9 single-domain targets. The average difference between the residue-specific distance deviation predicted by our method and the actual distance deviation on the test data is 2.637Å. The global quality prediction accuracy of the tool is comparable to other good tools on the same benchmark. SMOQ is a useful tool for protein single model quality assessment. Its source code and executable are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/.

  19. InterProScan 5: genome-scale protein function classification

    PubMed Central

    Jones, Philip; Binns, David; Chang, Hsin-Yu; Fraser, Matthew; Li, Weizhong; McAnulla, Craig; McWilliam, Hamish; Maslen, John; Mitchell, Alex; Nuka, Gift; Pesseat, Sebastien; Quinn, Antony F.; Sangrador-Vegas, Amaia; Scheremetjew, Maxim; Yong, Siew-Yit; Lopez, Rodrigo; Hunter, Sarah

    2014-01-01

    Motivation: Robust large-scale sequence analysis is a major challenge in modern genomic science, where biologists are frequently trying to characterize many millions of sequences. Here, we describe a new Java-based architecture for the widely used protein function prediction software package InterProScan. Developments include improvements and additions to the outputs of the software and the complete reimplementation of the software framework, resulting in a flexible and stable system that is able to use both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis. InterProScan is freely available for download from the EMBl-EBI FTP site and the open source code is hosted at Google Code. Availability and implementation: InterProScan is distributed via FTP at ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/ and the source code is available from http://code.google.com/p/interproscan/. Contact: http://www.ebi.ac.uk/support or interhelp@ebi.ac.uk or mitchell@ebi.ac.uk PMID:24451626

  20. Complete mitochondrial genome of Taharana fasciana (Insecta, Hemiptera: Cicadellidae) and comparison with other Cicadellidae insects.

    PubMed

    Wang, Jiajia; Li, Hu; Dai, Renhuai

    2017-12-01

    Here, we describe the first complete mitochondrial genome (mitogenome) sequence of the leafhopper Taharana fasciana (Coelidiinae). The mitogenome sequence contains 15,161 bp with an A + T content of 77.9%. It includes 13 protein-coding genes, two ribosomal RNA genes, 22 transfer RNA genes, and one non-coding (A + T-rich) region; in addition, a repeat region is also present (GenBank accession no. KY886913). These genes/regions are in the same order as in the inferred insect ancestral mitogenome. All protein-coding genes have ATN as the start codon, and TAA or single T as the stop codons, except the gene ND3, which ends with TAG. Furthermore, we predicted the secondary structures of the rRNAs in T. fasciana. Six domains (domain III is absent in arthropods) and 41 helices were predicted for 16S rRNA, and 12S rRNA comprised three structural domains and 24 helices. Phylogenetic tree analysis confirmed that T. fasciana and other members of the Cicadellidae are clustered into a clade, and it identified the relationships among the subfamilies Deltocephalinae, Coelidiinae, Idiocerinae, Cicadellinae, and Typhlocybinae.

  1. AptRank: an adaptive PageRank model for protein function prediction on   bi-relational graphs.

    PubMed

    Jiang, Biaobin; Kloster, Kyle; Gleich, David F; Gribskov, Michael

    2017-06-15

    Diffusion-based network models are widely used for protein function prediction using protein network data and have been shown to outperform neighborhood-based and module-based methods. Recent studies have shown that integrating the hierarchical structure of the Gene Ontology (GO) data dramatically improves prediction accuracy. However, previous methods usually either used the GO hierarchy to refine the prediction results of multiple classifiers, or flattened the hierarchy into a function-function similarity kernel. No study has taken the GO hierarchy into account together with the protein network as a two-layer network model. We first construct a Bi-relational graph (Birg) model comprised of both protein-protein association and function-function hierarchical networks. We then propose two diffusion-based methods, BirgRank and AptRank, both of which use PageRank to diffuse information on this two-layer graph model. BirgRank is a direct application of traditional PageRank with fixed decay parameters. In contrast, AptRank utilizes an adaptive diffusion mechanism to improve the performance of BirgRank. We evaluate the ability of both methods to predict protein function on yeast, fly and human protein datasets, and compare with four previous methods: GeneMANIA, TMC, ProteinRank and clusDCA. We design four different validation strategies: missing function prediction, de novo function prediction, guided function prediction and newly discovered function prediction to comprehensively evaluate predictability of all six methods. We find that both BirgRank and AptRank outperform the previous methods, especially in missing function prediction when using only 10% of the data for training. The MATLAB code is available at https://github.rcac.purdue.edu/mgribsko/aptrank . gribskov@purdue.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  2. Transmembrane protein topology prediction using support vector machines.

    PubMed

    Nugent, Timothy; Jones, David T

    2009-05-26

    Alpha-helical transmembrane (TM) proteins are involved in a wide range of important biological processes such as cell signaling, transport of membrane-impermeable molecules, cell-cell communication, cell recognition and cell adhesion. Many are also prime drug targets, and it has been estimated that more than half of all drugs currently on the market target membrane proteins. However, due to the experimental difficulties involved in obtaining high quality crystals, this class of protein is severely under-represented in structural databases. In the absence of structural data, sequence-based prediction methods allow TM protein topology to be investigated. We present a support vector machine-based (SVM) TM protein topology predictor that integrates both signal peptide and re-entrant helix prediction, benchmarked with full cross-validation on a novel data set of 131 sequences with known crystal structures. The method achieves topology prediction accuracy of 89%, while signal peptides and re-entrant helices are predicted with 93% and 44% accuracy respectively. An additional SVM trained to discriminate between globular and TM proteins detected zero false positives, with a low false negative rate of 0.4%. We present the results of applying these tools to a number of complete genomes. Source code, data sets and a web server are freely available from http://bioinf.cs.ucl.ac.uk/psipred/. The high accuracy of TM topology prediction which includes detection of both signal peptides and re-entrant helices, combined with the ability to effectively discriminate between TM and globular proteins, make this method ideally suited to whole genome annotation of alpha-helical transmembrane proteins.

  3. Protein-ligand interfaces are polarized: discovery of a strong trend for intermolecular hydrogen bonds to favor donors on the protein side with implications for predicting and designing ligand complexes.

    PubMed

    Raschka, Sebastian; Wolf, Alex J; Bemister-Buffington, Joseph; Kuhn, Leslie A

    2018-04-01

    Understanding how proteins encode ligand specificity is fascinating and similar in importance to deciphering the genetic code. For protein-ligand recognition, the combination of an almost infinite variety of interfacial shapes and patterns of chemical groups makes the problem especially challenging. Here we analyze data across non-homologous proteins in complex with small biological ligands to address observations made in our inhibitor discovery projects: that proteins favor donating H-bonds to ligands and avoid using groups with both H-bond donor and acceptor capacity. The resulting clear and significant chemical group matching preferences elucidate the code for protein-native ligand binding, similar to the dominant patterns found in nucleic acid base-pairing. On average, 90% of the keto and carboxylate oxygens occurring in the biological ligands formed direct H-bonds to the protein. A two-fold preference was found for protein atoms to act as H-bond donors and ligand atoms to act as acceptors, and 76% of all intermolecular H-bonds involved an amine donor. Together, the tight chemical and geometric constraints associated with satisfying donor groups generate a hydrogen-bonding lock that can be matched only by ligands bearing the right acceptor-rich key. Measuring an index of H-bond preference based on the observed chemical trends proved sufficient to predict other protein-ligand complexes and can be used to guide molecular design. The resulting Hbind and Protein Recognition Index software packages are being made available for rigorously defining intermolecular H-bonds and measuring the extent to which H-bonding patterns in a given complex match the preference key.

  4. Protein-ligand interfaces are polarized: discovery of a strong trend for intermolecular hydrogen bonds to favor donors on the protein side with implications for predicting and designing ligand complexes

    NASA Astrophysics Data System (ADS)

    Raschka, Sebastian; Wolf, Alex J.; Bemister-Buffington, Joseph; Kuhn, Leslie A.

    2018-02-01

    Understanding how proteins encode ligand specificity is fascinating and similar in importance to deciphering the genetic code. For protein-ligand recognition, the combination of an almost infinite variety of interfacial shapes and patterns of chemical groups makes the problem especially challenging. Here we analyze data across non-homologous proteins in complex with small biological ligands to address observations made in our inhibitor discovery projects: that proteins favor donating H-bonds to ligands and avoid using groups with both H-bond donor and acceptor capacity. The resulting clear and significant chemical group matching preferences elucidate the code for protein-native ligand binding, similar to the dominant patterns found in nucleic acid base-pairing. On average, 90% of the keto and carboxylate oxygens occurring in the biological ligands formed direct H-bonds to the protein. A two-fold preference was found for protein atoms to act as H-bond donors and ligand atoms to act as acceptors, and 76% of all intermolecular H-bonds involved an amine donor. Together, the tight chemical and geometric constraints associated with satisfying donor groups generate a hydrogen-bonding lock that can be matched only by ligands bearing the right acceptor-rich key. Measuring an index of H-bond preference based on the observed chemical trends proved sufficient to predict other protein-ligand complexes and can be used to guide molecular design. The resulting Hbind and Protein Recognition Index software packages are being made available for rigorously defining intermolecular H-bonds and measuring the extent to which H-bonding patterns in a given complex match the preference key.

  5. CaMELS: In silico prediction of calmodulin binding proteins and their binding sites.

    PubMed

    Abbasi, Wajid Arshad; Asif, Amina; Andleeb, Saiqa; Minhas, Fayyaz Ul Amir Afsar

    2017-09-01

    Due to Ca 2+ -dependent binding and the sequence diversity of Calmodulin (CaM) binding proteins, identifying CaM interactions and binding sites in the wet-lab is tedious and costly. Therefore, computational methods for this purpose are crucial to the design of such wet-lab experiments. We present an algorithm suite called CaMELS (CalModulin intEraction Learning System) for predicting proteins that interact with CaM as well as their binding sites using sequence information alone. CaMELS offers state of the art accuracy for both CaM interaction and binding site prediction and can aid biologists in studying CaM binding proteins. For CaM interaction prediction, CaMELS uses protein sequence features coupled with a large-margin classifier. CaMELS models the binding site prediction problem using multiple instance machine learning with a custom optimization algorithm which allows more effective learning over imprecisely annotated CaM-binding sites during training. CaMELS has been extensively benchmarked using a variety of data sets, mutagenic studies, proteome-wide Gene Ontology enrichment analyses and protein structures. Our experiments indicate that CaMELS outperforms simple motif-based search and other existing methods for interaction and binding site prediction. We have also found that the whole sequence of a protein, rather than just its binding site, is important for predicting its interaction with CaM. Using the machine learning model in CaMELS, we have identified important features of protein sequences for CaM interaction prediction as well as characteristic amino acid sub-sequences and their relative position for identifying CaM binding sites. Python code for training and evaluating CaMELS together with a webserver implementation is available at the URL: http://faculty.pieas.edu.pk/fayyaz/software.html#camels. © 2017 Wiley Periodicals, Inc.

  6. CRITICA: coding region identification tool invoking comparative analysis

    NASA Technical Reports Server (NTRS)

    Badger, J. H.; Olsen, G. J.; Woese, C. R. (Principal Investigator)

    1999-01-01

    Gene recognition is essential to understanding existing and future DNA sequence data. CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) is a suite of programs for identifying likely protein-coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis, regions of DNA are aligned with related sequences from the DNA databases; if the translation of the aligned sequences has greater amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from the relative frequencies of hexanucleotides in coding frames versus other contexts (i.e., dicodon bias). The dicodon usage information is derived by iterative analysis of the data, such that CRITICA is not dependent on the existence or accuracy of coding sequence annotations in the databases. This independence makes the method particularly well suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium DNA sequences. Its predictions were compared with the DNA sequence annotations and with the predictions of GenMark. CRITICA proved to be more accurate than GenMark, and moreover, many of its predictions that would seem to be errors instead reflect problems in the sequence databases. The source code of CRITICA is freely available by anonymous FTP (rdp.life.uiuc.edu in/pub/critica) and on the World Wide Web (http:/(/)rdpwww.life.uiuc.edu).

  7. Inter-species prediction of protein phosphorylation in the sbv IMPROVER species translation challenge

    PubMed Central

    Biehl, Michael; Sadowski, Peter; Bhanot, Gyan; Bilal, Erhan; Dayarian, Adel; Meyer, Pablo; Norel, Raquel; Rhrissorrakrai, Kahn; Zeller, Michael D.; Hormoz, Sahand

    2015-01-01

    Motivation: Animal models are widely used in biomedical research for reasons ranging from practical to ethical. An important issue is whether rodent models are predictive of human biology. This has been addressed recently in the framework of a series of challenges designed by the systems biology verification for Industrial Methodology for Process Verification in Research (sbv IMPROVER) initiative. In particular, one of the sub-challenges was devoted to the prediction of protein phosphorylation responses in human bronchial epithelial cells, exposed to a number of different chemical stimuli, given the responses in rat bronchial epithelial cells. Participating teams were asked to make inter-species predictions on the basis of available training examples, comprising transcriptomics and phosphoproteomics data. Results: Here, the two best performing teams present their data-driven approaches and computational methods. In addition, post hoc analyses of the datasets and challenge results were performed by the participants and challenge organizers. The challenge outcome indicates that successful prediction of protein phosphorylation status in human based on rat phosphorylation levels is feasible. However, within the limitations of the computational tools used, the inclusion of gene expression data does not improve the prediction quality. The post hoc analysis of time-specific measurements sheds light on the signaling pathways in both species. Availability and implementation: A detailed description of the dataset, challenge design and outcome is available at www.sbvimprover.com. The code used by team IGB is provided under http://github.com/uci-igb/improver2013. Implementations of the algorithms applied by team AMG are available at http://bhanot.biomaps.rutgers.edu/wiki/AMG-sc2-code.zip. Contact: meikelbiehl@gmail.com PMID:24994890

  8. Beyond the Triplet Code: Context Cues Transform Translation.

    PubMed

    Brar, Gloria A

    2016-12-15

    The elucidation of the genetic code remains among the most influential discoveries in biology. While innumerable studies have validated the general universality of the code and its value in predicting and analyzing protein coding sequences, established and emerging work has also suggested that full genome decryption may benefit from a greater consideration of a codon's neighborhood within an mRNA than has been broadly applied. This Review examines the evidence for context cues in translation, with a focus on several recent studies that reveal broad roles for mRNA context in programming translation start sites, the rate of translation elongation, and stop codon identity. Copyright © 2016 Elsevier Inc. All rights reserved.

  9. Draft Genome Sequence of Photorhabdus luminescens Strain DSPV002N Isolated from Santa Fe, Argentina

    PubMed Central

    Del Valle, Eleodoro E.; Frizzo, Laureano; Berry, Colin; Caballero, Primitivo

    2016-01-01

    Here, we report the draft genome sequence of Photorhabdus luminescens strain DSPV002N, which consists of 177 contig sequences accounting for 5,518,143 bp, with a G+C content of 42.3% and 4,701 predicted protein-coding genes (CDSs). From these, 27 CDSs exhibited significant similarity with insecticidal toxin proteins from Photorhabdus luminescens subsp. laumondii TT01. PMID:27469965

  10. MHC class I-associated peptides derive from selective regions of the human genome.

    PubMed

    Pearson, Hillary; Daouda, Tariq; Granados, Diana Paola; Durette, Chantal; Bonneil, Eric; Courcelles, Mathieu; Rodenbrock, Anja; Laverdure, Jean-Philippe; Côté, Caroline; Mader, Sylvie; Lemieux, Sébastien; Thibault, Pierre; Perreault, Claude

    2016-12-01

    MHC class I-associated peptides (MAPs) define the immune self for CD8+ T lymphocytes and are key targets of cancer immunosurveillance. Here, the goals of our work were to determine whether the entire set of protein-coding genes could generate MAPs and whether specific features influence the ability of discrete genes to generate MAPs. Using proteogenomics, we have identified 25,270 MAPs isolated from the B lymphocytes of 18 individuals who collectively expressed 27 high-frequency HLA-A,B allotypes. The entire MAP repertoire presented by these 27 allotypes covered only 10% of the exomic sequences expressed in B lymphocytes. Indeed, 41% of expressed protein-coding genes generated no MAPs, while 59% of genes generated up to 64 MAPs, often derived from adjacent regions and presented by different allotypes. We next identified several features of transcripts and proteins associated with efficient MAP production. From these data, we built a logistic regression model that predicts with good accuracy whether a gene generates MAPs. Our results show preferential selection of MAPs from a limited repertoire of proteins with distinctive features. The notion that the MHC class I immunopeptidome presents only a small fraction of the protein-coding genome for monitoring by the immune system has profound implications in autoimmunity and cancer immunology.

  11. MHC class I–associated peptides derive from selective regions of the human genome

    PubMed Central

    Pearson, Hillary; Granados, Diana Paola; Durette, Chantal; Bonneil, Eric; Courcelles, Mathieu; Rodenbrock, Anja; Laverdure, Jean-Philippe; Côté, Caroline; Thibault, Pierre

    2016-01-01

    MHC class I–associated peptides (MAPs) define the immune self for CD8+ T lymphocytes and are key targets of cancer immunosurveillance. Here, the goals of our work were to determine whether the entire set of protein-coding genes could generate MAPs and whether specific features influence the ability of discrete genes to generate MAPs. Using proteogenomics, we have identified 25,270 MAPs isolated from the B lymphocytes of 18 individuals who collectively expressed 27 high-frequency HLA-A,B allotypes. The entire MAP repertoire presented by these 27 allotypes covered only 10% of the exomic sequences expressed in B lymphocytes. Indeed, 41% of expressed protein-coding genes generated no MAPs, while 59% of genes generated up to 64 MAPs, often derived from adjacent regions and presented by different allotypes. We next identified several features of transcripts and proteins associated with efficient MAP production. From these data, we built a logistic regression model that predicts with good accuracy whether a gene generates MAPs. Our results show preferential selection of MAPs from a limited repertoire of proteins with distinctive features. The notion that the MHC class I immunopeptidome presents only a small fraction of the protein-coding genome for monitoring by the immune system has profound implications in autoimmunity and cancer immunology. PMID:27841757

  12. Probing the functions of long non-coding RNAs by exploiting the topology of global association and interaction network.

    PubMed

    Deng, Lei; Wu, Hongjie; Liu, Chuyao; Zhan, Weihua; Zhang, Jingpu

    2018-06-01

    Long non-coding RNAs (lncRNAs) are involved in many biological processes, such as immune response, development, differentiation and gene imprinting and are associated with diseases and cancers. But the functions of the vast majority of lncRNAs are still unknown. Predicting the biological functions of lncRNAs is one of the key challenges in the post-genomic era. In our work, We first build a global network including a lncRNA similarity network, a lncRNA-protein association network and a protein-protein interaction network according to the expressions and interactions, then extract the topological feature vectors of the global network. Using these features, we present an SVM-based machine learning approach, PLNRGO, to annotate human lncRNAs. In PLNRGO, we construct a training data set according to the proteins with GO annotations and train a binary classifier for each GO term. We assess the performance of PLNRGO on our manually annotated lncRNA benchmark and a protein-coding gene benchmark with known functional annotations. As a result, the performance of our method is significantly better than that of other state-of-the-art methods in terms of maximum F-measure and coverage. Copyright © 2018 Elsevier Ltd. All rights reserved.

  13. Long non-coding RNAs and their biological roles in plants.

    PubMed

    Liu, Xue; Hao, Lili; Li, Dayong; Zhu, Lihuang; Hu, Songnian

    2015-06-01

    With the development of genomics and bioinformatics, especially the extensive applications of high-throughput sequencing technology, more transcriptional units with little or no protein-coding potential have been discovered. Such RNA molecules are called non-protein-coding RNAs (npcRNAs or ncRNAs). Among them, long npcRNAs or ncRNAs (lnpcRNAs or lncRNAs) represent diverse classes of transcripts longer than 200 nucleotides. In recent years, the lncRNAs have been considered as important regulators in many essential biological processes. In plants, although a large number of lncRNA transcripts have been predicted and identified in few species, our current knowledge of their biological functions is still limited. Here, we have summarized recent studies on their identification, characteristics, classification, bioinformatics, resources, and current exploration of their biological functions in plants. Copyright © 2015 The Authors. Production and hosting by Elsevier Ltd.. All rights reserved.

  14. A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein Interactions Using Evolutionary Information.

    PubMed

    Yi, Hai-Cheng; You, Zhu-Hong; Huang, De-Shuang; Li, Xiao; Jiang, Tong-Hai; Li, Li-Ping

    2018-06-01

    The interactions between non-coding RNAs (ncRNAs) and proteins play an important role in many biological processes, and their biological functions are primarily achieved by binding with a variety of proteins. High-throughput biological techniques are used to identify protein molecules bound with specific ncRNA, but they are usually expensive and time consuming. Deep learning provides a powerful solution to computationally predict RNA-protein interactions. In this work, we propose the RPI-SAN model by using the deep-learning stacked auto-encoder network to mine the hidden high-level features from RNA and protein sequences and feed them into a random forest (RF) model to predict ncRNA binding proteins. Stacked assembling is further used to improve the accuracy of the proposed method. Four benchmark datasets, including RPI2241, RPI488, RPI1807, and NPInter v2.0, were employed for the unbiased evaluation of five established prediction tools: RPI-Pred, IPMiner, RPISeq-RF, lncPro, and RPI-SAN. The experimental results show that our RPI-SAN model achieves much better performance than other methods, with accuracies of 90.77%, 89.7%, 96.1%, and 99.33%, respectively. It is anticipated that RPI-SAN can be used as an effective computational tool for future biomedical researches and can accurately predict the potential ncRNA-protein interacted pairs, which provides reliable guidance for biological research. Copyright © 2018 The Author(s). Published by Elsevier Inc. All rights reserved.

  15. AUTO-MUTE 2.0: A Portable Framework with Enhanced Capabilities for Predicting Protein Functional Consequences upon Mutation.

    PubMed

    Masso, Majid; Vaisman, Iosif I

    2014-01-01

    The AUTO-MUTE 2.0 stand-alone software package includes a collection of programs for predicting functional changes to proteins upon single residue substitutions, developed by combining structure-based features with trained statistical learning models. Three of the predictors evaluate changes to protein stability upon mutation, each complementing a distinct experimental approach. Two additional classifiers are available, one for predicting activity changes due to residue replacements and the other for determining the disease potential of mutations associated with nonsynonymous single nucleotide polymorphisms (nsSNPs) in human proteins. These five command-line driven tools, as well as all the supporting programs, complement those that run our AUTO-MUTE web-based server. Nevertheless, all the codes have been rewritten and substantially altered for the new portable software, and they incorporate several new features based on user feedback. Included among these upgrades is the ability to perform three highly requested tasks: to run "big data" batch jobs; to generate predictions using modified protein data bank (PDB) structures, and unpublished personal models prepared using standard PDB file formatting; and to utilize NMR structure files that contain multiple models.

  16. MINS2: revisiting the molecular code for transmembrane-helix recognition by the Sec61 translocon.

    PubMed

    Park, Yungki; Helms, Volkhard

    2008-08-15

    To be fully functional, membrane proteins should not only fold, but also get inserted into the membrane, which is mediated by the Sec61 translocon. Recent experimental studies have attempted to elucidate how the Sec61 translocon accomplishes this delicate task by measuring the translocon-mediated membrane insertion free energies of 357 systematically designed peptides. On the basis of this data set, we have developed MINS2, a novel sequence-based computational method for predicting the membrane insertion free energies of protein sequences. A benchmark analysis of MINS2 shows that MINS2 signi.cantly outperforms previously proposed methods. Importantly, the application of MINS2 to known membrane protein structures shows that a better prediction of membrane insertion free energies does not lead to a better prediction of transmembrane segments of polytopic membrane proteins. A web server for MINS2 is publicly available at http://service.bioinformatik.uni-saarland.de/mins. Supplementary data are available at Bioinformatics online.

  17. Ligand Binding Site Detection by Local Structure Alignment and Its Performance Complementarity

    PubMed Central

    Lee, Hui Sun; Im, Wonpil

    2013-01-01

    Accurate determination of potential ligand binding sites (BS) is a key step for protein function characterization and structure-based drug design. Despite promising results of template-based BS prediction methods using global structure alignment (GSA), there is a room to improve the performance by properly incorporating local structure alignment (LSA) because BS are local structures and often similar for proteins with dissimilar global folds. We present a template-based ligand BS prediction method using G-LoSA, our LSA tool. A large benchmark set validation shows that G-LoSA predicts drug-like ligands’ positions in single-chain protein targets more precisely than TM-align, a GSA-based method, while the overall success rate of TM-align is better. G-LoSA is particularly efficient for accurate detection of local structures conserved across proteins with diverse global topologies. Recognizing the performance complementarity of G-LoSA to TM-align and a non-template geometry-based method, fpocket, a robust consensus scoring method, CMCS-BSP (Complementary Methods and Consensus Scoring for ligand Binding Site Prediction), is developed and shows improvement on prediction accuracy. The G-LoSA source code is freely available at http://im.bioinformatics.ku.edu/GLoSA. PMID:23957286

  18. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures

    PubMed Central

    Stark, Alexander; Lin, Michael F.; Kheradpour, Pouya; Pedersen, Jakob S.; Parts, Leopold; Carlson, Joseph W.; Crosby, Madeline A.; Rasmussen, Matthew D.; Roy, Sushmita; Deoras, Ameya N.; Ruby, J. Graham; Brennecke, Julius; Hodges, Emily; Hinrichs, Angie S.; Caspi, Anat; Paten, Benedict; Park, Seung-Won; Han, Mira V.; Maeder, Morgan L.; Polansky, Benjamin J.; Robson, Bryanne E.; Aerts, Stein; van Helden, Jacques; Hassan, Bassem; Gilbert, Donald G.; Eastman, Deborah A.; Rice, Michael; Weir, Michael; Hahn, Matthew W.; Park, Yongkyu; Dewey, Colin N.; Pachter, Lior; Kent, W. James; Haussler, David; Lai, Eric C.; Bartel, David P.; Hannon, Gregory J.; Kaufman, Thomas C.; Eisen, Michael B.; Clark, Andrew G.; Smith, Douglas; Celniker, Susan E.; Gelbart, William M.; Kellis, Manolis

    2008-01-01

    Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies. PMID:17994088

  19. Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana

    PubMed Central

    Itoh, Takeshi; Tanaka, Tsuyoshi; Barrero, Roberto A.; Yamasaki, Chisato; Fujii, Yasuyuki; Hilton, Phillip B.; Antonio, Baltazar A.; Aono, Hideo; Apweiler, Rolf; Bruskiewich, Richard; Bureau, Thomas; Burr, Frances; Costa de Oliveira, Antonio; Fuks, Galina; Habara, Takuya; Haberer, Georg; Han, Bin; Harada, Erimi; Hiraki, Aiko T.; Hirochika, Hirohiko; Hoen, Douglas; Hokari, Hiroki; Hosokawa, Satomi; Hsing, Yue; Ikawa, Hiroshi; Ikeo, Kazuho; Imanishi, Tadashi; Ito, Yukiyo; Jaiswal, Pankaj; Kanno, Masako; Kawahara, Yoshihiro; Kawamura, Toshiyuki; Kawashima, Hiroaki; Khurana, Jitendra P.; Kikuchi, Shoshi; Komatsu, Setsuko; Koyanagi, Kanako O.; Kubooka, Hiromi; Lieberherr, Damien; Lin, Yao-Cheng; Lonsdale, David; Matsumoto, Takashi; Matsuya, Akihiro; McCombie, W. Richard; Messing, Joachim; Miyao, Akio; Mulder, Nicola; Nagamura, Yoshiaki; Nam, Jongmin; Namiki, Nobukazu; Numa, Hisataka; Nurimoto, Shin; O’Donovan, Claire; Ohyanagi, Hajime; Okido, Toshihisa; OOta, Satoshi; Osato, Naoki; Palmer, Lance E.; Quetier, Francis; Raghuvanshi, Saurabh; Saichi, Naomi; Sakai, Hiroaki; Sakai, Yasumichi; Sakata, Katsumi; Sakurai, Tetsuya; Sato, Fumihiko; Sato, Yoshiharu; Schoof, Heiko; Seki, Motoaki; Shibata, Michie; Shimizu, Yuji; Shinozaki, Kazuo; Shinso, Yuji; Singh, Nagendra K.; Smith-White, Brian; Takeda, Jun-ichi; Tanino, Motohiko; Tatusova, Tatiana; Thongjuea, Supat; Todokoro, Fusano; Tsugane, Mika; Tyagi, Akhilesh K.; Vanavichit, Apichart; Wang, Aihui; Wing, Rod A.; Yamaguchi, Kaori; Yamamoto, Mayu; Yamamoto, Naoyuki; Yu, Yeisoo; Zhang, Hao; Zhao, Qiang; Higo, Kenichi; Burr, Benjamin; Gojobori, Takashi; Sasaki, Takuji

    2007-01-01

    We present here the annotation of the complete genome of rice Oryza sativa L. ssp. japonica cultivar Nipponbare. All functional annotations for proteins and non-protein-coding RNA (npRNA) candidates were manually curated. Functions were identified or inferred in 19,969 (70%) of the proteins, and 131 possible npRNAs (including 58 antisense transcripts) were found. Almost 5000 annotated protein-coding genes were found to be disrupted in insertional mutant lines, which will accelerate future experimental validation of the annotations. The rice loci were determined by using cDNA sequences obtained from rice and other representative cereals. Our conservative estimate based on these loci and an extrapolation suggested that the gene number of rice is ∼32,000, which is smaller than previous estimates. We conducted comparative analyses between rice and Arabidopsis thaliana and found that both genomes possessed several lineage-specific genes, which might account for the observed differences between these species, while they had similar sets of predicted functional domains among the protein sequences. A system to control translational efficiency seems to be conserved across large evolutionary distances. Moreover, the evolutionary process of protein-coding genes was examined. Our results suggest that natural selection may have played a role for duplicated genes in both species, so that duplication was suppressed or favored in a manner that depended on the function of a gene. PMID:17210932

  20. Principles of protein folding--a perspective from simple exact models.

    PubMed Central

    Dill, K. A.; Bromberg, S.; Yue, K.; Fiebig, K. M.; Yee, D. P.; Thomas, P. D.; Chan, H. S.

    1995-01-01

    General principles of protein structure, stability, and folding kinetics have recently been explored in computer simulations of simple exact lattice models. These models represent protein chains at a rudimentary level, but they involve few parameters, approximations, or implicit biases, and they allow complete explorations of conformational and sequence spaces. Such simulations have resulted in testable predictions that are sometimes unanticipated: The folding code is mainly binary and delocalized throughout the amino acid sequence. The secondary and tertiary structures of a protein are specified mainly by the sequence of polar and nonpolar monomers. More specific interactions may refine the structure, rather than dominate the folding code. Simple exact models can account for the properties that characterize protein folding: two-state cooperativity, secondary and tertiary structures, and multistage folding kinetics--fast hydrophobic collapse followed by slower annealing. These studies suggest the possibility of creating "foldable" chain molecules other than proteins. The encoding of a unique compact chain conformation may not require amino acids; it may require only the ability to synthesize specific monomer sequences in which at least one monomer type is solvent-averse. PMID:7613459

  1. Meta4: a web application for sharing and annotating metagenomic gene predictions using web services.

    PubMed

    Richardson, Emily J; Escalettes, Franck; Fotheringham, Ian; Wallace, Robert J; Watson, Mick

    2013-01-01

    Whole-genome shotgun metagenomics experiments produce DNA sequence data from entire ecosystems, and provide a huge amount of novel information. Gene discovery projects require up-to-date information about sequence homology and domain structure for millions of predicted proteins to be presented in a simple, easy-to-use system. There is a lack of simple, open, flexible tools that allow the rapid sharing of metagenomics datasets with collaborators in a format they can easily interrogate. We present Meta4, a flexible and extensible web application that can be used to share and annotate metagenomic gene predictions. Proteins and predicted domains are stored in a simple relational database, with a dynamic front-end which displays the results in an internet browser. Web services are used to provide up-to-date information about the proteins from homology searches against public databases. Information about Meta4 can be found on the project website, code is available on Github, a cloud image is available, and an example implementation can be seen at.

  2. Protein functional features are reflected in the patterns of mRNA translation speed.

    PubMed

    López, Daniel; Pazos, Florencio

    2015-07-09

    The degeneracy of the genetic code makes it possible for the same amino acid string to be coded by different messenger RNA (mRNA) sequences. These "synonymous mRNAs" may differ largely in a number of aspects related to their overall translational efficiency, such as secondary structure content and availability of the encoded transfer RNAs (tRNAs). Consequently, they may render different yields of the translated polypeptides. These mRNA features related to translation efficiency are also playing a role locally, resulting in a non-uniform translation speed along the mRNA, which has been previously related to some protein structural features and also used to explain some dramatic effects of "silent" single-nucleotide-polymorphisms (SNPs). In this work we perform the first large scale analysis of the relationship between three experimental proxies of mRNA local translation efficiency and the local features of the corresponding encoded proteins. We found that a number of protein functional and structural features are reflected in the patterns of ribosome occupancy, secondary structure and tRNA availability along the mRNA. One or more of these proxies of translation speed have distinctive patterns around the mRNA regions coding for certain protein local features. In some cases the three patterns follow a similar trend. We also show specific examples where these patterns of translation speed point to the protein's important structural and functional features. This support the idea that the genome not only codes the protein functional features as sequences of amino acids, but also as subtle patterns of mRNA properties which, probably through local effects on the translation speed, have some consequence on the final polypeptide. These results open the possibility of predicting a protein's functional regions based on a single genomic sequence, and have implications for heterologous protein expression and fine-tuning protein function.

  3. Contribution to the Prediction of the Fold Code: Application to Immunoglobulin and Flavodoxin Cases

    PubMed Central

    Banach, Mateusz; Prudhomme, Nicolas; Carpentier, Mathilde; Duprat, Elodie; Papandreou, Nikolaos; Kalinowska, Barbara; Chomilier, Jacques; Roterman, Irena

    2015-01-01

    Background Folding nucleus of globular proteins formation starts by the mutual interaction of a group of hydrophobic amino acids whose close contacts allow subsequent formation and stability of the 3D structure. These early steps can be predicted by simulation of the folding process through a Monte Carlo (MC) coarse grain model in a discrete space. We previously defined MIRs (Most Interacting Residues), as the set of residues presenting a large number of non-covalent neighbour interactions during such simulation. MIRs are good candidates to define the minimal number of residues giving rise to a given fold instead of another one, although their proportion is rather high, typically [15-20]% of the sequences. Having in mind experiments with two sequences of very high levels of sequence identity (up to 90%) but different folds, we combined the MIR method, which takes sequence as single input, with the “fuzzy oil drop” (FOD) model that requires a 3D structure, in order to estimate the residues coding for the fold. FOD assumes that a globular protein follows an idealised 3D Gaussian distribution of hydrophobicity density, with the maximum in the centre and minima at the surface of the “drop”. If the actual local density of hydrophobicity around a given amino acid is as high as the ideal one, then this amino acid is assigned to the core of the globular protein, and it is assumed to follow the FOD model. Therefore one obtains a distribution of the amino acids of a protein according to their agreement or rejection with the FOD model. Results We compared and combined MIR and FOD methods to define the minimal nucleus, or keystone, of two populated folds: immunoglobulin-like (Ig) and flavodoxins (Flav). The combination of these two approaches defines some positions both predicted as a MIR and assigned as accordant with the FOD model. It is shown here that for these two folds, the intersection of the predicted sets of residues significantly differs from random selection. It reduces the number of selected residues by each individual method and allows a reasonable agreement with experimentally determined key residues coding for the particular fold. In addition, the intersection of the two methods significantly increases the specificity of the prediction, providing a robust set of residues that constitute the folding nucleus. PMID:25915049

  4. Analysis of periplasmic sensor domains from Anaeromyxobacter dehalogenans 2CP-C: Structure of one sensor domain from a histidine kinase and another from a chemotaxis protein

    PubMed Central

    Pokkuluri, P Raj; Dwulit-Smith, Jeff; Duke, Norma E; Wilton, Rosemarie; Mack, Jamey C; Bearden, Jessica; Rakowski, Ella; Babnigg, Gyorgy; Szurmant, Hendrik; Joachimiak, Andrzej; Schiffer, Marianne

    2013-01-01

    Anaeromyxobacter dehalogenans is a δ-proteobacterium found in diverse soils and sediments. It is of interest in bioremediation efforts due to its dechlorination and metal-reducing capabilities. To gain an understanding on A. dehalogenans' abilities to adapt to diverse environments we analyzed its signal transduction proteins. The A. dehalogenans genome codes for a large number of sensor histidine kinases (HK) and methyl-accepting chemotaxis proteins (MCP); among these 23 HK and 11 MCP proteins have a sensor domain in the periplasm. These proteins most likely contribute to adaptation to the organism's surroundings. We predicted their three-dimensional folds and determined the structures of two of the periplasmic sensor domains by X-ray diffraction. Most of the domains are predicted to have either PAS-like or helical bundle structures, with two predicted to have solute-binding protein fold, and another predicted to have a 6-phosphogluconolactonase like fold. Atomic structures of two sensor domains confirmed the respective fold predictions. The Adeh_2942 sensor (HK) was found to have a helical bundle structure, and the Adeh_3718 sensor (MCP) has a PAS-like structure. Interestingly, the Adeh_3718 sensor has an acetate moiety bound in a binding site typical for PAS-like domains. Future work is needed to determine whether Adeh_3718 is involved in acetate sensing by A. dehalogenans. PMID:23897711

  5. Identification and substrate prediction of new Fragaria x ananassa aquaporins and expression in different tissues and during strawberry fruit development.

    PubMed

    Merlaen, Britt; De Keyser, Ellen; Van Labeke, Marie-Christine

    2018-01-01

    The newly identified aquaporin coding sequences presented here pave the way for further insights into the plant-water relations in the commercial strawberry ( Fragaria x ananassa ). Aquaporins are water channel proteins that allow water to cross (intra)cellular membranes. In Fragaria x ananassa , few of them have been identified hitherto, hampering the exploration of the water transport regulation at cellular level. Here, we present new aquaporin coding sequences belonging to different subclasses: plasma membrane intrinsic proteins subtype 1 and subtype 2 (PIP1 and PIP2) and tonoplast intrinsic proteins (TIP). The classification is based on phylogenetic analysis and is confirmed by the presence of conserved residues. Substrate-specific signature sequences (SSSSs) and specificity-determining positions (SDPs) predict the substrate specificity of each new aquaporin. Expression profiling in leaves, petioles and developing fruits reveals distinct patterns, even within the same (sub)class. Expression profiles range from leaf-specific expression over constitutive expression to fruit-specific expression. Both upregulation and downregulation during fruit ripening occur. Substrate specificity and expression profiles suggest that functional specialization exists among aquaporins belonging to a different but also to the same (sub)class.

  6. Genome-wide computational identification of microRNAs and their targets in the deep-branching eukaryote Giardia lamblia.

    PubMed

    Zhang, Yan-Qiong; Chen, Dong-Liang; Tian, Hai-Feng; Zhang, Bao-Hong; Wen, Jian-Fan

    2009-10-01

    Using a combined computational program, we identified 50 potential microRNAs (miRNAs) in Giardia lamblia, one of the most primitive unicellular eukaryotes. These miRNAs are unique to G. lamblia and no homologues have been found in other organisms; miRNAs, currently known in other species, were not found in G. lamblia. This suggests that miRNA biogenesis and miRNA-mediated gene regulation pathway may evolve independently, especially in evolutionarily distant lineages. A majority (43) of the predicted miRNAs are located at one single locus; however, some miRNAs have two or more copies in the genome. Among the 58 miRNA genes, 28 are located in the intergenic regions whereas 30 are present in the anti-sense strands of the protein-coding sequences. Five predicted miRNAs are expressed in G. lamblia trophozoite cells evidenced by expressed sequence tags or RT-PCR. Thirty-seven identified miRNAs may target 50 protein-coding genes, including seven variant-specific surface proteins (VSPs). Our findings provide a clue that miRNA-mediated gene regulation may exist in the early stage of eukaryotic evolution, suggesting that it is an important regulation system ubiquitous in eukaryotes.

  7. NegGOA: negative GO annotations selection using ontology structure.

    PubMed

    Fu, Guangyuan; Wang, Jun; Yang, Bo; Yu, Guoxian

    2016-10-01

    Predicting the biological functions of proteins is one of the key challenges in the post-genomic era. Computational models have demonstrated the utility of applying machine learning methods to predict protein function. Most prediction methods explicitly require a set of negative examples-proteins that are known not carrying out a particular function. However, Gene Ontology (GO) almost always only provides the knowledge that proteins carry out a particular function, and functional annotations of proteins are incomplete. GO structurally organizes more than tens of thousands GO terms and a protein is annotated with several (or dozens) of these terms. For these reasons, the negative examples of a protein can greatly help distinguishing true positive examples of the protein from such a large candidate GO space. In this paper, we present a novel approach (called NegGOA) to select negative examples. Specifically, NegGOA takes advantage of the ontology structure, available annotations and potentiality of additional annotations of a protein to choose negative examples of the protein. We compare NegGOA with other negative examples selection algorithms and find that NegGOA produces much fewer false negatives than them. We incorporate the selected negative examples into an efficient function prediction model to predict the functions of proteins in Yeast, Human, Mouse and Fly. NegGOA also demonstrates improved accuracy than these comparing algorithms across various evaluation metrics. In addition, NegGOA is less suffered from incomplete annotations of proteins than these comparing methods. The Matlab and R codes are available at https://sites.google.com/site/guoxian85/neggoa gxyu@swu.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  8. Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones

    PubMed Central

    Imanishi, Tadashi; Itoh, Takeshi; Suzuki, Yutaka; O'Donovan, Claire; Fukuchi, Satoshi; Koyanagi, Kanako O; Barrero, Roberto A; Tamura, Takuro; Yamaguchi-Kabata, Yumi; Tanino, Motohiko; Yura, Kei; Miyazaki, Satoru; Ikeo, Kazuho; Homma, Keiichi; Kasprzyk, Arek; Nishikawa, Tetsuo; Hirakawa, Mika; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Ashurst, Jennifer; Jia, Libin; Nakao, Mitsuteru; Thomas, Michael A; Mulder, Nicola; Karavidopoulou, Youla; Jin, Lihua; Kim, Sangsoo; Yasuda, Tomohiro; Lenhard, Boris; Eveno, Eric; Suzuki, Yoshiyuki; Yamasaki, Chisato; Takeda, Jun-ichi; Gough, Craig; Hilton, Phillip; Fujii, Yasuyuki; Sakai, Hiroaki; Tanaka, Susumu; Amid, Clara; Bellgard, Matthew; Bonaldo, Maria de Fatima; Bono, Hidemasa; Bromberg, Susan K; Brookes, Anthony J; Bruford, Elspeth; Carninci, Piero; Chelala, Claude; Couillault, Christine; de Souza, Sandro J.; Debily, Marie-Anne; Devignes, Marie-Dominique; Dubchak, Inna; Endo, Toshinori; Estreicher, Anne; Eyras, Eduardo; Fukami-Kobayashi, Kaoru; R. Gopinath, Gopal; Graudens, Esther; Hahn, Yoonsoo; Han, Michael; Han, Ze-Guang; Hanada, Kousuke; Hanaoka, Hideki; Harada, Erimi; Hashimoto, Katsuyuki; Hinz, Ursula; Hirai, Momoki; Hishiki, Teruyoshi; Hopkinson, Ian; Imbeaud, Sandrine; Inoko, Hidetoshi; Kanapin, Alexander; Kaneko, Yayoi; Kasukawa, Takeya; Kelso, Janet; Kersey, Paul; Kikuno, Reiko; Kimura, Kouichi; Korn, Bernhard; Kuryshev, Vladimir; Makalowska, Izabela; Makino, Takashi; Mano, Shuhei; Mariage-Samson, Regine; Mashima, Jun; Matsuda, Hideo; Mewes, Hans-Werner; Minoshima, Shinsei; Nagai, Keiichi; Nagasaki, Hideki; Nagata, Naoki; Nigam, Rajni; Ogasawara, Osamu; Ohara, Osamu; Ohtsubo, Masafumi; Okada, Norihiro; Okido, Toshihisa; Oota, Satoshi; Ota, Motonori; Ota, Toshio; Otsuki, Tetsuji; Piatier-Tonneau, Dominique; Poustka, Annemarie; Ren, Shuang-Xi; Saitou, Naruya; Sakai, Katsunaga; Sakamoto, Shigetaka; Sakate, Ryuichi; Schupp, Ingo; Servant, Florence; Sherry, Stephen; Shiba, Rie; Shimizu, Nobuyoshi; Shimoyama, Mary; Simpson, Andrew J; Soares, Bento; Steward, Charles; Suwa, Makiko; Suzuki, Mami; Takahashi, Aiko; Tamiya, Gen; Tanaka, Hiroshi; Taylor, Todd; Terwilliger, Joseph D; Unneberg, Per; Veeramachaneni, Vamsi; Watanabe, Shinya; Wilming, Laurens; Yasuda, Norikazu; Yoo, Hyang-Sook; Stodolsky, Marvin; Makalowski, Wojciech; Go, Mitiko; Nakai, Kenta; Takagi, Toshihisa; Kanehisa, Minoru; Sakaki, Yoshiyuki; Quackenbush, John; Okazaki, Yasushi; Hayashizaki, Yoshihide; Hide, Winston; Chakraborty, Ranajit; Nishikawa, Ken; Sugawara, Hideaki; Tateno, Yoshio; Chen, Zhu; Oishi, Michio; Tonellato, Peter; Apweiler, Rolf; Okubo, Kousaku; Wagner, Lukas; Wiemann, Stefan; Strausberg, Robert L; Isogai, Takao; Auffray, Charles; Nomura, Nobuo; Sugano, Sumio

    2004-01-01

    The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology. PMID:15103394

  9. Well-characterized sequence features of eukaryote genomes and implications for ab initio gene prediction.

    PubMed

    Huang, Ying; Chen, Shi-Yi; Deng, Feilong

    2016-01-01

    In silico analysis of DNA sequences is an important area of computational biology in the post-genomic era. Over the past two decades, computational approaches for ab initio prediction of gene structure from genome sequence alone have largely facilitated our understanding on a variety of biological questions. Although the computational prediction of protein-coding genes has already been well-established, we are also facing challenges to robustly find the non-coding RNA genes, such as miRNA and lncRNA. Two main aspects of ab initio gene prediction include the computed values for describing sequence features and used algorithm for training the discriminant function, and by which different combinations are employed into various bioinformatic tools. Herein, we briefly review these well-characterized sequence features in eukaryote genomes and applications to ab initio gene prediction. The main purpose of this article is to provide an overview to beginners who aim to develop the related bioinformatic tools.

  10. Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners

    PubMed Central

    Feinauer, Christoph; Procaccini, Andrea; Zecchina, Riccardo; Weigt, Martin; Pagnani, Andrea

    2014-01-01

    In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code. PMID:24663061

  11. Accelerating the Pace of Protein Functional Annotation With Intel Xeon Phi Coprocessors.

    PubMed

    Feinstein, Wei P; Moreno, Juana; Jarrell, Mark; Brylinski, Michal

    2015-06-01

    Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of eFindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of eFindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of eFindSite is freely available to the academic community at www.brylinski.org/efindsite.

  12. Multi-Omics Driven Assembly and Annotation of the Sandalwood (Santalum album) Genome.

    PubMed

    Mahesh, Hirehally Basavarajegowda; Subba, Pratigya; Advani, Jayshree; Shirke, Meghana Deepak; Loganathan, Ramya Malarini; Chandana, Shankara Lingu; Shilpa, Siddappa; Chatterjee, Oishi; Pinto, Sneha Maria; Prasad, Thottethodi Subrahmanya Keshava; Gowda, Malali

    2018-04-01

    Indian sandalwood ( Santalum album ) is an important tropical evergreen tree known for its fragrant heartwood-derived essential oil and its valuable carving wood. Here, we applied an integrated genomic, transcriptomic, and proteomic approach to assemble and annotate the Indian sandalwood genome. Our genome sequencing resulted in the establishment of a draft map of the smallest genome for any woody tree species to date (221 Mb). The genome annotation predicted 38,119 protein-coding genes and 27.42% repetitive DNA elements. In-depth proteome analysis revealed the identities of 72,325 unique peptides, which confirmed 10,076 of the predicted genes. The addition of transcriptomic and proteogenomic approaches resulted in the identification of 53 novel proteins and 34 gene-correction events that were missed by genomic approaches. Proteogenomic analysis also helped in reassigning 1,348 potential noncoding RNAs as bona fide protein-coding messenger RNAs. Gene expression patterns at the RNA and protein levels indicated that peptide sequencing was useful in capturing proteins encoded by nuclear and organellar genomes alike. Mass spectrometry-based proteomic evidence provided an unbiased approach toward the identification of proteins encoded by organellar genomes. Such proteins are often missed in transcriptome data sets due to the enrichment of only messenger RNAs that contain poly(A) tails. Overall, the use of integrated omic approaches enhanced the quality of the assembly and annotation of this nonmodel plant genome. The availability of genomic, transcriptomic, and proteomic data will enhance genomics-assisted breeding, germplasm characterization, and conservation of sandalwood trees. © 2018 American Society of Plant Biologists. All Rights Reserved.

  13. GLADIATOR: a global approach for elucidating disease modules.

    PubMed

    Silberberg, Yael; Kupiec, Martin; Sharan, Roded

    2017-05-26

    Understanding the genetic basis of disease is an important challenge in biology and medicine. The observation that disease-related proteins often interact with one another has motivated numerous network-based approaches for deciphering disease mechanisms. In particular, protein-protein interaction networks were successfully used to illuminate disease modules, i.e., interacting proteins working in concert to drive a disease. The identification of these modules can further our understanding of disease mechanisms. We devised a global method for the prediction of multiple disease modules simultaneously named GLADIATOR (GLobal Approach for DIsease AssociaTed mOdule Reconstruction). GLADIATOR relies on a gold-standard disease phenotypic similarity to obtain a pan-disease view of the underlying modules. To traverse the search space of potential disease modules, we applied a simulated annealing algorithm aimed at maximizing the correlation between module similarity and the gold-standard phenotypic similarity. Importantly, this optimization is employed over hundreds of diseases simultaneously. GLADIATOR's predicted modules highly agree with current knowledge about disease-related proteins. Furthermore, the modules exhibit high coherence with respect to functional annotations and are highly enriched with known curated pathways, outperforming previous methods. Examination of the predicted proteins shared by similar diseases demonstrates the diverse role of these proteins in mediating related processes across similar diseases. Last, we provide a detailed analysis of the suggested molecular mechanism predicted by GLADIATOR for hyperinsulinism, suggesting novel proteins involved in its pathology. GLADIATOR predicts disease modules by integrating knowledge of disease-related proteins and phenotypes across multiple diseases. The predicted modules are functionally coherent and are more in line with current biological knowledge compared to modules obtained using previous disease-centric methods. The source code for GLADIATOR can be downloaded from http://www.cs.tau.ac.il/~roded/GLADIATOR.zip .

  14. Predicting highly-connected hubs in protein interaction networks by QSAR and biological data descriptors

    PubMed Central

    Hsing, Michael; Byler, Kendall; Cherkasov, Artem

    2009-01-01

    Hub proteins (those engaged in most physical interactions in a protein interaction network (PIN) have recently gained much research interest due to their essential role in mediating cellular processes and their potential therapeutic value. It is straightforward to identify hubs if the underlying PIN is experimentally determined; however, theoretical hub prediction remains a very challenging task, as physicochemical properties that differentiate hubs from less connected proteins remain mostly uncharacterized. To adequately distinguish hubs from non-hub proteins we have utilized over 1300 protein descriptors, some of which represent QSAR (quantitative structure-activity relationship) parameters, and some reflect sequence-derived characteristics of proteins including domain composition and functional annotations. Those protein descriptors, together with available protein interaction data have been processed by a machine learning method (boosting trees) and resulted in the development of hub classifiers that are capable of predicting highly interacting proteins for four model organisms: Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens. More importantly, through the analyses of the most relevant protein descriptors, we are able to demonstrate that hub proteins not only share certain common physicochemical and structural characteristics that make them different from non-hub counterparts, but they also exhibit species-specific characteristics that should be taken into account when analyzing different PINs. The developed prediction models can be used for determining highly interacting proteins in the four studied species to assist future proteomics experiments and PIN analyses. Availability The source code and executable program of the hub classifier are available for download at: http://www.cnbi2.ca/hub-analysis/ PMID:20198194

  15. Avoidance of truncated proteins from unintended ribosome binding sites within heterologous protein coding sequences.

    PubMed

    Whitaker, Weston R; Lee, Hanson; Arkin, Adam P; Dueber, John E

    2015-03-20

    Genetic sequences ported into non-native hosts for synthetic biology applications can gain unexpected properties. In this study, we explored sequences functioning as ribosome binding sites (RBSs) within protein coding DNA sequences (CDSs) that cause internal translation, resulting in truncated proteins. Genome-wide prediction of bacterial RBSs, based on biophysical calculations employed by the RBS calculator, suggests a selection against internal RBSs within CDSs in Escherichia coli, but not those in Saccharomyces cerevisiae. Based on these calculations, silent mutations aimed at removing internal RBSs can effectively reduce truncation products from internal translation. However, a solution for complete elimination of internal translation initiation is not always feasible due to constraints of available coding sequences. Fluorescence assays and Western blot analysis showed that in genes with internal RBSs, increasing the strength of the intended upstream RBS had little influence on the internal translation strength. Another strategy to minimize truncated products from an internal RBS is to increase the relative strength of the upstream RBS with a concomitant reduction in promoter strength to achieve the same protein expression level. Unfortunately, lower transcription levels result in increased noise at the single cell level due to stochasticity in gene expression. At the low expression regimes desired for many synthetic biology applications, this problem becomes particularly pronounced. We found that balancing promoter strengths and upstream RBS strengths to intermediate levels can achieve the target protein concentration while avoiding both excessive noise and truncated protein.

  16. Structure and Expression of Genes for Flavivirus Immunogens.

    DTIC Science & Technology

    1985-09-01

    the same order in YFV i.e., C-M-E-NSI--- NS3---NS5 and an open reading frame extends at least through the C-M-E-NS1 coding region, consistent with...been determined (Castle et al., 1985). Comparison of these results shows that 1) the six major JEV genes mapped thus far occur in the same order in YFV ...pre-M proteins and 3) the predicted structures of the E, NSI and ns2a proteins of JEV and YFV exhibit a high degree of relatedness. The E proteins

  17. Nmf9 Encodes a Highly Conserved Protein Important to Neurological Function in Mice and Flies.

    PubMed

    Zhang, Shuxiao; Ross, Kevin D; Seidner, Glen A; Gorman, Michael R; Poon, Tiffany H; Wang, Xiaobo; Keithley, Elizabeth M; Lee, Patricia N; Martindale, Mark Q; Joiner, William J; Hamilton, Bruce A

    2015-07-01

    Many protein-coding genes identified by genome sequencing remain without functional annotation or biological context. Here we define a novel protein-coding gene, Nmf9, based on a forward genetic screen for neurological function. ENU-induced and genome-edited null mutations in mice produce deficits in vestibular function, fear learning and circadian behavior, which correlated with Nmf9 expression in inner ear, amygdala, and suprachiasmatic nuclei. Homologous genes from unicellular organisms and invertebrate animals predict interactions with small GTPases, but the corresponding domains are absent in mammalian Nmf9. Intriguingly, homozygotes for null mutations in the Drosophila homolog, CG45058, show profound locomotor defects and premature death, while heterozygotes show striking effects on sleep and activity phenotypes. These results link a novel gene orthology group to discrete neurological functions, and show conserved requirement across wide phylogenetic distance and domain level structural changes.

  18. Mutations that Cause Human Disease: A Computational/Experimental Approach

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Beernink, P; Barsky, D; Pesavento, B

    International genome sequencing projects have produced billions of nucleotides (letters) of DNA sequence data, including the complete genome sequences of 74 organisms. These genome sequences have created many new scientific opportunities, including the ability to identify sequence variations among individuals within a species. These genetic differences, which are known as single nucleotide polymorphisms (SNPs), are particularly important in understanding the genetic basis for disease susceptibility. Since the report of the complete human genome sequence, over two million human SNPs have been identified, including a large-scale comparison of an entire chromosome from twenty individuals. Of the protein coding SNPs (cSNPs), approximatelymore » half leads to a single amino acid change in the encoded protein (non-synonymous coding SNPs). Most of these changes are functionally silent, while the remainder negatively impact the protein and sometimes cause human disease. To date, over 550 SNPs have been found to cause single locus (monogenic) diseases and many others have been associated with polygenic diseases. SNPs have been linked to specific human diseases, including late-onset Parkinson disease, autism, rheumatoid arthritis and cancer. The ability to predict accurately the effects of these SNPs on protein function would represent a major advance toward understanding these diseases. To date several attempts have been made toward predicting the effects of such mutations. The most successful of these is a computational approach called ''Sorting Intolerant From Tolerant'' (SIFT). This method uses sequence conservation among many similar proteins to predict which residues in a protein are functionally important. However, this method suffers from several limitations. First, a query sequence must have a sufficient number of relatives to infer sequence conservation. Second, this method does not make use of or provide any information on protein structure, which can be used to understand how an amino acid change affects the protein. The experimental methods that provide the most detailed structural information on proteins are X-ray crystallography and NMR spectroscopy. However, these methods are labor intensive and currently cannot be carried out on a genomic scale. Nonetheless, Structural Genomics projects are being pursued by more than a dozen groups and consortia worldwide and as a result the number of experimentally determined structures is rising exponentially. Based on the expectation that protein structures will continue to be determined at an ever-increasing rate, reliable structure prediction schemes will become increasingly valuable, leading to information on protein function and disease for many different proteins. Given known genetic variability and experimentally determined protein structures, can we accurately predict the effects of single amino acid substitutions? An objective assessment of this question would involve comparing predicted and experimentally determined structures, which thus far has not been rigorously performed. The completed research leveraged existing expertise at LLNL in computational and structural biology, as well as significant computing resources, to address this question.« less

  19. Performance of protein-structure predictions with the physics-based UNRES force field in CASP11.

    PubMed

    Krupa, Paweł; Mozolewska, Magdalena A; Wiśniewska, Marta; Yin, Yanping; He, Yi; Sieradzan, Adam K; Ganzynkowicz, Robert; Lipska, Agnieszka G; Karczyńska, Agnieszka; Ślusarz, Magdalena; Ślusarz, Rafał; Giełdoń, Artur; Czaplewski, Cezary; Jagieła, Dawid; Zaborowski, Bartłomiej; Scheraga, Harold A; Liwo, Adam

    2016-11-01

    Participating as the Cornell-Gdansk group, we have used our physics-based coarse-grained UNited RESidue (UNRES) force field to predict protein structure in the 11th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP11). Our methodology involved extensive multiplexed replica exchange simulations of the target proteins with a recently improved UNRES force field to provide better reproductions of the local structures of polypeptide chains. All simulations were started from fully extended polypeptide chains, and no external information was included in the simulation process except for weak restraints on secondary structure to enable us to finish each prediction within the allowed 3-week time window. Because of simplified UNRES representation of polypeptide chains, use of enhanced sampling methods, code optimization and parallelization and sufficient computational resources, we were able to treat, for the first time, all 55 human prediction targets with sizes from 44 to 595 amino acid residues, the average size being 251 residues. Complete structures of six single-domain proteins were predicted accurately, with the highest accuracy being attained for the T0769, for which the CαRMSD was 3.8 Å for 97 residues of the experimental structure. Correct structures were also predicted for 13 domains of multi-domain proteins with accuracy comparable to that of the best template-based modeling methods. With further improvements of the UNRES force field that are now underway, our physics-based coarse-grained approach to protein-structure prediction will eventually reach global prediction capacity and, consequently, reliability in simulating protein structure and dynamics that are important in biochemical processes. Freely available on the web at http://www.unres.pl/ CONTACT: has5@cornell.edu. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  20. PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics.

    PubMed

    von Grotthuss, Marcin; Plewczynski, Dariusz; Ginalski, Krzysztof; Rychlewski, Leszek; Shakhnovich, Eugene I

    2006-02-06

    The number of protein structures from structural genomics centers dramatically increases in the Protein Data Bank (PDB). Many of these structures are functionally unannotated because they have no sequence similarity to proteins of known function. However, it is possible to successfully infer function using only structural similarity. Here we present the PDB-UF database, a web-accessible collection of predictions of enzymatic properties using structure-function relationship. The assignments were conducted for three-dimensional protein structures of unknown function that come from structural genomics initiatives. We show that 4 hypothetical proteins (with PDB accession codes: 1VH0, 1NS5, 1O6D, and 1TO0), for which standard BLAST tools such as PSI-BLAST or RPS-BLAST failed to assign any function, are probably methyltransferase enzymes. We suggest that the structure-based prediction of an EC number should be conducted having the different similarity score cutoff for different protein folds. Moreover, performing the annotation using two different algorithms can reduce the rate of false positive assignments. We believe, that the presented web-based repository will help to decrease the number of protein structures that have functions marked as "unknown" in the PDB file. http://paradox.harvard.edu/PDB-UF and http://bioinfo.pl/PDB-UF.

  1. Coevolution Theory of the Genetic Code at Age Forty: Pathway to Translation and Synthetic Life

    PubMed Central

    Wong, J. Tze-Fei; Ng, Siu-Kin; Mat, Wai-Kin; Hu, Taobo; Xue, Hong

    2016-01-01

    The origins of the components of genetic coding are examined in the present study. Genetic information arose from replicator induction by metabolite in accordance with the metabolic expansion law. Messenger RNA and transfer RNA stemmed from a template for binding the aminoacyl-RNA synthetase ribozymes employed to synthesize peptide prosthetic groups on RNAs in the Peptidated RNA World. Coevolution of the genetic code with amino acid biosynthesis generated tRNA paralogs that identify a last universal common ancestor (LUCA) of extant life close to Methanopyrus, which in turn points to archaeal tRNA introns as the most primitive introns and the anticodon usage of Methanopyrus as an ancient mode of wobble. The prediction of the coevolution theory of the genetic code that the code should be a mutable code has led to the isolation of optional and mandatory synthetic life forms with altered protein alphabets. PMID:26999216

  2. lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine.

    PubMed

    Sun, Lei; Liu, Hui; Zhang, Lin; Meng, Jia

    2015-01-01

    Functional long non-coding RNAs (lncRNAs) have been bringing novel insight into biological study, however it is still not trivial to accurately distinguish the lncRNA transcripts (LNCTs) from the protein coding ones (PCTs). As various information and data about lncRNAs are preserved by previous studies, it is appealing to develop novel methods to identify the lncRNAs more accurately. Our method lncRScan-SVM aims at classifying PCTs and LNCTs using support vector machine (SVM). The gold-standard datasets for lncRScan-SVM model training, lncRNA prediction and method comparison were constructed according to the GENCODE gene annotations of human and mouse respectively. By integrating features derived from gene structure, transcript sequence, potential codon sequence and conservation, lncRScan-SVM outperforms other approaches, which is evaluated by several criteria such as sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC) and area under curve (AUC). In addition, several known human lncRNA datasets were assessed using lncRScan-SVM. LncRScan-SVM is an efficient tool for predicting the lncRNAs, and it is quite useful for current lncRNA study.

  3. Molecular modelling of the Norrie disease protein predicts a cystine knot growth factor tertiary structure.

    PubMed

    Meitinger, T; Meindl, A; Bork, P; Rost, B; Sander, C; Haasemann, M; Murken, J

    1993-12-01

    The X-lined gene for Norrie disease, which is characterized by blindness, deafness and mental retardation has been cloned recently. This gene has been thought to code for a putative extracellular factor; its predicted amino acid sequence is homologous to the C-terminal domain of diverse extracellular proteins. Sequence pattern searches and three-dimensional modelling now suggest that the Norrie disease protein (NDP) has a tertiary structure similar to that of transforming growth factor beta (TGF beta). Our model identifies NDP as a member of an emerging family of growth factors containing a cystine knot motif, with direct implications for the physiological role of NDP. The model also sheds light on sequence related domains such as the C-terminal domain of mucins and of von Willebrand factor.

  4. Natural variation of rice blast resistance gene Pi-d2

    USDA-ARS?s Scientific Manuscript database

    Studying natural variation of rice resistance (R) genes in cultivated and wild rice relatives can predict resistance stability to rice blast fungus. In the present study, the protein coding regions of rice R gene Pi-d2 in 35 rice accessions of subgroups, aus (AUS), indica (IND), temperate japonica (...

  5. Draft Genome Sequence of Lactobacillus panis DSM 6035T, First Isolated from Sourdough

    PubMed Central

    Zhu, Yixin; Fang, Daiqiong; Shi, Ding; Li, Ang; Lv, Longxian; Yan, Ren; Yao, Jian; Hua, Dasong; Hu, Xinjun; Guo, Feifei; Wu, Wenrui; Guo, Jing; Chen, Yanfei; Jiang, Xiawei; Chen, Xiaoxiao

    2015-01-01

    We report a draft genome sequence of Lactobacillus panis DSM 6035T, isolated from sourdough. The genome of this strain is 2,082,789 bp long, with 47.9% G+C content. A total of 2,047 protein-coding genes were predicted. PMID:26205855

  6. Effectively identifying compound-protein interactions by learning from positive and unlabeled examples.

    PubMed

    Cheng, Zhanzhan; Zhou, Shuigeng; Wang, Yang; Liu, Hui; Guan, Jihong; Chen, Yi-Ping Phoebe

    2016-05-18

    Prediction of compound-protein interactions (CPIs) is to find new compound-protein pairs where a protein is targeted by at least a compound, which is a crucial step in new drug design. Currently, a number of machine learning based methods have been developed to predict new CPIs in the literature. However, as there is not yet any publicly available set of validated negative CPIs, most existing machine learning based approaches use the unknown interactions (not validated CPIs) selected randomly as the negative examples to train classifiers for predicting new CPIs. Obviously, this is not quite reasonable and unavoidably impacts the CPI prediction performance. In this paper, we simply take the unknown CPIs as unlabeled examples, and propose a new method called PUCPI (the abbreviation of PU learning for Compound-Protein Interaction identification) that employs biased-SVM (Support Vector Machine) to predict CPIs using only positive and unlabeled examples. PU learning is a class of learning methods that leans from positive and unlabeled (PU) samples. To the best of our knowledge, this is the first work that identifies CPIs using only positive and unlabeled examples. We first collect known CPIs as positive examples and then randomly select compound-protein pairs not in the positive set as unlabeled examples. For each CPI/compound-protein pair, we extract protein domains as protein features and compound substructures as chemical features, then take the tensor product of the corresponding compound features and protein features as the feature vector of the CPI/compound-protein pair. After that, biased-SVM is employed to train classifiers on different datasets of CPIs and compound-protein pairs. Experiments over various datasets show that our method outperforms six typical classifiers, including random forest, L1- and L2-regularized logistic regression, naive Bayes, SVM and k-nearest neighbor (kNN), and three types of existing CPI prediction models. Source code, datasets and related documents of PUCPI are available at: http://admis.fudan.edu.cn/projects/pucpi.html.

  7. Effect of reference genome selection on the performance of computational methods for genome-wide protein-protein interaction prediction.

    PubMed

    Muley, Vijaykumar Yogesh; Ranjan, Akash

    2012-01-01

    Recent progress in computational methods for predicting physical and functional protein-protein interactions has provided new insights into the complexity of biological processes. Most of these methods assume that functionally interacting proteins are likely to have a shared evolutionary history. This history can be traced out for the protein pairs of a query genome by correlating different evolutionary aspects of their homologs in multiple genomes known as the reference genomes. These methods include phylogenetic profiling, gene neighborhood and co-occurrence of the orthologous protein coding genes in the same cluster or operon. These are collectively known as genomic context methods. On the other hand a method called mirrortree is based on the similarity of phylogenetic trees between two interacting proteins. Comprehensive performance analyses of these methods have been frequently reported in literature. However, very few studies provide insight into the effect of reference genome selection on detection of meaningful protein interactions. We analyzed the performance of four methods and their variants to understand the effect of reference genome selection on prediction efficacy. We used six sets of reference genomes, sampled in accordance with phylogenetic diversity and relationship between organisms from 565 bacteria. We used Escherichia coli as a model organism and the gold standard datasets of interacting proteins reported in DIP, EcoCyc and KEGG databases to compare the performance of the prediction methods. Higher performance for predicting protein-protein interactions was achievable even with 100-150 bacterial genomes out of 565 genomes. Inclusion of archaeal genomes in the reference genome set improves performance. We find that in order to obtain a good performance, it is better to sample few genomes of related genera of prokaryotes from the large number of available genomes. Moreover, such a sampling allows for selecting 50-100 genomes for comparable accuracy of predictions when computational resources are limited.

  8. Similarity-based prediction for Anatomical Therapeutic Chemical classification of drugs by integrating multiple data sources.

    PubMed

    Liu, Zhongyang; Guo, Feifei; Gu, Jiangyong; Wang, Yong; Li, Yang; Wang, Dan; Lu, Liang; Li, Dong; He, Fuchu

    2015-06-01

    Anatomical Therapeutic Chemical (ATC) classification system, widely applied in almost all drug utilization studies, is currently the most widely recognized classification system for drugs. Currently, new drug entries are added into the system only on users' requests, which leads to seriously incomplete drug coverage of the system, and bioinformatics prediction is helpful during this process. Here we propose a novel prediction model of drug-ATC code associations, using logistic regression to integrate multiple heterogeneous data sources including chemical structures, target proteins, gene expression, side-effects and chemical-chemical associations. The model obtains good performance for the prediction not only on ATC codes of unclassified drugs but also on new ATC codes of classified drugs assessed by cross-validation and independent test sets, and its efficacy exceeds previous methods. Further to facilitate the use, the model is developed into a user-friendly web service SPACE ( S: imilarity-based P: redictor of A: TC C: od E: ), which for each submitted compound, will give candidate ATC codes (ranked according to the decreasing probability_score predicted by the model) together with corresponding supporting evidence. This work not only contributes to knowing drugs' therapeutic, pharmacological and chemical properties, but also provides clues for drug repositioning and side-effect discovery. In addition, the construction of the prediction model also provides a general framework for similarity-based data integration which is suitable for other drug-related studies such as target, side-effect prediction etc. The web service SPACE is available at http://www.bprc.ac.cn/space. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  9. On splice site prediction using weight array models: a comparison of smoothing techniques

    NASA Astrophysics Data System (ADS)

    Taher, Leila; Meinicke, Peter; Morgenstern, Burkhard

    2007-11-01

    In most eukaryotic genes, protein-coding exons are separated by non-coding introns which are removed from the primary transcript by a process called "splicing". The positions where introns are cut and exons are spliced together are called "splice sites". Thus, computational prediction of splice sites is crucial for gene finding in eukaryotes. Weight array models are a powerful probabilistic approach to splice site detection. Parameters for these models are usually derived from m-tuple frequencies in trusted training data and subsequently smoothed to avoid zero probabilities. In this study we compare three different ways of parameter estimation for m-tuple frequencies, namely (a) non-smoothed probability estimation, (b) standard pseudo counts and (c) a Gaussian smoothing procedure that we recently developed.

  10. Predicted secondary structure similarity in the absence of primary amino acid sequence homology: hepatitis B virus open reading frames.

    PubMed Central

    Schaeffer, E; Sninsky, J J

    1984-01-01

    Proteins that are related evolutionarily may have diverged at the level of primary amino acid sequence while maintaining similar secondary structures. Computer analysis has been used to compare the open reading frames of the hepatitis B virus to those of the woodchuck hepatitis virus at the level of amino acid sequence, and to predict the relative hydrophilic character and the secondary structure of putative polypeptides. Similarity is seen at the levels of relative hydrophilicity and secondary structure, in the absence of sequence homology. These data reinforce the proposal that these open reading frames encode viral proteins. Computer analysis of this type can be more generally used to establish structural similarities between proteins that do not share obvious sequence homology as well as to assess whether an open reading frame is fortuitous or codes for a protein. PMID:6585835

  11. Identification and characterization of a novel zebrafish (Danio rerio) pentraxin-carbonic anhydrase.

    PubMed

    Patrikainen, Maarit S; Tolvanen, Martti E E; Aspatwar, Ashok; Barker, Harlan R; Ortutay, Csaba; Jänis, Janne; Laitaoja, Mikko; Hytönen, Vesa P; Azizi, Latifeh; Manandhar, Prajwol; Jáger, Edit; Vullo, Daniela; Kukkurainen, Sampo; Hilvo, Mika; Supuran, Claudiu T; Parkkila, Seppo

    2017-01-01

    Carbonic anhydrases (CAs) are ubiquitous, essential enzymes which catalyze the conversion of carbon dioxide and water to bicarbonate and H + ions. Vertebrate genomes generally contain gene loci for 15-21 different CA isoforms, three of which are enzymatically inactive. CA VI is the only secretory protein of the enzymatically active isoforms. We discovered that non-mammalian CA VI contains a C-terminal pentraxin (PTX) domain, a novel combination for both CAs and PTXs. We isolated and sequenced zebrafish ( Danio rerio ) CA VI cDNA, complete with the sequence coding for the PTX domain, and produced the recombinant CA VI-PTX protein. Enzymatic activity and kinetic parameters were measured with a stopped-flow instrument. Mass spectrometry, analytical gel filtration and dynamic light scattering were used for biophysical characterization. Sequence analyses and Bayesian phylogenetics were used in generating hypotheses of protein structure and CA VI gene evolution. A CA VI-PTX antiserum was produced, and the expression of CA VI protein was studied by immunohistochemistry. A knock-down zebrafish model was constructed, and larvae were observed up to five days post-fertilization (dpf). The expression of ca6 mRNA was quantitated by qRT-PCR in different developmental times in morphant and wild-type larvae and in different adult fish tissues. Finally, the swimming behavior of the morphant fish was compared to that of wild-type fish. The recombinant enzyme has a very high carbonate dehydratase activity. Sequencing confirms a 530-residue protein identical to one of the predicted proteins in the Ensembl database (ensembl.org). The protein is pentameric in solution, as studied by gel filtration and light scattering, presumably joined by the PTX domains. Mass spectrometry confirms the predicted signal peptide cleavage and disulfides, and N-glycosylation in two of the four observed glycosylation motifs. Molecular modeling of the pentamer is consistent with the modifications observed in mass spectrometry. Phylogenetics and sequence analyses provide a consistent hypothesis of the evolutionary history of domains associated with CA VI in mammals and non-mammals. Briefly, the evidence suggests that ancestral CA VI was a transmembrane protein, the exon coding for the cytoplasmic domain was replaced by one coding for PTX domain, and finally, in the therian lineage, the PTX-coding exon was lost. We knocked down CA VI expression in zebrafish embryos with antisense morpholino oligonucleotides, resulting in phenotype features of decreased buoyancy and swim bladder deflation in 4 dpf larvae. These findings provide novel insights into the evolution, structure, and function of this unique CA form.

  12. Whole genome annotation and comparative genomic analyses of bio-control fungus Purpureocillium lilacinum.

    PubMed

    Prasad, Pushplata; Varshney, Deepti; Adholeya, Alok

    2015-11-25

    The fungus Purpureocillium lilacinum is widely known as a biological control agent against plant parasitic nematodes. This research article consists of genomic annotation of the first draft of whole genome sequence of P. lilacinum. The study aims to decipher the putative genetic components of the fungus involved in nematode pathogenesis by performing comparative genomic analysis with nine closely related fungal species in Hypocreales. de novo genomic assembly was done and a total of 301 scaffolds were constructed for P. lilacinum genomic DNA. By employing structural genome prediction models, 13, 266 genes coding for proteins were predicted in the genome. Approximately 73% of the predicted genes were functionally annotated using Blastp, InterProScan and Gene Ontology. A 14.7% fraction of the predicted genes shared significant homology with genes in the Pathogen Host Interactions (PHI) database. The phylogenomic analysis carried out using maximum likelihood RAxML algorithm provided insight into the evolutionary relationship of P. lilacinum. In congruence with other closely related species in the Hypocreales namely, Metarhizium spp., Pochonia chlamydosporia, Cordyceps militaris, Trichoderma reesei and Fusarium spp., P. lilacinum has large gene sets coding for G-protein coupled receptors (GPCRs), proteases, glycoside hydrolases and carbohydrate esterases that are required for degradation of nematode-egg shell components. Screening of the genome by Antibiotics & Secondary Metabolite Analysis Shell (AntiSMASH) pipeline indicated that the genome potentially codes for a variety of secondary metabolites, possibly required for adaptation to heterogeneous lifestyles reported for P. lilacinum. Significant up-regulation of subtilisin-like serine protease genes in presence of nematode eggs in quantitative real-time analyses suggested potential role of serine proteases in nematode pathogenesis. The data offer a better understanding of Purpureocillium lilacinum genome and will enhance our understanding on the molecular mechanism involved in nematophagy.

  13. Protein-coding genes combined with long noncoding RNA as a novel transcriptome molecular staging model to predict the survival of patients with esophageal squamous cell carcinoma.

    PubMed

    Guo, Jin-Cheng; Wu, Yang; Chen, Yang; Pan, Feng; Wu, Zhi-Yong; Zhang, Jia-Sheng; Wu, Jian-Yi; Xu, Xiu-E; Zhao, Jian-Mei; Li, En-Min; Zhao, Yi; Xu, Li-Yan

    2018-04-09

    Esophageal squamous cell carcinoma (ESCC) is the predominant subtype of esophageal carcinoma in China. This study was to develop a staging model to predict outcomes of patients with ESCC. Using Cox regression analysis, principal component analysis (PCA), partitioning clustering, Kaplan-Meier analysis, receiver operating characteristic (ROC) curve analysis, and classification and regression tree (CART) analysis, we mined the Gene Expression Omnibus database to determine the expression profiles of genes in 179 patients with ESCC from GSE63624 and GSE63622 dataset. Univariate cox regression analysis of the GSE63624 dataset revealed that 2404 protein-coding genes (PCGs) and 635 long non-coding RNAs (lncRNAs) were associated with the survival of patients with ESCC. PCA categorized these PCGs and lncRNAs into three principal components (PCs), which were used to cluster the patients into three groups. ROC analysis demonstrated that the predictive ability of PCG-lncRNA PCs when applied to new patients was better than that of the tumor-node-metastasis staging (area under ROC curve [AUC]: 0.69 vs. 0.65, P < 0.05). Accordingly, we constructed a molecular disaggregated model comprising one lncRNA and two PCGs, which we designated as the LSB staging model using CART analysis in the GSE63624 dataset. This LSB staging model classified the GSE63622 dataset of patients into three different groups, and its effectiveness was validated by analysis of another cohort of 105 patients. The LSB staging model has clinical significance for the prognosis prediction of patients with ESCC and may serve as a three-gene staging microarray.

  14. The structural genes for three Drosophila glue proteins reside at a single polytene chromosome puff locus.

    PubMed Central

    Crowley, T E; Bond, M W; Meyerowitz, E M

    1983-01-01

    The polytene chromosome puff at 68C on the Drosophila melanogaster third chromosome is thought from genetic experiments to contain the structural gene for one of the secreted salivary gland glue polypeptides, sgs-3. Previous work has demonstrated that the DNA included in this puff contains sequences that are transcribed to give three different polyadenylated RNAs that are abundant in third-larval-instar salivary glands. These have been called the group II, group III, and group IV RNAs. In the experiments reported here, we used the nucleotide sequence of the DNA coding for these RNAs to predict some of the physical and chemical properties expected of their protein products, including molecular weight, amino acid composition, and amino acid sequence. Salivary gland polypeptides with molecular weights similar to those expected for the 68C RNA translation products, and with the expected degree of incorporation of different radioactive amino acids, were purified. These proteins were shown by amino acid sequencing to correspond to the protein products of the 68C RNAs. It was further shown that each of these proteins is a part of the secreted salivary gland glue: the group IV RNA codes for the previously described sgs-3, whereas the group II and III RNAs code for the newly identified glue polypeptides sgs-8 and sgs-7. Images PMID:6406838

  15. A deep learning framework for improving long-range residue-residue contact prediction using a hierarchical strategy.

    PubMed

    Xiong, Dapeng; Zeng, Jianyang; Gong, Haipeng

    2017-09-01

    Residue-residue contacts are of great value for protein structure prediction, since contact information, especially from those long-range residue pairs, can significantly reduce the complexity of conformational sampling for protein structure prediction in practice. Despite progresses in the past decade on protein targets with abundant homologous sequences, accurate contact prediction for proteins with limited sequence information is still far from satisfaction. Methodologies for these hard targets still need further improvement. We presented a computational program DeepConPred, which includes a pipeline of two novel deep-learning-based methods (DeepCCon and DeepRCon) as well as a contact refinement step, to improve the prediction of long-range residue contacts from primary sequences. When compared with previous prediction approaches, our framework employed an effective scheme to identify optimal and important features for contact prediction, and was only trained with coevolutionary information derived from a limited number of homologous sequences to ensure robustness and usefulness for hard targets. Independent tests showed that 59.33%/49.97%, 64.39%/54.01% and 70.00%/59.81% of the top L/5, top L/10 and top 5 predictions were correct for CASP10/CASP11 proteins, respectively. In general, our algorithm ranked as one of the best methods for CASP targets. All source data and codes are available at http://166.111.152.91/Downloads.html . hgong@tsinghua.edu.cn or zengjy321@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  16. SIBIS: a Bayesian model for inconsistent protein sequence estimation.

    PubMed

    Khenoussi, Walyd; Vanhoutrève, Renaud; Poch, Olivier; Thompson, Julie D

    2014-09-01

    The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences. Source code, implemented in C on a linux system, and the datasets of protein sequences are freely available for download at http://www.lbgi.fr/∼julie/SIBIS. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  17. Predicting Protein-Protein Interaction Sites with a Novel Membership Based Fuzzy SVM Classifier.

    PubMed

    Sriwastava, Brijesh K; Basu, Subhadip; Maulik, Ujjwal

    2015-01-01

    Predicting residues that participate in protein-protein interactions (PPI) helps to identify, which amino acids are located at the interface. In this paper, we show that the performance of the classical support vector machine (SVM) algorithm can further be improved with the use of a custom-designed fuzzy membership function, for the partner-specific PPI interface prediction problem. We evaluated the performances of both classical SVM and fuzzy SVM (F-SVM) on the PPI databases of three different model proteomes of Homo sapiens, Escherichia coli and Saccharomyces Cerevisiae and calculated the statistical significance of the developed F-SVM over classical SVM algorithm. We also compared our performance with the available state-of-the-art fuzzy methods in this domain and observed significant performance improvements. To predict interaction sites in protein complexes, local composition of amino acids together with their physico-chemical characteristics are used, where the F-SVM based prediction method exploits the membership function for each pair of sequence fragments. The average F-SVM performance (area under ROC curve) on the test samples in 10-fold cross validation experiment are measured as 77.07, 78.39, and 74.91 percent for the aforementioned organisms respectively. Performances on independent test sets are obtained as 72.09, 73.24 and 82.74 percent respectively. The software is available for free download from http://code.google.com/p/cmater-bioinfo.

  18. Effective gene prediction by high resolution frequency estimator based on least-norm solution technique

    PubMed Central

    2014-01-01

    Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the protein-coding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3-base periodicity, while non-coding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the least-norm method. The least-norm estimator developed in this paper shows sharp period-3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of least-norm gene prediction method over existing method. PMID:24386895

  19. Predicting Transmembrane Helix Packing Arrangements using Residue Contacts and a Force-Directed Algorithm

    PubMed Central

    Nugent, Timothy; Jones, David T.

    2010-01-01

    Alpha-helical transmembrane proteins constitute roughly 30% of a typical genome and are involved in a wide variety of important biological processes including cell signalling, transport of membrane-impermeable molecules and cell recognition. Despite significant efforts to predict transmembrane protein topology, comparatively little attention has been directed toward developing a method to pack the helices together. Here, we present a novel approach to predict lipid exposure, residue contacts, helix-helix interactions and finally the optimal helical packing arrangement of transmembrane proteins. Using molecular dynamics data, we have trained and cross-validated a support vector machine (SVM) classifier to predict per residue lipid exposure with 69% accuracy. This information is combined with additional features to train a second SVM to predict residue contacts which are then used to determine helix-helix interaction with up to 65% accuracy under stringent cross-validation on a non-redundant test set. Our method is also able to discriminate native from decoy helical packing arrangements with up to 70% accuracy. Finally, we employ a force-directed algorithm to construct the optimal helical packing arrangement which demonstrates success for proteins containing up to 13 transmembrane helices. This software is freely available as source code from http://bioinf.cs.ucl.ac.uk/memsat/mempack/. PMID:20333233

  20. aPPRove: An HMM-Based Method for Accurate Prediction of RNA-Pentatricopeptide Repeat Protein Binding Events

    PubMed Central

    Harrison, Thomas; Ruiz, Jaime; Sloan, Daniel B.; Ben-Hur, Asa; Boucher, Christina

    2016-01-01

    Pentatricopeptide repeat containing proteins (PPRs) bind to RNA transcripts originating from mitochondria and plastids. There are two classes of PPR proteins. The P class contains tandem P-type motif sequences, and the PLS class contains alternating P, L and S type sequences. In this paper, we describe a novel tool that predicts PPR-RNA interaction; specifically, our method, which we call aPPRove, determines where and how a PLS-class PPR protein will bind to RNA when given a PPR and one or more RNA transcripts by using a combinatorial binding code for site specificity proposed by Barkan et al. Our results demonstrate that aPPRove successfully locates how and where a PPR protein belonging to the PLS class can bind to RNA. For each binding event it outputs the binding site, the amino-acid-nucleotide interaction, and its statistical significance. Furthermore, we show that our method can be used to predict binding events for PLS-class proteins using a known edit site and the statistical significance of aligning the PPR protein to that site. In particular, we use our method to make a conjecture regarding an interaction between CLB19 and the second intronic region of ycf3. The aPPRove web server can be found at www.cs.colostate.edu/~approve. PMID:27560805

  1. An exploration of the sequence of a 2.9-Mb region of the genome of Drosophila melanogaster: the Adh region.

    PubMed Central

    Ashburner, M; Misra, S; Roote, J; Lewis, S E; Blazej, R; Davis, T; Doyle, C; Galle, R; George, R; Harris, N; Hartzell, G; Harvey, D; Hong, L; Houston, K; Hoskins, R; Johnson, G; Martin, C; Moshrefi, A; Palazzolo, M; Reese, M G; Spradling, A; Tsang, G; Wan, K; Whitelaw, K; Celniker, S

    1999-01-01

    A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized "Adh region." A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species.Before beginning a Hunt, it is wise to ask someone what you are looking for before you begin looking for it. Milne 1926 PMID:10471707

  2. TCOF1 gene encodes a putative nucleolar phosphoprotein that exhibits mutations in Treacher Collins Syndrome throughout its coding region

    PubMed Central

    Wise, Carol A.; Chiang, Lydia C.; Paznekas, William A.; Sharma, Mridula; Musy, Maurice M.; Ashley, Jennifer A.; Lovett, Michael; Jabs, Ethylin W.

    1997-01-01

    Treacher Collins Syndrome (TCS) is the most common of the human mandibulofacial dysostosis disorders. Recently, a partial TCOF1 cDNA was identified and shown to contain mutations in TCS families. Here we present the entire exon/intron genomic structure and the complete coding sequence of TCOF1. TCOF1 encodes a low complexity protein of 1,411 amino acids, whose predicted protein structure reveals repeated motifs that mirror the organization of its exons. These motifs are shared with nucleolar trafficking proteins in other species and are predicted to be highly phosphorylated by casein kinase. Consistent with this, the full-length TCOF1 protein sequence also contains putative nuclear and nucleolar localization signals. Throughout the open reading frame, we detected an additional eight mutations in TCS families and several polymorphisms. We postulate that TCS results from defects in a nucleolar trafficking protein that is critically required during human craniofacial development. PMID:9096354

  3. The LINE-1 DNA sequences in four mammalian orders predict proteins that conserve homologies to retrovirus proteins.

    PubMed Central

    Fanning, T; Singer, M

    1987-01-01

    Recent work suggests that one or more members of the highly repeated LINE-1 (L1) DNA family found in all mammals may encode one or more proteins. Here we report the sequence of a portion of an L1 cloned from the domestic cat (Felis catus). These data permit comparison of the L1 sequences in four mammalian orders (Carnivore, Lagomorph, Rodent and Primate) and the comparison supports the suggested coding potential. In two separate, noncontiguous regions in the carboxy terminal half of the proteins predicted from the DNA sequences, there are several strongly conserved segments. In one region, these share homology with known or suspected reverse transcriptases, as described by others in rodents and primates. In the second region, closer to the carboxy terminus, the strongly conserved segments are over 90% homologous among the four orders. One of the latter segments is cysteine rich and resembles the putative metal binding domains of nucleic acid binding proteins, including those of TFIIIA and retroviruses. PMID:3562227

  4. OSPREY: protein design with ensembles, flexibility, and provable algorithms.

    PubMed

    Gainza, Pablo; Roberts, Kyle E; Georgiev, Ivelin; Lilien, Ryan H; Keedy, Daniel A; Chen, Cheng-Yu; Reza, Faisal; Anderson, Amy C; Richardson, David C; Richardson, Jane S; Donald, Bruce R

    2013-01-01

    We have developed a suite of protein redesign algorithms that improves realistic in silico modeling of proteins. These algorithms are based on three characteristics that make them unique: (1) improved flexibility of the protein backbone, protein side-chains, and ligand to accurately capture the conformational changes that are induced by mutations to the protein sequence; (2) modeling of proteins and ligands as ensembles of low-energy structures to better approximate binding affinity; and (3) a globally optimal protein design search, guaranteeing that the computational predictions are optimal with respect to the input model. Here, we illustrate the importance of these three characteristics. We then describe OSPREY, a protein redesign suite that implements our protein design algorithms. OSPREY has been used prospectively, with experimental validation, in several biomedically relevant settings. We show in detail how OSPREY has been used to predict resistance mutations and explain why improved flexibility, ensembles, and provability are essential for this application. OSPREY is free and open source under a Lesser GPL license. The latest version is OSPREY 2.0. The program, user manual, and source code are available at www.cs.duke.edu/donaldlab/software.php. osprey@cs.duke.edu. Copyright © 2013 Elsevier Inc. All rights reserved.

  5. SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models

    PubMed Central

    2014-01-01

    Background Locating the protein-coding genes in novel genomes is essential to understanding and exploiting the genomic information but it is still difficult to accurately predict all the genes. The recent availability of detailed information about transcript structure from high-throughput sequencing of messenger RNA (RNA-Seq) delineates many expressed genes and promises increased accuracy in gene prediction. Computational gene predictors have been intensively developed for and tested in well-studied animal genomes. Hundreds of fungal genomes are now or will soon be sequenced. The differences of fungal genomes from animal genomes and the phylogenetic sparsity of well-studied fungi call for gene-prediction tools tailored to them. Results SnowyOwl is a new gene prediction pipeline that uses RNA-Seq data to train and provide hints for the generation of Hidden Markov Model (HMM)-based gene predictions and to evaluate the resulting models. The pipeline has been developed and streamlined by comparing its predictions to manually curated gene models in three fungal genomes and validated against the high-quality gene annotation of Neurospora crassa; SnowyOwl predicted N. crassa genes with 83% sensitivity and 65% specificity. SnowyOwl gains sensitivity by repeatedly running the HMM gene predictor Augustus with varied input parameters and selectivity by choosing the models with best homology to known proteins and best agreement with the RNA-Seq data. Conclusions SnowyOwl efficiently uses RNA-Seq data to produce accurate gene models in both well-studied and novel fungal genomes. The source code for the SnowyOwl pipeline (in Python) and a web interface (in PHP) is freely available from http://sourceforge.net/projects/snowyowl/. PMID:24980894

  6. Human cell structure-driven model construction for predicting protein subcellular location from biological images.

    PubMed

    Shao, Wei; Liu, Mingxia; Zhang, Daoqiang

    2016-01-01

    The systematic study of subcellular location pattern is very important for fully characterizing the human proteome. Nowadays, with the great advances in automated microscopic imaging, accurate bioimage-based classification methods to predict protein subcellular locations are highly desired. All existing models were constructed on the independent parallel hypothesis, where the cellular component classes are positioned independently in a multi-class classification engine. The important structural information of cellular compartments is missed. To deal with this problem for developing more accurate models, we proposed a novel cell structure-driven classifier construction approach (SC-PSorter) by employing the prior biological structural information in the learning model. Specifically, the structural relationship among the cellular components is reflected by a new codeword matrix under the error correcting output coding framework. Then, we construct multiple SC-PSorter-based classifiers corresponding to the columns of the error correcting output coding codeword matrix using a multi-kernel support vector machine classification approach. Finally, we perform the classifier ensemble by combining those multiple SC-PSorter-based classifiers via majority voting. We evaluate our method on a collection of 1636 immunohistochemistry images from the Human Protein Atlas database. The experimental results show that our method achieves an overall accuracy of 89.0%, which is 6.4% higher than the state-of-the-art method. The dataset and code can be downloaded from https://github.com/shaoweinuaa/. dqzhang@nuaa.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  7. Draft Genome Sequence of Mycobacterium bohemicum Strain DSM 44277T.

    PubMed

    Asmar, Shady; Phelippeau, Michael; Robert, Catherine; Croce, Olivier; Drancourt, Michel

    2015-08-06

    The Mycobacterium bohemicum strain is a nontuberculosis species mainly responsible for pediatric cervical lymphadenitis. The draft genome of M. bohemicum DSM 44277(T) comprises 5,097,190 bp exhibiting a 68.64% G+C content, 4,840 protein-coding genes, and 75 predicted RNA genes. Copyright © 2015 Asmar et al.

  8. Genome Sequence of a Chromium-Reducing Strain, Bacillus cereus S612

    DOE PAGES

    Wang, Dongping; Boukhalfa, Hakim; Ware, Doug S.; ...

    2015-12-10

    We report here the genome sequence of an effective chromium-reducing bacterium,Bacillus cereusstrain S612. We found that the size of the draft genome sequence is approximately 5.4 Mb, with a G+C content of 35%, and it is predicted to contain 5,450 protein-coding genes.

  9. Draft Genome Sequence of Saccharomyces cerevisiae Barra Grande (BG-1), a Brazilian Industrial Bioethanol-Producing Strain

    PubMed Central

    Coutouné, Natalia; Mulato, Aline Tieppo Nogueira

    2017-01-01

    ABSTRACT Here, we present the draft genome sequence of Saccharomyces cerevisiae BG-1, a Brazilian industrial strain widely used for bioethanol production from sugarcane. The 11.7-Mb genome sequence consists of 216 scaffolds and harbors 5,607 predicted protein-coding genes. PMID:28360170

  10. ContEst16S: an algorithm that identifies contaminated prokaryotic genomes using 16S RNA gene sequences.

    PubMed

    Lee, Imchang; Chalita, Mauricio; Ha, Sung-Min; Na, Seong-In; Yoon, Seok-Hwan; Chun, Jongsik

    2017-06-01

    Thanks to the recent advancement of DNA sequencing technology, the cost and time of prokaryotic genome sequencing have been dramatically decreased. It has repeatedly been reported that genome sequencing using high-throughput next-generation sequencing is prone to contaminations due to its high depth of sequencing coverage. Although a few bioinformatics tools are available to detect potential contaminations, these have inherited limitations as they only use protein-coding genes. Here we introduce a new algorithm, called ContEst16S, to detect potential contaminations using 16S rRNA genes from genome assemblies. We screened 69 745 prokaryotic genomes from the NCBI Assembly Database using ContEst16S and found that 594 were contaminated by bacteria, human and plants. Of the predicted contaminated genomes, 8 % were not predicted by the existing protein-coding gene-based tool, implying that both methods can be complementary in the detection of contaminations. A web-based service of the algorithm is available at www.ezbiocloud.net/tools/contest16s.

  11. Transmembrane Topology and Signal Peptide Prediction Using Dynamic Bayesian Networks

    PubMed Central

    Reynolds, Sheila M.; Käll, Lukas; Riffle, Michael E.; Bilmes, Jeff A.; Noble, William Stafford

    2008-01-01

    Hidden Markov models (HMMs) have been successfully applied to the tasks of transmembrane protein topology prediction and signal peptide prediction. In this paper we expand upon this work by making use of the more powerful class of dynamic Bayesian networks (DBNs). Our model, Philius, is inspired by a previously published HMM, Phobius, and combines a signal peptide submodel with a transmembrane submodel. We introduce a two-stage DBN decoder that combines the power of posterior decoding with the grammar constraints of Viterbi-style decoding. Philius also provides protein type, segment, and topology confidence metrics to aid in the interpretation of the predictions. We report a relative improvement of 13% over Phobius in full-topology prediction accuracy on transmembrane proteins, and a sensitivity and specificity of 0.96 in detecting signal peptides. We also show that our confidence metrics correlate well with the observed precision. In addition, we have made predictions on all 6.3 million proteins in the Yeast Resource Center (YRC) database. This large-scale study provides an overall picture of the relative numbers of proteins that include a signal-peptide and/or one or more transmembrane segments as well as a valuable resource for the scientific community. All DBNs are implemented using the Graphical Models Toolkit. Source code for the models described here is available at http://noble.gs.washington.edu/proj/philius. A Philius Web server is available at http://www.yeastrc.org/philius, and the predictions on the YRC database are available at http://www.yeastrc.org/pdr. PMID:18989393

  12. A discriminatory function for prediction of protein-DNA interactions based on alpha shape modeling.

    PubMed

    Zhou, Weiqiang; Yan, Hong

    2010-10-15

    Protein-DNA interaction has significant importance in many biological processes. However, the underlying principle of the molecular recognition process is still largely unknown. As more high-resolution 3D structures of protein-DNA complex are becoming available, the surface characteristics of the complex become an important research topic. In our work, we apply an alpha shape model to represent the surface structure of the protein-DNA complex and developed an interface-atom curvature-dependent conditional probability discriminatory function for the prediction of protein-DNA interaction. The interface-atom curvature-dependent formalism captures atomic interaction details better than the atomic distance-based method. The proposed method provides good performance in discriminating the native structures from the docking decoy sets, and outperforms the distance-dependent formalism in terms of the z-score. Computer experiment results show that the curvature-dependent formalism with the optimal parameters can achieve a native z-score of -8.17 in discriminating the native structure from the highest surface-complementarity scored decoy set and a native z-score of -7.38 in discriminating the native structure from the lowest RMSD decoy set. The interface-atom curvature-dependent formalism can also be used to predict apo version of DNA-binding proteins. These results suggest that the interface-atom curvature-dependent formalism has a good prediction capability for protein-DNA interactions. The code and data sets are available for download on http://www.hy8.com/bioinformatics.htm kenandzhou@hotmail.com.

  13. A versatile palindromic amphipathic repeat coding sequence horizontally distributed among diverse bacterial and eucaryotic microbes

    PubMed Central

    2010-01-01

    Background Intragenic tandem repeats occur throughout all domains of life and impart functional and structural variability to diverse translation products. Repeat proteins confer distinctive surface phenotypes to many unicellular organisms, including those with minimal genomes such as the wall-less bacterial monoderms, Mollicutes. One such repeat pattern in this clade is distributed in a manner suggesting its exchange by horizontal gene transfer (HGT). Expanding genome sequence databases reveal the pattern in a widening range of bacteria, and recently among eucaryotic microbes. We examined the genomic flux and consequences of the motif by determining its distribution, predicted structural features and association with membrane-targeted proteins. Results Using a refined hidden Markov model, we document a 25-residue protein sequence motif tandemly arrayed in variable-number repeats in ORFs lacking assigned functions. It appears sporadically in unicellular microbes from disparate bacterial and eucaryotic clades, representing diverse lifestyles and ecological niches that include host parasitic, marine and extreme environments. Tracts of the repeats predict a malleable configuration of recurring domains, with conserved hydrophobic residues forming an amphipathic secondary structure in which hydrophilic residues endow extensive sequence variation. Many ORFs with these domains also have membrane-targeting sequences that predict assorted topologies; others may comprise reservoirs of sequence variants. We demonstrate expressed variants among surface lipoproteins that distinguish closely related animal pathogens belonging to a subgroup of the Mollicutes. DNA sequences encoding the tandem domains display dyad symmetry. Moreover, in some taxa the domains occur in ORFs selectively associated with mobile elements. These features, a punctate phylogenetic distribution, and different patterns of dispersal in genomes of related taxa, suggest that the repeat may be disseminated by HGT and intra-genomic shuffling. Conclusions We describe novel features of PARCELs (Palindromic Amphipathic Repeat Coding ELements), a set of widely distributed repeat protein domains and coding sequences that were likely acquired through HGT by diverse unicellular microbes, further mobilized and diversified within genomes, and co-opted for expression in the membrane proteome of some taxa. Disseminated by multiple gene-centric vehicles, ORFs harboring these elements enhance accessory gene pools as part of the "mobilome" connecting genomes of various clades, in taxa sharing common niches. PMID:20626840

  14. Variability and transmission by Aphis glycines of North American and Asian Soybean mosaic virus isolates.

    PubMed

    Domier, L L; Latorre, I J; Steinlage, T A; McCoppin, N; Hartman, G L

    2003-10-01

    The variability of North American and Asian strains and isolates of Soybean mosaic virus was investigated. First, polymerase chain reaction (PCR) products representing the coat protein (CP)-coding regions of 38 SMVs were analyzed for restriction fragment length polymorphisms (RFLP). Second, the nucleotide and predicted amino acid sequence variability of the P1-coding region of 18 SMVs and the helper component/protease (HC/Pro) and CP-coding regions of 25 SMVs were assessed. The CP nucleotide and predicted amino acid sequences were the most similar and predicted phylogenetic relationships similar to those obtained from RFLP analysis. Neither RFLP nor sequence analyses of the CP-coding regions grouped the SMVs by geographical origin. The P1 and HC/Pro sequences were more variable and separated the North American and Asian SMV isolates into two groups similar to previously reported differences in pathogenic diversity of the two sets of SMV isolates. The P1 region was the most informative of the three regions analyzed. To assess the biological relevance of the sequence differences in the HC/Pro and CP coding regions, the transmissibility of 14 SMV isolates by Aphis glycines was tested. All field isolates of SMV were transmitted efficiently by A. glycines, but the laboratory isolates analyzed were transmitted poorly. The amino acid sequences from most, but not all, of the poorly transmitted isolates contained mutations in the aphid transmission-associated DAG and/or KLSC amino acid sequence motifs of CP and HC/Pro, respectively.

  15. Nanoparticles-cell association predicted by protein corona fingerprints

    NASA Astrophysics Data System (ADS)

    Palchetti, S.; Digiacomo, L.; Pozzi, D.; Peruzzi, G.; Micarelli, E.; Mahmoudi, M.; Caracciolo, G.

    2016-06-01

    In a physiological environment (e.g., blood and interstitial fluids) nanoparticles (NPs) will bind proteins shaping a ``protein corona'' layer. The long-lived protein layer tightly bound to the NP surface is referred to as the hard corona (HC) and encodes information that controls NP bioactivity (e.g. cellular association, cellular signaling pathways, biodistribution, and toxicity). Decrypting this complex code has become a priority to predict the NP biological outcomes. Here, we use a library of 16 lipid NPs of varying size (Ø ~ 100-250 nm) and surface chemistry (unmodified and PEGylated) to investigate the relationships between NP physicochemical properties (nanoparticle size, aggregation state and surface charge), protein corona fingerprints (PCFs), and NP-cell association. We found out that none of the NPs' physicochemical properties alone was exclusively able to account for association with human cervical cancer cell line (HeLa). For the entire library of NPs, a total of 436 distinct serum proteins were detected. We developed a predictive-validation modeling that provides a means of assessing the relative significance of the identified corona proteins. Interestingly, a minor fraction of the HC, which consists of only 8 PCFs were identified as main promoters of NP association with HeLa cells. Remarkably, identified PCFs have several receptors with high level of expression on the plasma membrane of HeLa cells.In a physiological environment (e.g., blood and interstitial fluids) nanoparticles (NPs) will bind proteins shaping a ``protein corona'' layer. The long-lived protein layer tightly bound to the NP surface is referred to as the hard corona (HC) and encodes information that controls NP bioactivity (e.g. cellular association, cellular signaling pathways, biodistribution, and toxicity). Decrypting this complex code has become a priority to predict the NP biological outcomes. Here, we use a library of 16 lipid NPs of varying size (Ø ~ 100-250 nm) and surface chemistry (unmodified and PEGylated) to investigate the relationships between NP physicochemical properties (nanoparticle size, aggregation state and surface charge), protein corona fingerprints (PCFs), and NP-cell association. We found out that none of the NPs' physicochemical properties alone was exclusively able to account for association with human cervical cancer cell line (HeLa). For the entire library of NPs, a total of 436 distinct serum proteins were detected. We developed a predictive-validation modeling that provides a means of assessing the relative significance of the identified corona proteins. Interestingly, a minor fraction of the HC, which consists of only 8 PCFs were identified as main promoters of NP association with HeLa cells. Remarkably, identified PCFs have several receptors with high level of expression on the plasma membrane of HeLa cells. Electronic supplementary information (ESI) available: Table S1. Cell viability (%) and cell association of the different nanoparticles used. Table S2. Total number of identified proteins on the different nanoparticles used. Tables S3-S18. Top 25 most abundant corona proteins identified in the protein corona of nanoparticles NP2-NP16 following 1 hour incubation with HP. Table S19. List of descriptors used. Table S20. Potential targets of protein corona fingerprints with its own interaction score (mentha) and the expression median value in Hela cells. Fig. S1 and S2. Effect of exposure to human plasma on size and zeta potential of NPs. Fig. S3. Predictive modeling of nanoparticle-cell association. See DOI: 10.1039/c6nr03898k

  16. Computational Identification and Functional Predictions of Long Noncoding RNA in Zea mays

    PubMed Central

    Boerner, Susan; McGinnis, Karen M.

    2012-01-01

    Background Computational analysis of cDNA sequences from multiple organisms suggests that a large portion of transcribed DNA does not code for a functional protein. In mammals, noncoding transcription is abundant, and often results in functional RNA molecules that do not appear to encode proteins. Many long noncoding RNAs (lncRNAs) appear to have epigenetic regulatory function in humans, including HOTAIR and XIST. While epigenetic gene regulation is clearly an essential mechanism in plants, relatively little is known about the presence or function of lncRNAs in plants. Methodology/Principal Findings To explore the connection between lncRNA and epigenetic regulation of gene expression in plants, a computational pipeline using the programming language Python has been developed and applied to maize full length cDNA sequences to identify, classify, and localize potential lncRNAs. The pipeline was used in parallel with an SVM tool for identifying ncRNAs to identify the maximal number of ncRNAs in the dataset. Although the available library of sequences was small and potentially biased toward protein coding transcripts, 15% of the sequences were predicted to be noncoding. Approximately 60% of these sequences appear to act as precursors for small RNA molecules and may function to regulate gene expression via a small RNA dependent mechanism. ncRNAs were predicted to originate from both genic and intergenic loci. Of the lncRNAs that originated from genic loci, ∼20% were antisense to the host gene loci. Conclusions/Significance Consistent with similar studies in other organisms, noncoding transcription appears to be widespread in the maize genome. Computational predictions indicate that maize lncRNAs may function to regulate expression of other genes through multiple RNA mediated mechanisms. PMID:22916204

  17. A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i)

    PubMed Central

    2014-01-01

    Background Small insertion and deletion polymorphisms (Indels) are the second most common mutations in the human genome, after Single Nucleotide Polymorphisms (SNPs). Recent studies have shown that they have significant influence on genetic variation by altering human traits and can cause multiple human diseases. In particular, many Indels that occur in protein coding regions are known to impact the structure or function of the protein. A major challenge is to predict the effects of these Indels and to distinguish between deleterious and neutral variants. When an Indel occurs within a coding region, it can be either frameshifting (FS) or non-frameshifting (NFS). FS-Indels either modify the complete C-terminal region of the protein or result in premature termination of translation. NFS-Indels insert/delete multiples of three nucleotides leading to the insertion/deletion of one or more amino acids. Results In order to study the relationships between NFS-Indels and Mendelian diseases, we characterized NFS-Indels according to numerous structural, functional and evolutionary parameters. We then used these parameters to identify specific characteristics of disease-causing and neutral NFS-Indels. Finally, we developed a new machine learning approach, KD4i, that can be used to predict the phenotypic effects of NFS-Indels. Conclusions We demonstrate in a large-scale evaluation that the accuracy of KD4i is comparable to existing state-of-the-art methods. However, a major advantage of our approach is that we also provide the reasons for the predictions, in the form of a set of rules. The rules are interpretable by non-expert humans and they thus represent new knowledge about the relationships between the genotype and phenotypes of NFS-Indels and the causative molecular perturbations that result in the disease. PMID:24742296

  18. DiffSLC: A graph centrality method to detect essential proteins of a protein-protein interaction network.

    PubMed

    Mistry, Divya; Wise, Roger P; Dickerson, Julie A

    2017-01-01

    Identification of central genes and proteins in biomolecular networks provides credible candidates for pathway analysis, functional analysis, and essentiality prediction. The DiffSLC centrality measure predicts central and essential genes and proteins using a protein-protein interaction network. Network centrality measures prioritize nodes and edges based on their importance to the network topology. These measures helped identify critical genes and proteins in biomolecular networks. The proposed centrality measure, DiffSLC, combines the number of interactions of a protein and the gene coexpression values of genes from which those proteins were translated, as a weighting factor to bias the identification of essential proteins in a protein interaction network. Potentially essential proteins with low node degree are promoted through eigenvector centrality. Thus, the gene coexpression values are used in conjunction with the eigenvector of the network's adjacency matrix and edge clustering coefficient to improve essentiality prediction. The outcome of this prediction is shown using three variations: (1) inclusion or exclusion of gene co-expression data, (2) impact of different coexpression measures, and (3) impact of different gene expression data sets. For a total of seven networks, DiffSLC is compared to other centrality measures using Saccharomyces cerevisiae protein interaction networks and gene expression data. Comparisons are also performed for the top ranked proteins against the known essential genes from the Saccharomyces Gene Deletion Project, which show that DiffSLC detects more essential proteins and has a higher area under the ROC curve than other compared methods. This makes DiffSLC a stronger alternative to other centrality methods for detecting essential genes using a protein-protein interaction network that obeys centrality-lethality principle. DiffSLC is implemented using the igraph package in R, and networkx package in Python. The python package can be obtained from git.io/diffslcpy. The R implementation and code to reproduce the analysis is available via git.io/diffslc.

  19. EGASP: the human ENCODE Genome Annotation Assessment Project

    PubMed Central

    Guigó, Roderic; Flicek, Paul; Abril, Josep F; Reymond, Alexandre; Lagarde, Julien; Denoeud, France; Antonarakis, Stylianos; Ashburner, Michael; Bajic, Vladimir B; Birney, Ewan; Castelo, Robert; Eyras, Eduardo; Ucla, Catherine; Gingeras, Thomas R; Harrow, Jennifer; Hubbard, Tim; Lewis, Suzanna E; Reese, Martin G

    2006-01-01

    Background We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. Results The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. Conclusion This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence. PMID:16925836

  20. Identifying functionally informative evolutionary sequence profiles.

    PubMed

    Gil, Nelson; Fiser, Andras

    2018-04-15

    Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. andras.fiser@einstein.yu.edu. Supplementary data are available at Bioinformatics online.

  1. Robust enzyme design: bioinformatic tools for improved protein stability.

    PubMed

    Suplatov, Dmitry; Voevodin, Vladimir; Švedas, Vytas

    2015-03-01

    The ability of proteins and enzymes to maintain a functionally active conformation under adverse environmental conditions is an important feature of biocatalysts, vaccines, and biopharmaceutical proteins. From an evolutionary perspective, robust stability of proteins improves their biological fitness and allows for further optimization. Viewed from an industrial perspective, enzyme stability is crucial for the practical application of enzymes under the required reaction conditions. In this review, we analyze bioinformatic-driven strategies that are used to predict structural changes that can be applied to wild type proteins in order to produce more stable variants. The most commonly employed techniques can be classified into stochastic approaches, empirical or systematic rational design strategies, and design of chimeric proteins. We conclude that bioinformatic analysis can be efficiently used to study large protein superfamilies systematically as well as to predict particular structural changes which increase enzyme stability. Evolution has created a diversity of protein properties that are encoded in genomic sequences and structural data. Bioinformatics has the power to uncover this evolutionary code and provide a reproducible selection of hotspots - key residues to be mutated in order to produce more stable and functionally diverse proteins and enzymes. Further development of systematic bioinformatic procedures is needed to organize and analyze sequences and structures of proteins within large superfamilies and to link them to function, as well as to provide knowledge-based predictions for experimental evaluation. Copyright © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  2. Predictive hypotheses are ineffectual in resolving complex biochemical systems.

    PubMed

    Fry, Michael

    2018-03-20

    Scientific hypotheses may either predict particular unknown facts or accommodate previously-known data. Although affirmed predictions are intuitively more rewarding than accommodations of established facts, opinions divide whether predictive hypotheses are also epistemically superior to accommodation hypotheses. This paper examines the contribution of predictive hypotheses to discoveries of several bio-molecular systems. Having all the necessary elements of the system known beforehand, an abstract predictive hypothesis of semiconservative mode of DNA replication was successfully affirmed. However, in defining the genetic code whose biochemical basis was unclear, hypotheses were only partially effective and supplementary experimentation was required for its conclusive definition. Markedly, hypotheses were entirely inept in predicting workings of complex systems that included unknown elements. Thus, hypotheses did not predict the existence and function of mRNA, the multiple unidentified components of the protein biosynthesis machinery, or the manifold unknown constituents of the ubiquitin-proteasome system of protein breakdown. Consequently, because of their inability to envision unknown entities, predictive hypotheses did not contribute to the elucidation of cation theories remained the sole instrument to explain complex bio-molecular systems, the philosophical question of alleged advantage of predictive over accommodative hypotheses became inconsequential.

  3. Cis-regulatory somatic mutations and gene-expression alteration in B-cell lymphomas.

    PubMed

    Mathelier, Anthony; Lefebvre, Calvin; Zhang, Allen W; Arenillas, David J; Ding, Jiarui; Wasserman, Wyeth W; Shah, Sohrab P

    2015-04-23

    With the rapid increase of whole-genome sequencing of human cancers, an important opportunity to analyze and characterize somatic mutations lying within cis-regulatory regions has emerged. A focus on protein-coding regions to identify nonsense or missense mutations disruptive to protein structure and/or function has led to important insights; however, the impact on gene expression of mutations lying within cis-regulatory regions remains under-explored. We analyzed somatic mutations from 84 matched tumor-normal whole genomes from B-cell lymphomas with accompanying gene expression measurements to elucidate the extent to which these cancers are disrupted by cis-regulatory mutations. We characterize mutations overlapping a high quality set of well-annotated transcription factor binding sites (TFBSs), covering a similar portion of the genome as protein-coding exons. Our results indicate that cis-regulatory mutations overlapping predicted TFBSs are enriched in promoter regions of genes involved in apoptosis or growth/proliferation. By integrating gene expression data with mutation data, our computational approach culminates with identification of cis-regulatory mutations most likely to participate in dysregulation of the gene expression program. The impact can be measured along with protein-coding mutations to highlight key mutations disrupting gene expression and pathways in cancer. Our study yields specific genes with disrupted expression triggered by genomic mutations in either the coding or the regulatory space. It implies that mutated regulatory components of the genome contribute substantially to cancer pathways. Our analyses demonstrate that identifying genomically altered cis-regulatory elements coupled with analysis of gene expression data will augment biological interpretation of mutational landscapes of cancers.

  4. The virophage as a unique parasite of the giant mimivirus.

    PubMed

    La Scola, Bernard; Desnues, Christelle; Pagnier, Isabelle; Robert, Catherine; Barrassi, Lina; Fournous, Ghislain; Merchat, Michèle; Suzan-Monti, Marie; Forterre, Patrick; Koonin, Eugene; Raoult, Didier

    2008-09-04

    Viruses are obligate parasites of Eukarya, Archaea and Bacteria. Acanthamoeba polyphaga mimivirus (APMV) is the largest known virus; it grows only in amoeba and is visible under the optical microscope. Mimivirus possesses a 1,185-kilobase double-stranded linear chromosome whose coding capacity is greater than that of numerous bacteria and archaea1, 2, 3. Here we describe an icosahedral small virus, Sputnik, 50 nm in size, found associated with a new strain of APMV. Sputnik cannot multiply in Acanthamoeba castellanii but grows rapidly, after an eclipse phase, in the giant virus factory found in amoebae co-infected with APMV4. Sputnik growth is deleterious to APMV and results in the production of abortive forms and abnormal capsid assembly of the host virus. The Sputnik genome is an 18.343-kilobase circular double-stranded DNA and contains genes that are linked to viruses infecting each of the three domains of life Eukarya, Archaea and Bacteria. Of the 21 predicted protein-coding genes, eight encode proteins with detectable homologues, including three proteins apparently derived from APMV, a homologue of an archaeal virus integrase, a predicted primase-helicase, a packaging ATPase with homologues in bacteriophages and eukaryotic viruses, a distant homologue of bacterial insertion sequence transposase DNA-binding subunit, and a Zn-ribbon protein. The closest homologues of the last four of these proteins were detected in the Global Ocean Survey environmental data set5, suggesting that Sputnik represents a currently unknown family of viruses. Considering its functional analogy with bacteriophages, we classify this virus as a virophage. The virophage could be a vehicle mediating lateral gene transfer between giant viruses.

  5. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Leung, Elo; Huang, Amy; Cadag, Eithon

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  6. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE PAGES

    Leung, Elo; Huang, Amy; Cadag, Eithon; ...

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  7. Cloning, expression analysis, and chromosomal localization of HIP1R, an isolog of huntingtin interacting protein (HIP1).

    PubMed

    Seki, N; Muramatsu, M; Sugano, S; Suzuki, Y; Nakagawara, A; Ohhira, M; Hayashi, A; Hori, T; Saito, T

    1998-01-01

    Huntington disease (HD) is an inherited neurodegenerative disorder which is associated with CAG expansion in the coding region of the gene for huntingtin protein. Recently, a huntingtin interacting protein, HIP1, was isolated by the yeast two-hybrid system. Here we report the isolation of a cDNA clone for HIP1R (huntingtin interacting protein-1 related), which encodes a predicted protein product sharing a striking homology with HIP1. RT-PCR analysis showed that the messenger RNA was ubiquitously expressed in various human tissues. Based on PCR-assisted analysis of a radiation hybrid panel and fluorescence in situ hybridization, HIP1R was localized to the q24 region of chromosome 12.

  8. Predicting multicellular function through multi-layer tissue networks

    PubMed Central

    Zitnik, Marinka; Leskovec, Jure

    2017-01-01

    Abstract Motivation: Understanding functions of proteins in specific human tissues is essential for insights into disease diagnostics and therapeutics, yet prediction of tissue-specific cellular function remains a critical challenge for biomedicine. Results: Here, we present OhmNet, a hierarchy-aware unsupervised node feature learning approach for multi-layer networks. We build a multi-layer network, where each layer represents molecular interactions in a different human tissue. OhmNet then automatically learns a mapping of proteins, represented as nodes, to a neural embedding-based low-dimensional space of features. OhmNet encourages sharing of similar features among proteins with similar network neighborhoods and among proteins activated in similar tissues. The algorithm generalizes prior work, which generally ignores relationships between tissues, by modeling tissue organization with a rich multiscale tissue hierarchy. We use OhmNet to study multicellular function in a multi-layer protein interaction network of 107 human tissues. In 48 tissues with known tissue-specific cellular functions, OhmNet provides more accurate predictions of cellular function than alternative approaches, and also generates more accurate hypotheses about tissue-specific protein actions. We show that taking into account the tissue hierarchy leads to improved predictive power. Remarkably, we also demonstrate that it is possible to leverage the tissue hierarchy in order to effectively transfer cellular functions to a functionally uncharacterized tissue. Overall, OhmNet moves from flat networks to multiscale models able to predict a range of phenotypes spanning cellular subsystems. Availability and implementation: Source code and datasets are available at http://snap.stanford.edu/ohmnet. Contact: jure@cs.stanford.edu PMID:28881986

  9. Cloning and sequence analysis of Hemonchus contortus HC58cDNA.

    PubMed

    Muleke, Charles I; Ruofeng, Yan; Lixin, Xu; Xinwen, Bo; Xiangrui, Li

    2007-06-01

    The complete coding sequence of Hemonchus contortus HC58cDNA was generated by rapid amplification of cDNA ends and polymerase chain reaction using primers based on the 5' and 3' ends of the parasite mRNA, accession no. AF305964. The HC58cDNA gene was 851 bp long, with open reading frame of 717 bp, precursors to 239 amino acids coding for approximately 27 kDa protein. Analysis of amino acid sequence revealed conserved residues of cysteine, histidine, asparagine, occluding loop pattern, hemoglobinase motif and glutamine of the oxyanion hole characteristic of cathepsin B like proteases (CBL). Comparison of the predicted amino acid sequences showed the protein shared 33.5-58.7% identity to cathepsin B homologues in the papain clan CA family (family C1). Phylogenetic analysis revealed close evolutionary proximity of the protein sequence to counterpart sequences in the CBL, suggesting that HC58cDNA was a member of the papain family.

  10. Bioinformatic Analysis Reveals Archaeal tRNATyr and tRNATrp Identities in Bacteria

    PubMed Central

    Mukai, Takahito; Reynolds, Noah M.; Crnković, Ana; Söll, Dieter

    2017-01-01

    The tRNA identity elements for some amino acids are distinct between the bacterial and archaeal domains. Searching in recent genomic and metagenomic sequence data, we found some candidate phyla radiation (CPR) bacteria with archaeal tRNA identity for Tyr-tRNA and Trp-tRNA synthesis. These bacteria possess genes for tyrosyl-tRNA synthetase (TyrRS) and tryptophanyl-tRNA synthetase (TrpRS) predicted to be derived from DPANN superphylum archaea, while the cognate tRNATyr and tRNATrp genes reveal bacterial or archaeal origins. We identified a trace of domain fusion and swapping in the archaeal-type TyrRS gene of a bacterial lineage, suggesting that CPR bacteria may have used this mechanism to create diverse proteins. Archaeal-type TrpRS of bacteria and a few TrpRS species of DPANN archaea represent a new phylogenetic clade (named TrpRS-A). The TrpRS-A open reading frames (ORFs) are always associated with another ORF (named ORF1) encoding an unknown protein without global sequence identity to any known protein. However, our protein structure prediction identified a putative HIGH-motif and KMSKS-motif as well as many α-helices that are characteristic of class I aminoacyl-tRNA synthetase (aaRS) homologs. These results provide another example of the diversity of molecular components that implement the genetic code and provide a clue to the early evolution of life and the genetic code. PMID:28230768

  11. Efficient analysis of mouse genome sequences reveal many nonsense variants

    PubMed Central

    Steeland, Sophie; Timmermans, Steven; Van Ryckeghem, Sara; Hulpiau, Paco; Saeys, Yvan; Van Montagu, Marc; Vandenbroucke, Roosmarijn E.; Libert, Claude

    2016-01-01

    Genetic polymorphisms in coding genes play an important role when using mouse inbred strains as research models. They have been shown to influence research results, explain phenotypical differences between inbred strains, and increase the amount of interesting gene variants present in the many available inbred lines. SPRET/Ei is an inbred strain derived from Mus spretus that has ∼1% sequence difference with the C57BL/6J reference genome. We obtained a listing of all SNPs and insertions/deletions (indels) present in SPRET/Ei from the Mouse Genomes Project (Wellcome Trust Sanger Institute) and processed these data to obtain an overview of all transcripts having nonsynonymous coding sequence variants. We identified 8,883 unique variants affecting 10,096 different transcripts from 6,328 protein-coding genes, which is about 28% of all coding genes. Because only a subset of these variants results in drastic changes in proteins, we focused on variations that are nonsense mutations that ultimately resulted in a gain of a stop codon. These genes were identified by in silico changing the C57BL/6J coding sequences to the SPRET/Ei sequences, converting them to amino acid (AA) sequences, and comparing the AA sequences. All variants and transcripts affected were also stored in a database, which can be browsed using a SPRET/Ei M. spretus variants web tool (www.spretus.org), including a manual. We validated the tool by demonstrating the loss of function of three proteins predicted to be severely truncated, namely Fas, IRAK2, and IFNγR1. PMID:27147605

  12. Characterization and prediction of residues determining protein functional specificity.

    PubMed

    Capra, John A; Singh, Mona

    2008-07-01

    Within a homologous protein family, proteins may be grouped into subtypes that share specific functions that are not common to the entire family. Often, the amino acids present in a small number of sequence positions determine each protein's particular functional specificity. Knowledge of these specificity determining positions (SDPs) aids in protein function prediction, drug design and experimental analysis. A number of sequence-based computational methods have been introduced for identifying SDPs; however, their further development and evaluation have been hindered by the limited number of known experimentally determined SDPs. We combine several bioinformatics resources to automate a process, typically undertaken manually, to build a dataset of SDPs. The resulting large dataset, which consists of SDPs in enzymes, enables us to characterize SDPs in terms of their physicochemical and evolutionary properties. It also facilitates the large-scale evaluation of sequence-based SDP prediction methods. We present a simple sequence-based SDP prediction method, GroupSim, and show that, surprisingly, it is competitive with a representative set of current methods. We also describe ConsWin, a heuristic that considers sequence conservation of neighboring amino acids, and demonstrate that it improves the performance of all methods tested on our large dataset of enzyme SDPs. Datasets and GroupSim code are available online at http://compbio.cs.princeton.edu/specificity/. Supplementary data are available at Bioinformatics online.

  13. Post-translational processing targets functionally diverse proteins in Mycoplasma hyopneumoniae

    PubMed Central

    Tacchi, Jessica L.; Raymond, Benjamin B. A.; Haynes, Paul A.; Berry, Iain J.; Widjaja, Michael; Bogema, Daniel R.; Woolley, Lauren K.; Jenkins, Cheryl; Minion, F. Chris; Padula, Matthew P.; Djordjevic, Steven P.

    2016-01-01

    Mycoplasma hyopneumoniae is a genome-reduced, cell wall-less, bacterial pathogen with a predicted coding capacity of less than 700 proteins and is one of the smallest self-replicating pathogens. The cell surface of M. hyopneumoniae is extensively modified by processing events that target the P97 and P102 adhesin families. Here, we present analyses of the proteome of M. hyopneumoniae-type strain J using protein-centric approaches (one- and two-dimensional GeLC–MS/MS) that enabled us to focus on global processing events in this species. While these approaches only identified 52% of the predicted proteome (347 proteins), our analyses identified 35 surface-associated proteins with widely divergent functions that were targets of unusual endoproteolytic processing events, including cell adhesins, lipoproteins and proteins with canonical functions in the cytosol that moonlight on the cell surface. Affinity chromatography assays that separately used heparin, fibronectin, actin and host epithelial cell surface proteins as bait recovered cleavage products derived from these processed proteins, suggesting these fragments interact directly with the bait proteins and display previously unrecognized adhesive functions. We hypothesize that protein processing is underestimated as a post-translational modification in genome-reduced bacteria and prokaryotes more broadly, and represents an important mechanism for creating cell surface protein diversity. PMID:26865024

  14. Draft genome sequence of Streptomyces vitaminophilus ATCC 31673, a producer of pyrrolomycin antibiotics, some of which contain a nitro group

    DOE PAGES

    Mahan, Kristina M.; Klingeman, Dawn Marie; Robert L. Hettich; ...

    2016-01-21

    Streptomyces vitaminophilus produces pyrrolomycins, which are halogenated polyketide antibiotics. Some of the pyrrolomycins contain a rare nitro group located on the pyrrole ring. In addition, the 6.5-Mbp genome encodes 5,941 predicted protein-coding sequences in 39 contigs with a 71.9% G+C content.

  15. Draft Genome Sequence of Streptomyces vitaminophilus ATCC 31673, a Producer of Pyrrolomycin Antibiotics, Some of Which Contain a Nitro Group

    PubMed Central

    Klingeman, Dawn M.; Hettich, Robert L.; Parry, Ronald J.

    2016-01-01

    Streptomyces vitaminophilus produces pyrrolomycins, which are halogenated polyketide antibiotics. Some of the pyrrolomycins contain a rare nitro group located on the pyrrole ring. The 6.5-Mbp genome encodes 5,941 predicted protein-coding sequences in 39 contigs with a 71.9% G+C content. PMID:26798098

  16. Draft genome sequence of Streptomyces vitaminophilus ATCC 31673, a producer of pyrrolomycin antibiotics, some of which contain a nitro group

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mahan, Kristina M.; Klingeman, Dawn Marie; Robert L. Hettich

    Streptomyces vitaminophilus produces pyrrolomycins, which are halogenated polyketide antibiotics. Some of the pyrrolomycins contain a rare nitro group located on the pyrrole ring. In addition, the 6.5-Mbp genome encodes 5,941 predicted protein-coding sequences in 39 contigs with a 71.9% G+C content.

  17. Analysis of Antisense Expression by Whole Genome Tiling Microarrays and siRNAs Suggests Mis-Annotation of Arabidopsis Orphan Protein-Coding Genes

    PubMed Central

    Richardson, Casey R.; Luo, Qing-Jun; Gontcharova, Viktoria; Jiang, Ying-Wen; Samanta, Manoj; Youn, Eunseog; Rock, Christopher D.

    2010-01-01

    Background MicroRNAs (miRNAs) and trans-acting small-interfering RNAs (tasi-RNAs) are small (20–22 nt long) RNAs (smRNAs) generated from hairpin secondary structures or antisense transcripts, respectively, that regulate gene expression by Watson-Crick pairing to a target mRNA and altering expression by mechanisms related to RNA interference. The high sequence homology of plant miRNAs to their targets has been the mainstay of miRNA prediction algorithms, which are limited in their predictive power for other kingdoms because miRNA complementarity is less conserved yet transitive processes (production of antisense smRNAs) are active in eukaryotes. We hypothesize that antisense transcription and associated smRNAs are biomarkers which can be computationally modeled for gene discovery. Principal Findings We explored rice (Oryza sativa) sense and antisense gene expression in publicly available whole genome tiling array transcriptome data and sequenced smRNA libraries (as well as C. elegans) and found evidence of transitivity of MIRNA genes similar to that found in Arabidopsis. Statistical analysis of antisense transcript abundances, presence of antisense ESTs, and association with smRNAs suggests several hundred Arabidopsis ‘orphan’ hypothetical genes are non-coding RNAs. Consistent with this hypothesis, we found novel Arabidopsis homologues of some MIRNA genes on the antisense strand of previously annotated protein-coding genes. A Support Vector Machine (SVM) was applied using thermodynamic energy of binding plus novel expression features of sense/antisense transcription topology and siRNA abundances to build a prediction model of miRNA targets. The SVM when trained on targets could predict the “ancient” (deeply conserved) class of validated Arabidopsis MIRNA genes with an accuracy of 84%, and 76% for “new” rapidly-evolving MIRNA genes. Conclusions Antisense and smRNA expression features and computational methods may identify novel MIRNA genes and other non-coding RNAs in plants and potentially other kingdoms, which can provide insight into antisense transcription, miRNA evolution, and post-transcriptional gene regulation. PMID:20520764

  18. A subset of conserved mammalian long non-coding RNAs are fossils of ancestral protein-coding genes.

    PubMed

    Hezroni, Hadas; Ben-Tov Perry, Rotem; Meir, Zohar; Housman, Gali; Lubelsky, Yoav; Ulitsky, Igor

    2017-08-30

    Only a small portion of human long non-coding RNAs (lncRNAs) appear to be conserved outside of mammals, but the events underlying the birth of new lncRNAs in mammals remain largely unknown. One potential source is remnants of protein-coding genes that transitioned into lncRNAs. We systematically compare lncRNA and protein-coding loci across vertebrates, and estimate that up to 5% of conserved mammalian lncRNAs are derived from lost protein-coding genes. These lncRNAs have specific characteristics, such as broader expression domains, that set them apart from other lncRNAs. Fourteen lncRNAs have sequence similarity with the loci of the contemporary homologs of the lost protein-coding genes. We propose that selection acting on enhancer sequences is mostly responsible for retention of these regions. As an example of an RNA element from a protein-coding ancestor that was retained in the lncRNA, we describe in detail a short translated ORF in the JPX lncRNA that was derived from an upstream ORF in a protein-coding gene and retains some of its functionality. We estimate that ~ 55 annotated conserved human lncRNAs are derived from parts of ancestral protein-coding genes, and loss of coding potential is thus a non-negligible source of new lncRNAs. Some lncRNAs inherited regulatory elements influencing transcription and translation from their protein-coding ancestors and those elements can influence the expression breadth and functionality of these lncRNAs.

  19. Protein Solvent-Accessibility Prediction by a Stacked Deep Bidirectional Recurrent Neural Network.

    PubMed

    Zhang, Buzhong; Li, Linqing; Lü, Qiang

    2018-05-25

    Residue solvent accessibility is closely related to the spatial arrangement and packing of residues. Predicting the solvent accessibility of a protein is an important step to understand its structure and function. In this work, we present a deep learning method to predict residue solvent accessibility, which is based on a stacked deep bidirectional recurrent neural network applied to sequence profiles. To capture more long-range sequence information, a merging operator was proposed when bidirectional information from hidden nodes was merged for outputs. Three types of merging operators were used in our improved model, with a long short-term memory network performing as a hidden computing node. The trained database was constructed from 7361 proteins extracted from the PISCES server using a cut-off of 25% sequence identity. Sequence-derived features including position-specific scoring matrix, physical properties, physicochemical characteristics, conservation score and protein coding were used to represent a residue. Using this method, predictive values of continuous relative solvent-accessible area were obtained, and then, these values were transformed into binary states with predefined thresholds. Our experimental results showed that our deep learning method improved prediction quality relative to current methods, with mean absolute error and Pearson's correlation coefficient values of 8.8% and 74.8%, respectively, on the CB502 dataset and 8.2% and 78%, respectively, on the Manesh215 dataset.

  20. Expression profiles of long non-coding RNAs located in autoimmune disease-associated regions reveal immune cell-type specificity.

    PubMed

    Hrdlickova, Barbara; Kumar, Vinod; Kanduri, Kartiek; Zhernakova, Daria V; Tripathi, Subhash; Karjalainen, Juha; Lund, Riikka J; Li, Yang; Ullah, Ubaid; Modderman, Rutger; Abdulahad, Wayel; Lähdesmäki, Harri; Franke, Lude; Lahesmaa, Riitta; Wijmenga, Cisca; Withoff, Sebo

    2014-01-01

    Although genome-wide association studies (GWAS) have identified hundreds of variants associated with a risk for autoimmune and immune-related disorders (AID), our understanding of the disease mechanisms is still limited. In particular, more than 90% of the risk variants lie in non-coding regions, and almost 10% of these map to long non-coding RNA transcripts (lncRNAs). lncRNAs are known to show more cell-type specificity than protein-coding genes. We aimed to characterize lncRNAs and protein-coding genes located in loci associated with nine AIDs which have been well-defined by Immunochip analysis and by transcriptome analysis across seven populations of peripheral blood leukocytes (granulocytes, monocytes, natural killer (NK) cells, B cells, memory T cells, naive CD4(+) and naive CD8(+) T cells) and four populations of cord blood-derived T-helper cells (precursor, primary, and polarized (Th1, Th2) T-helper cells). We show that lncRNAs mapping to loci shared between AID are significantly enriched in immune cell types compared to lncRNAs from the whole genome (α <0.005). We were not able to prioritize single cell types relevant for specific diseases, but we observed five different cell types enriched (α <0.005) in five AID (NK cells for inflammatory bowel disease, juvenile idiopathic arthritis, primary biliary cirrhosis, and psoriasis; memory T and CD8(+) T cells in juvenile idiopathic arthritis, primary biliary cirrhosis, psoriasis, and rheumatoid arthritis; Th0 and Th2 cells for inflammatory bowel disease, juvenile idiopathic arthritis, primary biliary cirrhosis, psoriasis, and rheumatoid arthritis). Furthermore, we show that co-expression analyses of lncRNAs and protein-coding genes can predict the signaling pathways in which these AID-associated lncRNAs are involved. The observed enrichment of lncRNA transcripts in AID loci implies lncRNAs play an important role in AID etiology and suggests that lncRNA genes should be studied in more detail to interpret GWAS findings correctly. The co-expression results strongly support a model in which the lncRNA and protein-coding genes function together in the same pathways.

  1. Differential protein-coding gene and long noncoding RNA expression in smoking-related lung squamous cell carcinoma.

    PubMed

    Li, Shicheng; Sun, Xiao; Miao, Shuncheng; Liu, Jia; Jiao, Wenjie

    2017-11-01

    Cigarette smoking is one of the greatest preventable risk factors for developing cancer, and most cases of lung squamous cell carcinoma (lung SCC) are associated with smoking. The pathogenesis mechanism of tumor progress is unclear. This study aimed to identify biomarkers in smoking-related lung cancer, including protein-coding gene, long noncoding RNA, and transcription factors. We selected and obtained messenger RNA microarray datasets and clinical data from the Gene Expression Omnibus database to identify gene expression altered by cigarette smoking. Integrated bioinformatic analysis was used to clarify biological functions of the identified genes, including Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, the construction of a protein-protein interaction network, transcription factor, and statistical analyses. Subsequent quantitative real-time PCR was utilized to verify these bioinformatic analyses. Five hundred and ninety-eight differentially expressed genes and 21 long noncoding RNA were identified in smoking-related lung SCC. GO and KEGG pathway analysis showed that identified genes were enriched in the cancer-related functions and pathways. The protein-protein interaction network revealed seven hub genes identified in lung SCC. Several transcription factors and their binding sites were predicted. The results of real-time quantitative PCR revealed that AURKA and BIRC5 were significantly upregulated and LINC00094 was downregulated in the tumor tissues of smoking patients. Further statistical analysis indicated that dysregulation of AURKA, BIRC5, and LINC00094 indicated poor prognosis in lung SCC. Protein-coding genes AURKA, BIRC5, and LINC00094 could be biomarkers or therapeutic targets for smoking-related lung SCC. © 2017 The Authors. Thoracic Cancer published by China Lung Oncology Group and John Wiley & Sons Australia, Ltd.

  2. DNCON2: improved protein contact prediction using two-level deep convolutional neural networks.

    PubMed

    Adhikari, Badri; Hou, Jie; Cheng, Jianlin

    2018-05-01

    Significant improvements in the prediction of protein residue-residue contacts are observed in the recent years. These contacts, predicted using a variety of coevolution-based and machine learning methods, are the key contributors to the recent progress in ab initio protein structure prediction, as demonstrated in the recent CASP experiments. Continuing the development of new methods to reliably predict contact maps is essential to further improve ab initio structure prediction. In this paper we discuss DNCON2, an improved protein contact map predictor based on two-level deep convolutional neural networks. It consists of six convolutional neural networks-the first five predict contacts at 6, 7.5, 8, 8.5 and 10 Å distance thresholds, and the last one uses these five predictions as additional features to predict final contact maps. On the free-modeling datasets in CASP10, 11 and 12 experiments, DNCON2 achieves mean precisions of 35, 50 and 53.4%, respectively, higher than 30.6% by MetaPSICOV on CASP10 dataset, 34% by MetaPSICOV on CASP11 dataset and 46.3% by Raptor-X on CASP12 dataset, when top L/5 long-range contacts are evaluated. We attribute the improved performance of DNCON2 to the inclusion of short- and medium-range contacts into training, two-level approach to prediction, use of the state-of-the-art optimization and activation functions, and a novel deep learning architecture that allows each filter in a convolutional layer to access all the input features of a protein of arbitrary length. The web server of DNCON2 is at http://sysbio.rnet.missouri.edu/dncon2/ where training and testing datasets as well as the predictions for CASP10, 11 and 12 free-modeling datasets can also be downloaded. Its source code is available at https://github.com/multicom-toolbox/DNCON2/. chengji@missouri.edu. Supplementary data are available at Bioinformatics online.

  3. Combining Structural Modeling with Ensemble Machine Learning to Accurately Predict Protein Fold Stability and Binding Affinity Effects upon Mutation

    PubMed Central

    Garcia Lopez, Sebastian; Kim, Philip M.

    2014-01-01

    Advances in sequencing have led to a rapid accumulation of mutations, some of which are associated with diseases. However, to draw mechanistic conclusions, a biochemical understanding of these mutations is necessary. For coding mutations, accurate prediction of significant changes in either the stability of proteins or their affinity to their binding partners is required. Traditional methods have used semi-empirical force fields, while newer methods employ machine learning of sequence and structural features. Here, we show how combining both of these approaches leads to a marked boost in accuracy. We introduce ELASPIC, a novel ensemble machine learning approach that is able to predict stability effects upon mutation in both, domain cores and domain-domain interfaces. We combine semi-empirical energy terms, sequence conservation, and a wide variety of molecular details with a Stochastic Gradient Boosting of Decision Trees (SGB-DT) algorithm. The accuracy of our predictions surpasses existing methods by a considerable margin, achieving correlation coefficients of 0.77 for stability, and 0.75 for affinity predictions. Notably, we integrated homology modeling to enable proteome-wide prediction and show that accurate prediction on modeled structures is possible. Lastly, ELASPIC showed significant differences between various types of disease-associated mutations, as well as between disease and common neutral mutations. Unlike pure sequence-based prediction methods that try to predict phenotypic effects of mutations, our predictions unravel the molecular details governing the protein instability, and help us better understand the molecular causes of diseases. PMID:25243403

  4. Integration of multi-omics data of a genome-reduced bacterium: Prevalence of post-transcriptional regulation and its correlation with protein abundances

    PubMed Central

    Chen, Wei-Hua; van Noort, Vera; Lluch-Senar, Maria; Hennrich, Marco L.; H. Wodke, Judith A.; Yus, Eva; Alibés, Andreu; Roma, Guglielmo; Mende, Daniel R.; Pesavento, Christina; Typas, Athanasios; Gavin, Anne-Claude; Serrano, Luis; Bork, Peer

    2016-01-01

    We developed a comprehensive resource for the genome-reduced bacterium Mycoplasma pneumoniae comprising 1748 consistently generated ‘-omics’ data sets, and used it to quantify the power of antisense non-coding RNAs (ncRNAs), lysine acetylation, and protein phosphorylation in predicting protein abundance (11%, 24% and 8%, respectively). These factors taken together are four times more predictive of the proteome abundance than of mRNA abundance. In bacteria, post-translational modifications (PTMs) and ncRNA transcription were both found to increase with decreasing genomic GC-content and genome size. Thus, the evolutionary forces constraining genome size and GC-content modify the relative contributions of the different regulatory layers to proteome homeostasis, and impact more genomic and genetic features than previously appreciated. Indeed, these scaling principles will enable us to develop more informed approaches when engineering minimal synthetic genomes. PMID:26773059

  5. cncRNAs: Bi-functional RNAs with protein coding and non-coding functions

    PubMed Central

    Kumari, Pooja; Sampath, Karuna

    2015-01-01

    For many decades, the major function of mRNA was thought to be to provide protein-coding information embedded in the genome. The advent of high-throughput sequencing has led to the discovery of pervasive transcription of eukaryotic genomes and opened the world of RNA-mediated gene regulation. Many regulatory RNAs have been found to be incapable of protein coding and are hence termed as non-coding RNAs (ncRNAs). However, studies in recent years have shown that several previously annotated non-coding RNAs have the potential to encode proteins, and conversely, some coding RNAs have regulatory functions independent of the protein they encode. Such bi-functional RNAs, with both protein coding and non-coding functions, which we term as ‘cncRNAs’, have emerged as new players in cellular systems. Here, we describe the functions of some cncRNAs identified from bacteria to humans. Because the functions of many RNAs across genomes remains unclear, we propose that RNAs be classified as coding, non-coding or both only after careful analysis of their functions. PMID:26498036

  6. Functional insights from proteome-wide structural modeling of Treponema pallidum subspecies pallidum, the causative agent of syphilis.

    PubMed

    Houston, Simon; Lithgow, Karen Vivien; Osbak, Kara Krista; Kenyon, Chris Richard; Cameron, Caroline E

    2018-05-16

    Syphilis continues to be a major global health threat with 11 million new infections each year, and a global burden of 36 million cases. The causative agent of syphilis, Treponema pallidum subspecies pallidum, is a highly virulent bacterium, however the molecular mechanisms underlying T. pallidum pathogenesis remain to be definitively identified. This is due to the fact that T. pallidum is currently uncultivatable, inherently fragile and thus difficult to work with, and phylogenetically distinct with no conventional virulence factor homologs found in other pathogens. In fact, approximately 30% of its predicted protein-coding genes have no known orthologs or assigned functions. Here we employed a structural bioinformatics approach using Phyre2-based tertiary structure modeling to improve our understanding of T. pallidum protein function on a proteome-wide scale. Phyre2-based tertiary structure modeling generated high-confidence predictions for 80% of the T. pallidum proteome (780/978 predicted proteins). Tertiary structure modeling also inferred the same function as primary structure-based annotations from genome sequencing pipelines for 525/605 proteins (87%), which represents 54% (525/978) of all T. pallidum proteins. Of the 175 T. pallidum proteins modeled with high confidence that were not assigned functions in the previously annotated published proteome, 167 (95%) were able to be assigned predicted functions. Twenty-one of the 175 hypothetical proteins modeled with high confidence were also predicted to exhibit significant structural similarity with proteins experimentally confirmed to be required for virulence in other pathogens. Phyre2-based structural modeling is a powerful bioinformatics tool that has provided insight into the potential structure and function of the majority of T. pallidum proteins and helped validate the primary structure-based annotation of more than 50% of all T. pallidum proteins with high confidence. This work represents the first T. pallidum proteome-wide structural modeling study and is one of few studies to apply this approach for the functional annotation of a whole proteome.

  7. orthAgogue: an agile tool for the rapid prediction of orthology relations.

    PubMed

    Ekseth, Ole Kristian; Kuiper, Martin; Mironov, Vladimir

    2014-03-01

    The comparison of genes and gene products across species depends on high-quality tools to determine the relationships between gene or protein sequences from various species. Although some excellent applications are available and widely used, their performance leaves room for improvement. We developed orthAgogue: a multithreaded C application for high-speed estimation of homology relations in massive datasets, operated via a flexible and easy command-line interface. The orthAgogue software is distributed under the GNU license. The source code and binaries compiled for Linux are available at https://code.google.com/p/orthagogue/.

  8. An Amino Acid Packing Code for α-helical Structure and Protein Design

    PubMed Central

    Joo, Hyun; Chavan, Archana G.; Phan, Jamie; Day, Ryan; Tsai, Jerry

    2012-01-01

    This work demonstrates that all packing in α-helices can be simplified to repetitive patterns of a single motif: the knob-socket. Using the precision of Voronoi Polyhedra/Deluaney Tessellations to identify contacts, the knob-socket is a 4 residue tetrahedral motif: a knob residue on one α-helix packs into the 3 residue socket on another α-helix. The principle of the knob-socket model relates the packing between levels of protein structure: the intra-helical packing arrangements within secondary structure that permit inter-helix tertiary packing interactions. Within an α-helix, the 3 residue sockets arrange residues into a uniform packing lattice. Inter-helix packing results from a definable pattern of interdigitated knob-socket motifs between 2 α-helices. Furthermore, the knob-socket model classifies 3 types of sockets: 1) free: favoring only intra-helical packing, 2) filled: favoring inter-helical interactions and 3) non: disfavoring α-helical structure. The amino acid propensities in these 3 socket classes essentially represent an amino acid code for structure in α-helical packing. Using this code, a novel yet straightforward approach for the design of α-helical structure was used to validate the knob-socket model. Unique sequences for 3 peptides were created to produce a predicted amount of α-helical structure: mostly helical, some helical, and no-helix. These 3 peptides were synthesized and helical content assessed using CD spectroscopy. The measured α-helicity of each peptide was consistent with the expected predictions. These results and analysis demonstrate that the knob-socket motif functions as the basic unit of packing and presents an intuitive tool to decipher the rules governing packing in protein structure. PMID:22426125

  9. Deep developmental transcriptome sequencing uncovers numerous new genes and enhances gene annotation in the sponge Amphimedon queenslandica.

    PubMed

    Fernandez-Valverde, Selene L; Calcino, Andrew D; Degnan, Bernard M

    2015-05-15

    The demosponge Amphimedon queenslandica is amongst the few early-branching metazoans with an assembled and annotated draft genome, making it an important species in the study of the origin and early evolution of animals. Current gene models in this species are largely based on in silico predictions and low coverage expressed sequence tag (EST) evidence. Amphimedon queenslandica protein-coding gene models are improved using deep RNA-Seq data from four developmental stages and CEL-Seq data from 82 developmental samples. Over 86% of previously predicted genes are retained in the new gene models, although 24% have additional exons; there is also a marked increase in the total number of annotated 3' and 5' untranslated regions (UTRs). Importantly, these new developmental transcriptome data reveal numerous previously unannotated protein-coding genes in the Amphimedon genome, increasing the total gene number by 25%, from 30,060 to 40,122. In general, Amphimedon genes have introns that are markedly smaller than those in other animals and most of the alternatively spliced genes in Amphimedon undergo intron-retention; exon-skipping is the least common mode of alternative splicing. Finally, in addition to canonical polyadenylation signal sequences, Amphimedon genes are enriched in a number of unique AT-rich motifs in their 3' UTRs. The inclusion of developmental transcriptome data has substantially improved the structure and composition of protein-coding gene models in Amphimedon queenslandica, providing a more accurate and comprehensive set of genes for functional and comparative studies. These improvements reveal the Amphimedon genome is comprised of a remarkably high number of tightly packed genes. These genes have small introns and there is pervasive intron retention amongst alternatively spliced transcripts. These aspects of the sponge genome are more similar unicellular opisthokont genomes than to other animal genomes.

  10. In Silico Pattern-Based Analysis of the Human Cytomegalovirus Genome

    PubMed Central

    Rigoutsos, Isidore; Novotny, Jiri; Huynh, Tien; Chin-Bow, Stephen T.; Parida, Laxmi; Platt, Daniel; Coleman, David; Shenk, Thomas

    2003-01-01

    More than 200 open reading frames (ORFs) from the human cytomegalovirus genome have been reported as potentially coding for proteins. We have used two pattern-based in silico approaches to analyze this set of putative viral genes. With the help of an objective annotation method that is based on the Bio-Dictionary, a comprehensive collection of amino acid patterns that describes the currently known natural sequence space of proteins, we have reannotated all of the previously reported putative genes of the human cytomegalovirus. Also, with the help of MUSCA, a pattern-based multiple sequence alignment algorithm, we have reexamined the original human cytomegalovirus gene family definitions. Our analysis of the genome shows that many of the coded proteins comprise amino acid combinations that are unique to either the human cytomegalovirus or the larger group of herpesviruses. We have confirmed that a surprisingly large portion of the analyzed ORFs encode membrane proteins, and we have discovered a significant number of previously uncharacterized proteins that are predicted to be G-protein-coupled receptor homologues. The analysis also indicates that many of the encoded proteins undergo posttranslational modifications such as hydroxylation, phosphorylation, and glycosylation. ORFs encoding proteins with similar functional behavior appear in neighboring regions of the human cytomegalovirus genome. All of the results of the present study can be found and interactively explored online (http://cbcsrv.watson.ibm.com/virus/). PMID:12634390

  11. In silico pattern-based analysis of the human cytomegalovirus genome.

    PubMed

    Rigoutsos, Isidore; Novotny, Jiri; Huynh, Tien; Chin-Bow, Stephen T; Parida, Laxmi; Platt, Daniel; Coleman, David; Shenk, Thomas

    2003-04-01

    More than 200 open reading frames (ORFs) from the human cytomegalovirus genome have been reported as potentially coding for proteins. We have used two pattern-based in silico approaches to analyze this set of putative viral genes. With the help of an objective annotation method that is based on the Bio-Dictionary, a comprehensive collection of amino acid patterns that describes the currently known natural sequence space of proteins, we have reannotated all of the previously reported putative genes of the human cytomegalovirus. Also, with the help of MUSCA, a pattern-based multiple sequence alignment algorithm, we have reexamined the original human cytomegalovirus gene family definitions. Our analysis of the genome shows that many of the coded proteins comprise amino acid combinations that are unique to either the human cytomegalovirus or the larger group of herpesviruses. We have confirmed that a surprisingly large portion of the analyzed ORFs encode membrane proteins, and we have discovered a significant number of previously uncharacterized proteins that are predicted to be G-protein-coupled receptor homologues. The analysis also indicates that many of the encoded proteins undergo posttranslational modifications such as hydroxylation, phosphorylation, and glycosylation. ORFs encoding proteins with similar functional behavior appear in neighboring regions of the human cytomegalovirus genome. All of the results of the present study can be found and interactively explored online (http://cbcsrv.watson.ibm.com/virus/).

  12. Sequencing proteins with transverse ionic transport in nanochannels.

    PubMed

    Boynton, Paul; Di Ventra, Massimiliano

    2016-05-03

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer's Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique's potential for de novo protein sequencing.

  13. Identification of Bombyx mori bidensovirus VD1-ORF4 reveals a novel protein associated with viral structural component.

    PubMed

    Li, Guohui; Hu, Zhaoyang; Guo, Xuli; Li, Guangtian; Tang, Qi; Wang, Peng; Chen, Keping; Yao, Qin

    2013-06-01

    Bombyx mori bidensovirus (BmBDV) VD1-ORF4 (open reading frame 4, ORF4) consists of 3,318 nucleotides, which codes for a predicted 1,105-amino acid protein containing a conserved DNA polymerase motif. However, its functions in viral propagation remain unknown. In the current study, the transcription of VD1-ORF4 was examined from 6 to 96 h postinfection (p.i.) by RT-PCR, 5'-RACE revealed the transcription initiation site of BmBDV ORF4 to be -16 nucleotides upstream from the start codon, and 3'-RACE revealed the transcription termination site of VD1-ORF4 to be +7 nucleotides downstream from termination codon. Three different proteins were examined in the extracts of BmBDV-infected silkworms midguts by Western blot using raised antibodies against VD1-ORF4 deduced amino acid, and a specific protein band about 53 kDa was further detected in purified virions using the same antibodies. Taken together, BmBDV VD1-ORF4 codes for three or more proteins during the viral life cycle, one of which is a 53 kDa protein and confirmed to be a component of BmBDV virion.

  14. AMPA: an automated web server for prediction of protein antimicrobial regions.

    PubMed

    Torrent, Marc; Di Tommaso, Paolo; Pulido, David; Nogués, M Victòria; Notredame, Cedric; Boix, Ester; Andreu, David

    2012-01-01

    AMPA is a web application for assessing the antimicrobial domains of proteins, with a focus on the design on new antimicrobial drugs. The application provides fast discovery of antimicrobial patterns in proteins that can be used to develop new peptide-based drugs against pathogens. Results are shown in a user-friendly graphical interface and can be downloaded as raw data for later examination. AMPA is freely available on the web at http://tcoffee.crg.cat/apps/ampa. The source code is also available in the web. marc.torrent@upf.edu; david.andreu@upf.edu Supplementary data are available at Bioinformatics online.

  15. InterPred: A pipeline to identify and model protein-protein interactions.

    PubMed

    Mirabello, Claudio; Wallner, Björn

    2017-06-01

    Protein-protein interactions (PPI) are crucial for protein function. There exist many techniques to identify PPIs experimentally, but to determine the interactions in molecular detail is still difficult and very time-consuming. The fact that the number of PPIs is vastly larger than the number of individual proteins makes it practically impossible to characterize all interactions experimentally. Computational approaches that can bridge this gap and predict PPIs and model the interactions in molecular detail are greatly needed. Here we present InterPred, a fully automated pipeline that predicts and model PPIs from sequence using structural modeling combined with massive structural comparisons and molecular docking. A key component of the method is the use of a novel random forest classifier that integrate several structural features to distinguish correct from incorrect protein-protein interaction models. We show that InterPred represents a major improvement in protein-protein interaction detection with a performance comparable or better than experimental high-throughput techniques. We also show that our full-atom protein-protein complex modeling pipeline performs better than state of the art protein docking methods on a standard benchmark set. In addition, InterPred was also one of the top predictors in the latest CAPRI37 experiment. InterPred source code can be downloaded from http://wallnerlab.org/InterPred Proteins 2017; 85:1159-1170. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  16. De novo assembly and characterization of the Trichuris trichiura adult worm transcriptome using Ion Torrent sequencing.

    PubMed

    Santos, Leonardo N; Silva, Eduardo S; Santos, André S; De Sá, Pablo H; Ramos, Rommel T; Silva, Artur; Cooper, Philip J; Barreto, Maurício L; Loureiro, Sebastião; Pinheiro, Carina S; Alcantara-Neves, Neuza M; Pacheco, Luis G C

    2016-07-01

    Infection with helminthic parasites, including the soil-transmitted helminth Trichuris trichiura (human whipworm), has been shown to modulate host immune responses and, consequently, to have an impact on the development and manifestation of chronic human inflammatory diseases. De novo derivation of helminth proteomes from sequencing of transcriptomes will provide valuable data to aid identification of parasite proteins that could be evaluated as potential immunotherapeutic molecules in near future. Herein, we characterized the transcriptome of the adult stage of the human whipworm T. trichiura, using next-generation sequencing technology and a de novo assembly strategy. Nearly 17.6 million high-quality clean reads were assembled into 6414 contiguous sequences, with an N50 of 1606bp. In total, 5673 protein-encoding sequences were confidentially identified in the T. trichiura adult worm transcriptome; of these, 1013 sequences represent potential newly discovered proteins for the species, most of which presenting orthologs already annotated in the related species T. suis. A number of transcripts representing probable novel non-coding transcripts for the species T. trichiura were also identified. Among the most abundant transcripts, we found sequences that code for proteins involved in lipid transport, such as vitellogenins, and several chitin-binding proteins. Through a cross-species expression analysis of gene orthologs shared by T. trichiura and the closely related parasites T. suis and T. muris it was possible to find twenty-six protein-encoding genes that are consistently highly expressed in the adult stages of the three helminth species. Additionally, twenty transcripts could be identified that code for proteins previously detected by mass spectrometry analysis of protein fractions of the whipworm somatic extract that present immunomodulatory activities. Five of these transcripts were amongst the most highly expressed protein-encoding sequences in the T. trichiura adult worm. Besides, orthologs of proteins demonstrated to have potent immunomodulatory properties in related parasitic helminths were also predicted from the T. trichiura de novo assembled transcriptome. Copyright © 2016. Published by Elsevier B.V.

  17. Representation of DNA sequences in genetic codon context with applications in exon and intron prediction.

    PubMed

    Yin, Changchuan

    2015-04-01

    To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.

  18. A multi-objective optimization approach accurately resolves protein domain architectures

    PubMed Central

    Bernardes, J.S.; Vieira, F.R.J.; Zaverucha, G.; Carbone, A.

    2016-01-01

    Motivation: Given a protein sequence and a number of potential domains matching it, what are the domain content and the most likely domain architecture for the sequence? This problem is of fundamental importance in protein annotation, constituting one of the main steps of all predictive annotation strategies. On the other hand, when potential domains are several and in conflict because of overlapping domain boundaries, finding a solution for the problem might become difficult. An accurate prediction of the domain architecture of a multi-domain protein provides important information for function prediction, comparative genomics and molecular evolution. Results: We developed DAMA (Domain Annotation by a Multi-objective Approach), a novel approach that identifies architectures through a multi-objective optimization algorithm combining scores of domain matches, previously observed multi-domain co-occurrence and domain overlapping. DAMA has been validated on a known benchmark dataset based on CATH structural domain assignments and on the set of Plasmodium falciparum proteins. When compared with existing tools on both datasets, it outperforms all of them. Availability and implementation: DAMA software is implemented in C++ and the source code can be found at http://www.lcqb.upmc.fr/DAMA. Contact: juliana.silva_bernardes@upmc.fr or alessandra.carbone@lip6.fr Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26458889

  19. Transcripts with in silico predicted RNA structure are enriched everywhere in the mouse brain

    PubMed Central

    2012-01-01

    Background Post-transcriptional control of gene expression is mostly conducted by specific elements in untranslated regions (UTRs) of mRNAs, in collaboration with specific binding proteins and RNAs. In several well characterized cases, these RNA elements are known to form stable secondary structures. RNA secondary structures also may have major functional implications for long noncoding RNAs (lncRNAs). Recent transcriptional data has indicated the importance of lncRNAs in brain development and function. However, no methodical efforts to investigate this have been undertaken. Here, we aim to systematically analyze the potential for RNA structure in brain-expressed transcripts. Results By comprehensive spatial expression analysis of the adult mouse in situ hybridization data of the Allen Mouse Brain Atlas, we show that transcripts (coding as well as non-coding) associated with in silico predicted structured probes are highly and significantly enriched in almost all analyzed brain regions. Functional implications of these RNA structures and their role in the brain are discussed in detail along with specific examples. We observe that mRNAs with a structure prediction in their UTRs are enriched for binding, transport and localization gene ontology categories. In addition, after manual examination we observe agreement between RNA binding protein interaction sites near the 3’ UTR structures and correlated expression patterns. Conclusions Our results show a potential use for RNA structures in expressed coding as well as noncoding transcripts in the adult mouse brain, and describe the role of structured RNAs in the context of intracellular signaling pathways and regulatory networks. Based on this data we hypothesize that RNA structure is widely involved in transcriptional and translational regulatory mechanisms in the brain and ultimately plays a role in brain function. PMID:22651826

  20. PANNZER2: a rapid functional annotation web server.

    PubMed

    Törönen, Petri; Medlar, Alan; Holm, Liisa

    2018-05-08

    The unprecedented growth of high-throughput sequencing has led to an ever-widening annotation gap in protein databases. While computational prediction methods are available to make up the shortfall, a majority of public web servers are hindered by practical limitations and poor performance. Here, we introduce PANNZER2 (Protein ANNotation with Z-scoRE), a fast functional annotation web server that provides both Gene Ontology (GO) annotations and free text description predictions. PANNZER2 uses SANSparallel to perform high-performance homology searches, making bulk annotation based on sequence similarity practical. PANNZER2 can output GO annotations from multiple scoring functions, enabling users to see which predictions are robust across predictors. Finally, PANNZER2 predictions scored within the top 10 methods for molecular function and biological process in the CAFA2 NK-full benchmark. The PANNZER2 web server is updated on a monthly schedule and is accessible at http://ekhidna2.biocenter.helsinki.fi/sanspanz/. The source code is available under the GNU Public Licence v3.

  1. Mean of the typical decoding rates: a new translation efficiency index based on the analysis of ribosome profiling data.

    PubMed

    Dana, Alexandra; Tuller, Tamir

    2014-12-01

    Gene translation modeling and prediction is a fundamental problem that has numerous biomedical implementations. In this work we present a novel, user-friendly tool/index for calculating the mean of the typical decoding rates that enables predicting translation elongation efficiency of protein coding genes for different tissue types, developmental stages, and experimental conditions. The suggested translation efficiency index is based on the analysis of the organism's ribosome profiling data. This index could be used for example to predict changes in translation elongation efficiency of lowly expressed genes that usually have relatively low and/or biased ribosomal densities and protein levels measurements, or can be used for example for predicting translation efficiency of new genetically engineered genes. We demonstrate the usability of this index via the analysis of six organisms in different tissues and developmental stages. Distributable cross platform application and guideline are available for download at: http://www.cs.tau.ac.il/~tamirtul/MTDR/MTDR_Install.html. Copyright © 2015 Dana and Tuller.

  2. Culture and Healthy Eating: The Role of Independence and Interdependence in the U.S. and Japan

    PubMed Central

    Levine, Cynthia S.; Miyamoto, Yuri; Markus, Hazel Rose; Rigotti, Attilio; Boylan, Jennifer Morozink; Park, Jiyoung; Kitayama, Shinobu; Karasawa, Mayumi; Kawakami, Norito; Coe, Christopher L.; Love, Gayle D.; Ryff, Carol D.

    2016-01-01

    Healthy eating is important for physical health. Using large probability samples of middle-aged adults in the U.S. and Japan, we show that fitting with the culturally normative way of being predicts healthy eating. In the U.S, a culture that prioritizes and emphasizes independence, being independent predicts eating a healthy diet (an index of fish, protein, fruit, vegetables, reverse-coded sugared beverages, and reverse-coded high fat meat consumption; Study 1) and not using food as a way to cope with stress (Study 2a). In Japan, a culture that prioritizes and emphasizes interdependence, being interdependent predicts eating a healthy diet (Studies 1 and 2b). Further, reflecting the types of agency that are prevalent in each context, these relationships are mediated by autonomy in the U.S. and positive relations with others in Japan. These findings highlight the importance of understanding cultural differences in shaping healthy behavior and have implications for designing health-promoting interventions. PMID:27516421

  3. Theory of prokaryotic genome evolution.

    PubMed

    Sela, Itamar; Wolf, Yuri I; Koonin, Eugene V

    2016-10-11

    Bacteria and archaea typically possess small genomes that are tightly packed with protein-coding genes. The compactness of prokaryotic genomes is commonly perceived as evidence of adaptive genome streamlining caused by strong purifying selection in large microbial populations. In such populations, even the small cost incurred by nonfunctional DNA because of extra energy and time expenditure is thought to be sufficient for this extra genetic material to be eliminated by selection. However, contrary to the predictions of this model, there exists a consistent, positive correlation between the strength of selection at the protein sequence level, measured as the ratio of nonsynonymous to synonymous substitution rates, and microbial genome size. Here, by fitting the genome size distributions in multiple groups of prokaryotes to predictions of mathematical models of population evolution, we show that only models in which acquisition of additional genes is, on average, slightly beneficial yield a good fit to genomic data. These results suggest that the number of genes in prokaryotic genomes reflects the equilibrium between the benefit of additional genes that diminishes as the genome grows and deletion bias (i.e., the rate of deletion of genetic material being slightly greater than the rate of acquisition). Thus, new genes acquired by microbial genomes, on average, appear to be adaptive. The tight spacing of protein-coding genes likely results from a combination of the deletion bias and purifying selection that efficiently eliminates nonfunctional, noncoding sequences.

  4. Identification of residue pairing in interacting β-strands from a predicted residue contact map.

    PubMed

    Mao, Wenzhi; Wang, Tong; Zhang, Wenxuan; Gong, Haipeng

    2018-04-19

    Despite the rapid progress of protein residue contact prediction, predicted residue contact maps frequently contain many errors. However, information of residue pairing in β strands could be extracted from a noisy contact map, due to the presence of characteristic contact patterns in β-β interactions. This information may benefit the tertiary structure prediction of mainly β proteins. In this work, we propose a novel ridge-detection-based β-β contact predictor to identify residue pairing in β strands from any predicted residue contact map. Our algorithm RDb 2 C adopts ridge detection, a well-developed technique in computer image processing, to capture consecutive residue contacts, and then utilizes a novel multi-stage random forest framework to integrate the ridge information and additional features for prediction. Starting from the predicted contact map of CCMpred, RDb 2 C remarkably outperforms all state-of-the-art methods on two conventional test sets of β proteins (BetaSheet916 and BetaSheet1452), and achieves F1-scores of ~ 62% and ~ 76% at the residue level and strand level, respectively. Taking the prediction of the more advanced RaptorX-Contact as input, RDb 2 C achieves impressively higher performance, with F1-scores reaching ~ 76% and ~ 86% at the residue level and strand level, respectively. In a test of structural modeling using the top 1 L predicted contacts as constraints, for 61 mainly β proteins, the average TM-score achieves 0.442 when using the raw RaptorX-Contact prediction, but increases to 0.506 when using the improved prediction by RDb 2 C. Our method can significantly improve the prediction of β-β contacts from any predicted residue contact maps. Prediction results of our algorithm could be directly applied to effectively facilitate the practical structure prediction of mainly β proteins. All source data and codes are available at http://166.111.152.91/Downloads.html or the GitHub address of https://github.com/wzmao/RDb2C .

  5. Draft Genome Sequence of Streptomyces vitaminophilus ATCC 31673, a Producer of Pyrrolomycin Antibiotics, Some of Which Contain a Nitro Group.

    PubMed

    Mahan, Kristina M; Klingeman, Dawn M; Hettich, Robert L; Parry, Ronald J; Graham, David E

    2016-01-21

    Streptomyces vitaminophilus produces pyrrolomycins, which are halogenated polyketide antibiotics. Some of the pyrrolomycins contain a rare nitro group located on the pyrrole ring. The 6.5-Mbp genome encodes 5,941 predicted protein-coding sequences in 39 contigs with a 71.9% G+C content. Copyright © 2016 Mahan et al.

  6. Permanent Improved High-Quality Draft Genome Sequence of Nocardia casuarinae Strain BMG51109, an Endophyte of Actinorhizal Root Nodules of Casuarina glauca

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ghodhbane-Gtari, Faten; Beauchemin, Nicholas; Louati, Moussa

    Here, we report the first genome sequence of a Nocardia plant endophyte, N. casuarinae strain BMG51109, isolated from Casuarina glauca root nodules. The improved high-quality draft genome sequence contains 8,787,999 bp with a 68.90% GC content and 7,307 predicted protein-coding genes.

  7. Draft Genome Sequence of Mycobacterium boenickei CIP 107829.

    PubMed

    Bouam, Amar; Robert, Catherine; Croce, Olivier; Levasseur, Anthony; Drancourt, Michel

    2017-05-04

    Mycobacterium boenickei is a rapidly growing mycobacterium isolated for the first time from a leg wound in the United States. Its 6,506,908-bp draft genome exhibits a 66.77% G+C content, 6,279 protein-coding genes, and 59 predicted RNA genes. In silico DNA-DNA hybridization confirms its assignment to the Mycobacterium fortuitum complex. Copyright © 2017 Bouam et al.

  8. Permanent Improved High-Quality Draft Genome Sequence of Nocardia casuarinae Strain BMG51109, an Endophyte of Actinorhizal Root Nodules of Casuarina glauca

    DOE PAGES

    Ghodhbane-Gtari, Faten; Beauchemin, Nicholas; Louati, Moussa; ...

    2016-08-04

    Here, we report the first genome sequence of a Nocardia plant endophyte, N. casuarinae strain BMG51109, isolated from Casuarina glauca root nodules. The improved high-quality draft genome sequence contains 8,787,999 bp with a 68.90% GC content and 7,307 predicted protein-coding genes.

  9. Capturing the Biofuel Wellhead and Powerhouse: The Chloroplast and Mitochondrial Genomes of the Leguminous Feedstock Tree Pongamia pinnata

    PubMed Central

    Kazakoff, Stephen H.; Imelfort, Michael; Edwards, David; Koehorst, Jasper; Biswas, Bandana; Batley, Jacqueline; Scott, Paul T.; Gresshoff, Peter M.

    2012-01-01

    Pongamia pinnata (syn. Millettia pinnata) is a novel, fast-growing arboreal legume that bears prolific quantities of oil-rich seeds suitable for the production of biodiesel and aviation biofuel. Here, we have used Illumina® ‘Second Generation DNA Sequencing (2GS)’ and a new short-read de novo assembler, SaSSY, to assemble and annotate the Pongamia chloroplast (152,968 bp; cpDNA) and mitochondrial (425,718 bp; mtDNA) genomes. We also show that SaSSY can be used to accurately assemble 2GS data, by re-assembling the Lotus japonicus cpDNA and in the process assemble its mtDNA (380,861 bp). The Pongamia cpDNA contains 77 unique protein-coding genes and is almost 60% gene-dense. It contains a 50 kb inversion common to other legumes, as well as a novel 6.5 kb inversion that is responsible for the non-disruptive, re-orientation of five protein-coding genes. Additionally, two copies of an inverted repeat firmly place the species outside the subclade of the Fabaceae lacking the inverted repeat. The Pongamia and L. japonicus mtDNA contain just 33 and 31 unique protein-coding genes, respectively, and like other angiosperm mtDNA, have expanded intergenic and multiple repeat regions. Through comparative analysis with Vigna radiata we measured the average synonymous and non-synonymous divergence of all three legume mitochondrial (1.59% and 2.40%, respectively) and chloroplast (8.37% and 8.99%, respectively) protein-coding genes. Finally, we explored the relatedness of Pongamia within the Fabaceae and showed the utility of the organellar genome sequences by mapping transcriptomic data to identify up- and down-regulated stress-responsive gene candidates and confirm in silico predicted RNA editing sites. PMID:23272141

  10. Capturing the biofuel wellhead and powerhouse: the chloroplast and mitochondrial genomes of the leguminous feedstock tree Pongamia pinnata.

    PubMed

    Kazakoff, Stephen H; Imelfort, Michael; Edwards, David; Koehorst, Jasper; Biswas, Bandana; Batley, Jacqueline; Scott, Paul T; Gresshoff, Peter M

    2012-01-01

    Pongamia pinnata (syn. Millettia pinnata) is a novel, fast-growing arboreal legume that bears prolific quantities of oil-rich seeds suitable for the production of biodiesel and aviation biofuel. Here, we have used Illumina® 'Second Generation DNA Sequencing (2GS)' and a new short-read de novo assembler, SaSSY, to assemble and annotate the Pongamia chloroplast (152,968 bp; cpDNA) and mitochondrial (425,718 bp; mtDNA) genomes. We also show that SaSSY can be used to accurately assemble 2GS data, by re-assembling the Lotus japonicus cpDNA and in the process assemble its mtDNA (380,861 bp). The Pongamia cpDNA contains 77 unique protein-coding genes and is almost 60% gene-dense. It contains a 50 kb inversion common to other legumes, as well as a novel 6.5 kb inversion that is responsible for the non-disruptive, re-orientation of five protein-coding genes. Additionally, two copies of an inverted repeat firmly place the species outside the subclade of the Fabaceae lacking the inverted repeat. The Pongamia and L. japonicus mtDNA contain just 33 and 31 unique protein-coding genes, respectively, and like other angiosperm mtDNA, have expanded intergenic and multiple repeat regions. Through comparative analysis with Vigna radiata we measured the average synonymous and non-synonymous divergence of all three legume mitochondrial (1.59% and 2.40%, respectively) and chloroplast (8.37% and 8.99%, respectively) protein-coding genes. Finally, we explored the relatedness of Pongamia within the Fabaceae and showed the utility of the organellar genome sequences by mapping transcriptomic data to identify up- and down-regulated stress-responsive gene candidates and confirm in silico predicted RNA editing sites.

  11. Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition

    PubMed Central

    2012-01-01

    Background Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods. Results This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble. Conclusions The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role. Availability The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/. PMID:23282103

  12. A systems level predictive model for global gene regulation of methanogenesis in a hydrogenotrophic methanogen

    PubMed Central

    Yoon, Sung Ho; Turkarslan, Serdar; Reiss, David J.; Pan, Min; Burn, June A.; Costa, Kyle C.; Lie, Thomas J.; Slagel, Joseph; Moritz, Robert L.; Hackett, Murray; Leigh, John A.; Baliga, Nitin S.

    2013-01-01

    Methanogens catalyze the critical methane-producing step (called methanogenesis) in the anaerobic decomposition of organic matter. Here, we present the first predictive model of global gene regulation of methanogenesis in a hydrogenotrophic methanogen, Methanococcus maripaludis. We generated a comprehensive list of genes (protein-coding and noncoding) for M. maripaludis through integrated analysis of the transcriptome structure and a newly constructed Peptide Atlas. The environment and gene-regulatory influence network (EGRIN) model of the strain was constructed from a compendium of transcriptome data that was collected over 58 different steady-state and time-course experiments that were performed in chemostats or batch cultures under a spectrum of environmental perturbations that modulated methanogenesis. Analyses of the EGRIN model have revealed novel components of methanogenesis that included at least three additional protein-coding genes of previously unknown function as well as one noncoding RNA. We discovered that at least five regulatory mechanisms act in a combinatorial scheme to intercoordinate key steps of methanogenesis with different processes such as motility, ATP biosynthesis, and carbon assimilation. Through a combination of genetic and environmental perturbation experiments we have validated the EGRIN-predicted role of two novel transcription factors in the regulation of phosphate-dependent repression of formate dehydrogenase—a key enzyme in the methanogenesis pathway. The EGRIN model demonstrates regulatory affiliations within methanogenesis as well as between methanogenesis and other cellular functions. PMID:24089473

  13. Cloning and expression of a cDNA coding for a human monocyte-derived plasminogen activator inhibitor.

    PubMed

    Antalis, T M; Clark, M A; Barnes, T; Lehrbach, P R; Devine, P L; Schevzov, G; Goss, N H; Stephens, R W; Tolstoshev, P

    1988-02-01

    Human monocyte-derived plasminogen activator inhibitor (mPAI-2) was purified to homogeneity from the U937 cell line and partially sequenced. Oligonucleotide probes derived from this sequence were used to screen a cDNA library prepared from U937 cells. One positive clone was sequenced and contained most of the coding sequence as well as a long incomplete 3' untranslated region (1112 base pairs). This cDNA sequence was shown to encode mPAI-2 by hybrid-select translation. A cDNA clone encoding the remainder of the mPAI-2 mRNA was obtained by primer extension of U937 poly(A)+ RNA using a probe complementary to the mPAI-2 coding region. The coding sequence for mPAI-2 was placed under the control of the lambda PL promoter, and the protein expressed in Escherichia coli formed a complex with urokinase that could be detected immunologically. By nucleotide sequence analysis, mPAI-2 cDNA encodes a protein containing 415 amino acids with a predicted unglycosylated Mr of 46,543. The predicted amino acid sequence of mPAI-2 is very similar to placental PAI-2 (3 amino acid differences) and shows extensive homology with members of the serine protease inhibitor (serpin) superfamily. mPAI-2 was found to be more homologous to ovalbumin (37%) than the endothelial plasminogen activator inhibitor, PAI-1 (26%). Like ovalbumin, mPAI-2 appears to have no typical amino-terminal signal sequence. The 3' untranslated region of the mPAI-2 cDNA contains a putative regulatory sequence that has been associated with the inflammatory mediators.

  14. Cloning and expression of a cDNA coding for a human monocyte-derived plasminogen activator inhibitor.

    PubMed Central

    Antalis, T M; Clark, M A; Barnes, T; Lehrbach, P R; Devine, P L; Schevzov, G; Goss, N H; Stephens, R W; Tolstoshev, P

    1988-01-01

    Human monocyte-derived plasminogen activator inhibitor (mPAI-2) was purified to homogeneity from the U937 cell line and partially sequenced. Oligonucleotide probes derived from this sequence were used to screen a cDNA library prepared from U937 cells. One positive clone was sequenced and contained most of the coding sequence as well as a long incomplete 3' untranslated region (1112 base pairs). This cDNA sequence was shown to encode mPAI-2 by hybrid-select translation. A cDNA clone encoding the remainder of the mPAI-2 mRNA was obtained by primer extension of U937 poly(A)+ RNA using a probe complementary to the mPAI-2 coding region. The coding sequence for mPAI-2 was placed under the control of the lambda PL promoter, and the protein expressed in Escherichia coli formed a complex with urokinase that could be detected immunologically. By nucleotide sequence analysis, mPAI-2 cDNA encodes a protein containing 415 amino acids with a predicted unglycosylated Mr of 46,543. The predicted amino acid sequence of mPAI-2 is very similar to placental PAI-2 (3 amino acid differences) and shows extensive homology with members of the serine protease inhibitor (serpin) superfamily. mPAI-2 was found to be more homologous to ovalbumin (37%) than the endothelial plasminogen activator inhibitor, PAI-1 (26%). Like ovalbumin, mPAI-2 appears to have no typical amino-terminal signal sequence. The 3' untranslated region of the mPAI-2 cDNA contains a putative regulatory sequence that has been associated with the inflammatory mediators. Images PMID:3257578

  15. Tissue-specific Proteogenomic Analysis of Plutella xylostella Larval Midgut Using a Multialgorithm Pipeline*

    PubMed Central

    Zhu, Xun; Xie, Shangbo; Armengaud, Jean; Xie, Wen; Guo, Zhaojiang; Kang, Shi; Wu, Qingjun; Wang, Shaoli; Xia, Jixing; He, Rongjun; Zhang, Youjun

    2016-01-01

    The diamondback moth, Plutella xylostella (L.), is the major cosmopolitan pest of brassica and other cruciferous crops. Its larval midgut is a dynamic tissue that interfaces with a wide variety of toxicological and physiological processes. The draft sequence of the P. xylostella genome was recently released, but its annotation remains challenging because of the low sequence coverage of this branch of life and the poor description of exon/intron splicing rules for these insects. Peptide sequencing by computational assignment of tandem mass spectra to genome sequence information provides an experimental independent approach for confirming or refuting protein predictions, a concept that has been termed proteogenomics. In this study, we carried out an in-depth proteogenomic analysis to complement genome annotation of P. xylostella larval midgut based on shotgun HPLC-ESI-MS/MS data by means of a multialgorithm pipeline. A total of 876,341 tandem mass spectra were searched against the predicted P. xylostella protein sequences and a whole-genome six-frame translation database. Based on a data set comprising 2694 novel genome search specific peptides, we discovered 439 novel protein-coding genes and corrected 128 existing gene models. To get the most accurate data to seed further insect genome annotation, more than half of the novel protein-coding genes, i.e. 235 over 439, were further validated after RT-PCR amplification and sequencing of the corresponding transcripts. Furthermore, we validated 53 novel alternative splicings. Finally, a total of 6764 proteins were identified, resulting in one of the most comprehensive proteogenomic study of a nonmodel animal. As the first tissue-specific proteogenomics analysis of P. xylostella, this study provides the fundamental basis for high-throughput proteomics and functional genomics approaches aimed at deciphering the molecular mechanisms of resistance and controlling this pest. PMID:26902207

  16. Genomic evidence for genes encoding leucine-rich repeat receptors linked to resistance against the eukaryotic extra- and intracellular Brassica napus pathogens Leptosphaeria maculans and Plasmodiophora brassicae.

    PubMed

    Stotz, Henrik U; Harvey, Pascoe J; Haddadi, Parham; Mashanova, Alla; Kukol, Andreas; Larkan, Nicholas J; Borhan, M Hossein; Fitt, Bruce D L

    2018-01-01

    Genes coding for nucleotide-binding leucine-rich repeat (LRR) receptors (NLRs) control resistance against intracellular (cell-penetrating) pathogens. However, evidence for a role of genes coding for proteins with LRR domains in resistance against extracellular (apoplastic) fungal pathogens is limited. Here, the distribution of genes coding for proteins with eLRR domains but lacking kinase domains was determined for the Brassica napus genome. Predictions of signal peptide and transmembrane regions divided these genes into 184 coding for receptor-like proteins (RLPs) and 121 coding for secreted proteins (SPs). Together with previously annotated NLRs, a total of 720 LRR genes were found. Leptosphaeria maculans-induced expression during a compatible interaction with cultivar Topas differed between RLP, SP and NLR gene families; NLR genes were induced relatively late, during the necrotrophic phase of pathogen colonization. Seven RLP, one SP and two NLR genes were found in Rlm1 and Rlm3/Rlm4/Rlm7/Rlm9 loci for resistance against L. maculans on chromosome A07 of B. napus. One NLR gene at the Rlm9 locus was positively selected, as was the RLP gene on chromosome A10 with LepR3 and Rlm2 alleles conferring resistance against L. maculans races with corresponding effectors AvrLm1 and AvrLm2, respectively. Known loci for resistance against L. maculans (extracellular hemi-biotrophic fungus), Sclerotinia sclerotiorum (necrotrophic fungus) and Plasmodiophora brassicae (intracellular, obligate biotrophic protist) were examined for presence of RLPs, SPs and NLRs in these regions. Whereas loci for resistance against P. brassicae were enriched for NLRs, no such signature was observed for the other pathogens. These findings demonstrate involvement of (i) NLR genes in resistance against the intracellular pathogen P. brassicae and a putative NLR gene in Rlm9-mediated resistance against the extracellular pathogen L. maculans.

  17. nGASP - the nematode genome annotation assessment project

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Coghlan, A; Fiedler, T J; McKay, S J

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner'more » algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders. While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets for 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second place. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy as reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs were the most challenging for gene-finders.« less

  18. Predicting domain-domain interaction based on domain profiles with feature selection and support vector machines

    PubMed Central

    2010-01-01

    Background Protein-protein interaction (PPI) plays essential roles in cellular functions. The cost, time and other limitations associated with the current experimental methods have motivated the development of computational methods for predicting PPIs. As protein interactions generally occur via domains instead of the whole molecules, predicting domain-domain interaction (DDI) is an important step toward PPI prediction. Computational methods developed so far have utilized information from various sources at different levels, from primary sequences, to molecular structures, to evolutionary profiles. Results In this paper, we propose a computational method to predict DDI using support vector machines (SVMs), based on domains represented as interaction profile hidden Markov models (ipHMM) where interacting residues in domains are explicitly modeled according to the three dimensional structural information available at the Protein Data Bank (PDB). Features about the domains are extracted first as the Fisher scores derived from the ipHMM and then selected using singular value decomposition (SVD). Domain pairs are represented by concatenating their selected feature vectors, and classified by a support vector machine trained on these feature vectors. The method is tested by leave-one-out cross validation experiments with a set of interacting protein pairs adopted from the 3DID database. The prediction accuracy has shown significant improvement as compared to InterPreTS (Interaction Prediction through Tertiary Structure), an existing method for PPI prediction that also uses the sequences and complexes of known 3D structure. Conclusions We show that domain-domain interaction prediction can be significantly enhanced by exploiting information inherent in the domain profiles via feature selection based on Fisher scores, singular value decomposition and supervised learning based on support vector machines. Datasets and source code are freely available on the web at http://liao.cis.udel.edu/pub/svdsvm. Implemented in Matlab and supported on Linux and MS Windows. PMID:21034480

  19. Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.

    PubMed

    Nagy, Alinda; Hegyi, Hédi; Farkas, Krisztina; Tordai, Hedvig; Kozma, Evelin; Bányai, László; Patthy, László

    2008-08-27

    Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes. Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries. MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.

  20. DeepQA: improving the estimation of single protein model quality with deep belief networks.

    PubMed

    Cao, Renzhi; Bhattacharya, Debswapna; Hou, Jie; Cheng, Jianlin

    2016-12-05

    Protein quality assessment (QA) useful for ranking and selecting protein models has long been viewed as one of the major challenges for protein tertiary structure prediction. Especially, estimating the quality of a single protein model, which is important for selecting a few good models out of a large model pool consisting of mostly low-quality models, is still a largely unsolved problem. We introduce a novel single-model quality assessment method DeepQA based on deep belief network that utilizes a number of selected features describing the quality of a model from different perspectives, such as energy, physio-chemical characteristics, and structural information. The deep belief network is trained on several large datasets consisting of models from the Critical Assessment of Protein Structure Prediction (CASP) experiments, several publicly available datasets, and models generated by our in-house ab initio method. Our experiments demonstrate that deep belief network has better performance compared to Support Vector Machines and Neural Networks on the protein model quality assessment problem, and our method DeepQA achieves the state-of-the-art performance on CASP11 dataset. It also outperformed two well-established methods in selecting good outlier models from a large set of models of mostly low quality generated by ab initio modeling methods. DeepQA is a useful deep learning tool for protein single model quality assessment and protein structure prediction. The source code, executable, document and training/test datasets of DeepQA for Linux is freely available to non-commercial users at http://cactus.rnet.missouri.edu/DeepQA/ .

  1. The Caenorhabditis elegans gene unc-89, required fpr muscle M-line assembly, encodes a giant modular protein composed of Ig and signal transduction domains

    PubMed Central

    1996-01-01

    Mutations in the Caenorhabditis elegans gene unc-89 result in nematodes having disorganized muscle structure in which thick filaments are not organized into A-bands, and there are no M-lines. Beginning with a partial cDNA from the C. elegans sequencing project, we have cloned and sequenced the unc-89 gene. An unc-89 allele, st515, was found to contain an 84-bp deletion and a 10-bp duplication, resulting in an in- frame stop codon within predicted unc-89 coding sequence. Analysis of the complete coding sequence for unc-89 predicts a novel 6,632 amino acid polypeptide consisting of sequence motifs which have been implicated in protein-protein interactions. UNC-89 begins with 67 residues of unique sequences, SH3, dbl/CDC24, and PH domains, 7 immunoglobulins (Ig) domains, a putative KSP-containing multiphosphorylation domain, and ends with 46 Ig domains. A polyclonal antiserum raised to a portion of unc-89 encoded sequence reacts to a twitchin-sized polypeptide from wild type, but truncated polypeptides from st515 and from the amber allele e2338. By immunofluorescent microscopy, this antiserum localizes to the middle of A-bands, consistent with UNC-89 being a structural component of the M-line. Previous studies indicate that myofilament lattice assembly begins with positional cues laid down in the basement membrane and muscle cell membrane. We propose that the intracellular protein UNC-89 responds to these signals, localizes, and then participates in assembling an M-line. PMID:8603916

  2. Molecular dynamics and Monte Carlo simulations resolve apparent diffusion rate differences for proteins confined in nanochannels

    DOE PAGES

    Tringe, J. W.; Ileri, N.; Levie, H. W.; ...

    2015-08-01

    We use Molecular Dynamics and Monte Carlo simulations to examine molecular transport phenomena in nanochannels, explaining four orders of magnitude difference in wheat germ agglutinin (WGA) protein diffusion rates observed by fluorescence correlation spectroscopy (FCS) and by direct imaging of fluorescently-labeled proteins. We first use the ESPResSo Molecular Dynamics code to estimate the surface transport distance for neutral and charged proteins. We then employ a Monte Carlo model to calculate the paths of protein molecules on surfaces and in the bulk liquid transport medium. Our results show that the transport characteristics depend strongly on the degree of molecular surface coverage.more » Atomic force microscope characterization of surfaces exposed to WGA proteins for 1000 s show large protein aggregates consistent with the predicted coverage. These calculations and experiments provide useful insight into the details of molecular motion in confined geometries.« less

  3. Evidence for Natural Selection in Nucleotide Content Relationships Based on Complete Mitochondrial Genomes: Strong Effect of Guanine Content on Separation between Terrestrial and Aquatic Vertebrates.

    PubMed

    Sorimachi, Kenji; Okayasu, Teiji

    2015-01-01

    The complete vertebrate mitochondrial genome consists of 13 coding genes. We used this genome to investigate the existence of natural selection in vertebrate evolution. From the complete mitochondrial genomes, we predicted nucleotide contents and then separated these values into coding and non-coding regions. When nucleotide contents of a coding or non-coding region were plotted against the nucleotide content of the complete mitochondrial genomes, we obtained linear regression lines only between homonucleotides and their analogs. On every plot using G or A content purine, G content in aquatic vertebrates was higher than that in terrestrial vertebrates, while A content in aquatic vertebrates was lower than that in terrestrial vertebrates. Based on these relationships, vertebrates were separated into two groups, terrestrial and aquatic. However, using C or T content pyrimidine, clear separation between these two groups was not obtained. The hagfish (Eptatretus burgeri) was further separated from both terrestrial and aquatic vertebrates. Based on these results, nucleotide content relationships predicted from the complete vertebrate mitochondrial genomes reveal the existence of natural selection based on evolutionary separation between terrestrial and aquatic vertebrate groups. In addition, we propose that separation of the two groups might be linked to ammonia detoxification based on high G and low A contents, which encode Glu rich and Lys poor proteins.

  4. Genome re-annotation of the wild strawberry Fragaria vesca using extensive Illumina- and SMRT-based RNA-seq datasets

    PubMed Central

    Li, Yongping; Wei, Wei; Feng, Jia; Luo, Huifeng; Pi, Mengting; Liu, Zhongchi; Kang, Chunying

    2018-01-01

    Abstract The genome of the wild diploid strawberry species Fragaria vesca, an ideal model system of cultivated strawberry (Fragaria × ananassa, octoploid) and other Rosaceae family crops, was first published in 2011 and followed by a new assembly (Fvb). However, the annotation for Fvb mainly relied on ab initio predictions and included only predicted coding sequences, therefore an improved annotation is highly desirable. Here, a new annotation version named v2.0.a2 was created for the Fvb genome by a pipeline utilizing one PacBio library, 90 Illumina RNA-seq libraries, and 9 small RNA-seq libraries. Altogether, 18,641 genes (55.6% out of 33,538 genes) were augmented with information on the 5′ and/or 3′ UTRs, 13,168 (39.3%) protein-coding genes were modified or newly identified, and 7,370 genes were found to possess alternative isoforms. In addition, 1,938 long non-coding RNAs, 171 miRNAs, and 51,714 small RNA clusters were integrated into the annotation. This new annotation of F. vesca is substantially improved in both accuracy and integrity of gene predictions, beneficial to the gene functional studies in strawberry and to the comparative genomic analysis of other horticultural crops in Rosaceae family. PMID:29036429

  5. Vector Adaptive/Predictive Encoding Of Speech

    NASA Technical Reports Server (NTRS)

    Chen, Juin-Hwey; Gersho, Allen

    1989-01-01

    Vector adaptive/predictive technique for digital encoding of speech signals yields decoded speech of very good quality after transmission at coding rate of 9.6 kb/s and of reasonably good quality at 4.8 kb/s. Requires 3 to 4 million multiplications and additions per second. Combines advantages of adaptive/predictive coding, and code-excited linear prediction, yielding speech of high quality but requires 600 million multiplications and additions per second at encoding rate of 4.8 kb/s. Vector adaptive/predictive coding technique bridges gaps in performance and complexity between adaptive/predictive coding and code-excited linear prediction.

  6. Global response of Acidithiobacillus ferrooxidans ATCC 53993 to high concentrations of copper: A quantitative proteomics approach.

    PubMed

    Martínez-Bussenius, Cristóbal; Navarro, Claudio A; Orellana, Luis; Paradela, Alberto; Jerez, Carlos A

    2016-08-11

    Acidithiobacillus ferrooxidans is used in industrial bioleaching of minerals to extract valuable metals. A. ferrooxidans strain ATCC 53993 is much more resistant to copper than other strains of this microorganism and it has been proposed that genes present in an exclusive genomic island (GI) of this strain would contribute to its extreme copper tolerance. ICPL (isotope-coded protein labeling) quantitative proteomics was used to study in detail the response of this bacterium to copper. A high overexpression of RND efflux systems and CusF copper chaperones, both present in the genome and the GI of strain ATCC 53993 was found. Also, changes in the levels of the respiratory system proteins such as AcoP and Rus copper binding proteins and several proteins with other predicted functions suggest that numerous metabolic changes are apparently involved in controlling the effects of the toxic metal on this acidophile. Using quantitative proteomics we overview the adaptation mechanisms that biomining acidophiles use to stand their harsh environment. The overexpression of several genes present in an exclusive genomic island strongly suggests the importance of the proteins coded in this DNA region in the high tolerance of A. ferrooxidans ATCC 53993 to metals. Copyright © 2016 Elsevier B.V. All rights reserved.

  7. GPU.proton.DOCK: Genuine Protein Ultrafast proton equilibria consistent DOCKing.

    PubMed

    Kantardjiev, Alexander A

    2011-07-01

    GPU.proton.DOCK (Genuine Protein Ultrafast proton equilibria consistent DOCKing) is a state of the art service for in silico prediction of protein-protein interactions via rigorous and ultrafast docking code. It is unique in providing stringent account of electrostatic interactions self-consistency and proton equilibria mutual effects of docking partners. GPU.proton.DOCK is the first server offering such a crucial supplement to protein docking algorithms--a step toward more reliable and high accuracy docking results. The code (especially the Fast Fourier Transform bottleneck and electrostatic fields computation) is parallelized to run on a GPU supercomputer. The high performance will be of use for large-scale structural bioinformatics and systems biology projects, thus bridging physics of the interactions with analysis of molecular networks. We propose workflows for exploring in silico charge mutagenesis effects. Special emphasis is given to the interface-intuitive and user-friendly. The input is comprised of the atomic coordinate files in PDB format. The advanced user is provided with a special input section for addition of non-polypeptide charges, extra ionogenic groups with intrinsic pK(a) values or fixed ions. The output is comprised of docked complexes in PDB format as well as interactive visualization in a molecular viewer. GPU.proton.DOCK server can be accessed at http://gpudock.orgchm.bas.bg/.

  8. Td4IN2: A drought-responsive durum wheat (Triticum durum Desf.) gene coding for a resistance like protein with serine/threonine protein kinase, nucleotide binding site and leucine rich domains.

    PubMed

    Rampino, Patrizia; De Pascali, Mariarosaria; De Caroli, Monica; Luvisi, Andrea; De Bellis, Luigi; Piro, Gabriella; Perrotta, Carla

    2017-11-01

    Wheat, the main food source for a third of world population, appears strongly under threat because of predicted increasing temperatures coupled to drought. Plant complex molecular response to drought stress relies on the gene network controlling cell reactions to abiotic stress. In the natural environment, plants are subjected to the combination of abiotic and biotic stresses. Also the response of plants to biotic stress, to cope with pathogens, involves the activation of a molecular network. Investigations on combination of abiotic and biotic stresses indicate the existence of cross-talk between the two networks and a kind of overlapping can be hypothesized. In this work we describe the isolation and characterization of a drought-related durum wheat (Triticum durum Desf.) gene, identified in a previous study, coding for a protein combining features of NBS-LRR type resistance protein with a S/TPK domain, involved in drought stress response. This is one of the few examples reported where all three domains are present in a single protein and, to our knowledge, it is the first report on a gene specifically induced by drought stress and drought-related conditions, with this particular structure. Copyright © 2017 Elsevier Masson SAS. All rights reserved.

  9. Predicting the tolerated sequences for proteins and protein interfaces using RosettaBackrub flexible backbone design.

    PubMed

    Smith, Colin A; Kortemme, Tanja

    2011-01-01

    Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s) are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface), interactions between and within parts of the structure (e.g. domains) can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.

  10. Approaches to Fungal Genome Annotation

    PubMed Central

    Haas, Brian J.; Zeng, Qiandong; Pearson, Matthew D.; Cuomo, Christina A.; Wortman, Jennifer R.

    2011-01-01

    Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center’s production genome annotation environment. PMID:22059117

  11. Prediction of hot spots in protein interfaces using a random forest model with hybrid features.

    PubMed

    Wang, Lin; Liu, Zhi-Ping; Zhang, Xiang-Sun; Chen, Luonan

    2012-03-01

    Prediction of hot spots in protein interfaces provides crucial information for the research on protein-protein interaction and drug design. Existing machine learning methods generally judge whether a given residue is likely to be a hot spot by extracting features only from the target residue. However, hot spots usually form a small cluster of residues which are tightly packed together at the center of protein interface. With this in mind, we present a novel method to extract hybrid features which incorporate a wide range of information of the target residue and its spatially neighboring residues, i.e. the nearest contact residue in the other face (mirror-contact residue) and the nearest contact residue in the same face (intra-contact residue). We provide a novel random forest (RF) model to effectively integrate these hybrid features for predicting hot spots in protein interfaces. Our method can achieve accuracy (ACC) of 82.4% and Matthew's correlation coefficient (MCC) of 0.482 in Alanine Scanning Energetics Database, and ACC of 77.6% and MCC of 0.429 in Binding Interface Database. In a comparison study, performance of our RF model exceeds other existing methods, such as Robetta, FOLDEF, KFC, KFC2, MINERVA and HotPoint. Of our hybrid features, three physicochemical features of target residues (mass, polarizability and isoelectric point), the relative side-chain accessible surface area and the average depth index of mirror-contact residues are found to be the main discriminative features in hot spots prediction. We also confirm that hot spots tend to form large contact surface areas between two interacting proteins. Source data and code are available at: http://www.aporc.org/doc/wiki/HotSpot.

  12. Comparison of the Genome Sequence of the Poultry Pathogen Bordetella avium with Those of B. bronchiseptica, B. pertussis, and B. parapertussis Reveals Extensive Diversity in Surface Structures Associated with Host Interaction

    PubMed Central

    Sebaihia, Mohammed; Preston, Andrew; Maskell, Duncan J.; Kuzmiak, Holly; Connell, Terry D.; King, Natalie D.; Orndorff, Paul E.; Miyamoto, David M.; Thomson, Nicholas R.; Harris, David; Goble, Arlette; Lord, Angela; Murphy, Lee; Quail, Michael A.; Rutter, Simon; Squares, Robert; Squares, Steven; Woodward, John; Parkhill, Julian; Temple, Louise M.

    2006-01-01

    Bordetella avium is a pathogen of poultry and is phylogenetically distinct from Bordetella bronchiseptica, Bordetella pertussis, and Bordetella parapertussis, which are other species in the Bordetella genus that infect mammals. In order to understand the evolutionary relatedness of Bordetella species and further the understanding of pathogenesis, we obtained the complete genome sequence of B. avium strain 197N, a pathogenic strain that has been extensively studied. With 3,732,255 base pairs of DNA and 3,417 predicted coding sequences, it has the smallest genome and gene complement of the sequenced bordetellae. In this study, the presence or absence of previously reported virulence factors from B. avium was confirmed, and the genetic bases for growth characteristics were elucidated. Over 1,100 genes present in B. avium but not in B. bronchiseptica were identified, and most were predicted to encode surface or secreted proteins that are likely to define an organism adapted to the avian rather than the mammalian respiratory tracts. These include genes coding for the synthesis of a polysaccharide capsule, hemagglutinins, a type I secretion system adjacent to two very large genes for secreted proteins, and unique genes for both lipopolysaccharide and fimbrial biogenesis. Three apparently complete prophages are also present. The BvgAS virulence regulatory system appears to have polymorphisms at a poly(C) tract that is involved in phase variation in other bordetellae. A number of putative iron-regulated outer membrane proteins were predicted from the sequence, and this regulation was confirmed experimentally for five of these. PMID:16885469

  13. Online interactive analysis of protein structure ensembles with Bio3D-web.

    PubMed

    Skjærven, Lars; Jariwala, Shashank; Yao, Xin-Qiu; Grant, Barry J

    2016-11-15

    Bio3D-web is an online application for analyzing the sequence, structure and conformational heterogeneity of protein families. Major functionality is provided for identifying protein structure sets for analysis, their alignment and refined structure superposition, sequence and structure conservation analysis, mapping and clustering of conformations and the quantitative comparison of their predicted structural dynamics. Bio3D-web is based on the Bio3D and Shiny R packages. All major browsers are supported and full source code is available under a GPL2 license from http://thegrantlab.org/bio3d-web CONTACT: bjgrant@umich.edu or lars.skjarven@uib.no. © The Author 2016. Published by Oxford University Press.

  14. Complete genome sequence of Cupriavidus basilensis 4G11, isolated from the Oak Ridge Field Research Center site

    DOE PAGES

    Ray, Jayashree; Waters, R. Jordan; Skerker, Jeffrey M.; ...

    2015-05-14

    Cupriavidus basilensis 4G11 was isolated from groundwater at the Oak Ridge Field Research Center (FRC) site. Here, we report the complete genome sequence and annotation of Cupriavidus basilensis 4G11. The genome contains 8,421,483 bp, 7,661 predicted protein-coding genes, and a total GC content of 64.4%.

  15. Comparative genome analysis reveals genetic adaptation to versatile environmental conditions and importance of biofilm lifestyle in Comamonas testosteroni.

    PubMed

    Wu, Yichao; Arumugam, Krithika; Tay, Martin Qi Xiang; Seshan, Hari; Mohanty, Anee; Cao, Bin

    2015-04-01

    Comamonas testosteroni is an important environmental bacterium capable of degrading a variety of toxic aromatic pollutants and has been demonstrated to be a promising biocatalyst for environmental decontamination. This organism is often found to be among the primary surface colonizers in various natural and engineered ecosystems, suggesting an extraordinary capability of this organism in environmental adaptation and biofilm formation. The goal of this study was to gain genetic insights into the adaption of C. testosteroni to versatile environments and the importance of a biofilm lifestyle. Specifically, a draft genome of C. testosteroni I2 was obtained. The draft genome is 5,778,710 bp in length and comprises 110 contigs. The average G+C content was 61.88 %. A total of 5365 genes with 5263 protein-coding genes were predicted, whereas 4324 (80.60 % of total genes) protein-encoding genes were associated with predicted functions. The catabolic genes responsible for biodegradation of steroid and other aromatic compounds on draft genome were identified. Plasmid pI2 was found to encode a complete pathway for aniline degradation and a partial catabolic pathway for chloroaniline. This organism was found to be equipped with a sophisticated signaling system which helps it find ideal niches and switch between planktonic and biofilm lifestyles. A large number of putative multi-drug-resistant genes coding for abundant outer membrane transporters, chaperones, and heat shock proteins for the protection of cellular function were identified in the genome of strain I2. In addition, the genome of strain I2 was predicted to encode several proteins involved in producing, secreting, and uptaking siderophores under iron-limiting conditions. The genome of strain I2 contains a number of genes responsible for the synthesis and secretion of exopolysaccharides, an extracellular component essential for biofilm formation. Overall, our results reveal the genomic features underlying the adaption of C. testosteroni to versatile environments and highlighting the importance of its biofilm lifestyle.

  16. A Network Based Method for Analysis of lncRNA-Disease Associations and Prediction of lncRNAs Implicated in Diseases

    PubMed Central

    Yang, Xiaofei; Gao, Lin; Guo, Xingli; Shi, Xinghua; Wu, Hao; Song, Fei; Wang, Bingbo

    2014-01-01

    Increasing evidence has indicated that long non-coding RNAs (lncRNAs) are implicated in and associated with many complex human diseases. Despite of the accumulation of lncRNA-disease associations, only a few studies had studied the roles of these associations in pathogenesis. In this paper, we investigated lncRNA-disease associations from a network view to understand the contribution of these lncRNAs to complex diseases. Specifically, we studied both the properties of the diseases in which the lncRNAs were implicated, and that of the lncRNAs associated with complex diseases. Regarding the fact that protein coding genes and lncRNAs are involved in human diseases, we constructed a coding-non-coding gene-disease bipartite network based on known associations between diseases and disease-causing genes. We then applied a propagation algorithm to uncover the hidden lncRNA-disease associations in this network. The algorithm was evaluated by leave-one-out cross validation on 103 diseases in which at least two genes were known to be involved, and achieved an AUC of 0.7881. Our algorithm successfully predicted 768 potential lncRNA-disease associations between 66 lncRNAs and 193 diseases. Furthermore, our results for Alzheimer's disease, pancreatic cancer, and gastric cancer were verified by other independent studies. PMID:24498199

  17. The prediction of human exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Solovyev, V.V.; Salamov, A.A.; Lawrence, C.B.

    1994-12-31

    Discriminant analysis is applied to the problem of recognition 5`-, internal and 3`-exons in human DNA sequences. Specific recognition functions were developed for revealing exons of particular types. The method based on a splice site prediction algorithm that uses the linear Fisher discriminant to combine the information about significant triplet frequencies of various functional parts of splice site regions and preferences of oligonucleotide in protein coding and nation regions. The accuracy of our splice site recognition function is about 97%. A discriminant function for 5`-exon prediction includes hexanucleotide composition of upstream region, triplet composition around the ATG codon, ORF codingmore » potential, donor splice site potential and composition of downstream introit region. For internal exon prediction, we combine in a discriminant function the characteristics describing the 5`- intron region, donor splice site, coding region, acceptor splice site and Y-intron region for each open reading frame flanked by GT and AG base pairs. The accuracy of precise internal exon recognition on a test set of 451 exon and 246693 pseudoexon sequences is 77% with a specificity of 79% and a level of pseudoexon ORF prediction of 99.96%. The recognition quality computed at the level of individual nucleotides is 89%, for exon sequences and 98% for intron sequences. A discriminant function for 3`-exon prediction includes octanucleolide composition of upstream nation region, triplet composition around the stop codon, ORF coding potential, acceptor splice site potential and hexanucleotide composition of downstream region. We unite these three discriminant functions in exon predicting program FEX (find exons). FEX exactly predicts 70% of 1016 exons from the test of 181 complete genes with specificity 73%, and 89% exons are exactly or partially predicted. On the average, 85% of nucleotides were predicted accurately with specificity 91%.« less

  18. SPOCS: software for predicting and visualizing orthology/paralogy relationships among genomes.

    PubMed

    Curtis, Darren S; Phillips, Aaron R; Callister, Stephen J; Conlan, Sean; McCue, Lee Ann

    2013-10-15

    At the rate that prokaryotic genomes can now be generated, comparative genomics studies require a flexible method for quickly and accurately predicting orthologs among the rapidly changing set of genomes available. SPOCS implements a graph-based ortholog prediction method to generate a simple tab-delimited table of orthologs and in addition, html files that provide a visualization of the predicted ortholog/paralog relationships to which gene/protein expression metadata may be overlaid. A SPOCS web application is freely available at http://cbb.pnnl.gov/portal/tools/spocs.html. Source code for Linux systems is also freely available under an open source license at http://cbb.pnnl.gov/portal/software/spocs.html; the Boost C++ libraries and BLAST are required.

  19. FrameD: A flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences.

    PubMed

    Schiex, Thomas; Gouzy, Jérôme; Moisan, Annick; de Oliveira, Yannick

    2003-07-01

    We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms.

  20. From sequence to enzyme mechanism using multi-label machine learning.

    PubMed

    De Ferrari, Luna; Mitchell, John B O

    2014-05-19

    In this work we predict enzyme function at the level of chemical mechanism, providing a finer granularity of annotation than traditional Enzyme Commission (EC) classes. Hence we can predict not only whether a putative enzyme in a newly sequenced organism has the potential to perform a certain reaction, but how the reaction is performed, using which cofactors and with susceptibility to which drugs or inhibitors, details with important consequences for drug and enzyme design. Work that predicts enzyme catalytic activity based on 3D protein structure features limits the prediction of mechanism to proteins already having either a solved structure or a close relative suitable for homology modelling. In this study, we evaluate whether sequence identity, InterPro or Catalytic Site Atlas sequence signatures provide enough information for bulk prediction of enzyme mechanism. By splitting MACiE (Mechanism, Annotation and Classification in Enzymes database) mechanism labels to a finer granularity, which includes the role of the protein chain in the overall enzyme complex, the method can predict at 96% accuracy (and 96% micro-averaged precision, 99.9% macro-averaged recall) the MACiE mechanism definitions of 248 proteins available in the MACiE, EzCatDb (Database of Enzyme Catalytic Mechanisms) and SFLD (Structure Function Linkage Database) databases using an off-the-shelf K-Nearest Neighbours multi-label algorithm. We find that InterPro signatures are critical for accurate prediction of enzyme mechanism. We also find that incorporating Catalytic Site Atlas attributes does not seem to provide additional accuracy. The software code (ml2db), data and results are available online at http://sourceforge.net/projects/ml2db/ and as supplementary files.

  1. A global analysis of protein expression profiles in Sinorhizobium meliloti: discovery of new genes for nodule occupancy and stress adaptation.

    PubMed

    Djordjevic, Michael A; Chen, Han Cai; Natera, Siria; Van Noorden, Giel; Menzel, Christian; Taylor, Scott; Renard, Clotilde; Geiger, Otto; Weiller, Georg F

    2003-06-01

    A proteomic examination of Sinorhizobium meliloti strain 1021 was undertaken using a combination of 2-D gel electrophoresis, peptide mass fingerprinting, and bioinformatics. Our goal was to identify (i) putative symbiosis- or nutrient-stress-specific proteins, (ii) the biochemical pathways active under different conditions, (iii) potential new genes, and (iv) the extent of posttranslational modifications of S. meliloti proteins. In total, we identified the protein products of 810 genes (13.1% of the genome's coding capacity). The 810 genes generated 1,180 gene products, with chromosomal genes accounting for 78% of the gene products identified (18.8% of the chromosome's coding capacity). The activity of 53 metabolic pathways was inferred from bioinformatic analysis of proteins with assigned Enzyme Commission numbers. Of the remaining proteins that did not encode enzymes, ABC-type transporters composed 12.7% and regulatory proteins 3.4% of the total. Proteins with up to seven transmembrane domains were identified in membrane preparations. A total of 27 putative nodule-specific proteins and 35 nutrient-stress-specific proteins were identified and used as a basis to define genes and describe processes occurring in S. meliloti cells in nodules and under stress. Several nodule proteins from the plant host were present in the nodule bacteria preparations. We also identified seven potentially novel proteins not predicted from the DNA sequence. Post-translational modifications such as N-terminal processing could be inferred from the data. The posttranslational addition of UMP to the key regulator of nitrogen metabolism, PII, was demonstrated. This work demonstrates the utility of combining mass spectrometry with protein arraying or separation techniques to identify candidate genes involved in important biological processes and niche occupations that may be intransigent to other methods of gene expression profiling.

  2. Molecular cloning and functional characterization of an antifungal PR-5 protein from Ocimum basilicum.

    PubMed

    Rather, Irshad Ahmad; Awasthi, Praveen; Mahajan, Vidushi; Bedi, Yashbir S; Vishwakarma, Ram A; Gandhi, Sumit G

    2015-03-01

    Pathogenesis-related (PR) proteins are involved in biotic and abiotic stress responses of plants and are grouped into 17 families (PR-1 to PR-17). PR-5 family includes proteins related to thaumatin and osmotin, with several members possessing antimicrobial properties. In this study, a PR-5 gene showing a high degree of homology with osmotin-like protein was isolated from sweet basil (Ocimum basilicum L.). A complete open reading frame consisting of 675 nucleotides, coding for a precursor protein, was obtained by PCR amplification. Based on sequence comparisons with tobacco osmotin and other osmotin-like proteins (OLPs), this protein was named ObOLP. The predicted mature protein is 225 amino acids in length and contains 16 cysteine residues that may potentially form eight disulfide bonds, a signature common to most PR-5 proteins. Among the various abiotic stress treatments tested, including high salt, mechanical wounding and exogenous phytohormone/elicitor treatments; methyl jasmonate (MeJA) and mechanical wounding significantly induced the expression of ObOLP gene. The coding sequence of ObOLP was cloned and expressed in a bacterial host resulting in a 25kDa recombinant-HIS tagged protein, displaying antifungal activity. The ObOLP protein sequence appears to contain an N-terminal signal peptide with signatures of secretory pathway. Further, our experimental data shows that ObOLP expression is regulated transcriptionally and in silico analysis suggests that it may be post-transcriptionally and post-translationally regulated through microRNAs and post-translational protein modifications, respectively. This study appears to be the first report of isolation and characterization of osmotin-like protein gene from O. basilicum. Copyright © 2014 Elsevier B.V. All rights reserved.

  3. EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation

    PubMed Central

    Amidi, Afshine; Megalooikonomou, Vasileios; Paragios, Nikos

    2018-01-01

    During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet. PMID:29740518

  4. EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation.

    PubMed

    Amidi, Afshine; Amidi, Shervine; Vlachakis, Dimitrios; Megalooikonomou, Vasileios; Paragios, Nikos; Zacharaki, Evangelia I

    2018-01-01

    During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet.

  5. Human myosin VIIA responsible for the Usher 1B syndrome: a predicted membrane-associated motor protein expressed in developing sensory epithelia.

    PubMed

    Weil, D; Levy, G; Sahly, I; Levi-Acobas, F; Blanchard, S; El-Amraoui, A; Crozet, F; Philippe, H; Abitbol, M; Petit, C

    1996-04-16

    The gene encoding human myosin VIIA is responsible for Usher syndrome type III (USH1B), a disease which associates profound congenital sensorineural deafness, vestibular dysfunction, and retinitis pigmentosa. The reconstituted cDNA sequence presented here predicts a 2215 amino acid protein with a typical unconventional myosin structure. This protein is expected to dimerize into a two-headed molecule. The C terminus of its tail shares homology with the membrane-binding domain of the band 4.1 protein superfamily. The gene consists of 48 coding exons. It encodes several alternatively spliced forms. In situ hybridization analysis in human embryos demonstrates that the myosin VIIA gene is expressed in the pigment epithelium and the photoreceptor cells of the retina, thus indicating that both cell types may be involved in the USH1B retinal degenerative process. In addition, the gene is expressed in the human embryonic cochlear and vestibular neuroepithelia. We suggest that deafness and vestibular dysfunction in USH1B patients result from a defect in the morphogenesis of the inner ear sensory cell stereocilia.

  6. A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).

    PubMed

    Stacey, R Greg; Skinnider, Michael A; Scott, Nichollas E; Foster, Leonard J

    2017-10-23

    An organism's protein interactome, or complete network of protein-protein interactions, defines the protein complexes that drive cellular processes. Techniques for studying protein complexes have traditionally applied targeted strategies such as yeast two-hybrid or affinity purification-mass spectrometry to assess protein interactions. However, given the vast number of protein complexes, more scalable methods are necessary to accelerate interaction discovery and to construct whole interactomes. We recently developed a complementary technique based on the use of protein correlation profiling (PCP) and stable isotope labeling in amino acids in cell culture (SILAC) to assess chromatographic co-elution as evidence of interacting proteins. Importantly, PCP-SILAC is also capable of measuring protein interactions simultaneously under multiple biological conditions, allowing the detection of treatment-specific changes to an interactome. Given the uniqueness and high dimensionality of co-elution data, new tools are needed to compare protein elution profiles, control false discovery rates, and construct an accurate interactome. Here we describe a freely available bioinformatics pipeline, PrInCE, for the analysis of co-elution data. PrInCE is a modular, open-source library that is computationally inexpensive, able to use label and label-free data, and capable of detecting tens of thousands of protein-protein interactions. Using a machine learning approach, PrInCE offers greatly reduced run time, more predicted interactions at the same stringency, prediction of protein complexes, and greater ease of use over previous bioinformatics tools for co-elution data. PrInCE is implemented in Matlab (version R2017a). Source code and standalone executable programs for Windows and Mac OSX are available at https://github.com/fosterlab/PrInCE , where usage instructions can be found. An example dataset and output are also provided for testing purposes. PrInCE is the first fast and easy-to-use data analysis pipeline that predicts interactomes and protein complexes from co-elution data. PrInCE allows researchers without bioinformatics expertise to analyze high-throughput co-elution datasets.

  7. The Effect of α-Mating Factor Secretion Signal Mutations on Recombinant Protein Expression in Pichia pastoris

    PubMed Central

    Lin-Cereghino, Geoff P.; Stark, Carolyn M.; Kim, Daniel; Chang, Jennifer; Shaheen, Nadia; Poerwanto, Hansel; Agari, Kimiko; Moua, Pachai; Low, Lauren K.; Tran, Namphuong; Huang, Amy D.; Nattestad, Maria; Oshiro, Kristin T.; Chang, John William; Chavan, Archana; Tsai, Jerry W.; Lin-Cereghino, Joan

    2013-01-01

    The methylotrophic yeast, Pichia pastoris, has been genetically engineered to produce many heterologous proteins for industrial and research purposes. In order to secrete proteins for easier purification from the extracellular medium, the coding sequence of recombinant proteins are initially fused to the Saccharomyces cerevisiae α-mating factor secretion signal leader. Extensive site-directed mutagenesis of the prepro region of the α-mating factor secretion signal sequence was performed in order to determine the effects of various deletions and substitutions on expression. Though some mutations clearly dampened protein expression, deletion of amino acids 57-70, corresponding to the predicted 3rd alpha helix of α-mating factor secretion signal, increased secretion of reporter proteins horseradish peroxidase and lipase at least 50% in small-scale cultures. These findings raise the possibility that the secretory efficiency of the leader can be further enhanced in the future. PMID:23454485

  8. Genome-wide Expression Profiling, In Vivo DNA Binding Analysis, and Probabilistic Motif Prediction Reveal Novel Abf1 Target Genes during Fermentation, Respiration, and Sporulation in Yeast

    PubMed Central

    Schlecht, Ulrich; Erb, Ionas; Demougin, Philippe; Robine, Nicolas; Borde, Valérie; van Nimwegen, Erik; Nicolas, Alain

    2008-01-01

    The autonomously replicating sequence binding factor 1 (Abf1) was initially identified as an essential DNA replication factor and later shown to be a component of the regulatory network controlling mitotic and meiotic cell cycle progression in budding yeast. The protein is thought to exert its functions via specific interaction with its target site as part of distinct protein complexes, but its roles during mitotic growth and meiotic development are only partially understood. Here, we report a comprehensive approach aiming at the identification of direct Abf1-target genes expressed during fermentation, respiration, and sporulation. Computational prediction of the protein's target sites was integrated with a genome-wide DNA binding assay in growing and sporulating cells. The resulting data were combined with the output of expression profiling studies using wild-type versus temperature-sensitive alleles. This work identified 434 protein-coding loci as being transcriptionally dependent on Abf1. More than 60% of their putative promoter regions contained a computationally predicted Abf1 binding site and/or were bound by Abf1 in vivo, identifying them as direct targets. The present study revealed numerous loci previously unknown to be under Abf1 control, and it yielded evidence for the protein's variable DNA binding pattern during mitotic growth and meiotic development. PMID:18305101

  9. The Evolution and Expression Pattern of Human Overlapping lncRNA and Protein-coding Gene Pairs.

    PubMed

    Ning, Qianqian; Li, Yixue; Wang, Zhen; Zhou, Songwen; Sun, Hong; Yu, Guangjun

    2017-03-27

    Long non-coding RNA overlapping with protein-coding gene (lncRNA-coding pair) is a special type of overlapping genes. Protein-coding overlapping genes have been well studied and increasing attention has been paid to lncRNAs. By studying lncRNA-coding pairs in human genome, we showed that lncRNA-coding pairs were more likely to be generated by overprinting and retaining genes in lncRNA-coding pairs were given higher priority than non-overlapping genes. Besides, the preference of overlapping configurations preserved during evolution was based on the origin of lncRNA-coding pairs. Further investigations showed that lncRNAs promoting the splicing of their embedded protein-coding partners was a unilateral interaction, but the existence of overlapping partners improving the gene expression was bidirectional and the effect was decreased with the increased evolutionary age of genes. Additionally, the expression of lncRNA-coding pairs showed an overall positive correlation and the expression correlation was associated with their overlapping configurations, local genomic environment and evolutionary age of genes. Comparison of the expression correlation of lncRNA-coding pairs between normal and cancer samples found that the lineage-specific pairs including old protein-coding genes may play an important role in tumorigenesis. This work presents a systematically comprehensive understanding of the evolution and the expression pattern of human lncRNA-coding pairs.

  10. Coevolutionary modeling of protein sequences: Predicting structure, function, and mutational landscapes

    NASA Astrophysics Data System (ADS)

    Weigt, Martin

    Over the last years, biological research has been revolutionized by experimental high-throughput techniques, in particular by next-generation sequencing technology. Unprecedented amounts of data are accumulating, and there is a growing request for computational methods unveiling the information hidden in raw data, thereby increasing our understanding of complex biological systems. Statistical-physics models based on the maximum-entropy principle have, in the last few years, played an important role in this context. To give a specific example, proteins and many non-coding RNA show a remarkable degree of structural and functional conservation in the course of evolution, despite a large variability in amino acid sequences. We have developed a statistical-mechanics inspired inference approach - called Direct-Coupling Analysis - to link this sequence variability (easy to observe in sequence alignments, which are available in public sequence databases) to bio-molecular structure and function. In my presentation I will show, how this methodology can be used (i) to infer contacts between residues and thus to guide tertiary and quaternary protein structure prediction and RNA structure prediction, (ii) to discriminate interacting from non-interacting protein families, and thus to infer conserved protein-protein interaction networks, and (iii) to reconstruct mutational landscapes and thus to predict the phenotypic effect of mutations. References [1] M. Figliuzzi, H. Jacquier, A. Schug, O. Tenaillon and M. Weigt ''Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1'', Mol. Biol. Evol. (2015), doi: 10.1093/molbev/msv211 [2] E. De Leonardis, B. Lutz, S. Ratz, S. Cocco, R. Monasson, A. Schug, M. Weigt ''Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction'', Nucleic Acids Research (2015), doi: 10.1093/nar/gkv932 [3] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. Marks, C. Sander, R. Zecchina, J.N. Onuchic, T. Hwa, M. Weigt, ''Direct-coupling analysis of residue co-evolution captures native contacts across many protein families'', Proc. Natl. Acad. Sci. 108, E1293-E1301 (2011).

  11. Signal sequence and keyword trap in silico for selection of full-length human cDNAs encoding secretion or membrane proteins from oligo-capped cDNA libraries.

    PubMed

    Otsuki, Tetsuji; Ota, Toshio; Nishikawa, Tetsuo; Hayashi, Koji; Suzuki, Yutaka; Yamamoto, Jun-ichi; Wakamatsu, Ai; Kimura, Kouichi; Sakamoto, Katsuhiko; Hatano, Naoto; Kawai, Yuri; Ishii, Shizuko; Saito, Kaoru; Kojima, Shin-ichi; Sugiyama, Tomoyasu; Ono, Tetsuyoshi; Okano, Kazunori; Yoshikawa, Yoko; Aotsuka, Satoshi; Sasaki, Naokazu; Hattori, Atsushi; Okumura, Koji; Nagai, Keiichi; Sugano, Sumio; Isogai, Takao

    2005-01-01

    We have developed an in silico method of selection of human full-length cDNAs encoding secretion or membrane proteins from oligo-capped cDNA libraries. Fullness rates were increased to about 80% by combination of the oligo-capping method and ATGpr, software for prediction of translation start point and the coding potential. Then, using 5'-end single-pass sequences, cDNAs having the signal sequence were selected by PSORT ('signal sequence trap'). We also applied 'secretion or membrane protein-related keyword trap' based on the result of BLAST search against the SWISS-PROT database for the cDNAs which could not be selected by PSORT. Using the above procedures, 789 cDNAs were primarily selected and subjected to full-length sequencing, and 334 of these cDNAs were finally selected as novel. Most of the cDNAs (295 cDNAs: 88.3%) were predicted to encode secretion or membrane proteins. In particular, 165(80.5%) of the 205 cDNAs selected by PSORT were predicted to have signal sequences, while 70 (54.2%) of the 129 cDNAs selected by 'keyword trap' preserved the secretion or membrane protein-related keywords. Many important cDNAs were obtained, including transporters, receptors, and ligands, involved in significant cellular functions. Thus, an efficient method of selecting secretion or membrane protein-encoding cDNAs was developed by combining the above four procedures.

  12. Identification of new stress-induced microRNA and their targets in wheat using computational approach.

    PubMed

    Pandey, Bharati; Gupta, Om Prakash; Pandey, Dev Mani; Sharma, Indu; Sharma, Pradeep

    2013-05-01

    MicroRNAs (miRNAs) are a class of short endogenous non-coding small RNA molecules of about 18-22 nucleotides in length. Their main function is to downregulate gene expression in different manners like translational repression, mRNA cleavage and epigenetic modification. Computational predictions have raised the number of miRNAs in wheat significantly using an EST based approach. Hence, a combinatorial approach which is amalgamation of bioinformatics software and perl script was used to identify new miRNA to add to the growing database of wheat miRNA. Identification of miRNAs was initiated by mining the EST (Expressed Sequence Tags) database available at National Center for Biotechnology Information. In this investigation, 4677 mature microRNA sequences belonging to 50 miRNA families from different plant species were used to predict miRNA in wheat. A total of five abiotic stress-responsive new miRNAs were predicted and named Ta-miR5653, Ta-miR855, Ta-miR819k, Ta-miR3708 and Ta-miR5156. In addition, four previously identified miRNA, i.e., Ta-miR1122, miR1117, Ta-miR1134 and Ta-miR1133 were predicted in newly identified EST sequence and 14 potential target genes were subsequently predicted, most of which seems to encode ubiquitin carrier protein, serine/threonine protein kinase, 40S ribosomal protein, F-box/kelch-repeat protein, BTB/POZ domain-containing protein, transcription factors which are involved in growth, development, metabolism and stress response. Our result has increased the number of miRNAs in wheat, which should be useful for further investigation into the biological functions and evolution of miRNAs in wheat and other plant species.

  13. Identification of a New Human Adenovirus Protein Encoded by a Novel Late l-Strand Transcription Unit▿

    PubMed Central

    Tollefson, Ann E.; Ying, Baoling; Doronin, Konstantin; Sidor, Peter D.; Wold, William S. M.

    2007-01-01

    A short open reading frame named the “U exon,” located on the adenovirus (Ad) l-strand (for leftward transcription) between the early E3 region and the fiber gene, is conserved in mastadenoviruses. We have observed that Ad5 mutants with large deletions in E3 that infringe on the U exon display a mild growth defect, as well as an aberrant Ad E2 DNA-binding protein (DBP) intranuclear localization pattern and an apparent failure to organize replication centers during late infection. Mutants in which the U exon DNA is reconstructed have a reversed phenotype. Chow et al. (L. T. Chow et al., J. Mol. Biol. 134:265-303, 1979) described mRNAs initiating in the region of the U exon and spliced to downstream sequences in the late DBP mRNA leader and the DBP-coding region. We have cloned this mRNA (as cDNA) from Ad5 late mRNA; the predicted protein is 217 amino acids, initiating in the U exon and continuing in frame in the DBP leader and in the DBP-coding region but in a different reading frame from DBP. Polyclonal and monoclonal antibodies generated against the predicted U exon protein (UXP) showed that UXP is ∼24K in size by immunoblot and is a late protein. At 18 to 24 h postinfection, UXP is strongly associated with nucleoli and is found throughout the nucleus; later, UXP is associated with the periphery of replication centers, suggesting a function relevant to Ad DNA replication or RNA transcription. UXP is expressed by all four species C Ads. When expressed in transient transfections, UXP complements the aberrant DBP localization pattern of UXP-negative Ad5 mutants. Our data indicate that UXP is a previously unrecognized protein derived from a novel late l-strand transcription unit. PMID:17881437

  14. Cost Function Network-based Design of Protein-Protein Interactions: predicting changes in binding affinity.

    PubMed

    Viricel, Clément; de Givry, Simon; Schiex, Thomas; Barbe, Sophie

    2018-02-20

    Accurate and economic methods to predict change in protein binding free energy upon mutation are imperative to accelerate the design of proteins for a wide range of applications. Free energy is defined by enthalpic and entropic contributions. Following the recent progresses of Artificial Intelligence-based algorithms for guaranteed NP-hard energy optimization and partition function computation, it becomes possible to quickly compute minimum energy conformations and to reliably estimate the entropic contribution of side-chains in the change of free energy of large protein interfaces. Using guaranteed Cost Function Network algorithms, Rosetta energy functions and Dunbrack's rotamer library, we developed and assessed EasyE and JayZ, two methods for binding affinity estimation that ignore or include conformational entropic contributions on a large benchmark of binding affinity experimental measures. If both approaches outperform most established tools, we observe that side-chain conformational entropy brings little or no improvement on most systems but becomes crucial in some rare cases. as open-source Python/C ++ code at sourcesup.renater.fr/projects/easy-jayz. thomas.schiex@inra.fr and sophie.barbe@insa-toulouse.fr. Supplementary data are available at Bioinformatics online.

  15. Analysis of predicted loss-of-function variants in UK Biobank identifies variants protective for disease.

    PubMed

    Emdin, Connor A; Khera, Amit V; Chaffin, Mark; Klarin, Derek; Natarajan, Pradeep; Aragam, Krishna; Haas, Mary; Bick, Alexander; Zekavat, Seyedeh M; Nomura, Akihiro; Ardissino, Diego; Wilson, James G; Schunkert, Heribert; McPherson, Ruth; Watkins, Hugh; Elosua, Roberto; Bown, Matthew J; Samani, Nilesh J; Baber, Usman; Erdmann, Jeanette; Gupta, Namrata; Danesh, John; Chasman, Daniel; Ridker, Paul; Denny, Joshua; Bastarache, Lisa; Lichtman, Judith H; D'Onofrio, Gail; Mattera, Jennifer; Spertus, John A; Sheu, Wayne H-H; Taylor, Kent D; Psaty, Bruce M; Rich, Stephen S; Post, Wendy; Rotter, Jerome I; Chen, Yii-Der Ida; Krumholz, Harlan; Saleheen, Danish; Gabriel, Stacey; Kathiresan, Sekar

    2018-04-24

    Less than 3% of protein-coding genetic variants are predicted to result in loss of protein function through the introduction of a stop codon, frameshift, or the disruption of an essential splice site; however, such predicted loss-of-function (pLOF) variants provide insight into effector transcript and direction of biological effect. In >400,000 UK Biobank participants, we conduct association analyses of 3759 pLOF variants with six metabolic traits, six cardiometabolic diseases, and twelve additional diseases. We identified 18 new low-frequency or rare (allele frequency < 5%) pLOF variant-phenotype associations. pLOF variants in the gene GPR151 protect against obesity and type 2 diabetes, in the gene IL33 against asthma and allergic disease, and in the gene IFIH1 against hypothyroidism. In the gene PDE3B, pLOF variants associate with elevated height, improved body fat distribution and protection from coronary artery disease. Our findings prioritize genes for which pharmacologic mimics of pLOF variants may lower risk for disease.

  16. Culture and Healthy Eating: The Role of Independence and Interdependence in the United States and Japan.

    PubMed

    Levine, Cynthia S; Miyamoto, Yuri; Markus, Hazel Rose; Rigotti, Attilio; Boylan, Jennifer Morozink; Park, Jiyoung; Kitayama, Shinobu; Karasawa, Mayumi; Kawakami, Norito; Coe, Christopher L; Love, Gayle D; Ryff, Carol D

    2016-10-01

    Healthy eating is important for physical health. Using large probability samples of middle-aged adults in the United States and Japan, we show that fitting with the culturally normative way of being predicts healthy eating. In the United States, a culture that prioritizes and emphasizes independence, being independent predicts eating a healthy diet (an index of fish, protein, fruit, vegetables, reverse-coded sugared beverages, and reverse-coded high fat meat consumption; Study 1) and not using nonmeat food as a way to cope with stress (Study 2a). In Japan, a culture that prioritizes and emphasizes interdependence, being interdependent predicts eating a healthy diet (Studies 1 and 2b). Furthermore, reflecting the types of agency that are prevalent in each context, these relationships are mediated by autonomy in the United States and positive relations with others in Japan. These findings highlight the importance of understanding cultural differences in shaping healthy behavior and have implications for designing health-promoting interventions. © 2016 by the Society for Personality and Social Psychology, Inc.

  17. The complete nucleotide sequence of RNA beta from the type strain of barley stripe mosaic virus.

    PubMed Central

    Gustafson, G; Armour, S L

    1986-01-01

    The complete nucleotide sequence of RNA beta from the type strain of barley stripe mosaic virus (BSMV) has been determined. The sequence is 3289 nucleotides in length and contains four open reading frames (ORFs) which code for proteins of Mr 22,147 (ORF1), Mr 58,098 (ORF2), Mr 17,378 (ORF3), and Mr 14,119 (ORF4). The predicted N-terminal amino acid sequence of the polypeptide encoded by the ORF nearest the 5'-end of the RNA (ORF1) is identical (after the initiator methionine) to the published N-terminal amino acid sequence of BSMV coat protein for 29 of the first 30 amino acids. ORF2 occupies the central portion of the coding region of RNA beta and ORF3 is located at the 3'-end. The ORF4 sequence overlaps the 3'-region of ORF2 and the 5'-region of ORF3 and differs in codon usage from the other three RNA beta ORFs. The coding region of RNA beta is followed by a poly(A) tract and a 238 nucleotide tRNA-like structure which are common to all three BSMV genomic RNAs. Images PMID:3754962

  18. Long Non-Coding RNAs Differentially Expressed between Normal versus Primary Breast Tumor Tissues Disclose Converse Changes to Breast Cancer-Related Protein-Coding Genes

    PubMed Central

    Reiche, Kristin; Kasack, Katharina; Schreiber, Stephan; Lüders, Torben; Due, Eldri U.; Naume, Bjørn; Riis, Margit; Kristensen, Vessela N.; Horn, Friedemann; Børresen-Dale, Anne-Lise; Hackermüller, Jörg; Baumbusch, Lars O.

    2014-01-01

    Breast cancer, the second leading cause of cancer death in women, is a highly heterogeneous disease, characterized by distinct genomic and transcriptomic profiles. Transcriptome analyses prevalently assessed protein-coding genes; however, the majority of the mammalian genome is expressed in numerous non-coding transcripts. Emerging evidence supports that many of these non-coding RNAs are specifically expressed during development, tumorigenesis, and metastasis. The focus of this study was to investigate the expression features and molecular characteristics of long non-coding RNAs (lncRNAs) in breast cancer. We investigated 26 breast tumor and 5 normal tissue samples utilizing a custom expression microarray enclosing probes for mRNAs as well as novel and previously identified lncRNAs. We identified more than 19,000 unique regions significantly differentially expressed between normal versus breast tumor tissue, half of these regions were non-coding without any evidence for functional open reading frames or sequence similarity to known proteins. The identified non-coding regions were primarily located in introns (53%) or in the intergenic space (33%), frequently orientated in antisense-direction of protein-coding genes (14%), and commonly distributed at promoter-, transcription factor binding-, or enhancer-sites. Analyzing the most diverse mRNA breast cancer subtypes Basal-like versus Luminal A and B resulted in 3,025 significantly differentially expressed unique loci, including 682 (23%) for non-coding transcripts. A notable number of differentially expressed protein-coding genes displayed non-synonymous expression changes compared to their nearest differentially expressed lncRNA, including an antisense lncRNA strongly anticorrelated to the mRNA coding for histone deacetylase 3 (HDAC3), which was investigated in more detail. Previously identified chromatin-associated lncRNAs (CARs) were predominantly downregulated in breast tumor samples, including CARs located in the protein-coding genes for CALD1, FTX, and HNRNPH1. In conclusion, a number of differentially expressed lncRNAs have been identified with relation to cancer-related protein-coding genes. PMID:25264628

  19. Long non-coding RNAs differentially expressed between normal versus primary breast tumor tissues disclose converse changes to breast cancer-related protein-coding genes.

    PubMed

    Reiche, Kristin; Kasack, Katharina; Schreiber, Stephan; Lüders, Torben; Due, Eldri U; Naume, Bjørn; Riis, Margit; Kristensen, Vessela N; Horn, Friedemann; Børresen-Dale, Anne-Lise; Hackermüller, Jörg; Baumbusch, Lars O

    2014-01-01

    Breast cancer, the second leading cause of cancer death in women, is a highly heterogeneous disease, characterized by distinct genomic and transcriptomic profiles. Transcriptome analyses prevalently assessed protein-coding genes; however, the majority of the mammalian genome is expressed in numerous non-coding transcripts. Emerging evidence supports that many of these non-coding RNAs are specifically expressed during development, tumorigenesis, and metastasis. The focus of this study was to investigate the expression features and molecular characteristics of long non-coding RNAs (lncRNAs) in breast cancer. We investigated 26 breast tumor and 5 normal tissue samples utilizing a custom expression microarray enclosing probes for mRNAs as well as novel and previously identified lncRNAs. We identified more than 19,000 unique regions significantly differentially expressed between normal versus breast tumor tissue, half of these regions were non-coding without any evidence for functional open reading frames or sequence similarity to known proteins. The identified non-coding regions were primarily located in introns (53%) or in the intergenic space (33%), frequently orientated in antisense-direction of protein-coding genes (14%), and commonly distributed at promoter-, transcription factor binding-, or enhancer-sites. Analyzing the most diverse mRNA breast cancer subtypes Basal-like versus Luminal A and B resulted in 3,025 significantly differentially expressed unique loci, including 682 (23%) for non-coding transcripts. A notable number of differentially expressed protein-coding genes displayed non-synonymous expression changes compared to their nearest differentially expressed lncRNA, including an antisense lncRNA strongly anticorrelated to the mRNA coding for histone deacetylase 3 (HDAC3), which was investigated in more detail. Previously identified chromatin-associated lncRNAs (CARs) were predominantly downregulated in breast tumor samples, including CARs located in the protein-coding genes for CALD1, FTX, and HNRNPH1. In conclusion, a number of differentially expressed lncRNAs have been identified with relation to cancer-related protein-coding genes.

  20. FRAGSION: ultra-fast protein fragment library generation by IOHMM sampling.

    PubMed

    Bhattacharya, Debswapna; Adhikari, Badri; Li, Jilong; Cheng, Jianlin

    2016-07-01

    Speed, accuracy and robustness of building protein fragment library have important implications in de novo protein structure prediction since fragment-based methods are one of the most successful approaches in template-free modeling (FM). Majority of the existing fragment detection methods rely on database-driven search strategies to identify candidate fragments, which are inherently time-consuming and often hinder the possibility to locate longer fragments due to the limited sizes of databases. Also, it is difficult to alleviate the effect of noisy sequence-based predicted features such as secondary structures on the quality of fragment. Here, we present FRAGSION, a database-free method to efficiently generate protein fragment library by sampling from an Input-Output Hidden Markov Model. FRAGSION offers some unique features compared to existing approaches in that it (i) is lightning-fast, consuming only few seconds of CPU time to generate fragment library for a protein of typical length (300 residues); (ii) can generate dynamic-size fragments of any length (even for the whole protein sequence) and (iii) offers ways to handle noise in predicted secondary structure during fragment sampling. On a FM dataset from the most recent Critical Assessment of Structure Prediction, we demonstrate that FGRAGSION provides advantages over the state-of-the-art fragment picking protocol of ROSETTA suite by speeding up computation by several orders of magnitude while achieving comparable performance in fragment quality. Source code and executable versions of FRAGSION for Linux and MacOS is freely available to non-commercial users at http://sysbio.rnet.missouri.edu/FRAGSION/ It is bundled with a manual and example data. chengji@missouri.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  1. GenoMycDB: a database for comparative analysis of mycobacterial genes and genomes.

    PubMed

    Catanho, Marcos; Mascarenhas, Daniel; Degrave, Wim; Miranda, Antonio Basílio de

    2006-03-31

    Several databases and computational tools have been created with the aim of organizing, integrating and analyzing the wealth of information generated by large-scale sequencing projects of mycobacterial genomes and those of other organisms. However, with very few exceptions, these databases and tools do not allow for massive and/or dynamic comparison of these data. GenoMycDB (http://www.dbbm.fiocruz.br/GenoMycDB) is a relational database built for large-scale comparative analyses of completely sequenced mycobacterial genomes, based on their predicted protein content. Its central structure is composed of the results obtained after pair-wise sequence alignments among all the predicted proteins coded by the genomes of six mycobacteria: Mycobacterium tuberculosis (strains H37Rv and CDC1551), M. bovis AF2122/97, M. avium subsp. paratuberculosis K10, M. leprae TN, and M. smegmatis MC2 155. The database stores the computed similarity parameters of every aligned pair, providing for each protein sequence the predicted subcellular localization, the assigned cluster of orthologous groups, the features of the corresponding gene, and links to several important databases. Tables containing pairs or groups of potential homologs between selected species/strains can be produced dynamically by user-defined criteria, based on one or multiple sequence similarity parameters. In addition, searches can be restricted according to the predicted subcellular localization of the protein, the DNA strand of the corresponding gene and/or the description of the protein. Massive data search and/or retrieval are available, and different ways of exporting the result are offered. GenoMycDB provides an on-line resource for the functional classification of mycobacterial proteins as well as for the analysis of genome structure, organization, and evolution.

  2. Sialotranscriptomics of Rhipicephalus zambeziensis reveals intricate expression profiles of secretory proteins and suggests tight temporal transcriptional regulation during blood-feeding.

    PubMed

    de Castro, Minique Hilda; de Klerk, Daniel; Pienaar, Ronel; Rees, D Jasper G; Mans, Ben J

    2017-08-10

    Ticks secrete a diverse mixture of secretory proteins into the host to evade its immune response and facilitate blood-feeding, making secretory proteins attractive targets for the production of recombinant anti-tick vaccines. The largely neglected tick species, Rhipicephalus zambeziensis, is an efficient vector of Theileria parva in southern Africa but its available sequence information is limited. Next generation sequencing has advanced sequence availability for ticks in recent years and has assisted the characterisation of secretory proteins. This study focused on the de novo assembly and annotation of the salivary gland transcriptome of R. zambeziensis and the temporal expression of secretory protein transcripts in female and male ticks, before the onset of feeding and during early and late feeding. The sialotranscriptome of R. zambeziensis yielded 23,631 transcripts from which 13,584 non-redundant proteins were predicted. Eighty-six percent of these contained a predicted start and stop codon and were estimated to be putatively full-length proteins. A fifth (2569) of the predicted proteins were annotated as putative secretory proteins and explained 52% of the expression in the transcriptome. Expression analyses revealed that 2832 transcripts were differentially expressed among feeding time points and 1209 between the tick sexes. The expression analyses further indicated that 57% of the annotated secretory protein transcripts were differentially expressed. Dynamic expression profiles of secretory protein transcripts were observed during feeding of female ticks. Whereby a number of transcripts were upregulated during early feeding, presumably for feeding site establishment and then during late feeding, 52% of these were downregulated, indicating that transcripts were required at specific feeding stages. This suggested that secretory proteins are under stringent transcriptional regulation that fine-tunes their expression in salivary glands during feeding. No open reading frames were predicted for 7947 transcripts. This class represented 17% of the differentially expressed transcripts, suggesting a potential transcriptional regulatory function of long non-coding RNA in tick blood-feeding. The assembled sialotranscriptome greatly expands the sequence availability of R. zambeziensis, assists in our understanding of the transcription of secretory proteins during blood-feeding and will be a valuable resource for future vaccine candidate selection.

  3. Draft Genome Sequence of the Terrestrial Cyanobacterium Scytonema millei VB511283, Isolated from Eastern India

    PubMed Central

    Sen, Diya; Chandrababunaidu, Mathu Malar; Singh, Deeksha; Sanghi, Neha; Ghorai, Arpita; Mishra, Gyan Prakash; Madduluri, Madhavi

    2015-01-01

    We report here the draft genome sequence of Scytonema millei VB511283, a cyanobacterium isolated from biofilms on the exterior of stone monuments in Santiniketan, eastern India. The draft genome is 11,627,246 bp long (11.63 Mb), with 118 scaffolds. About 9,011 protein-coding genes, 117 tRNAs, and 12 rRNAs are predicted from this assembly. PMID:25744984

  4. Genome Sequence of Legionella massiliensis, Isolated from a Cooling Tower Water Sample.

    PubMed

    Pagnier, Isabelle; Croce, Olivier; Robert, Catherine; Raoult, Didier; La Scola, Bernard

    2014-10-16

    We present the draft genome sequence of Legionella massiliensis strain LegA(T), recovered from a cooling tower water sample, using an amoebal coculture procedure. The strain described here is composed of 4,387,007 bp, with a G+C content of 41.19%, and its genome has 3,767 protein-coding genes and 60 predicted RNA genes. Copyright © 2014 Pagnier et al.

  5. Draft Genome Sequence of Algoriphagus sp. Strain NH1, a Multidrug-Resistant Bacterium Isolated from Coastal Sediments of the Northern Yellow Sea in China

    PubMed Central

    Mu, Dashuai; Zhao, Jinxin; Wang, Zongjie; Chen, Guanjun

    2016-01-01

    Algoriphagus sp. NH1 is a multidrug-resistant bacterium isolated from coastal sediments of the northern Yellow Sea in China. Here, we report the draft genome sequence of NH1, with a size of 6,131,579 bp, average G+C content of 42.68%, and 5,746 predicted protein-coding sequences. PMID:26769940

  6. Draft Genome Sequence of Cellulolytic and Xylanolytic Cellulomonas sp. Strain B6 Isolated from Subtropical Forest Soil

    PubMed Central

    Piccinni, Florencia; Murua, Yanina; Ghio, Silvina; Talia, Paola; Rivarola, Máximo

    2016-01-01

    Cellulomonas sp. strain B6 was isolated from a subtropical forest soil sample and presented (hemi)cellulose-degrading activity. We report here its draft genome sequence, with an estimated genome size of 4 Mb, a G+C content of 75.1%, and 3,443 predicted protein-coding sequences, 92 of which are glycosyl hydrolases involved in polysaccharide degradation. PMID:27563050

  7. Community and gene composition of a human dental plaque microbiota obtained by metagenomic sequencing

    PubMed Central

    Xie, G.; Chain, P.S.G.; Lo, C.; Liu, K-L.; Gans, J.; Merritt, J.; Qi, F.

    2010-01-01

    SUMMARY Human dental plaque is a complex microbial community containing an estimated 700 to 19,000 species/phylotypes. Despite numerous studies analysing species richness in healthy and diseased human subjects, the true genomic composition of the human dental plaque microbiota remains unknown. Here we report a metagenomic analysis of a healthy human plaque sample using a combination of second-generation sequencing platforms. A total of 860 million base pairs of non-human sequences were generated. Various analysis tools revealed the presence of 12 well-characterized phyla, members of the TM-7 and BRC1 clade, and sequences that could not be classified. Both pathogens and opportunistic pathogens were identified, supporting the ecological plaque hypothesis for oral diseases. Mapping the metagenomic reads to sequenced reference genomes demonstrated that 4% of the reads could be assigned to the sequenced species. Preliminary annotation identified genes belonging to all known functional categories. Interestingly, although 73% of the total assembled contig sequences were predicted to code for proteins, only 51% of them could be assigned a functional role. Furthermore, ~ 2.8% of the total predicted genes coded for proteins involved in resistance to antibiotics and toxic compounds, suggesting that the oral cavity is an important reservoir for antimicrobial resistance. PMID:21040513

  8. Community and gene composition of a human dental plaque microbiota obtained by metagenomic sequencing.

    PubMed

    Xie, G; Chain, P S G; Lo, C-C; Liu, K-L; Gans, J; Merritt, J; Qi, F

    2010-12-01

    Human dental plaque is a complex microbial community containing an estimated 700 to 19,000 species/phylotypes. Despite numerous studies analysing species richness in healthy and diseased human subjects, the true genomic composition of the human dental plaque microbiota remains unknown. Here we report a metagenomic analysis of a healthy human plaque sample using a combination of second-generation sequencing platforms. A total of 860 million base pairs of non-human sequences were generated. Various analysis tools revealed the presence of 12 well-characterized phyla, members of the TM-7 and BRC1 clade, and sequences that could not be classified. Both pathogens and opportunistic pathogens were identified, supporting the ecological plaque hypothesis for oral diseases. Mapping the metagenomic reads to sequenced reference genomes demonstrated that 4% of the reads could be assigned to the sequenced species. Preliminary annotation identified genes belonging to all known functional categories. Interestingly, although 73% of the total assembled contig sequences were predicted to code for proteins, only 51% of them could be assigned a functional role. Furthermore, ~2.8% of the total predicted genes coded for proteins involved in resistance to antibiotics and toxic compounds, suggesting that the oral cavity is an important reservoir for antimicrobial resistance. © 2010 John Wiley & Sons A/S.

  9. Human somatostatin I: sequence of the cDNA.

    PubMed Central

    Shen, L P; Pictet, R L; Rutter, W J

    1982-01-01

    RNA has been isolated from a human pancreatic somatostatinoma and used to prepare a cDNA library. After prescreening, clones containing somatostatin I sequences were identified by hybridization with an anglerfish somatostatin I-cloned cDNA probe. From the nucleotide sequence of two of these clones, we have deduced an essentially full-length mRNA sequence, including the preprosomatostatin coding region, 105 nucleotides from the 5' untranslated region and the complete 150-nucleotide 3' untranslated region. The coding region predicts a 116-amino acid precursor protein (Mr, 12.727) that contains somatostatin-14 and -28 at its COOH terminus. The predicted amino acid sequence of human somatostatin-28 is identical to that of somatostatin-28 isolated from the porcine and ovine species. A comparison of the amino acid sequences of human and anglerfish preprosomatostatin I indicated that the COOH-terminal region encoding somatostatin-14 and the adjacent 6 amino acids are highly conserved, whereas the remainder of the molecule, including the signal peptide region, is more divergent. However, many of the amino acid differences found in the pro region of the human and anglerfish proteins are conservative changes. This suggests that the propeptides have a similar secondary structure, which in turn may imply a biological function for this region of the molecule. Images PMID:6126875

  10. Detecting long tandem duplications in genomic sequences.

    PubMed

    Audemard, Eric; Schiex, Thomas; Faraut, Thomas

    2012-05-08

    Detecting duplication segments within completely sequenced genomes provides valuable information to address genome evolution and in particular the important question of the emergence of novel functions. The usual approach to gene duplication detection, based on all-pairs protein gene comparisons, provides only a restricted view of duplication. In this paper, we introduce ReD Tandem, a software using a flow based chaining algorithm targeted at detecting tandem duplication arrays of moderate to longer length regions, with possibly locally weak similarities, directly at the DNA level. On the A. thaliana genome, using a reference set of tandem duplicated genes built using TAIR,(a) we show that ReD Tandem is able to predict a large fraction of recently duplicated genes (dS  <  1) and that it is also able to predict tandem duplications involving non coding elements such as pseudo-genes or RNA genes. ReD Tandem allows to identify large tandem duplications without any annotation, leading to agnostic identification of tandem duplications. This approach nicely complements the usual protein gene based which ignores duplications involving non coding regions. It is however inherently restricted to relatively recent duplications. By recovering otherwise ignored events, ReD Tandem gives a more comprehensive view of existing evolutionary processes and may also allow to improve existing annotations.

  11. Diverse RNA-binding proteins interact with functionally related sets of RNAs, suggesting an extensive regulatory system.

    PubMed

    Hogan, Daniel J; Riordan, Daniel P; Gerber, André P; Herschlag, Daniel; Brown, Patrick O

    2008-10-28

    RNA-binding proteins (RBPs) have roles in the regulation of many post-transcriptional steps in gene expression, but relatively few RBPs have been systematically studied. We searched for the RNA targets of 40 proteins in the yeast Saccharomyces cerevisiae: a selective sample of the approximately 600 annotated and predicted RBPs, as well as several proteins not annotated as RBPs. At least 33 of these 40 proteins, including three of the four proteins that were not previously known or predicted to be RBPs, were reproducibly associated with specific sets of a few to several hundred RNAs. Remarkably, many of the RBPs we studied bound mRNAs whose protein products share identifiable functional or cytotopic features. We identified specific sequences or predicted structures significantly enriched in target mRNAs of 16 RBPs. These potential RNA-recognition elements were diverse in sequence, structure, and location: some were found predominantly in 3'-untranslated regions, others in 5'-untranslated regions, some in coding sequences, and many in two or more of these features. Although this study only examined a small fraction of the universe of yeast RBPs, 70% of the mRNA transcriptome had significant associations with at least one of these RBPs, and on average, each distinct yeast mRNA interacted with three of the RBPs, suggesting the potential for a rich, multidimensional network of regulation. These results strongly suggest that combinatorial binding of RBPs to specific recognition elements in mRNAs is a pervasive mechanism for multi-dimensional regulation of their post-transcriptional fate.

  12. Comparative genome analysis of entomopathogenic fungi reveals a complex set of secreted proteins.

    PubMed

    Staats, Charley Christian; Junges, Angela; Guedes, Rafael Lucas Muniz; Thompson, Claudia Elizabeth; de Morais, Guilherme Loss; Boldo, Juliano Tomazzoni; de Almeida, Luiz Gonzaga Paula; Andreis, Fábio Carrer; Gerber, Alexandra Lehmkuhl; Sbaraini, Nicolau; da Paixão, Rana Louise de Andrade; Broetto, Leonardo; Landell, Melissa; Santi, Lucélia; Beys-da-Silva, Walter Orlando; Silveira, Carolina Pereira; Serrano, Thaiane Rispoli; de Oliveira, Eder Silva; Kmetzsch, Lívia; Vainstein, Marilene Henning; de Vasconcelos, Ana Tereza Ribeiro; Schrank, Augusto

    2014-09-29

    Metarhizium anisopliae is an entomopathogenic fungus used in the biological control of some agricultural insect pests, and efforts are underway to use this fungus in the control of insect-borne human diseases. A large repertoire of proteins must be secreted by M. anisopliae to cope with the various available nutrients as this fungus switches through different lifestyles, i.e., from a saprophytic, to an infectious, to a plant endophytic stage. To further evaluate the predicted secretome of M. anisopliae, we employed genomic and transcriptomic analyses, coupled with phylogenomic analysis, focusing on the identification and characterization of secreted proteins. We determined the M. anisopliae E6 genome sequence and compared this sequence to other entomopathogenic fungi genomes. A robust pipeline was generated to evaluate the predicted secretomes of M. anisopliae and 15 other filamentous fungi, leading to the identification of a core of secreted proteins. Transcriptomic analysis using the tick Rhipicephalus microplus cuticle as an infection model during two periods of infection (48 and 144 h) allowed the identification of several differentially expressed genes. This analysis concluded that a large proportion of the predicted secretome coding genes contained altered transcript levels in the conditions analyzed in this study. In addition, some specific secreted proteins from Metarhizium have an evolutionary history similar to orthologs found in Beauveria/Cordyceps. This similarity suggests that a set of secreted proteins has evolved to participate in entomopathogenicity. The data presented represents an important step to the characterization of the role of secreted proteins in the virulence and pathogenicity of M. anisopliae.

  13. UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures.

    PubMed

    Lua, Rhonald C; Wilson, Stephen J; Konecki, Daniel M; Wilkins, Angela D; Venner, Eric; Morgan, Daniel H; Lichtarge, Olivier

    2016-01-04

    The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  14. Analysis of aromatic catabolic pathways in Pseudomonas putida KT 2440 using a combined proteomic approach: 2-DE/MS and cleavable isotope-coded affinity tag analysis.

    PubMed

    Kim, Young Hwan; Cho, Kun; Yun, Sung-Ho; Kim, Jin Young; Kwon, Kyung-Hoon; Yoo, Jong Shin; Kim, Seung Il

    2006-02-01

    Proteomic analysis of Pseudomonas putida KT2440 cultured in monocyclic aromatic compounds was performed using 2-DE/MS and cleavable isotope-coded affinity tag (ICAT) to determine whether proteins involved in aromatic compound degradation pathways were altered as predicted by genomic analysis (Jiménez et al., Environ Microbiol. 2002, 4, 824-841). Eighty unique proteins were identified by 2-DE/MS or MS/MS analysis from P. putida KT2440 cultured in the presence of six different organic compounds. Benzoate dioxygenase (BenA, BenD) and catechol 1,2-dioxygenase (CatA) were induced by benzoate. Protocatechuate 3,4-dixoygenase (PcaGH) was induced by p-hydroxybenzoate and vanilline. beta-Ketoadipyl CoA thiolase (PcaF) and 3-oxoadipate enol-lactone hydrolase (PcaD) were induced by benzoate, p-hydroxybenzoate and vanilline, suggesting that benzoate, p-hydroxybenzoate and vanilline were degraded by different dioxygenases and then converged in the same beta-ketoadipate degradation pathway. An additional 110 proteins, including 19 proteins from 2-DE analysis, were identified by cleavable ICAT analysis for benzoate-induced proteomes, which complemented the 2-DE results. Phenylethylamine exposure induced beta-ketoacyl CoA thiolase (PhaD) and ring-opening enzyme (PhaL), both enzymes of the phenylacetate (pha) biodegradation pathway. Phenylalanine induced 4-hydroxyphenyl-pyruvate dioxygenase (Hpd) and homogentisate 1,2-dioxygenase (HmgA), key enzymes in the homogentisate degradation pathway. Alkyl hydroperoxide reductase (AphC) was induced under all aromatic compounds conditions. These results suggest that proteome analysis complements and supports predictive information obtained by genomic sequence analysis.

  15. With or without you: predictive coding and Bayesian inference in the brain

    PubMed Central

    Aitchison, Laurence; Lengyel, Máté

    2018-01-01

    Two theoretical ideas have emerged recently with the ambition to provide a unifying functional explanation of neural population coding and dynamics: predictive coding and Bayesian inference. Here, we describe the two theories and their combination into a single framework: Bayesian predictive coding. We clarify how the two theories can be distinguished, despite sharing core computational concepts and addressing an overlapping set of empirical phenomena. We argue that predictive coding is an algorithmic / representational motif that can serve several different computational goals of which Bayesian inference is but one. Conversely, while Bayesian inference can utilize predictive coding, it can also be realized by a variety of other representations. We critically evaluate the experimental evidence supporting Bayesian predictive coding and discuss how to test it more directly. PMID:28942084

  16. RNAcode: Robust discrimination of coding and noncoding regions in comparative sequence data

    PubMed Central

    Washietl, Stefan; Findeiß, Sven; Müller, Stephan A.; Kalkhof, Stefan; von Bergen, Martin; Hofacker, Ivo L.; Stadler, Peter F.; Goldman, Nick

    2011-01-01

    With the availability of genome-wide transcription data and massive comparative sequencing, the discrimination of coding from noncoding RNAs and the assessment of coding potential in evolutionarily conserved regions arose as a core analysis task. Here we present RNAcode, a program to detect coding regions in multiple sequence alignments that is optimized for emerging applications not covered by current protein gene-finding software. Our algorithm combines information from nucleotide substitution and gap patterns in a unified framework and also deals with real-life issues such as alignment and sequencing errors. It uses an explicit statistical model with no machine learning component and can therefore be applied “out of the box,” without any training, to data from all domains of life. We describe the RNAcode method and apply it in combination with mass spectrometry experiments to predict and confirm seven novel short peptides in Escherichia coli and to analyze the coding potential of RNAs previously annotated as “noncoding.” RNAcode is open source software and available for all major platforms at http://wash.github.com/rnacode. PMID:21357752

  17. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data.

    PubMed

    Washietl, Stefan; Findeiss, Sven; Müller, Stephan A; Kalkhof, Stefan; von Bergen, Martin; Hofacker, Ivo L; Stadler, Peter F; Goldman, Nick

    2011-04-01

    With the availability of genome-wide transcription data and massive comparative sequencing, the discrimination of coding from noncoding RNAs and the assessment of coding potential in evolutionarily conserved regions arose as a core analysis task. Here we present RNAcode, a program to detect coding regions in multiple sequence alignments that is optimized for emerging applications not covered by current protein gene-finding software. Our algorithm combines information from nucleotide substitution and gap patterns in a unified framework and also deals with real-life issues such as alignment and sequencing errors. It uses an explicit statistical model with no machine learning component and can therefore be applied "out of the box," without any training, to data from all domains of life. We describe the RNAcode method and apply it in combination with mass spectrometry experiments to predict and confirm seven novel short peptides in Escherichia coli and to analyze the coding potential of RNAs previously annotated as "noncoding." RNAcode is open source software and available for all major platforms at http://wash.github.com/rnacode.

  18. A review of predictive coding algorithms.

    PubMed

    Spratling, M W

    2017-03-01

    Predictive coding is a leading theory of how the brain performs probabilistic inference. However, there are a number of distinct algorithms which are described by the term "predictive coding". This article provides a concise review of these different predictive coding algorithms, highlighting their similarities and differences. Five algorithms are covered: linear predictive coding which has a long and influential history in the signal processing literature; the first neuroscience-related application of predictive coding to explaining the function of the retina; and three versions of predictive coding that have been proposed to model cortical function. While all these algorithms aim to fit a generative model to sensory data, they differ in the type of generative model they employ, in the process used to optimise the fit between the model and sensory data, and in the way that they are related to neurobiology. Copyright © 2016 Elsevier Inc. All rights reserved.

  19. Plasmid-encoded hygromycin B resistance: the sequence of hygromycin B phosphotransferase gene and its expression in Escherichia coli and Saccharomyces cerevisiae.

    PubMed

    Gritz, L; Davies, J

    1983-11-01

    The plasmid-borne gene hph coding for hygromycin B phosphotransferase (HPH) in Escherichia coli has been identified and its nucleotide sequence determined. The hph gene is 1026 nucleotides long, coding for a protein with a predicted Mr of 39 000. The hph gene was placed in a shuttle plasmid vector, downstream from the promoter region of the cyc 1 gene of Saccharomyces cerevisiae, and an hph construction containing a single AUG in the 5' noncoding region allowed direct selection following transformation in yeast and in E. coli. Thus the hph gene can be used in cloning vectors for both pro- and eukaryotes.

  20. EST-PAC a web package for EST annotation and protein sequence prediction

    PubMed Central

    Strahm, Yvan; Powell, David; Lefèvre, Christophe

    2006-01-01

    With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST) from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST) annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1) searching local or remote biological databases for sequence similarities using Blast services, 2) predicting protein coding sequence from EST data and, 3) annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics. PMID:17147782

  1. Protein collapse is encoded in the folded state architecture.

    PubMed

    Samanta, Himadri S; Zhuravlev, Pavel I; Hinczewski, Michael; Hori, Naoto; Chakrabarti, Shaon; Thirumalai, D

    2017-05-21

    Folded states of single domain globular proteins are compact with high packing density. The radius of gyration, R g , of both the folded and unfolded states increase as N ν where N is the number of amino acids in the protein. The values of the Flory exponent ν are, respectively, ≈⅓ and ≈0.6 in the folded and unfolded states, coinciding with those for homopolymers. However, the extent of compaction of the unfolded state of a protein under low denaturant concentration (collapsibility), conditions favoring the formation of the folded state, is unknown. We develop a theory that uses the contact map of proteins as input to quantitatively assess collapsibility of proteins. Although collapsibility is universal, the propensity to be compact depends on the protein architecture. Application of the theory to over two thousand proteins shows that collapsibility depends not only on N but also on the contact map reflecting the native structure. A major prediction of the theory is that β-sheet proteins are far more collapsible than structures dominated by α-helices. The theory and the accompanying simulations, validating the theoretical predictions, provide insights into the differing conclusions reached using different experimental probes assessing the extent of compaction of proteins. By calculating the criterion for collapsibility as a function of protein length we provide quantitative insights into the reasons why single domain proteins are small and the physical reasons for the origin of multi-domain proteins. Collapsibility of non-coding RNA molecules is similar β-sheet proteins structures adding support to "Compactness Selection Hypothesis".

  2. Massively Convergent Evolution for Ribosomal Protein Gene Content in Plastid and Mitochondrial Genomes

    PubMed Central

    Maier, Uwe-G; Zauner, Stefan; Woehle, Christian; Bolte, Kathrin; Hempel, Franziska; Allen, John F.; Martin, William F.

    2013-01-01

    Plastid and mitochondrial genomes have undergone parallel evolution to encode the same functional set of genes. These encode conserved protein components of the electron transport chain in their respective bioenergetic membranes and genes for the ribosomes that express them. This highly convergent aspect of organelle genome evolution is partly explained by the redox regulation hypothesis, which predicts a separate plastid or mitochondrial location for genes encoding bioenergetic membrane proteins of either photosynthesis or respiration. Here we show that convergence in organelle genome evolution is far stronger than previously recognized, because the same set of genes for ribosomal proteins is independently retained by both plastid and mitochondrial genomes. A hitherto unrecognized selective pressure retains genes for the same ribosomal proteins in both organelles. On the Escherichia coli ribosome assembly map, the retained proteins are implicated in 30S and 50S ribosomal subunit assembly and initial rRNA binding. We suggest that ribosomal assembly imposes functional constraints that govern the retention of ribosomal protein coding genes in organelles. These constraints are subordinate to redox regulation for electron transport chain components, which anchor the ribosome to the organelle genome in the first place. As organelle genomes undergo reduction, the rRNAs also become smaller. Below size thresholds of approximately 1,300 nucleotides (16S rRNA) and 2,100 nucleotides (26S rRNA), all ribosomal protein coding genes are lost from organelles, while electron transport chain components remain organelle encoded as long as the organelles use redox chemistry to generate a proton motive force. PMID:24259312

  3. PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources.

    PubMed

    Kahanda, Indika; Funk, Christopher; Verspoor, Karin; Ben-Hur, Asa

    2015-01-01

    The human phenotype ontology (HPO) was recently developed as a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this work we demonstrate the performance advantage of the structured SVM approach which was shown to be highly effective for Gene Ontology term prediction in comparison to several baseline methods. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.

  4. Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art.

    PubMed

    Walia, Rasna R; Caragea, Cornelia; Lewis, Benjamin A; Towfic, Fadi; Terribilini, Michael; El-Manzalawy, Yasser; Dobbs, Drena; Honavar, Vasant

    2012-05-10

    RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.

  5. Proteus: a random forest classifier to predict disorder-to-order transitioning binding regions in intrinsically disordered proteins

    NASA Astrophysics Data System (ADS)

    Basu, Sankar; Söderquist, Fredrik; Wallner, Björn

    2017-05-01

    The focus of the computational structural biology community has taken a dramatic shift over the past one-and-a-half decades from the classical protein structure prediction problem to the possible understanding of intrinsically disordered proteins (IDP) or proteins containing regions of disorder (IDPR). The current interest lies in the unraveling of a disorder-to-order transitioning code embedded in the amino acid sequences of IDPs/IDPRs. Disordered proteins are characterized by an enormous amount of structural plasticity which makes them promiscuous in binding to different partners, multi-functional in cellular activity and atypical in folding energy landscapes resembling partially folded molten globules. Also, their involvement in several deadly human diseases (e.g. cancer, cardiovascular and neurodegenerative diseases) makes them attractive drug targets, and important for a biochemical understanding of the disease(s). The study of the structural ensemble of IDPs is rather difficult, in particular for transient interactions. When bound to a structured partner, an IDPR adapts an ordered conformation in the complex. The residues that undergo this disorder-to-order transition are called protean residues, generally found in short contiguous stretches and the first step in understanding the modus operandi of an IDP/IDPR would be to predict these residues. There are a few available methods which predict these protean segments from their amino acid sequences; however, their performance reported in the literature leaves clear room for improvement. With this background, the current study presents `Proteus', a random forest classifier that predicts the likelihood of a residue undergoing a disorder-to-order transition upon binding to a potential partner protein. The prediction is based on features that can be calculated using the amino acid sequence alone. Proteus compares favorably with existing methods predicting twice as many true positives as the second best method (55 vs. 27%) with a much higher precision on an independent data set. The current study also sheds some light on a possible `disorder-to-order' transitioning consensus, untangled, yet embedded in the amino acid sequence of IDPs. Some guidelines have also been suggested for proceeding with a real-life structural modeling involving an IDPR using Proteus.

  6. Stochastic many-body problems in ecology, evolution, neuroscience, and systems biology

    NASA Astrophysics Data System (ADS)

    Butler, Thomas C.

    Using the tools of many-body theory, I analyze problems in four different areas of biology dominated by strong fluctuations: The evolutionary history of the genetic code, spatiotemporal pattern formation in ecology, spatiotemporal pattern formation in neuroscience and the robustness of a model circadian rhythm circuit in systems biology. In the first two research chapters, I demonstrate that the genetic code is extremely optimal (in the sense that it manages the effects of point mutations or mistranslations efficiently), more than an order of magnitude beyond what was previously thought. I further show that the structure of the genetic code implies that early proteins were probably only loosely defined. Both the nature of early proteins and the extreme optimality of the genetic code are interpreted in light of recent theory [1] as evidence that the evolution of the genetic code was driven by evolutionary dynamics that were dominated by horizontal gene transfer. I then explore the optimality of a proposed precursor to the genetic code. The results show that the precursor code has only limited optimality, which is interpreted as evidence that the precursor emerged prior to translation, or else never existed. In the next part of the dissertation, I introduce a many-body formalism for reaction-diffusion systems described at the mesoscopic scale with master equations. I first apply this formalism to spatially-extended predator-prey ecosystems, resulting in the prediction that many-body correlations and fluctuations drive population cycles in time, called quasicycles. Most of these results were previously known, but were derived using the system size expansion [2, 3]. I next apply the analytical techniques developed in the study of quasi-cycles to a simple model of Turing patterns in a predator-prey ecosystem. This analysis shows that fluctuations drive the formation of a new kind of spatiotemporal pattern formation that I name "quasi-patterns." These quasi-patterns exist over a much larger range of physically accessible parameters than the patterns predicted in mean field theory and therefore account for the apparent observations in ecology of patterns in regimes where Turing patterns do not occur. I further show that quasi-patterns have statistical properties that allow them to be distinguished empirically from mean field Turing patterns. I next analyze a model of visual cortex in the brain that has striking similarities to the activator-inhibitor model of ecosystem quasi-pattern formation. Through analysis of the resulting phase diagram, I show that the architecture of the neural network in the visual cortex is configured to make the visual cortex robust to unwanted internally generated spatial structure that interferes with normal visual function. I also predict that some geometric visual hallucinations are quasi-patterns and that the visual cortex supports a new phase of spatially scale invariant behavior present far from criticality. In the final chapter, I explore the effects of fluctuations on cycles in systems biology, specifically the pervasive phenomenon of circadian rhythms. By exploring the behavior of a generic stochastic model of circadian rhythms, I show that the circadian rhythm circuit exploits leaky mRNA production to safeguard the cycle from failure. I also show that this safeguard mechanism is highly robust to changes in the rate of leaky mRNA production. Finally, I explore the failure of the deterministic model in two different contexts, one where the deterministic model predicts cycles where they do not exist, and another context in which cycles are not predicted by the deterministic model.

  7. Computational RNomics of Drosophilids

    PubMed Central

    Rose, Dominic; Hackermüller, Jörg; Washietl, Stefan; Reiche, Kristin; Hertel, Jana; Findeiß, Sven; Stadler, Peter F; Prohaska, Sonja J

    2007-01-01

    Background Recent experimental and computational studies have provided overwhelming evidence for a plethora of diverse transcripts that are unrelated to protein-coding genes. One subclass consists of those RNAs that require distinctive secondary structure motifs to exert their biological function and hence exhibit distinctive patterns of sequence conservation characteristic for positive selection on RNA secondary structure. The deep-sequencing of 12 drosophilid species coordinated by the NHGRI provides an ideal data set of comparative computational approaches to determine those genomic loci that code for evolutionarily conserved RNA motifs. This class of loci includes the majority of the known small ncRNAs as well as structured RNA motifs in mRNAs. We report here on a genome-wide survey using RNAz. Results We obtain 16 000 high quality predictions among which we recover the majority of the known ncRNAs. Taking a pessimistically estimated false discovery rate of 40% into account, this implies that at least some ten thousand loci in the Drosophila genome show the hallmarks of stabilizing selection action of RNA structure, and hence are most likely functional at the RNA level. A subset of RNAz predictions overlapping with TRF1 and BRF binding sites [Isogai et al., EMBO J. 26: 79–89 (2007)], which are plausible candidates of Pol III transcripts, have been studied in more detail. Among these sequences we identify several "clusters" of ncRNA candidates with striking structural similarities. Conclusion The statistical evaluation of the RNAz predictions in comparison with a similar analysis of vertebrate genomes [Washietl et al., Nat. Biotech. 23: 1383–1390 (2005)] shows that qualitatively similar fractions of structured RNAs are found in introns, UTRs, and intergenic regions. The intergenic RNA structures, however, are concentrated much more closely around known protein-coding loci, suggesting that flies have significantly smaller complement of independent structured ncRNAs compared to mammals. PMID:17996037

  8. Deciphering mRNA Sequence Determinants of Protein Production Rate

    NASA Astrophysics Data System (ADS)

    Szavits-Nossan, Juraj; Ciandrini, Luca; Romano, M. Carmen

    2018-03-01

    One of the greatest challenges in biophysical models of translation is to identify coding sequence features that affect the rate of translation and therefore the overall protein production in the cell. We propose an analytic method to solve a translation model based on the inhomogeneous totally asymmetric simple exclusion process, which allows us to unveil simple design principles of nucleotide sequences determining protein production rates. Our solution shows an excellent agreement when compared to numerical genome-wide simulations of S. cerevisiae transcript sequences and predicts that the first 10 codons, which is the ribosome footprint length on the mRNA, together with the value of the initiation rate, are the main determinants of protein production rate under physiological conditions. Finally, we interpret the obtained analytic results based on the evolutionary role of the codons' choice for regulating translation rates and ribosome densities.

  9. Can mutational GC-pressure create new linear B-cell epitopes in herpes simplex virus type 1 glycoprotein B?

    PubMed

    Khrustalev, Vladislav Victorovich

    2009-01-01

    We showed that GC-content of nucleotide sequences coding for linear B-cell epitopes of herpes simplex virus type 1 (HSV1) glycoprotein B (gB) is higher than GC-content of sequences coding for epitope-free regions of this glycoprotein (G + C = 73 and 64%, respectively). Linear B-cell epitopes have been predicted in HSV1 gB by BepiPred algorithm ( www.cbs.dtu.dk/services/BepiPred ). Proline is an acrophilic amino acid residue (it is usually situated on the surface of protein globules, and so included in linear B-cell epitopes). Indeed, the level of proline is much higher in predicted epitopes of gB than in epitope-free regions (17.8% versus 1.8%). This amino acid is coded by GC-rich codons (CCX) that can be produced due to nucleotide substitutions caused by mutational GC-pressure. GC-pressure will also lead to disappearance of acrophobic phenylalanine, isoleucine, methionine and tyrosine coded by GC-poor codons. Results of our "in-silico directed mutagenesis" showed that single nonsynonymous substitutions in AT to GC direction in two long epitope-free regions of gB will cause formation of new linear epitopes or elongation of previously existing epitopes flanking these regions in 25% of 539 possible cases. The calculations of GC-content and amino acid content have been performed by CodonChanges algorithm ( www.barkovsky.hotmail.ru ).

  10. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins.

    PubMed

    Jones, David T; Singh, Tanya; Kosciolek, Tomasz; Tetchner, Stuart

    2015-04-01

    Recent developments of statistical techniques to infer direct evolutionary couplings between residue pairs have rendered covariation-based contact prediction a viable means for accurate 3D modelling of proteins, with no information other than the sequence required. To extend the usefulness of contact prediction, we have designed a new meta-predictor (MetaPSICOV) which combines three distinct approaches for inferring covariation signals from multiple sequence alignments, considers a broad range of other sequence-derived features and, uniquely, a range of metrics which describe both the local and global quality of the input multiple sequence alignment. Finally, we use a two-stage predictor, where the second stage filters the output of the first stage. This two-stage predictor is additionally evaluated on its ability to accurately predict the long range network of hydrogen bonds, including correctly assigning the donor and acceptor residues. Using the original PSICOV benchmark set of 150 protein families, MetaPSICOV achieves a mean precision of 0.54 for top-L predicted long range contacts-around 60% higher than PSICOV, and around 40% better than CCMpred. In de novo protein structure prediction using FRAGFOLD, MetaPSICOV is able to improve the TM-scores of models by a median of 0.05 compared with PSICOV. Lastly, for predicting long range hydrogen bonding, MetaPSICOV-HB achieves a precision of 0.69 for the top-L/10 hydrogen bonds compared with just 0.26 for the baseline MetaPSICOV. MetaPSICOV is available as a freely available web server at http://bioinf.cs.ucl.ac.uk/MetaPSICOV. Raw data (predicted contact lists and 3D models) and source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/MetaPSICOV. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.

  11. Protein-driven inference of miRNA–disease associations

    PubMed Central

    Mørk, Søren; Pletscher-Frankild, Sune; Palleja Caro, Albert; Gorodkin, Jan; Jensen, Lars Juhl

    2014-01-01

    Motivation: MicroRNAs (miRNAs) are a highly abundant class of non-coding RNA genes involved in cellular regulation and thus also diseases. Despite miRNAs being important disease factors, miRNA–disease associations remain low in number and of variable reliability. Furthermore, existing databases and prediction methods do not explicitly facilitate forming hypotheses about the possible molecular causes of the association, thereby making the path to experimental follow-up longer. Results: Here we present miRPD in which miRNA–Protein–Disease associations are explicitly inferred. Besides linking miRNAs to diseases, it directly suggests the underlying proteins involved, which can be used to form hypotheses that can be experimentally tested. The inference of miRNAs and diseases is made by coupling known and predicted miRNA–protein associations with protein–disease associations text mined from the literature. We present scoring schemes that allow us to rank miRNA–disease associations inferred from both curated and predicted miRNA targets by reliability and thereby to create high- and medium-confidence sets of associations. Analyzing these, we find statistically significant enrichment for proteins involved in pathways related to cancer and type I diabetes mellitus, suggesting either a literature bias or a genuine biological trend. We show by example how the associations can be used to extract proteins for disease hypothesis. Availability and implementation: All datasets, software and a searchable Web site are available at http://mirpd.jensenlab.org. Contact: lars.juhl.jensen@cpr.ku.dk or gorodkin@rth.dk PMID:24273243

  12. Draft Genome Sequence of a Novel Chitinophaga sp. Strain, MD30, Isolated from a Biofilm in an Air Conditioner Condensate Pipe

    PubMed Central

    Darris, Maxwell

    2017-01-01

    ABSTRACT Most of the 24 known Chitinophaga species were originally isolated from soils. We report the draft genome sequence of a putatively novel Chitinophaga sp. from a biofilm in an air conditioner condensate pipe. The genome comprises 7,661,303 bp in one scaffold, 5,694 predicted protein-coding sequences, and a G+C content of 47.6%. PMID:29051259

  13. Draft Genome Sequence of the Terrestrial Cyanobacterium Scytonema millei VB511283, Isolated from Eastern India.

    PubMed

    Sen, Diya; Chandrababunaidu, Mathu Malar; Singh, Deeksha; Sanghi, Neha; Ghorai, Arpita; Mishra, Gyan Prakash; Madduluri, Madhavi; Adhikary, Siba Prasad; Tripathy, Sucheta

    2015-03-05

    We report here the draft genome sequence of Scytonema millei VB511283, a cyanobacterium isolated from biofilms on the exterior of stone monuments in Santiniketan, eastern India. The draft genome is 11,627,246 bp long (11.63 Mb), with 118 scaffolds. About 9,011 protein-coding genes, 117 tRNAs, and 12 rRNAs are predicted from this assembly. Copyright © 2015 Sen et al.

  14. Draft Genome Sequence of Cellulolytic and Xylanolytic Cellulomonas sp. Strain B6 Isolated from Subtropical Forest Soil.

    PubMed

    Piccinni, Florencia; Murua, Yanina; Ghio, Silvina; Talia, Paola; Rivarola, Máximo; Campos, Eleonora

    2016-08-25

    Cellulomonas sp. strain B6 was isolated from a subtropical forest soil sample and presented (hemi)cellulose-degrading activity. We report here its draft genome sequence, with an estimated genome size of 4 Mb, a G+C content of 75.1%, and 3,443 predicted protein-coding sequences, 92 of which are glycosyl hydrolases involved in polysaccharide degradation. Copyright © 2016 Piccinni et al.

  15. Draft Genome Sequence of the Entomopathogenic Bacterium Bacillus pumilus 15.1, a Strain Highly Toxic to the Mediterranean Fruit Fly Ceratitis capitata

    PubMed Central

    García-Ramón, Diana C.; Palma, Leopoldo; Berry, Colin; Osuna, Antonio

    2015-01-01

    We present the draft whole-genome sequence of the entomopathogenic Bacillus pumilus 15.1 strain that consists of 3,795,691 bp and 3,776 predicted protein-coding genes. This genome sequence provides the basis for understanding the potential mechanism behind the toxicity and virulence of B. pumilus 15.1 against the Mediterranean fruit fly. PMID:26404596

  16. Improving protein-protein interactions prediction accuracy using protein evolutionary information and relevance vector machine model.

    PubMed

    An, Ji-Yong; Meng, Fan-Rong; You, Zhu-Hong; Chen, Xing; Yan, Gui-Ying; Hu, Ji-Pu

    2016-10-01

    Predicting protein-protein interactions (PPIs) is a challenging task and essential to construct the protein interaction networks, which is important for facilitating our understanding of the mechanisms of biological systems. Although a number of high-throughput technologies have been proposed to predict PPIs, there are unavoidable shortcomings, including high cost, time intensity, and inherently high false positive rates. For these reasons, many computational methods have been proposed for predicting PPIs. However, the problem is still far from being solved. In this article, we propose a novel computational method called RVM-BiGP that combines the relevance vector machine (RVM) model and Bi-gram Probabilities (BiGP) for PPIs detection from protein sequences. The major improvement includes (1) Protein sequences are represented using the Bi-gram probabilities (BiGP) feature representation on a Position Specific Scoring Matrix (PSSM), in which the protein evolutionary information is contained; (2) For reducing the influence of noise, the Principal Component Analysis (PCA) method is used to reduce the dimension of BiGP vector; (3) The powerful and robust Relevance Vector Machine (RVM) algorithm is used for classification. Five-fold cross-validation experiments executed on yeast and Helicobacter pylori datasets, which achieved very high accuracies of 94.57 and 90.57%, respectively. Experimental results are significantly better than previous methods. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the yeast dataset. The experimental results demonstrate that our RVM-BiGP method is significantly better than the SVM-based method. In addition, we achieved 97.15% accuracy on imbalance yeast dataset, which is higher than that of balance yeast dataset. The promising experimental results show the efficiency and robust of the proposed method, which can be an automatic decision support tool for future proteomics research. For facilitating extensive studies for future proteomics research, we developed a freely available web server called RVM-BiGP-PPIs in Hypertext Preprocessor (PHP) for predicting PPIs. The web server including source code and the datasets are available at http://219.219.62.123:8888/BiGP/. © 2016 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.

  17. Position specific variation in the rate of evolution in transcription factor binding sites

    PubMed Central

    Moses, Alan M; Chiang, Derek Y; Kellis, Manolis; Lander, Eric S; Eisen, Michael B

    2003-01-01

    Background The binding sites of sequence specific transcription factors are an important and relatively well-understood class of functional non-coding DNAs. Although a wide variety of experimental and computational methods have been developed to characterize transcription factor binding sites, they remain difficult to identify. Comparison of non-coding DNA from related species has shown considerable promise in identifying these functional non-coding sequences, even though relatively little is known about their evolution. Results Here we analyse the genome sequences of the budding yeasts Saccharomyces cerevisiae, S. bayanus, S. paradoxus and S. mikatae to study the evolution of transcription factor binding sites. As expected, we find that both experimentally characterized and computationally predicted binding sites evolve slower than surrounding sequence, consistent with the hypothesis that they are under purifying selection. We also observe position-specific variation in the rate of evolution within binding sites. We find that the position-specific rate of evolution is positively correlated with degeneracy among binding sites within S. cerevisiae. We test theoretical predictions for the rate of evolution at positions where the base frequencies deviate from background due to purifying selection and find reasonable agreement with the observed rates of evolution. Finally, we show how the evolutionary characteristics of real binding motifs can be used to distinguish them from artefacts of computational motif finding algorithms. Conclusion As has been observed for protein sequences, the rate of evolution in transcription factor binding sites varies with position, suggesting that some regions are under stronger functional constraint than others. This variation likely reflects the varying importance of different positions in the formation of the protein-DNA complex. The characterization of the pattern of evolution in known binding sites will likely contribute to the effective use of comparative sequence data in the identification of transcription factor binding sites and is an important step toward understanding the evolution of functional non-coding DNA. PMID:12946282

  18. A genome-wide identification and analysis of the DYW-deaminase genes in the pentatricopeptide repeat gene family in cotton (Gossypium spp.)

    PubMed Central

    Liu, Guoyuan; Li, Xue; Guo, Liping; Zhang, Xuexian; Qi, Tingxiang; Wang, Hailin; Tang, Huini; Qiao, Xiuqin; Zhang, Jinfa; Xing, Chaozhu; Wu, Jianyong

    2017-01-01

    The RNA editing occurring in plant organellar genomes mainly involves the change of cytidine to uridine. This process involves a deamination reaction, with cytidine deaminase as the catalyst. Pentatricopeptide repeat (PPR) proteins with a C-terminal DYW domain are reportedly associated with cytidine deamination, similar to members of the deaminase superfamily. PPR genes are involved in many cellular functions and biological processes including fertility restoration to cytoplasmic male sterility (CMS) in plants. In this study, we identified 227 and 211 DYW deaminase-coding PPR genes for the cultivated tetraploid cotton species G. hirsutum and G. barbadense (2n = 4x = 52), respectively, as well as 126 and 97 DYW deaminase-coding PPR genes in the ancestral diploid species G. raimondii and G. arboreum (2n = 26), respectively. The 227 G. hirsutum PPR genes were predicted to encode 52–2016 amino acids, 203 of which were mapped onto 26 chromosomes. Most DYW deaminase genes lacked introns, and their proteins were predicted to target the mitochondria or chloroplasts. Additionally, the DYW domain differed from the complete DYW deaminase domain, which contained part of the E domain and the entire E+ domain. The types and number of DYW tripeptides may have been influenced by evolutionary processes, with some tripeptides being lost. Furthermore, a gene ontology analysis revealed that DYW deaminase functions were mainly related to binding as well as hydrolase and transferase activities. The G. hirsutum DYW deaminase expression profiles varied among different cotton tissues and developmental stages, and no differentially expressed DYW deaminase-coding PPRs were directly associated with the male sterility and restoration in the CMS-D2 system. Our current study provides an important piece of information regarding the structural and evolutionary characteristics of Gossypium DYW-containing PPR genes coding for deaminases and will be useful for characterizing the DYW deaminase gene family in cotton biology and breeding. PMID:28339482

  19. Tissue-specific Proteogenomic Analysis of Plutella xylostella Larval Midgut Using a Multialgorithm Pipeline.

    PubMed

    Zhu, Xun; Xie, Shangbo; Armengaud, Jean; Xie, Wen; Guo, Zhaojiang; Kang, Shi; Wu, Qingjun; Wang, Shaoli; Xia, Jixing; He, Rongjun; Zhang, Youjun

    2016-06-01

    The diamondback moth, Plutella xylostella (L.), is the major cosmopolitan pest of brassica and other cruciferous crops. Its larval midgut is a dynamic tissue that interfaces with a wide variety of toxicological and physiological processes. The draft sequence of the P. xylostella genome was recently released, but its annotation remains challenging because of the low sequence coverage of this branch of life and the poor description of exon/intron splicing rules for these insects. Peptide sequencing by computational assignment of tandem mass spectra to genome sequence information provides an experimental independent approach for confirming or refuting protein predictions, a concept that has been termed proteogenomics. In this study, we carried out an in-depth proteogenomic analysis to complement genome annotation of P. xylostella larval midgut based on shotgun HPLC-ESI-MS/MS data by means of a multialgorithm pipeline. A total of 876,341 tandem mass spectra were searched against the predicted P. xylostella protein sequences and a whole-genome six-frame translation database. Based on a data set comprising 2694 novel genome search specific peptides, we discovered 439 novel protein-coding genes and corrected 128 existing gene models. To get the most accurate data to seed further insect genome annotation, more than half of the novel protein-coding genes, i.e. 235 over 439, were further validated after RT-PCR amplification and sequencing of the corresponding transcripts. Furthermore, we validated 53 novel alternative splicings. Finally, a total of 6764 proteins were identified, resulting in one of the most comprehensive proteogenomic study of a nonmodel animal. As the first tissue-specific proteogenomics analysis of P. xylostella, this study provides the fundamental basis for high-throughput proteomics and functional genomics approaches aimed at deciphering the molecular mechanisms of resistance and controlling this pest. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.

  20. Annotation-based inference of transporter function.

    PubMed

    Lee, Thomas J; Paulsen, Ian; Karp, Peter

    2008-07-01

    We present a method for inferring and constructing transport reactions for transporter proteins based primarily on the analysis of the names of individual proteins in the genome annotation of an organism. Transport reactions are declarative descriptions of transporter activities, and thus can be manipulated computationally, unlike free-text protein names. Once transporter activities are encoded as transport reactions, a number of computational analyses are possible including database queries by transporter activity; inclusion of transporters into an automatically generated metabolic-map diagram that can be painted with omics data to aid in their interpretation; detection of anomalies in the metabolic and transport networks, such as substrates that are transported into the cell but are not inputs to any metabolic reaction or pathway; and comparative analyses of the transport capabilities of different organisms. On randomly selected organisms, the method achieves precision and recall rates of 0.93 and 0.90, respectively in identifying transporter proteins by name within the complete genome. The method obtains 67.5% accuracy in predicting complete transport reactions; if allowance is made for predictions that are overly general yet not incorrect, reaction prediction accuracy is 82.5%. The method is implemented as part of PathoLogic, the inference component of the Pathway Tools software. Pathway Tools is freely available to researchers at non-commercial institutions, including source code; a fee applies to commercial institutions. Supplementary data are available at Bioinformatics online.

  1. Reverse vaccinology as an approach for developing Histophilus somni vaccine candidates.

    PubMed

    Madampage, Claudia Avis; Rawlyk, Neil; Crockford, Gordon; Wang, Yejun; White, Aaron P; Brownlie, Robert; Van Donkersgoed, Joyce; Dorin, Craig; Potter, Andrew

    2015-11-01

    Histophilosis of cattle is caused by the Gram negative bacterial pathogen Histophilus somni (H. somni) which is also associated with the bovine respiratory disease (BRD) complex. Existing vaccines for H. somni include either killed cells or bacteria-free outer membrane proteins from the organism which have proven to be moderately successful. In this study, reverse vaccinology was used to predict potential H. somni vaccine candidates from genome sequences. In turn, these may protect animals against new strains circulating in the field. Whole genome sequencing of six recent clinical H. somni isolates was performed using an Illumina MiSeq and compared to six genomes from the 1980's. De novo assembly of crude whole genomes was completed using Geneious 6.1.7. Protein coding regions was predicted using Glimmer3. Scores from multiple web-based programs were utilized to evaluate the antigenicity of these predicted proteins which were finally ranked based on their surface exposure scores. A single new strain was selected for future vaccine development based on conservation of the protein candidates among all 12 isolates. A positive signal with convalescent serum for these antigens in western blots indicates in vivo recognition. In order to test the protective capacity of these antigens bovine animal trials are ongoing. Copyright © 2015 The International Alliance for Biological Standardization. Published by Elsevier Ltd. All rights reserved.

  2. Putative Nonribosomal Peptide Synthetase and Cytochrome P450 Genes Responsible for Tentoxin Biosynthesis in Alternaria alternata ZJ33.

    PubMed

    Li, You-Hai; Han, Wen-Jin; Gui, Xi-Wu; Wei, Tao; Tang, Shuang-Yan; Jin, Jian-Ming

    2016-08-02

    Tentoxin, a cyclic tetrapeptide produced by several Alternaria species, inhibits the F₁-ATPase activity of chloroplasts, resulting in chlorosis in sensitive plants. In this study, we report two clustered genes, encoding a putative non-ribosome peptide synthetase (NRPS) TES and a cytochrome P450 protein TES1, that are required for tentoxin biosynthesis in Alternaria alternata strain ZJ33, which was isolated from blighted leaves of Eupatorium adenophorum. Using a pair of primers designed according to the consensus sequences of the adenylation domain of NRPSs, two fragments containing putative adenylation domains were amplified from A. alternata ZJ33, and subsequent PCR analyses demonstrated that these fragments belonged to the same NRPS coding sequence. With no introns, TES consists of a single 15,486 base pair open reading frame encoding a predicted 5161 amino acid protein. Meanwhile, the TES1 gene is predicted to contain five introns and encode a 506 amino acid protein. The TES protein is predicted to be comprised of four peptide synthase modules with two additional N-methylation domains, and the number and arrangement of the modules in TES were consistent with the number and arrangement of the amino acid residues of tentoxin, respectively. Notably, both TES and TES1 null mutants generated via homologous recombination failed to produce tentoxin. This study provides the first evidence concerning the biosynthesis of tentoxin in A. alternata.

  3. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants.

    PubMed

    Fu, Wenqing; O'Connor, Timothy D; Jun, Goo; Kang, Hyun Min; Abecasis, Goncalo; Leal, Suzanne M; Gabriel, Stacey; Rieder, Mark J; Altshuler, David; Shendure, Jay; Nickerson, Deborah A; Bamshad, Michael J; Akey, Joshua M

    2013-01-10

    Establishing the age of each mutation segregating in contemporary human populations is important to fully understand our evolutionary history and will help to facilitate the development of new approaches for disease-gene discovery. Large-scale surveys of human genetic variation have reported signatures of recent explosive population growth, notable for an excess of rare genetic variants, suggesting that many mutations arose recently. To more quantitatively assess the distribution of mutation ages, we resequenced 15,336 genes in 6,515 individuals of European American and African American ancestry and inferred the age of 1,146,401 autosomal single nucleotide variants (SNVs). We estimate that approximately 73% of all protein-coding SNVs and approximately 86% of SNVs predicted to be deleterious arose in the past 5,000-10,000 years. The average age of deleterious SNVs varied significantly across molecular pathways, and disease genes contained a significantly higher proportion of recently arisen deleterious SNVs than other genes. Furthermore, European Americans had an excess of deleterious variants in essential and Mendelian disease genes compared to African Americans, consistent with weaker purifying selection due to the Out-of-Africa dispersal. Our results better delimit the historical details of human protein-coding variation, show the profound effect of recent human history on the burden of deleterious SNVs segregating in contemporary populations, and provide important practical information that can be used to prioritize variants in disease-gene discovery.

  4. MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit

    PubMed Central

    Li, Junhua; Chen, Weineng; Chen, Hua; Mende, Daniel R.; Arumugam, Manimozhiyan; Pan, Qi; Liu, Binghang; Qin, Junjie; Wang, Jun; Bork, Peer

    2012-01-01

    MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/. PMID:23082188

  5. MOCAT: a metagenomics assembly and gene prediction toolkit.

    PubMed

    Kultima, Jens Roat; Sunagawa, Shinichi; Li, Junhua; Chen, Weineng; Chen, Hua; Mende, Daniel R; Arumugam, Manimozhiyan; Pan, Qi; Liu, Binghang; Qin, Junjie; Wang, Jun; Bork, Peer

    2012-01-01

    MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.

  6. Experimentally assessing molecular dynamics sampling of the protein native state conformational distribution

    PubMed Central

    Hernández, Griselda; Anderson, Janet S.; LeMaster, David M.

    2012-01-01

    The acute sensitivity to conformation exhibited by amide hydrogen exchange reactivity provides a valuable test for the physical accuracy of model ensembles developed to represent the Boltzmann distribution of the protein native state. A number of molecular dynamics studies of ubiquitin have predicted a well-populated transition in the tight turn immediately preceding the primary site of proteasome-directed polyubiquitylation Lys 48. Amide exchange reactivity analysis demonstrates that this transition is 103-fold rarer than these predictions. More strikingly, for the most populated novel conformational basin predicted from a recent 1 ms MD simulation of bovine pancreatic trypsin inhibitor (at 13% of total), experimental hydrogen exchange data indicates a population below 10−6. The most sophisticated efforts to directly incorporate experimental constraints into the derivation of model protein ensembles have been applied to ubiquitin, as illustrated by three recently deposited studies (PDB codes 2NR2, 2K39 and 2KOX). Utilizing the extensive set of experimental NOE constraints, each of these three ensembles yields a modestly more accurate prediction of the exchange rates for the highly exposed amides than does a standard unconstrained molecular simulation. However, for the less frequently exposed amide hydrogens, the 2NR2 ensemble offers no improvement in rate predictions as compared to the unconstrained MD ensemble. The other two NMR-constrained ensembles performed markedly worse, either underestimating (2KOX) or overestimating (2K39) the extent of conformational diversity. PMID:22425325

  7. Cell cycle, oncogenic and tumor suppressor pathways regulate numerous long and macro non-protein-coding RNAs

    PubMed Central

    2014-01-01

    Background The genome is pervasively transcribed but most transcripts do not code for proteins, constituting non-protein-coding RNAs. Despite increasing numbers of functional reports of individual long non-coding RNAs (lncRNAs), assessing the extent of functionality among the non-coding transcriptional output of mammalian cells remains intricate. In the protein-coding world, transcripts differentially expressed in the context of processes essential for the survival of multicellular organisms have been instrumental in the discovery of functionally relevant proteins and their deregulation is frequently associated with diseases. We therefore systematically identified lncRNAs expressed differentially in response to oncologically relevant processes and cell-cycle, p53 and STAT3 pathways, using tiling arrays. Results We found that up to 80% of the pathway-triggered transcriptional responses are non-coding. Among these we identified very large macroRNAs with pathway-specific expression patterns and demonstrated that these are likely continuous transcripts. MacroRNAs contain elements conserved in mammals and sauropsids, which in part exhibit conserved RNA secondary structure. Comparing evolutionary rates of a macroRNA to adjacent protein-coding genes suggests a local action of the transcript. Finally, in different grades of astrocytoma, a tumor disease unrelated to the initially used cell lines, macroRNAs are differentially expressed. Conclusions It has been shown previously that the majority of expressed non-ribosomal transcripts are non-coding. We now conclude that differential expression triggered by signaling pathways gives rise to a similar abundance of non-coding content. It is thus unlikely that the prevalence of non-coding transcripts in the cell is a trivial consequence of leaky or random transcription events. PMID:24594072

  8. Shewregdb: Database and visualization environment for experimental and predicted regulatory information in Shewanella oneidensis mr-1

    PubMed Central

    Syed, Mustafa H; Karpinets, Tatiana V; Leuze, Michael R; Kora, Guruprasad H; Romine, Margaret R; Uberbacher, Edward C

    2009-01-01

    Shewanella oneidensis MR-1 is an important model organism for environmental research as it has an exceptional metabolic and respiratory versatility regulated by a complex regulatory network. We have developed a database to collect experimental and computational data relating to regulation of gene and protein expression, and, a visualization environment that enables integration of these data types. The regulatory information in the database includes predictions of DNA regulator binding sites, sigma factor binding sites, transcription units, operons, promoters, and RNA regulators including non-coding RNAs, riboswitches, and different types of terminators. Availability http://shewanella-knowledgebase.org:8080/Shewanella/gbrowserLanding.jsp PMID:20198195

  9. A method for partitioning the information contained in a protein sequence between its structure and function.

    PubMed

    Possenti, Andrea; Vendruscolo, Michele; Camilloni, Carlo; Tiana, Guido

    2018-05-23

    Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The possibility to encode for such functions is controlled by the balance between the amount of information supplied by the sequence and that left after that the protein has folded into its structure. We study the amount of information necessary to specify the protein structure, providing an estimate that keeps into account the thermodynamic properties of protein folding. We thus show that the information remaining in the protein sequence after encoding for its structure (the 'information gap') is very close to what needed to encode for its function and interactions. Then, by predicting the information gap directly from the protein sequence, we show that it may be possible to use these insights from information theory to discriminate between ordered and disordered proteins, to identify unknown functions, and to optimize artificially-designed protein sequences. This article is protected by copyright. All rights reserved. © 2018 Wiley Periodicals, Inc.

  10. The Coding of Biological Information: From Nucleotide Sequence to Protein Recognition

    NASA Astrophysics Data System (ADS)

    Štambuk, Nikola

    The paper reviews the classic results of Swanson, Dayhoff, Grantham, Blalock and Root-Bernstein, which link genetic code nucleotide patterns to the protein structure, evolution and molecular recognition. Symbolic representation of the binary addresses defining particular nucleotide and amino acid properties is discussed, with consideration of: structure and metric of the code, direct correspondence between amino acid and nucleotide information, and molecular recognition of the interacting protein motifs coded by the complementary DNA and RNA strands.

  11. PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations

    PubMed Central

    Bendl, Jaroslav; Stourac, Jan; Salanda, Ondrej; Pavelka, Antonin; Wieben, Eric D.; Zendulka, Jaroslav; Brezovsky, Jan; Damborsky, Jiri

    2014-01-01

    Single nucleotide variants represent a prevalent form of genetic variation. Mutations in the coding regions are frequently associated with the development of various genetic diseases. Computational tools for the prediction of the effects of mutations on protein function are very important for analysis of single nucleotide variants and their prioritization for experimental characterization. Many computational tools are already widely employed for this purpose. Unfortunately, their comparison and further improvement is hindered by large overlaps between the training datasets and benchmark datasets, which lead to biased and overly optimistic reported performances. In this study, we have constructed three independent datasets by removing all duplicities, inconsistencies and mutations previously used in the training of evaluated tools. The benchmark dataset containing over 43,000 mutations was employed for the unbiased evaluation of eight established prediction tools: MAPP, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT and SNAP. The six best performing tools were combined into a consensus classifier PredictSNP, resulting into significantly improved prediction performance, and at the same time returned results for all mutations, confirming that consensus prediction represents an accurate and robust alternative to the predictions delivered by individual tools. A user-friendly web interface enables easy access to all eight prediction tools, the consensus classifier PredictSNP and annotations from the Protein Mutant Database and the UniProt database. The web server and the datasets are freely available to the academic community at http://loschmidt.chemi.muni.cz/predictsnp. PMID:24453961

  12. The Purine Bias of Coding Sequences is Determined by Physicochemical Constraints on Proteins.

    PubMed

    Ponce de Leon, Miguel; de Miranda, Antonio Basilio; Alvarez-Valin, Fernando; Carels, Nicolas

    2014-01-01

    For this report, we analyzed protein secondary structures in relation to the statistics of three nucleotide codon positions. The purpose of this investigation was to find which properties of the ribosome, tRNA or protein level, could explain the purine bias (Rrr) as it is observed in coding DNA. We found that the Rrr pattern is the consequence of a regularity (the codon structure) resulting from physicochemical constraints on proteins and thermodynamic constraints on ribosomal machinery. The physicochemical constraints on proteins mainly come from the hydropathy and molecular weight (MW) of secondary structures as well as the energy cost of amino acid synthesis. These constraints appear through a network of statistical correlations, such as (i) the cost of amino acid synthesis, which is in favor of a higher level of guanine in the first codon position, (ii) the constructive contribution of hydropathy alternation in proteins, (iii) the spatial organization of secondary structure in proteins according to solvent accessibility, (iv) the spatial organization of secondary structure according to amino acid hydropathy, (v) the statistical correlation of MW with protein secondary structures and their overall hydropathy, (vi) the statistical correlation of thymine in the second codon position with hydropathy and the energy cost of amino acid synthesis, and (vii) the statistical correlation of adenine in the second codon position with amino acid complexity and the MW of secondary protein structures. Amino acid physicochemical properties and functional constraints on proteins constitute a code that is translated into a purine bias within the coding DNA via tRNAs. In that sense, the Rrr pattern within coding DNA is the effect of information transfer on nucleotide composition from protein to DNA by selection according to the codon positions. Thus, coding DNA structure and ribosomal machinery co-evolved to minimize the energy cost of protein coding given the functional constraints on proteins.

  13. Assessment of Current Jet Noise Prediction Capabilities

    NASA Technical Reports Server (NTRS)

    Hunter, Craid A.; Bridges, James E.; Khavaran, Abbas

    2008-01-01

    An assessment was made of the capability of jet noise prediction codes over a broad range of jet flows, with the objective of quantifying current capabilities and identifying areas requiring future research investment. Three separate codes in NASA s possession, representative of two classes of jet noise prediction codes, were evaluated, one empirical and two statistical. The empirical code is the Stone Jet Noise Module (ST2JET) contained within the ANOPP aircraft noise prediction code. It is well documented, and represents the state of the art in semi-empirical acoustic prediction codes where virtual sources are attributed to various aspects of noise generation in each jet. These sources, in combination, predict the spectral directivity of a jet plume. A total of 258 jet noise cases were examined on the ST2JET code, each run requiring only fractions of a second to complete. Two statistical jet noise prediction codes were also evaluated, JeNo v1, and Jet3D. Fewer cases were run for the statistical prediction methods because they require substantially more resources, typically a Reynolds-Averaged Navier-Stokes solution of the jet, volume integration of the source statistical models over the entire plume, and a numerical solution of the governing propagation equation within the jet. In the evaluation process, substantial justification of experimental datasets used in the evaluations was made. In the end, none of the current codes can predict jet noise within experimental uncertainty. The empirical code came within 2dB on a 1/3 octave spectral basis for a wide range of flows. The statistical code Jet3D was within experimental uncertainty at broadside angles for hot supersonic jets, but errors in peak frequency and amplitude put it out of experimental uncertainty at cooler, lower speed conditions. Jet3D did not predict changes in directivity in the downstream angles. The statistical code JeNo,v1 was within experimental uncertainty predicting noise from cold subsonic jets at all angles, but did not predict changes with heating of the jet and did not account for directivity changes at supersonic conditions. Shortcomings addressed here give direction for future work relevant to the statistical-based prediction methods. A full report will be released as a chapter in a NASA publication assessing the state of the art in aircraft noise prediction.

  14. Evaluation of a genome-scale in silico metabolic model for Geobacter metallireducens by using proteomic data from a field biostimulation experiment.

    PubMed

    Fang, Yilin; Wilkins, Michael J; Yabusaki, Steven B; Lipton, Mary S; Long, Philip E

    2012-12-01

    Accurately predicting the interactions between microbial metabolism and the physical subsurface environment is necessary to enhance subsurface energy development, soil and groundwater cleanup, and carbon management. This study was an initial attempt to confirm the metabolic functional roles within an in silico model using environmental proteomic data collected during field experiments. Shotgun global proteomics data collected during a subsurface biostimulation experiment were used to validate a genome-scale metabolic model of Geobacter metallireducens-specifically, the ability of the metabolic model to predict metal reduction, biomass yield, and growth rate under dynamic field conditions. The constraint-based in silico model of G. metallireducens relates an annotated genome sequence to the physiological functions with 697 reactions controlled by 747 enzyme-coding genes. Proteomic analysis showed that 180 of the 637 G. metallireducens proteins detected during the 2008 experiment were associated with specific metabolic reactions in the in silico model. When the field-calibrated Fe(III) terminal electron acceptor process reaction in a reactive transport model for the field experiments was replaced with the genome-scale model, the model predicted that the largest metabolic fluxes through the in silico model reactions generally correspond to the highest abundances of proteins that catalyze those reactions. Central metabolism predicted by the model agrees well with protein abundance profiles inferred from proteomic analysis. Model discrepancies with the proteomic data, such as the relatively low abundances of proteins associated with amino acid transport and metabolism, revealed pathways or flux constraints in the in silico model that could be updated to more accurately predict metabolic processes that occur in the subsurface environment.

  15. Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees.

    PubMed

    Williams, Philip H; Eyles, Rod; Weiller, Georg

    2012-01-01

    MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require "read count" to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA(∗) duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.

  16. RVMAB: Using the Relevance Vector Machine Model Combined with Average Blocks to Predict the Interactions of Proteins from Protein Sequences.

    PubMed

    An, Ji-Yong; You, Zhu-Hong; Meng, Fan-Rong; Xu, Shu-Juan; Wang, Yin

    2016-05-18

    Protein-Protein Interactions (PPIs) play essential roles in most cellular processes. Knowledge of PPIs is becoming increasingly more important, which has prompted the development of technologies that are capable of discovering large-scale PPIs. Although many high-throughput biological technologies have been proposed to detect PPIs, there are unavoidable shortcomings, including cost, time intensity, and inherently high false positive and false negative rates. For the sake of these reasons, in silico methods are attracting much attention due to their good performances in predicting PPIs. In this paper, we propose a novel computational method known as RVM-AB that combines the Relevance Vector Machine (RVM) model and Average Blocks (AB) to predict PPIs from protein sequences. The main improvements are the results of representing protein sequences using the AB feature representation on a Position Specific Scoring Matrix (PSSM), reducing the influence of noise using a Principal Component Analysis (PCA), and using a Relevance Vector Machine (RVM) based classifier. We performed five-fold cross-validation experiments on yeast and Helicobacter pylori datasets, and achieved very high accuracies of 92.98% and 95.58% respectively, which is significantly better than previous works. In addition, we also obtained good prediction accuracies of 88.31%, 89.46%, 91.08%, 91.55%, and 94.81% on other five independent datasets C. elegans, M. musculus, H. sapiens, H. pylori, and E. coli for cross-species prediction. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the yeast dataset. The experimental results demonstrate that our RVM-AB method is obviously better than the SVM-based method. The promising experimental results show the efficiency and simplicity of the proposed method, which can be an automatic decision support tool. To facilitate extensive studies for future proteomics research, we developed a freely available web server called RVMAB-PPI in Hypertext Preprocessor (PHP) for predicting PPIs. The web server including source code and the datasets are available at http://219.219.62.123:8888/ppi_ab/.

  17. Biophysics of protein evolution and evolutionary protein biophysics

    PubMed Central

    Sikosek, Tobias; Chan, Hue Sun

    2014-01-01

    The study of molecular evolution at the level of protein-coding genes often entails comparing large datasets of sequences to infer their evolutionary relationships. Despite the importance of a protein's structure and conformational dynamics to its function and thus its fitness, common phylogenetic methods embody minimal biophysical knowledge of proteins. To underscore the biophysical constraints on natural selection, we survey effects of protein mutations, highlighting the physical basis for marginal stability of natural globular proteins and how requirement for kinetic stability and avoidance of misfolding and misinteractions might have affected protein evolution. The biophysical underpinnings of these effects have been addressed by models with an explicit coarse-grained spatial representation of the polypeptide chain. Sequence–structure mappings based on such models are powerful conceptual tools that rationalize mutational robustness, evolvability, epistasis, promiscuous function performed by ‘hidden’ conformational states, resolution of adaptive conflicts and conformational switches in the evolution from one protein fold to another. Recently, protein biophysics has been applied to derive more accurate evolutionary accounts of sequence data. Methods have also been developed to exploit sequence-based evolutionary information to predict biophysical behaviours of proteins. The success of these approaches demonstrates a deep synergy between the fields of protein biophysics and protein evolution. PMID:25165599

  18. Comparison of GLIMPS and HFAST Stirling engine code predictions with experimental data

    NASA Technical Reports Server (NTRS)

    Geng, Steven M.; Tew, Roy C.

    1992-01-01

    Predictions from GLIMPS and HFAST design codes are compared with experimental data for the RE-1000 and SPRE free piston Stirling engines. Engine performance and available power loss predictions are compared. Differences exist between GLIMPS and HFAST loss predictions. Both codes require engine specific calibration to bring predictions and experimental data into agreement.

  19. Complete genome sequence analysis of the fish pathogen Flavobacterium columnare provides insights into antibiotic resistance and pathogenicity related genes.

    PubMed

    Zhang, Yulei; Zhao, Lijuan; Chen, Wenjie; Huang, Yunmao; Yang, Ling; Sarathbabu, V; Wu, Zaohe; Li, Jun; Nie, Pin; Lin, Li

    2017-10-01

    We analyzed here the complete genome sequences of a highly virulent Flavobacterium columnare Pf1 strain isolated in our laboratory. The complete genome consists of a 3,171,081 bp circular DNA with 2784 predicted protein-coding genes. Among these, 286 genes were predicted as antibiotic resistance genes, including 32 RND-type efflux pump related genes which were associated with the export of aminoglycosides, indicating inducible aminoglycosides resistances in F. columnare. On the other hand, 328 genes were predicted as pathogenicity related genes which could be classified as virulence factors, gliding motility proteins, adhesins, and many putative secreted proteases. These genes were probably involved in the colonization, invasion and destruction of fish tissues during the infection of F. columnare. Apparently, our obtained complete genome sequences provide the basis for the explanation of the interactions between the F. columnare and the infected fish. The predicted antibiotic resistance and pathogenicity related genes will shed a new light on the development of more efficient preventional strategies against the infection of F. columnare, which is a major worldwide fish pathogen. Copyright © 2017 Elsevier Ltd. All rights reserved.

  20. Comparative genomics approach to detecting split-coding regions in a low-coverage genome: lessons from the chimaera Callorhinchus milii (Holocephali, Chondrichthyes).

    PubMed

    Dessimoz, Christophe; Zoller, Stefan; Manousaki, Tereza; Qiu, Huan; Meyer, Axel; Kuraku, Shigehiro

    2011-09-01

    Recent development of deep sequencing technologies has facilitated de novo genome sequencing projects, now conducted even by individual laboratories. However, this will yield more and more genome sequences that are not well assembled, and will hinder thorough annotation when no closely related reference genome is available. One of the challenging issues is the identification of protein-coding sequences split into multiple unassembled genomic segments, which can confound orthology assignment and various laboratory experiments requiring the identification of individual genes. In this study, using the genome of a cartilaginous fish, Callorhinchus milii, as test case, we performed gene prediction using a model specifically trained for this genome. We implemented an algorithm, designated ESPRIT, to identify possible linkages between multiple protein-coding portions derived from a single genomic locus split into multiple unassembled genomic segments. We developed a validation framework based on an artificially fragmented human genome, improvements between early and recent mouse genome assemblies, comparison with experimentally validated sequences from GenBank, and phylogenetic analyses. Our strategy provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes. To our knowledge, our study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references.

  1. Comparative genomics approach to detecting split-coding regions in a low-coverage genome: lessons from the chimaera Callorhinchus milii (Holocephali, Chondrichthyes)

    PubMed Central

    Zoller, Stefan; Manousaki, Tereza; Qiu, Huan; Meyer, Axel; Kuraku, Shigehiro

    2011-01-01

    Recent development of deep sequencing technologies has facilitated de novo genome sequencing projects, now conducted even by individual laboratories. However, this will yield more and more genome sequences that are not well assembled, and will hinder thorough annotation when no closely related reference genome is available. One of the challenging issues is the identification of protein-coding sequences split into multiple unassembled genomic segments, which can confound orthology assignment and various laboratory experiments requiring the identification of individual genes. In this study, using the genome of a cartilaginous fish, Callorhinchus milii, as test case, we performed gene prediction using a model specifically trained for this genome. We implemented an algorithm, designated ESPRIT, to identify possible linkages between multiple protein-coding portions derived from a single genomic locus split into multiple unassembled genomic segments. We developed a validation framework based on an artificially fragmented human genome, improvements between early and recent mouse genome assemblies, comparison with experimentally validated sequences from GenBank, and phylogenetic analyses. Our strategy provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes. To our knowledge, our study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references. PMID:21712341

  2. Non-coding RNAs—Novel targets in neurotoxicity

    PubMed Central

    Tal, Tamara L.; Tanguay, Robert L.

    2012-01-01

    Over the past ten years non-coding RNAs (ncRNAs) have emerged as pivotal players in fundamental physiological and cellular processes and have been increasingly implicated in cancer, immune disorders, and cardiovascular, neurodegenerative, and metabolic diseases. MicroRNAs (miRNAs) represent a class of ncRNA molecules that function as negative regulators of post-transcriptional gene expression. miRNAs are predicted to regulate 60% of all human protein-coding genes and as such, play key roles in cellular and developmental processes, human health, and disease. Relative to counterparts that lack bindings sites for miRNAs, genes encoding proteins that are post-transcriptionally regulated by miRNAs are twice as likely to be sensitive to environmental chemical exposure. Not surprisingly, miRNAs have been recognized as targets or effectors of nervous system, developmental, hepatic, and carcinogenic toxicants, and have been identified as putative regulators of phase I xenobiotic-metabolizing enzymes. In this review, we give an overview of the types of ncRNAs and highlight their roles in neurodevelopment, neurological disease, activity-dependent signaling, and drug metabolism. We then delve into specific examples that illustrate their importance as mediators, effectors, or adaptive agents of neurotoxicants or neuroactive pharmaceutical compounds. Finally, we identify a number of outstanding questions regarding ncRNAs and neurotoxicity. PMID:22394481

  3. Evaluation of a Genome-Scale In Silico Metabolic Model for Geobacter metallireducens Using Proteomic Data from a Field Biostimulation Experiment

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Fang, Yilin; Wilkins, Michael J.; Yabusaki, Steven B.

    2012-12-12

    Biomass and shotgun global proteomics data that reflected relative protein abundances from samples collected during the 2008 experiment at the U.S. Department of Energy Integrated Field-Scale Subsurface Research Challenge site in Rifle, Colorado, provided an unprecedented opportunity to validate a genome-scale metabolic model of Geobacter metallireducens and assess its performance with respect to prediction of metal reduction, biomass yield, and growth rate under dynamic field conditions. Reconstructed from annotated genomic sequence, biochemical, and physiological data, the constraint-based in silico model of G. metallireducens relates an annotated genome sequence to the physiological functions with 697 reactions controlled by 747 enzyme-coding genes.more » Proteomic analysis showed that 180 of the 637 G. metallireducens proteins detected during the 2008 experiment were associated with specific metabolic reactions in the in silico model. When the field-calibrated Fe(III) terminal electron acceptor process reaction in a reactive transport model for the field experiments was replaced with the genome-scale model, the model predicted that the largest metabolic fluxes through the in silico model reactions generally correspond to the highest abundances of proteins that catalyze those reactions. Central metabolism predicted by the model agrees well with protein abundance profiles inferred from proteomic analysis. Model discrepancies with the proteomic data, such as the relatively low fluxes through amino acid transport and metabolism, revealed pathways or flux constraints in the in silico model that could be updated to more accurately predict metabolic processes that occur in the subsurface environment.« less

  4. Rhodopseudomonas palustris CGA010 Proteome Implicates Extracytoplasmic Function Sigma Factor in Stress Response

    DOE PAGES

    Allen, Michael S.; Hurst, Gregory B.; Lu, Tse-Yuan S.; ...

    2015-04-08

    Rhodopseudomonas palustris encodes 16 extracytoplasmic function (ECF) σ factors. In this paper, to begin to investigate the regulatory network of one of these ECF σ factors, the whole proteome of R. palustris CGA010 was quantitatively analyzed by tandem mass spectrometry from cultures episomally expressing the ECF σ RPA4225 (ecfT) versus a WT control. Among the proteins with the greatest increase in abundance were catalase KatE, trehalose synthase, a DPS-like protein, and several regulatory proteins. Alignment of the cognate promoter regions driving expression of several upregulated proteins suggested a conserved binding motif in the -35 and -10 regions with the consensusmore » sequence GGAAC-18N-TT. Additionally, the putative anti-σ factor RPA4224, whose gene is contained in the same predicted operon as RPA4225, was identified as interacting directly with the predicted response regulator RPA4223 by mass spectrometry of affinity-isolated protein complexes. Furthermore, another gene (RPA4226) coding for a protein that contains a cytoplasmic histidine kinase domain is located immediately upstream of RPA4225. The genomic organization of orthologs for these four genes is conserved in several other strains of R. palustris as well as in closely related α-Proteobacteria. Finally, taken together, these data suggest that ECF σ RPA4225 and the three additional genes make up a sigma factor mimicry system in R. palustris.« less

  5. Rhodopseudomonas palustris CGA010 Proteome Implicates Extracytoplasmic Function Sigma Factor in Stress Response

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Allen, Michael S.; Hurst, Gregory B.; Lu, Tse-Yuan S.

    Rhodopseudomonas palustris encodes 16 extracytoplasmic function (ECF) σ factors. In this paper, to begin to investigate the regulatory network of one of these ECF σ factors, the whole proteome of R. palustris CGA010 was quantitatively analyzed by tandem mass spectrometry from cultures episomally expressing the ECF σ RPA4225 (ecfT) versus a WT control. Among the proteins with the greatest increase in abundance were catalase KatE, trehalose synthase, a DPS-like protein, and several regulatory proteins. Alignment of the cognate promoter regions driving expression of several upregulated proteins suggested a conserved binding motif in the -35 and -10 regions with the consensusmore » sequence GGAAC-18N-TT. Additionally, the putative anti-σ factor RPA4224, whose gene is contained in the same predicted operon as RPA4225, was identified as interacting directly with the predicted response regulator RPA4223 by mass spectrometry of affinity-isolated protein complexes. Furthermore, another gene (RPA4226) coding for a protein that contains a cytoplasmic histidine kinase domain is located immediately upstream of RPA4225. The genomic organization of orthologs for these four genes is conserved in several other strains of R. palustris as well as in closely related α-Proteobacteria. Finally, taken together, these data suggest that ECF σ RPA4225 and the three additional genes make up a sigma factor mimicry system in R. palustris.« less

  6. Improving protein–protein interactions prediction accuracy using protein evolutionary information and relevance vector machine model

    PubMed Central

    An, Ji‐Yong; Meng, Fan‐Rong; Chen, Xing; Yan, Gui‐Ying; Hu, Ji‐Pu

    2016-01-01

    Abstract Predicting protein–protein interactions (PPIs) is a challenging task and essential to construct the protein interaction networks, which is important for facilitating our understanding of the mechanisms of biological systems. Although a number of high‐throughput technologies have been proposed to predict PPIs, there are unavoidable shortcomings, including high cost, time intensity, and inherently high false positive rates. For these reasons, many computational methods have been proposed for predicting PPIs. However, the problem is still far from being solved. In this article, we propose a novel computational method called RVM‐BiGP that combines the relevance vector machine (RVM) model and Bi‐gram Probabilities (BiGP) for PPIs detection from protein sequences. The major improvement includes (1) Protein sequences are represented using the Bi‐gram probabilities (BiGP) feature representation on a Position Specific Scoring Matrix (PSSM), in which the protein evolutionary information is contained; (2) For reducing the influence of noise, the Principal Component Analysis (PCA) method is used to reduce the dimension of BiGP vector; (3) The powerful and robust Relevance Vector Machine (RVM) algorithm is used for classification. Five‐fold cross‐validation experiments executed on yeast and Helicobacter pylori datasets, which achieved very high accuracies of 94.57 and 90.57%, respectively. Experimental results are significantly better than previous methods. To further evaluate the proposed method, we compare it with the state‐of‐the‐art support vector machine (SVM) classifier on the yeast dataset. The experimental results demonstrate that our RVM‐BiGP method is significantly better than the SVM‐based method. In addition, we achieved 97.15% accuracy on imbalance yeast dataset, which is higher than that of balance yeast dataset. The promising experimental results show the efficiency and robust of the proposed method, which can be an automatic decision support tool for future proteomics research. For facilitating extensive studies for future proteomics research, we developed a freely available web server called RVM‐BiGP‐PPIs in Hypertext Preprocessor (PHP) for predicting PPIs. The web server including source code and the datasets are available at http://219.219.62.123:8888/BiGP/. PMID:27452983

  7. Comprehensive analysis of microRNA-Seq and target mRNAs of rice sheath blight pathogen provides new insights into pathogenic regulatory mechanisms.

    PubMed

    Lin, Runmao; He, Liye; He, Jiayu; Qin, Peigang; Wang, Yanran; Deng, Qiming; Yang, Xiaoting; Li, Shuangcheng; Wang, Shiquan; Wang, Wenming; Liu, Huainian; Li, Ping; Zheng, Aiping

    2016-07-03

    MicroRNAs (miRNAs) are ∼22 nucleotide non-coding RNAs that regulate gene expression by targeting mRNAs for degradation or inhibiting protein translation. To investigate whether miRNAs regulate the pathogenesis in necrotrophic fungus Rhizoctonia solani AG1 IA, which causes significant yield loss in main economically important crops, and to determine the regulatory mechanism occurring during pathogenesis, we constructed hyphal small RNA libraries from six different infection periods of the rice leaf. Through sequencing and analysis, 177 miRNA-like small RNAs (milRNAs) were identified, including 15 candidate pathogenic novel milRNAs predicted by functional annotations of their target mRNAs and expression patterns of milRNAs and mRNAs during infection. Reverse transcription-quantitative polymerase chain reaction results for randomly selected milRNAs demonstrated that our novel comprehensive predictions had a high level of accuracy. In our predicted pathogenic protein-protein interaction network of R. solani, we added the related regulatory milRNAs of these core coding genes into the network, and could understand the relationships among these regulatory factors more clearly at the systems level. Furthermore, the putative pathogenic Rhi-milR-16, which negatively regulates target gene expression, was experimentally validated to have regulatory functions by a dual-luciferase reporter assay. Additionally, 23 candidate rice miRNAs that may involve in plant immunity against R. solani were discovered. This first study on novel pathogenic milRNAs of R. solani AG1 IA and the recognition of target genes involved in pathogenicity, as well as rice miRNAs, participated in defence against R. solani could provide new insights into revealing the pathogenic mechanisms of the severe rice sheath blight disease. © The Author 2016. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  8. Draft Genome Sequence of a Novel Chitinophaga sp. Strain, MD30, Isolated from a Biofilm in an Air Conditioner Condensate Pipe.

    PubMed

    Wan, Xuehua; Darris, Maxwell; Hou, Shaobin; Donachie, Stuart P

    2017-10-19

    Most of the 24 known Chitinophaga species were originally isolated from soils. We report the draft genome sequence of a putatively novel Chitinophaga sp. from a biofilm in an air conditioner condensate pipe. The genome comprises 7,661,303 bp in one scaffold, 5,694 predicted protein-coding sequences, and a G+C content of 47.6%. Copyright © 2017 Wan et al.

  9. Toward Fast and Accurate Binding Affinity Prediction with pmemdGTI: An Efficient Implementation of GPU-Accelerated Thermodynamic Integration.

    PubMed

    Lee, Tai-Sung; Hu, Yuan; Sherborne, Brad; Guo, Zhuyan; York, Darrin M

    2017-07-11

    We report the implementation of the thermodynamic integration method on the pmemd module of the AMBER 16 package on GPUs (pmemdGTI). The pmemdGTI code typically delivers over 2 orders of magnitude of speed-up relative to a single CPU core for the calculation of ligand-protein binding affinities with no statistically significant numerical differences and thus provides a powerful new tool for drug discovery applications.

  10. Cloning and characterization of a basic phospholipase A2 homologue from Micrurus corallinus (coral snake) venom gland.

    PubMed

    de Oliveira, Ursula Castro; Assui, Alessandra; da Silva, Alvaro Rossan de Brandão Prieto; de Oliveira, Jane Silveira; Ho, Paulo Lee

    2003-09-01

    During the cloning of abundant cDNAs expressed in the Micrurus corallinus coral snake venom gland, several putative toxins, including a phospholipase A2 homologue cDNA (clone V2), were identified. The V2 cDNA clone codes for a potential coral snake toxin with a signal peptide of 27 amino acid residues plus a predicted mature protein with 119 amino acid residues. The deduced protein is highly similar to known phospholipases A2, with seven deduced S-S bridges at the same conserved positions. This protein was expressed in Escherichia coli as a His-tagged protein that allowed the rapid purification of the recombinant protein. This protein was used to generate antibodies, which recognized the recombinant protein in Western blot. This antiserum was used to screen a large number of venoms, showing a ubiquitous distribution of immunorelated proteins in all elapidic venoms but not in the viperidic Bothrops jararaca venom. This is the first description of a complete primary structure of a phospholipase A2 homologue deduced by cDNA cloning from a coral snake.

  11. Prediction and Validation of Disease Genes Using HeteSim Scores.

    PubMed

    Zeng, Xiangxiang; Liao, Yuanlu; Liu, Yuansheng; Zou, Quan

    2017-01-01

    Deciphering the gene disease association is an important goal in biomedical research. In this paper, we use a novel relevance measure, called HeteSim, to prioritize candidate disease genes. Two methods based on heterogeneous networks constructed using protein-protein interaction, gene-phenotype associations, and phenotype-phenotype similarity, are presented. In HeteSim_MultiPath (HSMP), HeteSim scores of different paths are combined with a constant that dampens the contributions of longer paths. In HeteSim_SVM (HSSVM), HeteSim scores are combined with a machine learning method. The 3-fold experiments show that our non-machine learning method HSMP performs better than the existing non-machine learning methods, our machine learning method HSSVM obtains similar accuracy with the best existing machine learning method CATAPULT. From the analysis of the top 10 predicted genes for different diseases, we found that HSSVM avoid the disadvantage of the existing machine learning based methods, which always predict similar genes for different diseases. The data sets and Matlab code for the two methods are freely available for download at http://lab.malab.cn/data/HeteSim/index.jsp.

  12. Cryptic tRNAs in chaetognath mitochondrial genomes.

    PubMed

    Barthélémy, Roxane-Marie; Seligmann, Hervé

    2016-06-01

    The chaetognaths constitute a small and enigmatic phylum of little marine invertebrates. Both nuclear and mitochondrial genomes have numerous originalities, some phylum-specific. Until recently, their mitogenomes seemed containing only one tRNA gene (trnMet), but a recent study found in two chaetognath mitogenomes two and four tRNA genes. Moreover, apparently two conspecific mitogenomes have different tRNA gene numbers (one and two). Reanalyses by tRNAscan-SE and ARWEN softwares of the five available complete chaetognath mitogenomes suggest numerous additional tRNA genes from different types. Their total number never reaches the 22 found in most other invertebrates using that genetic code. Predicted error compensation between codon-anticodon mismatch and tRNA misacylation suggests translational activity by tRNAs predicted solely according to secondary structure for tRNAs predicted by tRNAscan-SE, not ARWEN. Numbers of predicted stop-suppressor (antitermination) tRNAs coevolve with predicted overlapping, frameshifted protein coding genes including stop codons. Sequence alignments in secondary structure prediction with non-chaetognath tRNAs suggest that the most likely functional tRNAs are in intergenic regions, as regular mt-tRNAs. Due to usually short intergenic regions, generally tRNA sequences partially overlap with flanking genes. Some tRNA pairs seem templated by sense-antisense strands. Moreover, 16S rRNA genes, but not 12S rRNAs, appear as tRNA nurseries, as previously suggested for multifunctional ribosomal-like protogenomes. Copyright © 2016 Elsevier Ltd. All rights reserved.

  13. Genome Sequence of the Mesophilic Thermotogales Bacterium Mesotoga prima MesG1.Ag.4.2 Reveals the Largest Thermotogales Genome To Date

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhaxybayeva, Olga; Swithers, Kristen S; Foght, Julia

    2012-01-01

    Here we describe the genome of Mesotoga prima MesG1.Ag4.2, the first genome of a mesophilic Thermotogales bacterium. Mesotoga prima was isolated from a polychlorinated biphenyl (PCB)-dechlorinating enrichment culture from Baltimore Harbor sediments. Its 2.97 Mb genome is considerably larger than any previously sequenced Thermotogales genomes, which range between 1.86 and 2.30 Mb. This larger size is due to both higher numbers of protein-coding genes and larger intergenic regions. In particular, the M. prima genome contains more genes for proteins involved in regulatory functions, for instance those involved in regulation of transcription. Together with its closest relative, Kosmotoga olearia, it alsomore » encodes different types of proteins involved in environmental and cell-cell interactions as compared with other Thermotogales bacteria. Amino acid composition analysis of M. prima proteins implies that this lineage has inhabited low-temperature environments for a long time. A large fraction of the M. prima genome has been acquired by lateral gene transfer (LGT): a DarkHorse analysis suggests that 766 (32%) of predicted protein-coding genes have been involved in LGT after Mesotoga diverged from the other Thermotogales lineages. A notable example of a lineage-specific LGT event is a reductive dehalogenase gene - a key enzyme in dehalorespiration, indicating M. prima may have a more active role in PCB dechlorination than was previously assumed.« less

  14. Low-delay predictive audio coding for the HIVITS HDTV codec

    NASA Astrophysics Data System (ADS)

    McParland, A. K.; Gilchrist, N. H. C.

    1995-01-01

    The status of work relating to predictive audio coding, as part of the European project on High Quality Video Telephone and HD(TV) Systems (HIVITS), is reported. The predictive coding algorithm is developed, along with six-channel audio coding and decoding hardware. Demonstrations of the audio codec operating in conjunction with the video codec, are given.

  15. Overview of Recent Radiation Transport Code Comparisons for Space Applications

    NASA Astrophysics Data System (ADS)

    Townsend, Lawrence

    Recent advances in radiation transport code development for space applications have resulted in various comparisons of code predictions for a variety of scenarios and codes. Comparisons among both Monte Carlo and deterministic codes have been made and published by vari-ous groups and collaborations, including comparisons involving, but not limited to HZETRN, HETC-HEDS, FLUKA, GEANT, PHITS, and MCNPX. In this work, an overview of recent code prediction inter-comparisons, including comparisons to available experimental data, is presented and discussed, with emphases on those areas of agreement and disagreement among the various code predictions and published data.

  16. A fresh look at the male-specific region of the human Y chromosome.

    PubMed

    Jangravi, Zohreh; Alikhani, Mehdi; Arefnezhad, Babak; Sharifi Tabar, Mehdi; Taleahmad, Sara; Karamzadeh, Razieh; Jadaliha, Mahdieh; Mousavi, Seyed Ahmad; Ahmadi Rastegar, Diba; Parsamatin, Pouria; Vakilian, Haghighat; Mirshahvaladi, Shahab; Sabbaghian, Marjan; Mohseni Meybodi, Anahita; Mirzaei, Mehdi; Shahhoseini, Maryam; Ebrahimi, Marzieh; Piryaei, Abbas; Moosavi-Movahedi, Ali Akbar; Haynes, Paul A; Goodchild, Ann K; Nasr-Esfahani, Mohammad Hossein; Jabbari, Esmaiel; Baharvand, Hossein; Sedighi Gilani, Mohammad Ali; Gourabi, Hamid; Salekdeh, Ghasem Hosseini

    2013-01-04

    The Chromosome-centric Human Proteome Project (C-HPP) aims to systematically map the entire human proteome with the intent to enhance our understanding of human biology at the cellular level. This project attempts simultaneously to establish a sound basis for the development of diagnostic, prognostic, therapeutic, and preventive medical applications. In Iran, current efforts focus on mapping the proteome of the human Y chromosome. The male-specific region of the Y chromosome (MSY) is unique in many aspects and comprises 95% of the chromosome's length. The MSY continually retains its haploid state and is full of repeated sequences. It is responsible for important biological roles such as sex determination and male fertility. Here, we present the most recent update of MSY protein-encoding genes and their association with various traits and diseases including sex determination and reversal, spermatogenesis and male infertility, cancers such as prostate cancers, sex-specific effects on the brain and behavior, and graft-versus-host disease. We also present information available from RNA sequencing, protein-protein interaction, post-translational modification of MSY protein-coding genes and their implications in biological systems. An overview of Human Y chromosome Proteome Project is presented and a systematic approach is suggested to ensure that at least one of each predicted protein-coding gene's major representative proteins will be characterized in the context of its major anatomical sites of expression, its abundance, and its functional relevance in a biological and/or medical context. There are many technical and biological issues that will need to be overcome in order to accomplish the full scale mapping.

  17. A comprehensive catalog of human KRAB-associated zinc finger genes: Insights into the evolutionary history of a large family of transcriptional repressors

    PubMed Central

    Huntley, Stuart; Baggott, Daniel M.; Hamilton, Aaron T.; Tran-Gyamfi, Mary; Yang, Shan; Kim, Joomyeong; Gordon, Laurie; Branscomb, Elbert; Stubbs, Lisa

    2006-01-01

    Krüppel-type zinc finger (ZNF) motifs are prevalent components of transcription factor proteins in all eukaryotes. KRAB-ZNF proteins, in which a potent repressor domain is attached to a tandem array of DNA-binding zinc-finger motifs, are specific to tetrapod vertebrates and represent the largest class of ZNF proteins in mammals. To define the full repertoire of human KRAB-ZNF proteins, we searched the genome sequence for key motifs and then constructed and manually curated gene models incorporating those sequences. The resulting gene catalog contains 423 KRAB-ZNF protein-coding loci, yielding alternative transcripts that altogether predict at least 742 structurally distinct proteins. Active rounds of segmental duplication, involving single genes or larger regions and including both tandem and distributed duplication events, have driven the expansion of this mammalian gene family. Comparisons between the human genes and ZNF loci mined from the draft mouse, dog, and chimpanzee genomes not only identified 103 KRAB-ZNF genes that are conserved in mammals but also highlighted a substantial level of lineage-specific change; at least 136 KRAB-ZNF coding genes are primate specific, including many recent duplicates. KRAB-ZNF genes are widely expressed and clustered genes are typically not coregulated, indicating that paralogs have evolved to fill roles in many different biological processes. To facilitate further study, we have developed a Web-based public resource with access to gene models, sequences, and other data, including visualization tools to provide genomic context and interaction with other public data sets. PMID:16606702

  18. Validating a Coarse-Grained Potential Energy Function through Protein Loop Modelling

    PubMed Central

    MacDonald, James T.; Kelley, Lawrence A.; Freemont, Paul S.

    2013-01-01

    Coarse-grained (CG) methods for sampling protein conformational space have the potential to increase computational efficiency by reducing the degrees of freedom. The gain in computational efficiency of CG methods often comes at the expense of non-protein like local conformational features. This could cause problems when transitioning to full atom models in a hierarchical framework. Here, a CG potential energy function was validated by applying it to the problem of loop prediction. A novel method to sample the conformational space of backbone atoms was benchmarked using a standard test set consisting of 351 distinct loops. This method used a sequence-independent CG potential energy function representing the protein using -carbon positions only and sampling conformations with a Monte Carlo simulated annealing based protocol. Backbone atoms were added using a method previously described and then gradient minimised in the Rosetta force field. Despite the CG potential energy function being sequence-independent, the method performed similarly to methods that explicitly use either fragments of known protein backbones with similar sequences or residue-specific /-maps to restrict the search space. The method was also able to predict with sub-Angstrom accuracy two out of seven loops from recently solved crystal structures of proteins with low sequence and structure similarity to previously deposited structures in the PDB. The ability to sample realistic loop conformations directly from a potential energy function enables the incorporation of additional geometric restraints and the use of more advanced sampling methods in a way that is not possible to do easily with fragment replacement methods and also enable multi-scale simulations for protein design and protein structure prediction. These restraints could be derived from experimental data or could be design restraints in the case of computational protein design. C++ source code is available for download from http://www.sbg.bio.ic.ac.uk/phyre2/PD2/. PMID:23824634

  19. Death of a dogma: eukaryotic mRNAs can code for more than one protein

    PubMed Central

    Mouilleron, Hélène; Delcourt, Vivian; Roucou, Xavier

    2016-01-01

    mRNAs carry the genetic information that is translated by ribosomes. The traditional view of a mature eukaryotic mRNA is a molecule with three main regions, the 5′ UTR, the protein coding open reading frame (ORF) or coding sequence (CDS), and the 3′ UTR. This concept assumes that ribosomes translate one ORF only, generally the longest one, and produce one protein. As a result, in the early days of genomics and bioinformatics, one CDS was associated with each protein-coding gene. This fundamental concept of a single CDS is being challenged by increasing experimental evidence indicating that annotated proteins are not the only proteins translated from mRNAs. In particular, mass spectrometry (MS)-based proteomics and ribosome profiling have detected productive translation of alternative open reading frames. In several cases, the alternative and annotated proteins interact. Thus, the expression of two or more proteins translated from the same mRNA may offer a mechanism to ensure the co-expression of proteins which have functional interactions. Translational mechanisms already described in eukaryotic cells indicate that the cellular machinery is able to translate different CDSs from a single viral or cellular mRNA. In addition to summarizing data showing that the protein coding potential of eukaryotic mRNAs has been underestimated, this review aims to challenge the single translated CDS dogma. PMID:26578573

  20. Improved detection of DNA-binding proteins via compression technology on PSSM information.

    PubMed

    Wang, Yubo; Ding, Yijie; Guo, Fei; Wei, Leyi; Tang, Jijun

    2017-01-01

    Since the importance of DNA-binding proteins in multiple biomolecular functions has been recognized, an increasing number of researchers are attempting to identify DNA-binding proteins. In recent years, the machine learning methods have become more and more compelling in the case of protein sequence data soaring, because of their favorable speed and accuracy. In this paper, we extract three features from the protein sequence, namely NMBAC (Normalized Moreau-Broto Autocorrelation), PSSM-DWT (Position-specific scoring matrix-Discrete Wavelet Transform), and PSSM-DCT (Position-specific scoring matrix-Discrete Cosine Transform). We also employ feature selection algorithm on these feature vectors. Then, these features are fed into the training SVM (support vector machine) model as classifier to predict DNA-binding proteins. Our method applys three datasets, namely PDB1075, PDB594 and PDB186, to evaluate the performance of our approach. The PDB1075 and PDB594 datasets are employed for Jackknife test and the PDB186 dataset is used for the independent test. Our method achieves the best accuracy in the Jacknife test, from 79.20% to 86.23% and 80.5% to 86.20% on PDB1075 and PDB594 datasets, respectively. In the independent test, the accuracy of our method comes to 76.3%. The performance of independent test also shows that our method has a certain ability to be effectively used for DNA-binding protein prediction. The data and source code are at https://doi.org/10.6084/m9.figshare.5104084.

  1. Investigation of protein folding by coarse-grained molecular dynamics with the UNRES force field.

    PubMed

    Maisuradze, Gia G; Senet, Patrick; Czaplewski, Cezary; Liwo, Adam; Scheraga, Harold A

    2010-04-08

    Coarse-grained molecular dynamics simulations offer a dramatic extension of the time-scale of simulations compared to all-atom approaches. In this article, we describe the use of the physics-based united-residue (UNRES) force field, developed in our laboratory, in protein-structure simulations. We demonstrate that this force field offers about a 4000-times extension of the simulation time scale; this feature arises both from averaging out the fast-moving degrees of freedom and reduction of the cost of energy and force calculations compared to all-atom approaches with explicit solvent. With massively parallel computers, microsecond folding simulation times of proteins containing about 1000 residues can be obtained in days. A straightforward application of canonical UNRES/MD simulations, demonstrated with the example of the N-terminal part of the B-domain of staphylococcal protein A (PDB code: 1BDD, a three-alpha-helix bundle), discerns the folding mechanism and determines kinetic parameters by parallel simulations of several hundred or more trajectories. Use of generalized-ensemble techniques, of which the multiplexed replica exchange method proved to be the most effective, enables us to compute thermodynamics of folding and carry out fully physics-based prediction of protein structure, in which the predicted structure is determined as a mean over the most populated ensemble below the folding-transition temperature. By using principal component analysis of the UNRES folding trajectories of the formin-binding protein WW domain (PDB code: 1E0L; a three-stranded antiparallel beta-sheet) and 1BDD, we identified representative structures along the folding pathways and demonstrated that only a few (low-indexed) principal components can capture the main structural features of a protein-folding trajectory; the potentials of mean force calculated along these essential modes exhibit multiple minima, as opposed to those along the remaining modes that are unimodal. In addition, a comparison between the structures that are representative of the minima in the free-energy profile along the essential collective coordinates of protein folding (computed by principal component analysis) and the free-energy profile projected along the virtual-bond dihedral angles gamma of the backbone revealed the key residues involved in the transitions between the different basins of the folding free-energy profile, in agreement with existing experimental data for 1E0L .

  2. Identification and Classification of New Transcripts in Dorper and Small-Tailed Han Sheep Skeletal Muscle Transcriptomes.

    PubMed

    Chao, Tianle; Wang, Guizhi; Wang, Jianmin; Liu, Zhaohua; Ji, Zhibin; Hou, Lei; Zhang, Chunlan

    2016-01-01

    High-throughput mRNA sequencing enables the discovery of new transcripts and additional parts of incompletely annotated transcripts. Compared with the human and cow genomes, the reference annotation level of the sheep genome is still low. An investigation of new transcripts in sheep skeletal muscle will improve our understanding of muscle development. Therefore, applying high-throughput sequencing, two cDNA libraries from the biceps brachii of small-tailed Han sheep and Dorper sheep were constructed, and whole-transcriptome analysis was performed to determine the unknown transcript catalogue of this tissue. In this study, 40,129 transcripts were finally mapped to the sheep genome. Among them, 3,467 transcripts were determined to be unannotated in the current reference sheep genome and were defined as new transcripts. Based on protein-coding capacity prediction and comparative analysis of sequence similarity, 246 transcripts were classified as portions of unannotated genes or incompletely annotated genes. Another 1,520 transcripts were predicted with high confidence to be long non-coding RNAs. Our analysis also revealed 334 new transcripts that displayed specific expression in ruminants and uncovered a number of new transcripts without intergenus homology but with specific expression in sheep skeletal muscle. The results confirmed a complex transcript pattern of coding and non-coding RNA in sheep skeletal muscle. This study provided important information concerning the sheep genome and transcriptome annotation, which could provide a basis for further study.

  3. Non-coding landscapes of colorectal cancer

    PubMed Central

    Ragusa, Marco; Barbagallo, Cristina; Statello, Luisa; Condorelli, Angelo Giuseppe; Battaglia, Rosalia; Tamburello, Lucia; Barbagallo, Davide; Di Pietro, Cinzia; Purrello, Michele

    2015-01-01

    For two decades Vogelstein’s model has been the paradigm for describing the sequence of molecular changes within protein-coding genes that would lead to overt colorectal cancer (CRC). This model is now too simplistic in the light of recent studies, which have shown that our genome is pervasively transcribed in RNAs other than mRNAs, denominated non-coding RNAs (ncRNAs). The discovery that mutations in genes encoding these RNAs [i.e., microRNAs (miRNAs), long non-coding RNAs, and circular RNAs] are causally involved in cancer phenotypes has profoundly modified our vision of tumour molecular genetics and pathobiology. By exploiting a wide range of different mechanisms, ncRNAs control fundamental cellular processes, such as proliferation, differentiation, migration, angiogenesis and apoptosis: these data have also confirmed their role as oncogenes or tumor suppressors in cancer development and progression. The existence of a sophisticated RNA-based regulatory system, which dictates the correct functioning of protein-coding networks, has relevant biological and biomedical consequences. Different miRNAs involved in neoplastic and degenerative diseases exhibit potential predictive and prognostic properties. Furthermore, the key roles of ncRNAs make them very attractive targets for innovative therapeutic approaches. Several recent reports have shown that ncRNAs can be secreted by cells into the extracellular environment (i.e., blood and other body fluids): this suggests the existence of extracellular signalling mechanisms, which may be exploited by cells in physiology and pathology. In this review, we will summarize the most relevant issues on the involvement of cellular and extracellular ncRNAs in disease. We will then specifically describe their involvement in CRC pathobiology and their translational applications to CRC diagnosis, prognosis and therapy. PMID:26556998

  4. RNA-Seq Based Transcriptional Map of Bovine Respiratory Disease Pathogen “Histophilus somni 2336”

    PubMed Central

    Kumar, Ranjit; Lawrence, Mark L.; Watt, James; Cooksey, Amanda M.; Burgess, Shane C.; Nanduri, Bindu

    2012-01-01

    Genome structural annotation, i.e., identification and demarcation of the boundaries for all the functional elements in a genome (e.g., genes, non-coding RNAs, proteins and regulatory elements), is a prerequisite for systems level analysis. Current genome annotation programs do not identify all of the functional elements of the genome, especially small non-coding RNAs (sRNAs). Whole genome transcriptome analysis is a complementary method to identify “novel” genes, small RNAs, regulatory regions, and operon structures, thus improving the structural annotation in bacteria. In particular, the identification of non-coding RNAs has revealed their widespread occurrence and functional importance in gene regulation, stress and virulence. However, very little is known about non-coding transcripts in Histophilus somni, one of the causative agents of Bovine Respiratory Disease (BRD) as well as bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis. In this study, we report a single nucleotide resolution transcriptome map of H. somni strain 2336 using RNA-Seq method. The RNA-Seq based transcriptome map identified 94 sRNAs in the H. somni genome of which 82 sRNAs were never predicted or reported in earlier studies. We also identified 38 novel potential protein coding open reading frames that were absent in the current genome annotation. The transcriptome map allowed the identification of 278 operon (total 730 genes) structures in the genome. When compared with the genome sequence of a non-virulent strain 129Pt, a disproportionate number of sRNAs (∼30%) were located in genomic region unique to strain 2336 (∼18% of the total genome). This observation suggests that a number of the newly identified sRNAs in strain 2336 may be involved in strain-specific adaptations. PMID:22276113

  5. RNA-seq based transcriptional map of bovine respiratory disease pathogen "Histophilus somni 2336".

    PubMed

    Kumar, Ranjit; Lawrence, Mark L; Watt, James; Cooksey, Amanda M; Burgess, Shane C; Nanduri, Bindu

    2012-01-01

    Genome structural annotation, i.e., identification and demarcation of the boundaries for all the functional elements in a genome (e.g., genes, non-coding RNAs, proteins and regulatory elements), is a prerequisite for systems level analysis. Current genome annotation programs do not identify all of the functional elements of the genome, especially small non-coding RNAs (sRNAs). Whole genome transcriptome analysis is a complementary method to identify "novel" genes, small RNAs, regulatory regions, and operon structures, thus improving the structural annotation in bacteria. In particular, the identification of non-coding RNAs has revealed their widespread occurrence and functional importance in gene regulation, stress and virulence. However, very little is known about non-coding transcripts in Histophilus somni, one of the causative agents of Bovine Respiratory Disease (BRD) as well as bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis. In this study, we report a single nucleotide resolution transcriptome map of H. somni strain 2336 using RNA-Seq method.The RNA-Seq based transcriptome map identified 94 sRNAs in the H. somni genome of which 82 sRNAs were never predicted or reported in earlier studies. We also identified 38 novel potential protein coding open reading frames that were absent in the current genome annotation. The transcriptome map allowed the identification of 278 operon (total 730 genes) structures in the genome. When compared with the genome sequence of a non-virulent strain 129Pt, a disproportionate number of sRNAs (∼30%) were located in genomic region unique to strain 2336 (∼18% of the total genome). This observation suggests that a number of the newly identified sRNAs in strain 2336 may be involved in strain-specific adaptations.

  6. Protein-protein interaction specificity is captured by contact preferences and interface composition.

    PubMed

    Nadalin, Francesca; Carbone, Alessandra

    2018-02-01

    Large-scale computational docking will be increasingly used in future years to discriminate protein-protein interactions at the residue resolution. Complete cross-docking experiments make in silico reconstruction of protein-protein interaction networks a feasible goal. They ask for efficient and accurate screening of the millions structural conformations issued by the calculations. We propose CIPS (Combined Interface Propensity for decoy Scoring), a new pair potential combining interface composition with residue-residue contact preference. CIPS outperforms several other methods on screening docking solutions obtained either with all-atom or with coarse-grain rigid docking. Further testing on 28 CAPRI targets corroborates CIPS predictive power over existing methods. By combining CIPS with atomic potentials, discrimination of correct conformations in all-atom structures reaches optimal accuracy. The drastic reduction of candidate solutions produced by thousands of proteins docked against each other makes large-scale docking accessible to analysis. CIPS source code is freely available at http://www.lcqb.upmc.fr/CIPS. alessandra.carbone@lip6.fr. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  7. @TOME-2: a new pipeline for comparative modeling of protein-ligand complexes.

    PubMed

    Pons, Jean-Luc; Labesse, Gilles

    2009-07-01

    @TOME 2.0 is new web pipeline dedicated to protein structure modeling and small ligand docking based on comparative analyses. @TOME 2.0 allows fold recognition, template selection, structural alignment editing, structure comparisons, 3D-model building and evaluation. These tasks are routinely used in sequence analyses for structure prediction. In our pipeline the necessary software is efficiently interconnected in an original manner to accelerate all the processes. Furthermore, we have also connected comparative docking of small ligands that is performed using protein-protein superposition. The input is a simple protein sequence in one-letter code with no comment. The resulting 3D model, protein-ligand complexes and structural alignments can be visualized through dedicated Web interfaces or can be downloaded for further studies. These original features will aid in the functional annotation of proteins and the selection of templates for molecular modeling and virtual screening. Several examples are described to highlight some of the new functionalities provided by this pipeline. The server and its documentation are freely available at http://abcis.cbs.cnrs.fr/AT2/

  8. Conserved thioredoxin fold is present in Pisum sativum L. sieve element occlusion-1 protein

    PubMed Central

    Umate, Pavan; Tuteja, Renu

    2010-01-01

    Homology-based three-dimensional model for Pisum sativum sieve element occlusion 1 (Ps.SEO1) (forisomes) protein was constructed. A stretch of amino acids (residues 320 to 456) which is well conserved in all known members of forisomes proteins was used to model the 3D structure of Ps.SEO1. The structural prediction was done using Protein Homology/analogY Recognition Engine (PHYRE) web server. Based on studies of local sequence alignment, the thioredoxin-fold containing protein [Structural Classification of Proteins (SCOP) code d1o73a_], a member of the glutathione peroxidase family was selected as a template for modeling the spatial structure of Ps.SEO1. Selection was based on comparison of primary sequence, higher match quality and alignment accuracy. Motif 1 (EVF) is conserved in Ps.SEO1, Vicia faba (Vf.For1) and Medicago truncatula (MT.SEO3); motif 2 (KKED) is well conserved across all forisomes proteins and motif 3 (IGYIGNP) is conserved in Ps.SEO1 and Vf.For1. PMID:20404566

  9. Bit selection using field drilling data and mathematical investigation

    NASA Astrophysics Data System (ADS)

    Momeni, M. S.; Ridha, S.; Hosseini, S. J.; Meyghani, B.; Emamian, S. S.

    2018-03-01

    A drilling process will not be complete without the usage of a drill bit. Therefore, bit selection is considered to be an important task in drilling optimization process. To select a bit is considered as an important issue in planning and designing a well. This is simply because the cost of drilling bit in total cost is quite high. Thus, to perform this task, aback propagation ANN Model is developed. This is done by training the model using several wells and it is done by the usage of drilling bit records from offset wells. In this project, two models are developed by the usage of the ANN. One is to find predicted IADC bit code and one is to find Predicted ROP. Stage 1 was to find the IADC bit code by using all the given filed data. The output is the Targeted IADC bit code. Stage 2 was to find the Predicted ROP values using the gained IADC bit code in Stage 1. Next is Stage 3 where the Predicted ROP value is used back again in the data set to gain Predicted IADC bit code value. The output is the Predicted IADC bit code. Thus, at the end, there are two models that give the Predicted ROP values and Predicted IADC bit code values.

  10. Informational structure of genetic sequences and nature of gene splicing

    NASA Astrophysics Data System (ADS)

    Trifonov, E. N.

    1991-10-01

    Only about 1/20 of DNA of higher organisms codes for proteins, by means of classical triplet code. The rest of DNA sequences is largely silent, with unclear functions, if any. The triplet code is not the only code (message) carried by the sequences. There are three levels of molecular communication, where the same sequence ``talks'' to various bimolecules, while having, respectively, three different appearances: DNA, RNA and protein. Since the molecular structures and, hence, sequence specific preferences of these are substantially different, the original DNA sequence has to carry simultaneously three types of sequence patterns (codes, messages), thus, being a composite structure in which one had the same letter (nucleotide) is frequently involved in several overlapping codes of different nature. This multiplicity and overlapping of the codes is a unique feature of the Gnomic, language of genetic sequences. The coexisting codes have to be degenerate in various degrees to allow an optimal and concerted performance of all the encoded functions. There is an obvious conflict between the best possible performance of a given function and necessity to compromise the quality of a given sequence pattern in favor of other patterns. It appears that the major role of various changes in the sequences on their ``ontogenetic'' way from DNA to RNA to protein, like RNA editing and splicing, or protein post-translational modifications is to resolve such conflicts. New data are presented strongly indicating that the gene splicing is such a device to resolve the conflict between the code of DNA folding in chromatin and the triplet code for protein synthesis.

  11. Putative Nonribosomal Peptide Synthetase and Cytochrome P450 Genes Responsible for Tentoxin Biosynthesis in Alternaria alternata ZJ33

    PubMed Central

    Li, You-Hai; Han, Wen-Jin; Gui, Xi-Wu; Wei, Tao; Tang, Shuang-Yan; Jin, Jian-Ming

    2016-01-01

    Tentoxin, a cyclic tetrapeptide produced by several Alternaria species, inhibits the F1-ATPase activity of chloroplasts, resulting in chlorosis in sensitive plants. In this study, we report two clustered genes, encoding a putative non-ribosome peptide synthetase (NRPS) TES and a cytochrome P450 protein TES1, that are required for tentoxin biosynthesis in Alternaria alternata strain ZJ33, which was isolated from blighted leaves of Eupatorium adenophorum. Using a pair of primers designed according to the consensus sequences of the adenylation domain of NRPSs, two fragments containing putative adenylation domains were amplified from A. alternata ZJ33, and subsequent PCR analyses demonstrated that these fragments belonged to the same NRPS coding sequence. With no introns, TES consists of a single 15,486 base pair open reading frame encoding a predicted 5161 amino acid protein. Meanwhile, the TES1 gene is predicted to contain five introns and encode a 506 amino acid protein. The TES protein is predicted to be comprised of four peptide synthase modules with two additional N-methylation domains, and the number and arrangement of the modules in TES were consistent with the number and arrangement of the amino acid residues of tentoxin, respectively. Notably, both TES and TES1 null mutants generated via homologous recombination failed to produce tentoxin. This study provides the first evidence concerning the biosynthesis of tentoxin in A. alternata. PMID:27490569

  12. Kinetic models of gene expression including non-coding RNAs

    NASA Astrophysics Data System (ADS)

    Zhdanov, Vladimir P.

    2011-03-01

    In cells, genes are transcribed into mRNAs, and the latter are translated into proteins. Due to the feedbacks between these processes, the kinetics of gene expression may be complex even in the simplest genetic networks. The corresponding models have already been reviewed in the literature. A new avenue in this field is related to the recognition that the conventional scenario of gene expression is fully applicable only to prokaryotes whose genomes consist of tightly packed protein-coding sequences. In eukaryotic cells, in contrast, such sequences are relatively rare, and the rest of the genome includes numerous transcript units representing non-coding RNAs (ncRNAs). During the past decade, it has become clear that such RNAs play a crucial role in gene expression and accordingly influence a multitude of cellular processes both in the normal state and during diseases. The numerous biological functions of ncRNAs are based primarily on their abilities to silence genes via pairing with a target mRNA and subsequently preventing its translation or facilitating degradation of the mRNA-ncRNA complex. Many other abilities of ncRNAs have been discovered as well. Our review is focused on the available kinetic models describing the mRNA, ncRNA and protein interplay. In particular, we systematically present the simplest models without kinetic feedbacks, models containing feedbacks and predicting bistability and oscillations in simple genetic networks, and models describing the effect of ncRNAs on complex genetic networks. Mathematically, the presentation is based primarily on temporal mean-field kinetic equations. The stochastic and spatio-temporal effects are also briefly discussed.

  13. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments

    PubMed Central

    Haas, Brian J; Salzberg, Steven L; Zhu, Wei; Pertea, Mihaela; Allen, Jonathan E; Orvis, Joshua; White, Owen; Buell, C Robin; Wortman, Jennifer R

    2008-01-01

    EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation. PMID:18190707

  14. Genome Sequence of Bacillus cereus Strain TG1-6, a Plant-Beneficial Rhizobacterium That Is Highly Salt Tolerant

    PubMed Central

    2018-01-01

    ABSTRACT The complete genome sequence of Bacillus cereus strain TG1-6, which is a highly salt-tolerant rhizobacterium that enhances plant tolerance to drought stress, is reported here. The sequencing process was performed based on a combination of pyrosequencing and single-molecule sequencing. The complete genome is estimated to be approximately 5.42 Mb, containing a total of 5,610 predicted protein-coding DNA sequences (CDSs). PMID:29748401

  15. Genome Sequence of Bacillus megaterium Strain YC4-R4, a Plant Growth-Promoting Rhizobacterium Isolated from a High-Salinity Environment.

    PubMed

    Vílchez, Juan Ignacio; Tang, Qiming; Kaushal, Richa; Wang, Wei; Lv, Suhui; He, Danxia; Chu, Zhaoqing; Zhang, Heng; Liu, Renyi; Zhang, Huiming

    2018-06-21

    Here, we report the complete genome sequence for Bacillus megaterium strain YC4-R4, a highly salt-tolerant rhizobacterium that promotes growth in plants. The sequencing process was performed by combining pyrosequencing and single-molecule sequencing techniques. The complete genome is estimated to be approximately 5.44 Mb, containing a total of 5,673 predicted protein-coding DNA sequences (CDSs). Copyright © 2018 Vílchez et al.

  16. Genome-wide targeted prediction of ABA responsive genes in rice based on over-represented cis-motif in co-expressed genes.

    PubMed

    Lenka, Sangram K; Lohia, Bikash; Kumar, Abhay; Chinnusamy, Viswanathan; Bansal, Kailash C

    2009-02-01

    Abscisic acid (ABA), the popular plant stress hormone, plays a key role in regulation of sub-set of stress responsive genes. These genes respond to ABA through specific transcription factors which bind to cis-regulatory elements present in their promoters. We discovered the ABA Responsive Element (ABRE) core (ACGT) containing CGMCACGTGB motif as over-represented motif among the promoters of ABA responsive co-expressed genes in rice. Targeted gene prediction strategy using this motif led to the identification of 402 protein coding genes potentially regulated by ABA-dependent molecular genetic network. RT-PCR analysis of arbitrarily chosen 45 genes from the predicted 402 genes confirmed 80% accuracy of our prediction. Plant Gene Ontology (GO) analysis of ABA responsive genes showed enrichment of signal transduction and stress related genes among diverse functional categories.

  17. MicroRNA-like viral small RNA from porcine reproductive and respiratory syndrome virus negatively regulates viral replication by targeting the viral nonstructural protein 2.

    PubMed

    Li, Na; Yan, Yunhuan; Zhang, Angke; Gao, Jiming; Zhang, Chong; Wang, Xue; Hou, Gaopeng; Zhang, Gaiping; Jia, Jinbu; Zhou, En-Min; Xiao, Shuqi

    2016-12-13

    Many viruses encode microRNAs (miRNAs) that are small non-coding single-stranded RNAs which play critical roles in virus-host interactions. Porcine reproductive and respiratory syndrome virus (PRRSV) is one of the most economically impactful viruses in the swine industry. The present study sought to determine whether PRRSV encodes miRNAs that could regulate PRRSV replication. Four viral small RNAs (vsRNAs) were mapped to the stem-loop structures in the ORF1a, ORF1b and GP2a regions of the PRRSV genome by bioinformatics prediction and experimental verification. Of these, the structures with the lowest minimum free energy (MFE) values predicted for PRRSV-vsRNA1 corresponded to typical stem-loop, hairpin structures. Inhibition of PRRSV-vsRNA1 function led to significant increases in viral replication. Transfection with PRRSV-vsRNA1 mimics significantly inhibited PRRSV replication in primary porcine alveolar macrophages (PAMs). The time-dependent increase in the abundance of PRRSV-vsRNA1 mirrored the gradual upregulation of PRRSV RNA expression. Knockdown of proteins associated with cellular miRNA biogenesis demonstrated that Drosha and Argonaute (Ago2) are involved in PRRSV-vsRNA1 biogenesis. Moreover, PRRSV-vsRNA1 bound specifically to the nonstructural protein 2 (NSP2)-coding sequence of PRRSV genome RNA. Collectively, the results reveal that PRRSV encodes a functional PRRSV-vsRNA1 which auto-regulates PRRSV replication by directly targeting and suppressing viral NSP2 gene expression. These findings not only provide new insights into the mechanism of the pathogenesis of PRRSV, but also explore a potential avenue for controlling PRRSV infection using viral small RNAs.

  18. Genome-wide identification of miRNAs and lncRNAs in Cajanus cajan.

    PubMed

    Nithin, Chandran; Thomas, Amal; Basak, Jolly; Bahadur, Ranjit Prasad

    2017-11-15

    Non-coding RNAs (ncRNAs) are important players in the post transcriptional regulation of gene expression (PTGR). On one hand, microRNAs (miRNAs) are an abundant class of small ncRNAs (~22nt long) that negatively regulate gene expression at the levels of messenger RNAs stability and translation inhibition, on the other hand, long ncRNAs (lncRNAs) are a large and diverse class of transcribed non-protein coding RNA molecules (> 200nt) that play both up-regulatory as well as down-regulatory roles at the transcriptional level. Cajanus cajan, a leguminosae pulse crop grown in tropical and subtropical areas of the world, is a source of high value protein to vegetarians or very poor populations globally. Hence, genome-wide identification of miRNAs and lncRNAs in C. cajan is extremely important to understand their role in PTGR with a possible implication to generate improve variety of crops. We have identified 616 mature miRNAs in C. cajan belonging to 118 families, of which 578 are novel and not reported in MirBase21. A total of 1373 target sequences were identified for 180 miRNAs. Of these, 298 targets were characterized at the protein level. Besides, we have also predicted 3919 lncRNAs. Additionally, we have identified 87 of the predicted lncRNAs to be targeted by 66 miRNAs. miRNA and lncRNAs in plants are known to control a variety of traits including yield, quality and stress tolerance. Owing to its agricultural importance and medicinal value, the identified miRNA, lncRNA and their targets in C. cajan may be useful for genome editing to improve better quality crop. A thorough understanding of ncRNA-based cellular regulatory networks will aid in the improvement of C. cajan agricultural traits.

  19. Operon-mapper: A Web Server for Precise Operon Identification in Bacterial and Archaeal Genomes.

    PubMed

    Taboada, Blanca; Estrada, Karel; Ciria, Ricardo; Merino, Enrique

    2018-06-19

    Operon-mapper is a web server that accurately, easily, and directly predicts the operons of any bacterial or archaeal genome sequence. The operon predictions are based on the intergenic distance of neighboring genes as well as the functional relationships of their protein-coding products. To this end, Operon-mapper finds all the ORFs within a given nucleotide sequence, along with their genomic coordinates, orthology groups, and functional relationships. We believe that Operon-mapper, due to its accuracy, simplicity and speed, as well as the relevant information that it generates, will be a useful tool for annotating and characterizing genomic sequences. http://biocomputo.ibt.unam.mx/operon_mapper/.

  20. Long non-coding RNAs and mRNAs profiling during spleen development in pig.

    PubMed

    Che, Tiandong; Li, Diyan; Jin, Long; Fu, Yuhua; Liu, Yingkai; Liu, Pengliang; Wang, Yixin; Tang, Qianzi; Ma, Jideng; Wang, Xun; Jiang, Anan; Li, Xuewei; Li, Mingzhou

    2018-01-01

    Genome-wide transcriptomic studies in humans and mice have become extensive and mature. However, a comprehensive and systematic understanding of protein-coding genes and long non-coding RNAs (lncRNAs) expressed during pig spleen development has not been achieved. LncRNAs are known to participate in regulatory networks for an array of biological processes. Here, we constructed 18 RNA libraries from developing fetal pig spleen (55 days before birth), postnatal pig spleens (0, 30, 180 days and 2 years after birth), and the samples from the 2-year-old Wild Boar. A total of 15,040 lncRNA transcripts were identified among these samples. We found that the temporal expression pattern of lncRNAs was more restricted than observed for protein-coding genes. Time-series analysis showed two large modules for protein-coding genes and lncRNAs. The up-regulated module was enriched for genes related to immune and inflammatory function, while the down-regulated module was enriched for cell proliferation processes such as cell division and DNA replication. Co-expression networks indicated the functional relatedness between protein-coding genes and lncRNAs, which were enriched for similar functions over the series of time points examined. We identified numerous differentially expressed protein-coding genes and lncRNAs in all five developmental stages. Notably, ceruloplasmin precursor (CP), a protein-coding gene participating in antioxidant and iron transport processes, was differentially expressed in all stages. This study provides the first catalog of the developing pig spleen, and contributes to a fuller understanding of the molecular mechanisms underpinning mammalian spleen development.

  1. An open reading frame in intron seven of the sea urchin DNA-methyltransferase gene codes for a functional AP1 endonuclease.

    PubMed

    Cioffi, Anna Valentina; Ferrara, Diana; Cubellis, Maria Vittoria; Aniello, Francesco; Corrado, Marcella; Liguori, Francesca; Amoroso, Alessandro; Fucci, Laura; Branno, Margherita

    2002-08-01

    Analysis of the genome structure of the Paracentrotus lividus (sea urchin) DNA methyltransferase (DNA MTase) gene showed the presence of an open reading frame, named METEX, in intron 7 of the gene. METEX expression is developmentally regulated, showing no correlation with DNA MTase expression. In fact, DNA MTase transcripts are present at high concentrations in the early developmental stages, while METEX is expressed at late stages of development. Two METEX cDNA clones (Met1 and Met2) that are different in the 3' end have been isolated in a cDNA library screening. The putative translated protein from Met2 cDNA clone showed similarity with Escherichia coli endonuclease III on the basis of sequence and predictive three-dimensional structure. The protein, overexpressed in E. coli and purified, had functional properties similar to the endonuclease specific for apurinic/apyrimidinic (AP) sites on the basis of the lyase activity. Therefore the open reading frame, present in intron 7 of the P. lividus DNA MTase gene, codes for a functional AP endonuclease designated SuAP1.

  2. Genetic diversity of tyrosine hydroxylase (TH) and dopamine β-hydroxylase (DBH) genes in cattle breeds

    PubMed Central

    Lourenco-Jaramillo, Diana Lelidett; Sifuentes-Rincón, Ana María; Parra-Bracamonte, Gaspar Manuel; de la Rosa-Reyna, Xochitl Fabiola; Segura-Cabrera, Aldo; Arellano-Vera, Williams

    2012-01-01

    DNA from four cattle breeds was used to re-sequence all of the exons and 56% of the introns of the bovine tyrosine hydroxylase (TH) gene and 97% and 13% of the bovine dopamine β-hydroxylase (DBH) coding and non-coding sequences, respectively. Two novel single nucleotide polymorphisms (SNPs) and a microsatellite motif were found in the TH sequences. The DBH sequences contained 62 nucleotide changes, including eight non-synonymous SNPs (nsSNPs) that are of particular interest because they may alter protein function and therefore affect the phenotype. These DBH nsSNPs resulted in amino acid substitutions that were predicted to destabilize the protein structure. Six SNPs (one from TH and five from DBH non-synonymous SNPs) were genotyped in 140 animals; all of them were polymorphic and had a minor allele frequency of > 9%. There were significant differences in the intra- and inter-population haplotype distributions. The haplotype differences between Brahman cattle and the three B. t. taurus breeds (Charolais, Holstein and Lidia) were interesting from a behavioural point of view because of the differences in temperament between these breeds. PMID:22888292

  3. Using cellular automata to generate image representation for biological sequences.

    PubMed

    Xiao, X; Shao, S; Ding, Y; Huang, Z; Chen, X; Chou, K-C

    2005-02-01

    A novel approach to visualize biological sequences is developed based on cellular automata (Wolfram, S. Nature 1984, 311, 419-424), a set of discrete dynamical systems in which space and time are discrete. By transforming the symbolic sequence codes into the digital codes, and using some optimal space-time evolvement rules of cellular automata, a biological sequence can be represented by a unique image, the so-called cellular automata image. Many important features, which are originally hidden in a long and complicated biological sequence, can be clearly revealed thru its cellular automata image. With biological sequences entering into databanks rapidly increasing in the post-genomic era, it is anticipated that the cellular automata image will become a very useful vehicle for investigation into their key features, identification of their function, as well as revelation of their "fingerprint". It is anticipated that by using the concept of the pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43, 246-255), the cellular automata image approach can also be used to improve the quality of predicting protein attributes, such as structural class and subcellular location.

  4. Comparison of Code Predictions to Test Measurements for Two Orifice Compensated Hydrostatic Bearings at High Reynolds Numbers

    NASA Technical Reports Server (NTRS)

    Keba, John E.

    1996-01-01

    Rotordynamic coefficients obtained from testing two different hydrostatic bearings are compared to values predicted by two different computer programs. The first set of test data is from a relatively long (L/D=1) orifice compensated hydrostatic bearing tested in water by Texas A&M University (TAMU Bearing No.9). The second bearing is a shorter (L/D=.37) bearing and was tested in a lower viscosity fluid by Rocketdyne Division of Rockwell (Rocketdyne 'Generic' Bearing) at similar rotating speeds and pressures. Computed predictions of bearing rotordynamic coefficients were obtained from the cylindrical seal code 'ICYL', one of the industrial seal codes developed for NASA-LeRC by Mechanical Technology Inc., and from the hydrodynamic bearing code 'HYDROPAD'. The comparison highlights the difference the bearing has on the accuracy of the predictions. The TAMU Bearing No. 9 test data is closely matched by the predictions obtained for the HYDROPAD code (except for added mass terms) whereas significant differences exist between the data from the Rocketdyne 'Generic' bearing the code predictions. The results suggest that some aspects of the fluid behavior in the shorter, higher Reynolds Number 'Generic' bearing may not be modeled accurately in the codes. The ICYL code predictions for flowrate and direct stiffness approximately equal those of HYDROPAD. Significant differences in cross-coupled stiffness and the damping terms were obtained relative to HYDROPAD and both sets of test data. Several observations are included concerning application of the ICYL code.

  5. A deep learning framework for modeling structural features of RNA-binding protein targets

    PubMed Central

    Zhang, Sai; Zhou, Jingtian; Hu, Hailin; Gong, Haipeng; Chen, Ligong; Cheng, Chao; Zeng, Jianyang

    2016-01-01

    RNA-binding proteins (RBPs) play important roles in the post-transcriptional control of RNAs. Identifying RBP binding sites and characterizing RBP binding preferences are key steps toward understanding the basic mechanisms of the post-transcriptional gene regulation. Though numerous computational methods have been developed for modeling RBP binding preferences, discovering a complete structural representation of the RBP targets by integrating their available structural features in all three dimensions is still a challenging task. In this paper, we develop a general and flexible deep learning framework for modeling structural binding preferences and predicting binding sites of RBPs, which takes (predicted) RNA tertiary structural information into account for the first time. Our framework constructs a unified representation that characterizes the structural specificities of RBP targets in all three dimensions, which can be further used to predict novel candidate binding sites and discover potential binding motifs. Through testing on the real CLIP-seq datasets, we have demonstrated that our deep learning framework can automatically extract effective hidden structural features from the encoded raw sequence and structural profiles, and predict accurate RBP binding sites. In addition, we have conducted the first study to show that integrating the additional RNA tertiary structural features can improve the model performance in predicting RBP binding sites, especially for the polypyrimidine tract-binding protein (PTB), which also provides a new evidence to support the view that RBPs may own specific tertiary structural binding preferences. In particular, the tests on the internal ribosome entry site (IRES) segments yield satisfiable results with experimental support from the literature and further demonstrate the necessity of incorporating RNA tertiary structural information into the prediction model. The source code of our approach can be found in https://github.com/thucombio/deepnet-rbp. PMID:26467480

  6. Evaluating High-Throughput Ab Initio Gene Finders to Discover Proteins Encoded in Eukaryotic Pathogen Genomes Missed by Laboratory Techniques

    PubMed Central

    Goodswen, Stephen J.; Kennedy, Paul J.; Ellis, John T.

    2012-01-01

    Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen’s genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers. PMID:23226328

  7. RNA G-quadruplexes: emerging mechanisms in disease

    PubMed Central

    Cammas, Anne

    2017-01-01

    Abstract RNA G-quadruplexes (G4s) are formed by G-rich RNA sequences in protein-coding (mRNA) and non-coding (ncRNA) transcripts that fold into a four-stranded conformation. Experimental studies and bioinformatic predictions support the view that these structures are involved in different cellular functions associated to both DNA processes (telomere elongation, recombination and transcription) and RNA post-transcriptional mechanisms (including pre-mRNA processing, mRNA turnover, targeting and translation). An increasing number of different diseases have been associated with the inappropriate regulation of RNA G4s exemplifying the potential importance of these structures on human health. Here, we review the different molecular mechanisms underlying the link between RNA G4s and human diseases by proposing several overlapping models of deregulation emerging from recent research, including (i) sequestration of RNA-binding proteins, (ii) aberrant expression or localization of RNA G4-binding proteins, (iii) repeat associated non-AUG (RAN) translation, (iv) mRNA translational blockade and (v) disabling of protein–RNA G4 complexes. This review also provides a comprehensive survey of the functional RNA G4 and their mechanisms of action. Finally, we highlight future directions for research aimed at improving our understanding on RNA G4-mediated regulatory mechanisms linked to diseases. PMID:28013268

  8. An in silico argument for mitochondrial microRNA as a determinant of primary non function in liver transplantation.

    PubMed

    Khorsandi, Shirin Elizabeth; Salehi, Siamak; Cortes, Miriam; Vilca-Melendez, Hector; Menon, Krishna; Srinivasan, Parthi; Prachalias, Andreas; Jassem, Wayel; Heaton, Nigel

    2018-02-15

    Mitochondria have their own genomic, transcriptomic and proteomic machinery but are unable to be autonomous, needing both nuclear and mitochondrial genomes. The aim of this work was to use computational biology to explore the involvement of Mitochondrial microRNAs (MitomiRs) and their interactions with the mitochondrial proteome in a clinical model of primary non function (PNF) of the donor after cardiac death (DCD) liver. Archival array data on the differential expression of miRNA in DCD PNF was re-analyzed using a number of publically available computational algorithms. 10 MitomiRs were identified of importance in DCD PNF, 7 with predicted interaction of their seed sequence with the mitochondrial transcriptome that included both coding, and non coding areas of the hypervariability region 1 (HVR1) and control region. Considering miRNA regulation of the nuclear encoded mitochondrial proteome, 7 hypothetical small proteins were identified with homolog function that ranged from co-factor for formation of ATP Synthase, REDOX balance and an importin/exportin protein. In silico, unconventional seed interactions, both non canonical and alternative seed sites, appear to be of greater importance in MitomiR regulation of the mitochondrial genome. Additionally, a number of novel small proteins of relevance in transplantation have been identified which need further characterization.

  9. Theory of Mind: A Neural Prediction Problem

    PubMed Central

    Koster-Hale, Jorie; Saxe, Rebecca

    2014-01-01

    Predictive coding posits that neural systems make forward-looking predictions about incoming information. Neural signals contain information not about the currently perceived stimulus, but about the difference between the observed and the predicted stimulus. We propose to extend the predictive coding framework from high-level sensory processing to the more abstract domain of theory of mind; that is, to inferences about others’ goals, thoughts, and personalities. We review evidence that, across brain regions, neural responses to depictions of human behavior, from biological motion to trait descriptions, exhibit a key signature of predictive coding: reduced activity to predictable stimuli. We discuss how future experiments could distinguish predictive coding from alternative explanations of this response profile. This framework may provide an important new window on the neural computations underlying theory of mind. PMID:24012000

  10. Bitis gabonica (Gaboon viper) snake venom gland: toward a catalog for the full- length transcripts (cDNA) and proteins

    PubMed Central

    Francischetti, Ivo M. B.; My-Pham, Van; Harrison, Jim; Garfield, Mark K.; Ribeiro, José M. C.

    2010-01-01

    The venom gland of the snake Bitis gabonica (Gaboon viper) was used for the first time to construct a unidirectional cDNA phage library followed by high-throughput sequencing and bioinformatic analysis. Hundreds of cDNAs were obtained and clustered into contigs. We found mostly novel full-length cDNA coding for metalloproteases (P-II and P-III classes), Lys49-phospholipase A2, serine proteases with essential mutations in the active site, Kunitz protease inhibitors, several C-type lectins, bradykinin-potentiating peptide, vascular endothelial growth factor, nucleotidases and nucleases, nerve growth factor, and L-amino acid oxidases. Two new members of the recently described short coding region family of disintegrin, displaying RGD and MLD motifs are reported. In addition, we have identified for the first time a cytokine-like molecule and a multi-Kunitz protease inhibitor in snake venoms. The CLUSTAL alignment and the unrooted cladograms for selected families of B. gabonica venom proteins are also presented. A significant number of sequences were devoid of database matches, suggesting that their biologic function remains to be identified. This paper also reports the N-terminus of the 15 most abundant venom proteins and the sequences matching their corresponding transcripts. The electronic version of this manuscript, available on request, contains spreadsheets with hyperlinks to FASTA-formatted files for each contig and the best match to the GenBank and Conserved Domain Databases, in addition to CLUSTAL alignments of each contig. We have thus generated a comprehensive catalog of the B. gabonica venom gland, containing for each secreted protein: i) the predicted molecular weight, ii) the predicted isoelectric point, iii) the accession number, and iv) the putative function. The role of these molecules is discussed in the context of the envenomation caused by the Gaboon viper. PMID:15276202

  11. Comparative analysis of secretomes from Ectomycorrhizal fungi with an emphasis on small-secreted proteins

    DOE PAGES

    Pellegrin, Clement; Morin, Emmanuelle; Martin, Francis M.; ...

    2015-11-18

    Fungi are major players in the carbon cycle in forest ecosystems due to the wide range of interactions they have with plants either through soil degradation processes by litter decayers or biotrophic interactions with pathogenic and ectomycorrhizal symbionts. Secretion of fungal proteins mediates these interactions by allowing the fungus to interact with its environment and/or host. Ectomycorrhizal (ECM) symbiosis independently appeared several times throughout evolution and involves approximately 80% of trees. Despite extensive physiological studies on ECM symbionts, little is known about the composition and specificities of their secretomes. In this study, we used a bioinformatics pipeline to predict andmore » analyze the secretomes of 49 fungal species, including 11 ECM fungi, wood and soil decayers and pathogenic fungi to tackle the following questions: (1) Are there differences between the secretomes of saprophytic and ECM fungi? (2) Are small-secreted proteins (SSPs) more abundant in biotrophic fungi than in saprophytic fungi? and (3) Are there SSPs shared between ECM, saprotrophic and pathogenic fungi? We showed that the number of predicted secreted proteins is similar in the surveyed species, independently of their lifestyle. The secretome from ECM fungi is characterized by a restricted number of secreted CAZymes, but their repertoires of secreted proteases and lipases are similar to those of saprotrophic fungi. Focusing on SSPs, we showed that the secretome of ECM fungi is enriched in SSPs compared with other species. Most of the SSPs are coded by orphan genes with no known PFAM domain or similarities to known sequences in databases. Finally, based on the clustering analysis, we identified shared- and lifestyle-specific SSPs between saprotrophic and ECM fungi. The presence of SSPs is not limited to fungi interacting with living plants as the genome of saprotrophic fungi also code for numerous SSPs. As a result, ECM fungi shared lifestyle-specific SSPs likely involved in symbiosis that are good candidates for further functional analyses.« less

  12. Comparative analysis of secretomes from Ectomycorrhizal fungi with an emphasis on small-secreted proteins

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pellegrin, Clement; Morin, Emmanuelle; Martin, Francis M.

    Fungi are major players in the carbon cycle in forest ecosystems due to the wide range of interactions they have with plants either through soil degradation processes by litter decayers or biotrophic interactions with pathogenic and ectomycorrhizal symbionts. Secretion of fungal proteins mediates these interactions by allowing the fungus to interact with its environment and/or host. Ectomycorrhizal (ECM) symbiosis independently appeared several times throughout evolution and involves approximately 80% of trees. Despite extensive physiological studies on ECM symbionts, little is known about the composition and specificities of their secretomes. In this study, we used a bioinformatics pipeline to predict andmore » analyze the secretomes of 49 fungal species, including 11 ECM fungi, wood and soil decayers and pathogenic fungi to tackle the following questions: (1) Are there differences between the secretomes of saprophytic and ECM fungi? (2) Are small-secreted proteins (SSPs) more abundant in biotrophic fungi than in saprophytic fungi? and (3) Are there SSPs shared between ECM, saprotrophic and pathogenic fungi? We showed that the number of predicted secreted proteins is similar in the surveyed species, independently of their lifestyle. The secretome from ECM fungi is characterized by a restricted number of secreted CAZymes, but their repertoires of secreted proteases and lipases are similar to those of saprotrophic fungi. Focusing on SSPs, we showed that the secretome of ECM fungi is enriched in SSPs compared with other species. Most of the SSPs are coded by orphan genes with no known PFAM domain or similarities to known sequences in databases. Finally, based on the clustering analysis, we identified shared- and lifestyle-specific SSPs between saprotrophic and ECM fungi. The presence of SSPs is not limited to fungi interacting with living plants as the genome of saprotrophic fungi also code for numerous SSPs. As a result, ECM fungi shared lifestyle-specific SSPs likely involved in symbiosis that are good candidates for further functional analyses.« less

  13. Comparative Analysis of Secretomes from Ectomycorrhizal Fungi with an Emphasis on Small-Secreted Proteins

    PubMed Central

    Pellegrin, Clement; Morin, Emmanuelle; Martin, Francis M.; Veneault-Fourrey, Claire

    2015-01-01

    Fungi are major players in the carbon cycle in forest ecosystems due to the wide range of interactions they have with plants either through soil degradation processes by litter decayers or biotrophic interactions with pathogenic and ectomycorrhizal symbionts. Secretion of fungal proteins mediates these interactions by allowing the fungus to interact with its environment and/or host. Ectomycorrhizal (ECM) symbiosis independently appeared several times throughout evolution and involves approximately 80% of trees. Despite extensive physiological studies on ECM symbionts, little is known about the composition and specificities of their secretomes. In this study, we used a bioinformatics pipeline to predict and analyze the secretomes of 49 fungal species, including 11 ECM fungi, wood and soil decayers and pathogenic fungi to tackle the following questions: (1) Are there differences between the secretomes of saprophytic and ECM fungi? (2) Are small-secreted proteins (SSPs) more abundant in biotrophic fungi than in saprophytic fungi? and (3) Are there SSPs shared between ECM, saprotrophic and pathogenic fungi? We showed that the number of predicted secreted proteins is similar in the surveyed species, independently of their lifestyle. The secretome from ECM fungi is characterized by a restricted number of secreted CAZymes, but their repertoires of secreted proteases and lipases are similar to those of saprotrophic fungi. Focusing on SSPs, we showed that the secretome of ECM fungi is enriched in SSPs compared with other species. Most of the SSPs are coded by orphan genes with no known PFAM domain or similarities to known sequences in databases. Finally, based on the clustering analysis, we identified shared- and lifestyle-specific SSPs between saprotrophic and ECM fungi. The presence of SSPs is not limited to fungi interacting with living plants as the genome of saprotrophic fungi also code for numerous SSPs. ECM fungi shared lifestyle-specific SSPs likely involved in symbiosis that are good candidates for further functional analyses. PMID:26635749

  14. Deciphering the Fluorine Code-The Many Hats Fluorine Wears in a Protein Environment.

    PubMed

    Berger, Allison Ann; Völler, Jan-Stefan; Budisa, Nediljko; Koksch, Beate

    2017-09-19

    Deciphering the fluorine code is how we describe not only the focus of this Account, but also the systematic approach to studying the impact of fluorine's incorporation on the properties of peptides and proteins used by our groups and others. The introduction of fluorine has been shown to impart favorable, but seldom predictable, properties to peptides and proteins, but up until about two decades ago the outcomes of fluorine modification of peptides and proteins were largely left to chance. Driven by the motivation to extend the application of the unique properties of the element fluorine from medicinal and agro chemistry to peptide and protein engineering we have established extensive research programs that enable the systematic investigation of effects that accompany the introduction of fluorine into this class of biopolymers. The introduction of fluorine into amino acids offers a universe of options for modifications with regard to number and position of fluorine substituents in the amino acid side chain. Moreover, it is important to emphasize that the consequences of incorporating the C-F bond into a biopolymer can be attributed to two distinct yet related phenomena: (i) the fluorine substituent can directly engage in intermolecular interactions with its environment and/or (ii) the other functional groups present in the molecule can be influenced by the electron withdrawing nature of this element (intramolecular) and in turn interact differently with their immediate environment (intermolecular). Based on our studies, we have shown that a change in number and/or position of as subtle as one single fluorine substituent has the power to considerably modify key properties of amino acids such as hydrophobicity, polarity, and secondary structure propensity. These properties are crucial factors in peptide and protein engineering, and thus, fluorinated amino acids can be applied to fine-tune properties such as protein folding, proteolytic stability, and protein-protein interactions provided we understand and become able to predict the outcome of a fluorine substitution in this context. With this Account, we attempt to analyze information we gained from our recent projects on how the nature of the fluorine atom and C-F bond influence four key properties of peptides and proteins: peptide folding, protein-protein interactions, ribosomal translation, and protease stability. These results impressively show why the introduction of fluorine creates a new class of amino acids with a repertoire of functionalities that is unique to the world of proteins and in some cases orthogonal to the set of canonical and natural amino acids. Our concluding statements aim to offer a few conserved design principles that have emerged from systematic studies over the last two decades; in this way, we hope to advance the field of peptide and protein engineering based on the judicious introduction of fluorinated building blocks.

  15. Efficient temporal and interlayer parameter prediction for weighted prediction in scalable high efficiency video coding

    NASA Astrophysics Data System (ADS)

    Tsang, Sik-Ho; Chan, Yui-Lam; Siu, Wan-Chi

    2017-01-01

    Weighted prediction (WP) is an efficient video coding tool that was introduced since the establishment of the H.264/AVC video coding standard, for compensating the temporal illumination change in motion estimation and compensation. WP parameters, including a multiplicative weight and an additive offset for each reference frame, are required to be estimated and transmitted to the decoder by slice header. These parameters cause extra bits in the coded video bitstream. High efficiency video coding (HEVC) provides WP parameter prediction to reduce the overhead. Therefore, WP parameter prediction is crucial to research works or applications, which are related to WP. Prior art has been suggested to further improve the WP parameter prediction by implicit prediction of image characteristics and derivation of parameters. By exploiting both temporal and interlayer redundancies, we propose three WP parameter prediction algorithms, enhanced implicit WP parameter, enhanced direct WP parameter derivation, and interlayer WP parameter, to further improve the coding efficiency of HEVC. Results show that our proposed algorithms can achieve up to 5.83% and 5.23% bitrate reduction compared to the conventional scalable HEVC in the base layer for SNR scalability and 2× spatial scalability, respectively.

  16. PyPLIF: Python-based Protein-Ligand Interaction Fingerprinting.

    PubMed

    Radifar, Muhammad; Yuniarti, Nunung; Istyastono, Enade Perdana

    2013-01-01

    Structure-based virtual screening (SBVS) methods often rely on docking score. The docking score is an over-simplification of the actual ligand-target binding. Its capability to model and predict the actual binding reality is limited. Recently, interaction fingerprinting (IFP) has come and offered us an alternative way to model reality. IFP provides us an alternate way to examine protein-ligand interactions. The docking score indicates the approximate affinity and IFP shows the interaction specificity. IFP is a method to convert three dimensional (3D) protein-ligand interactions into one dimensional (1D) bitstrings. The bitstrings are subsequently employed to compare the protein-ligand interaction predicted by the docking tool against the reference ligand. These comparisons produce scores that can be used to enhance the quality of SBVS campaigns. However, some IFP tools are either proprietary or using a proprietary library, which limits the access to the tools and the development of customized IFP algorithm. Therefore, we have developed PyPLIF, a Python-based open source tool to analyze IFP. In this article, we describe PyPLIF and its application to enhance the quality of SBVS in order to identify antagonists for estrogen α receptor (ERα). PyPLIF is freely available at http://code.google.com/p/pyplif.

  17. Genetic analyses of bone morphogenetic protein 2, 4 and 7 in congenital combined pituitary hormone deficiency.

    PubMed

    Breitfeld, Jana; Martens, Susanne; Klammt, Jürgen; Schlicke, Marina; Pfäffle, Roland; Krause, Kerstin; Weidle, Kerstin; Schleinitz, Dorit; Stumvoll, Michael; Führer, Dagmar; Kovacs, Peter; Tönjes, Anke

    2013-12-01

    The complex process of development of the pituitary gland is regulated by a number of signalling molecules and transcription factors. Mutations in these factors have been identified in rare cases of congenital hypopituitarism but for most subjects with combined pituitary hormone deficiency (CPHD) genetic causes are unknown. Bone morphogenetic proteins (BMPs) affect induction and growth of the pituitary primordium and thus represent plausible candidates for mutational screening of patients with CPHD. We sequenced BMP2, 4 and 7 in 19 subjects with CPHD. For validation purposes, novel genetic variants were genotyped in 1046 healthy subjects. Additionally, potential functional relevance for most promising variants has been assessed by phylogenetic analyses and prediction of effects on protein structure. Sequencing revealed two novel variants and confirmed 30 previously known polymorphisms and mutations in BMP2, 4 and 7. Although phylogenetic analyses indicated that these variants map within strongly conserved gene regions, there was no direct support for their impact on protein structure when applying predictive bioinformatics tools. A mutation in the BMP4 coding region resulting in an amino acid exchange (p.Arg300Pro) appeared most interesting among the identified variants. Further functional analyses are required to ultimately map the relevance of these novel variants in CPHD.

  18. Genetic analyses of bone morphogenetic protein 2, 4 and 7 in congenital combined pituitary hormone deficiency

    PubMed Central

    2013-01-01

    Background The complex process of development of the pituitary gland is regulated by a number of signalling molecules and transcription factors. Mutations in these factors have been identified in rare cases of congenital hypopituitarism but for most subjects with combined pituitary hormone deficiency (CPHD) genetic causes are unknown. Bone morphogenetic proteins (BMPs) affect induction and growth of the pituitary primordium and thus represent plausible candidates for mutational screening of patients with CPHD. Methods We sequenced BMP2, 4 and 7 in 19 subjects with CPHD. For validation purposes, novel genetic variants were genotyped in 1046 healthy subjects. Additionally, potential functional relevance for most promising variants has been assessed by phylogenetic analyses and prediction of effects on protein structure. Results Sequencing revealed two novel variants and confirmed 30 previously known polymorphisms and mutations in BMP2, 4 and 7. Although phylogenetic analyses indicated that these variants map within strongly conserved gene regions, there was no direct support for their impact on protein structure when applying predictive bioinformatics tools. Conclusions A mutation in the BMP4 coding region resulting in an amino acid exchange (p.Arg300Pro) appeared most interesting among the identified variants. Further functional analyses are required to ultimately map the relevance of these novel variants in CPHD. PMID:24289245

  19. Complete Genome Sequence of the Soil Actinomycete Kocuria rhizophila▿

    PubMed Central

    Takarada, Hiromi; Sekine, Mitsuo; Kosugi, Hiroki; Matsuo, Yasunori; Fujisawa, Takatomo; Omata, Seiha; Kishi, Emi; Shimizu, Ai; Tsukatani, Naofumi; Tanikawa, Satoshi; Fujita, Nobuyuki; Harayama, Shigeaki

    2008-01-01

    The soil actinomycete Kocuria rhizophila belongs to the suborder Micrococcineae, a divergent bacterial group for which only a limited amount of genomic information is currently available. K. rhizophila is also important in industrial applications; e.g., it is commonly used as a standard quality control strain for antimicrobial susceptibility testing. Sequencing and annotation of the genome of K. rhizophila DC2201 (NBRC 103217) revealed a single circular chromosome (2,697,540 bp; G+C content of 71.16%) containing 2,357 predicted protein-coding genes. Most of the predicted proteins (87.7%) were orthologous to actinobacterial proteins, and the genome showed fairly good conservation of synteny with taxonomically related actinobacterial genomes. On the other hand, the genome seems to encode much smaller numbers of proteins necessary for secondary metabolism (one each of nonribosomal peptide synthetase and type III polyketide synthase), transcriptional regulation, and lateral gene transfer, reflecting the small genome size. The presence of probable metabolic pathways for the transformation of phenolic compounds generated from the decomposition of plant materials, and the presence of a large number of genes associated with membrane transport, particularly amino acid transporters and drug efflux pumps, may contribute to the organism's utilization of root exudates, as well as the tolerance to various organic compounds. PMID:18408034

  20. The Genome of the Generalist Plant Pathogen Fusarium avenaceum Is Enriched with Genes Involved in Redox, Signaling and Secondary Metabolism

    PubMed Central

    Lysøe, Erik; Harris, Linda J.; Walkowiak, Sean; Subramaniam, Rajagopal; Divon, Hege H.; Riiser, Even S.; Llorens, Carlos; Gabaldón, Toni; Kistler, H. Corby; Jonkers, Wilfried; Kolseth, Anna-Karin; Nielsen, Kristian F.; Thrane, Ulf; Frandsen, Rasmus J. N.

    2014-01-01

    Fusarium avenaceum is a fungus commonly isolated from soil and associated with a wide range of host plants. We present here three genome sequences of F. avenaceum, one isolated from barley in Finland and two from spring and winter wheat in Canada. The sizes of the three genomes range from 41.6–43.1 MB, with 13217–13445 predicted protein-coding genes. Whole-genome analysis showed that the three genomes are highly syntenic, and share>95% gene orthologs. Comparative analysis to other sequenced Fusaria shows that F. avenaceum has a very large potential for producing secondary metabolites, with between 75 and 80 key enzymes belonging to the polyketide, non-ribosomal peptide, terpene, alkaloid and indole-diterpene synthase classes. In addition to known metabolites from F. avenaceum, fuscofusarin and JM-47 were detected for the first time in this species. Many protein families are expanded in F. avenaceum, such as transcription factors, and proteins involved in redox reactions and signal transduction, suggesting evolutionary adaptation to a diverse and cosmopolitan ecology. We found that 20% of all predicted proteins were considered to be secreted, supporting a life in the extracellular space during interaction with plant hosts. PMID:25409087

  1. DRA/NASA/ONERA Collaboration on Icing Research. Part 2; Prediction of Airfoil Ice Accretion

    NASA Technical Reports Server (NTRS)

    Wright, William B.; Gent, R. W.; Guffond, Didier

    1997-01-01

    This report presents results from a joint study by DRA, NASA, and ONERA for the purpose of comparing, improving, and validating the aircraft icing computer codes developed by each agency. These codes are of three kinds: (1) water droplet trajectory prediction, (2) ice accretion modeling, and (3) transient electrothermal deicer analysis. In this joint study, the agencies compared their code predictions with each other and with experimental results. These comparison exercises were published in three technical reports, each with joint authorship. DRA published and had first authorship of Part 1 - Droplet Trajectory Calculations, NASA of Part 2 - Ice Accretion Prediction, and ONERA of Part 3 - Electrothermal Deicer Analysis. The results cover work done during the period from August 1986 to late 1991. As a result, all of the information in this report is dated. Where necessary, current information is provided to show the direction of current research. In this present report on ice accretion, each agency predicted ice shapes on two dimensional airfoils under icing conditions for which experimental ice shapes were available. In general, all three codes did a reasonable job of predicting the measured ice shapes. For any given experimental condition, one of the three codes predicted the general ice features (i.e., shape, impingement limits, mass of ice) somewhat better than did the other two. However, no single code consistently did better than the other two over the full range of conditions examined, which included rime, mixed, and glaze ice conditions. In several of the cases, DRA showed that the user's knowledge of icing can significantly improve the accuracy of the code prediction. Rime ice predictions were reasonably accurate and consistent among the codes, because droplets freeze on impact and the freezing model is simple. Glaze ice predictions were less accurate and less consistent among the codes, because the freezing model is more complex and is critically dependent upon unsubstantiated heat transfer and surface roughness models. Thus, heat transfer prediction methods used in the codes became the subject for a separate study in this report to compare predicted heat transfer coefficients with a limited experimental database of heat transfer coefficients for cylinders with simulated glaze and rime ice shapes. The codes did a good job of predicting heat transfer coefficients near the stagnation region of the ice shapes. But in the region of the ice horns, all three codes predicted heat transfer coefficients considerably higher than the measured values. An important conclusion of this study is that further research is needed to understand the finer detail of of the glaze ice accretion process and to develop improved glaze ice accretion models.

  2. Death of a dogma: eukaryotic mRNAs can code for more than one protein.

    PubMed

    Mouilleron, Hélène; Delcourt, Vivian; Roucou, Xavier

    2016-01-08

    mRNAs carry the genetic information that is translated by ribosomes. The traditional view of a mature eukaryotic mRNA is a molecule with three main regions, the 5' UTR, the protein coding open reading frame (ORF) or coding sequence (CDS), and the 3' UTR. This concept assumes that ribosomes translate one ORF only, generally the longest one, and produce one protein. As a result, in the early days of genomics and bioinformatics, one CDS was associated with each protein-coding gene. This fundamental concept of a single CDS is being challenged by increasing experimental evidence indicating that annotated proteins are not the only proteins translated from mRNAs. In particular, mass spectrometry (MS)-based proteomics and ribosome profiling have detected productive translation of alternative open reading frames. In several cases, the alternative and annotated proteins interact. Thus, the expression of two or more proteins translated from the same mRNA may offer a mechanism to ensure the co-expression of proteins which have functional interactions. Translational mechanisms already described in eukaryotic cells indicate that the cellular machinery is able to translate different CDSs from a single viral or cellular mRNA. In addition to summarizing data showing that the protein coding potential of eukaryotic mRNAs has been underestimated, this review aims to challenge the single translated CDS dogma. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  3. Complete mitochondrial genome of Germain's Peacock-Pheasant Polyplectron germaini (Aves, Galliformes, Phasianidae).

    PubMed

    Omeire, Destiny; Abdin, Shaunte; Brooks, Daniel M; Miranda, Hector C

    2015-04-01

    The Germain's Peacock-Pheasant Polyplectron germaini (Aves, Galliformes, Phasianidae) is classified as Near Threatened on the IUCN Red List. The complete mitochondrial genome of P. germaini is 16,699 bp, consisting of 13 protein-coding genes, 2 rRNA, 22 tRNA genes and 1 control region. All of the 13 protein-coding genes have ATG as start codon. Eight of the 13 protein-coding genes have TAA as stop codon.

  4. Posterior Predictive Bayesian Phylogenetic Model Selection

    PubMed Central

    Lewis, Paul O.; Xie, Wangang; Chen, Ming-Hui; Fan, Yu; Kuo, Lynn

    2014-01-01

    We present two distinctly different posterior predictive approaches to Bayesian phylogenetic model selection and illustrate these methods using examples from green algal protein-coding cpDNA sequences and flowering plant rDNA sequences. The Gelfand–Ghosh (GG) approach allows dissection of an overall measure of model fit into components due to posterior predictive variance (GGp) and goodness-of-fit (GGg), which distinguishes this method from the posterior predictive P-value approach. The conditional predictive ordinate (CPO) method provides a site-specific measure of model fit useful for exploratory analyses and can be combined over sites yielding the log pseudomarginal likelihood (LPML) which is useful as an overall measure of model fit. CPO provides a useful cross-validation approach that is computationally efficient, requiring only a sample from the posterior distribution (no additional simulation is required). Both GG and CPO add new perspectives to Bayesian phylogenetic model selection based on the predictive abilities of models and complement the perspective provided by the marginal likelihood (including Bayes Factor comparisons) based solely on the fit of competing models to observed data. [Bayesian; conditional predictive ordinate; CPO; L-measure; LPML; model selection; phylogenetics; posterior predictive.] PMID:24193892

  5. The Prognostic Accuracy of Suggested Predictors of Failure of Medical Management in Patients With Nontuberculous Spinal Epidural Abscess.

    PubMed

    Stratton, Alexandra; Faris, Peter; Thomas, Kenneth

    2018-05-01

    Retrospective cohort study. To test the external validity of the 2 published prediction criteria for failure of medical management in patients with spinal epidural abscess (SEA). Patients with SEA over a 10-year period at a tertiary care center were identified using ICD-10 (International Classification of Diseases, 10th Revision) diagnostic codes; electronic and paper charts were reviewed. The incidence of SEA and the proportion of patients with SEA that were treated medically were calculated. The rate of failure of medical management was determined. The published prediction models were applied to our data to determine how predictive they were of failure in our cohort. A total of 550 patients were identified using ICD-10 codes, 160 of whom had a magnetic resonance imaging-confirmed diagnosis of SEA. The incidence of SEA was 16 patients per year. Seventy-five patients were found to be intentionally managed medically and were included in the analysis. Thirteen of these 75 patients failed medical management (17%). Based on the published prediction criteria, 26% (Kim et al) and 45% (Patel et al) of our patients were expected to fail. Published prediction models for failure of medical management of SEA were not valid in our cohort. However, once calibrated to our cohort, Patel's model consisting of positive blood culture, presence of diabetes, white blood cells >12.5, and C-reactive protein >115 was the better model for our data.

  6. Prediction of functionally significant single nucleotide polymorphisms in PTEN tumor suppressor gene: An in silico approach.

    PubMed

    Khan, Imran; Ansari, Irfan A; Singh, Pratichi; Dass J, Febin Prabhu

    2017-09-01

    The phosphatase and tensin homolog (PTEN) gene plays a crucial role in signal transduction by negatively regulating the PI3K signaling pathway. It is the most frequent mutated gene in many human-related cancers. Considering its critical role, a functional analysis of missense mutations of PTEN gene was undertaken in this study. Thirty five nonsynonymous single nucleotide polymorphisms (nsSNPs) within the coding region of the PTEN gene were selected for our in silico investigation, and five nsSNPs (G129E, C124R, D252G, H61D, and R130G) were found to be deleterious based on combinatorial predictions of different computational tools. Moreover, molecular dynamics (MD) simulation was performed to investigate the conformational variation between native and all the five mutant PTEN proteins having predicted deleterious nsSNPs. The results of MD simulation of all mutant models illustrated variation in structural attributes such as root-mean-square deviation, root-mean-square fluctuation, radius of gyration, and total energy; which depicts the structural stability of PTEN protein. Furthermore, mutant PTEN protein structures also showed a significant variation in the solvent accessible surface area and hydrogen bond frequencies from the native PTEN structure. In conclusion, results of this study have established the deleterious effect of the all the five predicted nsSNPs on the PTEN protein structure. Thus, results of the current study can pave a new platform to sort out nsSNPs that can be undertaken for the confirmation of their phenotype and their correlation with diseased status in case of control studies. © 2016 International Union of Biochemistry and Molecular Biology, Inc.

  7. Flowfield Comparisons from Three Navier-Stokes Solvers for an Axisymmetric Separate Flow Jet

    NASA Technical Reports Server (NTRS)

    Koch, L. Danielle; Bridges, James; Khavaran, Abbas

    2002-01-01

    To meet new noise reduction goals, many concepts to enhance mixing in the exhaust jets of turbofan engines are being studied. Accurate steady state flowfield predictions from state-of-the-art computational fluid dynamics (CFD) solvers are needed as input to the latest noise prediction codes. The main intent of this paper was to ascertain that similar Navier-Stokes solvers run at different sites would yield comparable results for an axisymmetric two-stream nozzle case. Predictions from the WIND and the NPARC codes are compared to previously reported experimental data and results from the CRAFT Navier-Stokes solver. Similar k-epsilon turbulence models were employed in each solver, and identical computational grids were used. Agreement between experimental data and predictions from each code was generally good for mean values. All three codes underpredict the maximum value of turbulent kinetic energy. The predicted locations of the maximum turbulent kinetic energy were farther downstream than seen in the data. A grid study was conducted using the WIND code, and comments about convergence criteria and grid requirements for CFD solutions to be used as input for noise prediction computations are given. Additionally, noise predictions from the MGBK code, using the CFD results from the CRAFT code, NPARC, and WIND as input are compared to data.

  8. Coding tools investigation for next generation video coding based on HEVC

    NASA Astrophysics Data System (ADS)

    Chen, Jianle; Chen, Ying; Karczewicz, Marta; Li, Xiang; Liu, Hongbin; Zhang, Li; Zhao, Xin

    2015-09-01

    The new state-of-the-art video coding standard, H.265/HEVC, has been finalized in 2013 and it achieves roughly 50% bit rate saving compared to its predecessor, H.264/MPEG-4 AVC. This paper provides the evidence that there is still potential for further coding efficiency improvements. A brief overview of HEVC is firstly given in the paper. Then, our improvements on each main module of HEVC are presented. For instance, the recursive quadtree block structure is extended to support larger coding unit and transform unit. The motion information prediction scheme is improved by advanced temporal motion vector prediction, which inherits the motion information of each small block within a large block from a temporal reference picture. Cross component prediction with linear prediction model improves intra prediction and overlapped block motion compensation improves the efficiency of inter prediction. Furthermore, coding of both intra and inter prediction residual is improved by adaptive multiple transform technique. Finally, in addition to deblocking filter and SAO, adaptive loop filter is applied to further enhance the reconstructed picture quality. This paper describes above-mentioned techniques in detail and evaluates their coding performance benefits based on the common test condition during HEVC development. The simulation results show that significant performance improvement over HEVC standard can be achieved, especially for the high resolution video materials.

  9. Development and verification of NRC`s single-rod fuel performance codes FRAPCON-3 AND FRAPTRAN

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Beyer, C.E.; Cunningham, M.E.; Lanning, D.D.

    1998-03-01

    The FRAPCON and FRAP-T code series, developed in the 1970s and early 1980s, are used by the US Nuclear Regulatory Commission (NRC) to predict fuel performance during steady-state and transient power conditions, respectively. Both code series are now being updated by Pacific Northwest National Laboratory to improve their predictive capabilities at high burnup levels. The newest versions of the codes are called FRAPCON-3 and FRAPTRAN. The updates to fuel property and behavior models are focusing on providing best estimate predictions under steady-state and fast transient power conditions up to extended fuel burnups (> 55 GWd/MTU). Both codes will be assessedmore » against a data base independent of the data base used for code benchmarking and an estimate of code predictive uncertainties will be made based on comparisons to the benchmark and independent data bases.« less

  10. Transcriptome-Wide Analysis of UTRs in Non-Small Cell Lung Cancer Reveals Cancer-Related Genes with SNV-Induced Changes on RNA Secondary Structure and miRNA Target Sites

    PubMed Central

    Novotny, Peter; Tang, Xiaojia; Kalari, Krishna R.; Gorodkin, Jan

    2014-01-01

    Traditional mutation assessment methods generally focus on predicting disruptive changes in protein-coding regions rather than non-coding regulatory regions like untranslated regions (UTRs) of mRNAs. The UTRs, however, are known to have many sequence and structural motifs that can regulate translational and transcriptional efficiency and stability of mRNAs through interaction with RNA-binding proteins and other non-coding RNAs like microRNAs (miRNAs). In a recent study, transcriptomes of tumor cells harboring mutant and wild-type KRAS (V-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog) genes in patients with non-small cell lung cancer (NSCLC) have been sequenced to identify single nucleotide variations (SNVs). About 40% of the total SNVs (73,717) identified were mapped to UTRs, but omitted in the previous analysis. To meet this obvious demand for analysis of the UTRs, we designed a comprehensive pipeline to predict the effect of SNVs on two major regulatory elements, secondary structure and miRNA target sites. Out of 29,290 SNVs in 6462 genes, we predict 472 SNVs (in 408 genes) affecting local RNA secondary structure, 490 SNVs (in 447 genes) affecting miRNA target sites and 48 that do both. Together these disruptive SNVs were present in 803 different genes, out of which 188 (23.4%) were previously known to be cancer-associated. Notably, this ratio is significantly higher (one-sided Fisher's exact test p-value = 0.032) than the ratio (20.8%) of known cancer-associated genes (n = 1347) in our initial data set (n = 6462). Network analysis shows that the genes harboring disruptive SNVs were involved in molecular mechanisms of cancer, and the signaling pathways of LPS-stimulated MAPK, IL-6, iNOS, EIF2 and mTOR. In conclusion, we have found hundreds of SNVs which are highly disruptive with respect to changes in the secondary structure and miRNA target sites within UTRs. These changes hold the potential to alter the expression of known cancer genes or genes linked to cancer-associated pathways. PMID:24416147

  11. Transcriptome-wide analysis of UTRs in non-small cell lung cancer reveals cancer-related genes with SNV-induced changes on RNA secondary structure and miRNA target sites.

    PubMed

    Sabarinathan, Radhakrishnan; Wenzel, Anne; Novotny, Peter; Tang, Xiaojia; Kalari, Krishna R; Gorodkin, Jan

    2014-01-01

    Traditional mutation assessment methods generally focus on predicting disruptive changes in protein-coding regions rather than non-coding regulatory regions like untranslated regions (UTRs) of mRNAs. The UTRs, however, are known to have many sequence and structural motifs that can regulate translational and transcriptional efficiency and stability of mRNAs through interaction with RNA-binding proteins and other non-coding RNAs like microRNAs (miRNAs). In a recent study, transcriptomes of tumor cells harboring mutant and wild-type KRAS (V-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog) genes in patients with non-small cell lung cancer (NSCLC) have been sequenced to identify single nucleotide variations (SNVs). About 40% of the total SNVs (73,717) identified were mapped to UTRs, but omitted in the previous analysis. To meet this obvious demand for analysis of the UTRs, we designed a comprehensive pipeline to predict the effect of SNVs on two major regulatory elements, secondary structure and miRNA target sites. Out of 29,290 SNVs in 6462 genes, we predict 472 SNVs (in 408 genes) affecting local RNA secondary structure, 490 SNVs (in 447 genes) affecting miRNA target sites and 48 that do both. Together these disruptive SNVs were present in 803 different genes, out of which 188 (23.4%) were previously known to be cancer-associated. Notably, this ratio is significantly higher (one-sided Fisher's exact test p-value = 0.032) than the ratio (20.8%) of known cancer-associated genes (n = 1347) in our initial data set (n = 6462). Network analysis shows that the genes harboring disruptive SNVs were involved in molecular mechanisms of cancer, and the signaling pathways of LPS-stimulated MAPK, IL-6, iNOS, EIF2 and mTOR. In conclusion, we have found hundreds of SNVs which are highly disruptive with respect to changes in the secondary structure and miRNA target sites within UTRs. These changes hold the potential to alter the expression of known cancer genes or genes linked to cancer-associated pathways.

  12. Prophage-mediated defense against viral attack and viral counter-defense

    PubMed Central

    Dedrick, Rebekah M.; Jacobs-Sera, Deborah; Guerrero Bustamante, Carlos A.; Garlena, Rebecca A.; Mavrich, Travis N.; Pope, Welkin H.; Reyes, Juan C Cervantes; Russell, Daniel A.; Adair, Tamarah; Alvey, Richard; Bonilla, J. Alfred; Bricker, Jerald S.; Brown, Bryony R.; Byrnes, Deanna; Cresawn, Steven G.; Davis, William B.; Dickson, Leon A.; Edgington, Nicholas P.; Findley, Ann M.; Golebiewska, Urszula; Grose, Julianne H.; Hayes, Cory F.; Hughes, Lee E.; Hutchison, Keith W.; Isern, Sharon; Johnson, Allison A.; Kenna, Margaret A.; Klyczek, Karen K.; Mageeney, Catherine M.; Michael, Scott F.; Molloy, Sally D.; Montgomery, Matthew T.; Neitzel, James; Page, Shallee T.; Pizzorno, Marie C.; Poxleitner, Marianne K.; Rinehart, Claire A.; Robinson, Courtney J.; Rubin, Michael R.; Teyim, Joseph N.; Vazquez, Edwin; Ware, Vassie C.; Washington, Jacqueline; Hatfull, Graham F.

    2017-01-01

    Temperate phages are common and prophages are abundant residents of sequenced bacterial genomes. Mycobacteriophages are viruses infecting mycobacterial hosts including Mycobacterium tuberculosis and Mycobacterium smegmatis, encompass substantial genetic diversity, and are commonly temperate. Characterization of ten Cluster N temperate mycobacteriophages reveals at least five distinct prophage-expressed viral defense systems that interfere with infection of lytic and temperate phages that are either closely-related (homotypic defense) or unrelated (heterotypic defense). Target specificity is unpredictable, ranging from a single target phage to one-third of those tested. The defense systems include a single-subunit restriction system, a heterotypic exclusion system, and a predicted (p)ppGpp synthetase, which blocks lytic phage growth, promotes bacterial survival, and enables efficient lysogeny. The predicted (p)ppGpp synthetase coded by the Phrann prophage defends against phage Tweety infection, but Tweety codes for a tetrapeptide repeat protein, gp54, that acts as a highly effective counter-defense system. Prophage-mediated viral defense offers an efficient mechanism for bacterial success in host-virus dynamics, and counter-defense promotes phage co-evolution. PMID:28067906

  13. Evaluating bacterial gene-finding HMM structures as probabilistic logic programs.

    PubMed

    Mørk, Søren; Holmes, Ian

    2012-03-01

    Probabilistic logic programming offers a powerful way to describe and evaluate structured statistical models. To investigate the practicality of probabilistic logic programming for structure learning in bioinformatics, we undertook a simplified bacterial gene-finding benchmark in PRISM, a probabilistic dialect of Prolog. We evaluate Hidden Markov Model structures for bacterial protein-coding gene potential, including a simple null model structure, three structures based on existing bacterial gene finders and two novel model structures. We test standard versions as well as ADPH length modeling and three-state versions of the five model structures. The models are all represented as probabilistic logic programs and evaluated using the PRISM machine learning system in terms of statistical information criteria and gene-finding prediction accuracy, in two bacterial genomes. Neither of our implementations of the two currently most used model structures are best performing in terms of statistical information criteria or prediction performances, suggesting that better-fitting models might be achievable. The source code of all PRISM models, data and additional scripts are freely available for download at: http://github.com/somork/codonhmm. Supplementary data are available at Bioinformatics online.

  14. Influence of flowfield and vehicle parameters on engineering aerothermal methods

    NASA Technical Reports Server (NTRS)

    Wurster, Kathryn E.; Zoby, E. Vincent; Thompson, Richard A.

    1989-01-01

    The reliability and flexibility of three engineering codes used in the aerosphace industry (AEROHEAT, INCHES, and MINIVER) were investigated by comparing the results of these codes with Reentry F flight data and ground-test heat-transfer data for a range of cone angles, and with the predictions obtained using the detailed VSL3D code; the engineering solutions were also compared. In particular, the impact of several vehicle and flow-field parameters on the heat transfer and the capability of the engineering codes to predict these results were determined. It was found that entropy, pressure gradient, nose bluntness, gas chemistry, and angle of attack all affect heating levels. A comparison of the results of the three engineering codes with Reentry F flight data and with the predictions obtained of the VSL3D code showed a very good agreement in the regions of the applicability of the codes. It is emphasized that the parameters used in this study can significantly influence the actual heating levels and the prediction capability of a code.

  15. Systematic analysis of human kinase genes: a large number of genes and alternative splicing events result in functional and structural diversity

    PubMed Central

    Milanesi, Luciano; Petrillo, Mauro; Sepe, Leandra; Boccia, Angelo; D'Agostino, Nunzio; Passamano, Myriam; Di Nardo, Salvatore; Tasco, Gianluca; Casadio, Rita; Paolella, Giovanni

    2005-01-01

    Background Protein kinases are a well defined family of proteins, characterized by the presence of a common kinase catalytic domain and playing a significant role in many important cellular processes, such as proliferation, maintenance of cell shape, apoptosys. In many members of the family, additional non-kinase domains contribute further specialization, resulting in subcellular localization, protein binding and regulation of activity, among others. About 500 genes encode members of the kinase family in the human genome, and although many of them represent well known genes, a larger number of genes code for proteins of more recent identification, or for unknown proteins identified as kinase only after computational studies. Results A systematic in silico study performed on the human genome, led to the identification of 5 genes, on chromosome 1, 11, 13, 15 and 16 respectively, and 1 pseudogene on chromosome X; some of these genes are reported as kinases from NCBI but are absent in other databases, such as KinBase. Comparative analysis of 483 gene regions and subsequent computational analysis, aimed at identifying unannotated exons, indicates that a large number of kinase may code for alternately spliced forms or be incorrectly annotated. An InterProScan automated analysis was perfomed to study domain distribution and combination in the various families. At the same time, other structural features were also added to the annotation process, including the putative presence of transmembrane alpha helices, and the cystein propensity to participate into a disulfide bridge. Conclusion The predicted human kinome was extended by identifiying both additional genes and potential splice variants, resulting in a varied panorama where functionality may be searched at the gene and protein level. Structural analysis of kinase proteins domains as defined in multiple sources together with transmembrane alpha helices and signal peptide prediction provides hints to function assignment. The results of the human kinome analysis are collected in the KinWeb database, available for browsing and searching over the internet, where all results from the comparative analysis and the gene structure annotation are made available, alongside the domain information. Kinases may be searched by domain combinations and the relative genes may be viewed in a graphic browser at various level of magnification up to gene organization on the full chromosome set. PMID:16351747

  16. The Cortical Organization of Speech Processing: Feedback Control and Predictive Coding the Context of a Dual-Stream Model

    ERIC Educational Resources Information Center

    Hickok, Gregory

    2012-01-01

    Speech recognition is an active process that involves some form of predictive coding. This statement is relatively uncontroversial. What is less clear is the source of the prediction. The dual-stream model of speech processing suggests that there are two possible sources of predictive coding in speech perception: the motor speech system and the…

  17. The genome of obligately intracellular Ehrlichia canis revealsthemes of complex membrane structure and immune evasion strategies

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mavromatis, K.; Kuyler Doyle, C.; Lykidis, A.

    2005-09-01

    Ehrlichia canis, a small obligately intracellular, tick-transmitted, gram-negative, a-proteobacterium is the primary etiologic agent of globally distributed canine monocytic ehrlichiosis. Complete genome sequencing revealed that the E. canis genome consists of a single circular chromosome of 1,315,030 bp predicted to encode 925 proteins, 40 stable RNA species, and 17 putative pseudogenes, and a substantial proportion of non-coding sequence (27 percent). Interesting genome features include a large set of proteins with transmembrane helices and/or signal sequences, and a unique serine-threonine bias associated with the potential for O-glycosylation that was prominent in proteins associated with pathogen-host interactions. Furthermore, two paralogous protein familiesmore » associated with immune evasion were identified, one of which contains poly G:C tracts, suggesting that they may play a role in phase variation and facilitation of persistent infections. Proteins associated with pathogen-host interactions were identified including a small group of proteins (12) with tandem repeats and another with eukaryotic-like ankyrin domains (7).« less

  18. A high temperature fatigue life prediction computer code based on the total strain version of StrainRange Partitioning (SRP)

    NASA Technical Reports Server (NTRS)

    Mcgaw, Michael A.; Saltsman, James F.

    1993-01-01

    A recently developed high-temperature fatigue life prediction computer code is presented and an example of its usage given. The code discussed is based on the Total Strain version of Strainrange Partitioning (TS-SRP). Included in this code are procedures for characterizing the creep-fatigue durability behavior of an alloy according to TS-SRP guidelines and predicting cyclic life for complex cycle types for both isothermal and thermomechanical conditions. A reasonably extensive materials properties database is included with the code.

  19. Chromosome-level genome assembly and transcriptome of the green alga Chromochloris zofingiensis illuminates astaxanthin production.

    PubMed

    Roth, Melissa S; Cokus, Shawn J; Gallaher, Sean D; Walter, Andreas; Lopez, David; Erickson, Erika; Endelman, Benjamin; Westcott, Daniel; Larabell, Carolyn A; Merchant, Sabeeha S; Pellegrini, Matteo; Niyogi, Krishna K

    2017-05-23

    Microalgae have potential to help meet energy and food demands without exacerbating environmental problems. There is interest in the unicellular green alga Chromochloris zofingiensis , because it produces lipids for biofuels and a highly valuable carotenoid nutraceutical, astaxanthin. To advance understanding of its biology and facilitate commercial development, we present a C. zofingiensis chromosome-level nuclear genome, organelle genomes, and transcriptome from diverse growth conditions. The assembly, derived from a combination of short- and long-read sequencing in conjunction with optical mapping, revealed a compact genome of ∼58 Mbp distributed over 19 chromosomes containing 15,274 predicted protein-coding genes. The genome has uniform gene density over chromosomes, low repetitive sequence content (∼6%), and a high fraction of protein-coding sequence (∼39%) with relatively long coding exons and few coding introns. Functional annotation of gene models identified orthologous families for the majority (∼73%) of genes. Synteny analysis uncovered localized but scrambled blocks of genes in putative orthologous relationships with other green algae. Two genes encoding beta-ketolase ( BKT ), the key enzyme synthesizing astaxanthin, were found in the genome, and both were up-regulated by high light. Isolation and molecular analysis of astaxanthin-deficient mutants showed that BKT1 is required for the production of astaxanthin. Moreover, the transcriptome under high light exposure revealed candidate genes that could be involved in critical yet missing steps of astaxanthin biosynthesis, including ABC transporters, cytochrome P450 enzymes, and an acyltransferase. The high-quality genome and transcriptome provide insight into the green algal lineage and carotenoid production.

  20. Chromosome-level genome assembly and transcriptome of the green alga Chromochloris zofingiensis illuminates astaxanthin production

    DOE PAGES

    Roth, Melissa S.; Cokus, Shawn J.; Gallaher, Sean D.; ...

    2017-05-08

    Microalgae have potential to help meet energy and food demands without exacerbating environmental problems. There is interest in the unicellular green alga Chromochloris zofingiensis, because it produces lipids for biofuels and a highly valuable carotenoid nutraceutical, astaxanthin. Here, to advance understanding of its biology and facilitate commercial development, we present a C. zofingiensis chromosome-level nuclear genome, organelle genomes, and transcriptome from diverse growth conditions. The assembly, derived from a combination of short- and long-read sequencing in conjunction with optical mapping, revealed a compact genome of ~58 Mbp distributed over 19 chromosomes containing 15,274 predicted protein-coding genes. The genome has uniformmore » gene density over chromosomes, low repetitive sequence content (~6%), and a high fraction of protein-coding sequence (~39%) with relatively long coding exons and few coding introns. Functional annotation of gene models identified orthologous families for the majority (~73%) of genes. Synteny analysis uncovered localized but scrambled blocks of genes in putative orthologous relationships with other green algae. Two genes encoding beta-ketolase (BKT), the key enzyme synthesizing astaxanthin, were found in the genome, and both were up-regulated by high light. Isolation and molecular analysis of astaxanthin-deficient mutants showed that BKT1 is required for the production of astaxanthin. Moreover, the transcriptome under high light exposure revealed candidate genes that could be involved in critical yet missing steps of astaxanthin biosynthesis, including ABC transporters, cytochrome P450 enzymes, and an acyltransferase. Finally, the high-quality genome and transcriptome provide insight into the green algal lineage and carotenoid production.« less

  1. Chromosome-level genome assembly and transcriptome of the green alga Chromochloris zofingiensis illuminates astaxanthin production

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Roth, Melissa S.; Cokus, Shawn J.; Gallaher, Sean D.

    Microalgae have potential to help meet energy and food demands without exacerbating environmental problems. There is interest in the unicellular green alga Chromochloris zofingiensis, because it produces lipids for biofuels and a highly valuable carotenoid nutraceutical, astaxanthin. Here, to advance understanding of its biology and facilitate commercial development, we present a C. zofingiensis chromosome-level nuclear genome, organelle genomes, and transcriptome from diverse growth conditions. The assembly, derived from a combination of short- and long-read sequencing in conjunction with optical mapping, revealed a compact genome of ~58 Mbp distributed over 19 chromosomes containing 15,274 predicted protein-coding genes. The genome has uniformmore » gene density over chromosomes, low repetitive sequence content (~6%), and a high fraction of protein-coding sequence (~39%) with relatively long coding exons and few coding introns. Functional annotation of gene models identified orthologous families for the majority (~73%) of genes. Synteny analysis uncovered localized but scrambled blocks of genes in putative orthologous relationships with other green algae. Two genes encoding beta-ketolase (BKT), the key enzyme synthesizing astaxanthin, were found in the genome, and both were up-regulated by high light. Isolation and molecular analysis of astaxanthin-deficient mutants showed that BKT1 is required for the production of astaxanthin. Moreover, the transcriptome under high light exposure revealed candidate genes that could be involved in critical yet missing steps of astaxanthin biosynthesis, including ABC transporters, cytochrome P450 enzymes, and an acyltransferase. Finally, the high-quality genome and transcriptome provide insight into the green algal lineage and carotenoid production.« less

  2. Chromosome-level genome assembly and transcriptome of the green alga Chromochloris zofingiensis illuminates astaxanthin production

    PubMed Central

    Roth, Melissa S.; Cokus, Shawn J.; Gallaher, Sean D.; Walter, Andreas; Lopez, David; Erickson, Erika; Endelman, Benjamin; Westcott, Daniel; Larabell, Carolyn A.; Merchant, Sabeeha S.; Pellegrini, Matteo

    2017-01-01

    Microalgae have potential to help meet energy and food demands without exacerbating environmental problems. There is interest in the unicellular green alga Chromochloris zofingiensis, because it produces lipids for biofuels and a highly valuable carotenoid nutraceutical, astaxanthin. To advance understanding of its biology and facilitate commercial development, we present a C. zofingiensis chromosome-level nuclear genome, organelle genomes, and transcriptome from diverse growth conditions. The assembly, derived from a combination of short- and long-read sequencing in conjunction with optical mapping, revealed a compact genome of ∼58 Mbp distributed over 19 chromosomes containing 15,274 predicted protein-coding genes. The genome has uniform gene density over chromosomes, low repetitive sequence content (∼6%), and a high fraction of protein-coding sequence (∼39%) with relatively long coding exons and few coding introns. Functional annotation of gene models identified orthologous families for the majority (∼73%) of genes. Synteny analysis uncovered localized but scrambled blocks of genes in putative orthologous relationships with other green algae. Two genes encoding beta-ketolase (BKT), the key enzyme synthesizing astaxanthin, were found in the genome, and both were up-regulated by high light. Isolation and molecular analysis of astaxanthin-deficient mutants showed that BKT1 is required for the production of astaxanthin. Moreover, the transcriptome under high light exposure revealed candidate genes that could be involved in critical yet missing steps of astaxanthin biosynthesis, including ABC transporters, cytochrome P450 enzymes, and an acyltransferase. The high-quality genome and transcriptome provide insight into the green algal lineage and carotenoid production. PMID:28484037

  3. Different evolutionary patterns of SNPs between domains and unassigned regions in human protein-coding sequences.

    PubMed

    Pang, Erli; Wu, Xiaomei; Lin, Kui

    2016-06-01

    Protein evolution plays an important role in the evolution of each genome. Because of their functional nature, in general, most of their parts or sites are differently constrained selectively, particularly by purifying selection. Most previous studies on protein evolution considered individual proteins in their entirety or compared protein-coding sequences with non-coding sequences. Less attention has been paid to the evolution of different parts within each protein of a given genome. To this end, based on PfamA annotation of all human proteins, each protein sequence can be split into two parts: domains or unassigned regions. Using this rationale, single nucleotide polymorphisms (SNPs) in protein-coding sequences from the 1000 Genomes Project were mapped according to two classifications: SNPs occurring within protein domains and those within unassigned regions. With these classifications, we found: the density of synonymous SNPs within domains is significantly greater than that of synonymous SNPs within unassigned regions; however, the density of non-synonymous SNPs shows the opposite pattern. We also found there are signatures of purifying selection on both the domain and unassigned regions. Furthermore, the selective strength on domains is significantly greater than that on unassigned regions. In addition, among all of the human protein sequences, there are 117 PfamA domains in which no SNPs are found. Our results highlight an important aspect of protein domains and may contribute to our understanding of protein evolution.

  4. AMS 4.0: consensus prediction of post-translational modifications in protein sequences.

    PubMed

    Plewczynski, Dariusz; Basu, Subhadip; Saha, Indrajit

    2012-08-01

    We present here the 2011 update of the AutoMotif Service (AMS 4.0) that predicts the wide selection of 88 different types of the single amino acid post-translational modifications (PTM) in protein sequences. The selection of experimentally confirmed modifications is acquired from the latest UniProt and Phospho.ELM databases for training. The sequence vicinity of each modified residue is represented using amino acids physico-chemical features encoded using high quality indices (HQI) obtaining by automatic clustering of known indices extracted from AAindex database. For each type of the numerical representation, the method builds the ensemble of Multi-Layer Perceptron (MLP) pattern classifiers, each optimising different objectives during the training (for example the recall, precision or area under the ROC curve (AUC)). The consensus is built using brainstorming technology, which combines multi-objective instances of machine learning algorithm, and the data fusion of different training objects representations, in order to boost the overall prediction accuracy of conserved short sequence motifs. The performance of AMS 4.0 is compared with the accuracy of previous versions, which were constructed using single machine learning methods (artificial neural networks, support vector machine). Our software improves the average AUC score of the earlier version by close to 7 % as calculated on the test datasets of all 88 PTM types. Moreover, for the selected most-difficult sequence motifs types it is able to improve the prediction performance by almost 32 %, when compared with previously used single machine learning methods. Summarising, the brainstorming consensus meta-learning methodology on the average boosts the AUC score up to around 89 %, averaged over all 88 PTM types. Detailed results for single machine learning methods and the consensus methodology are also provided, together with the comparison to previously published methods and state-of-the-art software tools. The source code and precompiled binaries of brainstorming tool are available at http://code.google.com/p/automotifserver/ under Apache 2.0 licensing.

  5. Statistical Analysis of CFD Solutions from the Third AIAA Drag Prediction Workshop

    NASA Technical Reports Server (NTRS)

    Morrison, Joseph H.; Hemsch, Michael J.

    2007-01-01

    The first AIAA Drag Prediction Workshop, held in June 2001, evaluated the results from an extensive N-version test of a collection of Reynolds-Averaged Navier-Stokes CFD codes. The code-to-code scatter was more than an order of magnitude larger than desired for design and experimental validation of cruise conditions for a subsonic transport configuration. The second AIAA Drag Prediction Workshop, held in June 2003, emphasized the determination of installed pylon-nacelle drag increments and grid refinement studies. The code-to-code scatter was significantly reduced compared to the first DPW, but still larger than desired. However, grid refinement studies showed no significant improvement in code-to-code scatter with increasing grid refinement. The third Drag Prediction Workshop focused on the determination of installed side-of-body fairing drag increments and grid refinement studies for clean attached flow on wing alone configurations and for separated flow on the DLR-F6 subsonic transport model. This work evaluated the effect of grid refinement on the code-to-code scatter for the clean attached flow test cases and the separated flow test cases.

  6. Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts.

    PubMed

    Sanford, Jeremy R; Wang, Xin; Mort, Matthew; Vanduyn, Natalia; Cooper, David N; Mooney, Sean D; Edenberg, Howard J; Liu, Yunlong

    2009-03-01

    Metazoan genes are encrypted with at least two superimposed codes: the genetic code to specify the primary structure of proteins and the splicing code to expand their proteomic output via alternative splicing. Here, we define the specificity of a central regulator of pre-mRNA splicing, the conserved, essential splicing factor SFRS1. Cross-linking immunoprecipitation and high-throughput sequencing (CLIP-seq) identified 23,632 binding sites for SFRS1 in the transcriptome of cultured human embryonic kidney cells. SFRS1 was found to engage many different classes of functionally distinct transcripts including mRNA, miRNA, snoRNAs, ncRNAs, and conserved intergenic transcripts of unknown function. The majority of these diverse transcripts share a purine-rich consensus motif corresponding to the canonical SFRS1 binding site. The consensus site was not only enriched in exons cross-linked to SFRS1 in vivo, but was also enriched in close proximity to splice sites. mRNAs encoding RNA processing factors were significantly overrepresented, suggesting that SFRS1 may broadly influence the post-transcriptional control of gene expression in vivo. Finally, a search for the SFRS1 consensus motif within the Human Gene Mutation Database identified 181 mutations in 82 different genes that disrupt predicted SFRS1 binding sites. This comprehensive analysis substantially expands the known roles of human SR proteins in the regulation of a diverse array of RNA transcripts.

  7. High quality draft genome sequence of Olivibacter sitiensis type strain (AW-6T), a diphenol degrader with genes involved in the catechol pathway

    PubMed Central

    Ntougias, Spyridon; Lapidus, Alla; Han, James; Mavromatis, Konstantinos; Pati, Amrita; Chen, Amy; Klenk, Hans-Peter; Woyke, Tanja; Fasseas, Constantinos; Kyrpides, Nikos C.; Zervakis, Georgios I.

    2014-01-01

    Olivibacter sitiensis Ntougias et al. 2007 is a member of the family Sphingobacteriaceae, phylum Bacteroidetes. Members of the genus Olivibacter are phylogenetically diverse and of significant interest. They occur in diverse habitats, such as rhizosphere and contaminated soils, viscous wastes, composts, biofilter clean-up facilities on contaminated sites and cave environments, and they are involved in the degradation of complex and toxic compounds. Here we describe the features of O. sitiensis AW-6T, together with the permanent-draft genome sequence and annotation. The organism was sequenced under the Genomic Encyclopedia for Bacteria and Archaea (GEBA) project at the DOE Joint Genome Institute and is the first genome sequence of a species within the genus Olivibacter. The genome is 5,053,571 bp long and is comprised of 110 scaffolds with an average GC content of 44.61%. Of the 4,565 genes predicted, 4,501 were protein-coding genes and 64 were RNA genes. Most protein-coding genes (68.52%) were assigned to a putative function. The identification of 2-keto-4-pentenoate hydratase/2-oxohepta-3-ene-1,7-dioic acid hydratase-coding genes indicates involvement of this organism in the catechol catabolic pathway. In addition, genes encoding for β-1,4-xylanases and β-1,4-xylosidases reveal the xylanolytic action of O. sitiensis. PMID:25197463

  8. Experimental Assessment of Splicing Variants Using Expression Minigenes and Comparison with In Silico Predictions

    PubMed Central

    Sharma, Neeraj; Sosnay, Patrick R.; Ramalho, Anabela S.; Douville, Christopher; Franca, Arianna; Gottschalk, Laura B.; Park, Jeenah; Lee, Melissa; Vecchio-Pagan, Briana; Raraigh, Karen S.; Amaral, Margarida D.; Karchin, Rachel; Cutting, Garry R.

    2015-01-01

    Assessment of the functional consequences of variants near splice sites is a major challenge in the diagnostic laboratory. To address this issue, we created expression minigenes (EMGs) to determine the RNA and protein products generated by splice site variants (n = 10) implicated in cystic fibrosis (CF). Experimental results were compared with the splicing predictions of eight in silico tools. EMGs containing the full-length Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) coding sequence and flanking intron sequences generated wild-type transcript and fully processed protein in Human Embryonic Kidney (HEK293) and CF bronchial epithelial (CFBE41o-) cells. Quantification of variant induced aberrant mRNA isoforms was concordant using fragment analysis and pyrosequencing. The splicing patterns of c.1585−1G>A and c.2657+5G>A were comparable to those reported in primary cells from individuals bearing these variants. Bioinformatics predictions were consistent with experimental results for 9/10 variants (MES), 8/10 variants (NNSplice), and 7/10 variants (SSAT and Sroogle). Programs that estimate the consequences of mis-splicing predicted 11/16 (HSF and ASSEDA) and 10/16 (Fsplice and SplicePort) experimentally observed mRNA isoforms. EMGs provide a robust experimental approach for clinical interpretation of splice site variants and refinement of in silico tools. PMID:25066652

  9. APADB: a database for alternative polyadenylation and microRNA regulation events

    PubMed Central

    Müller, Sören; Rycak, Lukas; Afonso-Grunz, Fabian; Winter, Peter; Zawada, Adam M.; Damrath, Ewa; Scheider, Jessica; Schmäh, Juliane; Koch, Ina; Kahl, Günter; Rotter, Björn

    2014-01-01

    Alternative polyadenylation (APA) is a widespread mechanism that contributes to the sophisticated dynamics of gene regulation. Approximately 50% of all protein-coding human genes harbor multiple polyadenylation (PA) sites; their selective and combinatorial use gives rise to transcript variants with differing length of their 3′ untranslated region (3′UTR). Shortened variants escape UTR-mediated regulation by microRNAs (miRNAs), especially in cancer, where global 3′UTR shortening accelerates disease progression, dedifferentiation and proliferation. Here we present APADB, a database of vertebrate PA sites determined by 3′ end sequencing, using massive analysis of complementary DNA ends. APADB provides (A)PA sites for coding and non-coding transcripts of human, mouse and chicken genes. For human and mouse, several tissue types, including different cancer specimens, are available. APADB records the loss of predicted miRNA binding sites and visualizes next-generation sequencing reads that support each PA site in a genome browser. The database tables can either be browsed according to organism and tissue or alternatively searched for a gene of interest. APADB is the largest database of APA in human, chicken and mouse. The stored information provides experimental evidence for thousands of PA sites and APA events. APADB combines 3′ end sequencing data with prediction algorithms of miRNA binding sites, allowing to further improve prediction algorithms. Current databases lack correct information about 3′UTR lengths, especially for chicken, and APADB provides necessary information to close this gap. Database URL: http://tools.genxpro.net/apadb/ PMID:25052703

  10. The Number, Organization, and Size of Polymorphic Membrane Protein Coding Sequences as well as the Most Conserved Pmp Protein Differ within and across Chlamydia Species.

    PubMed

    Van Lent, Sarah; Creasy, Heather Huot; Myers, Garry S A; Vanrompay, Daisy

    2016-01-01

    Variation is a central trait of the polymorphic membrane protein (Pmp) family. The number of pmp coding sequences differs between Chlamydia species, but it is unknown whether the number of pmp coding sequences is constant within a Chlamydia species. The level of conservation of the Pmp proteins has previously only been determined for Chlamydia trachomatis. As different Pmp proteins might be indispensible for the pathogenesis of different Chlamydia species, this study investigated the conservation of Pmp proteins both within and across C. trachomatis,C. pneumoniae,C. abortus, and C. psittaci. The pmp coding sequences were annotated in 16 C. trachomatis, 6 C. pneumoniae, 2 C. abortus, and 16 C. psittaci genomes. The number and organization of polymorphic membrane coding sequences differed within and across the analyzed Chlamydia species. The length of coding sequences of pmpA,pmpB, and pmpH was conserved among all analyzed genomes, while the length of pmpE/F and pmpG, and remarkably also of the subtype pmpD, differed among the analyzed genomes. PmpD, PmpA, PmpH, and PmpA were the most conserved Pmp in C. trachomatis,C. pneumoniae,C. abortus, and C. psittaci, respectively. PmpB was the most conserved Pmp across the 4 analyzed Chlamydia species. © 2016 S. Karger AG, Basel.

  11. Memory T-Cell-Mediated Immune Responses Specific to an Alternative Core Protein in Hepatitis C Virus Infection

    PubMed Central

    Bain, Christine; Parroche, Peggy; Lavergne, Jean Pierre; Duverger, Blandine; Vieux, Claude; Dubois, Valérie; Komurian-Pradel, Florence; Trépo, Christian; Gebuhrer, Lucette; Paranhos-Baccala, Glaucia; Penin, François; Inchauspé, Geneviève

    2004-01-01

    In vitro studies have described the synthesis of an alternative reading frame form of the hepatitis C virus (HCV) core protein that was named F protein or ARFP (alternative reading frame protein) and includes a domain coded by the +1 open reading frame of the RNA core coding region. The expression of this protein in HCV-infected patients remains controversial. We have analyzed peripheral blood from 47 chronically or previously HCV-infected patients for the presence of T lymphocytes and antibodies specific to the ARFP. Anti-ARFP antibodies were detected in 41.6% of the patients infected with various HCV genotypes. Using a specific ARFP 99-amino-acid polypeptide as well as four ARFP predicted class I-restricted 9-mer peptides, we show that 20% of the patients display specific lymphocytes capable of producing gamma interferon, interleukin-10, or both cytokines. Patients harboring three different viral genotypes (1a, 1b, and 3) carried T lymphocytes reactive to genotype 1b-derived peptides. In longitudinal analysis of patients receiving therapy, both core and ARFP-specific T-cell- and B-cell-mediated responses were documented. The magnitude and kinetics of the HCV antigen-specific responses differed and were not linked with viremia or therapy outcome. These observations provide strong and new arguments in favor of the synthesis, during natural HCV infection, of an ARFP derived from the core sequence. Moreover, the present data provide the first demonstration of the presence of T-cell-mediated immune responses directed to this novel HCV antigen. PMID:15367612

  12. Identification of human microRNA targets from isolated argonaute protein complexes.

    PubMed

    Beitzinger, Michaela; Peters, Lasse; Zhu, Jia Yun; Kremmer, Elisabeth; Meister, Gunter

    2007-06-01

    MicroRNAs (miRNAs) constitute a class of small non-coding RNAs that regulate gene expression on the level of translation and/or mRNA stability. Mammalian miRNAs associate with members of the Argonaute (Ago) protein family and bind to partially complementary sequences in the 3' untranslated region (UTR) of specific target mRNAs. Computer algorithms based on factors such as free binding energy or sequence conservation have been used to predict miRNA target mRNAs. Based on such predictions, up to one third of all mammalian mRNAs seem to be under miRNA regulation. However, due to the low degree of complementarity between the miRNA and its target, such computer programs are often imprecise and therefore not very reliable. Here we report the first biochemical identification approach of miRNA targets from human cells. Using highly specific monoclonal antibodies against members of the Ago protein family, we co-immunoprecipitate Ago-bound mRNAs and identify them by cloning. Interestingly, most of the identified targets are also predicted by different computer programs. Moreover, we randomly analyzed six different target candidates and were able to experimentally validate five as miRNA targets. Our data clearly indicate that miRNA targets can be experimentally identified from Ago complexes and therefore provide a new tool to directly analyze miRNA function.

  13. Statistical Analysis of the AIAA Drag Prediction Workshop CFD Solutions

    NASA Technical Reports Server (NTRS)

    Morrison, Joseph H.; Hemsch, Michael J.

    2007-01-01

    The first AIAA Drag Prediction Workshop (DPW), held in June 2001, evaluated the results from an extensive N-version test of a collection of Reynolds-Averaged Navier-Stokes CFD codes. The code-to-code scatter was more than an order of magnitude larger than desired for design and experimental validation of cruise conditions for a subsonic transport configuration. The second AIAA Drag Prediction Workshop, held in June 2003, emphasized the determination of installed pylon-nacelle drag increments and grid refinement studies. The code-to-code scatter was significantly reduced compared to the first DPW, but still larger than desired. However, grid refinement studies showed no significant improvement in code-to-code scatter with increasing grid refinement. The third AIAA Drag Prediction Workshop, held in June 2006, focused on the determination of installed side-of-body fairing drag increments and grid refinement studies for clean attached flow on wing alone configurations and for separated flow on the DLR-F6 subsonic transport model. This report compares the transonic cruise prediction results of the second and third workshops using statistical analysis.

  14. ICAM-1-related long non-coding RNA: promoter analysis and expression in human retinal endothelial cells.

    PubMed

    Lumsden, Amanda L; Ma, Yuefang; Ashander, Liam M; Stempel, Andrew J; Keating, Damien J; Smith, Justine R; Appukuttan, Binoy

    2018-05-09

    Regulation of intercellular adhesion molecule (ICAM)-1 in retinal endothelial cells is a promising druggable target for retinal vascular diseases. The ICAM-1-related (ICR) long non-coding RNA stabilizes ICAM-1 transcript, increasing protein expression. However, studies of ICR involvement in disease have been limited as the promoter is uncharacterized. To address this issue, we undertook a comprehensive in silico analysis of the human ICR gene promoter region. We used genomic evolutionary rate profiling to identify a 115 base pair (bp) sequence within 500 bp upstream of the transcription start site of the annotated human ICR gene that was conserved across 25 eutherian genomes. A second constrained sequence upstream of the orthologous mouse gene (68 bp; conserved across 27 Eutherian genomes including human) was also discovered. Searching these elements identified 33 matrices predictive of binding sites for transcription factors known to be responsive to a broad range of pathological stimuli, including hypoxia, and metabolic and inflammatory proteins. Five phenotype-associated single nucleotide polymorphisms (SNPs) in the immediate vicinity of these elements included four SNPs (i.e. rs2569693, rs281439, rs281440 and rs11575074) predicted to impact binding motifs of transcription factors, and thus the expression of ICR and ICAM-1 genes, with potential to influence disease susceptibility. We verified that human retinal endothelial cells expressed ICR, and observed induction of expression by tumor necrosis factor-α.

  15. Effects of the enlargement of poly-glutamine segments on the structure and folding of ataxin-2 and ataxin-3 proteins

    PubMed Central

    Wen, Jingran; Scoles, Daniel R.; Facelli, Julio C.

    2017-01-01

    Spinocerebellar ataxia type 2 (SCA2) and type 3 (SCA3) are two common autosomal-dominant inherited ataxia syndromes, both of which are related to the unstable expansion of tri-nucleotide CAG repeats in the coding region of the related ATXN2 and ATXN3 genes, respectively. The poly-glutamine (poly-Q) tract encoded by the CAG repeats has long been recognized as an important factor in disease pathogenesis and progress. In this study, using the I-TASSER method for 3D structure prediction, we investigated the effect of poly-Q tract enlargement on the structure and folding of ataxin-2 and ataxin-3 proteins. Our results show good agreement with the known experimental structures of the Josephin and UIM domains providing credence to the simulation results presented here, which show that the enlargement of the poly-Q region not only affects the local structure of these regions but also affects the structures of functional domains as well as the whole protein. The changes observed in the predicted models of the UIM domains in ataxin-3 when the poly-Q track is enlarged provide new insights on possible pathogenic mechanisms. PMID:26861241

  16. Dscam1 web server: online prediction of Dscam1 self- and hetero-affinity.

    PubMed

    Marini, Simone; Nazzicari, Nelson; Biscarini, Filippo; Wang, Guang-Zhong

    2017-06-15

    Formation of homodimers by identical Dscam1 protein isomers on cell surface is the key factor for the self-avoidance of growing neurites. Dscam1 immense diversity has a critical role in the formation of arthropod neuronal circuit, showing unique evolutionary properties when compared to other cell surface proteins. Experimental measures are available for 89 self-binding and 1722 hetero-binding protein samples, out of more than 19 thousands (self-binding) and 350 millions (hetero-binding) possible isomer combinations. We developed Dscam1 Web Server to quickly predict Dscam1 self- and hetero- binding affinity for batches of Dscam1 isomers. The server can help the study of Dscam1 affinity and help researchers navigate through the tens of millions of possible isomer combinations to isolate the strong-binding ones. Dscam1 Web Server is freely available at: http://bioinformatics.tecnoparco.org/Dscam1-webserver . Web server code is available at https://gitlab.com/ne1s0n/Dscam1-binding . simone.marini@unipv.it or guangzhong.wang@picb.ac.cn. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  17. Experimental studies related to the origin of the genetic code and the process of protein synthesis - A review

    NASA Technical Reports Server (NTRS)

    Lacey, J. C., Jr.; Mullins, D. W., Jr.

    1983-01-01

    A survey is presented of the literature on the experimental evidence for the genetic code assignments and the chemical reactions involved in the process of protein synthesis. In view of the enormous number of theoretical models that have been advanced to explain the origin of the genetic code, attention is confined to experimental studies. Since genetic coding has significance only within the context of protein synthesis, it is believed that the problem of the origin of the code must be dealt with in terms of the origin of the process of protein synthesis. It is contended that the answers must lie in the nature of the molecules, amino acids and nucleotides, the affinities they might have for one another, and the effect that those affinities must have on the chemical reactions that are related to primitive protein synthesis. The survey establishes that for the bulk of amino acids, there is a direct and significant correlation between the hydrophobicity rank of the amino acids and the hydrophobicity rank of their anticodonic dinucleotides.

  18. Inter-view prediction of intra mode decision for high-efficiency video coding-based multiview video coding

    NASA Astrophysics Data System (ADS)

    da Silva, Thaísa Leal; Agostini, Luciano Volcan; da Silva Cruz, Luis A.

    2014-05-01

    Intra prediction is a very important tool in current video coding standards. High-efficiency video coding (HEVC) intra prediction presents relevant gains in encoding efficiency when compared to previous standards, but with a very important increase in the computational complexity since 33 directional angular modes must be evaluated. Motivated by this high complexity, this article presents a complexity reduction algorithm developed to reduce the HEVC intra mode decision complexity targeting multiview videos. The proposed algorithm presents an efficient fast intra prediction compliant with singleview and multiview video encoding. This fast solution defines a reduced subset of intra directions according to the video texture and it exploits the relationship between prediction units (PUs) of neighbor depth levels of the coding tree. This fast intra coding procedure is used to develop an inter-view prediction method, which exploits the relationship between the intra mode directions of adjacent views to further accelerate the intra prediction process in multiview video encoding applications. When compared to HEVC simulcast, our method achieves a complexity reduction of up to 47.77%, at the cost of an average BD-PSNR loss of 0.08 dB.

  19. The Acheta domesticus Densovirus, Isolated from the European House Cricket, Has Evolved an Expression Strategy Unique among Parvoviruses▿†

    PubMed Central

    Liu, Kaiyu; Li, Yi; Jousset, Françoise-Xavière; Zadori, Zoltan; Szelei, Jozsef; Yu, Qian; Pham, Hanh Thi; Lépine, François; Bergoin, Max; Tijssen, Peter

    2011-01-01

    The Acheta domesticus densovirus (AdDNV), isolated from crickets, has been endemic in Europe for at least 35 years. Severe epizootics have also been observed in American commercial rearings since 2009 and 2010. The AdDNV genome was cloned and sequenced for this study. The transcription map showed that splicing occurred in both the nonstructural (NS) and capsid protein (VP) multicistronic RNAs. The splicing pattern of NS mRNA predicted 3 nonstructural proteins (NS1 [576 codons], NS2 [286 codons], and NS3 [213 codons]). The VP gene cassette contained two VP open reading frames (ORFs), of 597 (ORF-A) and 268 (ORF-B) codons. The VP2 sequence was shown by N-terminal Edman degradation and mass spectrometry to correspond with ORF-A. Mass spectrometry, sequencing, and Western blotting of baculovirus-expressed VPs versus native structural proteins demonstrated that the VP1 structural protein was generated by joining ORF-A and -B via splicing (splice II), eliminating the N terminus of VP2. This splice resulted in a nested set of VP1 (816 codons), VP3 (467 codons), and VP4 (429 codons) structural proteins. In contrast, the two splices within ORF-B (Ia and Ib) removed the donor site of intron II and resulted in VP2, VP3, and VP4 expression. ORF-B may also code for several nonstructural proteins, of 268, 233, and 158 codons. The small ORF-B contains the coding sequence for a phospholipase A2 motif found in VP1, which was shown previously to be critical for cellular uptake of the virus. These splicing features are unique among parvoviruses and define a new genus of ambisense densoviruses. PMID:21775445

  20. Synonymous Mutations in the Core Gene Are Linked to Unusual Serological Profile in Hepatitis C Virus Infection

    PubMed Central

    Budkowska, Agata; Kakkanas, Athanassios; Nerrienet, Eric; Kalinina, Olga; Maillard, Patrick; Horm, Srey Viseth; Dalagiorgou, Geena; Vassilaki, Niki; Georgopoulou, Urania; Martinot, Michelle; Sall, Amadou Alpha; Mavromara, Penelope

    2011-01-01

    The biological role of the protein encoded by the alternative open reading frame (core+1/ARF) of the Hepatitis C virus (HCV) genome remains elusive, as does the significance of the production of corresponding antibodies in HCV infection. We investigated the prevalence of anti-core and anti-core+1/ARFP antibodies in HCV-positive blood donors from Cambodia, using peptide and recombinant protein-based ELISAs. We detected unusual serological profiles in 3 out of 58 HCV positive plasma of genotype 1a. These patients were negative for anti-core antibodies by commercial and peptide-based assays using C-terminal fragments of core but reacted by Western Blot with full-length core protein. All three patients had high levels of anti-core+1/ARFP antibodies. Cloning of the cDNA that corresponds to the core-coding region from these sera resulted in the expression of both core and core+1/ARFP in mammalian cells. The core protein exhibited high amino-acid homology with a consensus HCV1a sequence. However, 10 identical synonymous mutations were found, and 7 were located in the aa(99–124) region of core. All mutations concerned the third base of a codon, and 5/10 represented a T>C mutation. Prediction analyses of the RNA secondary structure revealed conformational changes within the stem-loop region that contains the core+1/ARFP internal AUG initiator at position 85/87. Using the luciferase tagging approach, we showed that core+1/ARFP expression is more efficient from such a sequence than from the prototype HCV1a RNA. We provide additional evidence of the existence of core+1/ARFP in vivo and new data concerning expression of HCV core protein. We show that HCV patients who do not produce normal anti-core antibodies have unusually high levels of antit-core+1/ARFP and harbour several identical synonymous mutations in the core and core+1/ARFP coding region that result in major changes in predicted RNA structure. Such HCV variants may favour core+1/ARFP production during HCV infection. PMID:21283512

  1. Transonic Drag Prediction on a DLR-F6 Transport Configuration Using Unstructured Grid Solvers

    NASA Technical Reports Server (NTRS)

    Lee-Rausch, E. M.; Frink, N. T.; Mavriplis, D. J.; Rausch, R. D.; Milholen, W. E.

    2004-01-01

    A second international AIAA Drag Prediction Workshop (DPW-II) was organized and held in Orlando Florida on June 21-22, 2003. The primary purpose was to inves- tigate the code-to-code uncertainty. address the sensitivity of the drag prediction to grid size and quantify the uncertainty in predicting nacelle/pylon drag increments at a transonic cruise condition. This paper presents an in-depth analysis of the DPW-II computational results from three state-of-the-art unstructured grid Navier-Stokes flow solvers exercised on similar families of tetrahedral grids. The flow solvers are USM3D - a tetrahedral cell-centered upwind solver. FUN3D - a tetrahedral node-centered upwind solver, and NSU3D - a general element node-centered central-differenced solver. For the wingbody, the total drag predicted for a constant-lift transonic cruise condition showed a decrease in code-to-code variation with grid refinement as expected. For the same flight condition, the wing/body/nacelle/pylon total drag and the nacelle/pylon drag increment predicted showed an increase in code-to-code variation with grid refinement. Although the range in total drag for the wingbody fine grids was only 5 counts, a code-to-code comparison of surface pressures and surface restricted streamlines indicated that the three solvers were not all converging to the same flow solutions- different shock locations and separation patterns were evident. Similarly, the wing/body/nacelle/pylon solutions did not appear to be converging to the same flow solutions. Overall, grid refinement did not consistently improve the correlation with experimental data for either the wingbody or the wing/body/nacelle pylon configuration. Although the absolute values of total drag predicted by two of the solvers for the medium and fine grids did not compare well with the experiment, the incremental drag predictions were within plus or minus 3 counts of the experimental data. The correlation with experimental incremental drag was not significantly changed by specifying transition. Although the sources of code-to-code variation in force and moment predictions for the three unstructured grid codes have not yet been identified, the current study reinforces the necessity of applying multiple codes to the same application to assess uncertainty.

  2. In Silico/In Vivo Insights into the Functional and Evolutionary Pathway of Pseudomonas aeruginosa Oleate-Diol Synthase. Discovery of a New Bacterial Di-Heme Cytochrome C Peroxidase Subfamily

    PubMed Central

    Estupiñán, Mónica; Álvarez-García, Daniel; Barril, Xavier; Diaz, Pilar; Manresa, Angeles

    2015-01-01

    As previously reported, P. aeruginosa genes PA2077 and PA2078 code for 10S-DOX (10S-Dioxygenase) and 7,10-DS (7,10-Diol Synthase) enzymes involved in long-chain fatty acid oxygenation through the recently described oleate-diol synthase pathway. Analysis of the amino acid sequence of both enzymes revealed the presence of two heme-binding motifs (CXXCH) on each protein. Phylogenetic analysis showed the relation of both proteins to bacterial di-heme cytochrome c peroxidases (Ccps), similar to Xanthomonas sp. 35Y rubber oxidase RoxA. Structural homology modelling of PA2077 and PA2078 was achieved using RoxA (pdb 4b2n) as a template. From the 3D model obtained, presence of significant amino acid variations in the predicted heme-environment was found. Moreover, the presence of palindromic repeats located in enzyme-coding regions, acting as protein evolution elements, is reported here for the first time in P. aeruginosa genome. These observations and the constructed phylogenetic tree of the two proteins, allow the proposal of an evolutionary pathway for P. aeruginosa oleate-diol synthase operon. Taking together the in silico and in vivo results obtained we conclude that enzymes PA2077 and PA2078 are the first described members of a new subfamily of bacterial peroxidases, designated as Fatty acid-di-heme Cytochrome c peroxidases (FadCcp). PMID:26154497

  3. Effective comparative analysis of protein-protein interaction networks by measuring the steady-state network flow using a Markov model.

    PubMed

    Jeong, Hyundoo; Qian, Xiaoning; Yoon, Byung-Jun

    2016-10-06

    Comparative analysis of protein-protein interaction (PPI) networks provides an effective means of detecting conserved functional network modules across different species. Such modules typically consist of orthologous proteins with conserved interactions, which can be exploited to computationally predict the modules through network comparison. In this work, we propose a novel probabilistic framework for comparing PPI networks and effectively predicting the correspondence between proteins, represented as network nodes, that belong to conserved functional modules across the given PPI networks. The basic idea is to estimate the steady-state network flow between nodes that belong to different PPI networks based on a Markov random walk model. The random walker is designed to make random moves to adjacent nodes within a PPI network as well as cross-network moves between potential orthologous nodes with high sequence similarity. Based on this Markov random walk model, we estimate the steady-state network flow - or the long-term relative frequency of the transitions that the random walker makes - between nodes in different PPI networks, which can be used as a probabilistic score measuring their potential correspondence. Subsequently, the estimated scores can be used for detecting orthologous proteins in conserved functional modules through network alignment. Through evaluations based on multiple real PPI networks, we demonstrate that the proposed scheme leads to improved alignment results that are biologically more meaningful at reduced computational cost, outperforming the current state-of-the-art algorithms. The source code and datasets can be downloaded from http://www.ece.tamu.edu/~bjyoon/CUFID .

  4. DNA sequence-dependent mechanics and protein-assisted bending in repressor-mediated loop formation

    PubMed Central

    Boedicker, James Q.; Garcia, Hernan G.; Johnson, Stephanie; Phillips, Rob

    2014-01-01

    As the chief informational molecule of life, DNA is subject to extensive physical manipulations. The energy required to deform double-helical DNA depends on sequence, and this mechanical code of DNA influences gene regulation, such as through nucleosome positioning. Here we examine the sequence-dependent flexibility of DNA in bacterial transcription factor-mediated looping, a context for which the role of sequence remains poorly understood. Using a suite of synthetic constructs repressed by the Lac repressor and two well-known sequences that show large flexibility differences in vitro, we make precise statistical mechanical predictions as to how DNA sequence influences loop formation and test these predictions using in vivo transcription and in vitro single-molecule assays. Surprisingly, sequence-dependent flexibility does not affect in vivo gene regulation. By theoretically and experimentally quantifying the relative contributions of sequence and the DNA-bending protein HU to DNA mechanical properties, we reveal that bending by HU dominates DNA mechanics and masks intrinsic sequence-dependent flexibility. Such a quantitative understanding of how mechanical regulatory information is encoded in the genome will be a key step towards a predictive understanding of gene regulation at single-base pair resolution. PMID:24231252

  5. Structural Phylogenomics Retrodicts the Origin of the Genetic Code and Uncovers the Evolutionary Impact of Protein Flexibility

    PubMed Central

    Caetano-Anollés, Gustavo; Wang, Minglei; Caetano-Anollés, Derek

    2013-01-01

    The genetic code shapes the genetic repository. Its origin has puzzled molecular scientists for over half a century and remains a long-standing mystery. Here we show that the origin of the genetic code is tightly coupled to the history of aminoacyl-tRNA synthetase enzymes and their interactions with tRNA. A timeline of evolutionary appearance of protein domain families derived from a structural census in hundreds of genomes reveals the early emergence of the ‘operational’ RNA code and the late implementation of the standard genetic code. The emergence of codon specificities and amino acid charging involved tight coevolution of aminoacyl-tRNA synthetases and tRNA structures as well as episodes of structural recruitment. Remarkably, amino acid and dipeptide compositions of single-domain proteins appearing before the standard code suggest archaic synthetases with structures homologous to catalytic domains of tyrosyl-tRNA and seryl-tRNA synthetases were capable of peptide bond formation and aminoacylation. Results reveal that genetics arose through coevolutionary interactions between polypeptides and nucleic acid cofactors as an exacting mechanism that favored flexibility and folding of the emergent proteins. These enhancements of phenotypic robustness were likely internalized into the emerging genetic system with the early rise of modern protein structure. PMID:23991065

  6. Application of the maximum entropy principle to determine ensembles of intrinsically disordered proteins from residual dipolar couplings.

    PubMed

    Sanchez-Martinez, M; Crehuet, R

    2014-12-21

    We present a method based on the maximum entropy principle that can re-weight an ensemble of protein structures based on data from residual dipolar couplings (RDCs). The RDCs of intrinsically disordered proteins (IDPs) provide information on the secondary structure elements present in an ensemble; however even two sets of RDCs are not enough to fully determine the distribution of conformations, and the force field used to generate the structures has a pervasive influence on the refined ensemble. Two physics-based coarse-grained force fields, Profasi and Campari, are able to predict the secondary structure elements present in an IDP, but even after including the RDC data, the re-weighted ensembles differ between both force fields. Thus the spread of IDP ensembles highlights the need for better force fields. We distribute our algorithm in an open-source Python code.

  7. FIST: a sensory domain for diverse signal transduction pathways in prokaryotes and ubiquitin signaling in eukaryotes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Borziak, Kirill; Jouline, Igor B

    2007-01-01

    Motivation: Sensory domains that are conserved among Bacteria, Archaea and Eucarya are important detectors of common signals detected by living cells. Due to their high sequence divergence, sensory domains are difficult to identify. We systematically look for novel sensory domains using sensitive profile-based searches initi-ated with regions of signal transduction proteins where no known domains can be identified by current domain models. Results: Using profile searches followed by multiple sequence alignment, structure prediction, and domain architecture analysis, we have identified a novel sensory domain termed FIST, which is present in signal transduction proteins from Bacteria, Archaea and Eucarya. Remote similaritymore » to a known ligand-binding fold and chromosomal proximity of FIST-encoding genes to those coding for proteins involved in amino acid metabolism and transport suggest that FIST domains bind small ligands, such as amino acids.« less

  8. Coarse-grained sequences for protein folding and design.

    PubMed

    Brown, Scott; Fawzi, Nicolas J; Head-Gordon, Teresa

    2003-09-16

    We present the results of sequence design on our off-lattice minimalist model in which no specification of native-state tertiary contacts is needed. We start with a sequence that adopts a target topology and build on it through sequence mutation to produce new sequences that comprise distinct members within a target fold class. In this work, we use the alpha/beta ubiquitin fold class and design two new sequences that, when characterized through folding simulations, reproduce the differences in folding mechanism seen experimentally for proteins L and G. The primary implication of this work is that patterning of hydrophobic and hydrophilic residues is the physical origin for the success of relative contact-order descriptions of folding, and that these physics-based potentials provide a predictive connection between free energy landscapes and amino acid sequence (the original protein folding problem). We present results of the sequence mapping from a 20- to the three-letter code for determining a sequence that folds into the WW domain topology to illustrate future extensions to protein design.

  9. Coarse-grained sequences for protein folding and design

    PubMed Central

    Brown, Scott; Fawzi, Nicolas J.; Head-Gordon, Teresa

    2003-01-01

    We present the results of sequence design on our off-lattice minimalist model in which no specification of native-state tertiary contacts is needed. We start with a sequence that adopts a target topology and build on it through sequence mutation to produce new sequences that comprise distinct members within a target fold class. In this work, we use the α/β ubiquitin fold class and design two new sequences that, when characterized through folding simulations, reproduce the differences in folding mechanism seen experimentally for proteins L and G. The primary implication of this work is that patterning of hydrophobic and hydrophilic residues is the physical origin for the success of relative contact-order descriptions of folding, and that these physics-based potentials provide a predictive connection between free energy landscapes and amino acid sequence (the original protein folding problem). We present results of the sequence mapping from a 20- to the three-letter code for determining a sequence that folds into the WW domain topology to illustrate future extensions to protein design. PMID:12963815

  10. CombAlign: a code for generating a one-to-many sequence alignment from a set of pairwise structure-based sequence alignments.

    PubMed

    Zhou, Carol L Ecale

    2015-01-01

    In order to better define regions of similarity among related protein structures, it is useful to identify the residue-residue correspondences among proteins. Few codes exist for constructing a one-to-many multiple sequence alignment derived from a set of structure or sequence alignments, and a need was evident for creating such a tool for combining pairwise structure alignments that would allow for insertion of gaps in the reference structure. This report describes a new Python code, CombAlign, which takes as input a set of pairwise sequence alignments (which may be structure based) and generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA). The use and utility of CombAlign was demonstrated by generating gapped MSSAs using sets of pairwise structure-based sequence alignments between structure models of the matrix protein (VP40) and pre-small/secreted glycoprotein (sGP) of Reston Ebolavirus and the corresponding proteins of several other filoviruses. The gapped MSSAs revealed structure-based residue-residue correspondences, which enabled identification of structurally similar versus differing regions in the Reston proteins compared to each of the other corresponding proteins. CombAlign is a new Python code that generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA) given a set of pairwise sequence alignments (which may be structure based). CombAlign has utility in assisting the user in distinguishing structurally conserved versus divergent regions on a reference protein structure relative to other closely related proteins. CombAlign was developed in Python 2.6, and the source code is available for download from the GitHub code repository.

  11. Advanced turboprop noise prediction: Development of a code at NASA Langley based on recent theoretical results

    NASA Technical Reports Server (NTRS)

    Farassat, F.; Dunn, M. H.; Padula, S. L.

    1986-01-01

    The development of a high speed propeller noise prediction code at Langley Research Center is described. The code utilizes two recent acoustic formulations in the time domain for subsonic and supersonic sources. The structure and capabilities of the code are discussed. Grid size study for accuracy and speed of execution on a computer is also presented. The code is tested against an earlier Langley code. Considerable increase in accuracy and speed of execution are observed. Some examples of noise prediction of a high speed propeller for which acoustic test data are available are given. A brisk derivation of formulations used is given in an appendix.

  12. nGASP--the nematode genome annotation assessment project.

    PubMed

    Coghlan, Avril; Fiedler, Tristan J; McKay, Sheldon J; Flicek, Paul; Harris, Todd W; Blasiar, Darin; Stein, Lincoln D

    2008-12-19

    While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders. This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.

  13. RNA structural constraints in the evolution of the influenza A virus genome NP segment

    PubMed Central

    Gultyaev, Alexander P; Tsyganov-Bodounov, Anton; Spronken, Monique IJ; van der Kooij, Sander; Fouchier, Ron AM; Olsthoorn, René CL

    2014-01-01

    Conserved RNA secondary structures were predicted in the nucleoprotein (NP) segment of the influenza A virus genome using comparative sequence and structure analysis. A number of structural elements exhibiting nucleotide covariations were identified over the whole segment length, including protein-coding regions. Calculations of mutual information values at the paired nucleotide positions demonstrate that these structures impose considerable constraints on the virus genome evolution. Functional importance of a pseudoknot structure, predicted in the NP packaging signal region, was confirmed by plaque assays of the mutant viruses with disrupted structure and those with restored folding using compensatory substitutions. Possible functions of the conserved RNA folding patterns in the influenza A virus genome are discussed. PMID:25180940

  14. CELFE/NASTRAN Code for the Analysis of Structures Subjected to High Velocity Impact

    NASA Technical Reports Server (NTRS)

    Chamis, C. C.

    1978-01-01

    CELFE (Coupled Eulerian Lagrangian Finite Element)/NASTRAN Code three-dimensional finite element code has the capability for analyzing of structures subjected to high velocity impact. The local response is predicted by CELFE and, for large problems, the far-field impact response is predicted by NASTRAN. The coupling of the CELFE code with NASTRAN (CELFE/NASTRAN code) and the application of the code to selected three-dimensional high velocity impact problems are described.

  15. Indexing sensory plasticity: Evidence for distinct Predictive Coding and Hebbian learning mechanisms in the cerebral cortex.

    PubMed

    Spriggs, M J; Sumner, R L; McMillan, R L; Moran, R J; Kirk, I J; Muthukumaraswamy, S D

    2018-04-30

    The Roving Mismatch Negativity (MMN), and Visual LTP paradigms are widely used as independent measures of sensory plasticity. However, the paradigms are built upon fundamentally different (and seemingly opposing) models of perceptual learning; namely, Predictive Coding (MMN) and Hebbian plasticity (LTP). The aim of the current study was to compare the generative mechanisms of the MMN and visual LTP, therefore assessing whether Predictive Coding and Hebbian mechanisms co-occur in the brain. Forty participants were presented with both paradigms during EEG recording. Consistent with Predictive Coding and Hebbian predictions, Dynamic Causal Modelling revealed that the generation of the MMN modulates forward and backward connections in the underlying network, while visual LTP only modulates forward connections. These results suggest that both Predictive Coding and Hebbian mechanisms are utilized by the brain under different task demands. This therefore indicates that both tasks provide unique insight into plasticity mechanisms, which has important implications for future studies of aberrant plasticity in clinical populations. Copyright © 2018 Elsevier Inc. All rights reserved.

  16. Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse

    PubMed Central

    Hillier, LaDeana W.; Zody, Michael C.; Goldstein, Steve; She, Xinwe; Bult, Carol J.; Agarwala, Richa; Cherry, Joshua L.; DiCuccio, Michael; Hlavina, Wratko; Kapustin, Yuri; Meric, Peter; Maglott, Donna; Birtle, Zoë; Marques, Ana C.; Graves, Tina; Zhou, Shiguo; Teague, Brian; Potamousis, Konstantinos; Churas, Christopher; Place, Michael; Herschleb, Jill; Runnheim, Ron; Forrest, Daniel; Amos-Landgraf, James; Schwartz, David C.; Cheng, Ze; Lindblad-Toh, Kerstin; Eichler, Evan E.; Ponting, Chris P.

    2009-01-01

    The mouse (Mus musculus) is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes). In addition, we identified 439 long, non–protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not. PMID:19468303

  17. In silico methods for co-transcriptional RNA secondary structure prediction and for investigating alternative RNA structure expression.

    PubMed

    Meyer, Irmtraud M

    2017-05-01

    RNA transcripts are the primary products of active genes in any living organism, including many viruses. Their cellular destiny not only depends on primary sequence signals, but can also be determined by RNA structure. Recent experimental evidence shows that many transcripts can be assigned more than a single functional RNA structure throughout their cellular life and that structure formation happens co-transcriptionally, i.e. as the transcript is synthesised in the cell. Moreover, functional RNA structures are not limited to non-coding transcripts, but can also feature in coding transcripts. The picture that now emerges is that RNA structures constitute an additional layer of information that can be encoded in any RNA transcript (and on top of other layers of information such as protein-context) in order to exert a wide range of functional roles. Moreover, different encoded RNA structures can be expressed at different stages of a transcript's life in order to alter the transcript's behaviour depending on its actual cellular context. Similar to the concept of alternative splicing for protein-coding genes, where a single transcript can yield different proteins depending on cellular context, it is thus appropriate to propose the notion of alternative RNA structure expression for any given transcript. This review introduces several computational strategies that my group developed to detect different aspects of RNA structure expression in vivo. Two aspects are of particular interest to us: (1) RNA secondary structure features that emerge during co-transcriptional folding and (2) functional RNA structure features that are expressed at different times of a transcript's life and potentially mutually exclusive. Copyright © 2017. Published by Elsevier Inc.

  18. Complex Interplay among DNA Modification, Noncoding RNA Expression and Protein-Coding RNA Expression in Salvia miltiorrhiza Chloroplast Genome

    PubMed Central

    Chen, Haimei; Zhang, Jianhui; Yuan, George; Liu, Chang

    2014-01-01

    Salvia miltiorrhiza is one of the most widely used medicinal plants. As a first step to develop a chloroplast-based genetic engineering method for the over-production of active components from S. miltiorrhiza, we have analyzed the genome, transcriptome, and base modifications of the S. miltiorrhiza chloroplast. Total genomic DNA and RNA were extracted from fresh leaves and then subjected to strand-specific RNA-Seq and Single-Molecule Real-Time (SMRT) sequencing analyses. Mapping the RNA-Seq reads to the genome assembly allowed us to determine the relative expression levels of 80 protein-coding genes. In addition, we identified 19 polycistronic transcription units and 136 putative antisense and intergenic noncoding RNA (ncRNA) genes. Comparison of the abundance of protein-coding transcripts (cRNA) with and without overlapping antisense ncRNAs (asRNA) suggest that the presence of asRNA is associated with increased cRNA abundance (p<0.05). Using the SMRT Portal software (v1.3.2), 2687 potential DNA modification sites and two potential DNA modification motifs were predicted. The two motifs include a TATA box–like motif (CPGDMM1, “TATANNNATNA”), and an unknown motif (CPGDMM2 “WNYANTGAW”). Specifically, 35 of the 97 CPGDMM1 motifs (36.1%) and 91 of the 369 CPGDMM2 motifs (24.7%) were found to be significantly modified (p<0.01). Analysis of genes downstream of the CPGDMM1 motif revealed the significantly increased abundance of ncRNA genes that are less than 400 bp away from the significantly modified CPGDMM1motif (p<0.01). Taking together, the present study revealed a complex interplay among DNA modifications, ncRNA and cRNA expression in chloroplast genome. PMID:24914614

  19. Complex interplay among DNA modification, noncoding RNA expression and protein-coding RNA expression in Salvia miltiorrhiza chloroplast genome.

    PubMed

    Chen, Haimei; Zhang, Jianhui; Yuan, George; Liu, Chang

    2014-01-01

    Salvia miltiorrhiza is one of the most widely used medicinal plants. As a first step to develop a chloroplast-based genetic engineering method for the over-production of active components from S. miltiorrhiza, we have analyzed the genome, transcriptome, and base modifications of the S. miltiorrhiza chloroplast. Total genomic DNA and RNA were extracted from fresh leaves and then subjected to strand-specific RNA-Seq and Single-Molecule Real-Time (SMRT) sequencing analyses. Mapping the RNA-Seq reads to the genome assembly allowed us to determine the relative expression levels of 80 protein-coding genes. In addition, we identified 19 polycistronic transcription units and 136 putative antisense and intergenic noncoding RNA (ncRNA) genes. Comparison of the abundance of protein-coding transcripts (cRNA) with and without overlapping antisense ncRNAs (asRNA) suggest that the presence of asRNA is associated with increased cRNA abundance (p<0.05). Using the SMRT Portal software (v1.3.2), 2687 potential DNA modification sites and two potential DNA modification motifs were predicted. The two motifs include a TATA box-like motif (CPGDMM1, "TATANNNATNA"), and an unknown motif (CPGDMM2 "WNYANTGAW"). Specifically, 35 of the 97 CPGDMM1 motifs (36.1%) and 91 of the 369 CPGDMM2 motifs (24.7%) were found to be significantly modified (p<0.01). Analysis of genes downstream of the CPGDMM1 motif revealed the significantly increased abundance of ncRNA genes that are less than 400 bp away from the significantly modified CPGDMM1motif (p<0.01). Taking together, the present study revealed a complex interplay among DNA modifications, ncRNA and cRNA expression in chloroplast genome.

  20. Correlation approach to identify coding regions in DNA sequences

    NASA Technical Reports Server (NTRS)

    Ossadnik, S. M.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Mantegna, R. N.; Peng, C. K.; Simons, M.; Stanley, H. E.

    1994-01-01

    Recently, it was observed that noncoding regions of DNA sequences possess long-range power-law correlations, whereas coding regions typically display only short-range correlations. We develop an algorithm based on this finding that enables investigators to perform a statistical analysis on long DNA sequences to locate possible coding regions. The algorithm is particularly successful in predicting the location of lengthy coding regions. For example, for the complete genome of yeast chromosome III (315,344 nucleotides), at least 82% of the predictions correspond to putative coding regions; the algorithm correctly identified all coding regions larger than 3000 nucleotides, 92% of coding regions between 2000 and 3000 nucleotides long, and 79% of coding regions between 1000 and 2000 nucleotides. The predictive ability of this new algorithm supports the claim that there is a fundamental difference in the correlation property between coding and noncoding sequences. This algorithm, which is not species-dependent, can be implemented with other techniques for rapidly and accurately locating relatively long coding regions in genomic sequences.

  1. Calibration and comparison of the NASA Lewis free-piston Stirling engine model predictions with RE-1000 test data

    NASA Technical Reports Server (NTRS)

    Geng, Steven M.

    1987-01-01

    A free-piston Stirling engine performance code is being upgraded and validated at the NASA Lewis Research Center under an interagency agreement between the Department of Energy's Oak Ridge National Laboratory and NASA Lewis. Many modifications were made to the free-piston code in an attempt to decrease the calibration effort. A procedure was developed that made the code calibration process more systematic. Engine-specific calibration parameters are often used to bring predictions and experimental data into better agreement. The code was calibrated to a matrix of six experimental data points. Predictions of the calibrated free-piston code are compared with RE-1000 free-piston Stirling engine sensitivity test data taken at NASA Lewis. Reasonable agreement was obtained between the code prediction and the experimental data over a wide range of engine operating conditions.

  2. Calibration and comparison of the NASA Lewis free-piston Stirling engine model predictions with RE-1000 test data

    NASA Technical Reports Server (NTRS)

    Geng, Steven M.

    1987-01-01

    A free-piston Stirling engine performance code is being upgraded and validated at the NASA Lewis Research Center under an interagency agreement between the Department of Energy's Oak Ridge National Laboratory and NASA Lewis. Many modifications were made to the free-piston code in an attempt to decrease the calibration effort. A procedure was developed that made the code calibration process more systematic. Engine-specific calibration parameters are often used to bring predictions and experimental data into better agreement. The code was calibrated to a matrix of six experimental data points. Predictions of the calibrated free-piston code are compared with RE-1000 free-piston Stirling engine sensitivity test data taken at NASA Lewis. Resonable agreement was obtained between the code predictions and the experimental data over a wide range of engine operating conditions.

  3. Physicochemical code for quinary protein interactions in Escherichia coli

    PubMed Central

    Mu, Xin; Choi, Seongil; Lang, Lisa; Mowray, David; Danielsson, Jens; Oliveberg, Mikael

    2017-01-01

    How proteins sense and navigate the cellular interior to find their functional partners remains poorly understood. An intriguing aspect of this search is that it relies on diffusive encounters with the crowded cellular background, made up of protein surfaces that are largely nonconserved. The question is then if/how this protein search is amenable to selection and biological control. To shed light on this issue, we examined the motions of three evolutionary divergent proteins in the Escherichia coli cytoplasm by in-cell NMR. The results show that the diffusive in-cell motions, after all, follow simplistic physical−chemical rules: The proteins reveal a common dependence on (i) net charge density, (ii) surface hydrophobicity, and (iii) the electric dipole moment. The bacterial protein is here biased to move relatively freely in the bacterial interior, whereas the human counterparts more easily stick. Even so, the in-cell motions respond predictably to surface mutation, allowing us to tune and intermix the protein’s behavior at will. The findings show how evolution can swiftly optimize the diffuse background of protein encounter complexes by just single-point mutations, and provide a rational framework for adjusting the cytoplasmic motions of individual proteins, e.g., for rescuing poor in-cell NMR signals and for optimizing protein therapeutics. PMID:28536196

  4. Insights from the genome of a high alkaline cellulase producing Aspergillus fumigatus strain obtained from Peruvian Amazon rainforest.

    PubMed

    Paul, Sujay; Zhang, Angel; Ludeña, Yvette; Villena, Gretty K; Yu, Fengan; Sherman, David H; Gutiérrez-Correa, Marcel

    2017-06-10

    Here, we report the complete genome sequence of a high alkaline cellulase producing Aspergillus fumigatus strain LMB-35Aa isolated from soil of Peruvian Amazon rainforest. The genome is ∼27.5mb in size, comprises of 228 scaffolds with an average GC content of 50%, and is predicted to contain a total of 8660 protein-coding genes. Of which, 6156 are with known function; it codes for 607 putative CAZymes families potentially involved in carbohydrate metabolism. Several important cellulose degrading genes, such as endoglucanase A, endoglucanase B, endoglucanase D and beta-glucosidase, are also identified. The genome of A. fumigatus strain LMB-35Aa represents the first whole sequenced genome of non-clinical, high cellulase producing A. fumigatus strain isolated from forest soil. Copyright © 2017 Elsevier B.V. All rights reserved.

  5. The majority of total nuclear-encoded non-ribosomal RNA in a human cell is 'dark matter' un-annotated RNA.

    PubMed

    Kapranov, Philipp; St Laurent, Georges; Raz, Tal; Ozsolak, Fatih; Reynolds, C Patrick; Sorensen, Poul H B; Reaman, Gregory; Milos, Patrice; Arceci, Robert J; Thompson, John F; Triche, Timothy J

    2010-12-21

    Discovery that the transcriptional output of the human genome is far more complex than predicted by the current set of protein-coding annotations and that most RNAs produced do not appear to encode proteins has transformed our understanding of genome complexity and suggests new paradigms of genome regulation. However, the fraction of all cellular RNA whose function we do not understand and the fraction of the genome that is utilized to produce that RNA remain controversial. This is not simply a bookkeeping issue because the degree to which this un-annotated transcription is present has important implications with respect to its biologic function and to the general architecture of genome regulation. For example, efforts to elucidate how non-coding RNAs (ncRNAs) regulate genome function will be compromised if that class of RNAs is dismissed as simply 'transcriptional noise'. We show that the relative mass of RNA whose function and/or structure we do not understand (the so called 'dark matter' RNAs), as a proportion of all non-ribosomal, non-mitochondrial human RNA (mt-RNA), can be greater than that of protein-encoding transcripts. This observation is obscured in studies that focus only on polyA-selected RNA, a method that enriches for protein coding RNAs and at the same time discards the vast majority of RNA prior to analysis. We further show the presence of a large number of very long, abundantly-transcribed regions (100's of kb) in intergenic space and further show that expression of these regions is associated with neoplastic transformation. These overlap some regions found previously in normal human embryonic tissues and raises an interesting hypothesis as to the function of these ncRNAs in both early development and neoplastic transformation. We conclude that 'dark matter' RNA can constitute the majority of non-ribosomal, non-mitochondrial-RNA and a significant fraction arises from numerous very long, intergenic transcribed regions that could be involved in neoplastic transformation.

  6. Immunoreactivity of polyclonal antibodies generated against the carboxy terminus of the predicted amino acid sequence of the Huntington disease gene

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Alkatib, G.; Graham, R.; Pelmear-Telenius, A.

    1994-09-01

    A cDNA fragment spanning the 3{prime}-end of the Huntington disease gene (from 8052 to 9252) was cloned into a prokaryotic expression vector containing the E. Coli lac promoter and a portion of the coding sequence for {beta}-galactosidase. The truncated {beta}-galactosidase gene was cleaved with BamHl and fused in frame to the BamHl fragment of the Huntington disease gene 3{prime}-end. Expression analysis of proteins made in E. Coli revealed that 20-30% of the total cellular proteins was represented by the {beta}-galactosidase-huntingtin fusion protein. The identity of the Huntington disease protein amino acid sequences was confirmed by protein sequence analysis. Affinity chromatographymore » was used to purify large quantities of the fusion protein from bacterial cell lysates. Affinity-purified proteins were used to immunize New Zealand white rabbits for antibody production. The generated polyclonal antibodies were used to immunoprecipitate the Huntington disease gene product expressed in a neuroblastoma cell line. In this cell line the antibodies precipitated two protein bands of apparent gel migrations of 200 and 150 kd which together, correspond to the calculated molecular weight of the Huntington disease gene product (350 kd). Immunoblotting experiments revealed the presence of a large precursor protein in the range of 350-750 kd which is in agreement with the predicted molecular weight of the protein without post-translational modifications. These results indicate that the huntingtin protein is cleaved into two subunits in this neuroblastoma cell line and implicate that cleavage of a large precursor protein may contribute to its biological activity. Experiments are ongoing to determine the precursor-product relationship and to examine the synthesis of the huntingtin protein in freshly isolated rat brains, and to determine cellular and subcellular distribution of the gene product.« less

  7. The complete mitochondrial genome of Hydra vulgaris (Hydroida: Hydridae).

    PubMed

    Pan, Hong-Chun; Fang, Hong-Yan; Li, Shi-Wei; Liu, Jun-Hong; Wang, Ying; Wang, An-Tai

    2014-12-01

    The complete mitochondrial genome of Hydra vulgaris (Hydroida: Hydridae) is composed of two linear DNA molecules. The mitochondrial DNA (mtDNA) molecule 1 is 8010 bp long and contains six protein-coding genes, large subunit rRNA, methionine and tryptophan tRNAs, two pseudogenes consisting respectively of a partial copy of COI, and terminal sequences at two ends of the linear mtDNA, while the mtDNA molecule 2 is 7576 bp long and contains seven protein-coding genes, small subunit rRNA, methionine tRNA, a pseudogene consisting of a partial copy of COI and terminal sequences at two ends of the linear mtDNA. COI gene begins with GTG as start codon, whereas other 12 protein-coding genes start with a typical ATG initiation codon. In addition, all protein-coding genes are terminated with TAA as stop codon.

  8. Genetic Code Expansion of Mammalian Cells with Unnatural Amino Acids.

    PubMed

    Brown, Kalyn A; Deiters, Alexander

    2015-09-01

    The expansion of the genetic code of mammalian cells enables the incorporation of unnatural amino acids into proteins. This is achieved by adding components to the protein biosynthetic machinery, specifically an engineered aminoacyl-tRNA synthetase/tRNA pair. The unnatural amino acids are chemically synthesized and supplemented to the growth medium. Using this methodology, fundamental new chemistries can be added to the functional repertoire of the genetic code of mammalian cells. This protocol outlines the steps necessary to incorporate a photocaged lysine into proteins and showcases its application in the optical triggering of protein translocation to the nucleus. Copyright © 2015 John Wiley & Sons, Inc.

  9. Predictive codes of familiarity and context during the perceptual learning of facial identities

    NASA Astrophysics Data System (ADS)

    Apps, Matthew A. J.; Tsakiris, Manos

    2013-11-01

    Face recognition is a key component of successful social behaviour. However, the computational processes that underpin perceptual learning and recognition as faces transition from unfamiliar to familiar are poorly understood. In predictive coding, learning occurs through prediction errors that update stimulus familiarity, but recognition is a function of both stimulus and contextual familiarity. Here we show that behavioural responses on a two-option face recognition task can be predicted by the level of contextual and facial familiarity in a computational model derived from predictive-coding principles. Using fMRI, we show that activity in the superior temporal sulcus varies with the contextual familiarity in the model, whereas activity in the fusiform face area covaries with the prediction error parameter that updated facial familiarity. Our results characterize the key computations underpinning the perceptual learning of faces, highlighting that the functional properties of face-processing areas conform to the principles of predictive coding.

  10. Arbitrariness is not enough: towards a functional approach to the genetic code.

    PubMed

    Lacková, Ľudmila; Matlach, Vladimír; Faltýnek, Dan

    2017-12-01

    Arbitrariness in the genetic code is one of the main reasons for a linguistic approach to molecular biology: the genetic code is usually understood as an arbitrary relation between amino acids and nucleobases. However, from a semiotic point of view, arbitrariness should not be the only condition for definition of a code, consequently it is not completely correct to talk about "code" in this case. Yet we suppose that there exist a code in the process of protein synthesis, but on a higher level than the nucleic bases chains. Semiotically, a code should be always associated with a function and we propose to define the genetic code not only relationally (in basis of relation between nucleobases and amino acids) but also in terms of function (function of a protein as meaning of the code). Even if the functional definition of meaning in the genetic code has been discussed in the field of biosemiotics, its further implications have not been considered. In fact, if the function of a protein represents the meaning of the genetic code (the sign's object), then it is crucial to reconsider the notion of its expression (the sign) as well. In our contribution, we will show that the actual model of the genetic code is not the only possible and we will propose a more appropriate model from a semiotic point of view.

  11. The aminoacyl-tRNA synthetases had only a marginal role in the origin of the organization of the genetic code: Evidence in favor of the coevolution theory.

    PubMed

    Di Giulio, Massimo

    2017-11-07

    The coevolution theory of the origin of the genetic code suggests that the organization of the genetic code coevolved with the biosynthetic relationships between amino acids. The mechanism that allowed this coevolution was based on tRNA-like molecules on which-this theory-would postulate the biosynthetic transformations between amino acids to have occurred. This mechanism makes a prediction on how the role conducted by the aminoacyl-tRNA synthetases (ARSs), in the origin of the genetic code, should have been. Indeed, if the biosynthetic transformations between amino acids occurred on tRNA-like molecules, then there was no need to link amino acids to these molecules because amino acids were already charged on tRNA-like molecules, as the coevolution theory suggests. In spite of the fact that ARSs make the genetic code responsible for the first interaction between a component of nucleic acids and that of proteins, for the coevolution theory the role of ARSs should have been entirely marginal in the genetic code origin. Therefore, I have conducted a further analysis of the distribution of the two classes of ARSs and of their subclasses-in the genetic code table-in order to perform a falsification test of the coevolution theory. Indeed, in the case in which the distribution of ARSs within the genetic code would have been highly significant, then the coevolution theory would be falsified since the mechanism on which it is based would not predict a fundamental role of ARSs in the origin of the genetic code. I found that the statistical significance of the distribution of the two classes of ARSs in the table of the genetic code is low or marginal, whereas that of the subclasses of ARSs statistically significant. However, this is in perfect agreement with the postulates of the coevolution theory. Indeed, the only case of statistical significance-regarding the classes of ARSs-is appreciable for the CAG code, whereas for its complement-the UNN/NUN code-only a marginal significance is measurable. These two codes codify roughly for the two ARS classes, in particular, the CAG code for the class II while the UNN/NUN code for the class I. Furthermore, the subclasses of ARSs show a statistical significance of their distribution in the genetic code table. Nevertheless, the more sensible explanation for these observations would be the following. The observation that would link the two classes of ARSs to the CAG and UNN/NUN codes, and the statistical significance of the distribution of the subclasses of ARSs in the genetic code table, would be only a secondary effect due to the highly significant distribution of the polarity of amino acids and their biosynthetic relationships in the genetic code. That is to say, the polarity of amino acids and their biosynthetic relationships would have conditioned the evolution of ARSs so that their presence in the genetic code would have been detectable. Even if the ARSs would not have-on their own-influenced directly the evolutionary organization of the genetic code. In other words, the role that ARSs had in the origin of the genetic code would have been entirely marginal. This conclusion would be in perfect accord with the predictions of the coevolution theory. Conversely, this conclusion would be in contrast-at least partially-with the physicochemical theories of the origin of the genetic code because they would foresee an absolutely more active role of ARSs in the origin of the organization of the genetic code. Copyright © 2017 Elsevier Ltd. All rights reserved.

  12. Prediction task guided representation learning of medical codes in EHR.

    PubMed

    Cui, Liwen; Xie, Xiaolei; Shen, Zuojun

    2018-06-18

    There have been rapidly growing applications using machine learning models for predictive analytics in Electronic Health Records (EHR) to improve the quality of hospital services and the efficiency of healthcare resource utilization. A fundamental and crucial step in developing such models is to convert medical codes in EHR to feature vectors. These medical codes are used to represent diagnoses or procedures. Their vector representations have a tremendous impact on the performance of machine learning models. Recently, some researchers have utilized representation learning methods from Natural Language Processing (NLP) to learn vector representations of medical codes. However, most previous approaches are unsupervised, i.e. the generation of medical code vectors is independent from prediction tasks. Thus, the obtained feature vectors may be inappropriate for a specific prediction task. Moreover, unsupervised methods often require a lot of samples to obtain reliable results, but most practical problems have very limited patient samples. In this paper, we develop a new method called Prediction Task Guided Health Record Aggregation (PTGHRA), which aggregates health records guided by prediction tasks, to construct training corpus for various representation learning models. Compared with unsupervised approaches, representation learning models integrated with PTGHRA yield a significant improvement in predictive capability of generated medical code vectors, especially for limited training samples. Copyright © 2018. Published by Elsevier Inc.

  13. RNA catalysis and the origins of life

    NASA Technical Reports Server (NTRS)

    Orgel, Leslie E.

    1986-01-01

    The role of RNA catalysis in the origins of life is considered in connection with the discovery of riboszymes, which are RNA molecules that catalyze sequence-specific hydrolysis and transesterification reactions of RNA substrates. Due to this discovery, theories positing protein-free replication as preceding the appearance of the genetic code are more plausible. The scope of RNA catalysis in biology and chemistry is discussed, and it is noted that the development of methods to select (or predict) RNA sequences with preassigned catalytic functions would be a major contribution to the study of life's origins.

  14. Permanent draft genome sequence of Comamonas testosteroni KF-1

    PubMed Central

    Weiss, Michael; Kesberg, Anna I.; LaButti, Kurt M.; Pitluck, Sam; Bruce, David; Hauser, Loren; Copeland, Alex; Woyke, Tanja; Lowry, Stephen; Lucas, Susan; Land, Miriam; Goodwin, Lynne; Kjelleberg, Staffan; Cook, Alasdair M.; Buhmann, Matthias; Thomas, Torsten; Schleheck, David

    2013-01-01

    Comamonas testosteroni KF-1 is a model organism for the elucidation of the novel biochemical degradation pathways for xenobiotic 4-sulfophenylcarboxylates (SPC) formed during biodegradation of synthetic 4-sulfophenylalkane surfactants (linear alkylbenzenesulfonates, LAS) by bacterial communities. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 6,026,527 bp long chromosome (one sequencing gap) exhibits an average G+C content of 61.79% and is predicted to encode 5,492 protein-coding genes and 114 RNA genes. PMID:23991256

  15. Improving protein fold recognition by extracting fold-specific features from predicted residue-residue contacts.

    PubMed

    Zhu, Jianwei; Zhang, Haicang; Li, Shuai Cheng; Wang, Chao; Kong, Lupeng; Sun, Shiwei; Zheng, Wei-Mou; Bu, Dongbo

    2017-12-01

    Accurate recognition of protein fold types is a key step for template-based prediction of protein structures. The existing approaches to fold recognition mainly exploit the features derived from alignments of query protein against templates. These approaches have been shown to be successful for fold recognition at family level, but usually failed at superfamily/fold levels. To overcome this limitation, one of the key points is to explore more structurally informative features of proteins. Although residue-residue contacts carry abundant structural information, how to thoroughly exploit these information for fold recognition still remains a challenge. In this study, we present an approach (called DeepFR) to improve fold recognition at superfamily/fold levels. The basic idea of our approach is to extract fold-specific features from predicted residue-residue contacts of proteins using deep convolutional neural network (DCNN) technique. Based on these fold-specific features, we calculated similarity between query protein and templates, and then assigned query protein with fold type of the most similar template. DCNN has showed excellent performance in image feature extraction and image recognition; the rational underlying the application of DCNN for fold recognition is that contact likelihood maps are essentially analogy to images, as they both display compositional hierarchy. Experimental results on the LINDAHL dataset suggest that even using the extracted fold-specific features alone, our approach achieved success rate comparable to the state-of-the-art approaches. When further combining these features with traditional alignment-related features, the success rate of our approach increased to 92.3%, 82.5% and 78.8% at family, superfamily and fold levels, respectively, which is about 18% higher than the state-of-the-art approach at fold level, 6% higher at superfamily level and 1% higher at family level. An independent assessment on SCOP_TEST dataset showed consistent performance improvement, indicating robustness of our approach. Furthermore, bi-clustering results of the extracted features are compatible with fold hierarchy of proteins, implying that these features are fold-specific. Together, these results suggest that the features extracted from predicted contacts are orthogonal to alignment-related features, and the combination of them could greatly facilitate fold recognition at superfamily/fold levels and template-based prediction of protein structures. Source code of DeepFR is freely available through https://github.com/zhujianwei31415/deepfr, and a web server is available through http://protein.ict.ac.cn/deepfr. zheng@itp.ac.cn or dbu@ict.ac.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  16. Advanced Subsonic Technology (AST) Area of Interest (AOI) 6: Develop and Validate Aeroelastic Codes for Turbomachinery

    NASA Technical Reports Server (NTRS)

    Gardner, Kevin D.; Liu, Jong-Shang; Murthy, Durbha V.; Kruse, Marlin J.; James, Darrell

    1999-01-01

    AlliedSignal Engines, in cooperation with NASA GRC (National Aeronautics and Space Administration Glenn Research Center), completed an evaluation of recently-developed aeroelastic computer codes using test cases from the AlliedSignal Engines fan blisk and turbine databases. Test data included strain gage, performance, and steady-state pressure information obtained for conditions where synchronous or flutter vibratory conditions were found to occur. Aeroelastic codes evaluated included quasi 3-D UNSFLO (MIT Developed/AE Modified, Quasi 3-D Aeroelastic Computer Code), 2-D FREPS (NASA-Developed Forced Response Prediction System Aeroelastic Computer Code), and 3-D TURBO-AE (NASA/Mississippi State University Developed 3-D Aeroelastic Computer Code). Unsteady pressure predictions for the turbine test case were used to evaluate the forced response prediction capabilities of each of the three aeroelastic codes. Additionally, one of the fan flutter cases was evaluated using TURBO-AE. The UNSFLO and FREPS evaluation predictions showed good agreement with the experimental test data trends, but quantitative improvements are needed. UNSFLO over-predicted turbine blade response reductions, while FREPS under-predicted them. The inviscid TURBO-AE turbine analysis predicted no discernible blade response reduction, indicating the necessity of including viscous effects for this test case. For the TURBO-AE fan blisk test case, significant effort was expended getting the viscous version of the code to give converged steady flow solutions for the transonic flow conditions. Once converged, the steady solutions provided an excellent match with test data and the calibrated DAWES (AlliedSignal 3-D Viscous Steady Flow CFD Solver). However, efforts expended establishing quality steady-state solutions prevented exercising the unsteady portion of the TURBO-AE code during the present program. AlliedSignal recommends that unsteady pressure measurement data be obtained for both test cases examined for use in aeroelastic code validation.

  17. Translation elicits a growth rate-dependent, genome-wide, differential protein production in Bacillus subtilis.

    PubMed

    Borkowski, Olivier; Goelzer, Anne; Schaffer, Marc; Calabre, Magali; Mäder, Ulrike; Aymerich, Stéphane; Jules, Matthieu; Fromion, Vincent

    2016-05-17

    Complex regulatory programs control cell adaptation to environmental changes by setting condition-specific proteomes. In balanced growth, bacterial protein abundances depend on the dilution rate, transcript abundances and transcript-specific translation efficiencies. We revisited the current theory claiming the invariance of bacterial translation efficiency. By integrating genome-wide transcriptome datasets and datasets from a library of synthetic gfp-reporter fusions, we demonstrated that translation efficiencies in Bacillus subtilis decreased up to fourfold from slow to fast growth. The translation initiation regions elicited a growth rate-dependent, differential production of proteins without regulators, hence revealing a unique, hard-coded, growth rate-dependent mode of regulation. We combined model-based data analyses of transcript and protein abundances genome-wide and revealed that this global regulation is extensively used in B. subtilis We eventually developed a knowledge-based, three-step translation initiation model, experimentally challenged the model predictions and proposed that a growth rate-dependent drop in free ribosome abundance accounted for the differential protein production. © 2016 The Authors. Published under the terms of the CC BY 4.0 license.

  18. Are plant formins integral membrane proteins?

    PubMed

    Cvrcková, F

    2000-01-01

    The formin family of proteins has been implicated in signaling pathways of cellular morphogenesis in both animals and fungi; in the latter case, at least, they participate in communication between the actin cytoskeleton and the cell surface. Nevertheless, they appear to be cytoplasmic or nuclear proteins, and it is not clear whether they communicate with the plasma membrane, and if so, how. Because nothing is known about formin function in plants, I performed a systematic search for putative Arabidopsis thaliana formin homologs. I found eight putative formin-coding genes in the publicly available part of the Arabidopsis genome sequence and analyzed their predicted protein sequences. Surprisingly, some of them lack parts of the conserved formin-homology 2 (FH2) domain and the majority of them seem to have signal sequences and putative transmembrane segments that are not found in yeast or animals formins. Plant formins define a distinct subfamily. The presence in most Arabidopsis formins of sequence motifs typical or transmembrane proteins suggests a mechanism of membrane attachment that may be specific to plant formins, and indicates an unexpected evolutionary flexibility of the conserved formin domain.

  19. Translation efficiency of heterologous proteins is significantly affected by the genetic context of RBS sequences in engineered cyanobacterium Synechocystis sp. PCC 6803.

    PubMed

    Thiel, Kati; Mulaku, Edita; Dandapani, Hariharan; Nagy, Csaba; Aro, Eva-Mari; Kallio, Pauli

    2018-03-02

    Photosynthetic cyanobacteria have been studied as potential host organisms for direct solar-driven production of different carbon-based chemicals from CO 2 and water, as part of the development of sustainable future biotechnological applications. The engineering approaches, however, are still limited by the lack of comprehensive information on most optimal expression strategies and validated species-specific genetic elements which are essential for increasing the intricacy, predictability and efficiency of the systems. This study focused on the systematic evaluation of the key translational control elements, ribosome binding sites (RBS), in the cyanobacterial host Synechocystis sp. PCC 6803, with the objective of expanding the palette of tools for more rigorous engineering approaches. An expression system was established for the comparison of 13 selected RBS sequences in Synechocystis, using several alternative reporter proteins (sYFP2, codon-optimized GFPmut3 and ethylene forming enzyme) as quantitative indicators of the relative translation efficiencies. The set-up was shown to yield highly reproducible expression patterns in independent analytical series with low variation between biological replicates, thus allowing statistical comparison of the activities of the different RBSs in vivo. While the RBSs covered a relatively broad overall expression level range, the downstream gene sequence was demonstrated in a rigorous manner to have a clear impact on the resulting translational profiles. This was expected to reflect interfering sequence-specific mRNA-level interaction between the RBS and the coding region, yet correlation between potential secondary structure formation and observed translation levels could not be resolved with existing in silico prediction tools. The study expands our current understanding on the potential and limitations associated with the regulation of protein expression at translational level in engineered cyanobacteria. The acquired information can be used for selecting appropriate RBSs for optimizing over-expression constructs or multicistronic pathways in Synechocystis, while underlining the complications in predicting the activity due to gene-specific interactions which may reduce the translational efficiency for a given RBS-gene combination. Ultimately, the findings emphasize the need for additional characterized insulator sequence elements to decouple the interaction between the RBS and the coding region for future engineering approaches.

  20. Background-Modeling-Based Adaptive Prediction for Surveillance Video Coding.

    PubMed

    Zhang, Xianguo; Huang, Tiejun; Tian, Yonghong; Gao, Wen

    2014-02-01

    The exponential growth of surveillance videos presents an unprecedented challenge for high-efficiency surveillance video coding technology. Compared with the existing coding standards that were basically developed for generic videos, surveillance video coding should be designed to make the best use of the special characteristics of surveillance videos (e.g., relative static background). To do so, this paper first conducts two analyses on how to improve the background and foreground prediction efficiencies in surveillance video coding. Following the analysis results, we propose a background-modeling-based adaptive prediction (BMAP) method. In this method, all blocks to be encoded are firstly classified into three categories. Then, according to the category of each block, two novel inter predictions are selectively utilized, namely, the background reference prediction (BRP) that uses the background modeled from the original input frames as the long-term reference and the background difference prediction (BDP) that predicts the current data in the background difference domain. For background blocks, the BRP can effectively improve the prediction efficiency using the higher quality background as the reference; whereas for foreground-background-hybrid blocks, the BDP can provide a better reference after subtracting its background pixels. Experimental results show that the BMAP can achieve at least twice the compression ratio on surveillance videos as AVC (MPEG-4 Advanced Video Coding) high profile, yet with a slightly additional encoding complexity. Moreover, for the foreground coding performance, which is crucial to the subjective quality of moving objects in surveillance videos, BMAP also obtains remarkable gains over several state-of-the-art methods.

  1. Predicting the Performance of an Axial-Flow Compressor

    NASA Technical Reports Server (NTRS)

    Steinke, R. J.

    1986-01-01

    Stage-stacking computer code (STGSTK) developed for predicting off-design performance of multi-stage axial-flow compressors. Code uses meanline stagestacking method. Stage and cumulative compressor performance calculated from representative meanline velocity diagrams located at rotor inlet and outlet meanline radii. Numerous options available within code. Code developed so user modify correlations to suit their needs.

  2. Advanced turboprop noise prediction based on recent theoretical results

    NASA Technical Reports Server (NTRS)

    Farassat, F.; Padula, S. L.; Dunn, M. H.

    1987-01-01

    The development of a high speed propeller noise prediction code at Langley Research Center is described. The code utilizes two recent acoustic formulations in the time domain for subsonic and supersonic sources. The structure and capabilities of the code are discussed. Grid size study for accuracy and speed of execution on a computer is also presented. The code is tested against an earlier Langley code. Considerable increase in accuracy and speed of execution are observed. Some examples of noise prediction of a high speed propeller for which acoustic test data are available are given. A brisk derivation of formulations used is given in an appendix.

  3. simDEF: definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes.

    PubMed

    Pesaranghader, Ahmad; Matwin, Stan; Sokolova, Marina; Beiko, Robert G

    2016-05-01

    Measures of protein functional similarity are essential tools for function prediction, evaluation of protein-protein interactions (PPIs) and other applications. Several existing methods perform comparisons between proteins based on the semantic similarity of their GO terms; however, these measures are highly sensitive to modifications in the topological structure of GO, tend to be focused on specific analytical tasks and concentrate on the GO terms themselves rather than considering their textual definitions. We introduce simDEF, an efficient method for measuring semantic similarity of GO terms using their GO definitions, which is based on the Gloss Vector measure commonly used in natural language processing. The simDEF approach builds optimized definition vectors for all relevant GO terms, and expresses the similarity of a pair of proteins as the cosine of the angle between their definition vectors. Relative to existing similarity measures, when validated on a yeast reference database, simDEF improves correlation with sequence homology by up to 50%, shows a correlation improvement >4% with gene expression in the biological process hierarchy of GO and increases PPI predictability by > 2.5% in F1 score for molecular function hierarchy. Datasets, results and source code are available at http://kiwi.cs.dal.ca/Software/simDEF CONTACT: ahmad.pgh@dal.ca or beiko@cs.dal.ca Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  4. GuiTope: an application for mapping random-sequence peptides to protein sequences.

    PubMed

    Halperin, Rebecca F; Stafford, Phillip; Emery, Jack S; Navalkar, Krupa Arun; Johnston, Stephen Albert

    2012-01-03

    Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. GuiTope provides a graphical user interface for aligning peptide sequences to protein sequences. All alignment parameters are accessible to the user including the ability to specify the amino acid frequency in the peptide library; these frequencies often differ significantly from those assumed by popular alignment programs. It also includes a novel feature to align di-peptide inversions, which we have found improves the accuracy of antibody epitope prediction from peptide microarray data and shows utility in analyzing phage display datasets. Finally, GuiTope can randomly select peptides from a given library to estimate a null distribution of scores and calculate statistical significance. GuiTope provides a convenient method for comparing selected peptide sequences to protein sequences, including flexible alignment parameters, novel alignment features, ability to search a database, and statistical significance of results. The software is available as an executable (for PC) at http://www.immunosignature.com/software and ongoing updates and source code will be available at sourceforge.net.

  5. Using the NCBI Genome Databases to Compare the Genes for Human & Chimpanzee Beta Hemoglobin

    ERIC Educational Resources Information Center

    Offner, Susan

    2010-01-01

    The beta hemoglobin protein is identical in humans and chimpanzees. In this tutorial, students see that even though the proteins are identical, the genes that code for them are not. There are many more differences in the introns than in the exons, which indicates that coding regions of DNA are more highly conserved than non-coding regions.

  6. Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project.

    PubMed

    Aggarwal, Gautam; Worthey, E A; McDonagh, Paul D; Myler, Peter J

    2003-06-07

    Seattle Biomedical Research Institute (SBRI) as part of the Leishmania Genome Network (LGN) is sequencing chromosomes of the trypanosomatid protozoan species Leishmania major. At SBRI, chromosomal sequence is annotated using a combination of trained and untrained non-consensus gene-prediction algorithms with ARTEMIS, an annotation platform with rich and user-friendly interfaces. Here we describe a methodology used to import results from three different protein-coding gene-prediction algorithms (GLIMMER, TESTCODE and GENESCAN) into the ARTEMIS sequence viewer and annotation tool. Comparison of these methods, along with the CODONUSAGE algorithm built into ARTEMIS, shows the importance of combining methods to more accurately annotate the L. major genomic sequence. An improvised and powerful tool for gene prediction has been developed by importing data from widely-used algorithms into an existing annotation platform. This approach is especially fruitful in the Leishmania genome project where there is large proportion of novel genes requiring manual annotation.

  7. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data.

    PubMed

    Huang, Yi-Fei; Gulko, Brad; Siepel, Adam

    2017-04-01

    Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.

  8. Biallelic insertion of a transcriptional terminator via the CRISPR/Cas9 system efficiently silences expression of protein-coding and non-coding RNA genes.

    PubMed

    Liu, Yangyang; Han, Xiao; Yuan, Junting; Geng, Tuoyu; Chen, Shihao; Hu, Xuming; Cui, Isabelle H; Cui, Hengmi

    2017-04-07

    The type II bacterial CRISPR/Cas9 system is a simple, convenient, and powerful tool for targeted gene editing. Here, we describe a CRISPR/Cas9-based approach for inserting a poly(A) transcriptional terminator into both alleles of a targeted gene to silence protein-coding and non-protein-coding genes, which often play key roles in gene regulation but are difficult to silence via insertion or deletion of short DNA fragments. The integration of 225 bp of bovine growth hormone poly(A) signals into either the first intron or the first exon or behind the promoter of target genes caused efficient termination of expression of PPP1R12C , NSUN2 (protein-coding genes), and MALAT1 (non-protein-coding gene). Both NeoR and PuroR were used as markers in the selection of clonal cell lines with biallelic integration of a poly(A) signal. Genotyping analysis indicated that the cell lines displayed the desired biallelic silencing after a brief selection period. These combined results indicate that this CRISPR/Cas9-based approach offers an easy, convenient, and efficient novel technique for gene silencing in cell lines, especially for those in which gene integration is difficult because of a low efficiency of homology-directed repair. © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.

  9. Long non-coding RNA expression patterns in lung tissues of chronic cigarette smoke induced COPD mouse model.

    PubMed

    Zhang, Haiyun; Sun, Dejun; Li, Defu; Zheng, Zeguang; Xu, Jingyi; Liang, Xue; Zhang, Chenting; Wang, Sheng; Wang, Jian; Lu, Wenju

    2018-05-15

    Long non-coding RNAs (lncRNAs) have critical regulatory roles in protein-coding gene expression. Aberrant expression profiles of lncRNAs have been observed in various human diseases. In this study, we investigated transcriptome profiles in lung tissues of chronic cigarette smoke (CS)-induced COPD mouse model. We found that 109 lncRNAs and 260 mRNAs were significantly differential expressed in lungs of chronic CS-induced COPD mouse model compared with control animals. GO and KEGG analyses indicated that differentially expressed lncRNAs associated protein-coding genes were mainly involved in protein processing of endoplasmic reticulum pathway, and taurine and hypotaurine metabolism pathway. The combination of high throughput data analysis and the results of qRT-PCR validation in lungs of chronic CS-induced COPD mouse model, 16HBE cells with CSE treatment and PBMC from patients with COPD revealed that NR_102714 and its associated protein-coding gene UCHL1 might be involved in the development of COPD both in mouse and human. In conclusion, our study demonstrated that aberrant expression profiles of lncRNAs and mRNAs existed in lungs of chronic CS-induced COPD mouse model. From animal models perspective, these results might provide further clues to investigate biological functions of lncRNAs and their potential target protein-coding genes in the pathogenesis of COPD.

  10. Comparison of Space Shuttle Hot Gas Manifold analysis to air flow data

    NASA Technical Reports Server (NTRS)

    Mcconnaughey, P. K.

    1988-01-01

    This paper summarizes several recent analyses of the Space Shuttle Main Engine Hot Gas Manifold and compares predicted flow environments to air flow data. Codes used in these analyses include INS3D, PAGE, PHOENICS, and VAST. Both laminar (Re = 250, M = 0.30) and turbulent (Re = 1.9 million, M = 0.30) results are discussed, with the latter being compared to data for system losses, outer wall static pressures, and manifold exit Mach number profiles. Comparison of predicted results for the turbulent case to air flow data shows that the analysis using INS3D predicted system losses within 1 percent error, while the PHOENICS, PAGE, and VAST codes erred by 31, 35, and 47 percent, respectively. The INS3D, PHOENICS, and PAGE codes did a reasonable job of predicting outer wall static pressure, while the PHOENICS code predicted exit Mach number profiles with acceptable accuracy. INS3D was approximately an order of magnitude more efficient than the other codes in terms of code speed and memory requirements. In general, it is seen that complex internal flows in manifold-like geometries can be predicted with a limited degree of confidence, and further development is necessary to improve both efficiency and accuracy of codes if they are to be used as design tools for complex three-dimensional geometries.

  11. A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.

    PubMed

    Zhang, Ai-bing; Feng, Jie; Ward, Robert D; Wan, Ping; Gao, Qiang; Wu, Jun; Zhao, Wei-zhong

    2012-01-01

    Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.

  12. Functional Identification and Structure Determination of Two Novel Prolidases from cog1228 in the Amidohydrolase Superfamily

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Xiang, Dao Feng; Patskovsky, Yury; Xu, Chengfu

    2010-12-07

    Two uncharacterized enzymes from the amidohydrolase superfamily belonging to cog1228 were cloned, expressed, and purified to homogeneity. The two proteins, Sgx9260c (gi|44242006) and Sgx9260b (gi|44479596), were derived from environmental DNA samples originating from the Sargasso Sea. The catalytic function and substrate profiles for Sgx9260c and Sgx9260b were determined using a comprehensive library of dipeptides and N-acyl derivative of L-amino acids. Sgx9260c catalyzes the hydrolysis of Gly-L-Pro, L-Ala-L-Pro, and N-acyl derivatives of L-Pro. The best substrate identified to date is N-acetyl-L-Pro with a value of k{sub cat}/K{sub m} of 3 x 10{sup 5} M{sup -1} s{sup -1}. Sgx9260b catalyzes the hydrolysismore » of L-hydrophobic L-Pro dipeptides and N-acyl derivatives of L-Pro. The best substrate identified to date is N-propionyl-L-Pro with a value of k{sub cat}/K{sub m} of 1 x 10{sup 5} M{sup -1} s{sup -1}. Three-dimensional structures of both proteins were determined by X-ray diffraction methods (PDB codes 3MKV and 3FEQ). These proteins fold as distorted ({beta}/{alpha})8-barrels with two divalent cations in the active site. The structure of Sgx9260c was also determined as a complex with the N-methylphosphonate derivative of L-Pro (PDB code 3N2C). In this structure the phosphonate moiety bridges the binuclear metal center, and one oxygen atom interacts with His-140. The {alpha}-carboxylate of the inhibitor interacts with Tyr-231. The proline side chain occupies a small substrate binding cavity formed by residues contributed from the loop that follows {beta}-strand 7 within the ({beta}/{alpha})8-barrel. A total of 38 other proteins from cog1228 are predicted to have the same substrate profile based on conservation of the substrate binding residues. The structure of an evolutionarily related protein, Cc2672 from Caulobacter crecentus, was determined as a complex with the N-methylphosphonate derivative of L-arginine (PDB code 3MTW).« less

  13. Computer analysis of protein functional sites projection on exon structure of genes in Metazoa.

    PubMed

    Medvedeva, Irina V; Demenkov, Pavel S; Ivanisenko, Vladimir A

    2015-01-01

    Study of the relationship between the structural and functional organization of proteins and their coding genes is necessary for an understanding of the evolution of molecular systems and can provide new knowledge for many applications for designing proteins with improved medical and biological properties. It is well known that the functional properties of proteins are determined by their functional sites. Functional sites are usually represented by a small number of amino acid residues that are distantly located from each other in the amino acid sequence. They are highly conserved within their functional group and vary significantly in structure between such groups. According to this facts analysis of the general properties of the structural organization of the functional sites at the protein level and, at the level of exon-intron structure of the coding gene is still an actual problem. One approach to this analysis is the projection of amino acid residue positions of the functional sites along with the exon boundaries to the gene structure. In this paper, we examined the discontinuity of the functional sites in the exon-intron structure of genes and the distribution of lengths and phases of the functional site encoding exons in vertebrate genes. We have shown that the DNA fragments coding the functional sites were in the same exons, or in close exons. The observed tendency to cluster the exons that code functional sites which could be considered as the unit of protein evolution. We studied the characteristics of the structure of the exon boundaries that code, and do not code, functional sites in 11 Metazoa species. This is accompanied by a reduced frequency of intercodon gaps (phase 0) in exons encoding the amino acid residue functional site, which may be evidence of the existence of evolutionary limitations to the exon shuffling. These results characterize the features of the coding exon-intron structure that affect the functionality of the encoded protein and allow a better understanding of the emergence of biological diversity.

  14. Identification of a psoriasis susceptibility candidate gene by linkage disequilibrium mapping with a localized single nucleotide polymorphism map.

    PubMed

    Hewett, Duncan; Samuelsson, Lena; Polding, Joanne; Enlund, Fredrik; Smart, Devi; Cantone, Kathryn; See, Chee Gee; Chadha, Sapna; Inerot, Annica; Enerback, Charlotta; Montgomery, Doug; Christodolou, Chris; Robinson, Phil; Matthews, Paul; Plumpton, Mary; Wahlstrom, Jan; Swanbeck, Gunnar; Martinsson, Tommy; Roses, Allen; Riley, John; Purvis, Ian

    2002-03-01

    Psoriasis is a chronic inflammatory disease of the skin with both genetic and environmental risk factors. Here we describe the creation of a single-nucleotide polymorphism (SNP) map spanning 900-1200 kb of chromosome 3q21, which had been previously recognized as containing a psoriasis susceptibility locus, PSORS5. We genotyped 644 individuals, from 195 Swedish psoriatic families, for 19 polymorphisms. Linkage disequilibrium (LD) between marker and disease was assessed using the transmission/disequilibrium test (TDT). In the TDT analysis, alleles of three of these SNPs showed significant association with disease (P<0.05). A 160-kb interval encompassing these three SNPs was sequenced, and a coding sequence consisting of 13 exons was identified. The predicted protein shares 30-40% homology with the family of cation/chloride cotransporters. A five-marker haplotype spanning the 3' half of this gene is associated with psoriasis to a P value of 3.8<10(-5). We have called this gene SLC12A8, coding for a member of the solute carrier family 12 proteins. It belongs to a class of genes that were previously unrecognized as playing a role in psoriasis pathogenesis.

  15. MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.

    PubMed

    Zhu, Huaiqiu; Hu, Gang-Qing; Yang, Yi-Fan; Wang, Jin; She, Zhen-Su

    2007-03-16

    Despite a remarkable success in the computational prediction of genes in Bacteria and Archaea, a lack of comprehensive understanding of prokaryotic gene structures prevents from further elucidation of differences among genomes. It continues to be interesting to develop new ab initio algorithms which not only accurately predict genes, but also facilitate comparative studies of prokaryotic genomes. This paper describes a new prokaryotic genefinding algorithm based on a comprehensive statistical model of protein coding Open Reading Frames (ORFs) and Translation Initiation Sites (TISs). The former is based on a linguistic "Entropy Density Profile" (EDP) model of coding DNA sequence and the latter comprises several relevant features related to the translation initiation. They are combined to form a so-called Multivariate Entropy Distance (MED) algorithm, MED 2.0, that incorporates several strategies in the iterative program. The iterations enable us to develop a non-supervised learning process and to obtain a set of genome-specific parameters for the gene structure, before making the prediction of genes. Results of extensive tests show that MED 2.0 achieves a competitive high performance in the gene prediction for both 5' and 3' end matches, compared to the current best prokaryotic gene finders. The advantage of the MED 2.0 is particularly evident for GC-rich genomes and archaeal genomes. Furthermore, the genome-specific parameters given by MED 2.0 match with the current understanding of prokaryotic genomes and may serve as tools for comparative genomic studies. In particular, MED 2.0 is shown to reveal divergent translation initiation mechanisms in archaeal genomes while making a more accurate prediction of TISs compared to the existing gene finders and the current GenBank annotation.

  16. Robust prediction of consensus secondary structures using averaged base pairing probability matrices.

    PubMed

    Kiryu, Hisanori; Kin, Taishin; Asai, Kiyoshi

    2007-02-15

    Recent transcriptomic studies have revealed the existence of a considerable number of non-protein-coding RNA transcripts in higher eukaryotic cells. To investigate the functional roles of these transcripts, it is of great interest to find conserved secondary structures from multiple alignments on a genomic scale. Since multiple alignments are often created using alignment programs that neglect the special conservation patterns of RNA secondary structures for computational efficiency, alignment failures can cause potential risks of overlooking conserved stem structures. We investigated the dependence of the accuracy of secondary structure prediction on the quality of alignments. We compared three algorithms that maximize the expected accuracy of secondary structures as well as other frequently used algorithms. We found that one of our algorithms, called McCaskill-MEA, was more robust against alignment failures than others. The McCaskill-MEA method first computes the base pairing probability matrices for all the sequences in the alignment and then obtains the base pairing probability matrix of the alignment by averaging over these matrices. The consensus secondary structure is predicted from this matrix such that the expected accuracy of the prediction is maximized. We show that the McCaskill-MEA method performs better than other methods, particularly when the alignment quality is low and when the alignment consists of many sequences. Our model has a parameter that controls the sensitivity and specificity of predictions. We discussed the uses of that parameter for multi-step screening procedures to search for conserved secondary structures and for assigning confidence values to the predicted base pairs. The C++ source code that implements the McCaskill-MEA algorithm and the test dataset used in this paper are available at http://www.ncrna.org/papers/McCaskillMEA/. Supplementary data are available at Bioinformatics online.

  17. Dynamic and Widespread lncRNA Expression in a Sponge and the Origin of Animal Complexity

    PubMed Central

    Gaiti, Federico; Fernandez-Valverde, Selene L.; Nakanishi, Nagayasu; Calcino, Andrew D.; Yanai, Itai; Tanurdzic, Milos; Degnan, Bernard M.

    2015-01-01

    Long noncoding RNAs (lncRNAs) are important developmental regulators in bilaterian animals. A correlation has been claimed between the lncRNA repertoire expansion and morphological complexity in vertebrate evolution. However, this claim has not been tested by examining morphologically simple animals. Here, we undertake a systematic investigation of lncRNAs in the demosponge Amphimedon queenslandica, a morphologically simple, early-branching metazoan. We combine RNA-Seq data across multiple developmental stages of Amphimedon with a filtering pipeline to conservatively predict 2,935 lncRNAs. These include intronic overlapping lncRNAs, exonic antisense overlapping lncRNAs, long intergenic nonprotein coding RNAs, and precursors for small RNAs. Sponge lncRNAs are remarkably similar to their bilaterian counterparts in being relatively short with few exons and having low primary sequence conservation relative to protein-coding genes. As in bilaterians, a majority of sponge lncRNAs exhibit typical hallmarks of regulatory molecules, including high temporal specificity and dynamic developmental expression. Specific lncRNA expression profiles correlate tightly with conserved protein-coding genes likely involved in a range of developmental and physiological processes, such as the Wnt signaling pathway. Although the majority of Amphimedon lncRNAs appears to be taxonomically restricted with no identifiable orthologs, we find a few cases of conservation between demosponges in lncRNAs that are antisense to coding sequences. Based on the high similarity in the structure, organization, and dynamic expression of sponge lncRNAs to their bilaterian counterparts, we propose that these noncoding RNAs are an ancient feature of the metazoan genome. These results are consistent with lncRNAs regulating the development of animals, regardless of their level of morphological complexity. PMID:25976353

  18. New PAH gene promoter KLF1 and 3'-region C/EBPalpha motifs influence transcription in vitro.

    PubMed

    Klaassen, Kristel; Stankovic, Biljana; Kotur, Nikola; Djordjevic, Maja; Zukic, Branka; Nikcevic, Gordana; Ugrin, Milena; Spasovski, Vesna; Srzentic, Sanja; Pavlovic, Sonja; Stojiljkovic, Maja

    2017-02-01

    Phenylketonuria (PKU) is a metabolic disease caused by mutations in the phenylalanine hydroxylase (PAH) gene. Although the PAH genotype remains the main determinant of PKU phenotype severity, genotype-phenotype inconsistencies have been reported. In this study, we focused on unanalysed sequences in non-coding PAH gene regions to assess their possible influence on the PKU phenotype. We transiently transfected HepG2 cells with various chloramphenicol acetyl transferase (CAT) reporter constructs which included PAH gene non-coding regions. Selected non-coding regions were indicated by in silico prediction to contain transcription factor binding sites. Furthermore, electrophoretic mobility shift assay (EMSA) and supershift assays were performed to identify which transcriptional factors were engaged in the interaction. We found novel KLF1 motif in the PAH promoter, which decreases CAT activity by 50 % in comparison to basal transcription in vitro. The cytosine at the c.-170 promoter position creates an additional binding site for the protein complex involving KLF1 transcription factor. Moreover, we assessed for the first time the role of a multivariant variable number tandem repeat (VNTR) region located in the 3'-region of the PAH gene. We found that the VNTR3, VNTR7 and VNTR8 constructs had approximately 60 % of CAT activity. The regulation is mediated by the C/EBPalpha transcription factor, present in protein complex binding to VNTR3. Our study highlighted two novel promoter KLF1 and 3'-region C/EBPalpha motifs in the PAH gene which decrease transcription in vitro and, thus, could be considered as PAH expression modifiers. New transcription motifs in non-coding regions will contribute to better understanding of the PKU phenotype complexity and may become important for the optimisation of PKU treatment.

  19. Use of fluorescent proteins and color-coded imaging to visualize cancer cells with different genetic properties.

    PubMed

    Hoffman, Robert M

    2016-03-01

    Fluorescent proteins are very bright and available in spectrally-distinct colors, enable the imaging of color-coded cancer cells growing in vivo and therefore the distinction of cancer cells with different genetic properties. Non-invasive and intravital imaging of cancer cells with fluorescent proteins allows the visualization of distinct genetic variants of cancer cells down to the cellular level in vivo. Cancer cells with increased or decreased ability to metastasize can be distinguished in vivo. Gene exchange in vivo which enables low metastatic cancer cells to convert to high metastatic can be color-coded imaged in vivo. Cancer stem-like and non-stem cells can be distinguished in vivo by color-coded imaging. These properties also demonstrate the vast superiority of imaging cancer cells in vivo with fluorescent proteins over photon counting of luciferase-labeled cancer cells.

  20. Identification of functional elements and regulatory circuits by Drosophila modENCODE

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Roy, Sushmita; Ernst, Jason; Kharchenko, Peter V.

    2010-12-22

    To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- andmore » tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation. Several years after the complete genetic sequencing of many species, it is still unclear how to translate genomic information into a functional map of cellular and developmental programs. The Encyclopedia of DNA Elements (ENCODE) (1) and model organism ENCODE (modENCODE) (2) projects use diverse genomic assays to comprehensively annotate the Homo sapiens (human), Drosophila melanogaster (fruit fly), and Caenorhabditis elegans (worm) genomes, through systematic generation and computational integration of functional genomic data sets. Previous genomic studies in flies have made seminal contributions to our understanding of basic biological mechanisms and genome functions, facilitated by genetic, experimental, computational, and manual annotation of the euchromatic and heterochromatic genome (3), small genome size, short life cycle, and a deep knowledge of development, gene function, and chromosome biology. The functions of {approx}40% of the protein and nonprotein-coding genes [FlyBase 5.12 (4)] have been determined from cDNA collections (5, 6), manual curation of gene models (7), gene mutations and comprehensive genome-wide RNA interference screens (8-10), and comparative genomic analyses (11, 12). The Drosophila modENCODE project has generated more than 700 data sets that profile transcripts, histone modifications and physical nucleosome properties, general and specific transcription factors (TFs), and replication programs in cell lines, isolated tissues, and whole organisms across several developmental stages (Fig. 1). Here, we computationally integrate these data sets and report (i) improved and additional genome annotations, including full-length proteincoding genes and peptides as short as 21 amino acids; (ii) noncoding transcripts, including 132 candidate structural RNAs and 1608 nonstructural transcripts; (iii) additional Argonaute (Ago)-associated small RNA genes and pathways, including new microRNAs (miRNAs) encoded within protein-coding exons and endogenous small interfering RNAs (siRNAs) from 3-inch untranslated regions; (iv) chromatin 'states' defined by combinatorial patterns of 18 chromatin marks that are associated with distinct functions and properties; (v) regions of high TF occupancy and replication activity with likely epigenetic regulation; (vi)mixed TF and miRNA regulatory networks with hierarchical structure and enriched feed-forward loops; (vii) coexpression- and co-regulation-based functional annotations for nearly 3000 genes; (viii) stage- and tissue-specific regulators; and (ix) predictive models of gene expression levels and regulator function.« less

Top